From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B7A09330660 for ; Thu, 8 Jan 2026 20:38:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904723; cv=none; b=Uq0EGP05wYJkj9eLHjjnM9ln61FY2pl5m8D1GYc8F/6oUC3RpPw2kUMqlHajuVcfjgnlUursT/arUnQXbYZkiY9Q4hQXbFbxHFvk+i4Q9ZzJ6psGzsBxSLqJYNKJFnMN/eOXykEDSkqPCp1zhYdL9xwV3M22N6d58wGNAif/U0g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904723; c=relaxed/simple; bh=CQhA3mC1HF2mDq9BwlheTKnKhYcNJHlD3RCOWwoDAu4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pCDfYHXmrHIsCem/xl/603aR1I9NsiM7SOZVEsRnxj51AsVQ0BPXWogZ8C2RidO3MJ8FtP72UczWeI21OwYDol83xu/Tnc3DIT90b75q14D7BBYz49Fz/7pNuDhzUiJpprhHDjOMW0RTow6gNpOCwPf5mFfU6FBoKzCL5hXtxlA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=Vc9olPry; arc=none smtp.client-ip=209.85.222.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="Vc9olPry" Received: by mail-qk1-f172.google.com with SMTP id af79cd13be357-8b2148ca40eso496848085a.1 for ; Thu, 08 Jan 2026 12:38:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904721; x=1768509521; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=yOrL3864bJGfwtl8UioEmR5rhRGdyu2H2uXt3HWEC2Y=; b=Vc9olPryHSrK4DKO+GwbXZOMPYDfRm7YzoSfgtJe6rN6i8gfBNR9tTrqGNbcXaUFWd C9WEL29I5PIp0nOnrW802TWA3kNBNevF9AH3pNznfb4rmwzo+Bv+gQ96J3tYolmltk0o HlcrfHtIsggqzRY7iR5kYp3H6uMdTasBQJNqNKuGoQGDOWklzqB7ZZocvqtqCyloK2hA kHd2g7Xc6tGa++2wYYZB5rmd3E7H3re7TTKNWLluo1Z+xCNkptLtsv71k8UePjFzf95/ g2VuHi6dj8VTYlzIGUcEOc1j01CasObHAsrgMFZR1EzcH17OxGSQGTgbWaTH+UestQc6 gTBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904721; x=1768509521; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=yOrL3864bJGfwtl8UioEmR5rhRGdyu2H2uXt3HWEC2Y=; b=F+AXwx96/W4r4VIIHH8gdMmz0EPrK/PaV6zwGjNFKcF06xrctS+LnsPz/pMsarJF16 ubmVnjFxRCPJ5Jjw0L5t6M8ghDr1vzlsMalyF9afuZ7BFEs6gMTwzUHfvPp7FRM1mgq0 m6kTppPeHc1DREN185G+JXFv4e6M99ZbpTQUipjsXY1VVqi0ZdqbMnlSCXqnwF2S2BVo zGXsTqqEZjyYe36rcmWKM36tyY7ZGY3U2fMsrtWLDxKxl+GqFl7aD4gasLbNBc+jZghv zPvQYj9YmNxh6NYUBQ0IsaQkbIYhDpgstV5RhSVZzYkSmzEP4EosWIrTbWLNBE9MjN/o 46xA== X-Forwarded-Encrypted: i=1; AJvYcCVGotFuis5XasNWa4xmGKgwQbKuOV61IXv58WbJ8snFQ6KfwCeQapsoqUgrktpiN5+DAwJzRvgZ//mCV04=@vger.kernel.org X-Gm-Message-State: AOJu0YzEopuIXIPJHnfd5hz+UjIuXbzLoc/S7/pBHTi/fcbB/l3aLxF+ CJEshN+68V9sh+HwrBXYGfRuJ4y+5BYaLuhZ/hacdgLI/OP13pkQdtKgTN4OoqrHsxY= X-Gm-Gg: AY/fxX6Uf0GcRR5hy7/TQnk+ZxKOASz2hfEck8rGXCvhN1K/KF74SrdOZ69nweiPmjD No0O39kgSpdJqLexwqcCFew6yrZfcJAEqri9Zwh6GQ/Q0dZKHXzyBY8KjmyibAs62sz64zRtLf8 Vkw+Z8bPJmWn4AH+jG7rEorXU6naWf6w+Z871t02FqXKekGGMTfvo99aYiho1lT/pXHEgXQ+zC7 l8e2T+Li4oYaZru0ZlbaQB2EfjrX6mPXtjy6s7h/1eCFB4X/QPSBU4HbyPoPjl+dxNHxcgL1Adq 0LBfSiJcknDOpt4O/Eekrg4yqEPkmtvWxhlEoPy8rAYLmiW3w7+VnBOJFtNn0P5hObe5NOOeYu0 gkkVbkGcGlnefWBV7+jsHhiHRSTWo+vmt5LfhvwBLZo+eMeqYLB954VP+MzmyO1KuyHaM911Dx8 9QWsDQbY1nXrlJR3hbVoi08AHGg4TK+7OmF7DwYtW9vDIUmr2vcuSuh27CmoT9ZUMYNoqFdQ3jx jk= X-Google-Smtp-Source: AGHT+IErusoCASFUMJ5xblQXCL2hFphAjRCx8AogbVf/OleyczW51nyjXSqkJjk86aDTspqXYukd7w== X-Received: by 2002:a05:620a:6ccd:b0:8b2:e17a:37 with SMTP id af79cd13be357-8c3893f239bmr1013342785a.43.1767904720528; Thu, 08 Jan 2026 12:38:40 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.38.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:38:40 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com, Balbir Singh Subject: [RFC PATCH v3 1/8] numa,memory_hotplug: create N_PRIVATE (Private Nodes) Date: Thu, 8 Jan 2026 15:37:48 -0500 Message-ID: <20260108203755.1163107-2-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" N_MEMORY nodes are intended to contain general System RAM. Today, some device drivers hotplug their memory (marked Specific Purpose or Reserved) to get access to mm/ services, but don't intend it for general consumption. This creates reliability issues as there are no isolation guarantees. Create N_PRIVATE for memory nodes whose memory is not intended for general consumption. This state is mutually exclusive with N_MEMORY. This will allow existing service code (like page_alloc.c) to manage N_PRIVATE nodes without exposing N_MEMORY users to that memory. Add `node_mark_private()` for device drivers to call to mark a node as private prior to hotplugging memory. This fails if the node is already online or already has N_MEMORY. Private nodes must have a memory types so that multiple drivers trying to online private memory onto the same node are warned when a conflict occurs. Suggested-by: David Hildenbrand Suggested-by: Balbir Singh Signed-off-by: Gregory Price --- drivers/base/node.c | 199 +++++++++++++++++++++++++++++++++++++++ include/linux/node.h | 60 ++++++++++++ include/linux/nodemask.h | 1 + mm/memory_hotplug.c | 2 +- 4 files changed, 261 insertions(+), 1 deletion(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 00cf4532f121..b503782ea109 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -861,6 +861,193 @@ void register_memory_blocks_under_node_hotplug(int ni= d, unsigned long start_pfn, (void *)&nid, register_mem_block_under_node_hotplug); return; } + +static enum private_memtype *private_nodes; +/* Per-node list of private node operations callbacks */ +static struct list_head private_node_ops_list[MAX_NUMNODES]; +static DEFINE_MUTEX(private_node_ops_lock); +static bool private_node_ops_initialized; + +/* + * Note: private_node_ops_list is initialized in node_dev_init() before + * any calls to node_register_private() can occur. + */ + +/** + * node_register_private - Mark a node as private and register ops + * @nid: Node identifier + * @ops: Callback operations structure (required, but callbacks may be NUL= L) + * + * Mark a node as private and register the given ops structure. The ops + * structure must have res_start and res_end set to the physical address + * range covered by this registration, and memtype set to the private + * memory type. Multiple registrations for the same node are allowed as + * long as they have the same memtype. + * + * Returns 0 on success, negative error code on failure. + */ +int node_register_private(int nid, struct private_node_ops *ops) +{ + int rc =3D 0; + enum private_memtype ctype; + enum private_memtype type; + + if (!ops) + return -EINVAL; + + type =3D ops->memtype; + + if (!node_possible(nid) || !private_nodes || type >=3D NODE_MAX_MEMTYPE) + return -EINVAL; + + /* Validate resource bounds */ + if (ops->res_start > ops->res_end) + return -EINVAL; + + mutex_lock(&private_node_ops_lock); + + /* hotplug lock must be held while checking online/node state */ + mem_hotplug_begin(); + + /* + * N_PRIVATE and N_MEMORY are mutually exclusive. Fail if the node + * already has N_MEMORY set, regardless of online state. + */ + if (node_state(nid, N_MEMORY)) { + rc =3D -EBUSY; + goto out; + } + + ctype =3D private_nodes[nid]; + if (ctype > NODE_MEM_NOTYPE && ctype !=3D type) { + rc =3D -EINVAL; + goto out; + } + + /* Initialize the ops list entry and add to the node's list */ + INIT_LIST_HEAD(&ops->list); + list_add_tail_rcu(&ops->list, &private_node_ops_list[nid]); + + private_nodes[nid] =3D type; + node_set_state(nid, N_PRIVATE); +out: + mem_hotplug_done(); + mutex_unlock(&private_node_ops_lock); + return rc; +} +EXPORT_SYMBOL_GPL(node_register_private); + +/** + * node_unregister_private - Unregister ops and potentially unmark node as= private + * @nid: Node identifier + * @ops: Callback operations structure to remove + * + * Remove the given ops structure from the node's ops list. If this is + * the last ops structure for the node and the node is offline, the + * node is unmarked as private. + */ +void node_unregister_private(int nid, struct private_node_ops *ops) +{ + if (!node_possible(nid) || !private_nodes || !ops) + return; + + mutex_lock(&private_node_ops_lock); + mem_hotplug_begin(); + + list_del_rcu(&ops->list); + /* If list is now empty, clear private state */ + if (list_empty(&private_node_ops_list[nid])) { + private_nodes[nid] =3D NODE_MEM_NOTYPE; + node_clear_state(nid, N_PRIVATE); + } + + mem_hotplug_done(); + mutex_unlock(&private_node_ops_lock); + synchronize_rcu(); +} +EXPORT_SYMBOL_GPL(node_unregister_private); + +/** + * node_private_allocated - Validate a page allocation from a private node + * @page: The allocated page + * + * Find the ops structure whose region contains the page's physical address + * and call its page_allocated callback if one is registered. + * + * Returns: + * 0 if the callback succeeds or no callback is registered for this regi= on + * -ENXIO if the page is not found in any registered region + * Other negative error code if the callback indicates the page is not s= afe + */ +int node_private_allocated(struct page *page) +{ + struct private_node_ops *ops; + phys_addr_t page_phys; + int nid =3D page_to_nid(page); + int ret =3D -ENXIO; + + if (!node_possible(nid) || nid >=3D MAX_NUMNODES) + return -ENXIO; + + if (!private_node_ops_initialized) + return -ENXIO; + + page_phys =3D page_to_phys(page); + + /* + * Use RCU to safely traverse the list without holding locks. + * Writers use list_add_tail_rcu/list_del_rcu with synchronize_rcu() + * to ensure safe concurrent access. + */ + rcu_read_lock(); + list_for_each_entry_rcu(ops, &private_node_ops_list[nid], list) { + if (page_phys >=3D ops->res_start && page_phys <=3D ops->res_end) { + if (ops->page_allocated) + ret =3D ops->page_allocated(page, ops->data); + else + ret =3D 0; + break; + } + } + rcu_read_unlock(); + + return ret; +} +EXPORT_SYMBOL_GPL(node_private_allocated); + +/** + * node_private_freed - Notify that a page from a private node is being fr= eed + * @page: The page being freed + * + * Find the ops structure whose region contains the page's physical address + * and call its page_freed callback if one is registered. + */ +void node_private_freed(struct page *page) +{ + struct private_node_ops *ops; + phys_addr_t page_phys; + int nid =3D page_to_nid(page); + + if (!node_possible(nid) || nid >=3D MAX_NUMNODES) + return; + + if (!private_node_ops_initialized) + return; + + page_phys =3D page_to_phys(page); + + rcu_read_lock(); + list_for_each_entry_rcu(ops, &private_node_ops_list[nid], list) { + if (page_phys >=3D ops->res_start && page_phys <=3D ops->res_end) { + if (ops->page_freed) + ops->page_freed(page, ops->data); + break; + } + } + rcu_read_unlock(); +} +EXPORT_SYMBOL_GPL(node_private_freed); + #endif /* CONFIG_MEMORY_HOTPLUG */ =20 /** @@ -959,6 +1146,7 @@ static struct node_attr node_state_attr[] =3D { [N_HIGH_MEMORY] =3D _NODE_ATTR(has_high_memory, N_HIGH_MEMORY), #endif [N_MEMORY] =3D _NODE_ATTR(has_memory, N_MEMORY), + [N_PRIVATE] =3D _NODE_ATTR(has_private_memory, N_PRIVATE), [N_CPU] =3D _NODE_ATTR(has_cpu, N_CPU), [N_GENERIC_INITIATOR] =3D _NODE_ATTR(has_generic_initiator, N_GENERIC_INITIATOR), @@ -972,6 +1160,7 @@ static struct attribute *node_state_attrs[] =3D { &node_state_attr[N_HIGH_MEMORY].attr.attr, #endif &node_state_attr[N_MEMORY].attr.attr, + &node_state_attr[N_PRIVATE].attr.attr, &node_state_attr[N_CPU].attr.attr, &node_state_attr[N_GENERIC_INITIATOR].attr.attr, NULL @@ -1007,5 +1196,15 @@ void __init node_dev_init(void) panic("%s() failed to add node: %d\n", __func__, ret); } =20 + private_nodes =3D kzalloc(sizeof(enum private_memtype) * MAX_NUMNODES, + GFP_KERNEL); + if (!private_nodes) + pr_warn("Failed to allocate private_nodes, private node support disabled= \n"); + + /* Initialize private node ops lists */ + for (i =3D 0; i < MAX_NUMNODES; i++) + INIT_LIST_HEAD(&private_node_ops_list[i]); + private_node_ops_initialized =3D true; + register_memory_blocks_under_nodes(); } diff --git a/include/linux/node.h b/include/linux/node.h index 0269b064ba65..53a9fb63b60e 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -62,6 +62,47 @@ enum cache_mode { NODE_CACHE_ADDR_MODE_EXTENDED_LINEAR, }; =20 +enum private_memtype { + NODE_MEM_NOTYPE, + NODE_MEM_ZSWAP, + NODE_MEM_COMPRESSED, + NODE_MEM_ACCELERATOR, + NODE_MEM_DEMOTE_ONLY, + NODE_MAX_MEMTYPE, +}; + +/** + * struct private_node_ops - Callbacks for private node operations + * @list: List node for per-node ops list + * @res_start: Start physical address of the memory region + * @res_end: End physical address of the memory region (inclusive) + * @memtype: Private node memory type for this region + * @page_allocated: Called after a page is allocated from this region + * to validate that the page is safe to use. Returns 0 + * on success, negative error code on failure. If this + * returns an error, the caller should free the page + * and try another node. May be NULL if no validation + * is needed. + * @page_freed: Called when a page from this region is being freed. + * Allows the driver to update its internal tracking. + * May be NULL if no notification is needed. + * @data: Driver-private data passed to callbacks + * + * Multiple drivers may register ops for a single private node. Each + * registration covers a specific physical memory region. When a page + * is allocated, the appropriate ops structure is found by matching + * the page's physical address against the registered regions. + */ +struct private_node_ops { + struct list_head list; + resource_size_t res_start; + resource_size_t res_end; + enum private_memtype memtype; + int (*page_allocated)(struct page *page, void *data); + void (*page_freed)(struct page *page, void *data); + void *data; +}; + /** * struct node_cache_attrs - system memory caching attributes * @@ -121,6 +162,10 @@ extern struct node *node_devices[]; #if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA) void register_memory_blocks_under_node_hotplug(int nid, unsigned long star= t_pfn, unsigned long end_pfn); +int node_register_private(int nid, struct private_node_ops *ops); +void node_unregister_private(int nid, struct private_node_ops *ops); +int node_private_allocated(struct page *page); +void node_private_freed(struct page *page); #else static inline void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn, @@ -130,6 +175,21 @@ static inline void register_memory_blocks_under_node_h= otplug(int nid, static inline void register_memory_blocks_under_nodes(void) { } +static inline int node_register_private(int nid, struct private_node_ops *= ops) +{ + return -ENODEV; +} +static inline void node_unregister_private(int nid, + struct private_node_ops *ops) +{ +} +static inline int node_private_allocated(struct page *page) +{ + return -ENXIO; +} +static inline void node_private_freed(struct page *page) +{ +} #endif =20 struct node_notify { diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index bd38648c998d..dac250c6f1a9 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -391,6 +391,7 @@ enum node_states { N_HIGH_MEMORY =3D N_NORMAL_MEMORY, #endif N_MEMORY, /* The node has memory(regular, high, movable) */ + N_PRIVATE, /* The node's memory is private */ N_CPU, /* The node has one or more cpus */ N_GENERIC_INITIATOR, /* The node has one or more Generic Initiators */ NR_NODE_STATES diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 389989a28abe..57463fcb4021 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1207,7 +1207,7 @@ int online_pages(unsigned long pfn, unsigned long nr_= pages, online_pages_range(pfn, nr_pages); adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); =20 - if (node_arg.nid >=3D 0) + if (node_arg.nid >=3D 0 && !node_state(nid, N_PRIVATE)) node_set_state(nid, N_MEMORY); if (need_zonelists_rebuild) build_all_zonelists(NULL); --=20 2.52.0 From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 56B7C330B19 for ; Thu, 8 Jan 2026 20:38:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904728; cv=none; b=oLC8+iYqbfaIzXZM2mlB4wkkqjMKrLGy1nfzFNCDZQ+RGTQJo4/39xpImBin7p5Ma9lb7kUF7c6H2o6LsdqdrzpIfJ23W3n6IzxsgepHrFIWPHwuuY7bUJKYNmyOnlzKMmBWbd7MwGHbSVkl2HPgMLRevIs/AlJPM/ojvmC19GM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904728; c=relaxed/simple; bh=qpZtiRNxcsJUu3AUTMQj3Awyi35/tyhY6J8Dqm1Quvo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dFE/SjMK5DdCxBZCvuGLlv0b/HQpP6dAAF7vDm44b2ocm+hW2O6yGb0qm5RPZMhRM/RwTkAxUrjyxEIchuQoZkvWqjNdSwyxUT+qPW3jKBSIdGS2ux/Q4TrtdLP2KDwS5U22gOaSvm2P70uNcgbP8o1OQStcC1WMbFuQI2lgouA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=HFFGtcTA; arc=none smtp.client-ip=209.85.219.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="HFFGtcTA" Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-88a32bf0248so28338086d6.0 for ; Thu, 08 Jan 2026 12:38:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904725; x=1768509525; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=nggVLtJRnFU0QDlUT0CWimedJUol/+KzTaLPJoqAtUY=; b=HFFGtcTApntHosngjgK334Z8upT+hEJe1YlSIUt7m+8s9IFnGpbISwACJYKGBM72qk nEB4sC1ThkYKzI724WdX1+NjcZFyzKKL5c+iITmvogIVkZz8Do4AACEMIorusnwlLyzR 2Rhu0gDdyovYpb8+yqk0tVRTZppuJahTXfU3TAoTPS7/N3w7cahCJNocg73aFGtAASD9 mzneFBdhddRVeZFQ6Cgy4IcfAz0WR/SBKwuHcByMwfpCOHOp18iKbEXUuKH4h5r475o2 HD2lPSJY/heQNxrXS4Acwf7bvAqw1IDz2hcczyGJf9+lVP6TLA1JmKTmpRar5nCeSj47 b01Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904725; x=1768509525; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=nggVLtJRnFU0QDlUT0CWimedJUol/+KzTaLPJoqAtUY=; b=d290ra0hOx2QZvCaGmSyzL5tRPfYFL+UIcVU0yEcXi9whF1IyYSS+f0Bn/MNzv4Fmi muqW2VDJQwe6JtNHpjpHhTQ2/K/voLnDT/jwPVUQ1Mi5mnqL1BnJDD2czynFl/H9wIlv 8NhI5c66vZqvHZGp8iAF80W5xKkkLffobrzmDKpdYlqTuGLzInH13K2BMGrRlGLKPUeM q52tv5FV3wCwScGpy26f/aXSmN8zRnmRrQP+VtKWEGzMQcy3xkurQNFpuAeH8kef0Dct FcB28M9EHq+04i4NuTGVM7uBtyFriykwI89a6kUyB2SvaI9GizyjLNWvjr1RGgs2hZFm EGCQ== X-Forwarded-Encrypted: i=1; AJvYcCXeotlFshcC820mZkhkLf/A38x3fN8sXLGbLLZYFfS250bSVDJeBfAsZgI9KVc4w6zZdtnL8MYi+jEjuvY=@vger.kernel.org X-Gm-Message-State: AOJu0Yz+DNdD6S9ILPWnWwN/azSk+dViEKuHfMENWMbTPvnTrGVqTjPy yIDCmn7aBWlMmzeIrjN+jnGeKeGEHpDuH454eHG4JOn46Fflk/EBi+OBRM3snxtHo9M= X-Gm-Gg: AY/fxX5fCCTR9zvgiMS0aS+//65MjH+zNVg6nTz0XeutEQHzB/o5CPz4JcCy7/3auhz AGXDl1KzD5u6hDbpy/+IQCnFEb4/x6+UalKffWNjb3zPs0q2gIRb4lRIv8Gz6Rm69U4tlSjj4g0 fgmtZ52yfyTmk3c0z4ON+hGGoInRYnyV57h43BJRxnFYNIdtMHl29s9RW93zzDowPWlONkE1+QX 6hAww8ouuEr/3iXHpxWf28jcdrE+0nUJYLb7JMW/W9DYUUVEZEgY2NXH7BH60CbwXL/oEMxTHV/ iZRbZ+xOJso8GTfINZtImnemQQjiwFUK7Gz0/ZxpBZFHBurWzJuQMKJoeahoV6Pq/ryisWyXWzR 979sGsqPC2XSYq4alWrJDSqesh1bN/oc/7jgDa1YTnZM5oQaqb77Wkqsr8Pm1mHhqjZAGkb0y90 Q00rmLjO5Ew1TzP2R77KtNsvfFifJMKk9OKLLKV0qLQyuct08Y5lAVxP0rPAWPs6/S07zWR4V4G 3w= X-Google-Smtp-Source: AGHT+IGGkUof/CU2/WL6ia65lx/hKwi6wIF+fNWN77FX8bbJUBtv8zQ3F0zWY/9RAxC3ckJGR4/xjQ== X-Received: by 2002:a05:6214:e4f:b0:88a:52a1:2576 with SMTP id 6a1803df08f44-8908418394fmr114080716d6.1.1767904725150; Thu, 08 Jan 2026 12:38:45 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.38.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:38:44 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com Subject: [RFC PATCH v3 2/8] mm: constify oom_control, scan_control, and alloc_context nodemask Date: Thu, 8 Jan 2026 15:37:49 -0500 Message-ID: <20260108203755.1163107-3-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The nodemasks in these structures may come from a variety of sources, including tasks and cpusets - and should never be modified by any code when being passed around inside another context. Signed-off-by: Gregory Price --- include/linux/cpuset.h | 4 ++-- include/linux/mm.h | 4 ++-- include/linux/mmzone.h | 6 +++--- include/linux/oom.h | 2 +- include/linux/swap.h | 2 +- kernel/cgroup/cpuset.c | 2 +- mm/internal.h | 2 +- mm/mmzone.c | 5 +++-- mm/page_alloc.c | 4 ++-- mm/show_mem.c | 9 ++++++--- mm/vmscan.c | 6 +++--- 11 files changed, 25 insertions(+), 21 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 631577384677..fe4f29624117 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -81,7 +81,7 @@ extern bool cpuset_cpu_is_isolated(int cpu); extern nodemask_t cpuset_mems_allowed(struct task_struct *p); #define cpuset_current_mems_allowed (current->mems_allowed) void cpuset_init_current_mems_allowed(void); -int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask); +int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask); =20 extern bool cpuset_current_node_allowed(int node, gfp_t gfp_mask); =20 @@ -226,7 +226,7 @@ static inline nodemask_t cpuset_mems_allowed(struct tas= k_struct *p) #define cpuset_current_mems_allowed (node_states[N_MEMORY]) static inline void cpuset_init_current_mems_allowed(void) {} =20 -static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) +static inline int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nod= emask) { return 1; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 45dfb2f2883c..dd4f5d49f638 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3572,7 +3572,7 @@ extern int __meminit early_pfn_to_nid(unsigned long p= fn); extern void mem_init(void); extern void __init mmap_init(void); =20 -extern void __show_mem(unsigned int flags, nodemask_t *nodemask, int max_z= one_idx); +extern void __show_mem(unsigned int flags, const nodemask_t *nodemask, int= max_zone_idx); static inline void show_mem(void) { __show_mem(0, NULL, MAX_NR_ZONES - 1); @@ -3582,7 +3582,7 @@ extern void si_meminfo(struct sysinfo * val); extern void si_meminfo_node(struct sysinfo *val, int nid); =20 extern __printf(3, 4) -void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...= ); +void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fm= t, ...); =20 extern void setup_per_cpu_pageset(void); =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a7db0fee54a..7f94d67ffac4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1721,7 +1721,7 @@ static inline int zonelist_node_idx(const struct zone= ref *zoneref) =20 struct zoneref *__next_zones_zonelist(struct zoneref *z, enum zone_type highest_zoneidx, - nodemask_t *nodes); + const nodemask_t *nodes); =20 /** * next_zones_zonelist - Returns the next zone at or below highest_zoneidx= within the allowed nodemask using a cursor within a zonelist as a starting= point @@ -1740,7 +1740,7 @@ struct zoneref *__next_zones_zonelist(struct zoneref = *z, */ static __always_inline struct zoneref *next_zones_zonelist(struct zoneref = *z, enum zone_type highest_zoneidx, - nodemask_t *nodes) + const nodemask_t *nodes) { if (likely(!nodes && zonelist_zone_idx(z) <=3D highest_zoneidx)) return z; @@ -1766,7 +1766,7 @@ static __always_inline struct zoneref *next_zones_zon= elist(struct zoneref *z, */ static inline struct zoneref *first_zones_zonelist(struct zonelist *zoneli= st, enum zone_type highest_zoneidx, - nodemask_t *nodes) + const nodemask_t *nodes) { return next_zones_zonelist(zonelist->_zonerefs, highest_zoneidx, nodes); diff --git a/include/linux/oom.h b/include/linux/oom.h index 7b02bc1d0a7e..00da05d227e6 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -30,7 +30,7 @@ struct oom_control { struct zonelist *zonelist; =20 /* Used to determine mempolicy */ - nodemask_t *nodemask; + const nodemask_t *nodemask; =20 /* Memory cgroup in which oom is invoked, or NULL for global oom */ struct mem_cgroup *memcg; diff --git a/include/linux/swap.h b/include/linux/swap.h index 62fc7499b408..1569f3f4773b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -370,7 +370,7 @@ extern void swap_setup(void); /* linux/mm/vmscan.c */ extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int orde= r, - gfp_t gfp_mask, nodemask_t *mask); + gfp_t gfp_mask, const nodemask_t *mask); =20 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1) #define MEMCG_RECLAIM_PROACTIVE (1 << 2) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 289fb1a72550..a3ade9d5968b 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4326,7 +4326,7 @@ nodemask_t cpuset_mems_allowed(struct task_struct *ts= k) * * Are any of the nodes in the nodemask allowed in current->mems_allowed? */ -int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) +int cpuset_nodemask_valid_mems_allowed(const nodemask_t *nodemask) { return nodes_intersects(*nodemask, current->mems_allowed); } diff --git a/mm/internal.h b/mm/internal.h index 6dc83c243120..50d32055b544 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -587,7 +587,7 @@ void page_alloc_sysctl_init(void); */ struct alloc_context { struct zonelist *zonelist; - nodemask_t *nodemask; + const nodemask_t *nodemask; struct zoneref *preferred_zoneref; int migratetype; =20 diff --git a/mm/mmzone.c b/mm/mmzone.c index 0c8f181d9d50..59dc3f2076a6 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -43,7 +43,8 @@ struct zone *next_zone(struct zone *zone) return zone; } =20 -static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes) +static inline int zref_in_nodemask(struct zoneref *zref, + const nodemask_t *nodes) { #ifdef CONFIG_NUMA return node_isset(zonelist_node_idx(zref), *nodes); @@ -55,7 +56,7 @@ static inline int zref_in_nodemask(struct zoneref *zref, = nodemask_t *nodes) /* Returns the next zone at or below highest_zoneidx in a zonelist */ struct zoneref *__next_zones_zonelist(struct zoneref *z, enum zone_type highest_zoneidx, - nodemask_t *nodes) + const nodemask_t *nodes) { /* * Find the next suitable zone to use for the allocation. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ecb2646b57ba..bb89d81aa68c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3988,7 +3988,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int o= rder, int alloc_flags, return NULL; } =20 -static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask) +static void warn_alloc_show_mem(gfp_t gfp_mask, const nodemask_t *nodemask) { unsigned int filter =3D SHOW_MEM_FILTER_NODES; =20 @@ -4008,7 +4008,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodem= ask_t *nodemask) mem_cgroup_show_protected_memory(NULL); } =20 -void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) +void warn_alloc(gfp_t gfp_mask, const nodemask_t *nodemask, const char *fm= t, ...) { struct va_format vaf; va_list args; diff --git a/mm/show_mem.c b/mm/show_mem.c index 3a4b5207635d..24685b5c6dcf 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -116,7 +116,8 @@ void si_meminfo_node(struct sysinfo *val, int nid) * Determine whether the node should be displayed or not, depending on whe= ther * SHOW_MEM_FILTER_NODES was passed to show_free_areas(). */ -static bool show_mem_node_skip(unsigned int flags, int nid, nodemask_t *no= demask) +static bool show_mem_node_skip(unsigned int flags, int nid, + const nodemask_t *nodemask) { if (!(flags & SHOW_MEM_FILTER_NODES)) return false; @@ -177,7 +178,8 @@ static bool node_has_managed_zones(pg_data_t *pgdat, in= t max_zone_idx) * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's * cpuset. */ -static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int= max_zone_idx) +static void show_free_areas(unsigned int filter, const nodemask_t *nodemas= k, + int max_zone_idx) { unsigned long free_pcp =3D 0; int cpu, nid; @@ -399,7 +401,8 @@ static void show_free_areas(unsigned int filter, nodema= sk_t *nodemask, int max_z show_swap_cache_info(); } =20 -void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_id= x) +void __show_mem(unsigned int filter, const nodemask_t *nodemask, + int max_zone_idx) { unsigned long total =3D 0, reserved =3D 0, highmem =3D 0; struct zone *zone; diff --git a/mm/vmscan.c b/mm/vmscan.c index 7c962ee7819f..23f68e754738 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -80,7 +80,7 @@ struct scan_control { * Nodemask of nodes allowed by the caller. If NULL, all nodes * are scanned. */ - nodemask_t *nodemask; + const nodemask_t *nodemask; =20 /* * The memory cgroup that hit its limit and as a result is the @@ -6502,7 +6502,7 @@ static bool allow_direct_reclaim(pg_data_t *pgdat) * happens, the page allocator should not consider triggering the OOM kill= er. */ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonel= ist, - nodemask_t *nodemask) + const nodemask_t *nodemask) { struct zoneref *z; struct zone *zone; @@ -6582,7 +6582,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, s= truct zonelist *zonelist, } =20 unsigned long try_to_free_pages(struct zonelist *zonelist, int order, - gfp_t gfp_mask, nodemask_t *nodemask) + gfp_t gfp_mask, const nodemask_t *nodemask) { unsigned long nr_reclaimed; struct scan_control sc =3D { --=20 2.52.0 From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7ABAB330654 for ; Thu, 8 Jan 2026 20:38:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904733; cv=none; b=In0W9AaLkaO8leNk8f5yg77QMKq0wZgMPmYTKP7Efu9x6bfTMjn/WG6BvZrl33MS1wlSPK92wDFD7YGYeKV6aO6/1mNvGnuBJled2L8x2rGjU5/MLCg1yWY42pgcYauUcqKRB8lVR7r6qGiKBJgWV9T7MfNh2QESx0bcHOI/DEo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904733; c=relaxed/simple; bh=NC/62yOtZDEWDFQWBLBYdje42g/fgNjMQjSNsPds1BQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qjHXn7AqobYCBRtN8e94rsyHrPG7pjVj1/GmjR+Qd32hQb9PZKarDGB3cGsnp4FcM1ghBmKDdGSi27MvMaev0OTD3moN58SezBvavLxQMPfQ9lhKwieDDU8yQO7OSq8H92dTfX1Amr/g3Rpix49A1RBoIzE1/+ZXrhgifWmMpx4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=puL5YAtZ; arc=none smtp.client-ip=209.85.219.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="puL5YAtZ" Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-88fca7bce90so37203506d6.3 for ; Thu, 08 Jan 2026 12:38:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904730; x=1768509530; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HzEtunsMdBn4IQoHvFlHCM71GQRUPpw4LtA2vMizqz0=; b=puL5YAtZgxEuvAG04IYeir0DuN8twNxXu56/AVFQJ6lluJwVZBG+zhIpiua1FRe3OL 0OtoJTrWYGJHAs1L0yke7BMnbAVPyMeXpxS7tICNUqOpzln7BtRoF7dMdQ6pdMpy2UwT ko+louZGFVbgcQQTFsdmO4RSDTv5FgGzF8c6A+aixQ1RJPhUZrcIJpRhddZDsFRPzBP8 +he//p+vj8YeSB2Vog5DV8dVNGCmqPrRrPWoTDAXdnv/+VZI6SpIubKnGZOwi7sAqPJc MZrKNk3u/EsuybgdpJEYc5OuucQa8kxxS8bCtoMDAF9jo/XJjqR1HtgtDLkrmCX01SL5 f0Sw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904730; x=1768509530; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=HzEtunsMdBn4IQoHvFlHCM71GQRUPpw4LtA2vMizqz0=; b=V4pUknZtk+zdl4ZnTyYU4lwKjz8uaIKTU73LxXmmhVdMMtLpDXVTof7SPewYUr6Tbi X3BWYIzVXe/bbQT7xN2Ca5WEmWmK0v4K5Woq1u9Ys6aDh/NLu5eQDv/W4JBdT344aEkH MUzYupa92UZ5XlaFTqzcZcVkl61AnkVgA7ah5O0pPcY37a/AOyGMchEhZEL2BQxfvyyG puuSEqPABGE1pWyJJlgrJzkgbL7egQ0rzSPkzqjnsTUnHisDIDKDDUwGmpeDUthNIVE0 9QUCCt81xep6utybsCi0Ufa6w+JXFyPcbEXT90rKm8iVUdgCPh9azbTM9Vw/9t4lnB7a gmkg== X-Forwarded-Encrypted: i=1; AJvYcCUtp1itgVS8oGoEnus4CXHeiQnHS8Rtm1SltblqY+hrvJ9urc0ro5ecF4MnGsvIEVUWhU6sq5NJP3xv4Lw=@vger.kernel.org X-Gm-Message-State: AOJu0YzGTnAOqYOzahbdo6csAE+GYm6716gpgFvAVZEEmCZ4miRpLsfV Gx6nR5Bat4Jf0/8RNUrFa3fW56hXsmqWwAMVtFaFksdEl3/EaujpHA5TqvSWlkqG8kk= X-Gm-Gg: AY/fxX5AJtnvDfWijAj6qEzVoKrJp2WpNvo9ANmq6IYfEmPvCWB/bBsmhMUvKBqV0J4 oCNd81N7/jzh8ln26BOy7rg7VaByI53S7qutC70LF5GPKvsFueeiitPDLpKBxo3v7kWyirIo0nO qjrOIlQSiBEre7TQMy0CEnmAap1IlVDzq+v85cStUYHSnuTYvZx8rl1+raNaH/wKa1V8OBlb4tq M8IC2kKVC62KjABkwvudbzyGGpj51EFq6+l4ItOBf2Hs0bUQgBTl3mQJ9zFoPwoCVUoyctLavg5 fflwr81p8R24PJmHjGGLRBK2YJE66VeVM75KyjVwzEANUxXWtVsBJ8XBBFKTO1q9RBpoaVOWb+j xgHk8Zesy1OBTSsUyi5mW1jd3E5cWb7DpzhcKip7yybpcl7aBtMhRaWiWkS5N826u7LUlpoJRjH m0BkWy0Vq+WvkAipLA40j88/QEZaAeiYNeAaHxA7mamkhVoS9owFvHhLSCLpC9+BAQPqrde5H7V pM= X-Google-Smtp-Source: AGHT+IF3fWiGbPzGJ2jrA579ah6au0pY7RZhvW9wnxA0yziJvwGxMvsH3haaiag/Ap3aYN6F9CC0+g== X-Received: by 2002:a05:6214:29c1:b0:890:19d1:532c with SMTP id 6a1803df08f44-890842311f3mr108761396d6.34.1767904730313; Thu, 08 Jan 2026 12:38:50 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.38.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:38:49 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com Subject: [RFC PATCH v3 3/8] mm: restrict slub, compaction, and page_alloc to sysram Date: Thu, 8 Jan 2026 15:37:50 -0500 Message-ID: <20260108203755.1163107-4-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Restrict page allocation and zone iteration to N_MEMORY nodes via cpusets - or node_states[N_MEMORY] when cpusets is disabled. __GFP_THISNODE allows N_PRIVATE nodes to be used explicitly (all nodes become valid targets with __GFP_THISNODE). This constrains core users of nodemasks to the node_states[N_MEMORY], which is guaranteed to at least contain the set of nodes with sysram memory blocks present at boot. Signed-off-by: Gregory Price --- include/linux/gfp.h | 6 ++++++ mm/compaction.c | 6 ++---- mm/page_alloc.c | 27 ++++++++++++++++----------- mm/slub.c | 8 ++++++-- 4 files changed, 30 insertions(+), 17 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b155929af5b1..0b6cdef7a232 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -321,6 +321,7 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsign= ed int order, struct mempolicy *mpol, pgoff_t ilx, int nid); struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_= struct *vma, unsigned long addr); +bool numa_zone_allowed(int alloc_flags, struct zone *zone, gfp_t gfp_mask); #else static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int= order) { @@ -337,6 +338,11 @@ static inline struct folio *folio_alloc_mpol_noprof(gf= p_t gfp, unsigned int orde } #define vma_alloc_folio_noprof(gfp, order, vma, addr) \ folio_alloc_noprof(gfp, order) +static inline bool numa_zone_allowed(int alloc_flags, struct zone *zone, + gfp_t gfp_mask) +{ + return true; +} #endif =20 #define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__)) diff --git a/mm/compaction.c b/mm/compaction.c index 1e8f8eca318c..63ef9803607f 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2829,10 +2829,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_m= ask, unsigned int order, ac->highest_zoneidx, ac->nodemask) { enum compact_result status; =20 - if (cpusets_enabled() && - (alloc_flags & ALLOC_CPUSET) && - !__cpuset_zone_allowed(zone, gfp_mask)) - continue; + if (!numa_zone_allowed(alloc_flags, zone, gfp_mask)) + continue; =20 if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bb89d81aa68c..76b12cef7dfc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3723,6 +3723,16 @@ static bool zone_allows_reclaim(struct zone *local_z= one, struct zone *zone) return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=3D node_reclaim_distance; } +bool numa_zone_allowed(int alloc_flags, struct zone *zone, gfp_t gfp_mask) +{ + /* If cpusets is being used, check mems_allowed or sysram_nodes */ + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET)) + return cpuset_zone_allowed(zone, gfp_mask); + + /* Otherwise only allow N_PRIVATE if __GFP_THISNODE is present */ + return (gfp_mask & __GFP_THISNODE) || + node_isset(zone_to_nid(zone), node_states[N_MEMORY]); +} #else /* CONFIG_NUMA */ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { @@ -3814,10 +3824,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int = order, int alloc_flags, struct page *page; unsigned long mark; =20 - if (cpusets_enabled() && - (alloc_flags & ALLOC_CPUSET) && - !__cpuset_zone_allowed(zone, gfp_mask)) - continue; + if (!numa_zone_allowed(alloc_flags, zone, gfp_mask)) + continue; + /* * When allocating a page cache page for writing, we * want to get it from a node that is within its dirty @@ -4618,10 +4627,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, unsigned long min_wmark =3D min_wmark_pages(zone); bool wmark; =20 - if (cpusets_enabled() && - (alloc_flags & ALLOC_CPUSET) && - !__cpuset_zone_allowed(zone, gfp_mask)) - continue; + if (!numa_zone_allowed(alloc_flags, zone, gfp_mask)) + continue; =20 available =3D reclaimable =3D zone_reclaimable_pages(zone); available +=3D zone_page_state_snapshot(zone, NR_FREE_PAGES); @@ -5131,10 +5138,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int= preferred_nid, for_next_zone_zonelist_nodemask(zone, z, ac.highest_zoneidx, ac.nodemask)= { unsigned long mark; =20 - if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) && - !__cpuset_zone_allowed(zone, gfp)) { + if (!numa_zone_allowed(alloc_flags, zone, gfp)) continue; - } =20 if (nr_online_nodes > 1 && zone !=3D zonelist_zone(ac.preferred_zoneref)= && zone_to_nid(zone) !=3D zonelist_node_idx(ac.preferred_zoneref)) { diff --git a/mm/slub.c b/mm/slub.c index 861592ac5425..adebbddc48f6 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3594,9 +3594,13 @@ static struct slab *get_any_partial(struct kmem_cach= e *s, struct kmem_cache_node *n; =20 n =3D get_node(s, zone_to_nid(zone)); + if (!n) + continue; + + if (!numa_zone_allowed(ALLOC_CPUSET, zone, pc->flags)) + continue; =20 - if (n && cpuset_zone_allowed(zone, pc->flags) && - n->nr_partial > s->min_partial) { + if (n->nr_partial > s->min_partial) { slab =3D get_partial_node(s, n, pc); if (slab) { /* --=20 2.52.0 From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qv1-f45.google.com (mail-qv1-f45.google.com [209.85.219.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ED59C3314B8 for ; Thu, 8 Jan 2026 20:38:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904739; cv=none; b=WuNwpn23X466xNlVdJFMjfsCo8GRoTkXzyzcDEZ/U6buTwVvcu3vnhD13on/J+8QglXtacsxTH+ObfnLoeEsoBG5fekyMNVyXHpF0h9VoS40KIVasrKijiRve0AqHqvx7CBdVyCLyBCRx85ydO4KeeMNPQTrj5nSXH+jLFfrghE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904739; c=relaxed/simple; bh=Svp3zx02/oXeli/4ycXpUYqvgWOFWoraBOL/YLhuieA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MdEZDK0UPDDtZmTqaZ1bRmU/+ptWZ1vcZOXO2IlQDupRzjKw3LM01Ndw4HjcY8pRqBvIikmRa1G19in1gQx7wbGqtIMNynUsyY97eN8RFW0dAlcTeUqN1ojDsCkNOYUKX8yT4tQkQgPW+9fp4DzeW3gYW/0V0ZJvGlCkIu8iN9Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=oUYDj5eb; arc=none smtp.client-ip=209.85.219.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="oUYDj5eb" Received: by mail-qv1-f45.google.com with SMTP id 6a1803df08f44-88a35a00506so48725126d6.2 for ; Thu, 08 Jan 2026 12:38:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904735; x=1768509535; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ZHxGrBtl/hlNeoeWLQW++mLZlqqQEPHYPIYdLPi7GaY=; b=oUYDj5ebfGBDiqGGUN6/7W2yrHQ9qayBYVthZpoDKsTCNRbQqNrK+gBcIxQAjG9MRi fT8y0WsERbyUk1qrALNjgnHDF9p8fCaxYKkQIIi49zO8WGjHE2obP596F/JYZgzn4CWI DjsG8U0P8Jnmg+HxZLi6hv+ulfnaiwD3uP9+/XntEVLdpJB6oHT10nk5tE2cKpCP5Ypf L/q4YeW40k9oxo9xscTUJuQoqIGnOLhld45vRKgElinZea62jcuug8rRHqQNEuEQRMN+ re00ORrUJTpGkIvpTu1KfRt2AvytEm/l6ewn1sozbOBKW+dFipR4Ai+MvIRZ9eUFyxPs tf1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904735; x=1768509535; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=ZHxGrBtl/hlNeoeWLQW++mLZlqqQEPHYPIYdLPi7GaY=; b=JCx1arm5qPWOmocRJW+isMTRJQk00yfsGW13pI2l1VHnva1pXJbLpDK6PeeV5pe0nT u6jtQ8GwCWK/rphZhFpMmQ1OrhAnI5PACRMQnGxRfWhfIt9NL+xJg9t7L5INvBda9faO 3f67zTAd/bHQSrynayGSBQRxI/yWc2Pwm70nw5hgfLNUfVPriltO6q6QlDrEakXetwgF CYje57VweSm/mcdHJTG7Z+zLSI9B8zX1qI1pHnEXMzhyzeT6rt1FyLh87JmDy9QRo5Sc kM+Av713KLxElGi3CiB8j1TQl6uNZB67xIhw+H+ar5j9bsjZy/ax4ktAQ6Hk0H7HBDZ9 P79A== X-Forwarded-Encrypted: i=1; AJvYcCVUepzABFWglp1pjwNeEyYjfk+Zs4WpyBk+8IXVQqTfJgt666beISYbe7YDKBPy3+0VAQoqtCa9rzlkZm4=@vger.kernel.org X-Gm-Message-State: AOJu0YxXvIGK8XyGUyisz36oOYWHOsev6i06hTjpi3SjFoUsOjUrrnuf NjAa6sOrcBhsRNH63XblNgOtssiTESHWkcpcmgYRUu+dKy9viUB3kwaUgRqvYaF+VAU= X-Gm-Gg: AY/fxX7RjplbuEXB/UdQDMYPvHatI+VyCu8rHvOMn9Vn0+tyqxmeLAjtTGjkJmbyY6u I3+18vpmOyiSes7w12sUZYYmyhVvs2kqpJW5OvELb27B44kODdXlCuem1bJSmNzv0Ie1XGYDb4Z SVeYP1OyLLr8hsnHm8g+9Iqei1rp2y01Mmd4DQn36DsDXX9QfQVezA/i62bxgznsuHU5iA9g9Qu Do74vAaz5nnbEbXrpfyBmF1hE0nHBar9+/5hXlLKlOBv+nOG6zTBeJw7i2CcxWnnsGBkAT8OPv9 V+cOvBbDVgZH703Qj6Idqqu++cda9Tw4SZcsqYnmOgzuDWDyrjffv1ijuuSOBD9P/cW1RUtsuIF /7E6XASphRQ58hRdPNlD4CUZIlGK/P4q77T80IEzo/YQdMH2aw4Hs9mFYeU+Q8YCYW4F19CvXze blcNnPjPZZju/Bq2D8B1c9SZWyHx3pgs3/VwiEHtJWolAQn3z9dDfRSo9+xXZhpc3JheMtKYN9p hBM1XMNWFKxkQ== X-Google-Smtp-Source: AGHT+IHoPDtEyu/qazAej9qUcvHlLXP2al9oKF0ZIuQR82MK0kzJKO/DN4uJk0hBSgOfyzdaEtp2Zw== X-Received: by 2002:a05:6214:1147:b0:890:9169:d2f5 with SMTP id 6a1803df08f44-8909169d5c2mr55584316d6.64.1767904734853; Thu, 08 Jan 2026 12:38:54 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.38.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:38:54 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com Subject: [RFC PATCH v3 4/8] cpuset: introduce cpuset.mems.sysram Date: Thu, 8 Jan 2026 15:37:51 -0500 Message-ID: <20260108203755.1163107-5-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" mems_sysram contains only SystemRAM nodes (omitting Private Nodes). The nodemask is intersect(effective_mems, node_states[N_MEMORY]). When checking mems_allowed, check for __GFP_THISNODE to determine if the check should be made against sysram_nodes or mems_allowed. This omits Private Nodes (N_PRIVATE) from default mems_allowed checks, making those nodes unreachable via normal allocation paths (page faults, mempolicies, etc). Signed-off-by: Gregory Price --- include/linux/cpuset.h | 20 +++++-- kernel/cgroup/cpuset-internal.h | 8 +++ kernel/cgroup/cpuset-v1.c | 8 +++ kernel/cgroup/cpuset.c | 96 +++++++++++++++++++++++++-------- mm/memcontrol.c | 2 +- mm/mempolicy.c | 6 +-- mm/migrate.c | 4 +- 7 files changed, 112 insertions(+), 32 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index fe4f29624117..1ae09ec0fcb7 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -174,7 +174,9 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } =20 -extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask); +extern void cpuset_sysram_nodes_allowed(struct cgroup *cgroup, + nodemask_t *mask); +extern nodemask_t cpuset_sysram_nodemask(struct task_struct *p); #else /* !CONFIG_CPUSETS */ =20 static inline bool cpusets_enabled(void) { return false; } @@ -218,7 +220,13 @@ static inline bool cpuset_cpu_is_isolated(int cpu) return false; } =20 -static inline nodemask_t cpuset_mems_allowed(struct task_struct *p) +static inline void cpuset_sysram_nodes_allowed(struct cgroup *cgroup, + nodemask_t *mask) +{ + nodes_copy(*mask, node_possible_map); +} + +static inline nodemask_t cpuset_sysram_nodemask(struct task_struct *p) { return node_possible_map; } @@ -301,10 +309,16 @@ static inline bool read_mems_allowed_retry(unsigned i= nt seq) return false; } =20 -static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t = *mask) +static inline void cpuset_sysram_nodes_allowed(struct cgroup *cgroup, + nodemask_t *mask) { nodes_copy(*mask, node_states[N_MEMORY]); } + +static nodemask_t cpuset_sysram_nodemask(struct task_struct *p) +{ + return node_states[N_MEMORY]; +} #endif /* !CONFIG_CPUSETS */ =20 #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-interna= l.h index 01976c8e7d49..4764aaef585f 100644 --- a/kernel/cgroup/cpuset-internal.h +++ b/kernel/cgroup/cpuset-internal.h @@ -53,6 +53,7 @@ typedef enum { FILE_MEMORY_MIGRATE, FILE_CPULIST, FILE_MEMLIST, + FILE_MEMS_SYSRAM, FILE_EFFECTIVE_CPULIST, FILE_EFFECTIVE_MEMLIST, FILE_SUBPARTS_CPULIST, @@ -104,6 +105,13 @@ struct cpuset { cpumask_var_t effective_cpus; nodemask_t effective_mems; =20 + /* + * SystemRAM Memory Nodes for tasks. + * This is the intersection of effective_mems and node_states[N_MEMORY]. + * Tasks will have their sysram_nodes set to this value. + */ + nodemask_t mems_sysram; + /* * Exclusive CPUs dedicated to current cgroup (default hierarchy only) * diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c index 12e76774c75b..45b74181effd 100644 --- a/kernel/cgroup/cpuset-v1.c +++ b/kernel/cgroup/cpuset-v1.c @@ -293,6 +293,8 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs, cpumask_copy(cs->effective_cpus, new_cpus); cs->mems_allowed =3D *new_mems; cs->effective_mems =3D *new_mems; + nodes_and(cs->mems_sysram, cs->effective_mems, node_states[N_MEMORY]); + cpuset_update_tasks_nodemask(cs); cpuset_callback_unlock_irq(); =20 /* @@ -532,6 +534,12 @@ struct cftype cpuset1_files[] =3D { .private =3D FILE_EFFECTIVE_MEMLIST, }, =20 + { + .name =3D "mems_sysram", + .seq_show =3D cpuset_common_seq_show, + .private =3D FILE_MEMS_SYSRAM, + }, + { .name =3D "cpu_exclusive", .read_u64 =3D cpuset_read_u64, diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a3ade9d5968b..4c213a2ea7ac 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -454,11 +455,11 @@ static void guarantee_active_cpus(struct task_struct = *tsk, * * Call with callback_lock or cpuset_mutex held. */ -static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) +static void guarantee_online_sysram_nodes(struct cpuset *cs, nodemask_t *p= mask) { - while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY])) + while (!nodes_intersects(cs->mems_sysram, node_states[N_MEMORY])) cs =3D parent_cs(cs); - nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]); + nodes_and(*pmask, cs->mems_sysram, node_states[N_MEMORY]); } =20 /** @@ -2791,7 +2792,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) =20 cpuset_being_rebound =3D cs; /* causes mpol_dup() rebind */ =20 - guarantee_online_mems(cs, &newmems); + guarantee_online_sysram_nodes(cs, &newmems); =20 /* * The mpol_rebind_mm() call takes mmap_lock, which we couldn't @@ -2816,7 +2817,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) =20 migrate =3D is_memory_migrate(cs); =20 - mpol_rebind_mm(mm, &cs->mems_allowed); + mpol_rebind_mm(mm, &cs->mems_sysram); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); else @@ -2876,6 +2877,8 @@ static void update_nodemasks_hier(struct cpuset *cs, = nodemask_t *new_mems) =20 spin_lock_irq(&callback_lock); cp->effective_mems =3D *new_mems; + nodes_and(cp->mems_sysram, cp->effective_mems, + node_states[N_MEMORY]); spin_unlock_irq(&callback_lock); =20 WARN_ON(!is_in_v2_mode() && @@ -3304,11 +3307,11 @@ static void cpuset_attach(struct cgroup_taskset *ts= et) * by skipping the task iteration and update. */ if (cpuset_v2() && !cpus_updated && !mems_updated) { - cpuset_attach_nodemask_to =3D cs->effective_mems; + cpuset_attach_nodemask_to =3D cs->mems_sysram; goto out; } =20 - guarantee_online_mems(cs, &cpuset_attach_nodemask_to); + guarantee_online_sysram_nodes(cs, &cpuset_attach_nodemask_to); =20 cgroup_taskset_for_each(task, css, tset) cpuset_attach_task(cs, task); @@ -3319,7 +3322,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) * if there is no change in effective_mems and CS_MEMORY_MIGRATE is * not set. */ - cpuset_attach_nodemask_to =3D cs->effective_mems; + cpuset_attach_nodemask_to =3D cs->mems_sysram; if (!is_memory_migrate(cs) && !mems_updated) goto out; =20 @@ -3441,6 +3444,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void = *v) case FILE_EFFECTIVE_MEMLIST: seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems)); break; + case FILE_MEMS_SYSRAM: + seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->mems_sysram)); + break; case FILE_EXCLUSIVE_CPULIST: seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->exclusive_cpus)); break; @@ -3552,6 +3558,12 @@ static struct cftype dfl_files[] =3D { .private =3D FILE_EFFECTIVE_MEMLIST, }, =20 + { + .name =3D "mems.sysram", + .seq_show =3D cpuset_common_seq_show, + .private =3D FILE_MEMS_SYSRAM, + }, + { .name =3D "cpus.partition", .seq_show =3D cpuset_partition_show, @@ -3654,6 +3666,8 @@ static int cpuset_css_online(struct cgroup_subsys_sta= te *css) if (is_in_v2_mode()) { cpumask_copy(cs->effective_cpus, parent->effective_cpus); cs->effective_mems =3D parent->effective_mems; + nodes_and(cs->mems_sysram, cs->effective_mems, + node_states[N_MEMORY]); } spin_unlock_irq(&callback_lock); =20 @@ -3685,6 +3699,8 @@ static int cpuset_css_online(struct cgroup_subsys_sta= te *css) spin_lock_irq(&callback_lock); cs->mems_allowed =3D parent->mems_allowed; cs->effective_mems =3D parent->mems_allowed; + nodes_and(cs->mems_sysram, cs->effective_mems, + node_states[N_MEMORY]); cpumask_copy(cs->cpus_allowed, parent->cpus_allowed); cpumask_copy(cs->effective_cpus, parent->cpus_allowed); spin_unlock_irq(&callback_lock); @@ -3838,7 +3854,7 @@ static void cpuset_fork(struct task_struct *task) =20 /* CLONE_INTO_CGROUP */ mutex_lock(&cpuset_mutex); - guarantee_online_mems(cs, &cpuset_attach_nodemask_to); + guarantee_online_sysram_nodes(cs, &cpuset_attach_nodemask_to); cpuset_attach_task(cs, task); =20 dec_attach_in_progress_locked(cs); @@ -3887,7 +3903,8 @@ int __init cpuset_init(void) cpumask_setall(top_cpuset.effective_xcpus); cpumask_setall(top_cpuset.exclusive_cpus); nodes_setall(top_cpuset.effective_mems); - + nodes_and(top_cpuset.mems_sysram, top_cpuset.effective_mems, + node_states[N_MEMORY]); fmeter_init(&top_cpuset.fmeter); =20 BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)); @@ -3916,6 +3933,7 @@ hotplug_update_tasks(struct cpuset *cs, spin_lock_irq(&callback_lock); cpumask_copy(cs->effective_cpus, new_cpus); cs->effective_mems =3D *new_mems; + nodes_and(cs->mems_sysram, cs->effective_mems, node_states[N_MEMORY]); spin_unlock_irq(&callback_lock); =20 if (cpus_updated) @@ -4064,7 +4082,15 @@ static void cpuset_handle_hotplug(void) =20 /* fetch the available cpus/mems and find out which changed how */ cpumask_copy(&new_cpus, cpu_active_mask); - new_mems =3D node_states[N_MEMORY]; + + /* + * Effective mems is union(N_MEMORY, N_PRIVATE), this allows + * control over N_PRIVATE node usage from cgroups while + * mems.sysram prevents N_PRIVATE nodes from being used + * without __GFP_THISNODE. + */ + nodes_clear(new_mems); + nodes_or(new_mems, node_states[N_MEMORY], node_states[N_PRIVATE]); =20 /* * If subpartitions_cpus is populated, it is likely that the check @@ -4106,6 +4132,8 @@ static void cpuset_handle_hotplug(void) if (!on_dfl) top_cpuset.mems_allowed =3D new_mems; top_cpuset.effective_mems =3D new_mems; + nodes_and(top_cpuset.mems_sysram, top_cpuset.effective_mems, + node_states[N_MEMORY]); spin_unlock_irq(&callback_lock); cpuset_update_tasks_nodemask(&top_cpuset); } @@ -4176,6 +4204,7 @@ void __init cpuset_init_smp(void) =20 cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask); top_cpuset.effective_mems =3D node_states[N_MEMORY]; + top_cpuset.mems_sysram =3D node_states[N_MEMORY]; =20 hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); =20 @@ -4293,14 +4322,18 @@ bool cpuset_cpus_allowed_fallback(struct task_struc= t *tsk) return changed; } =20 +/* + * At this point in time, no hotplug nodes can have been added, so just set + * the sysram_nodes of the init task to the set of N_MEMORY nodes. + */ void __init cpuset_init_current_mems_allowed(void) { - nodes_setall(current->mems_allowed); + current->mems_allowed =3D node_states[N_MEMORY]; } =20 /** - * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset. - * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed. + * cpuset_sysram_nodemask - return mems_sysram mask from a tasks cpuset. + * @tsk: pointer to task_struct from which to obtain cpuset->mems_sysram. * * Description: Returns the nodemask_t mems_allowed of the cpuset * attached to the specified @tsk. Guaranteed to return some non-empty @@ -4308,13 +4341,13 @@ void __init cpuset_init_current_mems_allowed(void) * tasks cpuset. **/ =20 -nodemask_t cpuset_mems_allowed(struct task_struct *tsk) +nodemask_t cpuset_sysram_nodemask(struct task_struct *tsk) { nodemask_t mask; unsigned long flags; =20 spin_lock_irqsave(&callback_lock, flags); - guarantee_online_mems(task_cs(tsk), &mask); + guarantee_online_sysram_nodes(task_cs(tsk), &mask); spin_unlock_irqrestore(&callback_lock, flags); =20 return mask; @@ -4383,17 +4416,30 @@ static struct cpuset *nearest_hardwall_ancestor(str= uct cpuset *cs) * tsk_is_oom_victim - any node ok * GFP_KERNEL - any node in enclosing hardwalled cpuset ok * GFP_USER - only nodes in current tasks mems allowed ok. + * GFP_THISNODE - allows private memory nodes in mems_allowed */ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) { struct cpuset *cs; /* current cpuset ancestors */ bool allowed; /* is allocation in zone z allowed? */ unsigned long flags; + bool private_nodes =3D gfp_mask & __GFP_THISNODE; =20 + /* Only SysRAM nodes are valid in interrupt context */ if (in_interrupt()) - return true; - if (node_isset(node, current->mems_allowed)) - return true; + return node_isset(node, node_states[N_MEMORY]); + + if (private_nodes) { + rcu_read_lock(); + cs =3D task_cs(current); + allowed =3D node_isset(node, cs->effective_mems); + rcu_read_unlock(); + } else + allowed =3D node_isset(node, current->mems_allowed); + + if (allowed) + return allowed; + /* * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. @@ -4412,6 +4458,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp= _mask) cs =3D nearest_hardwall_ancestor(task_cs(current)); allowed =3D node_isset(node, cs->mems_allowed); =20 + /* If not allowing private node allocation, restrict to sysram nodes */ + if (!private_nodes) + allowed &=3D node_isset(node, node_states[N_MEMORY]); + spin_unlock_irqrestore(&callback_lock, flags); return allowed; } @@ -4434,7 +4484,7 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_= mask) * online due to hot plugins. Callers should check the mask for validity on * return based on its subsequent use. **/ -void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask) +void cpuset_sysram_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask) { struct cgroup_subsys_state *css; struct cpuset *cs; @@ -4457,16 +4507,16 @@ void cpuset_nodes_allowed(struct cgroup *cgroup, no= demask_t *mask) =20 /* * The reference taken via cgroup_get_e_css is sufficient to - * protect css, but it does not imply safe accesses to effective_mems. + * protect css, but it does not imply safe accesses to mems_sysram. * - * Normally, accessing effective_mems would require the cpuset_mutex + * Normally, accessing mems_sysram would require the cpuset_mutex * or callback_lock - but the correctness of this information is stale * immediately after the query anyway. We do not acquire the lock * during this process to save lock contention in exchange for racing * against mems_allowed rebinds. */ cs =3D container_of(css, struct cpuset, css); - nodes_copy(*mask, cs->effective_mems); + nodes_copy(*mask, cs->mems_sysram); css_put(css); } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7fbe9565cd06..2df7168edca0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5610,7 +5610,7 @@ void mem_cgroup_node_filter_allowed(struct mem_cgroup= *memcg, nodemask_t *mask) * in effective_mems and hot-unpluging of nodes, inaccurate allowed * mask is acceptable. */ - cpuset_nodes_allowed(memcg->css.cgroup, &allowed); + cpuset_sysram_nodes_allowed(memcg->css.cgroup, &allowed); nodes_and(*mask, *mask, allowed); } =20 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 76da50425712..760b5b6b4ae6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1901,14 +1901,14 @@ static int kernel_migrate_pages(pid_t pid, unsigned= long maxnode, } rcu_read_unlock(); =20 - task_nodes =3D cpuset_mems_allowed(task); + task_nodes =3D cpuset_sysram_nodemask(task); /* Is the user allowed to access the target nodes? */ if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) { err =3D -EPERM; goto out_put; } =20 - task_nodes =3D cpuset_mems_allowed(current); + task_nodes =3D cpuset_sysram_nodemask(current); nodes_and(*new, *new, task_nodes); if (nodes_empty(*new)) goto out_put; @@ -2833,7 +2833,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old) *new =3D *old; =20 if (current_cpuset_is_being_rebound()) { - nodemask_t mems =3D cpuset_mems_allowed(current); + nodemask_t mems =3D cpuset_sysram_nodemask(current); mpol_rebind_policy(new, &mems); } atomic_set(&new->refcnt, 1); diff --git a/mm/migrate.c b/mm/migrate.c index 5169f9717f60..0ad893bf862b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2534,7 +2534,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, no= demask_t *mem_nodes) */ if (!pid) { mmget(current->mm); - *mem_nodes =3D cpuset_mems_allowed(current); + *mem_nodes =3D cpuset_sysram_nodemask(current); return current->mm; } =20 @@ -2555,7 +2555,7 @@ static struct mm_struct *find_mm_struct(pid_t pid, no= demask_t *mem_nodes) mm =3D ERR_PTR(security_task_movememory(task)); if (IS_ERR(mm)) goto out; - *mem_nodes =3D cpuset_mems_allowed(task); + *mem_nodes =3D cpuset_sysram_nodemask(task); mm =3D get_task_mm(task); out: put_task_struct(task); --=20 2.52.0 From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qt1-f193.google.com (mail-qt1-f193.google.com [209.85.160.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0BCC2331A4B for ; Thu, 8 Jan 2026 20:38:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.193 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904746; cv=none; b=E/+p2JQQhy9yVSKkKLLOHhN6G4Ryo6RO2NjOA87D+JRAJ41kN2lRqsiCrm6f/xn6ZsJIQfjRWu7dDpa/DpS1VUGcv4VvwXjEfljUY/1WtodRYBDta5KhxyhGhKtvkuPMX5q4w0RH1piJS0tL/IZEyO2l6uHCUA3nTzRzHL+gKpI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904746; c=relaxed/simple; bh=a5FayHYeLOjpSYndVEQNyEbTMfQR+C/FVM2n6KXYrYo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=DbSfOrlLNkLZW7xcC+0LxyuC+kQ5HnnQwPiwWnSlU64O0z66GRY5XhxwBM/c1e1vztTjKeK2w7i7gCSCL0bGR39HMgU8LijB9OSK+Oxvu4WlmhD793NKaBsgHkwBOM3RZAu/BUb7CTH9r3hsmMNyyz/gQ2J5mRXg3uVnTpebL6A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=AAJhU+QG; arc=none smtp.client-ip=209.85.160.193 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="AAJhU+QG" Received: by mail-qt1-f193.google.com with SMTP id d75a77b69052e-4ffa95fc5f1so34521301cf.3 for ; Thu, 08 Jan 2026 12:38:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904739; x=1768509539; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=rRDiJ+UlJzmE9gGlU0qCIeblmwRaclHqLNtsuAzr4HM=; b=AAJhU+QGRlNWe4lqt0GifqyTq3VXjRFKFEaeOF3zUUdmCGLDtQCJuVwfCae5+OLl9Q I8c21JA5zmeklBxwY1CTf8h9SPlgGcA7DIa/1Y8cUBk+YkExbsGDIwYq1FIqoE2LCAVV PTCdoIG+LfI6qYBqTdv75rwfU5I3MK356ni6uKREs/ATTAsohSfn4u2fDHtYZN3yG1hH dlirk7CFoMIDjOM5p/k/cyVdebpjRH2FOPptO/jZcCFwCGnn4096ONOS8sVQU6hTliW+ 2IVQgg8fRoB5tUccO86rOR2auxmqOu5AXjasjRAs4VUKxl9ZTI3iDxoY7VVPIjYc++6n jCIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904739; x=1768509539; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=rRDiJ+UlJzmE9gGlU0qCIeblmwRaclHqLNtsuAzr4HM=; b=OAtRvHAuiOth6idWCsgh8vH9iH5Hf8/ePnBqL1ySaV5gmdasZw9NZ2/tspggGRSzDg Tu3Gp2EQYkNYsC39qGrbjSmzk5BygxysY6njrP9RmPaZTrINLP0nZ/0Y1ssDG77kP0Qk FKv4YDdLk/2qYu6EMeKjMU3dq67gqqrX31Y9dtsCwph4OTlJasFn9X1nLgjhoeVNqpC9 nhq0MRm+jZLTNHZPgGLr1xFJ6DL3CpqpYKhZr/OWCrAu0cvaXulUAUYOzF7JECXYjcEG oIrIpPllA015Bt9Gv9sOILgfwEMOeIW7TaHi+XojsnayTMNzZBlNSfQRNMLv3Kh2L4L9 X1cg== X-Forwarded-Encrypted: i=1; AJvYcCWNb8iCwD5Pdb2EG+kpv4hwnVZroRbbDQ7DI8GuVNss7Gt0DfXVfWH3CFW3LPowhDgS50LXnmRUFy7zfwA=@vger.kernel.org X-Gm-Message-State: AOJu0YxfbiUaeOOtMasGdhrftf48aQTEqmx7A/Nr7QIiXjGiWkpTE5+d SRs7ykqa8eKwmAnOTuUdge3n9jvw6A6szDt6tmMOFr8jBwhJCqXNPtE7ux9XeZwU6kI= X-Gm-Gg: AY/fxX7gXWOzMSHqI4kTsmbytP/fK/0By1IrFDgdUG2mgMmNVr7gk1nd9LrOMoqgLI4 8lCqYdrbcXgSCpCBHAtU5Vf5yIH9I/xm0xCh/9Gt87hnvOD4FtBThk/6gs/tWANsX29QTBybPpl 0GOEK/iFRDoRmcBUbPO1facPRRw0zCz5Cxq07w1IjYq84r8RPcpsSnJuvNRBhWZP0jKWudIQMvj dLx8FgIegOUT4rfyXFp58IW9cfqddJqf7hSPa1edYrI5KpaHhwZNHAW+lc5V9BaHCW6g8baXL/d zmTLHHjX39D9a/Qp3ixRr2cCzlIQ021TSYUYmflbovrnDP/voYWhCMKjQhD6wYgS2hxbZOpLjTu 6pTP3bQtowqsvbgsMJ+b09s/LBWcGWcYw9YonPW3PYiGhU9R56J5FA4FI+vcBPhhKArSchPnXbY AM0th7sK9K9oxDN36/m64VS5Oe6HrHJ3c1vW+npqV8XKR+eR5f5iyokinoJh6DIgfLCuy/ehIzs ic= X-Google-Smtp-Source: AGHT+IFQyuUju0Z/aUbfqpVDbDingg5jj0Y3CKeme4WYcUfzCc6+dQALyvDYUzyj67d/yfod1gGU5A== X-Received: by 2002:a05:622a:1101:b0:4ed:df09:a6a6 with SMTP id d75a77b69052e-4ffb48a8a54mr84063441cf.25.1767904738761; Thu, 08 Jan 2026 12:38:58 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.38.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:38:58 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com Subject: [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed Date: Thu, 8 Jan 2026 15:37:52 -0500 Message-ID: <20260108203755.1163107-6-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add new information about mems_allowed and sysram_nodes, which says mems_allowed may contain union(N_MEMORY, N_PRIVATE) nodes, while sysram_nodes may only contain a subset of N_MEMORY nodes. cpuset.mems.sysram is a new RO ABI which reports the list of N_MEMORY nodes the cpuset is allowed to use, while cpusets.mems and mems.effective may also contain N_PRIVATE. Signed-off-by: Gregory Price --- .../admin-guide/cgroup-v1/cpusets.rst | 19 +++++++++++--- Documentation/admin-guide/cgroup-v2.rst | 26 +++++++++++++++++-- Documentation/filesystems/proc.rst | 2 +- 3 files changed, 40 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentatio= n/admin-guide/cgroup-v1/cpusets.rst index c7909e5ac136..6d326056f7b4 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -158,21 +158,26 @@ new system calls are added for cpusets - all support = for querying and modifying cpusets is via this cpuset file system. =20 The /proc//status file for each task has four added lines, -displaying the task's cpus_allowed (on which CPUs it may be scheduled) -and mems_allowed (on which Memory Nodes it may obtain memory), -in the two formats seen in the following example:: +displaying the task's cpus_allowed (on which CPUs it may be scheduled), +and mems_allowed (on which SystemRAM nodes it may obtain memory), +in the formats seen in the following example:: =20 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-127 Mems_allowed: ffffffff,ffffffff Mems_allowed_list: 0-63 =20 +Note that Mems_allowed only shows SystemRAM nodes (N_MEMORY), not +Private Nodes. Private Nodes may be accessible via __GFP_THISNODE +allocations if they appear in the task's cpuset.effective_mems. + Each cpuset is represented by a directory in the cgroup file system containing (on top of the standard cgroup files) the following files describing that cpuset: =20 - cpuset.cpus: list of CPUs in that cpuset - cpuset.mems: list of Memory Nodes in that cpuset + - cpuset.mems.sysram: read-only list of SystemRAM nodes (excludes Private= Nodes) - cpuset.memory_migrate flag: if set, move pages to cpusets nodes - cpuset.cpu_exclusive flag: is cpu placement exclusive? - cpuset.mem_exclusive flag: is memory placement exclusive? @@ -227,7 +232,9 @@ nodes with memory--using the cpuset_track_online_nodes(= ) hook. =20 The cpuset.effective_cpus and cpuset.effective_mems files are normally read-only copies of cpuset.cpus and cpuset.mems files -respectively. If the cpuset cgroup filesystem is mounted with the +respectively. The cpuset.effective_mems file may include both +regular SystemRAM nodes (N_MEMORY) and Private Nodes (N_PRIVATE). +If the cpuset cgroup filesystem is mounted with the special "cpuset_v2_mode" option, the behavior of these files will become similar to the corresponding files in cpuset v2. In other words, hotplug events will not change cpuset.cpus and cpuset.mems. Those events will @@ -236,6 +243,10 @@ the actual cpus and memory nodes that are currently us= ed by this cpuset. See Documentation/admin-guide/cgroup-v2.rst for more information about cpuset v2 behavior. =20 +The cpuset.mems.sysram file shows only the SystemRAM nodes (N_MEMORY) +from cpuset.effective_mems, excluding any Private Nodes. This +represents the nodes available for general memory allocation. + =20 1.4 What are exclusive cpusets ? -------------------------------- diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 7f5b59d95fce..6af54efb84a2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2530,8 +2530,11 @@ Cpuset Interface Files cpuset-enabled cgroups. =20 It lists the onlined memory nodes that are actually granted to - this cgroup by its parent. These memory nodes are allowed to - be used by tasks within the current cgroup. + this cgroup by its parent. This includes both regular SystemRAM + nodes (N_MEMORY) and Private Nodes (N_PRIVATE) that provide + device-specific memory not intended for general consumption. + Tasks within this cgroup may access Private Nodes using explicit + __GFP_THISNODE allocations if the node is in this mask. =20 If "cpuset.mems" is empty, it shows all the memory nodes from the parent cgroup that will be available to be used by this cgroup. @@ -2541,6 +2544,25 @@ Cpuset Interface Files =20 Its value will be affected by memory nodes hotplug events. =20 + cpuset.mems.sysram + A read-only multiple values file which exists on all + cpuset-enabled cgroups. + + It lists the SystemRAM nodes (N_MEMORY) that are available for + general memory allocation by tasks within this cgroup. This is + a subset of "cpuset.mems.effective" that excludes Private Nodes. + + Normal page allocations are restricted to nodes in this mask. + The kernel page allocator, slab allocator, and compaction only + consider SystemRAM nodes when allocating memory for tasks. + + Private Nodes are excluded from this mask because their memory + is managed by device drivers for specific purposes (e.g., CXL + compressed memory, accelerator memory) and should not be used + for general allocations. + + Its value will be affected by memory nodes hotplug events. + cpuset.cpus.exclusive A read-write multiple values file which exists on non-root cpuset-enabled cgroups. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems= /proc.rst index c92e95e28047..68f3d8ffc03b 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -294,7 +294,7 @@ It's slow but very precise. Cpus_active_mm mask of CPUs on which this process has an act= ive memory context Cpus_active_mm_list Same as previous, but in "list format" - Mems_allowed mask of memory nodes allowed to this process + Mems_allowed mask of SystemRAM nodes for general allocatio= ns Mems_allowed_list Same as previous, but in "list format" voluntary_ctxt_switches number of voluntary context switches nonvoluntary_ctxt_switches number of non voluntary context switches --=20 2.52.0 From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 76B0933120E for ; Thu, 8 Jan 2026 20:39:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904748; cv=none; b=arTFTgwINTzYFz7JZuzguWxM3sOTWXmjYsOoAIWrdUNZ2yorGKJM/znB6VzAsfZN4nZxGvM0j8s++Xnq5t/MCfM5uhxmlK2DHFCKZyPSJfteYJFzfX0MhdguyVzpgTBK4mF0SKHopV7ic3xSqacvy63dnYHDH+bBcWQITKiUIwA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904748; c=relaxed/simple; bh=xii4wO1lEnLuXt5FEyh6bs/i4HOiIb27LXChmEfzilo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LpCTVrw2HthM3kThywwvkmUgsdF3kggTG1JgcuxofEgfRwn7G11LqVzsdgtxVvDigPhuMUXWtOyGgN6NSK716TdPozqciRPVP1n310W23zwID/vb5RQoukmNxZlfBoJym36H5m+P9hb8hwSG0MT0vSNJAlSwgiS+/uqmPmYcND0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=Bu/wvCUS; arc=none smtp.client-ip=209.85.222.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="Bu/wvCUS" Received: by mail-qk1-f172.google.com with SMTP id af79cd13be357-8b29ff9d18cso377272985a.3 for ; Thu, 08 Jan 2026 12:39:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904744; x=1768509544; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=mwgM57cBzuIgAiPMK5GxhnYsXF2RBvqFNs+qFbak5h0=; b=Bu/wvCUSzvE1wOrM1YeQH/zu/nTFQkmi/JULxPvAXca4qSjAhWGwiUvRUT89s2X6xE r5CnFWg01qf11wPsRE5OmpeoETNGxmtlwaGcCSH3JyhfX5Vq0LKgn22h7EyHHHUNZxW6 RSPDle7YU7Bl0z/G00liKYzEwSq1NCubdSUHrXwAH0y09MNg7bITTrHNoG4j7pq6D4Pl ZYROv9A6/IbC3tu8J4qZNHJa67t8MGPwFGr3cddg9MBn8K84aJNQs+9zb4Kdi6L7YFX7 UE+SZ5/9MLwVQgRlC7Ms/PDMzAS1qHCbB9q5Oanrre2bLosixNWZ1jPZDV8TplyyZgKN JK8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904744; x=1768509544; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=mwgM57cBzuIgAiPMK5GxhnYsXF2RBvqFNs+qFbak5h0=; b=MWMpiMVD761HL5REMQCEMg3/rNske4EtRjoHzhaAhgv8vDhcSUcAXzrlh9fM+n72wz UzjL2BV0mHXzBp7LSxOAOYo4NjSnH4pKaWxx1HciGL5T22Gqkhp42SsRXj1u0ElPdb8C yPfqE3KU9dE3u0SjWln941nD4nm3l8QSrPbmK1YMsCyOEN3Z39a4z2PKUwy6DmgaR8Iy 6sRPtrbzI63YIf8eRkPXiYnuUmAr9axuCBlJu2BwNwlAmMW5k4rpTuGGJKcRtgHE3aGj lyvs1kuynmX+g9ZWROe0hFGI/gTwQyU4Dm5KydYdpcdzeISXinlcklRWQd75Rs3Mconk p38A== X-Forwarded-Encrypted: i=1; AJvYcCU5tu8u90vDTBTln47ljLzWGlrVtiCkeCA01cbduG/BAWb5IDiWoAll5K63q6ImZTAuhs2Z5jz22NW9itk=@vger.kernel.org X-Gm-Message-State: AOJu0YziU6o/vqXZr1vCrfsnqUyq/cs/bp0UdB7qWZ/zhwqFCfvIev4d L81XwNbhV9VHU1OB1jnmp9mZp2Z/+JSe0T0bVVrY3Uk3ycidHzRzQfntJtPG8nkpdFw= X-Gm-Gg: AY/fxX5RHZZkNRMbNaZkEyzexZeJYPCrXgOp0+LG989o/LWe8wAhV624Buv7f4fSXaP In9kVK+Iv3F5GkWG1LEPM7piIF5fcwlQO59lbLUjVWPADtXGdhD2ZU1w0CSsbaZETRmCySN8Y7l nAXWMDwuRzOKVt4eb92tYN79oSy9JJozhp8Vke2BL60SLoiFTGzUZuwlje4iOkPP1aftkFVybYC zjfYdlkcv74nYoYd1sZy6aJ/oCQEvmW605JWEHWF0Cr9lD/WVQOHvXTh9VjpYTjJKHGhMes0MVO ta0iaENOL03W94JgSuUuTycTYhr3q2ilnGmoI7izkYAqiVSyr8WCxmPl0/ncNOuGnCdPjY/8tId tFEH7cPcppN+HWmeABYQzzzk/nM38SVfSDT/Qhtas0ezRvNfj4G6QkCZEfMF/UBwe9KnM6oZb/a PhrqXBi1jkTm3RN8QGxBCCytjOWGwLYTAAd2ukXhqpDlh6ZtAS7iAgGS8KcP9iITbzogTvAstqV Y0= X-Google-Smtp-Source: AGHT+IEohul7eh/tCSXsk4Ov9Y1GsfVyXy+u/Ko/qY44ZfRMGDZ1/y3sswNjUUv0CUGvS9SQKMHaSg== X-Received: by 2002:a05:620a:1a9d:b0:8b2:d6eb:8204 with SMTP id af79cd13be357-8c389416dbfmr982794985a.71.1767904744299; Thu, 08 Jan 2026 12:39:04 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.39.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:39:03 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com Subject: [RFC PATCH v3 6/8] drivers/cxl/core/region: add private_region Date: Thu, 8 Jan 2026 15:37:53 -0500 Message-ID: <20260108203755.1163107-7-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A private_region is just a RAM region which attempts to set the target_node to N_PRIVATE before continuing to create a DAX device and subsequently hotplugging memory onto the system. A CXL device driver would create a private_region with the intent to manage how the memory can be used more granuarly than typical SystemRAM. This patch adds the infrastructure for a private memory region. Added as a separate folder to keep private region types organized. usage: echo regionN > decoderX.Y/create_private_region echo type > regionN/private_type Signed-off-by: Gregory Price --- drivers/cxl/core/Makefile | 1 + drivers/cxl/core/core.h | 4 + drivers/cxl/core/port.c | 4 + drivers/cxl/core/private_region/Makefile | 9 ++ .../cxl/core/private_region/private_region.c | 119 ++++++++++++++++++ .../cxl/core/private_region/private_region.h | 10 ++ drivers/cxl/core/region.c | 63 ++++++++-- drivers/cxl/cxl.h | 20 +++ 8 files changed, 219 insertions(+), 11 deletions(-) create mode 100644 drivers/cxl/core/private_region/Makefile create mode 100644 drivers/cxl/core/private_region/private_region.c create mode 100644 drivers/cxl/core/private_region/private_region.h diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile index 5ad8fef210b5..2dd882a52609 100644 --- a/drivers/cxl/core/Makefile +++ b/drivers/cxl/core/Makefile @@ -17,6 +17,7 @@ cxl_core-y +=3D cdat.o cxl_core-y +=3D ras.o cxl_core-$(CONFIG_TRACING) +=3D trace.o cxl_core-$(CONFIG_CXL_REGION) +=3D region.o +obj-$(CONFIG_CXL_REGION) +=3D private_region/ cxl_core-$(CONFIG_CXL_MCE) +=3D mce.o cxl_core-$(CONFIG_CXL_FEATURES) +=3D features.o cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) +=3D edac.o diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h index 1fb66132b777..159f92e4bea1 100644 --- a/drivers/cxl/core/core.h +++ b/drivers/cxl/core/core.h @@ -21,6 +21,7 @@ enum cxl_detach_mode { #ifdef CONFIG_CXL_REGION extern struct device_attribute dev_attr_create_pmem_region; extern struct device_attribute dev_attr_create_ram_region; +extern struct device_attribute dev_attr_create_private_region; extern struct device_attribute dev_attr_delete_region; extern struct device_attribute dev_attr_region; extern const struct device_type cxl_pmem_region_type; @@ -30,6 +31,9 @@ extern const struct device_type cxl_region_type; int cxl_decoder_detach(struct cxl_region *cxlr, struct cxl_endpoint_decoder *cxled, int pos, enum cxl_detach_mode mode); +int devm_cxl_add_dax_region(struct cxl_region *cxlr); +struct cxl_region *to_cxl_region(struct device *dev); +extern struct device_attribute dev_attr_private_type; =20 #define CXL_REGION_ATTR(x) (&dev_attr_##x.attr) #define CXL_REGION_TYPE(x) (&cxl_region_type) diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c index fef3aa0c6680..aedecb83e59b 100644 --- a/drivers/cxl/core/port.c +++ b/drivers/cxl/core/port.c @@ -333,6 +333,7 @@ static struct attribute *cxl_decoder_root_attrs[] =3D { &dev_attr_qos_class.attr, SET_CXL_REGION_ATTR(create_pmem_region) SET_CXL_REGION_ATTR(create_ram_region) + SET_CXL_REGION_ATTR(create_private_region) SET_CXL_REGION_ATTR(delete_region) NULL, }; @@ -362,6 +363,9 @@ static umode_t cxl_root_decoder_visible(struct kobject = *kobj, struct attribute * if (a =3D=3D CXL_REGION_ATTR(create_ram_region) && !can_create_ram(cxlrd)) return 0; =20 + if (a =3D=3D CXL_REGION_ATTR(create_private_region) && !can_create_ram(cx= lrd)) + return 0; + if (a =3D=3D CXL_REGION_ATTR(delete_region) && !(can_create_pmem(cxlrd) || can_create_ram(cxlrd))) return 0; diff --git a/drivers/cxl/core/private_region/Makefile b/drivers/cxl/core/pr= ivate_region/Makefile new file mode 100644 index 000000000000..d17498129ba6 --- /dev/null +++ b/drivers/cxl/core/private_region/Makefile @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# CXL Private Region type implementations +# + +ccflags-y +=3D -I$(srctree)/drivers/cxl + +# Core dispatch and sysfs +obj-$(CONFIG_CXL_REGION) +=3D private_region.o diff --git a/drivers/cxl/core/private_region/private_region.c b/drivers/cxl= /core/private_region/private_region.c new file mode 100644 index 000000000000..ead48abb9fc7 --- /dev/null +++ b/drivers/cxl/core/private_region/private_region.c @@ -0,0 +1,119 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * CXL Private Region - dispatch and lifecycle management + * + * This file implements the main registration and unregistration dispatch + * for CXL private regions. It handles common initialization and delegates + * to type-specific implementations. + */ + +#include +#include +#include "../../cxl.h" +#include "../core.h" +#include "private_region.h" + +static const char *private_type_to_string(enum cxl_private_region_type typ= e) +{ + switch (type) { + default: + return ""; + } +} + +static enum cxl_private_region_type string_to_private_type(const char *str) +{ + return CXL_PRIVATE_NONE; +} + +static ssize_t private_type_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct cxl_region *cxlr =3D to_cxl_region(dev); + + return sysfs_emit(buf, "%s\n", private_type_to_string(cxlr->private_type)= ); +} + +static ssize_t private_type_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct cxl_region *cxlr =3D to_cxl_region(dev); + struct cxl_region_params *p =3D &cxlr->params; + enum cxl_private_region_type type; + ssize_t rc; + + type =3D string_to_private_type(buf); + if (type =3D=3D CXL_PRIVATE_NONE) + return -EINVAL; + + ACQUIRE(rwsem_write_kill, rwsem)(&cxl_rwsem.region); + if ((rc =3D ACQUIRE_ERR(rwsem_write_kill, &rwsem))) + return rc; + + /* Can only change type before region is committed */ + if (p->state >=3D CXL_CONFIG_COMMIT) + return -EBUSY; + + cxlr->private_type =3D type; + + return len; +} +DEVICE_ATTR_RW(private_type); + +/* + * Register a private CXL region based on its private_type. + * + * This function is called during commit. It validates the private_type, + * initializes the private_ops, and dispatches to the appropriate + * registration function which handles memtype, callbacks, and node + * registration. + */ +int cxl_register_private_region(struct cxl_region *cxlr) +{ + int rc =3D 0; + + if (!cxlr->params.res) + return -EINVAL; + + if (cxlr->private_type =3D=3D CXL_PRIVATE_NONE) { + dev_err(&cxlr->dev, "private_type must be set before commit\n"); + return -EINVAL; + } + + /* Initialize the private_ops with region info */ + cxlr->private_ops.res_start =3D cxlr->params.res->start; + cxlr->private_ops.res_end =3D cxlr->params.res->end; + cxlr->private_ops.data =3D cxlr; + + /* Call type-specific registration which sets memtype and callbacks */ + switch (cxlr->private_type) { + default: + dev_dbg(&cxlr->dev, "unsupported private_type: %d\n", + cxlr->private_type); + rc =3D -EINVAL; + break; + } + + if (!rc) + set_bit(CXL_REGION_F_PRIVATE_REGISTERED, &cxlr->flags); + return rc; +} + +/* + * Unregister a private CXL region. + * + * This function is called during region reset or device release. + * It dispatches to the appropriate type-specific cleanup function. + */ +void cxl_unregister_private_region(struct cxl_region *cxlr) +{ + if (!test_and_clear_bit(CXL_REGION_F_PRIVATE_REGISTERED, &cxlr->flags)) + return; + + /* Dispatch to type-specific cleanup */ + switch (cxlr->private_type) { + default: + break; + } +} diff --git a/drivers/cxl/core/private_region/private_region.h b/drivers/cxl= /core/private_region/private_region.h new file mode 100644 index 000000000000..9b34e51d8df4 --- /dev/null +++ b/drivers/cxl/core/private_region/private_region.h @@ -0,0 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __CXL_PRIVATE_REGION_H__ +#define __CXL_PRIVATE_REGION_H__ + +struct cxl_region; + +int cxl_register_private_region(struct cxl_region *cxlr); +void cxl_unregister_private_region(struct cxl_region *cxlr); + +#endif /* __CXL_PRIVATE_REGION_H__ */ diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index ae899f68551f..c60eef96c0ca 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -15,6 +15,7 @@ #include #include #include "core.h" +#include "private_region/private_region.h" =20 /** * DOC: cxl core region @@ -38,8 +39,6 @@ */ static nodemask_t nodemask_region_seen =3D NODE_MASK_NONE; =20 -static struct cxl_region *to_cxl_region(struct device *dev); - #define __ACCESS_ATTR_RO(_level, _name) { \ .attr =3D { .name =3D __stringify(_name), .mode =3D 0444 }, \ .show =3D _name##_access##_level##_show, \ @@ -398,9 +397,6 @@ static int __commit(struct cxl_region *cxlr) return rc; =20 rc =3D cxl_region_decode_commit(cxlr); - if (rc) - return rc; - p->state =3D CXL_CONFIG_COMMIT; =20 return 0; @@ -615,12 +611,17 @@ static ssize_t mode_show(struct device *dev, struct d= evice_attribute *attr, struct cxl_region *cxlr =3D to_cxl_region(dev); const char *desc; =20 - if (cxlr->mode =3D=3D CXL_PARTMODE_RAM) - desc =3D "ram"; - else if (cxlr->mode =3D=3D CXL_PARTMODE_PMEM) + switch (cxlr->mode) { + case CXL_PARTMODE_RAM: + desc =3D cxlr->private ? "private" : "ram"; + break; + case CXL_PARTMODE_PMEM: desc =3D "pmem"; - else + break; + default: desc =3D ""; + break; + } =20 return sysfs_emit(buf, "%s\n", desc); } @@ -772,6 +773,7 @@ static struct attribute *cxl_region_attrs[] =3D { &dev_attr_size.attr, &dev_attr_mode.attr, &dev_attr_extended_linear_cache_size.attr, + &dev_attr_private_type.attr, NULL, }; =20 @@ -2400,6 +2402,9 @@ static void cxl_region_release(struct device *dev) struct cxl_region *cxlr =3D to_cxl_region(dev); int id =3D atomic_read(&cxlrd->region_id); =20 + /* Ensure private region is cleaned up if not already done */ + cxl_unregister_private_region(cxlr); + /* * Try to reuse the recently idled id rather than the cached * next id to prevent the region id space from increasing @@ -2429,7 +2434,7 @@ bool is_cxl_region(struct device *dev) } EXPORT_SYMBOL_NS_GPL(is_cxl_region, "CXL"); =20 -static struct cxl_region *to_cxl_region(struct device *dev) +struct cxl_region *to_cxl_region(struct device *dev) { if (dev_WARN_ONCE(dev, dev->type !=3D &cxl_region_type, "not a cxl_region device\n")) @@ -2638,6 +2643,13 @@ static ssize_t create_ram_region_show(struct device = *dev, return __create_region_show(to_cxl_root_decoder(dev), buf); } =20 +static ssize_t create_private_region_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return __create_region_show(to_cxl_root_decoder(dev), buf); +} + static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd, enum cxl_partition_mode mode, int id) { @@ -2698,6 +2710,28 @@ static ssize_t create_ram_region_store(struct device= *dev, } DEVICE_ATTR_RW(create_ram_region); =20 +static ssize_t create_private_region_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct cxl_root_decoder *cxlrd =3D to_cxl_root_decoder(dev); + struct cxl_region *cxlr; + int rc, id; + + rc =3D sscanf(buf, "region%d\n", &id); + if (rc !=3D 1) + return -EINVAL; + + cxlr =3D __create_region(cxlrd, CXL_PARTMODE_RAM, id); + if (IS_ERR(cxlr)) + return PTR_ERR(cxlr); + + cxlr->private =3D true; + + return len; +} +DEVICE_ATTR_RW(create_private_region); + static ssize_t region_show(struct device *dev, struct device_attribute *at= tr, char *buf) { @@ -3431,7 +3465,7 @@ static void cxlr_dax_unregister(void *_cxlr_dax) device_unregister(&cxlr_dax->dev); } =20 -static int devm_cxl_add_dax_region(struct cxl_region *cxlr) +int devm_cxl_add_dax_region(struct cxl_region *cxlr) { struct cxl_dax_region *cxlr_dax; struct device *dev; @@ -3974,6 +4008,13 @@ static int cxl_region_probe(struct device *dev) p->res->start, p->res->end, cxlr, is_system_ram) > 0) return 0; + + + if (cxlr->private) { + rc =3D cxl_register_private_region(cxlr); + if (rc) + return rc; + } return devm_cxl_add_dax_region(cxlr); default: dev_dbg(&cxlr->dev, "unsupported region mode: %d\n", diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index ba17fa86d249..b276956ff88d 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -525,6 +525,20 @@ enum cxl_partition_mode { */ #define CXL_REGION_F_LOCK 2 =20 +/* + * Indicate that this region has been registered as a private region. + * Used to track lifecycle and prevent double-unregistration. + */ +#define CXL_REGION_F_PRIVATE_REGISTERED 3 + +/** + * enum cxl_private_region_type - CXL private region types + * @CXL_PRIVATE_NONE: No private region type set + */ +enum cxl_private_region_type { + CXL_PRIVATE_NONE, +}; + /** * struct cxl_region - CXL region * @dev: This region's device @@ -534,10 +548,13 @@ enum cxl_partition_mode { * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge * @flags: Region state flags + * @private: Region is private (not exposed to system memory) * @params: active + config params for the region * @coord: QoS access coordinates for the region * @node_notifier: notifier for setting the access coordinates to node * @adist_notifier: notifier for calculating the abstract distance of node + * @private_type: CXL private region type for dispatch (set via sysfs) + * @private_ops: private node operations for callbacks (if mode is PRIVATE) */ struct cxl_region { struct device dev; @@ -547,10 +564,13 @@ struct cxl_region { struct cxl_nvdimm_bridge *cxl_nvb; struct cxl_pmem_region *cxlr_pmem; unsigned long flags; + bool private; struct cxl_region_params params; struct access_coordinate coord[ACCESS_COORDINATE_MAX]; struct notifier_block node_notifier; struct notifier_block adist_notifier; + enum cxl_private_region_type private_type; + struct private_node_ops private_ops; }; =20 struct cxl_nvdimm_bridge { --=20 2.52.0 From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F265C3328F7 for ; Thu, 8 Jan 2026 20:39:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904756; cv=none; b=oCWMSnlxsrPvMZ5C32M8cJT9nu4OOYj4aVM2mDYK0tKOjvDfLEEN+UHJ4MMzaVieBL5HgCcrLyvu5JlmDPIol22+NCXmcVf5mkfXlOuXcMWspeZTvSfOzp1iO/DfSUUu70wWnbxy3iace9zpzbzBI9mX75Lp2U/NYHK9viGrwSo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904756; c=relaxed/simple; bh=Q6n6H6pGLwW6lg8o+yJkMzcp5ygcedJTXlkaT5WaISo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HySx8IQnja8rxCdRLvzqDrya0xv5jXf0VeCDgnonYR/8+7ZmVj4/SJxXtlx2u3lTL7DhNUqIVSBdp1J8V31ln9xtNbXaoIX3f0QmtqfnOGLCI+vAMnip34pxaYdKxVeuU2WFtwbrdkLnIviKBRUtTsASaBQ1IclUerjdtjYkbNQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=AfZQstJT; arc=none smtp.client-ip=209.85.160.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="AfZQstJT" Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-4ee13dc0c52so30322771cf.2 for ; Thu, 08 Jan 2026 12:39:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904750; x=1768509550; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=5OTZsz9YEfzZ2R8Vdeqp1RCaOkKgwxKsmSu+J9FA9fQ=; b=AfZQstJTicHT8cU0FpbOpZQ226cG57nxuzO9CXkQR2+q1EpDMkbKf6rUuAVsxrGdTh v6crsNu+AYUpEvyGFd60KQYYnTuKPc9OACtbdGy0jGPVY91zv6lyHAUaabz5uhDce9gt iY7+Aa2Nt4/6bImyHuiist25XkTpKbeWo8KK6FajaQ6+ranNVv4zReewKrjXTfhiu+o0 n/wuib40Q1w5he3k9YH0trpvMh7uZy/SavVCwu1PyWvfBoqpOPGa/VuIcJRMGnkO82cM /7oqylfUe9lJIzHM3fNaOzy0rWjLtTtk7dU/bk+cn9qklC4SXwr+6sCGibE8ZCp7ZWeV OrKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904750; x=1768509550; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=5OTZsz9YEfzZ2R8Vdeqp1RCaOkKgwxKsmSu+J9FA9fQ=; b=R8E11dsRe6+pTD4S6uRsHfVrgLSw4jFD6f0BEHHtsZmhx5x0qzn5We/qqNSfHVPhYT jvgaUVKNkx9+Vmt1xriflY/AyuaeHb2FYlBu7TH8HMwsBD6Z709V08LFvWoVlfNNVakv QbxWphwaEieLVZMOuqFVD4igoNLpxb/fBPoonD7UrSwthf5QP5THVbSHNvDv9hxhY7Df G9HqSmaNSD0IXTeJrR1iTxF8PsLbDdQSWtB0VPsVNSLKeqSstfkyYclw2K1RYOvrd5Tg sWVa+sOpJ1QsCPK06IUPQS5LuuFBJR98zx3Z5VbTzckyq60+wcv98MLkBxXzItmNGMPx FNUg== X-Forwarded-Encrypted: i=1; AJvYcCV1cKdxKxS7+1p0afvEvqaKaSOCrus72rvlOSLL++XFZDYlg0DjgycvZLG+UqFl6kdEdzHy7Gvp6/RVzdc=@vger.kernel.org X-Gm-Message-State: AOJu0YynacNdJ04iGyPxCXMQtXXxkeTF8Zv0Kj4gCmI/I+M9CIja8s6w OuS6GacVppht/277pV5yJAKMjzl3r+MzcX3Pd0eNj8Yd+YV/krCAuOZQ4hCa93cGrTQ= X-Gm-Gg: AY/fxX6aouSBHTvE4sjLktLyjAnY6GgfizKz7g0UXPAGy+ZidozR+2o9pTT9Zr07qZ8 bRb+PYnRdXYMyYs5/lYVDUhet4C4jjLeiCrlITJ/MtDLE4mGwZt/GfYaK5lG98w6ztdEPA2zrdr UgtEW2PnhOogDegenZgmxhXrI4BIdt6iAwrc5ezQDSSJyS2rAIFP0XmNlWLWYGO64OwCvwwazwn GWrtjagneMoo2/Sr/04BArzZDLEtLo5J86FiegAp8DuHhiM3QTBrITUjIGP7qC+4+XLatlTGMVj d2XtX1CdlHGfAcy/uTAFdo3CKpAUpot/meiMrBsBPqtD6soj6BSauzzYhC8p6RBLd6IgiZcWffZ MnLVjN650nTzQi3pNavgpP32+6tQtz1jBtxwFyzbuJnMQhGsn0/q/fg+8/0I+Du9FXVhgOeZ3eG 8xKU6R20qQ1nFvix64yTkEVkOKvZdINySUe6fqAUCqDk3d2uIPA2uJYbL+2lJnHQ0rJ8MXE9yZn a8= X-Google-Smtp-Source: AGHT+IHjrJedObjQLnkqlATuK0iiTd2ReDzXEhDuZpiEE8iScOlIeLDoodxwtlrvEQjvQ0kp+C8/HQ== X-Received: by 2002:a05:622a:554:b0:4ee:26bd:13fa with SMTP id d75a77b69052e-4ffb4a38073mr98924131cf.80.1767904749843; Thu, 08 Jan 2026 12:39:09 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.39.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:39:09 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com Subject: [RFC PATCH v3 7/8] mm/zswap: compressed ram direct integration Date: Thu, 8 Jan 2026 15:37:54 -0500 Message-ID: <20260108203755.1163107-8-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If a private zswap-node is available, skip the entire software compression process and memcpy directly to a compressed memory folio, and store the newly allocated compressed memory page as the zswap entry->handle. On decompress we do the opposite: copy directly from the stored page to the destination, and free the compressed memory page. The driver callback is responsible for preventing run-away compression ratio failures by checking that the allocated page is safe to use (i.e. a compression ratio limit hasn't been crossed). Signed-off-by: Gregory Price --- include/linux/zswap.h | 5 ++ mm/zswap.c | 106 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 109 insertions(+), 2 deletions(-) diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 30c193a1207e..4b52fe447e7e 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -35,6 +35,8 @@ void zswap_lruvec_state_init(struct lruvec *lruvec); void zswap_folio_swapin(struct folio *folio); bool zswap_is_enabled(void); bool zswap_never_enabled(void); +void zswap_add_direct_node(int nid); +void zswap_remove_direct_node(int nid); #else =20 struct zswap_lruvec_state {}; @@ -69,6 +71,9 @@ static inline bool zswap_never_enabled(void) return true; } =20 +static inline void zswap_add_direct_node(int nid) {} +static inline void zswap_remove_direct_node(int nid) {} + #endif =20 #endif /* _LINUX_ZSWAP_H */ diff --git a/mm/zswap.c b/mm/zswap.c index de8858ff1521..aada588c957e 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -35,6 +35,7 @@ #include #include #include +#include =20 #include "swap.h" #include "internal.h" @@ -190,6 +191,7 @@ struct zswap_entry { swp_entry_t swpentry; unsigned int length; bool referenced; + bool direct; struct zswap_pool *pool; unsigned long handle; struct obj_cgroup *objcg; @@ -199,6 +201,20 @@ struct zswap_entry { static struct xarray *zswap_trees[MAX_SWAPFILES]; static unsigned int nr_zswap_trees[MAX_SWAPFILES]; =20 +/* Nodemask for compressed RAM nodes used by zswap_compress_direct */ +static nodemask_t zswap_direct_nodes =3D NODE_MASK_NONE; + +void zswap_add_direct_node(int nid) +{ + node_set(nid, zswap_direct_nodes); +} + +void zswap_remove_direct_node(int nid) +{ + if (!node_online(nid)) + node_clear(nid, zswap_direct_nodes); +} + /* RCU-protected iteration */ static LIST_HEAD(zswap_pools); /* protects zswap_pools list modification */ @@ -716,7 +732,13 @@ static void zswap_entry_cache_free(struct zswap_entry = *entry) static void zswap_entry_free(struct zswap_entry *entry) { zswap_lru_del(&zswap_list_lru, entry); - zs_free(entry->pool->zs_pool, entry->handle); + if (entry->direct) { + struct page *page =3D (struct page *)entry->handle; + + node_private_freed(page); + __free_page(page); + } else + zs_free(entry->pool->zs_pool, entry->handle); zswap_pool_put(entry->pool); if (entry->objcg) { obj_cgroup_uncharge_zswap(entry->objcg, entry->length); @@ -849,6 +871,58 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_c= tx *acomp_ctx) mutex_unlock(&acomp_ctx->mutex); } =20 +static struct page *zswap_compress_direct(struct page *src, + struct zswap_entry *entry) +{ + int nid; + struct page *dst; + gfp_t gfp; + nodemask_t tried_nodes =3D NODE_MASK_NONE; + + if (nodes_empty(zswap_direct_nodes)) + return NULL; + + gfp =3D GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE | + __GFP_THISNODE; + + for_each_node_mask(nid, zswap_direct_nodes) { + int ret; + + /* Skip nodes we've already tried and failed */ + if (node_isset(nid, tried_nodes)) + continue; + + dst =3D __alloc_pages(gfp, 0, nid, &zswap_direct_nodes); + if (!dst) + continue; + + /* + * Check with the device driver that this page is safe to use. + * If the device reports an error (e.g., compression ratio is + * too low and the page can't safely store data), free the page + * and try another node. + */ + ret =3D node_private_allocated(dst); + if (ret) { + __free_page(dst); + node_set(nid, tried_nodes); + continue; + } + + goto found; + } + + return NULL; + +found: + /* If we fail to copy at this point just fallback */ + if (copy_mc_highpage(dst, src)) { + __free_page(dst); + dst =3D NULL; + } + return dst; +} + static bool zswap_compress(struct page *page, struct zswap_entry *entry, struct zswap_pool *pool) { @@ -860,6 +934,17 @@ static bool zswap_compress(struct page *page, struct z= swap_entry *entry, gfp_t gfp; u8 *dst; bool mapped =3D false; + struct page *zpage; + + /* Try to shunt directly to compressed ram */ + zpage =3D zswap_compress_direct(page, entry); + if (zpage) { + entry->handle =3D (unsigned long)zpage; + entry->length =3D PAGE_SIZE; + entry->direct =3D true; + return true; + } + /* otherwise fallback to normal zswap */ =20 acomp_ctx =3D acomp_ctx_get_cpu_lock(pool); dst =3D acomp_ctx->buffer; @@ -913,6 +998,7 @@ static bool zswap_compress(struct page *page, struct zs= wap_entry *entry, zs_obj_write(pool->zs_pool, handle, dst, dlen); entry->handle =3D handle; entry->length =3D dlen; + entry->direct =3D false; =20 unlock: if (mapped) @@ -936,6 +1022,15 @@ static bool zswap_decompress(struct zswap_entry *entr= y, struct folio *folio) int decomp_ret =3D 0, dlen =3D PAGE_SIZE; u8 *src, *obj; =20 + /* compressed ram page */ + if (entry->direct) { + struct page *src =3D (struct page *)entry->handle; + struct folio *zfolio =3D page_folio(src); + + memcpy_folio(folio, 0, zfolio, 0, PAGE_SIZE); + goto direct_done; + } + acomp_ctx =3D acomp_ctx_get_cpu_lock(pool); obj =3D zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer= ); =20 @@ -969,6 +1064,7 @@ static bool zswap_decompress(struct zswap_entry *entry= , struct folio *folio) zs_obj_read_end(pool->zs_pool, entry->handle, obj); acomp_ctx_put_unlock(acomp_ctx); =20 +direct_done: if (!decomp_ret && dlen =3D=3D PAGE_SIZE) return true; =20 @@ -1483,7 +1579,13 @@ static bool zswap_store_page(struct page *page, return true; =20 store_failed: - zs_free(pool->zs_pool, entry->handle); + if (entry->direct) { + struct page *freepage =3D (struct page *)entry->handle; + + node_private_freed(freepage); + __free_page(freepage); + } else + zs_free(pool->zs_pool, entry->handle); compress_failed: zswap_entry_cache_free(entry); return false; --=20 2.52.0 From nobody Sun Feb 8 17:13:32 2026 Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93F86331219 for ; Thu, 8 Jan 2026 20:39:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904757; cv=none; b=XJBK7S93GqPC97lciyeoCNDwlWtJKFuVuV+8pUechT9MHhbNtaBcQcQ/5pgNaQDV95+GdCh2+dSvisozEuJGJwEnD+6VlQOkKgro862vSu1C3PIIp2t+aKLReGS6st5As9aNLVVdLww9DwaYu1i9dnYJYGyW6+maLRO/aQlsWGI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767904757; c=relaxed/simple; bh=H1aiie8D4svpNDzSmOI68CzAj4t2rzf4KLw2L055t2E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IqRjXE8W+P2+r38b6aJ0iY/4XfPebnouXI9K7ZWZxVXIKVCMsjkBXwWVOt2O185Ao3gcYqLsyQQV1+oMl2dT5D0GAA2r4SRuk/5uI7DTVkrUFOVxZyNeUtEXuvKJIyk6UVsMvNe/xsW4VLwT7pSjevRNvaxVUwP94M5scPJ7sJY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=sv4IqeaF; arc=none smtp.client-ip=209.85.160.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="sv4IqeaF" Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-4ed9c19248bso33256081cf.1 for ; Thu, 08 Jan 2026 12:39:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1767904754; x=1768509554; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=hbAJPMgZOoJSjKjyh3GK6UyuoBx+8LBWNLKG00C3DPA=; b=sv4IqeaF1/x3PeHIs+GscpVxUPxX8J3UWJ7TgRxTJec0UV/1R0tuHdwgy/2tmS1kVn CroY5GoHsj8ja2LtiCn8lb56rvpzFCowKGEPTasLH3c6hbFA/K4wQzBrraT0scodrf6B TEMy4f9+7GPxIdFkVU3aw2tC5LBU2sxPDl7uq1HNnByEmr9nZQcFYyaOo5plfhfmE4pg zEgoT4RmyzKJBQ/HeB1poljUX+caNgVaCLbZL6dIds+XD/74CVbL4sGKuaKBIofv2vaQ ferxu6wmqIzgxcluE0XEFkfywnei2mL9AoL+dzS5Cs3uFow5vLs+vgm4goSZbcauAv94 nijw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767904754; x=1768509554; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=hbAJPMgZOoJSjKjyh3GK6UyuoBx+8LBWNLKG00C3DPA=; b=alZAIpPmPqGzYWzldGNLMTki0K5nOsJdw/xWaxrilTTOIzcIZcAAYLuqlS5+hK7SlU msA2R6xv5CwT1cj6X1c/pKndjzTwoPcKZXvgv7t3BFCDGJGNnSj9xkUlGN5qvFlfRltL peOnmBngmREIxdWCsuwZiLEvLLhocseKye9AgdtZI4FtCYEtHO9FRbalR6NkPJTArayv PiBsM1lmn8jEp36vVb1SVj9Wlr7VaMNViAPqPbMBBM2qoID2VT7A911acp53vgCgq9b3 X3XrYrSeXa3Qz4Rt814Pal1hyYiTt7a+XiGH2ypuz+vIynGhL8CgaXCu/TJj2Vwhfznd SMlQ== X-Forwarded-Encrypted: i=1; AJvYcCVG5v2mzBEq92PsTWFH4Jc7QW6Bm3P7g5Nqzh6tV6f6HUpbkSW98lB9vmcj9YFhqJ0X9/IeVWgXxcIRcRg=@vger.kernel.org X-Gm-Message-State: AOJu0YygJInF1s7fTGwyDhd5awC+JaWriYm946XBCjwWTa+QIWjNvKQR 8ubmkyiK3i8lA6UhZjCiUFD9QdXUX9uHZnbPnJFnlqLG8PrQcozfuiiOx34XeyS5ghE= X-Gm-Gg: AY/fxX48/DQQiYwrwNblbl7K9FoAhVY28K5zDXHZ/8WLwQX7ml2vBpYHhUVVHc18XU8 ieDYGmaCQ8RfEgmxRo3+Bqbo6sk+kmSmZf+acIJfs26eLFs6KfrO4YcUYCRWQPo8zsIgmSJeWjY ciPb4V4nOpLh2i+39G3KH7WyG+AUYFOMNEsMGOqRL0JT3bySxVekR6ZOcrHlBJIjjkv50FBTpCd hB+gGRb+m+2rWHFJKaNiOWbLcUUxkyZ6IDcU5OtIH2BmN7DUAloMOnrpIfhOvfGlX1Vdn8tFotx 6doflVWdBxdqn/Sa/EsEXzbVWBPczGSTuBYYjg/nstsqvqvaW9n/1MCP78I8a1QkPW7M50RPfL5 5W1uIaP2RrL3FGB6r2mr3o/XAqzBjl8BTxr2oIvOKLmLW+31q9+x/dz5Hw3SlZKME2/ithVzGKM OToCq7u+t1TXC/xJDICXsiyEZvvI5B+57Y7gpdZJnorkmlF7znUPOdEmft2W7m175eMXIOPVi4w 0I= X-Google-Smtp-Source: AGHT+IETJUC85zNJ1zZIUFzv9QwdWQGaYClNK785q5YOPJAvYLx3d5ypHJ/HCEokjoFg0FJKBiPjgg== X-Received: by 2002:a05:622a:124f:b0:4ee:24b8:2275 with SMTP id d75a77b69052e-4ffb4913960mr88384181cf.1.1767904754417; Thu, 08 Jan 2026 12:39:14 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-890770e472csm60483886d6.23.2026.01.08.12.39.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jan 2026 12:39:13 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-cxl@vger.kernel.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, longman@redhat.com, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, dan.j.williams@intel.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, ziy@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, rppt@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, yury.norov@gmail.com, linux@rasmusvillemoes.dk, rientjes@google.com, shakeel.butt@linux.dev, chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, roman.gushchin@linux.dev, muchun.song@linux.dev, osalvador@suse.de, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, cl@gentwo.org, harry.yoo@oracle.com, zhengqi.arch@bytedance.com Subject: [RFC PATCH v3 8/8] drivers/cxl: add zswap private_region type Date: Thu, 8 Jan 2026 15:37:55 -0500 Message-ID: <20260108203755.1163107-9-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260108203755.1163107-1-gourry@gourry.net> References: <20260108203755.1163107-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a sample type of a zswap region, which registers itself as a valid target node with mm/zswap. Zswap will callback into the driver on new page allocation and free. On cxl_zswap_page_allocated(), we would check whether the worst case vs current compression ratio is safe to allow new writes. On cxl_zswap_page_freed(), zero the page to adjust the ratio down. A device driver registering a Zswap private region would need to provide an indicator to this component whether to allow new allocations - this would probably be done via an interrupt setting a bit which says the compression ratio has reached some conservative threshold. Signed-off-by: Gregory Price --- drivers/cxl/core/private_region/Makefile | 3 + .../cxl/core/private_region/private_region.c | 10 ++ .../cxl/core/private_region/private_region.h | 4 + drivers/cxl/core/private_region/zswap.c | 127 ++++++++++++++++++ drivers/cxl/cxl.h | 2 + 5 files changed, 146 insertions(+) create mode 100644 drivers/cxl/core/private_region/zswap.c diff --git a/drivers/cxl/core/private_region/Makefile b/drivers/cxl/core/pr= ivate_region/Makefile index d17498129ba6..ba495cd3f89f 100644 --- a/drivers/cxl/core/private_region/Makefile +++ b/drivers/cxl/core/private_region/Makefile @@ -7,3 +7,6 @@ ccflags-y +=3D -I$(srctree)/drivers/cxl =20 # Core dispatch and sysfs obj-$(CONFIG_CXL_REGION) +=3D private_region.o + +# Type-specific implementations +obj-$(CONFIG_CXL_REGION) +=3D zswap.o diff --git a/drivers/cxl/core/private_region/private_region.c b/drivers/cxl= /core/private_region/private_region.c index ead48abb9fc7..da5fb3d264e1 100644 --- a/drivers/cxl/core/private_region/private_region.c +++ b/drivers/cxl/core/private_region/private_region.c @@ -16,6 +16,8 @@ static const char *private_type_to_string(enum cxl_private_region_type typ= e) { switch (type) { + case CXL_PRIVATE_ZSWAP: + return "zswap"; default: return ""; } @@ -23,6 +25,8 @@ static const char *private_type_to_string(enum cxl_privat= e_region_type type) =20 static enum cxl_private_region_type string_to_private_type(const char *str) { + if (sysfs_streq(str, "zswap")) + return CXL_PRIVATE_ZSWAP; return CXL_PRIVATE_NONE; } =20 @@ -88,6 +92,9 @@ int cxl_register_private_region(struct cxl_region *cxlr) =20 /* Call type-specific registration which sets memtype and callbacks */ switch (cxlr->private_type) { + case CXL_PRIVATE_ZSWAP: + rc =3D cxl_register_zswap_region(cxlr); + break; default: dev_dbg(&cxlr->dev, "unsupported private_type: %d\n", cxlr->private_type); @@ -113,6 +120,9 @@ void cxl_unregister_private_region(struct cxl_region *c= xlr) =20 /* Dispatch to type-specific cleanup */ switch (cxlr->private_type) { + case CXL_PRIVATE_ZSWAP: + cxl_unregister_zswap_region(cxlr); + break; default: break; } diff --git a/drivers/cxl/core/private_region/private_region.h b/drivers/cxl= /core/private_region/private_region.h index 9b34e51d8df4..84d43238dbe1 100644 --- a/drivers/cxl/core/private_region/private_region.h +++ b/drivers/cxl/core/private_region/private_region.h @@ -7,4 +7,8 @@ struct cxl_region; int cxl_register_private_region(struct cxl_region *cxlr); void cxl_unregister_private_region(struct cxl_region *cxlr); =20 +/* Type-specific registration functions - called from region.c dispatch */ +int cxl_register_zswap_region(struct cxl_region *cxlr); +void cxl_unregister_zswap_region(struct cxl_region *cxlr); + #endif /* __CXL_PRIVATE_REGION_H__ */ diff --git a/drivers/cxl/core/private_region/zswap.c b/drivers/cxl/core/pri= vate_region/zswap.c new file mode 100644 index 000000000000..c213abe2fad7 --- /dev/null +++ b/drivers/cxl/core/private_region/zswap.c @@ -0,0 +1,127 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * CXL Private Region - zswap type implementation + * + * This file implements the zswap private region type for CXL devices. + * It handles registration/unregistration of CXL regions as zswap + * compressed memory targets. + */ + +#include +#include +#include +#include +#include +#include "../../cxl.h" +#include "../core.h" +#include "private_region.h" + +/* + * CXL zswap region page_allocated callback + * + * This callback is invoked by zswap when a page is allocated from a priva= te + * node to validate that the page is safe to use. For a real compressed me= mory + * device, this would check the device's compression ratio and return an e= rror + * if the page cannot safely store data. + * + * Currently this is a placeholder that always succeeds. A real implementa= tion + * would query the device hardware to determine if sufficient compression + * headroom exists. + */ +static int cxl_zswap_page_allocated(struct page *page, void *data) +{ + struct cxl_region *cxlr =3D data; + + /* + * TODO: Query the CXL device to check if this page allocation is safe. + * + * A real compressed memory device would track its compression ratio + * and report whether it has headroom to accept new data. If the + * compression ratio is too low (device is near capacity), this should + * return -ENOSPC to tell zswap to try another node. + * + * For now, always succeed since we're testing with regular memory. + */ + dev_dbg(&cxlr->dev, "page_allocated callback for nid %d\n", + page_to_nid(page)); + + return 0; +} + +/* + * CXL zswap region page_freed callback + * + * This callback is invoked when a page from a private node is being freed. + * We zero the page before returning it to the allocator so that the compr= essed + * memory device can reclaim capacity - zeroed pages achieve excellent + * compression ratios. + */ +static void cxl_zswap_page_freed(struct page *page, void *data) +{ + struct cxl_region *cxlr =3D data; + + /* + * Zero the page to improve the device's compression ratio. + * Zeroed pages compress extremely well, reclaiming device capacity. + */ + clear_highpage(page); + + dev_dbg(&cxlr->dev, "page_freed callback for nid %d\n", + page_to_nid(page)); +} + +/* + * Unregister a zswap region from the zswap subsystem. + * + * This function removes the node from zswap direct nodes and unregisters + * the private node operations. + */ +void cxl_unregister_zswap_region(struct cxl_region *cxlr) +{ + int nid; + + if (!cxlr->private || + cxlr->private_ops.memtype !=3D NODE_MEM_ZSWAP) + return; + + if (!cxlr->params.res) + return; + + nid =3D phys_to_target_node(cxlr->params.res->start); + + zswap_remove_direct_node(nid); + node_unregister_private(nid, &cxlr->private_ops); + + dev_dbg(&cxlr->dev, "unregistered zswap region for nid %d\n", nid); +} + +/* + * Register a zswap region with the zswap subsystem. + * + * This function sets up the memtype, page_allocated callback, and + * registers the node with zswap as a direct compression target. + * The caller is responsible for adding the dax region after this succeeds. + */ +int cxl_register_zswap_region(struct cxl_region *cxlr) +{ + int nid, rc; + + if (!cxlr->private || !cxlr->params.res) + return -EINVAL; + + nid =3D phys_to_target_node(cxlr->params.res->start); + + /* Register with node subsystem as zswap memory */ + cxlr->private_ops.memtype =3D NODE_MEM_ZSWAP; + cxlr->private_ops.page_allocated =3D cxl_zswap_page_allocated; + cxlr->private_ops.page_freed =3D cxl_zswap_page_freed; + rc =3D node_register_private(nid, &cxlr->private_ops); + if (rc) + return rc; + + /* Register this node with zswap as a direct compression target */ + zswap_add_direct_node(nid); + + dev_dbg(&cxlr->dev, "registered zswap region for nid %d\n", nid); + return 0; +} diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index b276956ff88d..89d8ae4e796c 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -534,9 +534,11 @@ enum cxl_partition_mode { /** * enum cxl_private_region_type - CXL private region types * @CXL_PRIVATE_NONE: No private region type set + * @CXL_PRIVATE_ZSWAP: Region used for zswap compressed memory */ enum cxl_private_region_type { CXL_PRIVATE_NONE, + CXL_PRIVATE_ZSWAP, }; =20 /** --=20 2.52.0