From nobody Mon Apr 6 10:43:32 2026 Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A90D5391E56 for ; Sat, 21 Mar 2026 15:04:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774105467; cv=none; b=ru9oQCaIsZxInItQLTHGGNbOUDLPX32kax9mZgpWwO2kyb+DtpWyHZBbLBh5Qxk9LmxaNP97COWiK/Qy2w8F0G5mV4CWI2q+zjfzVNM+BEpnYWtwzmbIsKLljvIj58qOUDt6OS3onFoZMSiMFIwqOs6pb7ElEtyzWrJYI/DuWy0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774105467; c=relaxed/simple; bh=DqZdLzszMzhm/Vsx1n/l9hukxOl1LDdHzTaf+SEFt2g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZoI4/5S5gdOvmWSpJe//LeNQfrl8wZ0r6h9QFoWtTtkrCSR94TmlVUDIqotAWX7IC4gvrUYOoRZAM5fc4hBhEd8KO0yLlDOoxP1svXR5MNJYSJOzzQaiinUylhwfvOShjhiyuXyZ9dXW448pSgghfMCf8aN7Xg81yP/1ApZeyDw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=pR8AJmJR; arc=none smtp.client-ip=209.85.222.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="pR8AJmJR" Received: by mail-qk1-f173.google.com with SMTP id af79cd13be357-8cfd44fa075so117125185a.0 for ; Sat, 21 Mar 2026 08:04:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1774105464; x=1774710264; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GotqY1iYunncyBpBt3k1vF2DkZbep3lRnQKBDrqpHyM=; b=pR8AJmJR6zQoWZd1U0ca9P4ivgfEDyKvNHYUYCuHyHnpp1zd74HdI33eB1kGS9o25O KSztkWhZGOFiL/MIb2fjCnaPUuDBzjEL1GqUp01cBIyJoMHjAjLzqIHop3ybx6XUy+R4 LOK27sLESYgBCjDzyoNsOKbjDkh+YhgFBqUrrarWH4YwFLmRa+GLYN/HtcGZTB3fOSjL uHWw07brweE0Omj2h8OH3Xl47rk39DhNbR0tmk+QMYDQQN4bBio+IBJ56SGQh4F8uk8t 0vyV3svgYs9M9tX/2GzYxXRWLFBlVBoxSSB7z3izFYixHt3vjCs+p1tKYltJXt5wYjpp zT3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774105464; x=1774710264; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=GotqY1iYunncyBpBt3k1vF2DkZbep3lRnQKBDrqpHyM=; b=Lr4ziOvLt7IEIe2JkH1hjy3eTJccTlMnB2y9bWaag9a+bqFmxJg2Jzdr3t7iB0KnPU oI2dvTFj+p0Jd2u80u9obasEeFPm5Iln0mqXH3zCo3qUwwtQ7GlQZVLgPeY0XoeRMaAm VoJnwz5guZToUGvXGFtdflqNEUuSM1rNtr7YDvkJe1hLF5AjddNvFK+Q9qt77RBEgCvm Mp4UX5nNMIKsTmlG7i4A91NjC5yqsi1wIeVeHlv+qqypcKz8wwXmYn2XwJa1ZKE5ca2u BebZMpVdOV0pr7OdE8Thcxqq88C2iw/4qOJAUZYfiQcGMmyvNqSLIC7VlR3wh0Sf7TII sguA== X-Forwarded-Encrypted: i=1; AJvYcCX0LONj1GCzrr440BwrzGY6Qw63uiEyyaW3Xz9HQybguvVGgHYoIc/kMUh2yR1C9ZQizEkUQd8losEeJ60=@vger.kernel.org X-Gm-Message-State: AOJu0YxQ8AxNSxkUlWTwtvIuATal3HyoLo6Cfq9Zfdl4zjmUCuH/EpGS LA5t08/ZErvMfzJYXZZa7Tam483O6YU/wjN9PtQZO+9amIOFReGulN7fEXrGFDqvWlE= X-Gm-Gg: ATEYQzwYNIIsnt7z1N0m6mJ6vkD3Q0yIGDEab0xZaIpEUaj6FkmvYKkI9NLeTtvlwzj AVSa50T35jWfuASPtl7XUWid5CeN2fEBgcfahbpuAry2fial1ojSoFhpLwbaNtmIgWkuaemc1X1 JpFQbUAOrgrTKHr9BPlYoL2BTU144no5/qPrmsasX/jZHGF/Y4MWbitONgf/tuH1pbU3gerbR9D Y5p51ysIrHJ3drIi9MjxQBGe1Vlz5YzZl9u8JhHY2ure6RGkY8a2Tr0pkKXO1PZKqcxJb5RDVEi GyEGxUQy45yi0f2Cx5N0CY8ED3krMd0w8FXWLlNDf7E3ZsWla3hpE0FLgNcwVpt4CRK5xPrOT01 NhmAlox5V0Od3BsPTS7s82mBDDHNyj0MBkQ4ZDPV7/Gdej4mN3rUhIIL1sU038w81Xm4Ello5VR wgmv0P+OvuUO62vQtPma+8VhbJzhJk58NHyRRFGdxVOUiQAJx5iSlcIN59HMPwDIbRO/ledALci D3+gQj+DRvV4xqACKD5LAciAw== X-Received: by 2002:a05:620a:4456:b0:8b2:ea5a:4149 with SMTP id af79cd13be357-8cfc7f873aamr1145888685a.65.1774105463364; Sat, 21 Mar 2026 08:04:23 -0700 (PDT) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id af79cd13be357-8cfc90ba89fsm391979885a.40.2026.03.21.08.04.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 21 Mar 2026 08:04:22 -0700 (PDT) From: Gregory Price To: linux-mm@kvack.org, vishal.l.verma@intel.com, dave.jiang@intel.com, akpm@linux-foundation.org, david@kernel.org, osalvador@suse.de Cc: dan.j.williams@intel.com, ljs@kernel.org, Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org, surenb@google.com, mhocko@suse.com, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, kernel-team@meta.com, Hannes Reinecke Subject: [PATCH 8/8] dax/kmem: add sysfs interface for atomic whole-device hotplug Date: Sat, 21 Mar 2026 11:04:04 -0400 Message-ID: <20260321150404.3288786-9-gourry@gourry.net> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260321150404.3288786-1-gourry@gourry.net> References: <20260321150404.3288786-1-gourry@gourry.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The dax kmem driver currently onlines memory automatically during probe using the system's default online policy but provides no way to control or query the entire region state at runtime. Additionally, there is no atomic mechanism to offline and remove the entire set of memory blocks together. Instead, this is presently done in two steps: (offline all, remove all). This creates a race condition where external entities can operate directly on the blocks and cause hot-unplug to fail. Add a new 'hotplug' sysfs attribute that allows userspace to control and query the entire memory region state. The interface supports the following states: - "unplug": memory is offline and blocks are not present - "online": memory is online as normal system RAM - "online_movable": memory is online in ZONE_MOVABLE Valid transitions: - unplugged -> online - unplugged -> online_movable - online -> unplugged - online_movable -> unplugged "offline" (memory blocks exist but are offline by default) is not supported because it's functionally equivalent to "unplugged" and entices races between offlining and unplugging. The initial state after probe currently checks if online_type matches mhp_get_default_online_type() - and if so calls dax_kmem_do_hotplug. This causes the creation of memory blocks, despite the fact that we should be in an unplugged state. This preserves userland backward compatibility for existing tools that expect the memory blocks to be present after kmem probe - and can be deprecated over time. As with any hot-remove mechanism, the removal can fail and if rollback fails the system can be left in an inconsistent state. Unbind Note: We used to call remove_memory() during unbind, which would fire a BUG() if any of the memory blocks were online at that time. We lift this into a WARN in the cleanup routine and don't attempt hotremove if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE. The resources are still leaked but this prevents deadlock on unbind if a memory region happens to be impossible to hotremove. Suggested-by: Hannes Reinecke Suggested-by: David Hildenbrand Signed-off-by: Gregory Price --- Documentation/ABI/testing/sysfs-bus-dax | 17 +++ drivers/dax/kmem.c | 164 +++++++++++++++++++++--- 2 files changed, 161 insertions(+), 20 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/te= sting/sysfs-bus-dax index b34266bfae49..faf6f63a368c 100644 --- a/Documentation/ABI/testing/sysfs-bus-dax +++ b/Documentation/ABI/testing/sysfs-bus-dax @@ -151,3 +151,20 @@ Description: memmap_on_memory parameter for memory_hotplug. This is typically set on the kernel command line - memory_hotplug.memmap_on_memory set to 'true' or 'force'." + +What: /sys/bus/dax/devices/daxX.Y/hotplug +Date: January, 2026 +KernelVersion: v6.21 +Contact: nvdimm@lists.linux.dev +Description: + (RW) Controls what hotplug state of the memory region. + Applies to all memory blocks associated with the device. + Only applies to dax_kmem devices. + + States: [unplugged, online, online_movable] + Arguments: + "unplug": memory is offline and blocks are not present + "online": memory is online as normal system RAM + "online_movable": memory is online in ZONE_MOVABLE + + Devices must unplug to online into a different state. diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 8be9286f0ea3..5dbd5b7862fd 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -40,10 +40,16 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int = i, struct range *r) return 0; } =20 +#define DAX_KMEM_UNPLUGGED (-1) + struct dax_kmem_data { const char *res_name; int mgid; struct memory_dev_type *mtype; + int numa_node; + struct dev_dax *dev_dax; + int state; + struct mutex lock; /* protects hotplug state transitions */ struct resource *res[]; }; =20 @@ -51,8 +57,10 @@ struct dax_kmem_data { * dax_kmem_do_hotplug - hotplug memory for dax kmem device * @dev_dax: the dev_dax instance * @data: the dax_kmem_data structure with resource tracking + * @online_type: MMOP_ONLINE or MMOP_ONLINE_MOVABLE * - * Hotplugs all ranges in the dev_dax region as system memory. + * Hotplugs all ranges in the dev_dax region as system memory using + * the specified online type. * * Returns the number of successfully mapped ranges, or negative error. */ @@ -64,6 +72,12 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax, int i, rc, onlined =3D 0; mhp_t mhp_flags; =20 + if (data->state =3D=3D MMOP_ONLINE || data->state =3D=3D MMOP_ONLINE_MOVA= BLE) + return -EINVAL; + + if (online_type !=3D MMOP_ONLINE && online_type !=3D MMOP_ONLINE_MOVABLE) + return -EINVAL; + for (i =3D 0; i < dev_dax->nr_range; i++) { struct range range; =20 @@ -156,9 +170,9 @@ static int dax_kmem_init_resources(struct dev_dax *dev_= dax, * @dev_dax: the dev_dax instance * @data: the dax_kmem_data structure with resource tracking * - * Removes all ranges in the dev_dax region. + * Offlines and removes all ranges in the dev_dax region. * - * Returns the number of successfully removed ranges. + * Returns the number of successfully removed ranges, or negative error. */ static int dax_kmem_do_hotremove(struct dev_dax *dev_dax, struct dax_kmem_data *data) @@ -178,7 +192,7 @@ static int dax_kmem_do_hotremove(struct dev_dax *dev_da= x, if (!data->res[i]) continue; =20 - rc =3D remove_memory(range.start, range_len(&range)); + rc =3D offline_and_remove_memory(range.start, range_len(&range)); if (rc =3D=3D 0) { /* Release the resource for the successfully removed range */ remove_resource(data->res[i]); @@ -214,6 +228,20 @@ static void dax_kmem_cleanup_resources(struct dev_dax = *dev_dax, { int i; =20 + /* + * If the device unbind occurs before memory is hotremoved, we can never + * remove the memory (requires reboot). Attempting an offline operation + * here may cause deadlock and a failure to finish the unbind. + * + * This WARN used to be a BUG called by remove_memory(). + * + * Note: This leaks the resources. + */ + if (WARN(((data->state !=3D DAX_KMEM_UNPLUGGED) && + (data->state !=3D MMOP_OFFLINE)), + "Hotplug memory regions stuck online until reboot")) + return; + for (i =3D 0; i < dev_dax->nr_range; i++) { if (!data->res[i]) continue; @@ -223,6 +251,98 @@ static void dax_kmem_cleanup_resources(struct dev_dax = *dev_dax, } } =20 +static int dax_kmem_parse_state(const char *buf) +{ + if (sysfs_streq(buf, "unplug")) + return DAX_KMEM_UNPLUGGED; + if (sysfs_streq(buf, "online")) + return MMOP_ONLINE; + if (sysfs_streq(buf, "online_movable")) + return MMOP_ONLINE_MOVABLE; + return -EINVAL; +} + +static ssize_t hotplug_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_kmem_data *data =3D dev_get_drvdata(dev); + const char *state_str; + + if (!data) + return -ENXIO; + + switch (data->state) { + case DAX_KMEM_UNPLUGGED: + state_str =3D "unplugged"; + break; + case MMOP_OFFLINE: + state_str =3D "offline"; + break; + case MMOP_ONLINE: + state_str =3D "online"; + break; + case MMOP_ONLINE_MOVABLE: + state_str =3D "online_movable"; + break; + default: + state_str =3D "unknown"; + break; + } + + return sysfs_emit(buf, "%s\n", state_str); +} + +static ssize_t hotplug_store(struct device *dev, struct device_attribute *= attr, + const char *buf, size_t len) +{ + struct dev_dax *dev_dax =3D to_dev_dax(dev); + struct dax_kmem_data *data =3D dev_get_drvdata(dev); + int online_type; + int rc; + + if (!data) + return -ENXIO; + + online_type =3D dax_kmem_parse_state(buf); + if (online_type < DAX_KMEM_UNPLUGGED) + return online_type; + + guard(mutex)(&data->lock); + + /* Already in requested state */ + if (data->state =3D=3D online_type) + return len; + + if (online_type =3D=3D DAX_KMEM_UNPLUGGED) { + rc =3D dax_kmem_do_hotremove(dev_dax, data); + if (rc < 0) { + dev_warn(dev, "hotplug state is inconsistent\n"); + return rc; + } + if (rc < dev_dax->nr_range) + dev_warn(dev, "partial hotremove: %d of %d ranges removed\n", + rc, dev_dax->nr_range); + else + data->state =3D DAX_KMEM_UNPLUGGED; + return len; + } + + /* + * online_type is MMOP_ONLINE or MMOP_ONLINE_MOVABLE + * Cannot switch between online types without unplugging first + */ + if (data->state =3D=3D MMOP_ONLINE || data->state =3D=3D MMOP_ONLINE_MOVA= BLE) + return -EBUSY; + + rc =3D dax_kmem_do_hotplug(dev_dax, data, online_type); + if (rc < 0) + return rc; + + data->state =3D online_type; + return len; +} +static DEVICE_ATTR_RW(hotplug); + static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { struct device *dev =3D &dev_dax->dev; @@ -291,6 +411,10 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) goto err_reg_mgid; data->mgid =3D rc; data->mtype =3D mtype; + data->numa_node =3D numa_node; + data->dev_dax =3D dev_dax; + data->state =3D DAX_KMEM_UNPLUGGED; + mutex_init(&data->lock); =20 dev_set_drvdata(dev, data); =20 @@ -301,9 +425,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) /* * Hotplug using the configured online type for this device. */ - rc =3D dax_kmem_do_hotplug(dev_dax, data, dev_dax->online_type); - if (rc < 0) - goto err_hotplug; + if (dev_dax->online_type !=3D MMOP_OFFLINE || + dev_dax->online_type =3D=3D mhp_get_default_online_type()) { + rc =3D dax_kmem_do_hotplug(dev_dax, data, dev_dax->online_type); + if (rc < 0) + goto err_hotplug; + data->state =3D dev_dax->online_type; + } + + rc =3D device_create_file(dev, &dev_attr_hotplug); + if (rc) + dev_warn(dev, "failed to create hotplug sysfs entry\n"); =20 return 0; =20 @@ -324,23 +456,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) #ifdef CONFIG_MEMORY_HOTREMOVE static void dev_dax_kmem_remove(struct dev_dax *dev_dax) { - int success; int node =3D dev_dax->target_node; struct device *dev =3D &dev_dax->dev; struct dax_kmem_data *data =3D dev_get_drvdata(dev); =20 - /* - * We have one shot for removing memory, if some memory blocks were not - * offline prior to calling this function remove_memory() will fail, and - * there is no way to hotremove this memory until reboot because device - * unbind will succeed even if we return failure. - */ - success =3D dax_kmem_do_hotremove(dev_dax, data); - if (success < dev_dax->nr_range) { - dev_err(dev, "Hotplug regions stuck online until reboot\n"); - return; - } - + device_remove_file(dev, &dev_attr_hotplug); dax_kmem_cleanup_resources(dev_dax, data); memory_group_unregister(data->mgid); kfree(data->res_name); @@ -358,6 +478,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_da= x) #else static void dev_dax_kmem_remove(struct dev_dax *dev_dax) { + struct device *dev =3D &dev_dax->dev; + + device_remove_file(dev, &dev_attr_hotplug); + /* * Without hotremove purposely leak the request_mem_region() for the * device-dax range and return '0' to ->remove() attempts. The removal --=20 2.53.0