[PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs

Gregory Price posted 8 patches 1 week, 6 days ago
Documentation/ABI/testing/sysfs-bus-dax |  17 +
drivers/dax/bus.c                       |   3 +
drivers/dax/bus.h                       |   2 +
drivers/dax/cxl.c                       |   1 +
drivers/dax/dax-private.h               |   3 +
drivers/dax/hmem/hmem.c                 |   1 +
drivers/dax/kmem.c                      | 457 ++++++++++++++++++------
include/linux/memory-tiers.h            |  34 +-
include/linux/memory.h                  |  22 ++
include/linux/memory_hotplug.h          |  32 ++
mm/memory-tiers.c                       |  29 +-
mm/memory_hotplug.c                     |  67 +++-
12 files changed, 501 insertions(+), 167 deletions(-)
[PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs
Posted by Gregory Price 1 week, 6 days ago
The dax kmem driver currently onlines memory during probe using the
system default policy, with no way to control or query the region state
at runtime - other than by inspecting the state of individual blocks.

Offlining and removing an entire region requires operating on individual
memory blocks, creating race conditions where external entities can
interfere between the offline and remove steps.

The problem was discussed specifically in the LPC2025 device memory
sessions - https://lpc.events/event/19/contributions/2016/ - where
it was discussed how the non-atomic interface for dax hotplug is causing
issues in some distributions which have competing userland controllers
that interfere with each other.

This series adds a sysfs "hotplug" attribute for atomic whole-device
hotplug control, along with the mm and dax plumbing to support it.

The first five patches prepare the mm and dax layers:

  1. Consolidate memory-tier type deduplication into mt_get_memory_type(),
     removing redundant per-driver infrastructure.
  2. Add a memory_block_align_range() helper for hotplug range alignment.
  3-5. Thread an explicit online_type through the memory hotplug and dax
     paths, allowing drivers to specify a preferred auto-online policy
     (ZONE_NORMAL vs ZONE_MOVABLE) instead of being forced to the
     system default.

The last three patches build the dax/kmem feature:

  6. Plumb online_type through the dax device creation path.
  7. Extract hotplug/hotremove into helper functions to separate resource
     lifecycle from memory onlining.
  8. Add the "hotplug" sysfs attribute supporting three states:
     - "unplug": memory blocks removed
     - "online": online as normal system RAM
     - "online_movable": online in ZONE_MOVABLE

Transitions are atomic across all ranges in the device.  Backward
compatibility is preserved: probe still auto-onlines when the configured
policy matches the system default.

Specific notes for maintainers:

I downgraded a BUG() to a WARN() when unbind is called while the dax
device is not un an UNPLUGGED state.  This is because the old pattern of
toggling individual memory blocks is still used by userland tools, and
will disconnect the `hotplug` value from the actual state of the overall
memory region.

Unless we move to deprecate per-block controls, we should just WARN()
instead of BUG() as an indicator that userland tools need to be updated
to use the new pattern (the old pattern is subject to race conditions).

The first two commits are semi-unrelated cleanups that conflict with the
changes made in the refactoring commits. (memory-tier dedup and align_range
helper). These are intended to be used for future cxl region extensions,
but if you prefer them to be dropped or submitted separately let me
know.

This is technically v3, but the patch line has diverged considerably and
I've reworked the cover letter, apologies for prior obtuseness
Link: https://lore.kernel.org/all/20260114235022.3437787-1-gourry@gourry.net/

Gregory Price (8):
  mm/memory-tiers: consolidate memory type dedup into
    mt_get_memory_type()
  mm/memory: add memory_block_align_range() helper
  mm/memory_hotplug: pass online_type to online_memory_block() via arg
  mm/memory_hotplug: export mhp_get_default_online_type
  mm/memory_hotplug: add __add_memory_driver_managed() with online_type
    arg
  dax: plumb hotplug online_type through dax
  dax/kmem: extract hotplug/hotremove helper functions
  dax/kmem: add sysfs interface for atomic whole-device hotplug

 Documentation/ABI/testing/sysfs-bus-dax |  17 +
 drivers/dax/bus.c                       |   3 +
 drivers/dax/bus.h                       |   2 +
 drivers/dax/cxl.c                       |   1 +
 drivers/dax/dax-private.h               |   3 +
 drivers/dax/hmem/hmem.c                 |   1 +
 drivers/dax/kmem.c                      | 457 ++++++++++++++++++------
 include/linux/memory-tiers.h            |  34 +-
 include/linux/memory.h                  |  22 ++
 include/linux/memory_hotplug.h          |  32 ++
 mm/memory-tiers.c                       |  29 +-
 mm/memory_hotplug.c                     |  67 +++-
 12 files changed, 501 insertions(+), 167 deletions(-)

-- 
2.53.0
Re: [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs
Posted by Andrew Morton 1 week, 6 days ago
On Sat, 21 Mar 2026 11:03:56 -0400 Gregory Price <gourry@gourry.net> wrote:

> The dax kmem driver currently onlines memory during probe using the
> system default policy, with no way to control or query the region state
> at runtime - other than by inspecting the state of individual blocks.
> 
> Offlining and removing an entire region requires operating on individual
> memory blocks, creating race conditions where external entities can
> interfere between the offline and remove steps.
> 
> The problem was discussed specifically in the LPC2025 device memory
> sessions - https://lpc.events/event/19/contributions/2016/ - where
> it was discussed how the non-atomic interface for dax hotplug is causing
> issues in some distributions which have competing userland controllers
> that interfere with each other.
> 
> This series adds a sysfs "hotplug" attribute for atomic whole-device
> hotplug control, along with the mm and dax plumbing to support it.

AI review (which hasn't completed at this time) has a lot to say:
	https://sashiko.dev/#/patchset/20260321150404.3288786-1-gourry@gourry.net
Re: [PATCH 0/8] dax/kmem: atomic whole-device hotplug via sysfs
Posted by Gregory Price 1 week, 6 days ago
On Sat, Mar 21, 2026 at 10:40:21AM -0700, Andrew Morton wrote:
> On Sat, 21 Mar 2026 11:03:56 -0400 Gregory Price <gourry@gourry.net> wrote:
> 
> > The dax kmem driver currently onlines memory during probe using the
> > system default policy, with no way to control or query the region state
> > at runtime - other than by inspecting the state of individual blocks.
> > 
> > Offlining and removing an entire region requires operating on individual
> > memory blocks, creating race conditions where external entities can
> > interfere between the offline and remove steps.
> > 
> > The problem was discussed specifically in the LPC2025 device memory
> > sessions - https://lpc.events/event/19/contributions/2016/ - where
> > it was discussed how the non-atomic interface for dax hotplug is causing
> > issues in some distributions which have competing userland controllers
> > that interfere with each other.
> > 
> > This series adds a sysfs "hotplug" attribute for atomic whole-device
> > hotplug control, along with the mm and dax plumbing to support it.
> 
> AI review (which hasn't completed at this time) has a lot to say:
> 	https://sashiko.dev/#/patchset/20260321150404.3288786-1-gourry@gourry.net

Looking at the results - i mucked up a UAF during the rebase that i
didn't catch during testing.  Will clean that up.

I also just realized I left an extern in one of the patches that I
thought I had removed.

So I owe a respin on this in more ways than one.

But on the AI review comment for non-trivial stuff
---

Much of the remaining commentary is about either the pre-existing code
race conditions, or design questions in the space of that race
condition.

Specifically: userland can still try to twiddle the memoryN/state bits
while the dax device loops over non-contiguous regions.

I dropped this commit:
https://lore.kernel.org/all/20260114235022.3437787-6-gourry@gourry.net/

From the series, because the feedback here:
https://lore.kernel.org/linux-mm/d1938a63-839b-44a5-a68f-34ad290fef21@kernel.org/

suggested that offline_and_remove_memory() would resolve the race
condition problem - but the patch proposed actually solved two issues:

1) Inconsistent hotplug state issue (user is still using the old
   per-block offlining pattern)

2) The old offline pattern calling BUG() instead of WARN() when trying
   to unbind while things are still online.

But this goes to the issue of:  If the race condition in userland has
been around for many years, is it to be considered a feature we should
not break - or on what time scale should we consider breaking it?

I don't know the answer, David will have to weigh in on that.

~Gregory