[PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)

Ira Weiny posted 19 patches 9 months, 4 weeks ago
Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
drivers/cxl/core/Makefile               |    2 +-
drivers/cxl/core/cdat.c                 |   11 +
drivers/cxl/core/core.h                 |   33 +-
drivers/cxl/core/extent.c               |  495 +++++++++++++++
drivers/cxl/core/hdm.c                  |   13 +-
drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
drivers/cxl/core/memdev.c               |   87 ++-
drivers/cxl/core/port.c                 |    5 +
drivers/cxl/core/region.c               |   76 ++-
drivers/cxl/core/trace.h                |   65 ++
drivers/cxl/cxl.h                       |   61 +-
drivers/cxl/cxlmem.h                    |  134 +++-
drivers/cxl/mem.c                       |    2 +-
drivers/cxl/pci.c                       |  115 +++-
drivers/dax/bus.c                       |  356 +++++++++--
drivers/dax/bus.h                       |    4 +-
drivers/dax/cxl.c                       |   71 ++-
drivers/dax/dax-private.h               |   40 ++
drivers/dax/hmem/hmem.c                 |    2 +-
drivers/dax/pmem.c                      |    2 +-
include/cxl/event.h                     |   31 +
include/linux/ioport.h                  |    3 +
tools/testing/cxl/Kbuild                |    3 +-
tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
25 files changed, 3102 insertions(+), 262 deletions(-)
[PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Ira Weiny 9 months, 4 weeks ago
A git tree of this series can be found here:

	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13

This is now based on 6.15-rc2.

Due to the stagnation of solid requirements for users of DCD I do not
plan to rev this work in Q2 of 2025 and possibly beyond.

It is anticipated that this will support at least the initial
implementation of DCD devices, if and when they appear in the ecosystem.
The patch set should be reviewed with the limited set of functionality in
mind.  Additional functionality can be added as devices support them.

It is strongly encouraged for individuals or companies wishing to bring
DCD devices to market review this set with the customer use cases they
have in mind.

Series info
===========

This series has 2 parts:

Patch 1-17: Core DCD support
Patch 18-19: cxl_test support

Background
==========

A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows memory capacity within a region to change
dynamically without the need for resetting the device, reconfiguring
HDM decoders, or reconfiguring software DAX regions.

One of the biggest anticipated use cases for Dynamic Capacity is to
allow hosts to dynamically add or remove memory from a host within a
data center without physically changing the per-host attached memory nor
rebooting the host.

The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory.  Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Logical
device, the Host Kernel, and a Host User.

An example work flow is shown below.

Orchestrator      FM         Device       Host Kernel    Host User

    |             |           |            |               |
    |-------------- Create region ------------------------>|
    |             |           |            |               |
    |             |           |            |<-- Create ----|
    |             |           |            |    Region     |
    |             |           |            |(dynamic_ram_a)|
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Accept -|<- Accept  -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Create ---->|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |             |           |            |               |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Remove -->|- Release->|- Release ->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Accept -|<- Accept  -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Create -----|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |-- Remove -->|- Release->|- Release ->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |               |
    |-- Add ----->|-- Add --->|--- Add --->|               |
    |  Capacity   |  Extent   |   Extent   |               |
    |             |           |            |<- Create -----|
    |             |           |            |   DAX dev     |-- Use memory
    |             |           |            |               |   |
    |-- Remove -->|- Release->|- Release ->|               |   |
    |  Capacity   |  Extent   |   Extent   |               |   |
    |             |           |            |               |   |
    |             |           |     (Release Ignored)      |   |
    |             |           |            |               |   |
    |             |           |            |<- Release ----| <-+
    |             |           |            |   DAX dev     |
    |<------------- Signal done ---------------------------|
    |             |           |            |               |
    |             |- Release->|- Release ->|               |
    |             |  Extent   |   Extent   |               |
    |             |           |            |               |
    |             |<- Release-|<- Release -|               |
    |             |   Extent  |   Extent   |               |
    |             |           |            |<- Destroy ----|
    |             |           |            |   Region      |
    |             |           |            |               |

Implementation
==============

This series requires the creation of regions and DAX devices to be
closely synchronized with the Orchestrator and Fabric Manager.  The host
kernel will reject extents if a region is not yet created.  It also
ignores extent release if memory is in use (DAX device created).  These
synchronizations are not anticipated to be an issue with real
applications.

Only a single dynamic ram partition is supported (dynamic_ram_a).  The
requirements, use cases, and existence of actual hardware devices to
support more than one DC partition is unknown at this time.  So a less
complex implementation was chosen.

In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced.  A sparse DAX region may have 0 or
more bytes of available space.  The total space depends on the number
and size of the extents which have been added.

It is anticipated that users of the memory will carefully coordinate the
surfacing of capacity with the creation of DAX devices which use that
capacity.  Therefore, the allocation of the memory to DAX devices does
not allow for specific associations between DAX device and extent.  This
keeps allocations of DAX devices similar to existing DAX region
behavior.

To keep the DAX memory allocation aligned with the existing DAX devices
which do not have tags, extents are not allowed to have tags in this
implementation.  Future support for tags can be added when real use
cases surface.

Great care was taken to keep the extent tracking simple.  Some xarray's
needed to be added but extra software objects are kept to a minimum.

Region extents are tracked as sub-devices of the DAX region.  This
ensures that region destruction cleans up all extent allocations
properly.

The major functionality of this series includes:

- Getting the dynamic capacity (DC) configuration information from cxl
  devices

- Configuring a DC partition found in hardware.

- Enhancing the CXL and DAX regions for dynamic capacity support
	a. Maintain a logical separation between hardware extents and
	   software managed extents.  This provides an abstraction
	   between the layers and should allow for interleaving in the
	   future

- Get existing hardware extent lists for endpoint decoders upon region
  creation.

- Respond to DC capacity events and adjust available region memory.
        a. Add capacity Events
	b. Release capacity events

- Host response for add capacity
	a. do not accept the extent if:
		If the region does not exist
		or an error occurs realizing the extent
	b. If the region does exist
		realize a DAX region extent with 1:1 mapping (no
		interleave yet)
	c. Support the event more bit by processing a list of extents
	   marked with the more bit together before setting up a
	   response.

- Host response for remove capacity
	a. If no DAX device references the extent; release the extent
	b. If a reference does exist, ignore the request.
	   (Require FM to issue release again.)
	c. Release extents flagged with the 'more' bit individually as
	   the specification allows for the asynchronous release of
	   memory and the implementation is simplified by doing so.

- Modify DAX device creation/resize to account for extents within a
  sparse DAX region

- Trace Dynamic Capacity events for debugging

- Add cxl-test infrastructure to allow for faster unit testing
  (See new ndctl branch for cxl-dcd.sh test[1])

- Only support 0 value extent tags

Fan Ni's upstream of Qemu DCD was used for testing.

Remaining work:

	1) Allow mapping to specific extents (perhaps based on
	   label/tag)
	   1a) devise region size reporting based on tags
	2) Interleave support

Possible additional work depending on requirements:

	1) Accept a new extent which extends (but overlaps) already
	   accepted extent(s)
	2) Rework DAX device interfaces, memfd has been explored a bit
	3) Support more than 1 DC partition

[1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13

---
Changes in v9:
- djbw: pare down support to only a single DC parition
- djbw: adjust to the new core partition processing which aligns with
  new type2 work.
- iweiny: address smaller comments from v8
- iweiny: rebase off of 6.15-rc1
- Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com

---
Ira Weiny (19):
      cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
      cxl/mem: Read dynamic capacity configuration from the device
      cxl/cdat: Gather DSMAS data for DCD partitions
      cxl/core: Enforce partition order/simplify partition calls
      cxl/mem: Expose dynamic ram A partition in sysfs
      cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
      cxl/region: Add sparse DAX region support
      cxl/events: Split event msgnum configuration from irq setup
      cxl/pci: Factor out interrupt policy check
      cxl/mem: Configure dynamic capacity interrupts
      cxl/core: Return endpoint decoder information from region search
      cxl/extent: Process dynamic partition events and realize region extents
      cxl/region/extent: Expose region extent information in sysfs
      dax/bus: Factor out dev dax resize logic
      dax/region: Create resources on sparse DAX regions
      cxl/region: Read existing extents on region creation
      cxl/mem: Trace Dynamic capacity Event Record
      tools/testing/cxl: Make event logs dynamic
      tools/testing/cxl: Add DC Regions to mock mem data

 Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
 drivers/cxl/core/Makefile               |    2 +-
 drivers/cxl/core/cdat.c                 |   11 +
 drivers/cxl/core/core.h                 |   33 +-
 drivers/cxl/core/extent.c               |  495 +++++++++++++++
 drivers/cxl/core/hdm.c                  |   13 +-
 drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
 drivers/cxl/core/memdev.c               |   87 ++-
 drivers/cxl/core/port.c                 |    5 +
 drivers/cxl/core/region.c               |   76 ++-
 drivers/cxl/core/trace.h                |   65 ++
 drivers/cxl/cxl.h                       |   61 +-
 drivers/cxl/cxlmem.h                    |  134 +++-
 drivers/cxl/mem.c                       |    2 +-
 drivers/cxl/pci.c                       |  115 +++-
 drivers/dax/bus.c                       |  356 +++++++++--
 drivers/dax/bus.h                       |    4 +-
 drivers/dax/cxl.c                       |   71 ++-
 drivers/dax/dax-private.h               |   40 ++
 drivers/dax/hmem/hmem.c                 |    2 +-
 drivers/dax/pmem.c                      |    2 +-
 include/cxl/event.h                     |   31 +
 include/linux/ioport.h                  |    3 +
 tools/testing/cxl/Kbuild                |    3 +-
 tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
 25 files changed, 3102 insertions(+), 262 deletions(-)
---
base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,
-- 
Ira Weiny <ira.weiny@intel.com>
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Fan Ni 9 months, 4 weeks ago
On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.
> 
> Due to the stagnation of solid requirements for users of DCD I do not
> plan to rev this work in Q2 of 2025 and possibly beyond.
> 
> It is anticipated that this will support at least the initial
> implementation of DCD devices, if and when they appear in the ecosystem.
> The patch set should be reviewed with the limited set of functionality in
> mind.  Additional functionality can be added as devices support them.
> 
> It is strongly encouraged for individuals or companies wishing to bring
> DCD devices to market review this set with the customer use cases they
> have in mind.

Hi Ira,
thanks for sending it out.

I have not got a chance to check the code or test it extensively.

I tried to test one specific case and hit issue.

I tried to add some DC extents to the extent list on the device when the
VM is launched by hacking qemu like below,

diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index 87fa308495..4049fc8dd9 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
     QTAILQ_INIT(&ct3d->dc.extents);
     QTAILQ_INIT(&ct3d->dc.extents_pending);
 
+    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
+                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
+    ct3d->dc.total_extent_count = 1;
+    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
+
     return true;
 }


Then after the VM is launched, I tried to create a DC region with
commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
dynamic_ram_a.

It works fine. As you can see below, the region is created and the
extent is showing correctly.

root@debian:~# cxl list -r region0 -N
[
  {
    "region":"region0",
    "resource":79725330432,
    "size":1073741824,
    "interleave_ways":1,
    "interleave_granularity":256,
    "decode_state":"commit",
    "extents":[
      {
        "offset":0,
        "length":268435456,
        "uuid":"00000000-0000-0000-0000-000000000000"
      }
    ]
  }
]


However, after that, I tried to create a dax device as below, it failed.

root@debian:~# daxctl create-device -r region0 -v
libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
error creating devices: No such device or address
created 0 devices
root@debian:~# 

root@debian:~# ls /sys/class/dax 
ls: cannot access '/sys/class/dax': No such file or directory

The dmesg shows the really_probe function returns early as resource
presents before probe as below,

[ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
[ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
[ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
[ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
[ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
[ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
[ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
[ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
[ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
[ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
[ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
[ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
[ 1745.524984] cxl dax_region0: Resources present before probing


btw, I hit the same issue with the previous verson also.

Fan

> 
> Series info
> ===========
> 
> This series has 2 parts:
> 
> Patch 1-17: Core DCD support
> Patch 18-19: cxl_test support
> 
> Background
> ==========
> 
> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows memory capacity within a region to change
> dynamically without the need for resetting the device, reconfiguring
> HDM decoders, or reconfiguring software DAX regions.
> 
> One of the biggest anticipated use cases for Dynamic Capacity is to
> allow hosts to dynamically add or remove memory from a host within a
> data center without physically changing the per-host attached memory nor
> rebooting the host.
> 
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory.  Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Logical
> device, the Host Kernel, and a Host User.
> 
> An example work flow is shown below.
> 
> Orchestrator      FM         Device       Host Kernel    Host User
> 
>     |             |           |            |               |
>     |-------------- Create region ------------------------>|
>     |             |           |            |               |
>     |             |           |            |<-- Create ----|
>     |             |           |            |    Region     |
>     |             |           |            |(dynamic_ram_a)|
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create ---->|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |             |           |            |               |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |-- Remove -->|- Release->|- Release ->|               |   |
>     |  Capacity   |  Extent   |   Extent   |               |   |
>     |             |           |            |               |   |
>     |             |           |     (Release Ignored)      |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |             |- Release->|- Release ->|               |
>     |             |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Destroy ----|
>     |             |           |            |   Region      |
>     |             |           |            |               |
> 
> Implementation
> ==============
> 
> This series requires the creation of regions and DAX devices to be
> closely synchronized with the Orchestrator and Fabric Manager.  The host
> kernel will reject extents if a region is not yet created.  It also
> ignores extent release if memory is in use (DAX device created).  These
> synchronizations are not anticipated to be an issue with real
> applications.
> 
> Only a single dynamic ram partition is supported (dynamic_ram_a).  The
> requirements, use cases, and existence of actual hardware devices to
> support more than one DC partition is unknown at this time.  So a less
> complex implementation was chosen.
> 
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> more bytes of available space.  The total space depends on the number
> and size of the extents which have been added.
> 
> It is anticipated that users of the memory will carefully coordinate the
> surfacing of capacity with the creation of DAX devices which use that
> capacity.  Therefore, the allocation of the memory to DAX devices does
> not allow for specific associations between DAX device and extent.  This
> keeps allocations of DAX devices similar to existing DAX region
> behavior.
> 
> To keep the DAX memory allocation aligned with the existing DAX devices
> which do not have tags, extents are not allowed to have tags in this
> implementation.  Future support for tags can be added when real use
> cases surface.
> 
> Great care was taken to keep the extent tracking simple.  Some xarray's
> needed to be added but extra software objects are kept to a minimum.
> 
> Region extents are tracked as sub-devices of the DAX region.  This
> ensures that region destruction cleans up all extent allocations
> properly.
> 
> The major functionality of this series includes:
> 
> - Getting the dynamic capacity (DC) configuration information from cxl
>   devices
> 
> - Configuring a DC partition found in hardware.
> 
> - Enhancing the CXL and DAX regions for dynamic capacity support
> 	a. Maintain a logical separation between hardware extents and
> 	   software managed extents.  This provides an abstraction
> 	   between the layers and should allow for interleaving in the
> 	   future
> 
> - Get existing hardware extent lists for endpoint decoders upon region
>   creation.
> 
> - Respond to DC capacity events and adjust available region memory.
>         a. Add capacity Events
> 	b. Release capacity events
> 
> - Host response for add capacity
> 	a. do not accept the extent if:
> 		If the region does not exist
> 		or an error occurs realizing the extent
> 	b. If the region does exist
> 		realize a DAX region extent with 1:1 mapping (no
> 		interleave yet)
> 	c. Support the event more bit by processing a list of extents
> 	   marked with the more bit together before setting up a
> 	   response.
> 
> - Host response for remove capacity
> 	a. If no DAX device references the extent; release the extent
> 	b. If a reference does exist, ignore the request.
> 	   (Require FM to issue release again.)
> 	c. Release extents flagged with the 'more' bit individually as
> 	   the specification allows for the asynchronous release of
> 	   memory and the implementation is simplified by doing so.
> 
> - Modify DAX device creation/resize to account for extents within a
>   sparse DAX region
> 
> - Trace Dynamic Capacity events for debugging
> 
> - Add cxl-test infrastructure to allow for faster unit testing
>   (See new ndctl branch for cxl-dcd.sh test[1])
> 
> - Only support 0 value extent tags
> 
> Fan Ni's upstream of Qemu DCD was used for testing.
> 
> Remaining work:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	   1a) devise region size reporting based on tags
> 	2) Interleave support
> 
> Possible additional work depending on requirements:
> 
> 	1) Accept a new extent which extends (but overlaps) already
> 	   accepted extent(s)
> 	2) Rework DAX device interfaces, memfd has been explored a bit
> 	3) Support more than 1 DC partition
> 
> [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> ---
> Changes in v9:
> - djbw: pare down support to only a single DC parition
> - djbw: adjust to the new core partition processing which aligns with
>   new type2 work.
> - iweiny: address smaller comments from v8
> - iweiny: rebase off of 6.15-rc1
> - Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com
> 
> ---
> Ira Weiny (19):
>       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
>       cxl/mem: Read dynamic capacity configuration from the device
>       cxl/cdat: Gather DSMAS data for DCD partitions
>       cxl/core: Enforce partition order/simplify partition calls
>       cxl/mem: Expose dynamic ram A partition in sysfs
>       cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
>       cxl/region: Add sparse DAX region support
>       cxl/events: Split event msgnum configuration from irq setup
>       cxl/pci: Factor out interrupt policy check
>       cxl/mem: Configure dynamic capacity interrupts
>       cxl/core: Return endpoint decoder information from region search
>       cxl/extent: Process dynamic partition events and realize region extents
>       cxl/region/extent: Expose region extent information in sysfs
>       dax/bus: Factor out dev dax resize logic
>       dax/region: Create resources on sparse DAX regions
>       cxl/region: Read existing extents on region creation
>       cxl/mem: Trace Dynamic capacity Event Record
>       tools/testing/cxl: Make event logs dynamic
>       tools/testing/cxl: Add DC Regions to mock mem data
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
>  drivers/cxl/core/Makefile               |    2 +-
>  drivers/cxl/core/cdat.c                 |   11 +
>  drivers/cxl/core/core.h                 |   33 +-
>  drivers/cxl/core/extent.c               |  495 +++++++++++++++
>  drivers/cxl/core/hdm.c                  |   13 +-
>  drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
>  drivers/cxl/core/memdev.c               |   87 ++-
>  drivers/cxl/core/port.c                 |    5 +
>  drivers/cxl/core/region.c               |   76 ++-
>  drivers/cxl/core/trace.h                |   65 ++
>  drivers/cxl/cxl.h                       |   61 +-
>  drivers/cxl/cxlmem.h                    |  134 +++-
>  drivers/cxl/mem.c                       |    2 +-
>  drivers/cxl/pci.c                       |  115 +++-
>  drivers/dax/bus.c                       |  356 +++++++++--
>  drivers/dax/bus.h                       |    4 +-
>  drivers/dax/cxl.c                       |   71 ++-
>  drivers/dax/dax-private.h               |   40 ++
>  drivers/dax/hmem/hmem.c                 |    2 +-
>  drivers/dax/pmem.c                      |    2 +-
>  include/cxl/event.h                     |   31 +
>  include/linux/ioport.h                  |    3 +
>  tools/testing/cxl/Kbuild                |    3 +-
>  tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
>  25 files changed, 3102 insertions(+), 262 deletions(-)
> ---
> base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
>
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Ira Weiny 9 months, 4 weeks ago
Fan Ni wrote:
> On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > A git tree of this series can be found here:
> > 
> > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > 
> > This is now based on 6.15-rc2.
> > 
> > Due to the stagnation of solid requirements for users of DCD I do not
> > plan to rev this work in Q2 of 2025 and possibly beyond.
> > 
> > It is anticipated that this will support at least the initial
> > implementation of DCD devices, if and when they appear in the ecosystem.
> > The patch set should be reviewed with the limited set of functionality in
> > mind.  Additional functionality can be added as devices support them.
> > 
> > It is strongly encouraged for individuals or companies wishing to bring
> > DCD devices to market review this set with the customer use cases they
> > have in mind.
> 
> Hi Ira,
> thanks for sending it out.
> 
> I have not got a chance to check the code or test it extensively.
> 
> I tried to test one specific case and hit issue.
> 
> I tried to add some DC extents to the extent list on the device when the
> VM is launched by hacking qemu like below,
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index 87fa308495..4049fc8dd9 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
>      QTAILQ_INIT(&ct3d->dc.extents);
>      QTAILQ_INIT(&ct3d->dc.extents_pending);
>  
> +    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
> +                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
> +    ct3d->dc.total_extent_count = 1;
> +    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
> +
>      return true;
>  }
> 
> 
> Then after the VM is launched, I tried to create a DC region with
> commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
> dynamic_ram_a.
> 
> It works fine. As you can see below, the region is created and the
> extent is showing correctly.
> 
> root@debian:~# cxl list -r region0 -N
> [
>   {
>     "region":"region0",
>     "resource":79725330432,
>     "size":1073741824,
>     "interleave_ways":1,
>     "interleave_granularity":256,
>     "decode_state":"commit",
>     "extents":[
>       {
>         "offset":0,
>         "length":268435456,
>         "uuid":"00000000-0000-0000-0000-000000000000"
>       }
>     ]
>   }
> ]
> 
> 
> However, after that, I tried to create a dax device as below, it failed.
> 
> root@debian:~# daxctl create-device -r region0 -v
> libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
> error creating devices: No such device or address
> created 0 devices
> root@debian:~# 
> 
> root@debian:~# ls /sys/class/dax 
> ls: cannot access '/sys/class/dax': No such file or directory

Have you update daxctl with cxl-cli?

I was confused by this lack of /sys/class/dax and checked with Vishal.  He
says this is legacy.

I have /sys/bus/dax and that works fine for me with the latest daxctl
built from the ndctl code I sent out:

https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13

Could you build and use the executables from that version?

Ira

> 
> The dmesg shows the really_probe function returns early as resource
> presents before probe as below,
> 
> [ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
> [ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
> [ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
> [ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
> [ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
> [ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
> [ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
> [ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
> [ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
> [ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
> [ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
> [ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
> [ 1745.524984] cxl dax_region0: Resources present before probing
> 
> 
> btw, I hit the same issue with the previous verson also.
> 
> Fan

[snip]
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Fan Ni 8 months, 4 weeks ago
On Mon, Apr 14, 2025 at 09:37:02PM -0500, Ira Weiny wrote:
> Fan Ni wrote:
> > On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > > A git tree of this series can be found here:
> > > 
> > > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > > 
> > > This is now based on 6.15-rc2.
> > > 
> > > Due to the stagnation of solid requirements for users of DCD I do not
> > > plan to rev this work in Q2 of 2025 and possibly beyond.
> > > 
> > > It is anticipated that this will support at least the initial
> > > implementation of DCD devices, if and when they appear in the ecosystem.
> > > The patch set should be reviewed with the limited set of functionality in
> > > mind.  Additional functionality can be added as devices support them.
> > > 
> > > It is strongly encouraged for individuals or companies wishing to bring
> > > DCD devices to market review this set with the customer use cases they
> > > have in mind.
> > 
> > Hi Ira,
> > thanks for sending it out.
> > 
> > I have not got a chance to check the code or test it extensively.
> > 
> > I tried to test one specific case and hit issue.
> > 
> > I tried to add some DC extents to the extent list on the device when the
> > VM is launched by hacking qemu like below,
> > 
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index 87fa308495..4049fc8dd9 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
> >      QTAILQ_INIT(&ct3d->dc.extents);
> >      QTAILQ_INIT(&ct3d->dc.extents_pending);
> >  
> > +    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
> > +                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
> > +    ct3d->dc.total_extent_count = 1;
> > +    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
> > +
> >      return true;
> >  }
> > 
> > 
> > Then after the VM is launched, I tried to create a DC region with
> > commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
> > dynamic_ram_a.
> > 
> > It works fine. As you can see below, the region is created and the
> > extent is showing correctly.
> > 
> > root@debian:~# cxl list -r region0 -N
> > [
> >   {
> >     "region":"region0",
> >     "resource":79725330432,
> >     "size":1073741824,
> >     "interleave_ways":1,
> >     "interleave_granularity":256,
> >     "decode_state":"commit",
> >     "extents":[
> >       {
> >         "offset":0,
> >         "length":268435456,
> >         "uuid":"00000000-0000-0000-0000-000000000000"
> >       }
> >     ]
> >   }
> > ]
> > 
> > 
> > However, after that, I tried to create a dax device as below, it failed.
> > 
> > root@debian:~# daxctl create-device -r region0 -v
> > libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
> > error creating devices: No such device or address
> > created 0 devices
> > root@debian:~# 
> > 
> > root@debian:~# ls /sys/class/dax 
> > ls: cannot access '/sys/class/dax': No such file or directory
> 
> Have you update daxctl with cxl-cli?
> 
> I was confused by this lack of /sys/class/dax and checked with Vishal.  He
> says this is legacy.
> 
> I have /sys/bus/dax and that works fine for me with the latest daxctl
> built from the ndctl code I sent out:
> 
> https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> Could you build and use the executables from that version?
> 
> Ira

Hi Ira,
Here are more details about the issue and reasoning.


# ISSUE: No dax device created

## What we see: No Dax device is created after creating the dc region
<pre>
fan@smc-140338-bm01:~/cxl/linux-dcd$ cxl-tool.py --dcd-test mem0
Load cxl drivers first
ssh root@localhost -p 2024 "modprobe -a cxl_acpi cxl_core cxl_pci cxl_port cxl_mem"

Module                  Size  Used by
dax_pmem               12288  0
device_dax             16384  0
nd_pmem                24576  0
nd_btt                 28672  1 nd_pmem
dax                    57344  3 dax_pmem,device_dax,nd_pmem
cxl_pmu                28672  0
cxl_mem                12288  0
cxl_pmem               24576  0
libnvdimm             217088  4 cxl_pmem,dax_pmem,nd_btt,nd_pmem
cxl_pci                28672  0
cxl_acpi               24576  0
cxl_port               16384  0
cxl_core              368640  7 cxl_pmem,cxl_port,cxl_mem,cxl_pci,cxl_acpi,cxl_pmu
ssh root@localhost -p 2024 "cxl enable-memdev mem0"
cxl memdev: cmd_enable_memdev: enabled 1 mem
{
  "region":"region0",
  "resource":79725330432,
  "size":2147483648,
  "interleave_ways":1,
  "interleave_granularity":256,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region
sn=3840
cxl-memdev0
sn=3840
Choose OP: 0: add, 1: release, 2: print extent, 9: exit
Choice: 9
Do you want to continue to create dax device for DC(Y/N):y
daxctl create-device -r region0
error creating devices: No such device or address
created 0 devices
daxctl list -r region0 -D

Create dax device failed
</pre>

## What caused the issue: Resources present before probing

<pre>
...
[   14.251500] cxl_core:cxl_region_probe:3571: cxl_region region0: config state: 0
[   14.254129] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: -6
[   14.256536] cxl_core:devm_cxl_add_region:2535: cxl_acpi ACPI0017:00: decoder0.0: created region0
[   14.281676] cxl_core:cxl_port_attach_region:1169: cxl region0: mem0:endpoint2 decoder2.0 add: mem0:decoder2.0 @ 0 next: none nr_eps: 1 nr_targets: 1
[   14.286254] cxl_core:cxl_port_attach_region:1169: cxl region0: pci0000:0c:port1 decoder1.0 add: mem0:decoder2.0 @ 0 next: mem0 nr_eps: 1 nr_targets: 1
[   14.290995] cxl_core:cxl_port_setup_targets:1489: cxl region0: pci0000:0c:port1 iw: 1 ig: 256
[   14.294161] cxl_core:cxl_port_setup_targets:1513: cxl region0: pci0000:0c:port1 target[0] = 0000:0c:00.0 for mem0:decoder2.0 @ 0
[   14.298209] cxl_core:cxl_calc_interleave_pos:1880: cxl_mem mem0: decoder:decoder2.0 parent:0000:0d:00.0 port:endpoint2 range:0x1290000000-0x130fffffff pos:0
[   14.303224] cxl_core:cxl_region_attach:2080: cxl decoder2.0: Test cxl_calc_interleave_pos(): success test_pos:0 cxled->pos:0
[   14.307522] cxl region0: Bypassing cpu_cache_invalidate_memregion() for testing!
[   14.319576] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
[   14.322918] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
[   14.326102] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
[   14.329523] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
[   14.333141] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
[   14.336172] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
[   14.342736] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
[   14.345447] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x7fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
[   14.350198] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
[   14.354574] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
[   14.357876] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
[   14.361361] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
[   14.395020] cxl dax_region0: Resources present before probing
...
</pre>

## Workaround (not a fix)

By chasing why the devres link list is not empty, or when add_dr() is called,
I located the code that caused the issue. The below hack is used to confirm
the issue is caused by the devm_add_action_or_reset() function call.

<pre>
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 4dc0dec486f6..26daa7906717 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -174,6 +174,7 @@ static int online_region_extent(struct region_extent *region_extent)
                goto err;
 
        dev_dbg(dev, "region extent HPA %pra\n", &region_extent->hpa_range);
+       return 0;
        return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
                                        region_extent);
</pre> 

## Output

<pre>
fan@smc-140338-bm01:~/cxl/linux-dcd$ cxl-tool.py --run --create-topo
Info: back memory/lsa file exist under /tmp/host0 from previous run, delete them Y/N(default Y): 
Starting VM...
QEMU instance is up, access it: ssh root@localhost -p 2024
fan@smc-140338-bm01:~/cxl/linux-dcd$ cxl-tool.py --dcd-test mem0
Load cxl drivers first
ssh root@localhost -p 2024 "modprobe -a cxl_acpi cxl_core cxl_pci cxl_port cxl_mem"

Module                  Size  Used by
dax_pmem               12288  0
device_dax             16384  0
nd_pmem                24576  0
nd_btt                 28672  1 nd_pmem
dax                    57344  3 dax_pmem,device_dax,nd_pmem
cxl_pmem               24576  0
cxl_pmu                28672  0
cxl_mem                12288  0
libnvdimm             217088  4 cxl_pmem,dax_pmem,nd_btt,nd_pmem
cxl_pci                28672  0
cxl_acpi               24576  0
cxl_port               16384  0
cxl_core              368640  7 cxl_pmem,cxl_port,cxl_mem,cxl_pci,cxl_acpi,cxl_pmu
ssh root@localhost -p 2024 "cxl enable-memdev mem0"
cxl memdev: cmd_enable_memdev: enabled 1 mem
cxl region: cmd_create_region: created 1 region
{
  "region":"region0",
  "resource":79725330432,
  "size":2147483648,
  "interleave_ways":1,
  "interleave_granularity":256,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
sn=3840
cxl-memdev0
sn=3840
Choose OP: 0: add, 1: release, 2: print extent, 9: exit
Choice: 2
cat /tmp/qmp-show.json|ncat localhost 4445
{"QMP": {"version": {"qemu": {"micro": 90, "minor": 2, "major": 9}, "package": "v6.2.0-28065-g3537a06886-dirty"}, "capabilities": ["oob"]}}
{"return": {}}
{"return": {}}
{"return": {}}
Print accepted extent info:
0: [0x0 - 0x10000000]
In total, 1 extents printed!
Print pending-to-add extent info:
In total, 0 extents printed!
Choose OP: 0: add, 1: release, 2: print extent, 9: exit
Choice: 9
Do you want to continue to create dax device for DC(Y/N):y
daxctl create-device -r region0
[
  {
    "chardev":"dax0.1",
    "size":268435456,
    "target_node":1,
    "align":2097152,
    "mode":"devdax"
  }
]
created 1 device
daxctl list -r region0 -D
[
  {
    "chardev":"dax0.1",
    "size":268435456,
    "target_node":1,
    "align":2097152,
    "mode":"devdax"
  }
]
ssh root@localhost -p 2024 "daxctl reconfigure-device dax0.1 -m system-ram"
[
  {
    "chardev":"dax0.1",
    "size":268435456,
    "target_node":1,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":2,
    "total_memblocks":2,
    "movable":true
  }
]
reconfigured 1 device
RANGE                                  SIZE  STATE REMOVABLE   BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes    0-15
0x0000000100000000-0x000000027fffffff    6G online       yes   32-79
0x0000001290000000-0x000000129fffffff  256M online       yes 594-595

Memory block size:       128M
Total online memory:     8.3G
</pre>



fan
> 
> > 
> > The dmesg shows the really_probe function returns early as resource
> > presents before probe as below,
> > 
> > [ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
> > [ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
> > [ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
> > [ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
> > [ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
> > [ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
> > [ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
> > [ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
> > [ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
> > [ 1745.524984] cxl dax_region0: Resources present before probing
> > 
> > 
> > btw, I hit the same issue with the previous verson also.
> > 
> > Fan
> 
> [snip]
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Dan Williams 9 months, 4 weeks ago
Ira Weiny wrote:
[..]
> > However, after that, I tried to create a dax device as below, it failed.
> > 
> > root@debian:~# daxctl create-device -r region0 -v
> > libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax

Note that /sys/class/dax support was removed from the kernel back in
v5.17:

83762cb5c7c4 dax: Kill DEV_DAX_PMEM_COMPAT

daxctl still supports pre-v5.17 kernels and always checks both subsystem
types. This is a debug message just confirming that it is running on a
new kernel, see dax_regions_init() in daxctl.

> > error creating devices: No such device or address
> > created 0 devices
> > root@debian:~# 
> > 
> > root@debian:~# ls /sys/class/dax 
> > ls: cannot access '/sys/class/dax': No such file or directory
> 
> Have you update daxctl with cxl-cli?
> 
> I was confused by this lack of /sys/class/dax and checked with Vishal.  He
> says this is legacy.
> 
> I have /sys/bus/dax and that works fine for me with the latest daxctl
> built from the ndctl code I sent out:
> 
> https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> Could you build and use the executables from that version?

The same debug message still exists in that version and will fire every
time when debug is enabled.
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Fan Ni 9 months, 4 weeks ago
On Mon, Apr 14, 2025 at 09:37:02PM -0500, Ira Weiny wrote:
> Fan Ni wrote:
> > On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > > A git tree of this series can be found here:
> > > 
> > > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > > 
> > > This is now based on 6.15-rc2.
> > > 
> > > Due to the stagnation of solid requirements for users of DCD I do not
> > > plan to rev this work in Q2 of 2025 and possibly beyond.
> > > 
> > > It is anticipated that this will support at least the initial
> > > implementation of DCD devices, if and when they appear in the ecosystem.
> > > The patch set should be reviewed with the limited set of functionality in
> > > mind.  Additional functionality can be added as devices support them.
> > > 
> > > It is strongly encouraged for individuals or companies wishing to bring
> > > DCD devices to market review this set with the customer use cases they
> > > have in mind.
> > 
> > Hi Ira,
> > thanks for sending it out.
> > 
> > I have not got a chance to check the code or test it extensively.
> > 
> > I tried to test one specific case and hit issue.
> > 
> > I tried to add some DC extents to the extent list on the device when the
> > VM is launched by hacking qemu like below,
> > 
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index 87fa308495..4049fc8dd9 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -826,6 +826,11 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, Error **errp)
> >      QTAILQ_INIT(&ct3d->dc.extents);
> >      QTAILQ_INIT(&ct3d->dc.extents_pending);
> >  
> > +    cxl_insert_extent_to_extent_list(&ct3d->dc.extents, 0,
> > +                                     CXL_CAPACITY_MULTIPLIER, NULL, 0);
> > +    ct3d->dc.total_extent_count = 1;
> > +    ct3_set_region_block_backed(ct3d, 0, CXL_CAPACITY_MULTIPLIER);
> > +
> >      return true;
> >  }
> > 
> > 
> > Then after the VM is launched, I tried to create a DC region with
> > commmand: cxl create-region -m mem0 -d decoder0.0 -s 1G -t
> > dynamic_ram_a.
> > 
> > It works fine. As you can see below, the region is created and the
> > extent is showing correctly.
> > 
> > root@debian:~# cxl list -r region0 -N
> > [
> >   {
> >     "region":"region0",
> >     "resource":79725330432,
> >     "size":1073741824,
> >     "interleave_ways":1,
> >     "interleave_granularity":256,
> >     "decode_state":"commit",
> >     "extents":[
> >       {
> >         "offset":0,
> >         "length":268435456,
> >         "uuid":"00000000-0000-0000-0000-000000000000"
> >       }
> >     ]
> >   }
> > ]
> > 
> > 
> > However, after that, I tried to create a dax device as below, it failed.
> > 
> > root@debian:~# daxctl create-device -r region0 -v
> > libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
> > error creating devices: No such device or address
> > created 0 devices
> > root@debian:~# 
> > 
> > root@debian:~# ls /sys/class/dax 
> > ls: cannot access '/sys/class/dax': No such file or directory
> 
> Have you update daxctl with cxl-cli?
> 
> I was confused by this lack of /sys/class/dax and checked with Vishal.  He
> says this is legacy.
> 
> I have /sys/bus/dax and that works fine for me with the latest daxctl
> built from the ndctl code I sent out:
> 
> https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> Could you build and use the executables from that version?
> 
> Ira

That is my setup.

root@debian:~# cxl list -r region0 -N
[
  {
    "region":"region0",
    "resource":79725330432,
    "size":2147483648,
    "interleave_ways":1,
    "interleave_granularity":256,
    "decode_state":"commit",
    "extents":[
      {
        "offset":0,
        "length":268435456,
        "uuid":"00000000-0000-0000-0000-000000000000"
      }
    ]
  }
]
root@debian:~# cd ndctl/
root@debian:~/ndctl# git branch
* dcd-region3-2025-04-13
root@debian:~/ndctl# ./build/daxctl/daxctl create-device -r region0 -v
libdaxctl: __dax_regions_init: no dax regions found via: /sys/class/dax
error creating devices: No such device or address
created 0 devices

root@debian:~/ndctl# cat .git/config 
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[remote "origin"]
	url = https://github.com/weiny2/ndctl.git
	fetch = +refs/heads/dcd-region3-2025-04-13:refs/remotes/origin/dcd-region3-2025-04-13
[branch "dcd-region3-2025-04-13"]
	remote = origin
	merge = refs/heads/dcd-region3-2025-04-13


Fan

> 
> > 
> > The dmesg shows the really_probe function returns early as resource
> > presents before probe as below,
> > 
> > [ 1745.505068] cxl_core:devm_cxl_add_dax_region:3251: cxl_region region0: region0: register dax_region0
> > [ 1745.506063] cxl_pci:__cxl_pci_mbox_send_cmd:263: cxl_pci 0000:0d:00.0: Sending command: 0x4801
> > [ 1745.506953] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
> > [ 1745.507911] cxl_core:__cxl_process_extent_list:1802: cxl_pci 0000:0d:00.0: Got extent list 0-0 of 1 generation Num:0
> > [ 1745.508958] cxl_core:__cxl_process_extent_list:1815: cxl_pci 0000:0d:00.0: Processing extent 0/1
> > [ 1745.509843] cxl_core:cxl_validate_extent:975: cxl_pci 0000:0d:00.0: DC extent DPA [range 0x0000000000000000-0x000000000fffffff] (DCR:[range 0x0000000000000000-0x000000007fffffff])(00000000-0000-0000-0000-000000000000)
> > [ 1745.511748] cxl_core:__cxl_dpa_to_region:2869: cxl decoder2.0: dpa:0x0 mapped in region:region0
> > [ 1745.512626] cxl_core:cxl_add_extent:460: cxl decoder2.0: Checking ED ([mem 0x00000000-0x3fffffff flags 0x80000200]) for extent [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.514143] cxl_core:cxl_add_extent:492: cxl decoder2.0: Add extent [range 0x0000000000000000-0x000000000fffffff] (00000000-0000-0000-0000-000000000000)
> > [ 1745.515485] cxl_core:online_region_extent:176:  extent0.0: region extent HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.516576] cxl_core:cxlr_notify_extent:285: cxl dax_region0: Trying notify: type 0 HPA [range 0x0000000000000000-0x000000000fffffff]
> > [ 1745.517768] cxl_core:cxl_bus_probe:2087: cxl_region region0: probe: 0
> > [ 1745.524984] cxl dax_region0: Resources present before probing
> > 
> > 
> > btw, I hit the same issue with the previous verson also.
> > 
> > Fan
> 
> [snip]
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Fan Ni 8 months, 1 week ago
On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.
> 
> Due to the stagnation of solid requirements for users of DCD I do not
> plan to rev this work in Q2 of 2025 and possibly beyond.
> 
> It is anticipated that this will support at least the initial
> implementation of DCD devices, if and when they appear in the ecosystem.
> The patch set should be reviewed with the limited set of functionality in
> mind.  Additional functionality can be added as devices support them.
> 
> It is strongly encouraged for individuals or companies wishing to bring
> DCD devices to market review this set with the customer use cases they
> have in mind.

Hi,
I have a general question about DCD.

How will the start dpa of the first region be set before any extent is
offer to the hosts?

In this series, no dpa gap (skip) is allowed between static capacity and
dynamic capacity. That seems to imply some component that knows the layout
of the host memory will need to set the start dpa of the first dc region?
The firmware?

Also, if a DC extent is shared among multiple hosts each of which has
different memory configuration, how the dcd device provides the extents
to each host to make sure there is no dpa gap between static and dynamic
capacity range on all the hosts?
It seems the start dpa of dcd needs to be different for each host. No sure how
to achieve that.

Fan

> 
> Series info
> ===========
> 
> This series has 2 parts:
> 
> Patch 1-17: Core DCD support
> Patch 18-19: cxl_test support
> 
> Background
> ==========
> 
> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows memory capacity within a region to change
> dynamically without the need for resetting the device, reconfiguring
> HDM decoders, or reconfiguring software DAX regions.
> 
> One of the biggest anticipated use cases for Dynamic Capacity is to
> allow hosts to dynamically add or remove memory from a host within a
> data center without physically changing the per-host attached memory nor
> rebooting the host.
> 
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory.  Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Logical
> device, the Host Kernel, and a Host User.
> 
> An example work flow is shown below.
> 
> Orchestrator      FM         Device       Host Kernel    Host User
> 
>     |             |           |            |               |
>     |-------------- Create region ------------------------>|
>     |             |           |            |               |
>     |             |           |            |<-- Create ----|
>     |             |           |            |    Region     |
>     |             |           |            |(dynamic_ram_a)|
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create ---->|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |             |           |            |               |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Accept -|<- Accept  -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |-- Remove -->|- Release->|- Release ->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |               |
>     |-- Add ----->|-- Add --->|--- Add --->|               |
>     |  Capacity   |  Extent   |   Extent   |               |
>     |             |           |            |<- Create -----|
>     |             |           |            |   DAX dev     |-- Use memory
>     |             |           |            |               |   |
>     |-- Remove -->|- Release->|- Release ->|               |   |
>     |  Capacity   |  Extent   |   Extent   |               |   |
>     |             |           |            |               |   |
>     |             |           |     (Release Ignored)      |   |
>     |             |           |            |               |   |
>     |             |           |            |<- Release ----| <-+
>     |             |           |            |   DAX dev     |
>     |<------------- Signal done ---------------------------|
>     |             |           |            |               |
>     |             |- Release->|- Release ->|               |
>     |             |  Extent   |   Extent   |               |
>     |             |           |            |               |
>     |             |<- Release-|<- Release -|               |
>     |             |   Extent  |   Extent   |               |
>     |             |           |            |<- Destroy ----|
>     |             |           |            |   Region      |
>     |             |           |            |               |
> 
> Implementation
> ==============
> 
> This series requires the creation of regions and DAX devices to be
> closely synchronized with the Orchestrator and Fabric Manager.  The host
> kernel will reject extents if a region is not yet created.  It also
> ignores extent release if memory is in use (DAX device created).  These
> synchronizations are not anticipated to be an issue with real
> applications.
> 
> Only a single dynamic ram partition is supported (dynamic_ram_a).  The
> requirements, use cases, and existence of actual hardware devices to
> support more than one DC partition is unknown at this time.  So a less
> complex implementation was chosen.
> 
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> more bytes of available space.  The total space depends on the number
> and size of the extents which have been added.
> 
> It is anticipated that users of the memory will carefully coordinate the
> surfacing of capacity with the creation of DAX devices which use that
> capacity.  Therefore, the allocation of the memory to DAX devices does
> not allow for specific associations between DAX device and extent.  This
> keeps allocations of DAX devices similar to existing DAX region
> behavior.
> 
> To keep the DAX memory allocation aligned with the existing DAX devices
> which do not have tags, extents are not allowed to have tags in this
> implementation.  Future support for tags can be added when real use
> cases surface.
> 
> Great care was taken to keep the extent tracking simple.  Some xarray's
> needed to be added but extra software objects are kept to a minimum.
> 
> Region extents are tracked as sub-devices of the DAX region.  This
> ensures that region destruction cleans up all extent allocations
> properly.
> 
> The major functionality of this series includes:
> 
> - Getting the dynamic capacity (DC) configuration information from cxl
>   devices
> 
> - Configuring a DC partition found in hardware.
> 
> - Enhancing the CXL and DAX regions for dynamic capacity support
> 	a. Maintain a logical separation between hardware extents and
> 	   software managed extents.  This provides an abstraction
> 	   between the layers and should allow for interleaving in the
> 	   future
> 
> - Get existing hardware extent lists for endpoint decoders upon region
>   creation.
> 
> - Respond to DC capacity events and adjust available region memory.
>         a. Add capacity Events
> 	b. Release capacity events
> 
> - Host response for add capacity
> 	a. do not accept the extent if:
> 		If the region does not exist
> 		or an error occurs realizing the extent
> 	b. If the region does exist
> 		realize a DAX region extent with 1:1 mapping (no
> 		interleave yet)
> 	c. Support the event more bit by processing a list of extents
> 	   marked with the more bit together before setting up a
> 	   response.
> 
> - Host response for remove capacity
> 	a. If no DAX device references the extent; release the extent
> 	b. If a reference does exist, ignore the request.
> 	   (Require FM to issue release again.)
> 	c. Release extents flagged with the 'more' bit individually as
> 	   the specification allows for the asynchronous release of
> 	   memory and the implementation is simplified by doing so.
> 
> - Modify DAX device creation/resize to account for extents within a
>   sparse DAX region
> 
> - Trace Dynamic Capacity events for debugging
> 
> - Add cxl-test infrastructure to allow for faster unit testing
>   (See new ndctl branch for cxl-dcd.sh test[1])
> 
> - Only support 0 value extent tags
> 
> Fan Ni's upstream of Qemu DCD was used for testing.
> 
> Remaining work:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	   1a) devise region size reporting based on tags
> 	2) Interleave support
> 
> Possible additional work depending on requirements:
> 
> 	1) Accept a new extent which extends (but overlaps) already
> 	   accepted extent(s)
> 	2) Rework DAX device interfaces, memfd has been explored a bit
> 	3) Support more than 1 DC partition
> 
> [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> 
> ---
> Changes in v9:
> - djbw: pare down support to only a single DC parition
> - djbw: adjust to the new core partition processing which aligns with
>   new type2 work.
> - iweiny: address smaller comments from v8
> - iweiny: rebase off of 6.15-rc1
> - Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com
> 
> ---
> Ira Weiny (19):
>       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
>       cxl/mem: Read dynamic capacity configuration from the device
>       cxl/cdat: Gather DSMAS data for DCD partitions
>       cxl/core: Enforce partition order/simplify partition calls
>       cxl/mem: Expose dynamic ram A partition in sysfs
>       cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
>       cxl/region: Add sparse DAX region support
>       cxl/events: Split event msgnum configuration from irq setup
>       cxl/pci: Factor out interrupt policy check
>       cxl/mem: Configure dynamic capacity interrupts
>       cxl/core: Return endpoint decoder information from region search
>       cxl/extent: Process dynamic partition events and realize region extents
>       cxl/region/extent: Expose region extent information in sysfs
>       dax/bus: Factor out dev dax resize logic
>       dax/region: Create resources on sparse DAX regions
>       cxl/region: Read existing extents on region creation
>       cxl/mem: Trace Dynamic capacity Event Record
>       tools/testing/cxl: Make event logs dynamic
>       tools/testing/cxl: Add DC Regions to mock mem data
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
>  drivers/cxl/core/Makefile               |    2 +-
>  drivers/cxl/core/cdat.c                 |   11 +
>  drivers/cxl/core/core.h                 |   33 +-
>  drivers/cxl/core/extent.c               |  495 +++++++++++++++
>  drivers/cxl/core/hdm.c                  |   13 +-
>  drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
>  drivers/cxl/core/memdev.c               |   87 ++-
>  drivers/cxl/core/port.c                 |    5 +
>  drivers/cxl/core/region.c               |   76 ++-
>  drivers/cxl/core/trace.h                |   65 ++
>  drivers/cxl/cxl.h                       |   61 +-
>  drivers/cxl/cxlmem.h                    |  134 +++-
>  drivers/cxl/mem.c                       |    2 +-
>  drivers/cxl/pci.c                       |  115 +++-
>  drivers/dax/bus.c                       |  356 +++++++++--
>  drivers/dax/bus.h                       |    4 +-
>  drivers/dax/cxl.c                       |   71 ++-
>  drivers/dax/dax-private.h               |   40 ++
>  drivers/dax/hmem/hmem.c                 |    2 +-
>  drivers/dax/pmem.c                      |    2 +-
>  include/cxl/event.h                     |   31 +
>  include/linux/ioport.h                  |    3 +
>  tools/testing/cxl/Kbuild                |    3 +-
>  tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
>  25 files changed, 3102 insertions(+), 262 deletions(-)
> ---
> base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
> 

-- 
Fan Ni
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Fan Ni 8 months ago
On Tue, Jun 03, 2025 at 09:32:18AM -0700, Fan Ni wrote:
> On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > A git tree of this series can be found here:
> > 
> > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > 
> > This is now based on 6.15-rc2.
> > 
> > Due to the stagnation of solid requirements for users of DCD I do not
> > plan to rev this work in Q2 of 2025 and possibly beyond.
> > 
> > It is anticipated that this will support at least the initial
> > implementation of DCD devices, if and when they appear in the ecosystem.
> > The patch set should be reviewed with the limited set of functionality in
> > mind.  Additional functionality can be added as devices support them.
> > 
> > It is strongly encouraged for individuals or companies wishing to bring
> > DCD devices to market review this set with the customer use cases they
> > have in mind.
> 
> Hi,
> I have a general question about DCD.
> 
> How will the start dpa of the first region be set before any extent is
> offer to the hosts?
> 
> In this series, no dpa gap (skip) is allowed between static capacity and
> dynamic capacity. That seems to imply some component that knows the layout
> of the host memory will need to set the start dpa of the first dc region?
> The firmware?
> 
> Also, if a DC extent is shared among multiple hosts each of which has
> different memory configuration, how the dcd device provides the extents
> to each host to make sure there is no dpa gap between static and dynamic
> capacity range on all the hosts?
> It seems the start dpa of dcd needs to be different for each host. No sure how
> to achieve that.
> 
> Fan

Ignore the above message, the question does not make sense.

Fan
> 
> > 
> > Series info
> > ===========
> > 
> > This series has 2 parts:
> > 
> > Patch 1-17: Core DCD support
> > Patch 18-19: cxl_test support
> > 
> > Background
> > ==========
> > 
> > A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> > device that allows memory capacity within a region to change
> > dynamically without the need for resetting the device, reconfiguring
> > HDM decoders, or reconfiguring software DAX regions.
> > 
> > One of the biggest anticipated use cases for Dynamic Capacity is to
> > allow hosts to dynamically add or remove memory from a host within a
> > data center without physically changing the per-host attached memory nor
> > rebooting the host.
> > 
> > The general flow for the addition or removal of memory is to have an
> > orchestrator coordinate the use of the memory.  Generally there are 5
> > actors in such a system, the Orchestrator, Fabric Manager, the Logical
> > device, the Host Kernel, and a Host User.
> > 
> > An example work flow is shown below.
> > 
> > Orchestrator      FM         Device       Host Kernel    Host User
> > 
> >     |             |           |            |               |
> >     |-------------- Create region ------------------------>|
> >     |             |           |            |               |
> >     |             |           |            |<-- Create ----|
> >     |             |           |            |    Region     |
> >     |             |           |            |(dynamic_ram_a)|
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |-- Add ----->|-- Add --->|--- Add --->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Accept -|<- Accept  -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |<- Create ---->|
> >     |             |           |            |   DAX dev     |-- Use memory
> >     |             |           |            |               |   |
> >     |             |           |            |               |   |
> >     |             |           |            |<- Release ----| <-+
> >     |             |           |            |   DAX dev     |
> >     |             |           |            |               |
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |-- Remove -->|- Release->|- Release ->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Release-|<- Release -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |               |
> >     |-- Add ----->|-- Add --->|--- Add --->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Accept -|<- Accept  -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |<- Create -----|
> >     |             |           |            |   DAX dev     |-- Use memory
> >     |             |           |            |               |   |
> >     |             |           |            |<- Release ----| <-+
> >     |             |           |            |   DAX dev     |
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |-- Remove -->|- Release->|- Release ->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Release-|<- Release -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |               |
> >     |-- Add ----->|-- Add --->|--- Add --->|               |
> >     |  Capacity   |  Extent   |   Extent   |               |
> >     |             |           |            |<- Create -----|
> >     |             |           |            |   DAX dev     |-- Use memory
> >     |             |           |            |               |   |
> >     |-- Remove -->|- Release->|- Release ->|               |   |
> >     |  Capacity   |  Extent   |   Extent   |               |   |
> >     |             |           |            |               |   |
> >     |             |           |     (Release Ignored)      |   |
> >     |             |           |            |               |   |
> >     |             |           |            |<- Release ----| <-+
> >     |             |           |            |   DAX dev     |
> >     |<------------- Signal done ---------------------------|
> >     |             |           |            |               |
> >     |             |- Release->|- Release ->|               |
> >     |             |  Extent   |   Extent   |               |
> >     |             |           |            |               |
> >     |             |<- Release-|<- Release -|               |
> >     |             |   Extent  |   Extent   |               |
> >     |             |           |            |<- Destroy ----|
> >     |             |           |            |   Region      |
> >     |             |           |            |               |
> > 
> > Implementation
> > ==============
> > 
> > This series requires the creation of regions and DAX devices to be
> > closely synchronized with the Orchestrator and Fabric Manager.  The host
> > kernel will reject extents if a region is not yet created.  It also
> > ignores extent release if memory is in use (DAX device created).  These
> > synchronizations are not anticipated to be an issue with real
> > applications.
> > 
> > Only a single dynamic ram partition is supported (dynamic_ram_a).  The
> > requirements, use cases, and existence of actual hardware devices to
> > support more than one DC partition is unknown at this time.  So a less
> > complex implementation was chosen.
> > 
> > In order to allow for capacity to be added and removed a new concept of
> > a sparse DAX region is introduced.  A sparse DAX region may have 0 or
> > more bytes of available space.  The total space depends on the number
> > and size of the extents which have been added.
> > 
> > It is anticipated that users of the memory will carefully coordinate the
> > surfacing of capacity with the creation of DAX devices which use that
> > capacity.  Therefore, the allocation of the memory to DAX devices does
> > not allow for specific associations between DAX device and extent.  This
> > keeps allocations of DAX devices similar to existing DAX region
> > behavior.
> > 
> > To keep the DAX memory allocation aligned with the existing DAX devices
> > which do not have tags, extents are not allowed to have tags in this
> > implementation.  Future support for tags can be added when real use
> > cases surface.
> > 
> > Great care was taken to keep the extent tracking simple.  Some xarray's
> > needed to be added but extra software objects are kept to a minimum.
> > 
> > Region extents are tracked as sub-devices of the DAX region.  This
> > ensures that region destruction cleans up all extent allocations
> > properly.
> > 
> > The major functionality of this series includes:
> > 
> > - Getting the dynamic capacity (DC) configuration information from cxl
> >   devices
> > 
> > - Configuring a DC partition found in hardware.
> > 
> > - Enhancing the CXL and DAX regions for dynamic capacity support
> > 	a. Maintain a logical separation between hardware extents and
> > 	   software managed extents.  This provides an abstraction
> > 	   between the layers and should allow for interleaving in the
> > 	   future
> > 
> > - Get existing hardware extent lists for endpoint decoders upon region
> >   creation.
> > 
> > - Respond to DC capacity events and adjust available region memory.
> >         a. Add capacity Events
> > 	b. Release capacity events
> > 
> > - Host response for add capacity
> > 	a. do not accept the extent if:
> > 		If the region does not exist
> > 		or an error occurs realizing the extent
> > 	b. If the region does exist
> > 		realize a DAX region extent with 1:1 mapping (no
> > 		interleave yet)
> > 	c. Support the event more bit by processing a list of extents
> > 	   marked with the more bit together before setting up a
> > 	   response.
> > 
> > - Host response for remove capacity
> > 	a. If no DAX device references the extent; release the extent
> > 	b. If a reference does exist, ignore the request.
> > 	   (Require FM to issue release again.)
> > 	c. Release extents flagged with the 'more' bit individually as
> > 	   the specification allows for the asynchronous release of
> > 	   memory and the implementation is simplified by doing so.
> > 
> > - Modify DAX device creation/resize to account for extents within a
> >   sparse DAX region
> > 
> > - Trace Dynamic Capacity events for debugging
> > 
> > - Add cxl-test infrastructure to allow for faster unit testing
> >   (See new ndctl branch for cxl-dcd.sh test[1])
> > 
> > - Only support 0 value extent tags
> > 
> > Fan Ni's upstream of Qemu DCD was used for testing.
> > 
> > Remaining work:
> > 
> > 	1) Allow mapping to specific extents (perhaps based on
> > 	   label/tag)
> > 	   1a) devise region size reporting based on tags
> > 	2) Interleave support
> > 
> > Possible additional work depending on requirements:
> > 
> > 	1) Accept a new extent which extends (but overlaps) already
> > 	   accepted extent(s)
> > 	2) Rework DAX device interfaces, memfd has been explored a bit
> > 	3) Support more than 1 DC partition
> > 
> > [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13
> > 
> > ---
> > Changes in v9:
> > - djbw: pare down support to only a single DC parition
> > - djbw: adjust to the new core partition processing which aligns with
> >   new type2 work.
> > - iweiny: address smaller comments from v8
> > - iweiny: rebase off of 6.15-rc1
> > - Link to v8: https://patch.msgid.link/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com
> > 
> > ---
> > Ira Weiny (19):
> >       cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
> >       cxl/mem: Read dynamic capacity configuration from the device
> >       cxl/cdat: Gather DSMAS data for DCD partitions
> >       cxl/core: Enforce partition order/simplify partition calls
> >       cxl/mem: Expose dynamic ram A partition in sysfs
> >       cxl/port: Add 'dynamic_ram_a' to endpoint decoder mode
> >       cxl/region: Add sparse DAX region support
> >       cxl/events: Split event msgnum configuration from irq setup
> >       cxl/pci: Factor out interrupt policy check
> >       cxl/mem: Configure dynamic capacity interrupts
> >       cxl/core: Return endpoint decoder information from region search
> >       cxl/extent: Process dynamic partition events and realize region extents
> >       cxl/region/extent: Expose region extent information in sysfs
> >       dax/bus: Factor out dev dax resize logic
> >       dax/region: Create resources on sparse DAX regions
> >       cxl/region: Read existing extents on region creation
> >       cxl/mem: Trace Dynamic capacity Event Record
> >       tools/testing/cxl: Make event logs dynamic
> >       tools/testing/cxl: Add DC Regions to mock mem data
> > 
> >  Documentation/ABI/testing/sysfs-bus-cxl |  100 ++-
> >  drivers/cxl/core/Makefile               |    2 +-
> >  drivers/cxl/core/cdat.c                 |   11 +
> >  drivers/cxl/core/core.h                 |   33 +-
> >  drivers/cxl/core/extent.c               |  495 +++++++++++++++
> >  drivers/cxl/core/hdm.c                  |   13 +-
> >  drivers/cxl/core/mbox.c                 |  632 ++++++++++++++++++-
> >  drivers/cxl/core/memdev.c               |   87 ++-
> >  drivers/cxl/core/port.c                 |    5 +
> >  drivers/cxl/core/region.c               |   76 ++-
> >  drivers/cxl/core/trace.h                |   65 ++
> >  drivers/cxl/cxl.h                       |   61 +-
> >  drivers/cxl/cxlmem.h                    |  134 +++-
> >  drivers/cxl/mem.c                       |    2 +-
> >  drivers/cxl/pci.c                       |  115 +++-
> >  drivers/dax/bus.c                       |  356 +++++++++--
> >  drivers/dax/bus.h                       |    4 +-
> >  drivers/dax/cxl.c                       |   71 ++-
> >  drivers/dax/dax-private.h               |   40 ++
> >  drivers/dax/hmem/hmem.c                 |    2 +-
> >  drivers/dax/pmem.c                      |    2 +-
> >  include/cxl/event.h                     |   31 +
> >  include/linux/ioport.h                  |    3 +
> >  tools/testing/cxl/Kbuild                |    3 +-
> >  tools/testing/cxl/test/mem.c            | 1021 +++++++++++++++++++++++++++----
> >  25 files changed, 3102 insertions(+), 262 deletions(-)
> > ---
> > base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
> > change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> > 
> > Best regards,
> > -- 
> > Ira Weiny <ira.weiny@intel.com>
> > 
> 
> -- 
> Fan Ni

-- 
Fan Ni
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Jonathan Cameron 9 months, 4 weeks ago
On Sun, 13 Apr 2025 17:52:08 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.

Hi Ira,

Firstly thanks for the update and your hard work driving this forwards.

> 
> Due to the stagnation of solid requirements for users of DCD I do not
> plan to rev this work in Q2 of 2025 and possibly beyond.

Hopefully there will be limited need to make changes (it looks pretty 
good to me - we'll run a bunch of tests though which I haven't done
yet).  I do have reason to want this code upstream and it is
now simple enough that I hope it is not controversial. Let's discuss
path forwards on the sync call tomorrow as I'm sure I'm not the only one.

If needed I'm fine picking up the baton to keep this moving forwards
(I'm even more happy to let someone else step up though!)

To me we don't need to answer the question of whether we fully understand
requirements, or whether this support covers them, but rather to ask
if anyone has requirements that are not sensible to satisfy with additional
work building on this?

I'm not aware of any such blocker.  For the things I care about the
path forwards looks fine (particularly tagged capacity and sharing).

> 
> It is anticipated that this will support at least the initial
> implementation of DCD devices, if and when they appear in the ecosystem.
> The patch set should be reviewed with the limited set of functionality in
> mind.  Additional functionality can be added as devices support them.

Personally I think that's a chicken and egg problem but fully understand
the desire to keep things simple in the short term.  Getting initial DCD
support in will help reduce the response (that I frequently hear) of
'the ecosystem isn't ready, let's leave that for a generation'.


> 
> It is strongly encouraged for individuals or companies wishing to bring
> DCD devices to market review this set with the customer use cases they
> have in mind.
> 

Absolutely.  I can't share anything about devices at this time but you
can read whatever you want into my willingness to help get this (and a
bunch of things built on top of it) over the line.



> Remaining work:
> 
> 	1) Allow mapping to specific extents (perhaps based on
> 	   label/tag)
> 	   1a) devise region size reporting based on tags
> 	2) Interleave support

I'd maybe label these as 'additional possible future features'.
Personally I'm doubtful that hardware interleave of DCD is a short
term feature and it definitely doesn't have to be there for this to be useful.

Tags will matter but that is a 'next step' that this series does
not seem to hinder.


> 
> Possible additional work depending on requirements:
> 
> 	1) Accept a new extent which extends (but overlaps) already
> 	   accepted extent(s)
> 	2) Rework DAX device interfaces, memfd has been explored a bit
> 	3) Support more than 1 DC partition
> 
> [1] https://github.com/weiny2/ndctl/tree/dcd-region3-2025-04-13

Thanks,

Jonathan
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Dan Williams 9 months, 4 weeks ago
Jonathan Cameron wrote:
[..]
> To me we don't need to answer the question of whether we fully understand
> requirements, or whether this support covers them, but rather to ask
> if anyone has requirements that are not sensible to satisfy with additional
> work building on this?

Wearing only my upstream kernel development hat, the question for
merging is "what is the end user visible impact of merging this?". As
long as DCD remains in proof-of-concept mode then leave the code out of
tree until it is ready to graduate past that point.

Same held for HDM-D support which was an out-of-tree POC until
Alejandro arrived with the SFC consumer.

DCD is joined by HDM-DB (awaiting an endpoint) and CXL Error Isolation
(awaiting a production consumer) as solutions that have time to validate
that the ecosystem is indeed graduating to consume them. There was no
"chicken-egg" paradox for the ecosystem to deliver base
static-memory-expander CXL support.

The ongoing failure to get productive engagement on just how ruthlessly
simple the implementation could be and still meet planned usages
continues to give the impression that Linux is way out in front of
hardware here. Uncomfortably so.
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Jonathan Cameron 9 months, 4 weeks ago
On Mon, 14 Apr 2025 21:50:31 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> [..]
> > To me we don't need to answer the question of whether we fully understand
> > requirements, or whether this support covers them, but rather to ask
> > if anyone has requirements that are not sensible to satisfy with additional
> > work building on this?  
> 
> Wearing only my upstream kernel development hat, the question for
> merging is "what is the end user visible impact of merging this?". As
> long as DCD remains in proof-of-concept mode then leave the code out of
> tree until it is ready to graduate past that point.

Hi Dan,

Seems like we'll have to disagree on this. The only thing I can
therefore do is help to keep this patch set in a 'ready to go' state.

I would ask that people review it with that in mind so that we can
merge it the day someone is willing to announce a product which
is a lot more about marketing decisions than anything technical.
Note that will be far too late for distro cycles so distro folk
may have to pick up the fork (which they will hate).

Hopefully that 'fork' will provide a base on which we can build
the next set of key features. 

> 
> Same held for HDM-D support which was an out-of-tree POC until
> Alejandro arrived with the SFC consumer.

Obviously I can't comment on status of that hardware!

> 
> DCD is joined by HDM-DB (awaiting an endpoint) and CXL Error Isolation
> (awaiting a production consumer) as solutions that have time to validate
> that the ecosystem is indeed graduating to consume them. 

Those I'm fine with waiting on, though obviously others may not be!

> There was no
> "chicken-egg" paradox for the ecosystem to deliver base
> static-memory-expander CXL support.

That is (at least partly) because the ecosystem for those was initially BIOS
only. That's not true for DCD. So people built devices on basis they didn't
need any kernel support.  Lots of disadvantages to that but it's what happened.
As a side note, I'd much rather that path had never been there as it is
continuing to make a mess for Gregory and others.

> 
> The ongoing failure to get productive engagement on just how ruthlessly
> simple the implementation could be and still meet planned usages
> continues to give the impression that Linux is way out in front of
> hardware here. Uncomfortably so.

I'll keep pushing for others to engage with this. I also have on my
list writing a document on the future of DCD and proposing at least one
way to add all features on that roadmap. A major intent of that being
to show that there is no blocker to what we have here. I.e. we can
extend it in a logical fashion to exactly what is needed.

Reality is I cannot say anything about unannounced products. Whilst some
companies will talk about stuff well ahead of hardware being ready for
customers we do not do that (normally we announce long after customers
have it.) Hence it seems I have no way to get this upstream other than hope
someone else has a more flexible policy.

Jonathan
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Dan Williams 9 months, 3 weeks ago
Jonathan Cameron wrote:
> On Mon, 14 Apr 2025 21:50:31 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > [..]
> > > To me we don't need to answer the question of whether we fully understand
> > > requirements, or whether this support covers them, but rather to ask
> > > if anyone has requirements that are not sensible to satisfy with additional
> > > work building on this?  
> > 
> > Wearing only my upstream kernel development hat, the question for
> > merging is "what is the end user visible impact of merging this?". As
> > long as DCD remains in proof-of-concept mode then leave the code out of
> > tree until it is ready to graduate past that point.
> 
> Hi Dan,
> 
> Seems like we'll have to disagree on this. The only thing I can
> therefore do is help to keep this patch set in a 'ready to go' state.
> 
> I would ask that people review it with that in mind so that we can
> merge it the day someone is willing to announce a product which
> is a lot more about marketing decisions than anything technical.
> Note that will be far too late for distro cycles so distro folk
> may have to pick up the fork (which they will hate).

This is overstated. Distros say "no" to supporting even *shipping*
hardware when there is insufficient customer pull through.  If none of
the distros' customers can get their hands on DCD hardware that
contraindicates merge and distro intercept decisions.

> Hopefully that 'fork' will provide a base on which we can build
> the next set of key features. 

They are only key features when the adoption approaches inevitability.
The LSF/MM discussions around the ongoing challenges of managing
disparate performance memory pools still has me uneasy about whether
Linux yet has the right ABI in hand for dedicated-memory.

What folks seems to want is an anon-only memory provider that does not
ever leak into kernel allocations, and optionally a filesystem
abstraction to provide file backed allocation of dedicate memory. What
they do not want is to teach their applications anything beyond
"malloc()" for anon.

[..]
> That is (at least partly) because the ecosystem for those was initially BIOS
> only. That's not true for DCD. So people built devices on basis they didn't
> need any kernel support.  Lots of disadvantages to that but it's what happened.
> As a side note, I'd much rather that path had never been there as it is
> continuing to make a mess for Gregory and others.

The mess is driven by insufficient communication between platform
firmware implementations and Linux expectations. That is a tractable
problem.
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Gregory Price 4 days, 14 hours ago
On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> 
> This is now based on 6.15-rc2.
> 

Extreme necro-bump for this set, but i wonder what folks opinion is on
DCD support if we expose a new region control pattern ala:

https://lore.kernel.org/linux-cxl/20260129210442.3951412-1-gourry@gourry.net/

The major difference would be elimination of sparse-DAX, which i know
has been a concern, in favor of a per-region-driver policy on how to
manage hot-add/remove events.

Things I've discussed with folks in different private contexts

sysram usecase:
----
  echo regionN > decoder0.0/create_dc_region
  /* configure decoders */
  echo regionN > cxl/drivers/sysram/bind

tagged extents arrive and leave as a group, no sparseness
    extents cannot share a tag unless they arrive together
    e.g. set(A) & set(B) must have different tags
    add and expose daxN.M/uuid as the tag for collective management

Can decide whether linux wants to support untagged extents
    cxl_sysram could choose to track and hotplug untagged extents
    directly without going through DAX. Partial release would be
    possible on a per-extent granularity in this case.
----


virtio usecase:  (making some stuff up here)
----
  echo regionN > decoder0.0/create_dc_region
  /* configure decoders */
  echo regionN > cxl/drivers/virtio/bind

tags are required and may imply specific VM routing
    may or may not use DAX under the hood

extents may be tracked individually and add/removed individually
    if using DAX, this implies 1 device per extent.
    This probably requires a minimum extent size to be reasonable.

Does not expose the memory as SysRAM, instead builds new interface
    to handle memory management message routing to/from the VMM
    (N_MEMORY_PRIVATE?)
----


devdax usecase (FAMFS?)
---- 
  echo regionN > decoder0.0/create_dc_region
  /* configure decoders */
  echo regionN > cxl/drivers/devdax/bind

All sets of extents appear as new DAX devices
Tags are exposed via daxN.M/uuid
Tags are required
   otherwise you can't make sense of what that devdax represents
---

Begs the question:
   Do we require tags as a baseline feature for all modes?
   No tag - no service.
   Heavily implied:  Tags are globally unique (uuid)

But I think this resolves a lot of the disparate disagreements on "what
to do with tags" and how to manage sparseness - just split the policy
into each individual use-case's respective driver.

If a sufficiently unique use-case comes along that doesn't fit the
existing categories - a new region-driver may be warranted.

~Gregory
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Ira Weiny 3 days, 12 hours ago
Gregory Price wrote:
> On Sun, Apr 13, 2025 at 05:52:08PM -0500, Ira Weiny wrote:
> > A git tree of this series can be found here:
> > 
> > 	https://github.com/weiny2/linux-kernel/tree/dcd-v6-2025-04-13
> > 
> > This is now based on 6.15-rc2.
> > 
> 
> Extreme necro-bump for this set, but i wonder what folks opinion is on
> DCD support if we expose a new region control pattern ala:
> 
> https://lore.kernel.org/linux-cxl/20260129210442.3951412-1-gourry@gourry.net/
> 
> The major difference would be elimination of sparse-DAX, which i know

Sparse-dax is somewhat of a misnomer.  sparse regions may have been a
better name for it.  That is really what we are speaking of.  It is the
idea that we have regions which don't necessarily have memory backing the
size of the region.

For the DCD series I wrote dax devices could only be created after extents
appeared.

> has been a concern, in favor of a per-region-driver policy on how to
> manage hot-add/remove events.

I think a concern would be that each region driver is implementing a
'policy' which requires new drivers for new policies.

My memory is very weak on all this stuff...

My general architecture was trying to exposed the extent ranges to user
space and allow userspace to build them into ranges with whatever policy
they wanted.

The tests[1] were all written to create dax devices on top of the extents
in certain ways to link together those extents.

[1] https://github.com/weiny2/ndctl/blob/dcd-region3-2025-04-13/test/cxl-dcd.sh

I did not like the 'implicit' nature of the association of dax device with
extent.  But it maintained backwards compatibility with non-sparse
regions...

My vision for tags was that eventually dax device creation could have a
tag specified prior and would only allocate from extents with that tag.

> 
> Things I've discussed with folks in different private contexts
> 
> sysram usecase:
> ----
>   echo regionN > decoder0.0/create_dc_region
>   /* configure decoders */
>   echo regionN > cxl/drivers/sysram/bind
> 
> tagged extents arrive and leave as a group, no sparseness
>     extents cannot share a tag unless they arrive together
>     e.g. set(A) & set(B) must have different tags
>     add and expose daxN.M/uuid as the tag for collective management

I'm not following this.  If set(A) arrives can another set(A) arrive
later?

How long does the kernel wait for all the 'A's to arrive?  Or must they be
in a ...  'more bit set' set of extents.

Regardless IMO if user space was monitoring the extents with tag A they
can decide if and when all those extents have arrived and can build on top
of that.

> 
> Can decide whether linux wants to support untagged extents
>     cxl_sysram could choose to track and hotplug untagged extents

'cxl_sysram' is the sysram region driver right?

Are we expecting to have tags and non-taged extents on the same DCD
region?

I'm ok not supporting that.  But just to be clear about what you are
suggesting.

Would the cxl_sysram region driver be attached to the DCD partition?  Then
it would have some DCD functionality built in...  I guess make a common
extent processing lib for the 2 drivers?

I feel like that is a lot of policy being built into the kernel.  Where
having the DCD region driver simply tell user space 'Hey there is a new
extent here' and then having user space online that as sysram makes the
policy decision in user space.

Segwaying into the N_PRIVATE work.  Couldn't we assign that memory to a
NUMA node with N_PRIVATE only memory via userspace...  Then it is onlined
in a way that any app which is allocating from that node would get that
memory.  And keep it out of kernel space?

But keep all that policy in user space when an extent appears.  Not baked
into a particular driver.

>     directly without going through DAX. Partial release would be
>     possible on a per-extent granularity in this case.
> ----
> 
> 
> virtio usecase:  (making some stuff up here)
> ----
>   echo regionN > decoder0.0/create_dc_region
>   /* configure decoders */
>   echo regionN > cxl/drivers/virtio/bind
> 
> tags are required and may imply specific VM routing
>     may or may not use DAX under the hood
> 
> extents may be tracked individually and add/removed individually
>     if using DAX, this implies 1 device per extent.
>     This probably requires a minimum extent size to be reasonable.
> 
> Does not expose the memory as SysRAM, instead builds new interface
>     to handle memory management message routing to/from the VMM
>     (N_MEMORY_PRIVATE?)
> ----
> 
> 
> devdax usecase (FAMFS?)
> ---- 
>   echo regionN > decoder0.0/create_dc_region
>   /* configure decoders */
>   echo regionN > cxl/drivers/devdax/bind
> 
> All sets of extents appear as new DAX devices
> Tags are exposed via daxN.M/uuid
> Tags are required
>    otherwise you can't make sense of what that devdax represents
> ---
> 
> Begs the question:
>    Do we require tags as a baseline feature for all modes?

Previously no.  But I've often thought of no tag as just a special case of
tag == 0.  But we agreed at one time that they would have special no tag
meaning such that it was just memory to be used however...

>    No tag - no service.
>    Heavily implied:  Tags are globally unique (uuid)
> 
> But I think this resolves a lot of the disparate disagreements on "what
> to do with tags" and how to manage sparseness - just split the policy
> into each individual use-case's respective driver.

I think what I'm worried about is where that policy resides.

I think it is best to have a DCD region driver which simply exposes
extents and allows user space to control how those extents are used.  I
think some of what you have above works like that but I want to be careful
baking in policy.

> 
> If a sufficiently unique use-case comes along that doesn't fit the
> existing categories - a new region-driver may be warranted.

Again I don't like the idea of needing new drivers for new policies.  That
goes against how things should work in the kernel.

Ira
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Gregory Price 2 days, 19 hours ago
On Tue, Feb 03, 2026 at 04:04:23PM -0600, Ira Weiny wrote:
> Gregory Price wrote:

... snipping this to the top ...
> Again I don't like the idea of needing new drivers for new policies.  That
> goes against how things should work in the kernel.

If you define "How should virtio consume an extent" and "How should
FAMFS consume an extent" as "Policy" I can see your argument, and we
should address this.

I view "All things shall route through DAX" as "A policy" that
dictates cxl-driven changes to dax - including new dax drivers
(see: famfs new dax mechanism).

So we're already there.  Might as well reduce the complexity (as
explained below) and cut out dax where it makes sense rather than
force everyone to eat DAX (for potentially negative value).

---

> > has been a concern, in favor of a per-region-driver policy on how to
> > manage hot-add/remove events.
> 
> I think a concern would be that each region driver is implementing a
> 'policy' which requires new drivers for new policies.
> 

This is fair, we don't want infinite drivers - and many use cases
(we imagine) will end up using DAX - I'm not arguing to get rid of the
dax driver.

There are at least 3 or 4 use-cases i've seen so far

- dax (dev and fs): can share a driver w/ DAXDRV_ selection

- sysram : preferably doing direct hotplug - not via dax
           private-ram may re-use this cleanly with some config bits

- virtio : may not even want to expose objects to userland
           may prefer to simply directly interact with a VMM
	   dax may present a security issue if reconfig'd to device

- type-2 : may have wildly different patterns and preferences
           may also end up somewhat generalized

I think trying to pump all of these through dax and into userland by
default is a mistake - if only because it drives more complexity.

We should get form from function.

Example: for sysram - dax_kmem is just glue, the hotplug logic should
         live in cxl and operate directly on extents.  It's simpler and
	 doesn't add a bunch of needless dependencies.

Consider a hot-unplug request

Current setup
----
FM -> Host
   1) Unplug Extent A
Host
   2) cxl: hotunplug(dax_map[A])
   3) dax: Does this cover the entire dax? (no->reject, yes->unplug())
      - might fail due to dax-reasons
      - might fail due to normal hot-unplug reasons
   4) unbind dax
   5) return extent

Dropping Dax in favor of sysram doing direct hotplug
----
FM -> Host
   1) Unplug Extent A 
Host
   2) hotunplug(extents_map[A])
      - might fail because of normal hot-unplug reasons
   3) return extent

It's just simpler and gives you the option of complete sparseness
(untagged extents) or tracking related extents (tagged extents).

This pattern may not carry over the same with dax or virtio uses.

> I did not like the 'implicit' nature of the association of dax device with
> extent.  But it maintained backwards compatibility with non-sparse
> regions...
> 
> My vision for tags was that eventually dax device creation could have a
> tag specified prior and would only allocate from extents with that tag.
>

yeah i think it's pretty clear the dax case wants a daxN.M/uuid of some
kind (we can argue whether it needs to be exposed to userland - but
having some conversations about FAMFS, this sounds userful.

> I'm not following this.  If set(A) arrives can another set(A) arrive
> later?
> 
> How long does the kernel wait for all the 'A's to arrive?  Or must they be
> in a ...  'more bit set' set of extents.
> 

Set(A) = extents that arrive together with the more bit set

So lets say you get two sets that arrive with the same tag (A)
Set(A) + Set(A)'

Set(A)' would get rejected because Set(A) has already arrived.
Otherwise, accepting Set(A)' implies sparseness of Set(A).

Having a tag map to a region is pointless - the HPA maps extent to
region.  So there's no other use for a tag in the sysram case.

On the flip side - assuming you want to try to allow Set(A)+Set(A)'

How userland is expected to know when all extents have arrived if
hotplug cannot occur until all the extents have arrived, and the only
place to put those extents is DAX?  Seems needlessly complex.

> Regardless IMO if user space was monitoring the extents with tag A they
> can decide if and when all those extents have arrived and can build on top
> of that.
> 

This assumes userland has something to build on top of, and moreover
that this something will be DAX.

- I agree for a filesystem-consumption pattern.
- I disagree for hotplug - dax is pointless glue.
- I don't know if DAX is right-fit for other use cases. (it might just
  want to pass the raw IORESOURCE region to the VMM, for example).

> Are we expecting to have tags and non-taged extents on the same DCD
> region?
> 
> I'm ok not supporting that.  But just to be clear about what you are
> suggesting.
> 

Probably not.  And in fact I think that should be one configuration bit
(either you support tags or you don't - reject the other state).

But I can imagine a driver wanting to support either (exclusive-or)

> Would the cxl_sysram region driver be attached to the DCD partition?  Then
> it would have some DCD functionality built in...  I guess make a common
> extent processing lib for the 2 drivers?
> 

Same driver - allow it to bind PARTMODE_RAM or PARTMODE_DC.

A RAM region hotplugs exactly once: at bind/unbind
A DC region hotplugs at runtime.

Same code, DC just adds the log monitoring stuff.

> I feel like that is a lot of policy being built into the kernel.  Where
> having the DCD region driver simply tell user space 'Hey there is a new
> extent here' and then having user space online that as sysram makes the
> policy decision in user space.
> 
> Segwaying into the N_PRIVATE work.  Couldn't we assign that memory to a
> NUMA node with N_PRIVATE only memory via userspace...  Then it is onlined
> in a way that any app which is allocating from that node would get that
> memory.  And keep it out of kernel space?
> 
> But keep all that policy in user space when an extent appears.  Not baked
> into a particular driver.
> 

I would need to think this over a bit more, I'm not quite seeing how
what you are suggesting would work.

N_MEMORY_PRIVATE implies there is some special feature of the device
that should be taken into account when managing the memory - but that
you want to re-use (some of) the existing mm/ infrastructure for basic
operations (page_alloc, reclaim, migration, etc).

There's an argument that some such nodes shouldn't even be visible to
userspace (of what use is knowing a node is there if mempolicy commands
are rejected or ignored if you try to bind to it?)

But also, setting N_MEMORY_PRIVATE vs N_MEMORY would explicitly be an
mm/memory_hotplug.c operation - so there's a pretty long path from
userland to "Setting N_MEMORY_PRIVATE" that goes through the drivers.

You can't set N_MEMORY_PRIVATE before going online (has to be done
during the hotplug process, otherwise you get nasty race conditions).

> > But I think this resolves a lot of the disparate disagreements on "what
> > to do with tags" and how to manage sparseness - just split the policy
> > into each individual use-case's respective driver.
> 
> I think what I'm worried about is where that policy resides.
>
> I think it is best to have a DCD region driver which simply exposes
> extents and allows user space to control how those extents are used.  I
> think some of what you have above works like that but I want to be careful
> baking in policy.
> 

I guess summarizing the sysram case: The policy seems simple enough to
not warrant over-complicated the infrastructure for the sake of making
dax "The One Interface To Rule Them All".

All userland wants to do for sysram is hot(un)plug.  Why bother with
dax at all?

~Gregory
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Ira Weiny 2 days, 16 hours ago
Gregory Price wrote:
> On Tue, Feb 03, 2026 at 04:04:23PM -0600, Ira Weiny wrote:
> > Gregory Price wrote:
>
> ... snipping this to the top ...
> > Again I don't like the idea of needing new drivers for new policies.  That
> > goes against how things should work in the kernel.
> 
> If you define "How should virtio consume an extent" and "How should
> FAMFS consume an extent" as "Policy" I can see your argument, and we
> should address this.

TLDR; I just don't want to see an explosion of 'drivers' for various
'policies'.  I think your use of the word 'policy' triggered me.

> 
> I view "All things shall route through DAX" as "A policy" that
> dictates cxl-driven changes to dax - including new dax drivers
> (see: famfs new dax mechanism).
> 
> So we're already there.  Might as well reduce the complexity (as
> explained below) and cut out dax where it makes sense rather than
> force everyone to eat DAX (for potentially negative value).
> 
> ---
> 
> > > has been a concern, in favor of a per-region-driver policy on how to
> > > manage hot-add/remove events.
> > 
> > I think a concern would be that each region driver is implementing a
> > 'policy' which requires new drivers for new policies.
> > 
> 
> This is fair, we don't want infinite drivers - and many use cases
> (we imagine) will end up using DAX - I'm not arguing to get rid of the
> dax driver.
> 
> There are at least 3 or 4 use-cases i've seen so far
> 
> - dax (dev and fs): can share a driver w/ DAXDRV_ selection

Legacy...  check!

> 
> - sysram : preferably doing direct hotplug - not via dax
>            private-ram may re-use this cleanly with some config bits

Pre-reading this entire email I think what I was thinking was bundling a
lot of this in here.  Put knobs here to control 'policy' not add to this
list for more policies.

> 
> - virtio : may not even want to expose objects to userland
>            may prefer to simply directly interact with a VMM

Even if directly interacting with the VMM there has to be controls
directly by user space to control this.  I'm not a virtio expert so...  Ok
lets just say there is another flow here.  Don't call it a policy though.

> 	   dax may present a security issue if reconfig'd to device

I don't understand this comment.

> 
> - type-2 : may have wildly different patterns and preferences
>            may also end up somewhat generalized

I think this is all going to be handled in the specific drivers of the
specific devices.  There is no policy here other than 'special' for the
device and we can't control that.

> 
> I think trying to pump all of these through dax and into userland by
> default is a mistake - if only because it drives more complexity.

I don't want to preserve DAX.  I don't.

So I think this list is fine.

> 
> We should get form from function.
> 
> Example: for sysram - dax_kmem is just glue, the hotplug logic should
>          live in cxl and operate directly on extents.  It's simpler and
> 	 doesn't add a bunch of needless dependencies.

Agreed.

> 
> Consider a hot-unplug request
> 
> Current setup
> ----
> FM -> Host
>    1) Unplug Extent A
> Host
>    2) cxl: hotunplug(dax_map[A])
>    3) dax: Does this cover the entire dax? (no->reject, yes->unplug())
>       - might fail due to dax-reasons
>       - might fail due to normal hot-unplug reasons
>    4) unbind dax
>    5) return extent
> 
> Dropping Dax in favor of sysram doing direct hotplug
> ----
> FM -> Host
>    1) Unplug Extent A 
> Host
>    2) hotunplug(extents_map[A])
>       - might fail because of normal hot-unplug reasons
>    3) return extent

Agreed.

> 
> It's just simpler and gives you the option of complete sparseness
> (untagged extents) or tracking related extents (tagged extents).

Just add the knobs for the tags and yea...  the policy of how to handle
the extents can then be controlled by user space.

> 
> This pattern may not carry over the same with dax or virtio uses.

I don't fully understand the virtio case.  So I'll defer this.  But I feel
like this is not so much of a new policy as a different path which is, as
you said above, potentially not in user space at all.

> 
> > I did not like the 'implicit' nature of the association of dax device with
> > extent.  But it maintained backwards compatibility with non-sparse
> > regions...
> > 
> > My vision for tags was that eventually dax device creation could have a
> > tag specified prior and would only allocate from extents with that tag.
> >
> 
> yeah i think it's pretty clear the dax case wants a daxN.M/uuid of some
> kind (we can argue whether it needs to be exposed to userland - but
> having some conversations about FAMFS, this sounds userful.
> 
> > I'm not following this.  If set(A) arrives can another set(A) arrive
> > later?
> > 
> > How long does the kernel wait for all the 'A's to arrive?  Or must they be
> > in a ...  'more bit set' set of extents.
> > 
> 
> Set(A) = extents that arrive together with the more bit set
> 
> So lets say you get two sets that arrive with the same tag (A)
> Set(A) + Set(A)'
> 
> Set(A)' would get rejected because Set(A) has already arrived.
> Otherwise, accepting Set(A)' implies sparseness of Set(A).
> 
> Having a tag map to a region is pointless - the HPA maps extent to
> region.  So there's no other use for a tag in the sysram case.
> 
> On the flip side - assuming you want to try to allow Set(A)+Set(A)'
> 
> How userland is expected to know when all extents have arrived if
> hotplug cannot occur until all the extents have arrived, and the only
> place to put those extents is DAX?  Seems needlessly complex.

Ok I think we need to sync up on the driver here.

For FAMFS/famdax they can expect the more bit and all that jazz.  I can't
stop that.

But for sysram.  No.  It is easy enough to assign a tag to the region and
any extent which shows up without that tag (be it NULL tag or tag A) gets
rejected.  All valid tagged extents get hot plugged.

Simple.  Easy policy for user space to control.

> 
> > Regardless IMO if user space was monitoring the extents with tag A they
> > can decide if and when all those extents have arrived and can build on top
> > of that.
> > 
> 
> This assumes userland has something to build on top of, and moreover
> that this something will be DAX.
> 
> - I agree for a filesystem-consumption pattern.
> - I disagree for hotplug - dax is pointless glue.
> - I don't know if DAX is right-fit for other use cases. (it might just
>   want to pass the raw IORESOURCE region to the VMM, for example).
> 
> > Are we expecting to have tags and non-taged extents on the same DCD
> > region?
> > 
> > I'm ok not supporting that.  But just to be clear about what you are
> > suggesting.
> > 
> 
> Probably not.  And in fact I think that should be one configuration bit
> (either you support tags or you don't - reject the other state).

Not bit.  Just a non-null uuid set.

> 
> But I can imagine a driver wanting to support either (exclusive-or)

Yes.  Set the uuid.

> 
> > Would the cxl_sysram region driver be attached to the DCD partition?  Then
> > it would have some DCD functionality built in...  I guess make a common
> > extent processing lib for the 2 drivers?
> > 
> 
> Same driver - allow it to bind PARTMODE_RAM or PARTMODE_DC.

ok good.

> 
> A RAM region hotplugs exactly once: at bind/unbind
> A DC region hotplugs at runtime.

Yes for every extent as they are seen.

> 
> Same code, DC just adds the log monitoring stuff.

Yep.

> 
> > I feel like that is a lot of policy being built into the kernel.  Where
> > having the DCD region driver simply tell user space 'Hey there is a new
> > extent here' and then having user space online that as sysram makes the
> > policy decision in user space.
> > 
> > Segwaying into the N_PRIVATE work.  Couldn't we assign that memory to a
> > NUMA node with N_PRIVATE only memory via userspace...  Then it is onlined
> > in a way that any app which is allocating from that node would get that
> > memory.  And keep it out of kernel space?
> > 
> > But keep all that policy in user space when an extent appears.  Not baked
> > into a particular driver.
> > 
> 
> I would need to think this over a bit more, I'm not quite seeing how
> what you are suggesting would work.

I think you set it out above.  I thought the sysram driver would have a
control for N_MEMORY_PRIVATE vs N_MEMORY which could control that policy
during hotplug.  Maybe I'm hallucinating.

> 
> N_MEMORY_PRIVATE implies there is some special feature of the device
> that should be taken into account when managing the memory - but that
> you want to re-use (some of) the existing mm/ infrastructure for basic
> operations (page_alloc, reclaim, migration, etc).
> 
> There's an argument that some such nodes shouldn't even be visible to
> userspace (of what use is knowing a node is there if mempolicy commands
> are rejected or ignored if you try to bind to it?)
> 
> But also, setting N_MEMORY_PRIVATE vs N_MEMORY would explicitly be an
> mm/memory_hotplug.c operation - so there's a pretty long path from
> userland to "Setting N_MEMORY_PRIVATE" that goes through the drivers.
> 
> You can't set N_MEMORY_PRIVATE before going online (has to be done
> during the hotplug process, otherwise you get nasty race conditions).
> 
> > > But I think this resolves a lot of the disparate disagreements on "what
> > > to do with tags" and how to manage sparseness - just split the policy
> > > into each individual use-case's respective driver.
> > 
> > I think what I'm worried about is where that policy resides.
> >
> > I think it is best to have a DCD region driver which simply exposes
> > extents and allows user space to control how those extents are used.  I
> > think some of what you have above works like that but I want to be careful
> > baking in policy.
> > 
> 
> I guess summarizing the sysram case: The policy seems simple enough to
> not warrant over-complicated the infrastructure for the sake of making
> dax "The One Interface To Rule Them All".
> 
> All userland wants to do for sysram is hot(un)plug.  Why bother with
> dax at all?

I did not want dax.  Was not advocating for dax.  Just did not want to
build a bunch of new 'drivers' for new each new policy.

Summary, it is fine to add new knobs to the sysram driver for new policy
controls.  It is _not_ ok to have to put in a new driver.

I'm not clear if sysram could be used for virtio, or even needed.  I'm
still figuring out how virtio of simple memory devices is a gain.

Ira
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Gregory Price 2 days, 16 hours ago
On Wed, Feb 04, 2026 at 11:57:34AM -0600, Ira Weiny wrote:
> Gregory Price wrote:
> 
> TLDR; I just don't want to see an explosion of 'drivers' for various
> 'policies'.  I think your use of the word 'policy' triggered me.
> 

Gotcha.  Yeah words are hard.  I'm not sure what to call the difference
between the dax pattern and the sysram pattern... workflow?

You're *kind of* encoding "a policy", but more like defining a workflow
i guess.  I suppose i'll update to that terminology unless someone has
something better.

> > - sysram : preferably doing direct hotplug - not via dax
> >            private-ram may re-use this cleanly with some config bits
> 
> Pre-reading this entire email I think what I was thinking was bundling a
> lot of this in here.  Put knobs here to control 'policy' not add to this
> list for more policies.
> 

yup, so you have some sysram_region/ specific knobs
	sysram_region0/online_type
	sysram_region0/extents/[A,B,C]


> >
> >
... snipping out virtio stuff until the end ...
> 
> But for sysram.  No.  It is easy enough to assign a tag to the region and
> any extent which shows up without that tag (be it NULL tag or tag A) gets
> rejected.  All valid tagged extents get hot plugged.
> 
> Simple.  Easy policy for user space to control.
> 

Of what use is a tag for a sysram region?

The HPA is effectively a tag in this case.

An HPA can only belong to one region.

> > 
> > I would need to think this over a bit more, I'm not quite seeing how
> > what you are suggesting would work.
> 
> I think you set it out above.  I thought the sysram driver would have a
> control for N_MEMORY_PRIVATE vs N_MEMORY which could control that policy
> during hotplug.  Maybe I'm hallucinating.
> 

I imagine a device driver setting up a sysram_region with a private bit
before it goes to hotplug.

this would dictate whether it called
   add_memory_driver_managed() or
   add_private_memory_driver_managed()

so like

my_driver_code:
   sysram = create_sysram_region(...);
   sysram.private_callbacks = my_driver_callbacks;
   ... continue with the rest of configuration ...
   probe(sysram); /* sysram does the registration */

Since private-memory users actually have *device-defined* POLICY (yes,
policy) of some kind, I can imagine those devices needing to provide
drivers that set up that policy.

example: compressed memory devices may want to be on a demote-only node
         and control page-table mappings to enforce Read-Only.

(note: don't get hung-up on callbacks, design here is not set, just
       things floating around)

But in the short term, we should try to design it such that additional
drivers are not needed where reasonable.

I can imagine this showing up as needing mm/cram.c and registering a
compressed-node with mm/cram.c rather than enabling driver callbacks
(i'm learning callbacks are a mess, and going to try to avoid it).

> Summary, it is fine to add new knobs to the sysram driver for new policy
> controls.  It is _not_ ok to have to put in a new driver.
>

Well, we don't have a sysram driver at the moment :P

We have a region driver :]

We should have a sysram driver and split up the workflows between dax
and sysram.

> I'm not clear if sysram could be used for virtio, or even needed.  I'm
> still figuring out how virtio of simple memory devices is a gain.
> 

Jonathan mentioned that he thinks it would be possible to just bring it
online as a private-node and inform the consumer of this.  I think
that's probably reasonable.

~Gregory
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Jonathan Cameron 1 day, 17 hours ago
> > I'm not clear if sysram could be used for virtio, or even needed.  I'm
> > still figuring out how virtio of simple memory devices is a gain.
> >   
> 
> Jonathan mentioned that he thinks it would be possible to just bring it
> online as a private-node and inform the consumer of this.  I think
> that's probably reasonable.

Firstly VM == Application.  If we have say a DB that wants to do everything
itself, it would use same interface as a VM to get the whole memory
on offer. (I'm still trying to get that Application Specific Memory term
adopted ;) 

This would be better if we didn't assume anything to do with virtio
- that's just one option (and right now for CXL mem probably not the
sensible one as it's missing too many things we get for free by just
emulating CXL devices - e.g. all the stuff you are describing here
for the host is just as valid in the guest.) We have a path to
get that emulation and should have the big missing piece posted shortly
(DCD backed by 'things - this discussion' that turn up after VM boot).

The real topic is memory for a VM and we need a way to tie a memory
backend in qemu to, so that whatever the fabric manager provided for
that VM is given to the VM and not used for anything else.

If it's for a specific VM, then it's tagged as otherwise how else
do we know the intent? (lets ignore random other out of band paths).

Layering wise we can surface as many backing sources as we like at
runtime via 1+ emulated DCD devices (to give perf information etc).
They each show up in the guest as contiguous (maybe tagged) single
extent and then we apply what ever comes out of the rest of this
discussion on top of that.

So all we care about is how the host presents it.

Bunch of things might work for this.

1. Just put it in a numa node that requires specific selection to allocate
   from.  This is nice because it just looks like normal memory and we
   can apply any type of front end on top of that.  Not good if we have a lot
   of these coming and going.

2. Provide it as something with an fd we can memmap. I was fine with Dax for
   this but if it's normal ram just for a VM anything that gives me a handle
   that I can memmap is fine. Just need a way to know which one (so tag).

It's pretty similar for shared cases. Just need a handle to memmap.
In that case, tag goes straight up to guest OS (we've just unwound the
extent ordering in the host and presented it as a contiguous single
extent).

Assumption here is we always provide all that capacity that was tagged
for the VM to use to the VM.   Things may get more entertaining if we have
a bunch of capacity that was tagged to provide extra space for a set of
VMs (e.g. we overcommit on top of the DCD extents) - to me that's a
job for another day.

So I'm not really envisioning anything special for the VM case, it's
just a dedicate allocation of memory for a user who knows how to get it.
We will want a way to get perf info though so we can provide that
in the VM.  Maybe can figure that out from the CXL HW backing it without
needing anything special in what is being discussed here.

Jonathan

> 
> ~Gregory
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Alireza Sanaee 23 hours ago
On Thu, 5 Feb 2026 17:48:47 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

Hi Jonathan,

Thanks for the clarifications.

Quick thought inline.

> > > I'm not clear if sysram could be used for virtio, or even needed.  I'm
> > > still figuring out how virtio of simple memory devices is a gain.
> > >     
> > 
> > Jonathan mentioned that he thinks it would be possible to just bring it
> > online as a private-node and inform the consumer of this.  I think
> > that's probably reasonable.  
> 
> Firstly VM == Application.  If we have say a DB that wants to do everything
> itself, it would use same interface as a VM to get the whole memory
> on offer. (I'm still trying to get that Application Specific Memory term
> adopted ;) 
> 
> This would be better if we didn't assume anything to do with virtio
> - that's just one option (and right now for CXL mem probably not the
> sensible one as it's missing too many things we get for free by just
> emulating CXL devices - e.g. all the stuff you are describing here
> for the host is just as valid in the guest.) We have a path to
> get that emulation and should have the big missing piece posted shortly
> (DCD backed by 'things - this discussion' that turn up after VM boot).
> 
> The real topic is memory for a VM and we need a way to tie a memory
> backend in qemu to, so that whatever the fabric manager provided for
> that VM is given to the VM and not used for anything else.
> 
> If it's for a specific VM, then it's tagged as otherwise how else
> do we know the intent? (lets ignore random other out of band paths).
> 
> Layering wise we can surface as many backing sources as we like at
> runtime via 1+ emulated DCD devices (to give perf information etc).
> They each show up in the guest as contiguous (maybe tagged) single
> extent and then we apply what ever comes out of the rest of this
> discussion on top of that.
> 
> So all we care about is how the host presents it.
> 
> Bunch of things might work for this.
> 
> 1. Just put it in a numa node that requires specific selection to allocate
>    from.  This is nice because it just looks like normal memory and we
>    can apply any type of front end on top of that.  Not good if we have a lot
>    of these coming and going.
> 
> 2. Provide it as something with an fd we can memmap. I was fine with Dax for
>    this but if it's normal ram just for a VM anything that gives me a handle
>    that I can memmap is fine. Just need a way to know which one (so tag).

I think both of these approaches are OK, but looking from developers
perspective, if someone wants a specific memory for their workload, they
should rather get a fd and play with it in whichever way they want. NUMA may
not give that much flexibility. As a developer it would prefer 2. Though you
may say oh dax then? not sure!
> 
> It's pretty similar for shared cases. Just need a handle to memmap.
> In that case, tag goes straight up to guest OS (we've just unwound the
> extent ordering in the host and presented it as a contiguous single
> extent).
> 
> Assumption here is we always provide all that capacity that was tagged
> for the VM to use to the VM.   Things may get more entertaining if we have
> a bunch of capacity that was tagged to provide extra space for a set of
> VMs (e.g. we overcommit on top of the DCD extents) - to me that's a
> job for another day.
> 
> So I'm not really envisioning anything special for the VM case, it's
> just a dedicate allocation of memory for a user who knows how to get it.
> We will want a way to get perf info though so we can provide that
> in the VM.  Maybe can figure that out from the CXL HW backing it without
> needing anything special in what is being discussed here.
> 
> Jonathan
> 
> > 
> > ~Gregory  
>
Re: [PATCH v9 00/19] DCD: Add support for Dynamic Capacity Devices (DCD)
Posted by Gregory Price 21 hours ago
On Fri, Feb 06, 2026 at 11:01:30AM +0000, Alireza Sanaee wrote:
> On Thu, 5 Feb 2026 17:48:47 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> I think both of these approaches are OK, but looking from developers
> perspective, if someone wants a specific memory for their workload, they
> should rather get a fd and play with it in whichever way they want. NUMA may
> not give that much flexibility. As a developer it would prefer 2. Though you
> may say oh dax then? not sure!

DAX or numa-aware memfd

If you want *specific* memory (a particular HPA/DPA range), tagged dax is
probably appropriate.

If you just want any old page from a particular chunk of HPA, then
probably some kind of numa-aware memfd would be simplest (though this
may require new interfaces, since memfd is not currently numa-aware).

We might be able to make private node work specifically with membind
policy on a VMA (not on a task).  That would probably be sufficient.

~Gregory