[PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs

Srirangan Madhavan posted 9 patches 1 week, 4 days ago
Documentation/ABI/testing/sysfs-bus-cxl |   28 +
drivers/cxl/core/hdm.c                  |  318 ++++++-
drivers/cxl/core/memdev.c               |   30 +
drivers/cxl/core/pci.c                  | 1140 +++++++++++++++++++++++
drivers/cxl/cxl.h                       |    5 +
drivers/cxl/cxlmem.h                    |    2 +
drivers/pci/pci.c                       |   22 +-
include/linux/pci.h                     |    2 +
include/uapi/linux/pci_regs.h           |   15 +
9 files changed, 1557 insertions(+), 5 deletions(-)
[PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs
Posted by Srirangan Madhavan 1 week, 4 days ago
Hi folks!

This patch series introduces support for the CXL Reset method for CXL
Type 2 devices, implementing the reset procedure outlined in the CXL
Specification r3.2 [1], Sections 8.1.3, 9.6, and 9.7.

The userspace ABI is a write-only cxl_reset attribute under the CXL
memdev device:

    /sys/bus/cxl/devices/memX/cxl_reset

The memdev is the userspace handle, while the implementation coordinates
the target PCI function, affected sibling PCI functions, active CXL
memdevs, and any CXL regions reachable through those memdevs.

v6 changes (from v5 [2]):
- Rebased on the current CXL tree used for v7.1-rc4 development.
- Move the ABI from /sys/bus/pci/devices/.../cxl_reset to
  /sys/bus/cxl/devices/memX/cxl_reset.
- Use the memdev as the userspace handle while keeping the reset
  orchestration scoped to the CXL device reset scope.
- Reduce the earlier PCI/CXL save/restore series [3] to a single CXL HDM
  decoder restore/commit helper patch, included here as patch 1.
- Do not offline or hot-remove memory as part of reset. Return -EBUSY
  if an affected CXL region is online as System RAM or has an active
  region driver bound.
- Add reset-idle validation and CPU cache invalidation for affected CXL
  regions.
- Add CXL sibling PCI function discovery using the Non-CXL Function Map
  DVSEC and CXL.cache/CXL.mem capability bits.
- Coordinate PCI save/disable/restore and IOMMU reset prepare/done for
  the target and affected sibling functions.
- Add CXL DVSEC reset sequencing, including CXL.cache disable,
  writeback-invalidate, a minimum 100ms quiet period, reset-complete
  polling, and Reset Error reporting.
- Track affected memdevs, lock active memdevs across reset, restore and
  commit decoder state, re-enable CXL.mem, and wait for media ready
  after reset.
- Cache reset capability at memdev registration time for sysfs
  visibility.
- Document reset scope, Memory Clear not being requested, and -EBUSY
  behavior for active CXL regions.

Motivation:
-----------
- As support for Type 2 devices is being introduced, more devices need a
  CXL-specific reset mechanism beyond bus-wide PCI reset methods.

- FLR does not affect CXL.cache or CXL.mem protocol state, making CXL
  Reset the appropriate mechanism for cases where those protocols must
  be reset.

- The CXL specification highlights use cases such as function rebinding
  and error recovery where CXL Reset is explicitly required.

Change Description:
-------------------

Patch 1: cxl/hdm: Add helpers to restore and commit memdev decoders
- Restore endpoint decoder programming from CXL core's cached decoder
  objects while keeping CXL.mem disabled.
- Commit restored HDM decoders as a separate step so reset orchestration
  can re-enable CXL.mem only after safety checks complete.

Patch 2: PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
- Export PCI reset lifecycle helpers so CXL reset orchestration can save,
  disable, restore, and invoke reset callbacks for affected functions.

Patch 3: cxl: Add reset-idle and cache flush helpers
- Collect CXL regions affected by a memdev reset.
- Fail reset if affected regions are not idle.
- Invalidate CPU caches for each affected region once.

Patch 4: PCI/CXL: Add sibling function coordination for reset
- Identify CXL.cache/CXL.mem sibling functions in the reset scope.
- Use the Non-CXL Function Map DVSEC to exclude non-CXL functions.
- Save, disable, restore, and unlock affected PCI sibling functions.

Patch 5: cxl/pci: Add CXL DVSEC reset helper
- Execute CXL Reset through the CXL Device DVSEC.
- Disable CXL.cache and request writeback-invalidate where supported.
- Enforce the post-reset quiet period and poll for reset completion.
- Block and restore IOMMU traffic while reset is active.

Patch 6: cxl/pci: Track memdevs affected by CXL reset
- Track the target memdev and any sibling-function memdevs affected by
  reset.
- Revalidate and lock active memdevs before reset proceeds.

Patch 7: cxl/pci: Orchestrate CXL reset for affected memdevs
- Coordinate region validation, CPU cache invalidation, PCI function
  preparation, DVSEC reset, decoder restore and commit, CXL.mem enable,
  and media-ready wait.

Patch 8: cxl/memdev: Add cxl_reset sysfs attribute
- Expose /sys/bus/cxl/devices/memX/cxl_reset.
- Only make the attribute visible when the underlying PCI function is
  Type 2 and reset capable.
- Write a boolean true value, such as "1" or "true", to trigger reset.

Patch 9: Documentation/ABI: Document CXL memdev cxl_reset
- Document the new memdev sysfs ABI, reset scope, Memory Clear behavior,
  and idle-region requirement.

The CPU cache invalidation step depends on
cpu_cache_invalidate_memregion() support for the affected address ranges.
If no provider is available, reset fails before hardware reset is
requested.

Command line to test CXL reset on a capable memdev:

    echo 1 > /sys/bus/cxl/devices/memX/cxl_reset

Basic CXL DVSEC reset testing was done on a CXL Type 2 device. The reset
sequence completed successfully and ResetComplete was observed. Full
memdev/region integration testing is still in progress.

References:
[1] https://computeexpresslink.org/wp-content/uploads/2024/12/CXL_3.2-Spec-Announcement_FINAL-1.pdf
[2] https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/
[3] https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/

Srirangan Madhavan (9):
  cxl/hdm: Add helpers to restore and commit memdev decoders
  PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
  cxl: Add reset-idle and cache flush helpers
  PCI/CXL: Add sibling function coordination for reset
  cxl/pci: Add CXL DVSEC reset helper
  cxl/pci: Track memdevs affected by CXL reset
  cxl/pci: Orchestrate CXL reset for affected memdevs
  cxl/memdev: Add cxl_reset sysfs attribute
  Documentation/ABI: Document CXL memdev cxl_reset

 Documentation/ABI/testing/sysfs-bus-cxl |   28 +
 drivers/cxl/core/hdm.c                  |  318 ++++++-
 drivers/cxl/core/memdev.c               |   30 +
 drivers/cxl/core/pci.c                  | 1140 +++++++++++++++++++++++
 drivers/cxl/cxl.h                       |    5 +
 drivers/cxl/cxlmem.h                    |    2 +
 drivers/pci/pci.c                       |   22 +-
 include/linux/pci.h                     |    2 +
 include/uapi/linux/pci_regs.h           |   15 +
 9 files changed, 1557 insertions(+), 5 deletions(-)

base-commit: abb3c0de119032f4c0c81177884a3bb0a133e6ca
-- 
2.43.0
Re: [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs
Posted by Dan Williams (nvidia) 5 days, 16 hours ago
Srirangan Madhavan wrote:
> Hi folks!
> 
> This patch series introduces support for the CXL Reset method for CXL
> Type 2 devices, implementing the reset procedure outlined in the CXL
> Specification r3.2 [1], Sections 8.1.3, 9.6, and 9.7.
> 
> The userspace ABI is a write-only cxl_reset attribute under the CXL
> memdev device:
> 
>     /sys/bus/cxl/devices/memX/cxl_reset

Hi Srirangan,

To move this forward we need a compromise between reimplementing CXL
bits in drivers/pci/ (what I reacted to in the initial postings), but
still wanting to use the /sys/bus/pci reset entry point (what you and
Alex reacted to in my comments).

I started a suggestion here...

http://lore.kernel.org/6a0620acec806_57ad71008c@djbw-dev.notmuch

...however, looking at it again, this:

     echo 1 > /sys/bus/pci/devices/$pdev/cxl/reset

...ends up functionally equivalent to the original:

     echo cxl_reset > /sys/bus/pci/devices/$pdev/reset_method
     echo 1 > /sys/bus/pci/devices/$pdev/reset_method

Now, the motivations why I pushed on /sys/bus/cxl/devices/memX/cxl_reset
were to avoid duplicating HDM enumeration in multiple places, and
provide for coordinating changes to the CXL memory configuration with
CXL reset. I.e. CXL reset can take HDM locks (where the PCI reset device
locks may not be sufficient)

The fatal downside of that proposal is that the memX/cxl_reset ABI
requires driver loading. Long term, as you and Alex convinced me, that
is going to be a pain and breaks current device assignment flows.

A compromise that lets PCI and CXL share infrastructure while still
supporting the long-standing PCI reset ABI is:

1/ Carry CXL decoder settings in the PCI device
2/ Build in shared low level helpers for marshaling decoder settings
   to/from hardware.
3/ Allow the low-level helpers to reference CXL locks

I drafted a rough conversion of what would be needed to share this
low-level coordination across the PCI and CXL core.

It introduces 'struct cxl_decoder_settings' and moves all the HDM decode
related definitions to cxl/cxl.h. It moves the core locks and low-level
hardware update helpers into a built-in drivers/cxl/core/reset.o object
where all of this reset coordination can be shared. It provides for
saving and restoring HDM state not just over reset, but from initial
device enumeration for devices that may forget their CXL configuration
for other reasons besides PCI reset.

The bulk of this is movement from drivers/cxl/cxl.h to
include/cxl/cxl.h, and drivers/cxl/core/hdm.c to
drivers/cxl/core/reset.c.

Thoughts? Does this compromise address all the open ABI concerns? I will
go through the rest of the patches and provide some notes with this
proposal in mind.

Applies against v7.1-rc3, needs splitting once we agree on this shape
(only build tested):

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 80aeb0d556bd..a809ba0dcc0c 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -5,6 +5,7 @@ menuconfig CXL_BUS
 	select FW_LOADER
 	select FW_UPLOAD
 	select PCI_DOE
+	select CXL_HDM
 	select FIRMWARE_TABLE
 	select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS
 	select FWCTL if CXL_FEATURES
@@ -243,4 +244,7 @@ config CXL_ATL
 	depends on CXL_REGION
 	depends on ACPI_PRMT && AMD_NB
 
+config CXL_HDM
+	bool
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..ebb0891daeb5 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_CXL_BUS) += cxl_core.o
 obj-$(CONFIG_CXL_SUSPEND) += suspend.o
+obj-$(CONFIG_CXL_HDM) += reset.o
 
 ccflags-y += -I$(srctree)/drivers/cxl
 CFLAGS_trace.o = -DTRACE_INCLUDE_PATH=. -I$(src)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 1297594beaec..e31462fcf37b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -252,49 +252,8 @@ int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
 #define CXL_DECODER_F_NORMALIZED_ADDRESSING BIT(6)
 #define CXL_DECODER_F_RESET_MASK (CXL_DECODER_F_ENABLE | CXL_DECODER_F_LOCK)
 
-enum cxl_decoder_type {
-	CXL_DECODER_DEVMEM = 2,
-	CXL_DECODER_HOSTONLYMEM = 3,
-};
-
-/*
- * Current specification goes up to 8, double that seems a reasonable
- * software max for the foreseeable future
- */
-#define CXL_DECODER_MAX_INTERLEAVE 16
-
 #define CXL_QOS_CLASS_INVALID -1
 
-/**
- * struct cxl_decoder - Common CXL HDM Decoder Attributes
- * @dev: this decoder's device
- * @id: kernel device name id
- * @hpa_range: Host physical address range mapped by this decoder
- * @interleave_ways: number of cxl_dports in this decode
- * @interleave_granularity: data stride per dport
- * @target_type: accelerator vs expander (type2 vs type3) selector
- * @region: currently assigned region for this decoder
- * @flags: memory type capabilities and locking
- * @target_map: cached copy of hardware port-id list, available at init
- *              before all @dport objects have been instantiated. While
- *              dport id is 8bit, CFMWS interleave targets are 32bits.
- * @commit: device/decoder-type specific callback to commit settings to hw
- * @reset: device/decoder-type specific callback to reset hw settings
-*/
-struct cxl_decoder {
-	struct device dev;
-	int id;
-	struct range hpa_range;
-	int interleave_ways;
-	int interleave_granularity;
-	enum cxl_decoder_type target_type;
-	struct cxl_region *region;
-	unsigned long flags;
-	u32 target_map[CXL_DECODER_MAX_INTERLEAVE];
-	int (*commit)(struct cxl_decoder *cxld);
-	void (*reset)(struct cxl_decoder *cxld);
-};
-
 /*
  * Track whether this decoder is free for userspace provisioning, reserved for
  * region autodiscovery, whether it is started connecting (awaiting other
@@ -310,7 +269,6 @@ enum cxl_decoder_state {
  * struct cxl_endpoint_decoder - Endpoint  / SPA to DPA decoder
  * @cxld: base cxl_decoder_object
  * @dpa_res: actively claimed DPA span of this decoder
- * @skip: offset into @dpa_res where @cxld.hpa_range maps
  * @state: autodiscovery state
  * @part: partition index this decoder maps
  * @pos: interleave position in @cxld.region
@@ -318,7 +276,6 @@ enum cxl_decoder_state {
 struct cxl_endpoint_decoder {
 	struct cxl_decoder cxld;
 	struct resource *dpa_res;
-	resource_size_t skip;
 	enum cxl_decoder_state state;
 	int part;
 	int pos;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index fa7269154620..1460bfefe593 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -5,6 +5,7 @@
 #ifndef __CXL_CXL_H__
 #define __CXL_CXL_H__
 
+#include <linux/device.h>
 #include <linux/node.h>
 #include <linux/ioport.h>
 #include <cxl/mailbox.h>
@@ -23,7 +24,56 @@ enum cxl_devtype {
 	CXL_DEVTYPE_CLASSMEM,
 };
 
-struct device;
+enum cxl_decoder_type {
+	CXL_DECODER_DEVMEM = 2,
+	CXL_DECODER_HOSTONLYMEM = 3,
+};
+
+/*
+ * Current specification goes up to 8, double that seems a reasonable
+ * software max for the foreseeable future
+ */
+#define CXL_DECODER_MAX_INTERLEAVE 16
+
+/**
+ * struct cxl_decoder - Common CXL HDM Decoder Attributes
+ * @dev: this decoder's device
+ * @id: kernel device name id
+ * @hpa_range: Host physical address range mapped by this decoder
+ * @skip: offset into @dpa_res where @cxld.hpa_range maps (endpoint)
+ * @targets: interleave position to dport mapping (switch)
+ * @interleave_ways: number of cxl_dports in this decode
+ * @interleave_granularity: data stride per dport
+ * @target_type: accelerator vs expander (type2 vs type3) selector
+ * @flags: memory type capabilities and locking
+ * @region: currently assigned region for this decoder
+ * @target_map: cached copy of hardware port-id list, available at init
+ *              before all @dport objects have been instantiated. While
+ *              dport id is 8bit, CFMWS interleave targets are 32bits.
+ * @commit: device/decoder-type specific callback to commit settings to hw
+ * @reset: device/decoder-type specific callback to reset hw settings
+*/
+struct cxl_decoder {
+	struct device dev;
+	struct_group_tagged(cxl_decoder_settings, settings,
+		int id;
+		struct range hpa_range;
+		union {
+			u64 skip;
+			u64 targets;
+		};
+		int interleave_ways;
+		int interleave_granularity;
+		enum cxl_decoder_type target_type;
+		unsigned long flags;
+	);
+	struct cxl_region *region;
+	u32 target_map[CXL_DECODER_MAX_INTERLEAVE];
+	int (*commit)(struct cxl_decoder *cxld);
+	void (*reset)(struct cxl_decoder *cxld);
+};
+
+int cxl_commit(struct cxl_decoder_settings *cxld, void __iomem *hdm);
 
 /*
  * Using struct_group() allows for per register-block-type helper routines,
@@ -116,6 +166,12 @@ struct cxl_register_map {
 	};
 };
 
+struct cxl_hdm_info {
+	int decoder_count;
+	struct cxl_component_regs regs;
+	struct cxl_decoder_settings settings[] __counted_by(decoder_count);
+};
+
 /**
  * struct cxl_dpa_perf - DPA performance property entry
  * @dpa_range: range for DPA address
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..35d05c8bdd43 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/resource_ext.h>
 #include <linux/msi_api.h>
+#include <cxl/cxl.h>
 #include <uapi/linux/pci.h>
 
 #include <linux/pci_ids.h>
@@ -577,6 +578,9 @@ struct pci_dev {
 #endif
 #ifdef CONFIG_PCI_TSM
 	struct pci_tsm *tsm;		/* TSM operation state */
+#endif
+#ifdef CONFIG_CXL_HDM
+	struct cxl_hdm_info *hdm;
 #endif
 	u16		acs_cap;	/* ACS Capability offset */
 	u16		acs_capabilities; /* ACS Capabilities */
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 0c80b76a5f9b..8c236d116174 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -16,11 +16,6 @@
  * for enumerating these registers and capabilities.
  */
 
-struct cxl_rwsem cxl_rwsem = {
-	.region = __RWSEM_INITIALIZER(cxl_rwsem.region),
-	.dpa = __RWSEM_INITIALIZER(cxl_rwsem.dpa),
-};
-
 static int add_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld)
 {
 	int rc;
@@ -249,17 +244,18 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct cxl_decoder *cxld = &cxled->cxld;
 	struct resource *res = cxled->dpa_res;
 	resource_size_t skip_start;
 
 	lockdep_assert_held_write(&cxl_rwsem.dpa);
 
 	/* save @skip_start, before @res is released */
-	skip_start = res->start - cxled->skip;
+	skip_start = res->start - cxld->skip;
 	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
-	if (cxled->skip)
-		release_skip(cxlds, skip_start, cxled->skip);
-	cxled->skip = 0;
+	if (cxld->skip)
+		release_skip(cxlds, skip_start, cxld->skip);
+	cxld->skip = 0;
 	cxled->dpa_res = NULL;
 	put_device(&cxled->cxld.dev);
 	port->hdm_end--;
@@ -343,6 +339,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct cxl_decoder *cxld = &cxled->cxld;
 	struct device *dev = &port->dev;
 	struct resource *res;
 	int rc;
@@ -388,7 +385,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 		return -EBUSY;
 	}
 	cxled->dpa_res = res;
-	cxled->skip = skipped;
+	cxld->skip = skipped;
 
 	/*
 	 * When allocating new capacity, ->part is already set, when
@@ -679,39 +676,12 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, u64 size)
 	return devm_add_action_or_reset(&port->dev, cxl_dpa_release, cxled);
 }
 
-static void cxld_set_interleave(struct cxl_decoder *cxld, u32 *ctrl)
-{
-	u16 eig;
-	u8 eiw;
-
-	/*
-	 * Input validation ensures these warns never fire, but otherwise
-	 * suppress unititalized variable usage warnings.
-	 */
-	if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
-		      "invalid interleave_ways: %d\n", cxld->interleave_ways))
-		return;
-	if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
-		      "invalid interleave_granularity: %d\n",
-		      cxld->interleave_granularity))
-		return;
-
-	u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
-	u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
-	*ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
-}
-
-static void cxld_set_type(struct cxl_decoder *cxld, u32 *ctrl)
-{
-	u32p_replace_bits(ctrl,
-			  !!(cxld->target_type == CXL_DECODER_HOSTONLYMEM),
-			  CXL_HDM_DECODER0_CTRL_HOSTONLY);
-}
-
-static void cxlsd_set_targets(struct cxl_switch_decoder *cxlsd, u64 *tgt)
+static void cxlsd_set_targets(struct cxl_decoder *cxld)
 {
+	struct cxl_switch_decoder *cxlsd = to_cxl_switch_decoder(&cxld->dev);
 	struct cxl_dport **t = &cxlsd->target[0];
 	int ways = cxlsd->cxld.interleave_ways;
+	u64 *tgt = &cxld->targets;
 
 	*tgt = FIELD_PREP(GENMASK(7, 0), t[0]->port_id);
 	if (ways > 1)
@@ -730,73 +700,6 @@ static void cxlsd_set_targets(struct cxl_switch_decoder *cxlsd, u64 *tgt)
 		*tgt |= FIELD_PREP(GENMASK_ULL(63, 56), t[7]->port_id);
 }
 
-/*
- * Per CXL 2.0 8.2.5.12.20 Committing Decoder Programming, hardware must set
- * committed or error within 10ms, but just be generous with 20ms to account for
- * clock skew and other marginal behavior
- */
-#define COMMIT_TIMEOUT_MS 20
-static int cxld_await_commit(void __iomem *hdm, int id)
-{
-	u32 ctrl;
-	int i;
-
-	for (i = 0; i < COMMIT_TIMEOUT_MS; i++) {
-		ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
-		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMIT_ERROR, ctrl)) {
-			ctrl &= ~CXL_HDM_DECODER0_CTRL_COMMIT;
-			writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
-			return -EIO;
-		}
-		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
-			return 0;
-		fsleep(1000);
-	}
-
-	return -ETIMEDOUT;
-}
-
-static void setup_hw_decoder(struct cxl_decoder *cxld, void __iomem *hdm)
-{
-	int id = cxld->id;
-	u64 base, size;
-	u32 ctrl;
-
-	/* common decoder settings */
-	ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
-	cxld_set_interleave(cxld, &ctrl);
-	cxld_set_type(cxld, &ctrl);
-	base = cxld->hpa_range.start;
-	size = range_len(&cxld->hpa_range);
-
-	writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
-	writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
-	writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
-	writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
-
-	if (is_switch_decoder(&cxld->dev)) {
-		struct cxl_switch_decoder *cxlsd =
-			to_cxl_switch_decoder(&cxld->dev);
-		void __iomem *tl_hi = hdm + CXL_HDM_DECODER0_TL_HIGH(id);
-		void __iomem *tl_lo = hdm + CXL_HDM_DECODER0_TL_LOW(id);
-		u64 targets;
-
-		cxlsd_set_targets(cxlsd, &targets);
-		writel(upper_32_bits(targets), tl_hi);
-		writel(lower_32_bits(targets), tl_lo);
-	} else {
-		struct cxl_endpoint_decoder *cxled =
-			to_cxl_endpoint_decoder(&cxld->dev);
-		void __iomem *sk_hi = hdm + CXL_HDM_DECODER0_SKIP_HIGH(id);
-		void __iomem *sk_lo = hdm + CXL_HDM_DECODER0_SKIP_LOW(id);
-
-		writel(upper_32_bits(cxled->skip), sk_hi);
-		writel(lower_32_bits(cxled->skip), sk_lo);
-	}
-
-	writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
-}
-
 static int cxl_decoder_commit(struct cxl_decoder *cxld)
 {
 	struct cxl_port *port = to_cxl_port(cxld->dev.parent);
@@ -832,21 +735,17 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld)
 				dev_name(&cxld->dev));
 			return -EBUSY;
 		}
-	}
-
-	scoped_guard(rwsem_read, &cxl_rwsem.dpa)
-		setup_hw_decoder(cxld, hdm);
+	} else
+		cxlsd_set_targets(cxld);
 
-	rc = cxld_await_commit(hdm, cxld->id);
-	if (rc) {
+	rc = cxl_commit(&cxld->settings, hdm);
+	if (rc)
 		dev_dbg(&port->dev, "%s: error %d committing decoder\n",
 			dev_name(&cxld->dev), rc);
-		return rc;
-	}
-	port->commit_end++;
-	cxld->flags |= CXL_DECODER_F_ENABLE;
+	else
+		port->commit_end++;
 
-	return 0;
+	return rc;
 }
 
 static int commit_reap(struct device *dev, void *data)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e50dc716d4e8..0349d73140e3 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2899,6 +2899,7 @@ static int poison_by_decoder(struct device *dev, void *arg)
 	struct cxl_endpoint_decoder *cxled;
 	enum cxl_partition_mode mode;
 	struct cxl_dev_state *cxlds;
+	struct cxl_decoder *cxld;
 	struct cxl_memdev *cxlmd;
 	u64 offset, length;
 	int rc = 0;
@@ -2912,11 +2913,12 @@ static int poison_by_decoder(struct device *dev, void *arg)
 
 	cxlmd = cxled_to_memdev(cxled);
 	cxlds = cxlmd->cxlds;
+	cxld = &cxled->cxld;
 	mode = cxlds->part[cxled->part].mode;
 
-	if (cxled->skip) {
-		offset = cxled->dpa_res->start - cxled->skip;
-		length = cxled->skip;
+	if (cxld->skip) {
+		offset = cxled->dpa_res->start - cxld->skip;
+		length = cxld->skip;
 		rc = cxl_mem_get_poison(cxlmd, offset, length, NULL);
 		if (rc == -EFAULT && mode == CXL_PARTMODE_RAM)
 			rc = 0;
diff --git a/drivers/cxl/core/reset.c b/drivers/cxl/core/reset.c
new file mode 100644
index 000000000000..0b4372b6d608
--- /dev/null
+++ b/drivers/cxl/core/reset.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation & Affiliates */
+#include <cxl/cxl.h>
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/range.h>
+#include <cxl.h>
+#include "core.h"
+
+/*
+ * Common lowlevel setup and re-initialization (reset) helpers for the
+ * CXL memory associated with a PCI device. CXL core locks are built-in
+ * to the main kernel image for coordination with in-kernel mechanisms
+ * like reset.
+ */
+
+struct cxl_rwsem cxl_rwsem = {
+	.region = __RWSEM_INITIALIZER(cxl_rwsem.region),
+	.dpa = __RWSEM_INITIALIZER(cxl_rwsem.dpa),
+};
+EXPORT_SYMBOL_FOR_MODULES(cxl_rwsem, "cxl_core");
+
+static void cxld_set_interleave(struct cxl_decoder_settings *cxld, u32 *ctrl)
+{
+	u16 eig;
+	u8 eiw;
+
+	/*
+	 * Input validation ensures these warns never fire, but otherwise
+	 * suppress unititalized variable usage warnings.
+	 */
+	if (WARN_ONCE(ways_to_eiw(cxld->interleave_ways, &eiw),
+		      "invalid interleave_ways: %d\n", cxld->interleave_ways))
+		return;
+	if (WARN_ONCE(granularity_to_eig(cxld->interleave_granularity, &eig),
+		      "invalid interleave_granularity: %d\n",
+		      cxld->interleave_granularity))
+		return;
+
+	u32p_replace_bits(ctrl, eig, CXL_HDM_DECODER0_CTRL_IG_MASK);
+	u32p_replace_bits(ctrl, eiw, CXL_HDM_DECODER0_CTRL_IW_MASK);
+	*ctrl |= CXL_HDM_DECODER0_CTRL_COMMIT;
+}
+
+static void cxld_set_type(struct cxl_decoder_settings *cxld, u32 *ctrl)
+{
+	u32p_replace_bits(ctrl,
+			  !!(cxld->target_type == CXL_DECODER_HOSTONLYMEM),
+			  CXL_HDM_DECODER0_CTRL_HOSTONLY);
+}
+
+static void setup_hw_decoder(struct cxl_decoder_settings *cxld, void __iomem *hdm)
+{
+	u32 ctrl;
+	u64 base, size;
+	int id = cxld->id;
+	void __iomem *sk_hi = hdm + CXL_HDM_DECODER0_SKIP_HIGH(id);
+	void __iomem *sk_lo = hdm + CXL_HDM_DECODER0_SKIP_LOW(id);
+
+	/* common decoder settings */
+	ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+	cxld_set_interleave(cxld, &ctrl);
+	cxld_set_type(cxld, &ctrl);
+	base = cxld->hpa_range.start;
+	size = range_len(&cxld->hpa_range);
+
+	writel(upper_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_HIGH_OFFSET(id));
+	writel(lower_32_bits(base), hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id));
+	writel(upper_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(id));
+	writel(lower_32_bits(size), hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(id));
+
+	/* endpoint 'skip' and switch 'targets' settings alias */
+	writel(upper_32_bits(cxld->skip), sk_hi);
+	writel(lower_32_bits(cxld->skip), sk_lo);
+
+	writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+}
+
+/*
+ * Per CXL 2.0 8.2.5.12.20 Committing Decoder Programming, hardware must set
+ * committed or error within 10ms, but just be generous with 20ms to account for
+ * clock skew and other marginal behavior
+ */
+#define COMMIT_TIMEOUT_MS 20
+static int cxld_await_commit(void __iomem *hdm, int id)
+{
+	u32 ctrl;
+	int i;
+
+	for (i = 0; i < COMMIT_TIMEOUT_MS; i++) {
+		ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMIT_ERROR, ctrl)) {
+			ctrl &= ~CXL_HDM_DECODER0_CTRL_COMMIT;
+			writel(ctrl, hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id));
+			return -EIO;
+		}
+		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl))
+			return 0;
+		fsleep(1000);
+	}
+
+	return -ETIMEDOUT;
+}
+
+int cxl_commit(struct cxl_decoder_settings *cxld, void __iomem *hdm)
+{
+	int rc;
+
+	scoped_guard(rwsem_read, &cxl_rwsem.dpa)
+		setup_hw_decoder(cxld, hdm);
+
+	rc = cxld_await_commit(hdm, cxld->id);
+	if (rc == 0)
+		cxld->flags |= CXL_DECODER_F_ENABLE;
+	return rc;
+}
+EXPORT_SYMBOL_FOR_MODULES(cxl_commit, "cxl_core");
diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
index 418669927fb0..de088bb930c3 100644
--- a/tools/testing/cxl/test/cxl.c
+++ b/tools/testing/cxl/test/cxl.c
@@ -840,11 +840,11 @@ static int cxld_registry_restore(struct cxl_decoder *cxld,
 		dbg_cxld(port, "restore", &td->cxled.cxld);
 		cxld_copy(cxld, &td->cxled.cxld);
 		cxled->state = td->cxled.state;
-		cxled->skip = td->cxled.skip;
+		cxld->skip = td->cxled.cxld.skip;
 		if (range_len(&td->dpa_range)) {
 			rc = devm_cxl_dpa_reserve(cxled, td->dpa_range.start,
 						  range_len(&td->dpa_range),
-						  td->cxled.skip);
+						  td->cxled.cxld.skip);
 			if (rc) {
 				init_disabled_mock_decoder(cxld);
 				return rc;
@@ -882,7 +882,7 @@ static void __cxld_registry_save(struct cxl_test_decoder *td,
 
 		cxld_copy(&td->cxled.cxld, cxld);
 		td->cxled.state = cxled->state;
-		td->cxled.skip = cxled->skip;
+		td->cxled.cxld.skip = cxld->skip;
 
 		if (!(cxld->flags & CXL_DECODER_F_ENABLE)) {
 			td->dpa_range.start = 0;
@@ -970,7 +970,7 @@ static void mock_decoder_reset(struct cxl_decoder *cxld)
 			to_cxl_endpoint_decoder(&cxld->dev);
 
 		cxled->state = CXL_DECODER_STATE_MANUAL;
-		cxled->skip = 0;
+		cxld->skip = 0;
 	}
 	if (decoder_reset_preserve_registry)
 		dev_dbg(port->uport_dev, "decoder%d: skip registry update\n",
@@ -1021,7 +1021,7 @@ static void init_disabled_mock_decoder(struct cxl_decoder *cxld)
 			to_cxl_endpoint_decoder(&cxld->dev);
 
 		cxled->state = CXL_DECODER_STATE_MANUAL;
-		cxled->skip = 0;
+		cxld->skip = 0;
 	}
 }
Re: [PATCH v6 0/9] cxl: Add cxl_reset sysfs attribute for memdevs
Posted by Cheatham, Benjamin 5 days, 18 hours ago
On 5/28/2026 3:31 AM, Srirangan Madhavan wrote:
> Hi folks!
> 
> This patch series introduces support for the CXL Reset method for CXL
> Type 2 devices, implementing the reset procedure outlined in the CXL
> Specification r3.2 [1], Sections 8.1.3, 9.6, and 9.7.
> 
> The userspace ABI is a write-only cxl_reset attribute under the CXL
> memdev device:
> 
>     /sys/bus/cxl/devices/memX/cxl_reset
> 
> The memdev is the userspace handle, while the implementation coordinates
> the target PCI function, affected sibling PCI functions, active CXL
> memdevs, and any CXL regions reachable through those memdevs.
> 

This may be a dumb question, but where do type 2 driver slot into this?
I think it's expected that they'll implement the ->reset_XXX callbacks for
their own handling? If not, you may want to look at where type 2 drivers
outside of drivers/cxl/ can have a hook to do any reset related clean up
or prep.

Thanks,
Ben
> v6 changes (from v5 [2]):
> - Rebased on the current CXL tree used for v7.1-rc4 development.
> - Move the ABI from /sys/bus/pci/devices/.../cxl_reset to
>   /sys/bus/cxl/devices/memX/cxl_reset.
> - Use the memdev as the userspace handle while keeping the reset
>   orchestration scoped to the CXL device reset scope.
> - Reduce the earlier PCI/CXL save/restore series [3] to a single CXL HDM
>   decoder restore/commit helper patch, included here as patch 1.
> - Do not offline or hot-remove memory as part of reset. Return -EBUSY
>   if an affected CXL region is online as System RAM or has an active
>   region driver bound.
> - Add reset-idle validation and CPU cache invalidation for affected CXL
>   regions.
> - Add CXL sibling PCI function discovery using the Non-CXL Function Map
>   DVSEC and CXL.cache/CXL.mem capability bits.
> - Coordinate PCI save/disable/restore and IOMMU reset prepare/done for
>   the target and affected sibling functions.
> - Add CXL DVSEC reset sequencing, including CXL.cache disable,
>   writeback-invalidate, a minimum 100ms quiet period, reset-complete
>   polling, and Reset Error reporting.
> - Track affected memdevs, lock active memdevs across reset, restore and
>   commit decoder state, re-enable CXL.mem, and wait for media ready
>   after reset.
> - Cache reset capability at memdev registration time for sysfs
>   visibility.
> - Document reset scope, Memory Clear not being requested, and -EBUSY
>   behavior for active CXL regions.
> 
> Motivation:
> -----------
> - As support for Type 2 devices is being introduced, more devices need a
>   CXL-specific reset mechanism beyond bus-wide PCI reset methods.
> 
> - FLR does not affect CXL.cache or CXL.mem protocol state, making CXL
>   Reset the appropriate mechanism for cases where those protocols must
>   be reset.
> 
> - The CXL specification highlights use cases such as function rebinding
>   and error recovery where CXL Reset is explicitly required.
> 
> Change Description:
> -------------------
> 
> Patch 1: cxl/hdm: Add helpers to restore and commit memdev decoders
> - Restore endpoint decoder programming from CXL core's cached decoder
>   objects while keeping CXL.mem disabled.
> - Commit restored HDM decoders as a separate step so reset orchestration
>   can re-enable CXL.mem only after safety checks complete.
> 
> Patch 2: PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
> - Export PCI reset lifecycle helpers so CXL reset orchestration can save,
>   disable, restore, and invoke reset callbacks for affected functions.
> 
> Patch 3: cxl: Add reset-idle and cache flush helpers
> - Collect CXL regions affected by a memdev reset.
> - Fail reset if affected regions are not idle.
> - Invalidate CPU caches for each affected region once.
> 
> Patch 4: PCI/CXL: Add sibling function coordination for reset
> - Identify CXL.cache/CXL.mem sibling functions in the reset scope.
> - Use the Non-CXL Function Map DVSEC to exclude non-CXL functions.
> - Save, disable, restore, and unlock affected PCI sibling functions.
> 
> Patch 5: cxl/pci: Add CXL DVSEC reset helper
> - Execute CXL Reset through the CXL Device DVSEC.
> - Disable CXL.cache and request writeback-invalidate where supported.
> - Enforce the post-reset quiet period and poll for reset completion.
> - Block and restore IOMMU traffic while reset is active.
> 
> Patch 6: cxl/pci: Track memdevs affected by CXL reset
> - Track the target memdev and any sibling-function memdevs affected by
>   reset.
> - Revalidate and lock active memdevs before reset proceeds.
> 
> Patch 7: cxl/pci: Orchestrate CXL reset for affected memdevs
> - Coordinate region validation, CPU cache invalidation, PCI function
>   preparation, DVSEC reset, decoder restore and commit, CXL.mem enable,
>   and media-ready wait.
> 
> Patch 8: cxl/memdev: Add cxl_reset sysfs attribute
> - Expose /sys/bus/cxl/devices/memX/cxl_reset.
> - Only make the attribute visible when the underlying PCI function is
>   Type 2 and reset capable.
> - Write a boolean true value, such as "1" or "true", to trigger reset.
> 
> Patch 9: Documentation/ABI: Document CXL memdev cxl_reset
> - Document the new memdev sysfs ABI, reset scope, Memory Clear behavior,
>   and idle-region requirement.
> 
> The CPU cache invalidation step depends on
> cpu_cache_invalidate_memregion() support for the affected address ranges.
> If no provider is available, reset fails before hardware reset is
> requested.
> 
> Command line to test CXL reset on a capable memdev:
> 
>     echo 1 > /sys/bus/cxl/devices/memX/cxl_reset
> 
> Basic CXL DVSEC reset testing was done on a CXL Type 2 device. The reset
> sequence completed successfully and ResetComplete was observed. Full
> memdev/region integration testing is still in progress.
> 
> References:
> [1] https://computeexpresslink.org/wp-content/uploads/2024/12/CXL_3.2-Spec-Announcement_FINAL-1.pdf
> [2] https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/
> [3] https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/
> 
> Srirangan Madhavan (9):
>   cxl/hdm: Add helpers to restore and commit memdev decoders
>   PCI: Export pci_dev_save_and_disable() and pci_dev_restore()
>   cxl: Add reset-idle and cache flush helpers
>   PCI/CXL: Add sibling function coordination for reset
>   cxl/pci: Add CXL DVSEC reset helper
>   cxl/pci: Track memdevs affected by CXL reset
>   cxl/pci: Orchestrate CXL reset for affected memdevs
>   cxl/memdev: Add cxl_reset sysfs attribute
>   Documentation/ABI: Document CXL memdev cxl_reset
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |   28 +
>  drivers/cxl/core/hdm.c                  |  318 ++++++-
>  drivers/cxl/core/memdev.c               |   30 +
>  drivers/cxl/core/pci.c                  | 1140 +++++++++++++++++++++++
>  drivers/cxl/cxl.h                       |    5 +
>  drivers/cxl/cxlmem.h                    |    2 +
>  drivers/pci/pci.c                       |   22 +-
>  include/linux/pci.h                     |    2 +
>  include/uapi/linux/pci_regs.h           |   15 +
>  9 files changed, 1557 insertions(+), 5 deletions(-)
> 
> base-commit: abb3c0de119032f4c0c81177884a3bb0a133e6ca