Fix port enumeration failure and NULL endpoint issue

[PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by Li Ming 1 week ago

CXL testing environment can trigger following trace

 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
 RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
 Call Trace:
  <TASK>
  cxl_event_trace_record+0xd1/0xa70 [cxl_core]
  __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
  cxl_mem_get_records_log+0x261/0x500 [cxl_core]
  cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
  cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]
  platform_probe+0x9d/0x130
  really_probe+0x1c8/0x960
  __driver_probe_device+0x187/0x3e0
  driver_probe_device+0x45/0x120
  __device_attach_driver+0x15d/0x280

commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
valid endpoint.

Across the CXL core, endpoint availability is generally determined by
checking whether it is NULL. Align with this convention by initializing
cxlmd->endpoint to NULL by default.

Fixes: 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
Signed-off-by: Li Ming <ming.li@zohomail.com>
---
 drivers/cxl/core/memdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index af3d0cc65138..41a507b5daa4 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -675,7 +675,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
 	cxlmd->id = rc;
 	cxlmd->depth = -1;
 	cxlmd->attach = attach;
-	cxlmd->endpoint = ERR_PTR(-ENXIO);
+	cxlmd->endpoint = NULL;
 
 	dev = &cxlmd->dev;
 	device_initialize(dev);
-- 
2.43.0

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by dan.j.williams@intel.com 6 days, 6 hours ago

Li Ming wrote:
> CXL testing environment can trigger following trace
> 
>  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
>  KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
>  RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
>  Call Trace:
>   <TASK>
>   cxl_event_trace_record+0xd1/0xa70 [cxl_core]
>   __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
>   cxl_mem_get_records_log+0x261/0x500 [cxl_core]
>   cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
>   cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]

Hooray for unit tests, but I wonder why this failure is so reliable for
you and absent for me? I retried with latest cxl/next, no luck.

No matter, it still looks like something worth addressing, but not with
setting ->endpoint to NULL.

>   platform_probe+0x9d/0x130
>   really_probe+0x1c8/0x960
>   __driver_probe_device+0x187/0x3e0
>   driver_probe_device+0x45/0x120
>   __device_attach_driver+0x15d/0x280
> 
> commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
> initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
> However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
> valid endpoint.
> 
> Across the CXL core, endpoint availability is generally determined by
> checking whether it is NULL. Align with this convention by initializing
> cxlmd->endpoint to NULL by default.
> 
> Fixes: 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
> Signed-off-by: Li Ming <ming.li@zohomail.com>
> ---
>  drivers/cxl/core/memdev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index af3d0cc65138..41a507b5daa4 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -675,7 +675,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
>  	cxlmd->id = rc;
>  	cxlmd->depth = -1;
>  	cxlmd->attach = attach;
> -	cxlmd->endpoint = ERR_PTR(-ENXIO);
> +	cxlmd->endpoint = NULL;

So, this does not look like a fix to me, it looks like a bug report
against all the code paths that want to assume that a mere NULL check of
->endpoint is sufficient.

I think this unintentional ERR_PTR() trip hazard is now a *feature* in
retrospect to catch those cases. If something depends on ->endpoint
being valid, it depends on ->endpoint *staying* valid.

A proposed fix for this case is below (passes cxl_test), and if other
ERR_PTR() crashes show up I expect they need similar inspection. This
probably wants additional cleanup to consolidate boiler-plate and make
the critical change to cxl_dpa_to_region() easier to see:

-- 8< --
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 422531799af2..e652980098cf 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -39,7 +39,7 @@ int cxl_decoder_detach(struct cxl_region *cxlr,
 int cxl_region_init(void);
 void cxl_region_exit(void);
 int cxl_get_poison_by_endpoint(struct cxl_port *port);
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
+struct cxl_region *cxl_dpa_to_region(struct cxl_memdev *cxlmd, u64 dpa);
 u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 		   u64 dpa);
 
@@ -50,7 +50,7 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 	return ULLONG_MAX;
 }
 static inline
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(struct cxl_memdev *cxlmd, u64 dpa)
 {
 	return NULL;
 }
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index ef202b34e5ea..610ed6744ddc 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -864,7 +864,7 @@ void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 void clear_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 				  unsigned long *cmds);
 void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status);
-void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
+void cxl_event_trace_record(struct cxl_memdev *cxlmd,
 			    enum cxl_event_log_type type,
 			    enum cxl_event_type event_type,
 			    const uuid_t *uuid, union cxl_event *evt);
diff --git a/include/linux/device.h b/include/linux/device.h
index 0be95294b6e6..4fafee80524b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -911,6 +911,7 @@ static inline void device_unlock(struct device *dev)
 }
 
 DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
+DEFINE_GUARD_COND(device, _intr, device_lock_interruptible(_T), _RET == 0)
 
 static inline void device_lock_assert(struct device *dev)
 {
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index fa6dd0c94656..7b923408b6c5 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -886,7 +886,7 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, "CXL");
 
-void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
+void cxl_event_trace_record(struct cxl_memdev *cxlmd,
 			    enum cxl_event_log_type type,
 			    enum cxl_event_type event_type,
 			    const uuid_t *uuid, union cxl_event *evt)
@@ -913,6 +913,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 		 * translations. Take topology mutation locks and lookup
 		 * { HPA, REGION } from { DPA, MEMDEV } in the event record.
 		 */
+		guard(device)(&cxlmd->dev);
 		guard(rwsem_read)(&cxl_rwsem.region);
 		guard(rwsem_read)(&cxl_rwsem.dpa);
 
@@ -961,7 +962,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_event_trace_record, "CXL");
 
-static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
+static void __cxl_event_trace_record(struct cxl_memdev *cxlmd,
 				     enum cxl_event_log_type type,
 				     struct cxl_event_record_raw *record)
 {
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index af3d0cc65138..d2bee5608e50 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -331,6 +331,10 @@ int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa)
 {
 	int rc;
 
+	ACQUIRE(device_intr, devlock)(&cxlmd->dev);
+	if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
+		return rc;
+
 	ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
 		return rc;
@@ -355,6 +359,7 @@ int cxl_clear_poison_locked(struct cxl_memdev *cxlmd, u64 dpa)
 	if (!IS_ENABLED(CONFIG_DEBUG_FS))
 		return 0;
 
+	device_lock_assert(&cxlmd->dev);
 	lockdep_assert_held(&cxl_rwsem.dpa);
 	lockdep_assert_held(&cxl_rwsem.region);
 
@@ -400,6 +405,10 @@ int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa)
 {
 	int rc;
 
+	ACQUIRE(device_intr, devlock)(&cxlmd->dev);
+	if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
+		return rc;
+
 	ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
 		return rc;
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 96888d87a8df..a7d391757e16 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2936,7 +2936,7 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
 	return 1;
 }
 
-struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
+struct cxl_region *cxl_dpa_to_region(struct cxl_memdev *cxlmd, u64 dpa)
 {
 	struct cxl_dpa_to_region_context ctx;
 	struct cxl_port *port;
@@ -2944,8 +2944,12 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
 	ctx = (struct cxl_dpa_to_region_context) {
 		.dpa = dpa,
 	};
+
+	device_lock_assert(&cxlmd->dev);
+	if (!cxlmd->dev.driver)
+		return NULL;
 	port = cxlmd->endpoint;
-	if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
+	if (cxl_num_decoders_committed(port))
 		device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
 
 	return ctx.cxlr;
@@ -4004,10 +4008,9 @@ static int validate_region_offset(struct cxl_region *cxlr, u64 offset)
 	return 0;
 }
 
-static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
+static int region_poison_lookup(struct cxl_region *cxlr, u64 offset,
+				struct dpa_result *result)
 {
-	struct dpa_result result = { .dpa = ULLONG_MAX, .cxlmd = NULL };
-	struct cxl_region *cxlr = data;
 	int rc;
 
 	ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
@@ -4022,8 +4025,8 @@ static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
 		return -EINVAL;
 
 	offset -= cxlr->params.cache_size;
-	rc = region_offset_to_dpa_result(cxlr, offset, &result);
-	if (rc || !result.cxlmd || result.dpa == ULLONG_MAX) {
+	rc = region_offset_to_dpa_result(cxlr, offset, result);
+	if (rc || !result->cxlmd || result->dpa == ULLONG_MAX) {
 		dev_dbg(&cxlr->dev,
 			"Failed to resolve DPA for region offset %#llx rc %d\n",
 			offset, rc);
@@ -4031,7 +4034,43 @@ static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
 		return rc ? rc : -EINVAL;
 	}
 
-	return cxl_inject_poison_locked(result.cxlmd, result.dpa);
+	return 0;
+}
+
+static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
+{
+	struct dpa_result result1 = { .dpa = ULLONG_MAX };
+	struct dpa_result result2 = { .dpa = ULLONG_MAX };
+	struct cxl_region *cxlr = data;
+	struct cxl_memdev *cxlmd;
+	int rc;
+
+	rc = region_poison_lookup(cxlr, offset, &result1);
+	if (rc)
+		return rc;
+
+	cxlmd = result1.cxlmd;
+	ACQUIRE(device_intr, devlock)(&cxlmd->dev);
+	if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
+		return rc;
+
+	ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
+	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
+		return rc;
+
+	ACQUIRE(rwsem_read_intr, dpa_rwsem)(&cxl_rwsem.dpa);
+	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &dpa_rwsem)))
+		return rc;
+
+	offset -= cxlr->params.cache_size;
+	rc = region_offset_to_dpa_result(cxlr, offset, &result2);
+	if (rc || memcmp(&result1, &result2, sizeof(result1) != 0)) {
+		dev_dbg(&cxlr->dev,
+			"Error injection raced region reconfiguration: %d\n", rc);
+		return rc ? rc : -ENXIO;
+	}
+
+	return cxl_inject_poison_locked(result1.cxlmd, result1.dpa);
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_inject_fops, NULL,
@@ -4039,10 +4078,21 @@ DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_inject_fops, NULL,
 
 static int cxl_region_debugfs_poison_clear(void *data, u64 offset)
 {
-	struct dpa_result result = { .dpa = ULLONG_MAX, .cxlmd = NULL };
+	struct dpa_result result1 = { .dpa = ULLONG_MAX };
+	struct dpa_result result2 = { .dpa = ULLONG_MAX };
 	struct cxl_region *cxlr = data;
+	struct cxl_memdev *cxlmd;
 	int rc;
 
+	rc = region_poison_lookup(cxlr, offset, &result1);
+	if (rc)
+		return rc;
+
+	cxlmd = result1.cxlmd;
+	ACQUIRE(device_intr, devlock)(&cxlmd->dev);
+	if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
+		return rc;
+
 	ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
 		return rc;
@@ -4051,20 +4101,15 @@ static int cxl_region_debugfs_poison_clear(void *data, u64 offset)
 	if ((rc = ACQUIRE_ERR(rwsem_read_intr, &dpa_rwsem)))
 		return rc;
 
-	if (validate_region_offset(cxlr, offset))
-		return -EINVAL;
-
 	offset -= cxlr->params.cache_size;
-	rc = region_offset_to_dpa_result(cxlr, offset, &result);
-	if (rc || !result.cxlmd || result.dpa == ULLONG_MAX) {
+	rc = region_offset_to_dpa_result(cxlr, offset, &result2);
+	if (rc || memcmp(&result1, &result2, sizeof(result1) != 0)) {
 		dev_dbg(&cxlr->dev,
-			"Failed to resolve DPA for region offset %#llx rc %d\n",
-			offset, rc);
-
-		return rc ? rc : -EINVAL;
+			"Error clearing raced region reconfiguration: %d\n", rc);
+		return rc ? rc : -ENXIO;
 	}
 
-	return cxl_clear_poison_locked(result.cxlmd, result.dpa);
+	return cxl_clear_poison_locked(result1.cxlmd, result1.dpa);
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
-- 
2.52.0

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by Li Ming 5 days, 14 hours ago



From:  <dan.j.williams@intel.com>
To: "Li Ming"<ming.li@zohomail.com>, <dave@stgolabs.net>, <jonathan.cameron@huawei.com>, <dave.jiang@intel.com>, <alison.schofield@intel.com>, <vishal.l.verma@intel.com>, <ira.weiny@intel.com>, <dan.j.williams@intel.com>
Cc: <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>, "Li Ming"<ming.li@zohomail.com>
Date: Tue, 03 Feb 2026 08:01:04 +0800
Subject: Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

 > Li Ming wrote:
 > > CXL testing environment can trigger following trace
 > > 
 > >  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
 > >  KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
 > >  RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
 > >  Call Trace:
 > >   <TASK>
 > >   cxl_event_trace_record+0xd1/0xa70 [cxl_core]
 > >   __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
 > >   cxl_mem_get_records_log+0x261/0x500 [cxl_core]
 > >   cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
 > >   cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]
 > 
 > Hooray for unit tests, but I wonder why this failure is so reliable for
 > you and absent for me? I retried with latest cxl/next, no luck.
 > 
 > No matter, it still looks like something worth addressing, but not with
 > setting ->endpoint to NULL.
 > 
 > >   platform_probe+0x9d/0x130
 > >   really_probe+0x1c8/0x960
 > >   __driver_probe_device+0x187/0x3e0
 > >   driver_probe_device+0x45/0x120
 > >   __device_attach_driver+0x15d/0x280
 > > 
 > > commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
 > > initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
 > > However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
 > > valid endpoint.
 > > 
 > > Across the CXL core, endpoint availability is generally determined by
 > > checking whether it is NULL. Align with this convention by initializing
 > > cxlmd->endpoint to NULL by default.
 > > 
 > > Fixes: 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
 > > Signed-off-by: Li Ming <ming.li@zohomail.com>
 > > ---
 > >  drivers/cxl/core/memdev.c | 2 +-
 > >  1 file changed, 1 insertion(+), 1 deletion(-)
 > > 
 > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
 > > index af3d0cc65138..41a507b5daa4 100644
 > > --- a/drivers/cxl/core/memdev.c
 > > +++ b/drivers/cxl/core/memdev.c
 > > @@ -675,7 +675,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
 > >      cxlmd->id = rc;
 > >      cxlmd->depth = -1;
 > >      cxlmd->attach = attach;
 > > -    cxlmd->endpoint = ERR_PTR(-ENXIO);
 > > +    cxlmd->endpoint = NULL;
 > 
 > So, this does not look like a fix to me, it looks like a bug report
 > against all the code paths that want to assume that a mere NULL check of
 > ->endpoint is sufficient.
 > 
 > I think this unintentional ERR_PTR() trip hazard is now a *feature* in
 > retrospect to catch those cases. If something depends on ->endpoint
 > being valid, it depends on ->endpoint *staying* valid.
 > 
 > A proposed fix for this case is below (passes cxl_test), and if other
 > ERR_PTR() crashes show up I expect they need similar inspection. This
 > probably wants additional cleanup to consolidate boiler-plate and make
 > the critical change to cxl_dpa_to_region() easier to see:
 > 
Hi Dan,

Thanks for your proposal, I think your change can solve this problem too.
But the change is a lot, and need more time to review all driver code to confirm
if there are other places needed such checking. (I found that cxl_reset_done also needs some changes like you mentioned)
Maybe we can consider my change as a quick fix? Then I can prepare a new patchset for the consolidation.

Ming

 > -- 8< --
 > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
 > index 422531799af2..e652980098cf 100644
 > --- a/drivers/cxl/core/core.h
 > +++ b/drivers/cxl/core/core.h
 > @@ -39,7 +39,7 @@ int cxl_decoder_detach(struct cxl_region *cxlr,
 >  int cxl_region_init(void);
 >  void cxl_region_exit(void);
 >  int cxl_get_poison_by_endpoint(struct cxl_port *port);
 > -struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa);
 > +struct cxl_region *cxl_dpa_to_region(struct cxl_memdev *cxlmd, u64 dpa);
 >  u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
 >             u64 dpa);
 >  
 > @@ -50,7 +50,7 @@ static inline u64 cxl_dpa_to_hpa(struct cxl_region *cxlr,
 >      return ULLONG_MAX;
 >  }
 >  static inline
 > -struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
 > +struct cxl_region *cxl_dpa_to_region(struct cxl_memdev *cxlmd, u64 dpa)
 >  {
 >      return NULL;
 >  }
 > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
 > index ef202b34e5ea..610ed6744ddc 100644
 > --- a/drivers/cxl/cxlmem.h
 > +++ b/drivers/cxl/cxlmem.h
 > @@ -864,7 +864,7 @@ void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 >  void clear_exclusive_cxl_commands(struct cxl_memdev_state *mds,
 >                    unsigned long *cmds);
 >  void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status);
 > -void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 > +void cxl_event_trace_record(struct cxl_memdev *cxlmd,
 >                  enum cxl_event_log_type type,
 >                  enum cxl_event_type event_type,
 >                  const uuid_t *uuid, union cxl_event *evt);
 > diff --git a/include/linux/device.h b/include/linux/device.h
 > index 0be95294b6e6..4fafee80524b 100644
 > --- a/include/linux/device.h
 > +++ b/include/linux/device.h
 > @@ -911,6 +911,7 @@ static inline void device_unlock(struct device *dev)
 >  }
 >  
 >  DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
 > +DEFINE_GUARD_COND(device, _intr, device_lock_interruptible(_T), _RET == 0)
 >  
 >  static inline void device_lock_assert(struct device *dev)
 >  {
 > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
 > index fa6dd0c94656..7b923408b6c5 100644
 > --- a/drivers/cxl/core/mbox.c
 > +++ b/drivers/cxl/core/mbox.c
 > @@ -886,7 +886,7 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
 >  }
 >  EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, "CXL");
 >  
 > -void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 > +void cxl_event_trace_record(struct cxl_memdev *cxlmd,
 >                  enum cxl_event_log_type type,
 >                  enum cxl_event_type event_type,
 >                  const uuid_t *uuid, union cxl_event *evt)
 > @@ -913,6 +913,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 >           * translations. Take topology mutation locks and lookup
 >           * { HPA, REGION } from { DPA, MEMDEV } in the event record.
 >           */
 > +        guard(device)(&cxlmd->dev);
 >          guard(rwsem_read)(&cxl_rwsem.region);
 >          guard(rwsem_read)(&cxl_rwsem.dpa);
 >  
 > @@ -961,7 +962,7 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 >  }
 >  EXPORT_SYMBOL_NS_GPL(cxl_event_trace_record, "CXL");
 >  
 > -static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
 > +static void __cxl_event_trace_record(struct cxl_memdev *cxlmd,
 >                       enum cxl_event_log_type type,
 >                       struct cxl_event_record_raw *record)
 >  {
 > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
 > index af3d0cc65138..d2bee5608e50 100644
 > --- a/drivers/cxl/core/memdev.c
 > +++ b/drivers/cxl/core/memdev.c
 > @@ -331,6 +331,10 @@ int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa)
 >  {
 >      int rc;
 >  
 > +    ACQUIRE(device_intr, devlock)(&cxlmd->dev);
 > +    if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
 > +        return rc;
 > +
 >      ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 >      if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
 >          return rc;
 > @@ -355,6 +359,7 @@ int cxl_clear_poison_locked(struct cxl_memdev *cxlmd, u64 dpa)
 >      if (!IS_ENABLED(CONFIG_DEBUG_FS))
 >          return 0;
 >  
 > +    device_lock_assert(&cxlmd->dev);
 >      lockdep_assert_held(&cxl_rwsem.dpa);
 >      lockdep_assert_held(&cxl_rwsem.region);
 >  
 > @@ -400,6 +405,10 @@ int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa)
 >  {
 >      int rc;
 >  
 > +    ACQUIRE(device_intr, devlock)(&cxlmd->dev);
 > +    if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
 > +        return rc;
 > +
 >      ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 >      if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
 >          return rc;
 > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
 > index 96888d87a8df..a7d391757e16 100644
 > --- a/drivers/cxl/core/region.c
 > +++ b/drivers/cxl/core/region.c
 > @@ -2936,7 +2936,7 @@ static int __cxl_dpa_to_region(struct device *dev, void *arg)
 >      return 1;
 >  }
 >  
 > -struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
 > +struct cxl_region *cxl_dpa_to_region(struct cxl_memdev *cxlmd, u64 dpa)
 >  {
 >      struct cxl_dpa_to_region_context ctx;
 >      struct cxl_port *port;
 > @@ -2944,8 +2944,12 @@ struct cxl_region *cxl_dpa_to_region(const struct cxl_memdev *cxlmd, u64 dpa)
 >      ctx = (struct cxl_dpa_to_region_context) {
 >          .dpa = dpa,
 >      };
 > +
 > +    device_lock_assert(&cxlmd->dev);
 > +    if (!cxlmd->dev.driver)
 > +        return NULL;
 >      port = cxlmd->endpoint;
 > -    if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
 > +    if (cxl_num_decoders_committed(port))
 >          device_for_each_child(&port->dev, &ctx, __cxl_dpa_to_region);
 >  
 >      return ctx.cxlr;
 > @@ -4004,10 +4008,9 @@ static int validate_region_offset(struct cxl_region *cxlr, u64 offset)
 >      return 0;
 >  }
 >  
 > -static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
 > +static int region_poison_lookup(struct cxl_region *cxlr, u64 offset,
 > +                struct dpa_result *result)
 >  {
 > -    struct dpa_result result = { .dpa = ULLONG_MAX, .cxlmd = NULL };
 > -    struct cxl_region *cxlr = data;
 >      int rc;
 >  
 >      ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 > @@ -4022,8 +4025,8 @@ static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
 >          return -EINVAL;
 >  
 >      offset -= cxlr->params.cache_size;
 > -    rc = region_offset_to_dpa_result(cxlr, offset, &result);
 > -    if (rc || !result.cxlmd || result.dpa == ULLONG_MAX) {
 > +    rc = region_offset_to_dpa_result(cxlr, offset, result);
 > +    if (rc || !result->cxlmd || result->dpa == ULLONG_MAX) {
 >          dev_dbg(&cxlr->dev,
 >              "Failed to resolve DPA for region offset %#llx rc %d\n",
 >              offset, rc);
 > @@ -4031,7 +4034,43 @@ static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
 >          return rc ? rc : -EINVAL;
 >      }
 >  
 > -    return cxl_inject_poison_locked(result.cxlmd, result.dpa);
 > +    return 0;
 > +}
 > +
 > +static int cxl_region_debugfs_poison_inject(void *data, u64 offset)
 > +{
 > +    struct dpa_result result1 = { .dpa = ULLONG_MAX };
 > +    struct dpa_result result2 = { .dpa = ULLONG_MAX };
 > +    struct cxl_region *cxlr = data;
 > +    struct cxl_memdev *cxlmd;
 > +    int rc;
 > +
 > +    rc = region_poison_lookup(cxlr, offset, &result1);
 > +    if (rc)
 > +        return rc;
 > +
 > +    cxlmd = result1.cxlmd;
 > +    ACQUIRE(device_intr, devlock)(&cxlmd->dev);
 > +    if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
 > +        return rc;
 > +
 > +    ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 > +    if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
 > +        return rc;
 > +
 > +    ACQUIRE(rwsem_read_intr, dpa_rwsem)(&cxl_rwsem.dpa);
 > +    if ((rc = ACQUIRE_ERR(rwsem_read_intr, &dpa_rwsem)))
 > +        return rc;
 > +
 > +    offset -= cxlr->params.cache_size;
 > +    rc = region_offset_to_dpa_result(cxlr, offset, &result2);
 > +    if (rc || memcmp(&result1, &result2, sizeof(result1) != 0)) {
 > +        dev_dbg(&cxlr->dev,
 > +            "Error injection raced region reconfiguration: %d\n", rc);
 > +        return rc ? rc : -ENXIO;
 > +    }
 > +
 > +    return cxl_inject_poison_locked(result1.cxlmd, result1.dpa);
 >  }
 >  
 >  DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_inject_fops, NULL,
 > @@ -4039,10 +4078,21 @@ DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_inject_fops, NULL,
 >  
 >  static int cxl_region_debugfs_poison_clear(void *data, u64 offset)
 >  {
 > -    struct dpa_result result = { .dpa = ULLONG_MAX, .cxlmd = NULL };
 > +    struct dpa_result result1 = { .dpa = ULLONG_MAX };
 > +    struct dpa_result result2 = { .dpa = ULLONG_MAX };
 >      struct cxl_region *cxlr = data;
 > +    struct cxl_memdev *cxlmd;
 >      int rc;
 >  
 > +    rc = region_poison_lookup(cxlr, offset, &result1);
 > +    if (rc)
 > +        return rc;
 > +
 > +    cxlmd = result1.cxlmd;
 > +    ACQUIRE(device_intr, devlock)(&cxlmd->dev);
 > +    if ((rc = ACQUIRE_ERR(device_intr, &devlock)))
 > +        return rc;
 > +
 >      ACQUIRE(rwsem_read_intr, region_rwsem)(&cxl_rwsem.region);
 >      if ((rc = ACQUIRE_ERR(rwsem_read_intr, &region_rwsem)))
 >          return rc;
 > @@ -4051,20 +4101,15 @@ static int cxl_region_debugfs_poison_clear(void *data, u64 offset)
 >      if ((rc = ACQUIRE_ERR(rwsem_read_intr, &dpa_rwsem)))
 >          return rc;
 >  
 > -    if (validate_region_offset(cxlr, offset))
 > -        return -EINVAL;
 > -
 >      offset -= cxlr->params.cache_size;
 > -    rc = region_offset_to_dpa_result(cxlr, offset, &result);
 > -    if (rc || !result.cxlmd || result.dpa == ULLONG_MAX) {
 > +    rc = region_offset_to_dpa_result(cxlr, offset, &result2);
 > +    if (rc || memcmp(&result1, &result2, sizeof(result1) != 0)) {
 >          dev_dbg(&cxlr->dev,
 > -            "Failed to resolve DPA for region offset %#llx rc %d\n",
 > -            offset, rc);
 > -
 > -        return rc ? rc : -EINVAL;
 > +            "Error clearing raced region reconfiguration: %d\n", rc);
 > +        return rc ? rc : -ENXIO;
 >      }
 >  
 > -    return cxl_clear_poison_locked(result.cxlmd, result.dpa);
 > +    return cxl_clear_poison_locked(result1.cxlmd, result1.dpa);
 >  }
 >  
 >  DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
 > -- 
 > 2.52.0
 > 
 >

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by dan.j.williams@intel.com 5 days, 7 hours ago

Li Ming wrote:
[..]
> Hi Dan,
> 
> Thanks for your proposal, I think your change can solve this problem too.
> But the change is a lot, and need more time to review all driver code
> to confirm if there are other places needed such checking. (I found
> that cxl_reset_done also needs some changes like you mentioned) Maybe
> we can consider my change as a quick fix? Then I can prepare a new
> patchset for the consolidation.

I am not convinced that it is a fix. The fact that you say the bug
disappears when patch2 is applied leads me to believe that is
potentially the only bug.

I.e. it may be the case that cxl_dpa_to_region() is safe to assume that
a valid ->endpoint pointer will remain valid once the port
bus_add_device() vs bus_probe_device() hole is plugged.

I would say start with patch2 by itself. Then circle back to prove
that a mere NULL check is a fix or just makes the vulnerability window
smaller and the locking rework is needed.

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by Dave Jiang 6 days, 9 hours ago


On 2/1/26 2:30 AM, Li Ming wrote:
> CXL testing environment can trigger following trace
> 
>  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
>  KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
>  RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
>  Call Trace:
>   <TASK>
>   cxl_event_trace_record+0xd1/0xa70 [cxl_core]
>   __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
>   cxl_mem_get_records_log+0x261/0x500 [cxl_core]
>   cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
>   cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]
>   platform_probe+0x9d/0x130
>   really_probe+0x1c8/0x960
>   __driver_probe_device+0x187/0x3e0
>   driver_probe_device+0x45/0x120
>   __device_attach_driver+0x15d/0x280

So I talked to Dan and the assigning of ERR_PTR(-ENXIO) is from a previous revision of implementation and isn't intended to be there. So removing the line entirely (which would mean endpoint == NULL from kzalloc) is probably the right change. However, I'm not able to reproduce the error. How is it triggered? Also, by setting it to NULL, are we covering up a different issue? Are you able to determine whether cxlmd->endpoint at that point is unexpected and perhaps triggered by a different bug?

DJ

> 
> commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
> initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
> However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
> valid endpoint.
> 
> Across the CXL core, endpoint availability is generally determined by
> checking whether it is NULL. Align with this convention by initializing
> cxlmd->endpoint to NULL by default.
> 
> Fixes: 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
> Signed-off-by: Li Ming <ming.li@zohomail.com>
> ---
>  drivers/cxl/core/memdev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index af3d0cc65138..41a507b5daa4 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -675,7 +675,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
>  	cxlmd->id = rc;
>  	cxlmd->depth = -1;
>  	cxlmd->attach = attach;
> -	cxlmd->endpoint = ERR_PTR(-ENXIO);
> +	cxlmd->endpoint = NULL;
>  
>  	dev = &cxlmd->dev;
>  	device_initialize(dev);

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by Li Ming 5 days, 15 hours ago



From: Dave Jiang <dave.jiang@intel.com>
To: "Li Ming"<ming.li@zohomail.com>, <dave@stgolabs.net>, <jonathan.cameron@huawei.com>, <alison.schofield@intel.com>, <vishal.l.verma@intel.com>, <ira.weiny@intel.com>, <dan.j.williams@intel.com>
Cc: <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Date: Tue, 03 Feb 2026 05:04:44 +0800
Subject: Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

 > 
 > 
 > On 2/1/26 2:30 AM, Li Ming wrote:
 > > CXL testing environment can trigger following trace
 > > 
 > >  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
 > >  KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
 > >  RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
 > >  Call Trace:
 > >   <TASK>
 > >   cxl_event_trace_record+0xd1/0xa70 [cxl_core]
 > >   __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
 > >   cxl_mem_get_records_log+0x261/0x500 [cxl_core]
 > >   cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
 > >   cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]
 > >   platform_probe+0x9d/0x130
 > >   really_probe+0x1c8/0x960
 > >   __driver_probe_device+0x187/0x3e0
 > >   driver_probe_device+0x45/0x120
 > >   __device_attach_driver+0x15d/0x280
 > 
 > So I talked to Dan and the assigning of ERR_PTR(-ENXIO) is from a previous revision of implementation and isn't intended to be there. So removing the line entirely (which would mean endpoint == NULL from kzalloc) is probably the right change. However, I'm not able to reproduce the error. How is it triggered? Also, by setting it to NULL, are we covering up a different issue? Are you able to determine whether cxlmd->endpoint at that point is unexpected and perhaps triggered by a different bug?
 > 
 > DJ
Hi Dave,

As I implied in cover-letter, When I trigger issue #2, it always trigger this issue.
I have a simple script to run cxl test over and over again in a VM. I usually reproduce it in one hour with KASAN enabled. 
When I reproduced it, I dumpped out the value of cxlmd->endpoint, it is -ENXIO. That means it was not updated in cxl_port_add() yet at that time. I cannot reproduce this issue only with patch #2.

Ming

 > 
 > > 
 > > commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
 > > initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
 > > However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
 > > valid endpoint.
 > > 
 > > Across the CXL core, endpoint availability is generally determined by
 > > checking whether it is NULL. Align with this convention by initializing
 > > cxlmd->endpoint to NULL by default.
 > > 
 > > Fixes: 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
 > > Signed-off-by: Li Ming <ming.li@zohomail.com>
 > > ---
 > >  drivers/cxl/core/memdev.c | 2 +-
 > >  1 file changed, 1 insertion(+), 1 deletion(-)
 > > 
 > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
 > > index af3d0cc65138..41a507b5daa4 100644
 > > --- a/drivers/cxl/core/memdev.c
 > > +++ b/drivers/cxl/core/memdev.c
 > > @@ -675,7 +675,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
 > >      cxlmd->id = rc;
 > >      cxlmd->depth = -1;
 > >      cxlmd->attach = attach;
 > > -    cxlmd->endpoint = ERR_PTR(-ENXIO);
 > > +    cxlmd->endpoint = NULL;
 > >  
 > >      dev = &cxlmd->dev;
 > >      device_initialize(dev);
 > 
 >

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by Jonathan Cameron 6 days, 15 hours ago

On Sun,  1 Feb 2026 17:30:01 +0800
Li Ming <ming.li@zohomail.com> wrote:

> CXL testing environment can trigger following trace
> 
>  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
>  KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
>  RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
>  Call Trace:
>   <TASK>
>   cxl_event_trace_record+0xd1/0xa70 [cxl_core]
>   __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
>   cxl_mem_get_records_log+0x261/0x500 [cxl_core]
>   cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
>   cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]
>   platform_probe+0x9d/0x130
>   really_probe+0x1c8/0x960
>   __driver_probe_device+0x187/0x3e0
>   driver_probe_device+0x45/0x120
>   __device_attach_driver+0x15d/0x280
> 
> commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
> initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
> However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
> valid endpoint.
> 
> Across the CXL core, endpoint availability is generally determined by
> checking whether it is NULL. Align with this convention by initializing
> cxlmd->endpoint to NULL by default.

I had a look at whether it made sense to use use IS_ERR_OR_NULL() to check
for validity of the endpoint, but it would be somewhat fiddly and I think
you are correct that convention here seems to be NULL means not set.
We don't need the error code.  One comment inline.

Either way nice catch
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


> 
> Fixes: 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
> Signed-off-by: Li Ming <ming.li@zohomail.com>
> ---
>  drivers/cxl/core/memdev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index af3d0cc65138..41a507b5daa4 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -675,7 +675,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
>  	cxlmd->id = rc;
>  	cxlmd->depth = -1;
>  	cxlmd->attach = attach;
> -	cxlmd->endpoint = ERR_PTR(-ENXIO);
> +	cxlmd->endpoint = NULL;
cxlmd has just been allocated with kzalloc so I'd argue we don't need this to be explicitly
set at all.  Seems like a natural and safe default.

>  
>  	dev = &cxlmd->dev;
>  	device_initialize(dev);

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by Li Ming 5 days, 15 hours ago



From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: "Li Ming"<ming.li@zohomail.com>
Cc: <dave@stgolabs.net>, <dave.jiang@intel.com>, <alison.schofield@intel.com>, <vishal.l.verma@intel.com>, <ira.weiny@intel.com>, <dan.j.williams@intel.com>, <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Date: Mon, 02 Feb 2026 22:41:03 +0800
Subject: Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

 > On Sun,  1 Feb 2026 17:30:01 +0800
 > Li Ming <ming.li@zohomail.com> wrote:
 > 
 > > CXL testing environment can trigger following trace
 > > 
 > >  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
 > >  KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
 > >  RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
 > >  Call Trace:
 > >   <TASK>
 > >   cxl_event_trace_record+0xd1/0xa70 [cxl_core]
 > >   __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
 > >   cxl_mem_get_records_log+0x261/0x500 [cxl_core]
 > >   cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
 > >   cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]
 > >   platform_probe+0x9d/0x130
 > >   really_probe+0x1c8/0x960
 > >   __driver_probe_device+0x187/0x3e0
 > >   driver_probe_device+0x45/0x120
 > >   __device_attach_driver+0x15d/0x280
 > > 
 > > commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
 > > initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
 > > However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
 > > valid endpoint.
 > > 
 > > Across the CXL core, endpoint availability is generally determined by
 > > checking whether it is NULL. Align with this convention by initializing
 > > cxlmd->endpoint to NULL by default.
 > 
 > I had a look at whether it made sense to use use IS_ERR_OR_NULL() to check
 > for validity of the endpoint, but it would be somewhat fiddly and I think
 > you are correct that convention here seems to be NULL means not set.
 > We don't need the error code.  One comment inline.
 > 
 > Either way nice catch
 > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Thanks for the review.

 > 
 > 
 > > 
 > > Fixes: 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
 > > Signed-off-by: Li Ming <ming.li@zohomail.com>
 > > ---
 > >  drivers/cxl/core/memdev.c | 2 +-
 > >  1 file changed, 1 insertion(+), 1 deletion(-)
 > > 
 > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
 > > index af3d0cc65138..41a507b5daa4 100644
 > > --- a/drivers/cxl/core/memdev.c
 > > +++ b/drivers/cxl/core/memdev.c
 > > @@ -675,7 +675,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
 > >      cxlmd->id = rc;
 > >      cxlmd->depth = -1;
 > >      cxlmd->attach = attach;
 > > -    cxlmd->endpoint = ERR_PTR(-ENXIO);
 > > +    cxlmd->endpoint = NULL;
 > cxlmd has just been allocated with kzalloc so I'd argue we don't need this to be explicitly
 > set at all.  Seems like a natural and safe default.
Yes, I forgot it. Thanks for pointing it out.

Ming
 > 
 > >  
 > >      dev = &cxlmd->dev;
 > >      device_initialize(dev);
 > 
 >

Re: [PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default

Posted by Gregory Price 6 days, 14 hours ago

On Mon, Feb 02, 2026 at 02:41:03PM +0000, Jonathan Cameron wrote:
> On Sun,  1 Feb 2026 17:30:01 +0800
> Li Ming <ming.li@zohomail.com> wrote:
> 
> > CXL testing environment can trigger following trace
> > 
> >  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000092: 0000 [#1] SMP KASAN NOPTI
> >  KASAN: null-ptr-deref in range [0x0000000000000490-0x0000000000000497]
> >  RIP: 0010:cxl_dpa_to_region+0x105/0x1f0 [cxl_core]
> >  Call Trace:
> >   <TASK>
> >   cxl_event_trace_record+0xd1/0xa70 [cxl_core]
> >   __cxl_event_trace_record+0x12f/0x1e0 [cxl_core]
> >   cxl_mem_get_records_log+0x261/0x500 [cxl_core]
> >   cxl_mem_get_event_records+0x7c/0xc0 [cxl_core]
> >   cxl_mock_mem_probe+0xd38/0x1c60 [cxl_mock_mem]
> >   platform_probe+0x9d/0x130
> >   really_probe+0x1c8/0x960
> >   __driver_probe_device+0x187/0x3e0
> >   driver_probe_device+0x45/0x120
> >   __device_attach_driver+0x15d/0x280
> > 
> > commit 29317f8dc6ed ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
> > initializes cxlmd->endpoint to ERR_PTR(-ENXIO) in cxl_memdev_alloc().
> > However, cxl_dpa_to_region() treats a non-NULL cxlmd->endpoint as a
> > valid endpoint.
> > 
> > Across the CXL core, endpoint availability is generally determined by
> > checking whether it is NULL. Align with this convention by initializing
> > cxlmd->endpoint to NULL by default.
> 
> I had a look at whether it made sense to use use IS_ERR_OR_NULL() to check
> for validity of the endpoint, but it would be somewhat fiddly and I think
> you are correct that convention here seems to be NULL means not set.
> We don't need the error code.  One comment inline.
> 

doing validity checks on pointers by checking for null is a pretty
common convention kernel-wide, I would consider setting some structure's
value to an ERR_PTR to be the aberration.

So yeah, good catch

Reviewed-by: Gregory Price <gourry@gourry.net>

~Gregory

[PATCH 1/2] cxl/core: Set cxlmd->endpoint to NULL by default
[PATCH 2/2] cxl/core: Hold grandparent port lock while dport adding