[PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved

Smita Koralahalli posted 9 patches 2 months, 3 weeks ago
There is a newer version of this series
[PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved
Posted by Smita Koralahalli 2 months, 3 weeks ago
If CXL regions do not fully cover a Soft Reserved span, HMEM takes
ownership. Tear down overlapping CXL regions before allowing HMEM to
register and online the memory.

Add cxl_region_teardown() to walk CXL regions overlapping a span and
unregister them via devm_release_action() and unregister_region().

Force the region state back to CXL_CONFIG_ACTIVE before unregistering to
prevent the teardown path from resetting decoders HMEM still relies on
to create its dax and online memory.

Co-developed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
 drivers/cxl/core/region.c | 38 ++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h         |  5 +++++
 drivers/dax/hmem/hmem.c   |  4 +++-
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 38e7ec6a087b..266b24028df0 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3784,6 +3784,44 @@ struct cxl_range_ctx {
 	bool found;
 };
 
+static int cxl_region_teardown_cb(struct device *dev, void *data)
+{
+	struct cxl_range_ctx *ctx = data;
+	struct cxl_root_decoder *cxlrd;
+	struct cxl_region_params *p;
+	struct cxl_region *cxlr;
+	struct cxl_port *port;
+
+	cxlr = cxlr_overlapping_range(dev, ctx->start, ctx->end);
+	if (!cxlr)
+		return 0;
+
+	cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
+	port = cxlrd_to_port(cxlrd);
+	p = &cxlr->params;
+
+	/* Force the region state back to CXL_CONFIG_ACTIVE so that
+	 * unregister_region() does not run the full decoder reset path
+	 * which would invalidate the decoder programming that HMEM
+	 * relies on to create its DAX device and online the underlying
+	 * memory.
+	 */
+	scoped_guard(rwsem_write, &cxl_rwsem.region)
+		p->state = min(p->state, CXL_CONFIG_ACTIVE);
+
+	devm_release_action(port->uport_dev, unregister_region, cxlr);
+
+	return 0;
+}
+
+void cxl_region_teardown(resource_size_t start, resource_size_t end)
+{
+	struct cxl_range_ctx ctx = { .start = start, .end = end };
+
+	bus_for_each_dev(&cxl_bus_type, NULL, &ctx, cxl_region_teardown_cb);
+}
+EXPORT_SYMBOL_GPL(cxl_region_teardown);
+
 static void cxl_region_enable_dax(struct cxl_region *cxlr)
 {
 	struct cxl_region_params *p = &cxlr->params;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 414ddf6c35d7..a215a88ef59c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -880,6 +880,7 @@ struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
 u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
 bool cxl_regions_fully_map(resource_size_t start, resource_size_t end);
 void cxl_register_dax(resource_size_t start, resource_size_t end);
+void cxl_region_teardown(resource_size_t start, resource_size_t end);
 #else
 static inline bool is_cxl_pmem_region(struct device *dev)
 {
@@ -911,6 +912,10 @@ static inline void cxl_register_dax(resource_size_t start,
 				    resource_size_t end)
 {
 }
+static inline void cxl_region_teardown(resource_size_t start,
+				       resource_size_t end)
+{
+}
 #endif
 
 void cxl_endpoint_parse_cdat(struct cxl_port *port);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index b9312e0f2e62..7d874ee169ac 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -158,8 +158,10 @@ static int handle_deferred_cxl(struct device *host, int target_nid,
 		if (cxl_regions_fully_map(res->start, res->end)) {
 			dax_cxl_mode = DAX_CXL_MODE_DROP;
 			cxl_register_dax(res->start, res->end);
-		} else
+		} else {
 			dax_cxl_mode = DAX_CXL_MODE_REGISTER;
+			cxl_region_teardown(res->start, res->end);
+		}
 
 		hmem_register_device(host, target_nid, res);
 	}
-- 
2.17.1
Re: [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved
Posted by dan.j.williams@intel.com 2 months, 1 week ago
Smita Koralahalli wrote:
> If CXL regions do not fully cover a Soft Reserved span, HMEM takes
> ownership. Tear down overlapping CXL regions before allowing HMEM to
> register and online the memory.
> 
> Add cxl_region_teardown() to walk CXL regions overlapping a span and
> unregister them via devm_release_action() and unregister_region().
> 
> Force the region state back to CXL_CONFIG_ACTIVE before unregistering to
> prevent the teardown path from resetting decoders HMEM still relies on
> to create its dax and online memory.
> 
> Co-developed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
>  drivers/cxl/core/region.c | 38 ++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h         |  5 +++++
>  drivers/dax/hmem/hmem.c   |  4 +++-
>  3 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 38e7ec6a087b..266b24028df0 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3784,6 +3784,44 @@ struct cxl_range_ctx {
>  	bool found;
>  };
>  
> +static int cxl_region_teardown_cb(struct device *dev, void *data)
> +{
> +	struct cxl_range_ctx *ctx = data;
> +	struct cxl_root_decoder *cxlrd;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct cxl_port *port;
> +
> +	cxlr = cxlr_overlapping_range(dev, ctx->start, ctx->end);
> +	if (!cxlr)
> +		return 0;
> +
> +	cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
> +	port = cxlrd_to_port(cxlrd);
> +	p = &cxlr->params;
> +
> +	/* Force the region state back to CXL_CONFIG_ACTIVE so that

Minor, and moot given the follow on comments below, but please keep
consistent comment-style and lead with a /*, i.e.:

/*
 * Force the region...
 
> +	 * unregister_region() does not run the full decoder reset path
> +	 * which would invalidate the decoder programming that HMEM
> +	 * relies on to create its DAX device and online the underlying
> +	 * memory.
> +	 */
> +	scoped_guard(rwsem_write, &cxl_rwsem.region)
> +		p->state = min(p->state, CXL_CONFIG_ACTIVE);

I think the thickness of the above comment belies that this is too much
of a layering violation and likely to cause problems. For minimizing the
mental load of analyzing future bug reports, I want all regions gone
when any handshake with the platform firmware and dax-hmem occurs.  When
that happens it may mean destroying regions that were dynamically
created while waiting the wait_for_initial_probe() to timeout, who
knows. The simple policy is "CXL subsystem understands everything, or
touches nothing."

For this reset determination, what I think makes more sense, and is
generally useful for shutting down CXL even outside of the hmem deferral
trickery, is to always record whether decoders were idle or not at the
time of region creation. In fact we already have that flag, it is called
CXL_REGION_F_AUTO.

If CXL_REGION_F_AUTO is still set at detach_target() time, it means that
we are giving up on auto-assembly and leaving the decoders alone.

If the administrator actually wants to destroy and reclaim that
physical address space then they need to forcefully de-commit that
auto-assembled region via the @commit sysfs attribute. So that means
commit_store() needs to clear CXL_REGION_F_AUTO to get the decoder reset
to happen. 

[..]
>  void cxl_endpoint_parse_cdat(struct cxl_port *port);
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index b9312e0f2e62..7d874ee169ac 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -158,8 +158,10 @@ static int handle_deferred_cxl(struct device *host, int target_nid,
>  		if (cxl_regions_fully_map(res->start, res->end)) {
>  			dax_cxl_mode = DAX_CXL_MODE_DROP;
>  			cxl_register_dax(res->start, res->end);
> -		} else
> +		} else {
>  			dax_cxl_mode = DAX_CXL_MODE_REGISTER;
> +			cxl_region_teardown(res->start, res->end);
> +		}

Like I alluded to above, I am not on board with making a range-by range
decision on teardown. The check for "all clear" vs "abort" should be a
global event before proceeding with either allowing cxl_region instances
to attach or all of them get destroyed. Recall that if
cxl_dax_region_probe() is globally rejecting all cxl_dax_region devices
until dax_cxl_mode moves to DAX_CXL_MODE_DROP then it keeps a consistent
behavior of all regions attach or none attach.
Re: [PATCH v4 8/9] cxl/region, dax/hmem: Tear down CXL regions when HMEM reclaims Soft Reserved
Posted by Koralahalli Channabasappa, Smita 1 month, 4 weeks ago
On 12/3/2025 4:50 PM, dan.j.williams@intel.com wrote:
> Smita Koralahalli wrote:
>> If CXL regions do not fully cover a Soft Reserved span, HMEM takes
>> ownership. Tear down overlapping CXL regions before allowing HMEM to
>> register and online the memory.
>>
>> Add cxl_region_teardown() to walk CXL regions overlapping a span and
>> unregister them via devm_release_action() and unregister_region().
>>
>> Force the region state back to CXL_CONFIG_ACTIVE before unregistering to
>> prevent the teardown path from resetting decoders HMEM still relies on
>> to create its dax and online memory.
>>
>> Co-developed-by: Alison Schofield <alison.schofield@intel.com>
>> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>>   drivers/cxl/core/region.c | 38 ++++++++++++++++++++++++++++++++++++++
>>   drivers/cxl/cxl.h         |  5 +++++
>>   drivers/dax/hmem/hmem.c   |  4 +++-
>>   3 files changed, 46 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>> index 38e7ec6a087b..266b24028df0 100644
>> --- a/drivers/cxl/core/region.c
>> +++ b/drivers/cxl/core/region.c
>> @@ -3784,6 +3784,44 @@ struct cxl_range_ctx {
>>   	bool found;
>>   };
>>   
>> +static int cxl_region_teardown_cb(struct device *dev, void *data)
>> +{
>> +	struct cxl_range_ctx *ctx = data;
>> +	struct cxl_root_decoder *cxlrd;
>> +	struct cxl_region_params *p;
>> +	struct cxl_region *cxlr;
>> +	struct cxl_port *port;
>> +
>> +	cxlr = cxlr_overlapping_range(dev, ctx->start, ctx->end);
>> +	if (!cxlr)
>> +		return 0;
>> +
>> +	cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
>> +	port = cxlrd_to_port(cxlrd);
>> +	p = &cxlr->params;
>> +
>> +	/* Force the region state back to CXL_CONFIG_ACTIVE so that
> 
> Minor, and moot given the follow on comments below, but please keep
> consistent comment-style and lead with a /*, i.e.:
> 
> /*
>   * Force the region...
>   
>> +	 * unregister_region() does not run the full decoder reset path
>> +	 * which would invalidate the decoder programming that HMEM
>> +	 * relies on to create its DAX device and online the underlying
>> +	 * memory.
>> +	 */
>> +	scoped_guard(rwsem_write, &cxl_rwsem.region)
>> +		p->state = min(p->state, CXL_CONFIG_ACTIVE);
> 
> I think the thickness of the above comment belies that this is too much
> of a layering violation and likely to cause problems. For minimizing the
> mental load of analyzing future bug reports, I want all regions gone
> when any handshake with the platform firmware and dax-hmem occurs.  When
> that happens it may mean destroying regions that were dynamically
> created while waiting the wait_for_initial_probe() to timeout, who
> knows. The simple policy is "CXL subsystem understands everything, or
> touches nothing."
> 
> For this reset determination, what I think makes more sense, and is
> generally useful for shutting down CXL even outside of the hmem deferral
> trickery, is to always record whether decoders were idle or not at the
> time of region creation. In fact we already have that flag, it is called
> CXL_REGION_F_AUTO.
> 
> If CXL_REGION_F_AUTO is still set at detach_target() time, it means that
> we are giving up on auto-assembly and leaving the decoders alone.
> 
> If the administrator actually wants to destroy and reclaim that
> physical address space then they need to forcefully de-commit that
> auto-assembled region via the @commit sysfs attribute. So that means
> commit_store() needs to clear CXL_REGION_F_AUTO to get the decoder reset
> to happen.
> 
> [..]
>>   void cxl_endpoint_parse_cdat(struct cxl_port *port);
>> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
>> index b9312e0f2e62..7d874ee169ac 100644
>> --- a/drivers/dax/hmem/hmem.c
>> +++ b/drivers/dax/hmem/hmem.c
>> @@ -158,8 +158,10 @@ static int handle_deferred_cxl(struct device *host, int target_nid,
>>   		if (cxl_regions_fully_map(res->start, res->end)) {
>>   			dax_cxl_mode = DAX_CXL_MODE_DROP;
>>   			cxl_register_dax(res->start, res->end);
>> -		} else
>> +		} else {
>>   			dax_cxl_mode = DAX_CXL_MODE_REGISTER;
>> +			cxl_region_teardown(res->start, res->end);
>> +		}
> 
> Like I alluded to above, I am not on board with making a range-by range
> decision on teardown. The check for "all clear" vs "abort" should be a
> global event before proceeding with either allowing cxl_region instances
> to attach or all of them get destroyed. Recall that if
> cxl_dax_region_probe() is globally rejecting all cxl_dax_region devices
> until dax_cxl_mode moves to DAX_CXL_MODE_DROP then it keeps a consistent
> behavior of all regions attach or none attach.

Thanks. I will restructure this and rely on CXL_REGION_F_AUTO semantics 
for teardown.

I wanted to check one nuance to make sure Im applying this correctly.

Consider a case like:

Soft Reserved spans (HMAT-visible):
SR1: 0x085000–0x284fff
SR2: 0x285000–0x484fff
SR3: 0x485000–0x684fff

CXL regions:
R1: 0x085000–0x274fff
R2: 0x285000–0x484fff
R3: 0x485000–0x684fff

Here SR1 is not fully contained by any committed CXL region, even though
SR2 and SR3 are. In this scenario, my understanding is that the global 
check should fail, and we abort CXL ownership entirely (tear down all 
auto-assembled regions) and proceed with pure HMEM, rather than allowing 
partial CXL ownership. Is that correct?

Related to that, the flow should be something like:

Inside process_defer_work()

process_defer_work()
	wait_for_device_probe();
	if (cxl_contains_soft_reserve())
		dax_cxl_mode = DROP
     		rescan cxl-bus (bus_rescan_devices(&cxl_bus_type);)
	else
     		dax_cxl_mode = REGISTER
     		teardown all regions.
	walk_hmem_resources(&pdev->dev, hmem_register_device);
         (Here we walk all hmem resources second time)..

and as you mentioned

cxl_contains_soft_reserve()
     for_each_cxl_intersecting_hmem_resource()
(Here we walk all hmem resources first time)..
         found = false
         for_each_region()
            if (resource_contains(cxl_region_resource, hmem_resource))
                found = true
         if (!found)
             return false
     return true

Thanks
Smita