[PATCH v6 3/3] acpi,srat: give memory block size advice based on CFMWS alignment

Gregory Price posted 3 patches 2 weeks, 3 days ago
[PATCH v6 3/3] acpi,srat: give memory block size advice based on CFMWS alignment
Posted by Gregory Price 2 weeks, 3 days ago
Capacity is stranded when CFMWS regions are not aligned to block size.
On x86, block size increases with capacity (2G blocks @ 64G capacity).

Use CFMWS base/size to report memory block size alignment advice.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
---
 drivers/acpi/numa/srat.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 44f91f2c6c5d..34b6993e7d6c 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -14,6 +14,7 @@
 #include <linux/errno.h>
 #include <linux/acpi.h>
 #include <linux/memblock.h>
+#include <linux/memory.h>
 #include <linux/numa.h>
 #include <linux/nodemask.h>
 #include <linux/topology.h>
@@ -338,13 +339,22 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 {
 	struct acpi_cedt_cfmws *cfmws;
 	int *fake_pxm = arg;
-	u64 start, end;
+	u64 start, end, align;
 	int node;
 
 	cfmws = (struct acpi_cedt_cfmws *)header;
 	start = cfmws->base_hpa;
 	end = cfmws->base_hpa + cfmws->window_size;
 
+	/* Align memblock size to CFMW regions if possible */
+	align = 1UL << __ffs(start | end);
+	if (align >= SZ_256M) {
+		if (memory_block_advise_max_size(align) < 0)
+			pr_warn("CFMWS: memblock size advise failed\n");
+	} else {
+		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
+	}
+
 	/*
 	 * The SRAT may have already described NUMA details for all,
 	 * or a portion of, this CFMWS HPA range. Extend the memblks
-- 
2.43.0
Re: [PATCH v6 3/3] acpi,srat: give memory block size advice based on CFMWS alignment
Posted by Dan Williams 1 week, 4 days ago
Gregory Price wrote:
> Capacity is stranded when CFMWS regions are not aligned to block size.
> On x86, block size increases with capacity (2G blocks @ 64G capacity).
> 
> Use CFMWS base/size to report memory block size alignment advice.
> 
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/acpi/numa/srat.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 44f91f2c6c5d..34b6993e7d6c 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -14,6 +14,7 @@
>  #include <linux/errno.h>
>  #include <linux/acpi.h>
>  #include <linux/memblock.h>
> +#include <linux/memory.h>
>  #include <linux/numa.h>
>  #include <linux/nodemask.h>
>  #include <linux/topology.h>
> @@ -338,13 +339,22 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
>  {
>  	struct acpi_cedt_cfmws *cfmws;
>  	int *fake_pxm = arg;
> -	u64 start, end;
> +	u64 start, end, align;
>  	int node;
>  
>  	cfmws = (struct acpi_cedt_cfmws *)header;
>  	start = cfmws->base_hpa;
>  	end = cfmws->base_hpa + cfmws->window_size;
>  
> +	/* Align memblock size to CFMW regions if possible */
> +	align = 1UL << __ffs(start | end);
> +	if (align >= SZ_256M) {
> +		if (memory_block_advise_max_size(align) < 0)
> +			pr_warn("CFMWS: memblock size advise failed\n");

Oh, this made me go back to look at what happens if CFMWS has multiple
alignment suggestions. Should not memory_block_advise_max_size() be
considering the max advice?

    if (memory_block_advised_size) {
        ...    
    } else {
            memory_block_advised_size = max(memory_block_advised_size, size);
    }

For example, if region0 is an x4 region and region1 is an x1 region then
the memory block size should be 1GB, not 256M. I.e. CFMWS alignment
follows CXL hardware decoder alignment of "256M * InterleaveWays".
Re: [PATCH v6 3/3] acpi,srat: give memory block size advice based on CFMWS alignment
Posted by Gregory Price 1 week, 4 days ago
On Tue, Nov 12, 2024 at 01:41:55PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > Capacity is stranded when CFMWS regions are not aligned to block size.
> > On x86, block size increases with capacity (2G blocks @ 64G capacity).
> > 
> > Use CFMWS base/size to report memory block size alignment advice.
> > 
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> >  drivers/acpi/numa/srat.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > index 44f91f2c6c5d..34b6993e7d6c 100644
> > --- a/drivers/acpi/numa/srat.c
> > +++ b/drivers/acpi/numa/srat.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/errno.h>
> >  #include <linux/acpi.h>
> >  #include <linux/memblock.h>
> > +#include <linux/memory.h>
> >  #include <linux/numa.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/topology.h>
> > @@ -338,13 +339,22 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> >  {
> >  	struct acpi_cedt_cfmws *cfmws;
> >  	int *fake_pxm = arg;
> > -	u64 start, end;
> > +	u64 start, end, align;
> >  	int node;
> >  
> >  	cfmws = (struct acpi_cedt_cfmws *)header;
> >  	start = cfmws->base_hpa;
> >  	end = cfmws->base_hpa + cfmws->window_size;
> >  
> > +	/* Align memblock size to CFMW regions if possible */
> > +	align = 1UL << __ffs(start | end);
> > +	if (align >= SZ_256M) {
> > +		if (memory_block_advise_max_size(align) < 0)
> > +			pr_warn("CFMWS: memblock size advise failed\n");
> 
> Oh, this made me go back to look at what happens if CFMWS has multiple
> alignment suggestions. Should not memory_block_advise_max_size() be
> considering the max advice?
> 
>     if (memory_block_advised_size) {
>         ...    
>     } else {
>             memory_block_advised_size = max(memory_block_advised_size, size);
>     }
> 
> For example, if region0 is an x4 region and region1 is an x1 region then
> the memory block size should be 1GB, not 256M. I.e. CFMWS alignment
> follows CXL hardware decoder alignment of "256M * InterleaveWays".

Max size to minimize capacity loss to due alignment truncation.

If CFMW-0 is aligned at 1GB and CFMW-1 is aligned at 256MB, if you select 1GB
then some portion of CFMW-1 will be unmappable.

so you want min(memory_block_advised_size, size) to ensure the hotplug memblock
size aligns to the *smallest* CFMW (or any other source) alignment.

Unless I'm misunderstanding your feedback here.


I'm not clear on why the interleave data is relevant here - that just tells us
how decoders line up with the memory region described in the CFMW.  The window
still gets chopped up into N memblocks of memory_block_advised_size.

~Gregory
Re: [PATCH v6 3/3] acpi,srat: give memory block size advice based on CFMWS alignment
Posted by Dan Williams 1 week, 4 days ago
Gregory Price wrote:
> On Tue, Nov 12, 2024 at 01:41:55PM -0800, Dan Williams wrote:
> > Gregory Price wrote:
> > > Capacity is stranded when CFMWS regions are not aligned to block size.
> > > On x86, block size increases with capacity (2G blocks @ 64G capacity).
> > > 
> > > Use CFMWS base/size to report memory block size alignment advice.
> > > 
> > > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > > Signed-off-by: Gregory Price <gourry@gourry.net>
> > > Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >  drivers/acpi/numa/srat.c | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > > index 44f91f2c6c5d..34b6993e7d6c 100644
> > > --- a/drivers/acpi/numa/srat.c
> > > +++ b/drivers/acpi/numa/srat.c
> > > @@ -14,6 +14,7 @@
> > >  #include <linux/errno.h>
> > >  #include <linux/acpi.h>
> > >  #include <linux/memblock.h>
> > > +#include <linux/memory.h>
> > >  #include <linux/numa.h>
> > >  #include <linux/nodemask.h>
> > >  #include <linux/topology.h>
> > > @@ -338,13 +339,22 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> > >  {
> > >  	struct acpi_cedt_cfmws *cfmws;
> > >  	int *fake_pxm = arg;
> > > -	u64 start, end;
> > > +	u64 start, end, align;
> > >  	int node;
> > >  
> > >  	cfmws = (struct acpi_cedt_cfmws *)header;
> > >  	start = cfmws->base_hpa;
> > >  	end = cfmws->base_hpa + cfmws->window_size;
> > >  
> > > +	/* Align memblock size to CFMW regions if possible */
> > > +	align = 1UL << __ffs(start | end);
> > > +	if (align >= SZ_256M) {
> > > +		if (memory_block_advise_max_size(align) < 0)
> > > +			pr_warn("CFMWS: memblock size advise failed\n");
> > 
> > Oh, this made me go back to look at what happens if CFMWS has multiple
> > alignment suggestions. Should not memory_block_advise_max_size() be
> > considering the max advice?
> > 
> >     if (memory_block_advised_size) {
> >         ...    
> >     } else {
> >             memory_block_advised_size = max(memory_block_advised_size, size);
> >     }
> > 
> > For example, if region0 is an x4 region and region1 is an x1 region then
> > the memory block size should be 1GB, not 256M. I.e. CFMWS alignment
> > follows CXL hardware decoder alignment of "256M * InterleaveWays".
> 
> Max size to minimize capacity loss to due alignment truncation.
> 
> If CFMW-0 is aligned at 1GB and CFMW-1 is aligned at 256MB, if you select 1GB
> then some portion of CFMW-1 will be unmappable.
> 
> so you want min(memory_block_advised_size, size) to ensure the hotplug memblock
> size aligns to the *smallest* CFMW (or any other source) alignment.
> 
> Unless I'm misunderstanding your feedback here.

No, whoops, you didn't misundertand, I just misread
memory_block_advise_max_size(). Makes sense and current code looks good,
you can add:

Acked-by: Dan Williams <dan.j.williams@intel.com>

> I'm not clear on why the interleave data is relevant here - that just tells us
> how decoders line up with the memory region described in the CFMW.  The window
> still gets chopped up into N memblocks of memory_block_advised_size.

Yes the window still gets chopped, but the alignment is meant to follow
256M * InterleaveWays. The algorithm as you have it will pick that
up. So, no concerns from me from the CXL side.