x86/NUMA: improve memnode_shift calculation for multi node system

[PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Jan Beulich 1 year, 7 months ago

SRAT may describe individual nodes using multiple ranges. When they're
adjacent (with or without a gap in between), only the start of the first
such range actually needs accounting for. Furthermore the very first
range doesn't need considering of its start address at all, as it's fine
to associate all lower addresses (with no memory) with that same node.
For this to work, the array of ranges needs to be sorted by address -
adjust logic accordingly in acpi_numa_memory_affinity_init().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
On my Dinar and Rome systems this changes memnodemapsize to a single
page. Originally they used 9 / 130 pages (with shifts going from 8 / 6
to 15 / 16) respectively, resulting from lowmem gaps [A0,FF] / [A0,BF].

This goes on top of "x86/NUMA: correct memnode_shift calculation for
single node system".

--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -127,7 +127,8 @@ static int __init extract_lsb_from_nodes
         epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
         if ( spdx >= epdx )
             continue;
-        bitfield |= spdx;
+        if ( i && (!nodeids || nodeids[i - 1] != nodeids[i]) )
+            bitfield |= spdx;
         if ( !i || !nodeids || nodeids[i - 1] != nodeids[i] )
             nodes_used++;
         if ( epdx > memtop )
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -312,6 +312,7 @@ acpi_numa_memory_affinity_init(const str
 	unsigned pxm;
 	nodeid_t node;
 	unsigned int i;
+	bool next = false;
 
 	if (srat_disabled())
 		return;
@@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
 	       node, pxm, start, end - 1,
 	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
 
-	node_memblk_range[num_node_memblks].start = start;
-	node_memblk_range[num_node_memblks].end = end;
-	memblk_nodeid[num_node_memblks] = node;
+	/* Keep node_memblk_range[] sorted by address. */
+	for (i = 0; i < num_node_memblks; ++i)
+		if (node_memblk_range[i].start > start ||
+		    (node_memblk_range[i].start == start &&
+		     node_memblk_range[i].end > end))
+			break;
+
+	memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
+	        (num_node_memblks - i) * sizeof(*node_memblk_range));
+	node_memblk_range[i].start = start;
+	node_memblk_range[i].end = end;
+
+	memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
+	        (num_node_memblks - i) * sizeof(*memblk_nodeid));
+	memblk_nodeid[i] = node;
+
 	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
-		__set_bit(num_node_memblks, memblk_hotplug);
+		next = true;
 		if (end > mem_hotplug)
 			mem_hotplug = end;
 	}
+	for (; i <= num_node_memblks; ++i) {
+		bool prev = next;
+
+		next = test_bit(i, memblk_hotplug);
+		if (prev)
+			__set_bit(i, memblk_hotplug);
+		else
+			__clear_bit(i, memblk_hotplug);
+	}
+
 	num_node_memblks++;
 }

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Andrew Cooper 1 year, 7 months ago

On 27/09/2022 17:20, Jan Beulich wrote:
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>  	       node, pxm, start, end - 1,
>  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>  
> -	node_memblk_range[num_node_memblks].start = start;
> -	node_memblk_range[num_node_memblks].end = end;
> -	memblk_nodeid[num_node_memblks] = node;
> +	/* Keep node_memblk_range[] sorted by address. */
> +	for (i = 0; i < num_node_memblks; ++i)
> +		if (node_memblk_range[i].start > start ||
> +		    (node_memblk_range[i].start == start &&
> +		     node_memblk_range[i].end > end))
> +			break;
> +
> +	memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
> +	        (num_node_memblks - i) * sizeof(*node_memblk_range));
> +	node_memblk_range[i].start = start;
> +	node_memblk_range[i].end = end;
> +
> +	memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
> +	        (num_node_memblks - i) * sizeof(*memblk_nodeid));
> +	memblk_nodeid[i] = node;

This is now the 4th example we have of logic wanting a sorted array. 
(two examples in ARM code which want to switch away from using sort(),
and the VMX MSR lists).

I was already contemplating doing a small library (static inline, or
perhaps extern inline now we've started using that) to abstract away the
insert/find/delete operations and their decidedly non-trivial pointer
operations.

The secondary purpose was to be able to do some actual unit tests of the
library, so we can be rather better assured of correctness.

For this case, and the two ARM cases, the firmware data is supposed to
be sorted to begin with, so the search-for-insertion loop should look at
the num_node_memblks entry first because the overwhelming common case is
that the end is the correct place to put it.  If not, it should binary
search backwards rather than doing a linear search.

Obviously not work for 4.17, but there's a lot of value in such a library.

~Andrew

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Jan Beulich 1 year, 7 months ago

On 30.09.2022 13:54, Andrew Cooper wrote:
> On 27/09/2022 17:20, Jan Beulich wrote:
>> --- a/xen/arch/x86/srat.c
>> +++ b/xen/arch/x86/srat.c
>> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>>  	       node, pxm, start, end - 1,
>>  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>>  
>> -	node_memblk_range[num_node_memblks].start = start;
>> -	node_memblk_range[num_node_memblks].end = end;
>> -	memblk_nodeid[num_node_memblks] = node;
>> +	/* Keep node_memblk_range[] sorted by address. */
>> +	for (i = 0; i < num_node_memblks; ++i)
>> +		if (node_memblk_range[i].start > start ||
>> +		    (node_memblk_range[i].start == start &&
>> +		     node_memblk_range[i].end > end))
>> +			break;
>> +
>> +	memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
>> +	        (num_node_memblks - i) * sizeof(*node_memblk_range));
>> +	node_memblk_range[i].start = start;
>> +	node_memblk_range[i].end = end;
>> +
>> +	memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
>> +	        (num_node_memblks - i) * sizeof(*memblk_nodeid));
>> +	memblk_nodeid[i] = node;
> 
> This is now the 4th example we have of logic wanting a sorted array. 
> (two examples in ARM code which want to switch away from using sort(),
> and the VMX MSR lists).
> 
> I was already contemplating doing a small library (static inline, or
> perhaps extern inline now we've started using that) to abstract away the
> insert/find/delete operations and their decidedly non-trivial pointer
> operations.

For using such library routines the data structures here would need
re-organizing first: We're inserting into two arrays and a bitmap at
the same time.

> The secondary purpose was to be able to do some actual unit tests of the
> library, so we can be rather better assured of correctness.
> 
> 
> For this case, and the two ARM cases, the firmware data is supposed to
> be sorted to begin with, so the search-for-insertion loop should look at
> the num_node_memblks entry first because the overwhelming common case is
> that the end is the correct place to put it.  If not, it should binary
> search backwards rather than doing a linear search.

Well, yes, perhaps. Of course we don't expect there to be very many
entries, at which point I guess a linear search can be deemed acceptable.

Jan

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Roger Pau Monné 1 year, 7 months ago

On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
> SRAT may describe individual nodes using multiple ranges. When they're
> adjacent (with or without a gap in between), only the start of the first
> such range actually needs accounting for. Furthermore the very first
> range doesn't need considering of its start address at all, as it's fine
> to associate all lower addresses (with no memory) with that same node.
> For this to work, the array of ranges needs to be sorted by address -
> adjust logic accordingly in acpi_numa_memory_affinity_init().
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Roger Pau Monné 1 year, 7 months ago

On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
> SRAT may describe individual nodes using multiple ranges. When they're
> adjacent (with or without a gap in between), only the start of the first
> such range actually needs accounting for. Furthermore the very first
> range doesn't need considering of its start address at all, as it's fine
> to associate all lower addresses (with no memory) with that same node.
> For this to work, the array of ranges needs to be sorted by address -
> adjust logic accordingly in acpi_numa_memory_affinity_init().

Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
would benefit from a bit of context on why and how memnode_shift is
used.

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> On my Dinar and Rome systems this changes memnodemapsize to a single
> page. Originally they used 9 / 130 pages (with shifts going from 8 / 6
> to 15 / 16) respectively, resulting from lowmem gaps [A0,FF] / [A0,BF].
> 
> This goes on top of "x86/NUMA: correct memnode_shift calculation for
> single node system".
> 
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -127,7 +127,8 @@ static int __init extract_lsb_from_nodes
>          epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
>          if ( spdx >= epdx )
>              continue;
> -        bitfield |= spdx;
> +        if ( i && (!nodeids || nodeids[i - 1] != nodeids[i]) )
> +            bitfield |= spdx;
>          if ( !i || !nodeids || nodeids[i - 1] != nodeids[i] )
>              nodes_used++;
>          if ( epdx > memtop )
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -312,6 +312,7 @@ acpi_numa_memory_affinity_init(const str
>  	unsigned pxm;
>  	nodeid_t node;
>  	unsigned int i;
> +	bool next = false;
>  
>  	if (srat_disabled())
>  		return;
> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>  	       node, pxm, start, end - 1,
>  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>  
> -	node_memblk_range[num_node_memblks].start = start;
> -	node_memblk_range[num_node_memblks].end = end;
> -	memblk_nodeid[num_node_memblks] = node;
> +	/* Keep node_memblk_range[] sorted by address. */
> +	for (i = 0; i < num_node_memblks; ++i)
> +		if (node_memblk_range[i].start > start ||
> +		    (node_memblk_range[i].start == start &&

Maybe I'm confused, but won't .start == start means we have
overlapping ranges?

> +		     node_memblk_range[i].end > end))
> +			break;
> +
> +	memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
> +	        (num_node_memblks - i) * sizeof(*node_memblk_range));
> +	node_memblk_range[i].start = start;
> +	node_memblk_range[i].end = end;
> +
> +	memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
> +	        (num_node_memblks - i) * sizeof(*memblk_nodeid));
> +	memblk_nodeid[i] = node;
> +
>  	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
> -		__set_bit(num_node_memblks, memblk_hotplug);
> +		next = true;
>  		if (end > mem_hotplug)
>  			mem_hotplug = end;
>  	}
> +	for (; i <= num_node_memblks; ++i) {
> +		bool prev = next;
> +
> +		next = test_bit(i, memblk_hotplug);
> +		if (prev)
> +			__set_bit(i, memblk_hotplug);
> +		else
> +			__clear_bit(i, memblk_hotplug);

Nit: I think you could avoid doing the clear for the last bit, ie:
else if (i != num_node_memblks) __clear_bit(...);

But I'm not sure it's worth adding the logic, just makes it more
complicated to follow.

Thanks, Roger.

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Jan Beulich 1 year, 7 months ago

On 30.09.2022 10:25, Roger Pau Monné wrote:
> On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
>> SRAT may describe individual nodes using multiple ranges. When they're
>> adjacent (with or without a gap in between), only the start of the first
>> such range actually needs accounting for. Furthermore the very first
>> range doesn't need considering of its start address at all, as it's fine
>> to associate all lower addresses (with no memory) with that same node.
>> For this to work, the array of ranges needs to be sorted by address -
>> adjust logic accordingly in acpi_numa_memory_affinity_init().
> 
> Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
> would benefit from a bit of context on why and how memnode_shift is
> used.

It's used solely to shift PDXes right before indexing memnodemap[].
Hence a larger shift allows for a smaller array (less memory and,
perhaps more importantly, less cache footprint). Personally I don't
think such needs mentioning in a patch, but I know others think
differently (George for example looks to believe that the present
situation should always be described). In the case here a simple
grep for memnode_shift would tell you, and even if I described this
here it wouldn't really help review I think: You'd still want to
verify then that what I claim actually matches reality.

>> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>>  	       node, pxm, start, end - 1,
>>  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>>  
>> -	node_memblk_range[num_node_memblks].start = start;
>> -	node_memblk_range[num_node_memblks].end = end;
>> -	memblk_nodeid[num_node_memblks] = node;
>> +	/* Keep node_memblk_range[] sorted by address. */
>> +	for (i = 0; i < num_node_memblks; ++i)
>> +		if (node_memblk_range[i].start > start ||
>> +		    (node_memblk_range[i].start == start &&
> 
> Maybe I'm confused, but won't .start == start means we have
> overlapping ranges?

Yes, except when a range is empty.

>> +		     node_memblk_range[i].end > end))
>> +			break;
>> +
>> +	memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
>> +	        (num_node_memblks - i) * sizeof(*node_memblk_range));
>> +	node_memblk_range[i].start = start;
>> +	node_memblk_range[i].end = end;
>> +
>> +	memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
>> +	        (num_node_memblks - i) * sizeof(*memblk_nodeid));
>> +	memblk_nodeid[i] = node;
>> +
>>  	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
>> -		__set_bit(num_node_memblks, memblk_hotplug);
>> +		next = true;
>>  		if (end > mem_hotplug)
>>  			mem_hotplug = end;
>>  	}
>> +	for (; i <= num_node_memblks; ++i) {
>> +		bool prev = next;
>> +
>> +		next = test_bit(i, memblk_hotplug);
>> +		if (prev)
>> +			__set_bit(i, memblk_hotplug);
>> +		else
>> +			__clear_bit(i, memblk_hotplug);
> 
> Nit: I think you could avoid doing the clear for the last bit, ie:
> else if (i != num_node_memblks) __clear_bit(...);
> 
> But I'm not sure it's worth adding the logic, just makes it more
> complicated to follow.

Indeed. I did consider both this and using a single __change_bit()
wrapped in a suitable if(). Both looked to me like they would hamper
proving the code is doing what it ought to do.

Jan

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Roger Pau Monné 1 year, 7 months ago

On Fri, Sep 30, 2022 at 10:36:20AM +0200, Jan Beulich wrote:
> On 30.09.2022 10:25, Roger Pau Monné wrote:
> > On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
> >> SRAT may describe individual nodes using multiple ranges. When they're
> >> adjacent (with or without a gap in between), only the start of the first
> >> such range actually needs accounting for. Furthermore the very first
> >> range doesn't need considering of its start address at all, as it's fine
> >> to associate all lower addresses (with no memory) with that same node.
> >> For this to work, the array of ranges needs to be sorted by address -
> >> adjust logic accordingly in acpi_numa_memory_affinity_init().
> > 
> > Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
> > would benefit from a bit of context on why and how memnode_shift is
> > used.
> 
> It's used solely to shift PDXes right before indexing memnodemap[].
> Hence a larger shift allows for a smaller array (less memory and,
> perhaps more importantly, less cache footprint). Personally I don't
> think such needs mentioning in a patch, but I know others think
> differently (George for example looks to believe that the present
> situation should always be described). In the case here a simple
> grep for memnode_shift would tell you, and even if I described this
> here it wouldn't really help review I think: You'd still want to
> verify then that what I claim actually matches reality.

Right, that's why I like some context with the patches.  Sometimes (and
I'm saying it's the case here) the context analysis done by the patch
submitter is wrong, and hence it's easier to detect and point out if
it's part of the commit message.

IMO it also helps when looking at git history.

> >> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
> >>  	       node, pxm, start, end - 1,
> >>  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
> >>  
> >> -	node_memblk_range[num_node_memblks].start = start;
> >> -	node_memblk_range[num_node_memblks].end = end;
> >> -	memblk_nodeid[num_node_memblks] = node;
> >> +	/* Keep node_memblk_range[] sorted by address. */
> >> +	for (i = 0; i < num_node_memblks; ++i)
> >> +		if (node_memblk_range[i].start > start ||
> >> +		    (node_memblk_range[i].start == start &&
> > 
> > Maybe I'm confused, but won't .start == start means we have
> > overlapping ranges?
> 
> Yes, except when a range is empty.

Hm, OK.  Won't overlapping ranges be bad if not empty?

Shouldn't Xen complain if it finds overlapping ranges, or that's taken
care somewhere else?

Thanks, Roger.

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Posted by Jan Beulich 1 year, 7 months ago

On 30.09.2022 12:03, Roger Pau Monné wrote:
> On Fri, Sep 30, 2022 at 10:36:20AM +0200, Jan Beulich wrote:
>> On 30.09.2022 10:25, Roger Pau Monné wrote:
>>> On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
>>>> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>>>>  	       node, pxm, start, end - 1,
>>>>  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>>>>  
>>>> -	node_memblk_range[num_node_memblks].start = start;
>>>> -	node_memblk_range[num_node_memblks].end = end;
>>>> -	memblk_nodeid[num_node_memblks] = node;
>>>> +	/* Keep node_memblk_range[] sorted by address. */
>>>> +	for (i = 0; i < num_node_memblks; ++i)
>>>> +		if (node_memblk_range[i].start > start ||
>>>> +		    (node_memblk_range[i].start == start &&
>>>
>>> Maybe I'm confused, but won't .start == start means we have
>>> overlapping ranges?
>>
>> Yes, except when a range is empty.
> 
> Hm, OK.  Won't overlapping ranges be bad if not empty?

They are and ...

> Shouldn't Xen complain if it finds overlapping ranges, or that's taken
> care somewhere else?

... we do, elsewhere. Still I'd like this code to be generic, at the very
least to - as said - deal with empty ranges as well. It didn't seem
indicated to me to special case empty ranges here, when the code is more
clear when written in a more generic manner.

Jan