SRAT may describe individual nodes using multiple ranges. When they're
adjacent (with or without a gap in between), only the start of the first
such range actually needs accounting for. Furthermore the very first
range doesn't need considering of its start address at all, as it's fine
to associate all lower addresses (with no memory) with that same node.
For this to work, the array of ranges needs to be sorted by address -
adjust logic accordingly in acpi_numa_memory_affinity_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
On my Dinar and Rome systems this changes memnodemapsize to a single
page. Originally they used 9 / 130 pages (with shifts going from 8 / 6
to 15 / 16) respectively, resulting from lowmem gaps [A0,FF] / [A0,BF].
This goes on top of "x86/NUMA: correct memnode_shift calculation for
single node system".
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -127,7 +127,8 @@ static int __init extract_lsb_from_nodes
epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
if ( spdx >= epdx )
continue;
- bitfield |= spdx;
+ if ( i && (!nodeids || nodeids[i - 1] != nodeids[i]) )
+ bitfield |= spdx;
if ( !i || !nodeids || nodeids[i - 1] != nodeids[i] )
nodes_used++;
if ( epdx > memtop )
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -312,6 +312,7 @@ acpi_numa_memory_affinity_init(const str
unsigned pxm;
nodeid_t node;
unsigned int i;
+ bool next = false;
if (srat_disabled())
return;
@@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
node, pxm, start, end - 1,
ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
- node_memblk_range[num_node_memblks].start = start;
- node_memblk_range[num_node_memblks].end = end;
- memblk_nodeid[num_node_memblks] = node;
+ /* Keep node_memblk_range[] sorted by address. */
+ for (i = 0; i < num_node_memblks; ++i)
+ if (node_memblk_range[i].start > start ||
+ (node_memblk_range[i].start == start &&
+ node_memblk_range[i].end > end))
+ break;
+
+ memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
+ (num_node_memblks - i) * sizeof(*node_memblk_range));
+ node_memblk_range[i].start = start;
+ node_memblk_range[i].end = end;
+
+ memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
+ (num_node_memblks - i) * sizeof(*memblk_nodeid));
+ memblk_nodeid[i] = node;
+
if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
- __set_bit(num_node_memblks, memblk_hotplug);
+ next = true;
if (end > mem_hotplug)
mem_hotplug = end;
}
+ for (; i <= num_node_memblks; ++i) {
+ bool prev = next;
+
+ next = test_bit(i, memblk_hotplug);
+ if (prev)
+ __set_bit(i, memblk_hotplug);
+ else
+ __clear_bit(i, memblk_hotplug);
+ }
+
num_node_memblks++;
}
On 27/09/2022 17:20, Jan Beulich wrote: > --- a/xen/arch/x86/srat.c > +++ b/xen/arch/x86/srat.c > @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str > node, pxm, start, end - 1, > ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : ""); > > - node_memblk_range[num_node_memblks].start = start; > - node_memblk_range[num_node_memblks].end = end; > - memblk_nodeid[num_node_memblks] = node; > + /* Keep node_memblk_range[] sorted by address. */ > + for (i = 0; i < num_node_memblks; ++i) > + if (node_memblk_range[i].start > start || > + (node_memblk_range[i].start == start && > + node_memblk_range[i].end > end)) > + break; > + > + memmove(&node_memblk_range[i + 1], &node_memblk_range[i], > + (num_node_memblks - i) * sizeof(*node_memblk_range)); > + node_memblk_range[i].start = start; > + node_memblk_range[i].end = end; > + > + memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i], > + (num_node_memblks - i) * sizeof(*memblk_nodeid)); > + memblk_nodeid[i] = node; This is now the 4th example we have of logic wanting a sorted array. (two examples in ARM code which want to switch away from using sort(), and the VMX MSR lists). I was already contemplating doing a small library (static inline, or perhaps extern inline now we've started using that) to abstract away the insert/find/delete operations and their decidedly non-trivial pointer operations. The secondary purpose was to be able to do some actual unit tests of the library, so we can be rather better assured of correctness. For this case, and the two ARM cases, the firmware data is supposed to be sorted to begin with, so the search-for-insertion loop should look at the num_node_memblks entry first because the overwhelming common case is that the end is the correct place to put it. If not, it should binary search backwards rather than doing a linear search. Obviously not work for 4.17, but there's a lot of value in such a library. ~Andrew
On 30.09.2022 13:54, Andrew Cooper wrote: > On 27/09/2022 17:20, Jan Beulich wrote: >> --- a/xen/arch/x86/srat.c >> +++ b/xen/arch/x86/srat.c >> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str >> node, pxm, start, end - 1, >> ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : ""); >> >> - node_memblk_range[num_node_memblks].start = start; >> - node_memblk_range[num_node_memblks].end = end; >> - memblk_nodeid[num_node_memblks] = node; >> + /* Keep node_memblk_range[] sorted by address. */ >> + for (i = 0; i < num_node_memblks; ++i) >> + if (node_memblk_range[i].start > start || >> + (node_memblk_range[i].start == start && >> + node_memblk_range[i].end > end)) >> + break; >> + >> + memmove(&node_memblk_range[i + 1], &node_memblk_range[i], >> + (num_node_memblks - i) * sizeof(*node_memblk_range)); >> + node_memblk_range[i].start = start; >> + node_memblk_range[i].end = end; >> + >> + memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i], >> + (num_node_memblks - i) * sizeof(*memblk_nodeid)); >> + memblk_nodeid[i] = node; > > This is now the 4th example we have of logic wanting a sorted array. > (two examples in ARM code which want to switch away from using sort(), > and the VMX MSR lists). > > I was already contemplating doing a small library (static inline, or > perhaps extern inline now we've started using that) to abstract away the > insert/find/delete operations and their decidedly non-trivial pointer > operations. For using such library routines the data structures here would need re-organizing first: We're inserting into two arrays and a bitmap at the same time. > The secondary purpose was to be able to do some actual unit tests of the > library, so we can be rather better assured of correctness. > > > For this case, and the two ARM cases, the firmware data is supposed to > be sorted to begin with, so the search-for-insertion loop should look at > the num_node_memblks entry first because the overwhelming common case is > that the end is the correct place to put it. If not, it should binary > search backwards rather than doing a linear search. Well, yes, perhaps. Of course we don't expect there to be very many entries, at which point I guess a linear search can be deemed acceptable. Jan
On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote: > SRAT may describe individual nodes using multiple ranges. When they're > adjacent (with or without a gap in between), only the start of the first > such range actually needs accounting for. Furthermore the very first > range doesn't need considering of its start address at all, as it's fine > to associate all lower addresses (with no memory) with that same node. > For this to work, the array of ranges needs to be sorted by address - > adjust logic accordingly in acpi_numa_memory_affinity_init(). > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger.
On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote: > SRAT may describe individual nodes using multiple ranges. When they're > adjacent (with or without a gap in between), only the start of the first > such range actually needs accounting for. Furthermore the very first > range doesn't need considering of its start address at all, as it's fine > to associate all lower addresses (with no memory) with that same node. > For this to work, the array of ranges needs to be sorted by address - > adjust logic accordingly in acpi_numa_memory_affinity_init(). Speaking for myself (due to the lack of knowledge of the NUMA stuff) I would benefit from a bit of context on why and how memnode_shift is used. > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > On my Dinar and Rome systems this changes memnodemapsize to a single > page. Originally they used 9 / 130 pages (with shifts going from 8 / 6 > to 15 / 16) respectively, resulting from lowmem gaps [A0,FF] / [A0,BF]. > > This goes on top of "x86/NUMA: correct memnode_shift calculation for > single node system". > > --- a/xen/arch/x86/numa.c > +++ b/xen/arch/x86/numa.c > @@ -127,7 +127,8 @@ static int __init extract_lsb_from_nodes > epdx = paddr_to_pdx(nodes[i].end - 1) + 1; > if ( spdx >= epdx ) > continue; > - bitfield |= spdx; > + if ( i && (!nodeids || nodeids[i - 1] != nodeids[i]) ) > + bitfield |= spdx; > if ( !i || !nodeids || nodeids[i - 1] != nodeids[i] ) > nodes_used++; > if ( epdx > memtop ) > --- a/xen/arch/x86/srat.c > +++ b/xen/arch/x86/srat.c > @@ -312,6 +312,7 @@ acpi_numa_memory_affinity_init(const str > unsigned pxm; > nodeid_t node; > unsigned int i; > + bool next = false; > > if (srat_disabled()) > return; > @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str > node, pxm, start, end - 1, > ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : ""); > > - node_memblk_range[num_node_memblks].start = start; > - node_memblk_range[num_node_memblks].end = end; > - memblk_nodeid[num_node_memblks] = node; > + /* Keep node_memblk_range[] sorted by address. */ > + for (i = 0; i < num_node_memblks; ++i) > + if (node_memblk_range[i].start > start || > + (node_memblk_range[i].start == start && Maybe I'm confused, but won't .start == start means we have overlapping ranges? > + node_memblk_range[i].end > end)) > + break; > + > + memmove(&node_memblk_range[i + 1], &node_memblk_range[i], > + (num_node_memblks - i) * sizeof(*node_memblk_range)); > + node_memblk_range[i].start = start; > + node_memblk_range[i].end = end; > + > + memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i], > + (num_node_memblks - i) * sizeof(*memblk_nodeid)); > + memblk_nodeid[i] = node; > + > if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) { > - __set_bit(num_node_memblks, memblk_hotplug); > + next = true; > if (end > mem_hotplug) > mem_hotplug = end; > } > + for (; i <= num_node_memblks; ++i) { > + bool prev = next; > + > + next = test_bit(i, memblk_hotplug); > + if (prev) > + __set_bit(i, memblk_hotplug); > + else > + __clear_bit(i, memblk_hotplug); Nit: I think you could avoid doing the clear for the last bit, ie: else if (i != num_node_memblks) __clear_bit(...); But I'm not sure it's worth adding the logic, just makes it more complicated to follow. Thanks, Roger.
On 30.09.2022 10:25, Roger Pau Monné wrote: > On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote: >> SRAT may describe individual nodes using multiple ranges. When they're >> adjacent (with or without a gap in between), only the start of the first >> such range actually needs accounting for. Furthermore the very first >> range doesn't need considering of its start address at all, as it's fine >> to associate all lower addresses (with no memory) with that same node. >> For this to work, the array of ranges needs to be sorted by address - >> adjust logic accordingly in acpi_numa_memory_affinity_init(). > > Speaking for myself (due to the lack of knowledge of the NUMA stuff) I > would benefit from a bit of context on why and how memnode_shift is > used. It's used solely to shift PDXes right before indexing memnodemap[]. Hence a larger shift allows for a smaller array (less memory and, perhaps more importantly, less cache footprint). Personally I don't think such needs mentioning in a patch, but I know others think differently (George for example looks to believe that the present situation should always be described). In the case here a simple grep for memnode_shift would tell you, and even if I described this here it wouldn't really help review I think: You'd still want to verify then that what I claim actually matches reality. >> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str >> node, pxm, start, end - 1, >> ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : ""); >> >> - node_memblk_range[num_node_memblks].start = start; >> - node_memblk_range[num_node_memblks].end = end; >> - memblk_nodeid[num_node_memblks] = node; >> + /* Keep node_memblk_range[] sorted by address. */ >> + for (i = 0; i < num_node_memblks; ++i) >> + if (node_memblk_range[i].start > start || >> + (node_memblk_range[i].start == start && > > Maybe I'm confused, but won't .start == start means we have > overlapping ranges? Yes, except when a range is empty. >> + node_memblk_range[i].end > end)) >> + break; >> + >> + memmove(&node_memblk_range[i + 1], &node_memblk_range[i], >> + (num_node_memblks - i) * sizeof(*node_memblk_range)); >> + node_memblk_range[i].start = start; >> + node_memblk_range[i].end = end; >> + >> + memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i], >> + (num_node_memblks - i) * sizeof(*memblk_nodeid)); >> + memblk_nodeid[i] = node; >> + >> if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) { >> - __set_bit(num_node_memblks, memblk_hotplug); >> + next = true; >> if (end > mem_hotplug) >> mem_hotplug = end; >> } >> + for (; i <= num_node_memblks; ++i) { >> + bool prev = next; >> + >> + next = test_bit(i, memblk_hotplug); >> + if (prev) >> + __set_bit(i, memblk_hotplug); >> + else >> + __clear_bit(i, memblk_hotplug); > > Nit: I think you could avoid doing the clear for the last bit, ie: > else if (i != num_node_memblks) __clear_bit(...); > > But I'm not sure it's worth adding the logic, just makes it more > complicated to follow. Indeed. I did consider both this and using a single __change_bit() wrapped in a suitable if(). Both looked to me like they would hamper proving the code is doing what it ought to do. Jan
On Fri, Sep 30, 2022 at 10:36:20AM +0200, Jan Beulich wrote: > On 30.09.2022 10:25, Roger Pau Monné wrote: > > On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote: > >> SRAT may describe individual nodes using multiple ranges. When they're > >> adjacent (with or without a gap in between), only the start of the first > >> such range actually needs accounting for. Furthermore the very first > >> range doesn't need considering of its start address at all, as it's fine > >> to associate all lower addresses (with no memory) with that same node. > >> For this to work, the array of ranges needs to be sorted by address - > >> adjust logic accordingly in acpi_numa_memory_affinity_init(). > > > > Speaking for myself (due to the lack of knowledge of the NUMA stuff) I > > would benefit from a bit of context on why and how memnode_shift is > > used. > > It's used solely to shift PDXes right before indexing memnodemap[]. > Hence a larger shift allows for a smaller array (less memory and, > perhaps more importantly, less cache footprint). Personally I don't > think such needs mentioning in a patch, but I know others think > differently (George for example looks to believe that the present > situation should always be described). In the case here a simple > grep for memnode_shift would tell you, and even if I described this > here it wouldn't really help review I think: You'd still want to > verify then that what I claim actually matches reality. Right, that's why I like some context with the patches. Sometimes (and I'm saying it's the case here) the context analysis done by the patch submitter is wrong, and hence it's easier to detect and point out if it's part of the commit message. IMO it also helps when looking at git history. > >> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str > >> node, pxm, start, end - 1, > >> ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : ""); > >> > >> - node_memblk_range[num_node_memblks].start = start; > >> - node_memblk_range[num_node_memblks].end = end; > >> - memblk_nodeid[num_node_memblks] = node; > >> + /* Keep node_memblk_range[] sorted by address. */ > >> + for (i = 0; i < num_node_memblks; ++i) > >> + if (node_memblk_range[i].start > start || > >> + (node_memblk_range[i].start == start && > > > > Maybe I'm confused, but won't .start == start means we have > > overlapping ranges? > > Yes, except when a range is empty. Hm, OK. Won't overlapping ranges be bad if not empty? Shouldn't Xen complain if it finds overlapping ranges, or that's taken care somewhere else? Thanks, Roger.
On 30.09.2022 12:03, Roger Pau Monné wrote: > On Fri, Sep 30, 2022 at 10:36:20AM +0200, Jan Beulich wrote: >> On 30.09.2022 10:25, Roger Pau Monné wrote: >>> On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote: >>>> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str >>>> node, pxm, start, end - 1, >>>> ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : ""); >>>> >>>> - node_memblk_range[num_node_memblks].start = start; >>>> - node_memblk_range[num_node_memblks].end = end; >>>> - memblk_nodeid[num_node_memblks] = node; >>>> + /* Keep node_memblk_range[] sorted by address. */ >>>> + for (i = 0; i < num_node_memblks; ++i) >>>> + if (node_memblk_range[i].start > start || >>>> + (node_memblk_range[i].start == start && >>> >>> Maybe I'm confused, but won't .start == start means we have >>> overlapping ranges? >> >> Yes, except when a range is empty. > > Hm, OK. Won't overlapping ranges be bad if not empty? They are and ... > Shouldn't Xen complain if it finds overlapping ranges, or that's taken > care somewhere else? ... we do, elsewhere. Still I'd like this code to be generic, at the very least to - as said - deal with empty ranges as well. It didn't seem indicated to me to special case empty ranges here, when the code is more clear when written in a more generic manner. Jan
© 2016 - 2024 Red Hat, Inc.