[v1] x86/NUMA: correct off-by-1 in node map size calculation

[PATCH] x86/NUMA: correct off-by-1 in node map size calculation

Posted by Jan Beulich 1 year, 7 months ago

extract_lsb_from_nodes() accumulates "memtop" from all PDXes one past
the covered ranges. Hence the maximum address which can validly by used
to index the node map is one below this value, and we may currently set
up a node map with an unused (and never initialized) trailing entry. In
boundary cases this may also mean we dynamically allocate a page when
the static (64-entry) map would suffice.

While there also correct the comment ahead of the function, for it to
match the actual code: Linux commit 54413927f022 ("x86-64:
x86_64-make-the-numa-hash-function-nodemap-allocation fix fix") removed
the ORing in of the end address before we actually cloned their code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Really the shift value may end up needlessly small when there's
discontiguous memory. Within a gap, any address can be taken for the
node boundary, and hence neither the end of the lower range nor the
start of the higher range necessarily is the best address to use. For
example with these two node ranges (numbers are frame addresses)

[10000,17fff]
[28000,2ffff]

we'd calculate the shift as 15 when 16 or even 17 (because the start of
the 1st range can also be ignored) would do. I haven't tried to properly
prove it yet, but it looks to me as if the top bit of the XOR of lower
range (inclusive) end and higher range start would be what would want
accumulating (of course requiring the entries to be sorted, or to be
processed in address order). This would then "naturally" exclude lowest
range start and highest range end.

--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -110,7 +110,7 @@ static int __init allocate_cachealigned_
 }
 
 /*
- * The LSB of all start and end addresses in the node map is the value of the
+ * The LSB of all start addresses in the node map is the value of the
  * maximum possible shift.
  */
 static int __init extract_lsb_from_nodes(const struct node *nodes,
@@ -135,7 +135,7 @@ static int __init extract_lsb_from_nodes
         i = BITS_PER_LONG - 1;
     else
         i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
-    memnodemapsize = (memtop >> i) + 1;
+    memnodemapsize = ((memtop - 1) >> i) + 1;
     return i;
 }

RE: [PATCH] x86/NUMA: correct off-by-1 in node map size calculation

Posted by Wei Chen 1 year, 7 months ago

Hi Jan,

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan
> Beulich
> Sent: 2022年9月27日 22:14
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Wei Liu <wl@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>
> Subject: [PATCH] x86/NUMA: correct off-by-1 in node map size calculation
> 
> extract_lsb_from_nodes() accumulates "memtop" from all PDXes one past
> the covered ranges. Hence the maximum address which can validly by used
> to index the node map is one below this value, and we may currently set
> up a node map with an unused (and never initialized) trailing entry. In
> boundary cases this may also mean we dynamically allocate a page when
> the static (64-entry) map would suffice.
> 
> While there also correct the comment ahead of the function, for it to
> match the actual code: Linux commit 54413927f022 ("x86-64:
> x86_64-make-the-numa-hash-function-nodemap-allocation fix fix") removed
> the ORing in of the end address before we actually cloned their code.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Really the shift value may end up needlessly small when there's
> discontiguous memory. Within a gap, any address can be taken for the
> node boundary, and hence neither the end of the lower range nor the
> start of the higher range necessarily is the best address to use. For
> example with these two node ranges (numbers are frame addresses)
> 
> [10000,17fff]
> [28000,2ffff]
> 
> we'd calculate the shift as 15 when 16 or even 17 (because the start of
> the 1st range can also be ignored) would do. I haven't tried to properly
> prove it yet, but it looks to me as if the top bit of the XOR of lower
> range (inclusive) end and higher range start would be what would want
> accumulating (of course requiring the entries to be sorted, or to be
> processed in address order). This would then "naturally" exclude lowest
> range start and highest range end.
> 
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -110,7 +110,7 @@ static int __init allocate_cachealigned_
>  }
> 
>  /*
> - * The LSB of all start and end addresses in the node map is the value of
> the
> + * The LSB of all start addresses in the node map is the value of the
>   * maximum possible shift.
>   */
>  static int __init extract_lsb_from_nodes(const struct node *nodes,
> @@ -135,7 +135,7 @@ static int __init extract_lsb_from_nodes
>          i = BITS_PER_LONG - 1;
>      else
>          i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
> -    memnodemapsize = (memtop >> i) + 1;
> +    memnodemapsize = ((memtop - 1) >> i) + 1;
>      return i;
>  }
> 

Thanks for this fix.

Reviewed-by: Wei Chen <Wei.Chen@arm.com>

Re: [PATCH] x86/NUMA: correct off-by-1 in node map size calculation

Posted by Roger Pau Monné 1 year, 7 months ago

On Tue, Sep 27, 2022 at 04:14:21PM +0200, Jan Beulich wrote:
> extract_lsb_from_nodes() accumulates "memtop" from all PDXes one past
> the covered ranges. Hence the maximum address which can validly by used
> to index the node map is one below this value, and we may currently set
> up a node map with an unused (and never initialized) trailing entry. In
> boundary cases this may also mean we dynamically allocate a page when
> the static (64-entry) map would suffice.
> 
> While there also correct the comment ahead of the function, for it to
> match the actual code: Linux commit 54413927f022 ("x86-64:
> x86_64-make-the-numa-hash-function-nodemap-allocation fix fix") removed
> the ORing in of the end address before we actually cloned their code.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> Really the shift value may end up needlessly small when there's
> discontiguous memory. Within a gap, any address can be taken for the
> node boundary, and hence neither the end of the lower range nor the
> start of the higher range necessarily is the best address to use. For
> example with these two node ranges (numbers are frame addresses)
> 
> [10000,17fff]
> [28000,2ffff]
> 
> we'd calculate the shift as 15 when 16 or even 17 (because the start of
> the 1st range can also be ignored) would do. I haven't tried to properly
> prove it yet, but it looks to me as if the top bit of the XOR of lower
> range (inclusive) end and higher range start would be what would want
> accumulating (of course requiring the entries to be sorted, or to be
> processed in address order). This would then "naturally" exclude lowest
> range start and highest range end.

I'm not familiar with the logic in the NUMA code, seems like a
possible optimization.  It might be good to include in which way a
bigger shift is beneficial.

Thanks, Roger.