[PATCH 0/1] mm/fake-numa: allow later numa node hotplug

Bruno Faccini posted 1 patch 1 year, 1 month ago
drivers/acpi/numa/srat.c     | 86 ++++++++++++++++++++++++++++++++++++
include/acpi/acpi_numa.h     |  5 +++
include/linux/numa_memblks.h |  3 ++
mm/numa_emulation.c          | 45 ++++++++++++++++---
mm/numa_memblks.c            |  2 +-
5 files changed, 133 insertions(+), 8 deletions(-)
[PATCH 0/1] mm/fake-numa: allow later numa node hotplug
Posted by Bruno Faccini 1 year, 1 month ago
When trying to use fake-numa feature on our system where new
Numa nodes are being "hot-plugged" upon driver load, this fails
with the following type of message and warning with stack :

node 8 was absent from the node_possible_map
WARNING: CPU: 61 PID: 4259 at mm/memory_hotplug.c:1506
add_memory_resource+0x3dc/0x418

This issue prevents the use of the fake-NUMA debug feature with
the system's full configuration, when it has proven to be sometimes
extremely useful for performance testing of multi-tasked,
memory-bound applications, as it enables better isolation of
processes/ranks compared to fat NUMA nodes.

Usual numactl output after driver has “hot-plugged”/unveiled some
new Numa nodes with and without memory :
$ numactl --hardware
available: 9 nodes (0-8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 0 size: 490037 MB
node 0 free: 484432 MB
node 1 cpus:
node 1 size: 97280 MB
node 1 free: 97279 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus:
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus:
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus:
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus:
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus:
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus:
node 8 size: 0 MB
node 8 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7   8
  0:  10  80  80  80  80  80  80  80  80
  1:  80  10  255  255  255  255  255  255  255
  2:  80  255  10  255  255  255  255  255  255
  3:  80  255  255  10  255  255  255  255  255
  4:  80  255  255  255  10  255  255  255  255
  5:  80  255  255  255  255  10  255  255  255
  6:  80  255  255  255  255  255  10  255  255
  7:  80  255  255  255  255  255  255  10  255
  8:  80  255  255  255  255  255  255  255  10


With recent M.Rapoport set of fake-numa patches in mm-everything
and using numa=fake=4 boot parameter :
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 0 size: 122518 MB
node 0 free: 117141 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 1 size: 219911 MB
node 1 free: 219751 MB
node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 2 size: 122599 MB
node 2 free: 122541 MB
node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 3 size: 122479 MB
node 3 free: 122408 MB
node distances:
node   0   1   2   3
  0:  10  10  10  10
  1:  10  10  10  10
  2:  10  10  10  10
  3:  10  10  10  10


With recent M.Rapoport set of fake-numa patches in mm-everything,
this patch on top, using numa=fake=4 boot parameter :
# numactl —hardware
available: 12 nodes (0-11)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 0 size: 122518 MB
node 0 free: 116429 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 1 size: 122631 MB
node 1 free: 122576 MB
node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 2 size: 122599 MB
node 2 free: 122544 MB
node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71
node 3 size: 122479 MB
node 3 free: 122419 MB
node 4 cpus:
node 4 size: 97280 MB
node 4 free: 97279 MB
node 5 cpus:
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus:
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus:
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus:
node 8 size: 0 MB
node 8 free: 0 MB
node 9 cpus:
node 9 size: 0 MB
node 9 free: 0 MB
node 10 cpus:
node 10 size: 0 MB
node 10 free: 0 MB
node 11 cpus:
node 11 size: 0 MB
node 11 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11
  0:  10  10  10  10  80  80  80  80  80  80  80  80
  1:  10  10  10  10  80  80  80  80  80  80  80  80
  2:  10  10  10  10  80  80  80  80  80  80  80  80
  3:  10  10  10  10  80  80  80  80  80  80  80  80
  4:  80  80  80  80  10  255  255  255  255  255  255  255
  5:  80  80  80  80  255  10  255  255  255  255  255  255
  6:  80  80  80  80  255  255  10  255  255  255  255  255
  7:  80  80  80  80  255  255  255  10  255  255  255  255
  8:  80  80  80  80  255  255  255  255  10  255  255  255
  9:  80  80  80  80  255  255  255  255  255  10  255  255
 10:  80  80  80  80  255  255  255  255  255  255  10  255
 11:  80  80  80  80  255  255  255  255  255  255  255  10


Bruno Faccini (1):
  mm/fake-numa: allow later numa node hotplug

 drivers/acpi/numa/srat.c     | 86 ++++++++++++++++++++++++++++++++++++
 include/acpi/acpi_numa.h     |  5 +++
 include/linux/numa_memblks.h |  3 ++
 mm/numa_emulation.c          | 45 ++++++++++++++++---
 mm/numa_memblks.c            |  2 +-
 5 files changed, 133 insertions(+), 8 deletions(-)


base-commit: 4c10320ffbe7d6273b153b112a6e5f9b61ac008a
-- 
2.43.0

Re: [PATCH 0/1] mm/fake-numa: allow later numa node hotplug
Posted by David Hildenbrand 1 year, 1 month ago
Hi,

> 
> With recent M.Rapoport set of fake-numa patches in mm-everything
> and using numa=fake=4 boot parameter :
> $ numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 0 size: 122518 MB
> node 0 free: 117141 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 1 size: 219911 MB
> node 1 free: 219751 MB
> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 2 size: 122599 MB
> node 2 free: 122541 MB
> node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 3 size: 122479 MB
> node 3 free: 122408 MB

Why are all CPUs indicated as belonging to all nodes? Is that expected 
or a BUG?

I would have thought, just like memory, that one resource only belongs 
to one NUMA node.


> 
> With recent M.Rapoport set of fake-numa patches in mm-everything,
> this patch on top, using numa=fake=4 boot parameter :
> # numactl —hardware
> available: 12 nodes (0-11)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 0 size: 122518 MB
> node 0 free: 116429 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 1 size: 122631 MB
> node 1 free: 122576 MB
> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 2 size: 122599 MB
> node 2 free: 122544 MB
> node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> 65 66 67 68 69 70 71
> node 3 size: 122479 MB
> node 3 free: 122419 MB
> node 4 cpus:
> node 4 size: 97280 MB
> node 4 free: 97279 MB


^ Is this where your driver hotplugged a single node and hotplugged memory?


-- 
Cheers,

David / dhildenb

Re: [PATCH 0/1] mm/fake-numa: allow later numa node hotplug
Posted by Bruno Faccini 1 year, 1 month ago
Hello David,

Le 07/01/2025 à 11:08, David Hildenbrand a écrit :
> 
> Hi,
> 
>>
>> With recent M.Rapoport set of fake-numa patches in mm-everything
>> and using numa=fake=4 boot parameter :
>> $ numactl --hardware
>> available: 4 nodes (0-3)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 0 size: 122518 MB
>> node 0 free: 117141 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 1 size: 219911 MB
>> node 1 free: 219751 MB
>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 2 size: 122599 MB
>> node 2 free: 122541 MB
>> node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 3 size: 122479 MB
>> node 3 free: 122408 MB
> 
> Why are all CPUs indicated as belonging to all nodes? Is that expected
> or a BUG?

This behaviour comes from original fake-numa implementation, and has 
been left as is by M.Rapoport recent fake-numa changes.

> 
> I would have thought, just like memory, that one resource only belongs
> to one NUMA node.
"All fake-NUMA nodes that belong to a physical NUMA node share the same 
CPU cores", this was already the case in original/x86-only 
implementation so that fake-NUMA does not affect application launch 
commands.

> 
> 
>>
>> With recent M.Rapoport set of fake-numa patches in mm-everything,
>> this patch on top, using numa=fake=4 boot parameter :
>> # numactl —hardware
>> available: 12 nodes (0-11)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 0 size: 122518 MB
>> node 0 free: 116429 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 1 size: 122631 MB
>> node 1 free: 122576 MB
>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 2 size: 122599 MB
>> node 2 free: 122544 MB
>> node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>> 65 66 67 68 69 70 71
>> node 3 size: 122479 MB
>> node 3 free: 122419 MB
>> node 4 cpus:
>> node 4 size: 97280 MB
>> node 4 free: 97279 MB
> 
> 
> ^ Is this where your driver hotplugged a single node and hotplugged memory?
Yes, Node 4 is a GPU node and its memory has been hotplugged by the driver.

> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
Thanks for your review and comments/questions, bye,
Bruno

Re: [PATCH 0/1] mm/fake-numa: allow later numa node hotplug
Posted by David Hildenbrand 1 year ago
>>
>> I would have thought, just like memory, that one resource only belongs
>> to one NUMA node.
> "All fake-NUMA nodes that belong to a physical NUMA node share the same
> CPU cores", this was already the case in original/x86-only
> implementation so that fake-NUMA does not affect application launch
> commands.

Thanks! Interesting; a bit unexpected :)

> 
>>
>>
>>>
>>> With recent M.Rapoport set of fake-numa patches in mm-everything,
>>> this patch on top, using numa=fake=4 boot parameter :
>>> # numactl —hardware
>>> available: 12 nodes (0-11)
>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>>> 65 66 67 68 69 70 71
>>> node 0 size: 122518 MB
>>> node 0 free: 116429 MB
>>> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>>> 65 66 67 68 69 70 71
>>> node 1 size: 122631 MB
>>> node 1 free: 122576 MB
>>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>>> 65 66 67 68 69 70 71
>>> node 2 size: 122599 MB
>>> node 2 free: 122544 MB
>>> node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
>>> 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
>>> 65 66 67 68 69 70 71
>>> node 3 size: 122479 MB
>>> node 3 free: 122419 MB
>>> node 4 cpus:
>>> node 4 size: 97280 MB
>>> node 4 free: 97279 MB
>>
>>
>> ^ Is this where your driver hotplugged a single node and hotplugged memory?
> Yes, Node 4 is a GPU node and its memory has been hotplugged by the driver.

Thanks!

-- 
Cheers,

David / dhildenb