drivers/base/node.c | 1 + 1 file changed, 1 insertion(+)
On large systems with more than 64TB of DRAM, if the memory blocks
are interleaved, node initialization (node_dev_init()) could take
a long time since it iterates over each memory block. If the memory
block belongs to the current iterating node, the first pfn_to_nid
will provide the correct value. Otherwise, it will iterate over all
PFNs and check the nid. On non-preemptive kernels, this can result
in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
is enabled in kernels now [1], we may still need to fix older
stable kernels to avoid encountering these kernel warnings during
boot.
This patch adds a cond_resched() call in node_dev_init() to avoid
this warning.
node_dev_init()
register_one_node
register_memory_blocks_under_node
walk_memory_blocks()
register_mem_block_under_node_early
get_nid_for_pfn
early_pfn_to_nid
In my system node4 has a memory block ranging from memory30351
to memory38524, and memory128433. The memory blocks between
memory38524 and memory128433 do not belong to this node.
In walk_memory_blocks() we iterate over all memblocks starting
from memory38524 to memory128433.
In register_mem_block_under_node_early(), up to memory38524, the
first pfn correctly returns the corresponding nid and the function
returns from there. But after memory38524 and until memory128433,
the loop iterates through each pfn and checks the nid. Since the nid
does not match the required nid, the loop continues. This causes
the soft lockups.
[1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
drivers/base/node.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..107eb508e28e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -975,5 +975,6 @@ void __init node_dev_init(void)
ret = register_one_node(i);
if (ret)
panic("%s() failed to add node: %d\n", __func__, ret);
+ cond_resched();
}
}
--
2.43.5
On 10.03.25 12:53, Donet Tom wrote:
> On large systems with more than 64TB of DRAM, if the memory blocks
> are interleaved, node initialization (node_dev_init()) could take
> a long time since it iterates over each memory block. If the memory
> block belongs to the current iterating node, the first pfn_to_nid
> will provide the correct value. Otherwise, it will iterate over all
> PFNs and check the nid. On non-preemptive kernels, this can result
> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
> is enabled in kernels now [1], we may still need to fix older
> stable kernels to avoid encountering these kernel warnings during
> boot.
If it's not an issue upstream, there is no need for an upstream patch.
Fix stable kernels separately.
Or did I get you wrong and this can be triggered upstream?
>
> This patch adds a cond_resched() call in node_dev_init() to avoid
> this warning.
>
> node_dev_init()
> register_one_node
> register_memory_blocks_under_node
> walk_memory_blocks()
> register_mem_block_under_node_early
> get_nid_for_pfn
> early_pfn_to_nid
>
> In my system node4 has a memory block ranging from memory30351
> to memory38524, and memory128433. The memory blocks between
> memory38524 and memory128433 do not belong to this node.
>
> In walk_memory_blocks() we iterate over all memblocks starting
> from memory38524 to memory128433.
> In register_mem_block_under_node_early(), up to memory38524, the
> first pfn correctly returns the corresponding nid and the function
> returns from there. But after memory38524 and until memory128433,
> the loop iterates through each pfn and checks the nid. Since the nid
> does not match the required nid, the loop continues. This causes
> the soft lockups.
>
> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
That commit only moved code; so very likely, that is not the problematic
commit.
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
> drivers/base/node.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..107eb508e28e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
> ret = register_one_node(i);
> if (ret)
> panic("%s() failed to add node: %d\n", __func__, ret);
> + cond_resched();
> }
> }
--
Cheers,
David / dhildenb
On 3/11/25 2:52 PM, David Hildenbrand wrote:
> On 10.03.25 12:53, Donet Tom wrote:
>> On large systems with more than 64TB of DRAM, if the memory blocks
>> are interleaved, node initialization (node_dev_init()) could take
>> a long time since it iterates over each memory block. If the memory
>> block belongs to the current iterating node, the first pfn_to_nid
>> will provide the correct value. Otherwise, it will iterate over all
>> PFNs and check the nid. On non-preemptive kernels, this can result
>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>> is enabled in kernels now [1], we may still need to fix older
>> stable kernels to avoid encountering these kernel warnings during
>> boot.
>
> If it's not an issue upstream, there is no need for an upstream patch.
>
> Fix stable kernels separately.
>
> Or did I get you wrong and this can be triggered upstream?
Yes, the issue is present upstream if CONFIG_PREEMPT_LAZY is disabled.
Thanks
Donet
>
>>
>> This patch adds a cond_resched() call in node_dev_init() to avoid
>> this warning.
>>
>> node_dev_init()
>> register_one_node
>> register_memory_blocks_under_node
>> walk_memory_blocks()
>> register_mem_block_under_node_early
>> get_nid_for_pfn
>> early_pfn_to_nid
>>
>> In my system node4 has a memory block ranging from memory30351
>> to memory38524, and memory128433. The memory blocks between
>> memory38524 and memory128433 do not belong to this node.
>>
>> In walk_memory_blocks() we iterate over all memblocks starting
>> from memory38524 to memory128433.
>> In register_mem_block_under_node_early(), up to memory38524, the
>> first pfn correctly returns the corresponding nid and the function
>> returns from there. But after memory38524 and until memory128433,
>> the loop iterates through each pfn and checks the nid. Since the nid
>> does not match the required nid, the loop continues. This causes
>> the soft lockups.
>>
>> [1]:
>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device
>> subsystem initialization in node_dev_init()")
>
> That commit only moved code; so very likely, that is not the
> problematic commit.
>
>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>> drivers/base/node.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 0ea653fa3433..107eb508e28e 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>> ret = register_one_node(i);
>> if (ret)
>> panic("%s() failed to add node: %d\n", __func__, ret);
>> + cond_resched();
>> }
>> }
>
>
On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
> On large systems with more than 64TB of DRAM, if the memory blocks
> are interleaved, node initialization (node_dev_init()) could take
> a long time since it iterates over each memory block. If the memory
> block belongs to the current iterating node, the first pfn_to_nid
> will provide the correct value. Otherwise, it will iterate over all
> PFNs and check the nid. On non-preemptive kernels, this can result
> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
> is enabled in kernels now [1], we may still need to fix older
> stable kernels to avoid encountering these kernel warnings during
> boot.
>
> This patch adds a cond_resched() call in node_dev_init() to avoid
> this warning.
>
> node_dev_init()
> register_one_node
> register_memory_blocks_under_node
> walk_memory_blocks()
> register_mem_block_under_node_early
> get_nid_for_pfn
> early_pfn_to_nid
>
> In my system node4 has a memory block ranging from memory30351
> to memory38524, and memory128433. The memory blocks between
> memory38524 and memory128433 do not belong to this node.
>
> In walk_memory_blocks() we iterate over all memblocks starting
> from memory38524 to memory128433.
> In register_mem_block_under_node_early(), up to memory38524, the
> first pfn correctly returns the corresponding nid and the function
> returns from there. But after memory38524 and until memory128433,
> the loop iterates through each pfn and checks the nid. Since the nid
> does not match the required nid, the loop continues. This causes
> the soft lockups.
>
> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
> drivers/base/node.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..107eb508e28e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
> ret = register_one_node(i);
> if (ret)
> panic("%s() failed to add node: %d\n", __func__, ret);
> + cond_resched();
That's a horrible hack, sorry, but no, we can't sprinkle this around in
random locations, especially as this is actually fixed by using a
different scheduler model as you say.
Why not just make the code faster so as to avoid the long time this
takes?
thanks,
greg k-h
On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>> On large systems with more than 64TB of DRAM, if the memory blocks
>> are interleaved, node initialization (node_dev_init()) could take
>> a long time since it iterates over each memory block. If the memory
>> block belongs to the current iterating node, the first pfn_to_nid
>> will provide the correct value. Otherwise, it will iterate over all
>> PFNs and check the nid. On non-preemptive kernels, this can result
>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>> is enabled in kernels now [1], we may still need to fix older
>> stable kernels to avoid encountering these kernel warnings during
>> boot.
>>
>> This patch adds a cond_resched() call in node_dev_init() to avoid
>> this warning.
>>
>> node_dev_init()
>> register_one_node
>> register_memory_blocks_under_node
>> walk_memory_blocks()
>> register_mem_block_under_node_early
>> get_nid_for_pfn
>> early_pfn_to_nid
>>
>> In my system node4 has a memory block ranging from memory30351
>> to memory38524, and memory128433. The memory blocks between
>> memory38524 and memory128433 do not belong to this node.
>>
>> In walk_memory_blocks() we iterate over all memblocks starting
>> from memory38524 to memory128433.
>> In register_mem_block_under_node_early(), up to memory38524, the
>> first pfn correctly returns the corresponding nid and the function
>> returns from there. But after memory38524 and until memory128433,
>> the loop iterates through each pfn and checks the nid. Since the nid
>> does not match the required nid, the loop continues. This causes
>> the soft lockups.
>>
>> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>> drivers/base/node.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 0ea653fa3433..107eb508e28e 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>> ret = register_one_node(i);
>> if (ret)
>> panic("%s() failed to add node: %d\n", __func__, ret);
>> + cond_resched();
> That's a horrible hack, sorry, but no, we can't sprinkle this around in
> random locations, especially as this is actually fixed by using a
> different scheduler model as you say.
>
> Why not just make the code faster so as to avoid the long time this
> takes?
Thanks Greg
I was checking the code to see how to make it faster in order to
avoid the long time it takes.
In below code path
register_one_node()
register_memory_blocks_under_node()
walk_memory_blocks()
register_mem_block_under_node_early()
walk_memory_blocks() is iterating over all memblocks, and
register_mem_block_under_node_early() is iterating over the PFNs
to find the page_nid
If the page_nid and the requested nid are the same, we will register
the memblock under the node and return.
But if get_nid_for_pfn() returns a different nid (This means the
memblock is part of different nid), then the loop will iterate
over all PFNs of the memblock and check if the page_nid returned by
get_nid_for_pfn() and the requested nid are the same.
IIUC, since all PFNs of a memblock return the same page_nid, we
should stop the loop if the page_nid returned does not match the
requested nid.
With the change below, softlockups are no longer observed.
What are your thoughts on this?
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..5ca417e8672e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -811,7 +811,7 @@ static int
register_mem_block_under_node_early(struct memory_block *mem_blk,
if (page_nid < 0)
continue;
if (page_nid != nid)
- continue;
+ break;
do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
return 0;
> thanks,
>
> greg k-h
On 11.03.25 09:56, Donet Tom wrote:
>
> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>> are interleaved, node initialization (node_dev_init()) could take
>>> a long time since it iterates over each memory block. If the memory
>>> block belongs to the current iterating node, the first pfn_to_nid
>>> will provide the correct value. Otherwise, it will iterate over all
>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>> is enabled in kernels now [1], we may still need to fix older
>>> stable kernels to avoid encountering these kernel warnings during
>>> boot.
>>>
>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>> this warning.
>>>
>>> node_dev_init()
>>> register_one_node
>>> register_memory_blocks_under_node
>>> walk_memory_blocks()
>>> register_mem_block_under_node_early
>>> get_nid_for_pfn
>>> early_pfn_to_nid
>>>
>>> In my system node4 has a memory block ranging from memory30351
>>> to memory38524, and memory128433. The memory blocks between
>>> memory38524 and memory128433 do not belong to this node.
>>>
>>> In walk_memory_blocks() we iterate over all memblocks starting
>>> from memory38524 to memory128433.
>>> In register_mem_block_under_node_early(), up to memory38524, the
>>> first pfn correctly returns the corresponding nid and the function
>>> returns from there. But after memory38524 and until memory128433,
>>> the loop iterates through each pfn and checks the nid. Since the nid
>>> does not match the required nid, the loop continues. This causes
>>> the soft lockups.
>>>
>>> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>> ---
>>> drivers/base/node.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index 0ea653fa3433..107eb508e28e 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>> ret = register_one_node(i);
>>> if (ret)
>>> panic("%s() failed to add node: %d\n", __func__, ret);
>>> + cond_resched();
>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>> random locations, especially as this is actually fixed by using a
>> different scheduler model as you say.
>>
>> Why not just make the code faster so as to avoid the long time this
>> takes?
>
>
> Thanks Greg
>
> I was checking the code to see how to make it faster in order to
> avoid the long time it takes.
>
> In below code path
>
> register_one_node()
> register_memory_blocks_under_node()
> walk_memory_blocks()
> register_mem_block_under_node_early()
>
> walk_memory_blocks() is iterating over all memblocks, and
> register_mem_block_under_node_early() is iterating over the PFNs
> to find the page_nid
>
> If the page_nid and the requested nid are the same, we will register
> the memblock under the node and return.
>
> But if get_nid_for_pfn() returns a different nid (This means the
> memblock is part of different nid), then the loop will iterate
> over all PFNs of the memblock and check if the page_nid returned by
> get_nid_for_pfn() and the requested nid are the same.
>
> IIUC, since all PFNs of a memblock return the same page_nid, we
> should stop the loop if the page_nid returned does not match the
> requested nid.
Nodes can easily partially span "memory blocks". So your patch would
break these setups?
But I agree that iterating all pages is rather nasty. I wonder if we
could just walk all memblocks in the range?
early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ...
for each PFN. Testing a range instead could be better.
Something like "early_pfn_range_spans_nid()" could be useful for that.
--
Cheers,
David / dhildenb
On 3/11/25 2:59 PM, David Hildenbrand wrote:
> On 11.03.25 09:56, Donet Tom wrote:
>>
>> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>>> are interleaved, node initialization (node_dev_init()) could take
>>>> a long time since it iterates over each memory block. If the memory
>>>> block belongs to the current iterating node, the first pfn_to_nid
>>>> will provide the correct value. Otherwise, it will iterate over all
>>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>>> is enabled in kernels now [1], we may still need to fix older
>>>> stable kernels to avoid encountering these kernel warnings during
>>>> boot.
>>>>
>>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>>> this warning.
>>>>
>>>> node_dev_init()
>>>> register_one_node
>>>> register_memory_blocks_under_node
>>>> walk_memory_blocks()
>>>> register_mem_block_under_node_early
>>>> get_nid_for_pfn
>>>> early_pfn_to_nid
>>>>
>>>> In my system node4 has a memory block ranging from memory30351
>>>> to memory38524, and memory128433. The memory blocks between
>>>> memory38524 and memory128433 do not belong to this node.
>>>>
>>>> In walk_memory_blocks() we iterate over all memblocks starting
>>>> from memory38524 to memory128433.
>>>> In register_mem_block_under_node_early(), up to memory38524, the
>>>> first pfn correctly returns the corresponding nid and the function
>>>> returns from there. But after memory38524 and until memory128433,
>>>> the loop iterates through each pfn and checks the nid. Since the nid
>>>> does not match the required nid, the loop continues. This causes
>>>> the soft lockups.
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device
>>>> subsystem initialization in node_dev_init()")
>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>> ---
>>>> drivers/base/node.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index 0ea653fa3433..107eb508e28e 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>> ret = register_one_node(i);
>>>> if (ret)
>>>> panic("%s() failed to add node: %d\n", __func__, ret);
>>>> + cond_resched();
>>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>>> random locations, especially as this is actually fixed by using a
>>> different scheduler model as you say.
>>>
>>> Why not just make the code faster so as to avoid the long time this
>>> takes?
>>
>>
>> Thanks Greg
>>
>> I was checking the code to see how to make it faster in order to
>> avoid the long time it takes.
>>
>> In below code path
>>
>> register_one_node()
>> register_memory_blocks_under_node()
>> walk_memory_blocks()
>> register_mem_block_under_node_early()
>>
>> walk_memory_blocks() is iterating over all memblocks, and
>> register_mem_block_under_node_early() is iterating over the PFNs
>> to find the page_nid
>>
>> If the page_nid and the requested nid are the same, we will register
>> the memblock under the node and return.
>>
>> But if get_nid_for_pfn() returns a different nid (This means the
>> memblock is part of different nid), then the loop will iterate
>> over all PFNs of the memblock and check if the page_nid returned by
>> get_nid_for_pfn() and the requested nid are the same.
>>
>> IIUC, since all PFNs of a memblock return the same page_nid, we
>> should stop the loop if the page_nid returned does not match the
>> requested nid.
>
> Nodes can easily partially span "memory blocks". So your patch would
> break these setups?
Does this mean one memory block can be part of two or
more nodes ? Some PFNs belong to one node, and the remaining
PFNs belong to another node?"
In that case, the current implementation adds the memory block to
only one node. In register_mem_block_under_node_early(), if the
first PFN returns the correct expected nid, the memory block will
be registered under that node. It does not iterate over the other
PFNs. Is this because of the assumption that one memory block
cannot be part of multiple nodes?
If one memory block cannot be part of multiple nodes, then we can
break if get_nid_for_pfn() returns the wrong nid, right?
>
> But I agree that iterating all pages is rather nasty. I wonder if we
> could just walk all memblocks in the range?
>
> early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ...
> for each PFN. Testing a range instead could be better.
>
> Something like "early_pfn_range_spans_nid()" could be useful for that.
Do you mean we should do it section by section within a memory block?
Thanks
Donet
On 11.03.25 16:00, Donet Tom wrote:
>
> On 3/11/25 2:59 PM, David Hildenbrand wrote:
>> On 11.03.25 09:56, Donet Tom wrote:
>>>
>>> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>>>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>>>> are interleaved, node initialization (node_dev_init()) could take
>>>>> a long time since it iterates over each memory block. If the memory
>>>>> block belongs to the current iterating node, the first pfn_to_nid
>>>>> will provide the correct value. Otherwise, it will iterate over all
>>>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>>>> is enabled in kernels now [1], we may still need to fix older
>>>>> stable kernels to avoid encountering these kernel warnings during
>>>>> boot.
>>>>>
>>>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>>>> this warning.
>>>>>
>>>>> node_dev_init()
>>>>> register_one_node
>>>>> register_memory_blocks_under_node
>>>>> walk_memory_blocks()
>>>>> register_mem_block_under_node_early
>>>>> get_nid_for_pfn
>>>>> early_pfn_to_nid
>>>>>
>>>>> In my system node4 has a memory block ranging from memory30351
>>>>> to memory38524, and memory128433. The memory blocks between
>>>>> memory38524 and memory128433 do not belong to this node.
>>>>>
>>>>> In walk_memory_blocks() we iterate over all memblocks starting
>>>>> from memory38524 to memory128433.
>>>>> In register_mem_block_under_node_early(), up to memory38524, the
>>>>> first pfn correctly returns the corresponding nid and the function
>>>>> returns from there. But after memory38524 and until memory128433,
>>>>> the loop iterates through each pfn and checks the nid. Since the nid
>>>>> does not match the required nid, the loop continues. This causes
>>>>> the soft lockups.
>>>>>
>>>>> [1]:
>>>>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device
>>>>> subsystem initialization in node_dev_init()")
>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>> ---
>>>>> drivers/base/node.c | 1 +
>>>>> 1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>>> index 0ea653fa3433..107eb508e28e 100644
>>>>> --- a/drivers/base/node.c
>>>>> +++ b/drivers/base/node.c
>>>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>>> ret = register_one_node(i);
>>>>> if (ret)
>>>>> panic("%s() failed to add node: %d\n", __func__, ret);
>>>>> + cond_resched();
>>>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>>>> random locations, especially as this is actually fixed by using a
>>>> different scheduler model as you say.
>>>>
>>>> Why not just make the code faster so as to avoid the long time this
>>>> takes?
>>>
>>>
>>> Thanks Greg
>>>
>>> I was checking the code to see how to make it faster in order to
>>> avoid the long time it takes.
>>>
>>> In below code path
>>>
>>> register_one_node()
>>> register_memory_blocks_under_node()
>>> walk_memory_blocks()
>>> register_mem_block_under_node_early()
>>>
>>> walk_memory_blocks() is iterating over all memblocks, and
>>> register_mem_block_under_node_early() is iterating over the PFNs
>>> to find the page_nid
>>>
>>> If the page_nid and the requested nid are the same, we will register
>>> the memblock under the node and return.
>>>
>>> But if get_nid_for_pfn() returns a different nid (This means the
>>> memblock is part of different nid), then the loop will iterate
>>> over all PFNs of the memblock and check if the page_nid returned by
>>> get_nid_for_pfn() and the requested nid are the same.
>>>
>>> IIUC, since all PFNs of a memblock return the same page_nid, we
>>> should stop the loop if the page_nid returned does not match the
>>> requested nid.
>>
>> Nodes can easily partially span "memory blocks". So your patch would
>> break these setups?
>
>
> Does this mean one memory block can be part of two or
> more nodes ? Some PFNs belong to one node, and the remaining
> PFNs belong to another node?"
Exactly.
Consider the following qemu cmdline as one example:
qemu-system-x86_64 --enable-kvm -smp 10 -M q35 -m 4G -hda
Fedora-Server-KVM-40-1.14.x86_64.qcow2 -nographic -object
memory-backend-ram,size=2000M,id=mem0 -object
memory-backend-ram,size=2096M,id=mem1 -numa node,cpus=0-4,memdev=mem0
-numa node,cpus=5-9,memdev=mem1
Inside the VM:
[root@localhost ~]# ls /sys/devices/system/node/node0/
compact cpu4 meminfo memory12 memory3 memory8 subsystem
cpu0 cpulist memory0 memory13 memory4 memory9 uevent
cpu1 cpumap memory1 memory14 memory5 memory_failure vmstat
cpu2 distance memory10 memory15 memory6 numastat x86
cpu3 hugepages memory11 memory2 memory7 power
[root@localhost ~]# ls /sys/devices/system/node/node1/
compact cpu9 meminfo memory35 memory40 memory45 power
cpu5 cpulist memory15 memory36 memory41 memory46 subsystem
cpu6 cpumap memory32 memory37 memory42 memory47 uevent
cpu7 distance memory33 memory38 memory43 memory_failure vmstat
cpu8 hugepages memory34 memory39 memory44 numastat x86
Observer how memory15 shows up for both nodes.
[root@localhost ~]# ls /sys/devices/system/memory/memory15/
node0 online phys_index removable subsystem valid_zones
node1 phys_device power state uevent
Observe how both nodes are listed
[root@localhost ~]# cat /sys/devices/system/memory/memory15/valid_zones
none
And "valid_zone = none" indicates that this memory block cannot get
offlined because it spans multiple zones (here: from multiple nodes)
>
> In that case, the current implementation adds the memory block to
> only one node. In register_mem_block_under_node_early(), if the
> first PFN returns the correct expected nid, the memory block will
> be registered under that node. It does not iterate over the other
> PFNs. Is this because of the assumption that one memory block
> cannot be part of multiple nodes?
See my example above. But note that my test VM has
]# uname -a
Linux localhost.localdomain 6.11.10-200.fc40.x86_64 #1 SMP
PREEMPT_DYNAMIC Sat Nov 23 00:53:13 UTC 2024 x86_64 GNU/Linux
>
> If one memory block cannot be part of multiple nodes, then we can
> break if get_nid_for_pfn() returns the wrong nid, right?
Again, see above. Hopefully that makes it clearer.
>
>
>>
>> But I agree that iterating all pages is rather nasty. I wonder if we
>> could just walk all memblocks in the range?
>>
>> early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ...
>> for each PFN. Testing a range instead could be better.
>>
>> Something like "early_pfn_range_spans_nid()" could be useful for that.
>
> Do you mean we should do it section by section within a memory block?
All we want to know is if the memblock allocator ("early") thinks that
any part of the memory block device (memory_block_size_bytes()) belongs
to that node.
--
Cheers,
David / dhildenb
© 2016 - 2026 Red Hat, Inc.