[PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks

Donet Tom posted 1 patch 11 months ago
drivers/base/node.c | 1 +
1 file changed, 1 insertion(+)
[PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by Donet Tom 11 months ago
On large systems with more than 64TB of DRAM, if the memory blocks
are interleaved, node initialization (node_dev_init()) could take
a long time since it iterates over each memory block. If the memory
block belongs to the current iterating node, the first pfn_to_nid
will provide the correct value. Otherwise, it will iterate over all
PFNs and check the nid. On non-preemptive kernels, this can result
in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
is enabled in kernels now [1], we may still need to fix older
stable kernels to avoid encountering these kernel warnings during
boot.

This patch adds a cond_resched() call in node_dev_init() to avoid
this warning.

node_dev_init()
    register_one_node
        register_memory_blocks_under_node
            walk_memory_blocks()
                register_mem_block_under_node_early
                    get_nid_for_pfn
                        early_pfn_to_nid

In my system node4 has a memory block ranging from memory30351
to memory38524, and memory128433. The memory blocks between
memory38524 and memory128433 do not belong to this node.

In  walk_memory_blocks() we iterate over all memblocks starting
from memory38524 to memory128433.
In register_mem_block_under_node_early(), up to memory38524, the
first pfn correctly returns the corresponding nid and the function
returns from there. But after memory38524 and until memory128433,
the loop iterates through each pfn and checks the nid. Since the nid
does not match the required nid, the loop continues. This causes
the soft lockups.

[1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/base/node.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..107eb508e28e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -975,5 +975,6 @@ void __init node_dev_init(void)
 		ret = register_one_node(i);
 		if (ret)
 			panic("%s() failed to add node: %d\n", __func__, ret);
+		cond_resched();
 	}
 }
-- 
2.43.5
Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by David Hildenbrand 11 months ago
On 10.03.25 12:53, Donet Tom wrote:
> On large systems with more than 64TB of DRAM, if the memory blocks
> are interleaved, node initialization (node_dev_init()) could take
> a long time since it iterates over each memory block. If the memory
> block belongs to the current iterating node, the first pfn_to_nid
> will provide the correct value. Otherwise, it will iterate over all
> PFNs and check the nid. On non-preemptive kernels, this can result
> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
> is enabled in kernels now [1], we may still need to fix older
> stable kernels to avoid encountering these kernel warnings during
> boot.

If it's not an issue upstream, there is no need for an upstream patch.

Fix stable kernels separately.

Or did I get you wrong and this can be triggered upstream?

> 
> This patch adds a cond_resched() call in node_dev_init() to avoid
> this warning.
> 
> node_dev_init()
>      register_one_node
>          register_memory_blocks_under_node
>              walk_memory_blocks()
>                  register_mem_block_under_node_early
>                      get_nid_for_pfn
>                          early_pfn_to_nid
> 
> In my system node4 has a memory block ranging from memory30351
> to memory38524, and memory128433. The memory blocks between
> memory38524 and memory128433 do not belong to this node.
> 
> In  walk_memory_blocks() we iterate over all memblocks starting
> from memory38524 to memory128433.
> In register_mem_block_under_node_early(), up to memory38524, the
> first pfn correctly returns the corresponding nid and the function
> returns from there. But after memory38524 and until memory128433,
> the loop iterates through each pfn and checks the nid. Since the nid
> does not match the required nid, the loop continues. This causes
> the soft lockups.
> 
> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")

That commit only moved code; so very likely, that is not the problematic 
commit.


> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>   drivers/base/node.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..107eb508e28e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>   		ret = register_one_node(i);
>   		if (ret)
>   			panic("%s() failed to add node: %d\n", __func__, ret);
> +		cond_resched();
>   	}
>   }


-- 
Cheers,

David / dhildenb
Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by Donet Tom 11 months ago
On 3/11/25 2:52 PM, David Hildenbrand wrote:
> On 10.03.25 12:53, Donet Tom wrote:
>> On large systems with more than 64TB of DRAM, if the memory blocks
>> are interleaved, node initialization (node_dev_init()) could take
>> a long time since it iterates over each memory block. If the memory
>> block belongs to the current iterating node, the first pfn_to_nid
>> will provide the correct value. Otherwise, it will iterate over all
>> PFNs and check the nid. On non-preemptive kernels, this can result
>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>> is enabled in kernels now [1], we may still need to fix older
>> stable kernels to avoid encountering these kernel warnings during
>> boot.
>
> If it's not an issue upstream, there is no need for an upstream patch.
>
> Fix stable kernels separately.
>
> Or did I get you wrong and this can be triggered upstream?


Yes, the issue is present upstream if CONFIG_PREEMPT_LAZY is disabled.

Thanks
Donet


>
>>
>> This patch adds a cond_resched() call in node_dev_init() to avoid
>> this warning.
>>
>> node_dev_init()
>>      register_one_node
>>          register_memory_blocks_under_node
>>              walk_memory_blocks()
>>                  register_mem_block_under_node_early
>>                      get_nid_for_pfn
>>                          early_pfn_to_nid
>>
>> In my system node4 has a memory block ranging from memory30351
>> to memory38524, and memory128433. The memory blocks between
>> memory38524 and memory128433 do not belong to this node.
>>
>> In  walk_memory_blocks() we iterate over all memblocks starting
>> from memory38524 to memory128433.
>> In register_mem_block_under_node_early(), up to memory38524, the
>> first pfn correctly returns the corresponding nid and the function
>> returns from there. But after memory38524 and until memory128433,
>> the loop iterates through each pfn and checks the nid. Since the nid
>> does not match the required nid, the loop continues. This causes
>> the soft lockups.
>>
>> [1]: 
>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device 
>> subsystem initialization in node_dev_init()")
>
> That commit only moved code; so very likely, that is not the 
> problematic commit.
>
>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   drivers/base/node.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 0ea653fa3433..107eb508e28e 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>           ret = register_one_node(i);
>>           if (ret)
>>               panic("%s() failed to add node: %d\n", __func__, ret);
>> +        cond_resched();
>>       }
>>   }
>
>
Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by Greg Kroah-Hartman 11 months ago
On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
> On large systems with more than 64TB of DRAM, if the memory blocks
> are interleaved, node initialization (node_dev_init()) could take
> a long time since it iterates over each memory block. If the memory
> block belongs to the current iterating node, the first pfn_to_nid
> will provide the correct value. Otherwise, it will iterate over all
> PFNs and check the nid. On non-preemptive kernels, this can result
> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
> is enabled in kernels now [1], we may still need to fix older
> stable kernels to avoid encountering these kernel warnings during
> boot.
> 
> This patch adds a cond_resched() call in node_dev_init() to avoid
> this warning.
> 
> node_dev_init()
>     register_one_node
>         register_memory_blocks_under_node
>             walk_memory_blocks()
>                 register_mem_block_under_node_early
>                     get_nid_for_pfn
>                         early_pfn_to_nid
> 
> In my system node4 has a memory block ranging from memory30351
> to memory38524, and memory128433. The memory blocks between
> memory38524 and memory128433 do not belong to this node.
> 
> In  walk_memory_blocks() we iterate over all memblocks starting
> from memory38524 to memory128433.
> In register_mem_block_under_node_early(), up to memory38524, the
> first pfn correctly returns the corresponding nid and the function
> returns from there. But after memory38524 and until memory128433,
> the loop iterates through each pfn and checks the nid. Since the nid
> does not match the required nid, the loop continues. This causes
> the soft lockups.
> 
> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>  drivers/base/node.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..107eb508e28e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>  		ret = register_one_node(i);
>  		if (ret)
>  			panic("%s() failed to add node: %d\n", __func__, ret);
> +		cond_resched();

That's a horrible hack, sorry, but no, we can't sprinkle this around in
random locations, especially as this is actually fixed by using a
different scheduler model as you say.

Why not just make the code faster so as to avoid the long time this
takes?

thanks,

greg k-h
Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by Donet Tom 11 months ago
On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>> On large systems with more than 64TB of DRAM, if the memory blocks
>> are interleaved, node initialization (node_dev_init()) could take
>> a long time since it iterates over each memory block. If the memory
>> block belongs to the current iterating node, the first pfn_to_nid
>> will provide the correct value. Otherwise, it will iterate over all
>> PFNs and check the nid. On non-preemptive kernels, this can result
>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>> is enabled in kernels now [1], we may still need to fix older
>> stable kernels to avoid encountering these kernel warnings during
>> boot.
>>
>> This patch adds a cond_resched() call in node_dev_init() to avoid
>> this warning.
>>
>> node_dev_init()
>>      register_one_node
>>          register_memory_blocks_under_node
>>              walk_memory_blocks()
>>                  register_mem_block_under_node_early
>>                      get_nid_for_pfn
>>                          early_pfn_to_nid
>>
>> In my system node4 has a memory block ranging from memory30351
>> to memory38524, and memory128433. The memory blocks between
>> memory38524 and memory128433 do not belong to this node.
>>
>> In  walk_memory_blocks() we iterate over all memblocks starting
>> from memory38524 to memory128433.
>> In register_mem_block_under_node_early(), up to memory38524, the
>> first pfn correctly returns the corresponding nid and the function
>> returns from there. But after memory38524 and until memory128433,
>> the loop iterates through each pfn and checks the nid. Since the nid
>> does not match the required nid, the loop continues. This causes
>> the soft lockups.
>>
>> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   drivers/base/node.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 0ea653fa3433..107eb508e28e 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>   		ret = register_one_node(i);
>>   		if (ret)
>>   			panic("%s() failed to add node: %d\n", __func__, ret);
>> +		cond_resched();
> That's a horrible hack, sorry, but no, we can't sprinkle this around in
> random locations, especially as this is actually fixed by using a
> different scheduler model as you say.
>
> Why not just make the code faster so as to avoid the long time this
> takes?


Thanks Greg

I was checking the code to see how to make it faster in order to
avoid the long time it takes.

In below code path

register_one_node()
     register_memory_blocks_under_node()
         walk_memory_blocks()
             register_mem_block_under_node_early()

walk_memory_blocks() is iterating over all memblocks, and
register_mem_block_under_node_early() is iterating over the PFNs
to find the page_nid

If the page_nid and the requested nid are the same, we will register
the memblock under the node and return.

But if get_nid_for_pfn() returns a different nid (This means the
memblock is part of different nid), then the loop will iterate
over all PFNs of the memblock and check if the page_nid returned by
get_nid_for_pfn() and the requested nid are the same.

IIUC, since all PFNs of a memblock return the same page_nid, we
should stop the loop if the page_nid returned does not match the
requested nid.

With the change below, softlockups are no longer observed.
What are your thoughts on this?

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..5ca417e8672e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -811,7 +811,7 @@ static int 
register_mem_block_under_node_early(struct memory_block *mem_blk,
          if (page_nid < 0)
              continue;
          if (page_nid != nid)
-            continue;
+            break;

          do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
          return 0;


> thanks,
>
> greg k-h
Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by David Hildenbrand 11 months ago
On 11.03.25 09:56, Donet Tom wrote:
> 
> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>> are interleaved, node initialization (node_dev_init()) could take
>>> a long time since it iterates over each memory block. If the memory
>>> block belongs to the current iterating node, the first pfn_to_nid
>>> will provide the correct value. Otherwise, it will iterate over all
>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>> is enabled in kernels now [1], we may still need to fix older
>>> stable kernels to avoid encountering these kernel warnings during
>>> boot.
>>>
>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>> this warning.
>>>
>>> node_dev_init()
>>>       register_one_node
>>>           register_memory_blocks_under_node
>>>               walk_memory_blocks()
>>>                   register_mem_block_under_node_early
>>>                       get_nid_for_pfn
>>>                           early_pfn_to_nid
>>>
>>> In my system node4 has a memory block ranging from memory30351
>>> to memory38524, and memory128433. The memory blocks between
>>> memory38524 and memory128433 do not belong to this node.
>>>
>>> In  walk_memory_blocks() we iterate over all memblocks starting
>>> from memory38524 to memory128433.
>>> In register_mem_block_under_node_early(), up to memory38524, the
>>> first pfn correctly returns the corresponding nid and the function
>>> returns from there. But after memory38524 and until memory128433,
>>> the loop iterates through each pfn and checks the nid. Since the nid
>>> does not match the required nid, the loop continues. This causes
>>> the soft lockups.
>>>
>>> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>> ---
>>>    drivers/base/node.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index 0ea653fa3433..107eb508e28e 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>    		ret = register_one_node(i);
>>>    		if (ret)
>>>    			panic("%s() failed to add node: %d\n", __func__, ret);
>>> +		cond_resched();
>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>> random locations, especially as this is actually fixed by using a
>> different scheduler model as you say.
>>
>> Why not just make the code faster so as to avoid the long time this
>> takes?
> 
> 
> Thanks Greg
> 
> I was checking the code to see how to make it faster in order to
> avoid the long time it takes.
> 
> In below code path
> 
> register_one_node()
>       register_memory_blocks_under_node()
>           walk_memory_blocks()
>               register_mem_block_under_node_early()
> 
> walk_memory_blocks() is iterating over all memblocks, and
> register_mem_block_under_node_early() is iterating over the PFNs
> to find the page_nid
> 
> If the page_nid and the requested nid are the same, we will register
> the memblock under the node and return.
> 
> But if get_nid_for_pfn() returns a different nid (This means the
> memblock is part of different nid), then the loop will iterate
> over all PFNs of the memblock and check if the page_nid returned by
> get_nid_for_pfn() and the requested nid are the same.
> 
> IIUC, since all PFNs of a memblock return the same page_nid, we
> should stop the loop if the page_nid returned does not match the
> requested nid.

Nodes can easily partially span "memory blocks". So your patch would 
break these setups?

But I agree that iterating all pages is rather nasty. I wonder if we 
could just walk all memblocks in the range?

early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ... 
for each PFN. Testing a range instead could be better.

Something like "early_pfn_range_spans_nid()" could be useful for that.

-- 
Cheers,

David / dhildenb

Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by Donet Tom 11 months ago
On 3/11/25 2:59 PM, David Hildenbrand wrote:
> On 11.03.25 09:56, Donet Tom wrote:
>>
>> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>>> are interleaved, node initialization (node_dev_init()) could take
>>>> a long time since it iterates over each memory block. If the memory
>>>> block belongs to the current iterating node, the first pfn_to_nid
>>>> will provide the correct value. Otherwise, it will iterate over all
>>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>>> is enabled in kernels now [1], we may still need to fix older
>>>> stable kernels to avoid encountering these kernel warnings during
>>>> boot.
>>>>
>>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>>> this warning.
>>>>
>>>> node_dev_init()
>>>>       register_one_node
>>>>           register_memory_blocks_under_node
>>>>               walk_memory_blocks()
>>>>                   register_mem_block_under_node_early
>>>>                       get_nid_for_pfn
>>>>                           early_pfn_to_nid
>>>>
>>>> In my system node4 has a memory block ranging from memory30351
>>>> to memory38524, and memory128433. The memory blocks between
>>>> memory38524 and memory128433 do not belong to this node.
>>>>
>>>> In  walk_memory_blocks() we iterate over all memblocks starting
>>>> from memory38524 to memory128433.
>>>> In register_mem_block_under_node_early(), up to memory38524, the
>>>> first pfn correctly returns the corresponding nid and the function
>>>> returns from there. But after memory38524 and until memory128433,
>>>> the loop iterates through each pfn and checks the nid. Since the nid
>>>> does not match the required nid, the loop continues. This causes
>>>> the soft lockups.
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device 
>>>> subsystem initialization in node_dev_init()")
>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>> ---
>>>>    drivers/base/node.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index 0ea653fa3433..107eb508e28e 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>>            ret = register_one_node(i);
>>>>            if (ret)
>>>>                panic("%s() failed to add node: %d\n", __func__, ret);
>>>> +        cond_resched();
>>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>>> random locations, especially as this is actually fixed by using a
>>> different scheduler model as you say.
>>>
>>> Why not just make the code faster so as to avoid the long time this
>>> takes?
>>
>>
>> Thanks Greg
>>
>> I was checking the code to see how to make it faster in order to
>> avoid the long time it takes.
>>
>> In below code path
>>
>> register_one_node()
>>       register_memory_blocks_under_node()
>>           walk_memory_blocks()
>>               register_mem_block_under_node_early()
>>
>> walk_memory_blocks() is iterating over all memblocks, and
>> register_mem_block_under_node_early() is iterating over the PFNs
>> to find the page_nid
>>
>> If the page_nid and the requested nid are the same, we will register
>> the memblock under the node and return.
>>
>> But if get_nid_for_pfn() returns a different nid (This means the
>> memblock is part of different nid), then the loop will iterate
>> over all PFNs of the memblock and check if the page_nid returned by
>> get_nid_for_pfn() and the requested nid are the same.
>>
>> IIUC, since all PFNs of a memblock return the same page_nid, we
>> should stop the loop if the page_nid returned does not match the
>> requested nid.
>
> Nodes can easily partially span "memory blocks". So your patch would 
> break these setups?


Does this mean one memory block can be part of two or
more nodes ? Some PFNs belong to one node, and the remaining
PFNs belong to another node?"

In that case, the current implementation adds the memory block to
only one node. In register_mem_block_under_node_early(), if the
first PFN returns the correct expected nid, the memory block will
be registered under that node. It does not iterate over the other
PFNs. Is this because of the assumption that one memory block
cannot be part of multiple nodes?

If one memory block cannot be part of multiple nodes, then we can
break if get_nid_for_pfn() returns the wrong nid, right?


>
> But I agree that iterating all pages is rather nasty. I wonder if we 
> could just walk all memblocks in the range?
>
> early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ... 
> for each PFN. Testing a range instead could be better.
>
> Something like "early_pfn_range_spans_nid()" could be useful for that.

Do you mean we should do it section by section within a memory block?

Thanks
Donet
Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks
Posted by David Hildenbrand 11 months ago
On 11.03.25 16:00, Donet Tom wrote:
> 
> On 3/11/25 2:59 PM, David Hildenbrand wrote:
>> On 11.03.25 09:56, Donet Tom wrote:
>>>
>>> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>>>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>>>> are interleaved, node initialization (node_dev_init()) could take
>>>>> a long time since it iterates over each memory block. If the memory
>>>>> block belongs to the current iterating node, the first pfn_to_nid
>>>>> will provide the correct value. Otherwise, it will iterate over all
>>>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>>>> is enabled in kernels now [1], we may still need to fix older
>>>>> stable kernels to avoid encountering these kernel warnings during
>>>>> boot.
>>>>>
>>>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>>>> this warning.
>>>>>
>>>>> node_dev_init()
>>>>>        register_one_node
>>>>>            register_memory_blocks_under_node
>>>>>                walk_memory_blocks()
>>>>>                    register_mem_block_under_node_early
>>>>>                        get_nid_for_pfn
>>>>>                            early_pfn_to_nid
>>>>>
>>>>> In my system node4 has a memory block ranging from memory30351
>>>>> to memory38524, and memory128433. The memory blocks between
>>>>> memory38524 and memory128433 do not belong to this node.
>>>>>
>>>>> In  walk_memory_blocks() we iterate over all memblocks starting
>>>>> from memory38524 to memory128433.
>>>>> In register_mem_block_under_node_early(), up to memory38524, the
>>>>> first pfn correctly returns the corresponding nid and the function
>>>>> returns from there. But after memory38524 and until memory128433,
>>>>> the loop iterates through each pfn and checks the nid. Since the nid
>>>>> does not match the required nid, the loop continues. This causes
>>>>> the soft lockups.
>>>>>
>>>>> [1]:
>>>>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device
>>>>> subsystem initialization in node_dev_init()")
>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>> ---
>>>>>     drivers/base/node.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>>> index 0ea653fa3433..107eb508e28e 100644
>>>>> --- a/drivers/base/node.c
>>>>> +++ b/drivers/base/node.c
>>>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>>>             ret = register_one_node(i);
>>>>>             if (ret)
>>>>>                 panic("%s() failed to add node: %d\n", __func__, ret);
>>>>> +        cond_resched();
>>>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>>>> random locations, especially as this is actually fixed by using a
>>>> different scheduler model as you say.
>>>>
>>>> Why not just make the code faster so as to avoid the long time this
>>>> takes?
>>>
>>>
>>> Thanks Greg
>>>
>>> I was checking the code to see how to make it faster in order to
>>> avoid the long time it takes.
>>>
>>> In below code path
>>>
>>> register_one_node()
>>>        register_memory_blocks_under_node()
>>>            walk_memory_blocks()
>>>                register_mem_block_under_node_early()
>>>
>>> walk_memory_blocks() is iterating over all memblocks, and
>>> register_mem_block_under_node_early() is iterating over the PFNs
>>> to find the page_nid
>>>
>>> If the page_nid and the requested nid are the same, we will register
>>> the memblock under the node and return.
>>>
>>> But if get_nid_for_pfn() returns a different nid (This means the
>>> memblock is part of different nid), then the loop will iterate
>>> over all PFNs of the memblock and check if the page_nid returned by
>>> get_nid_for_pfn() and the requested nid are the same.
>>>
>>> IIUC, since all PFNs of a memblock return the same page_nid, we
>>> should stop the loop if the page_nid returned does not match the
>>> requested nid.
>>
>> Nodes can easily partially span "memory blocks". So your patch would
>> break these setups?
> 
> 
> Does this mean one memory block can be part of two or
> more nodes ? Some PFNs belong to one node, and the remaining
> PFNs belong to another node?"

Exactly.

Consider the following qemu cmdline as one example:

qemu-system-x86_64 --enable-kvm -smp 10 -M q35 -m 4G -hda 
Fedora-Server-KVM-40-1.14.x86_64.qcow2 -nographic -object 
memory-backend-ram,size=2000M,id=mem0 -object 
memory-backend-ram,size=2096M,id=mem1 -numa node,cpus=0-4,memdev=mem0 
-numa node,cpus=5-9,memdev=mem1


Inside the VM:

[root@localhost ~]# ls /sys/devices/system/node/node0/
compact  cpu4       meminfo   memory12  memory3  memory8         subsystem
cpu0     cpulist    memory0   memory13  memory4  memory9         uevent
cpu1     cpumap     memory1   memory14  memory5  memory_failure  vmstat
cpu2     distance   memory10  memory15  memory6  numastat        x86
cpu3     hugepages  memory11  memory2   memory7  power
[root@localhost ~]# ls /sys/devices/system/node/node1/
compact  cpu9       meminfo   memory35  memory40  memory45        power
cpu5     cpulist    memory15  memory36  memory41  memory46        subsystem
cpu6     cpumap     memory32  memory37  memory42  memory47        uevent
cpu7     distance   memory33  memory38  memory43  memory_failure  vmstat
cpu8     hugepages  memory34  memory39  memory44  numastat        x86


Observer how memory15 shows up for both nodes.

[root@localhost ~]# ls /sys/devices/system/memory/memory15/
node0  online       phys_index  removable  subsystem  valid_zones
node1  phys_device  power       state      uevent

Observe how both nodes are listed

[root@localhost ~]# cat /sys/devices/system/memory/memory15/valid_zones
none

And "valid_zone = none" indicates that this memory block cannot get 
offlined because it spans multiple zones (here: from multiple nodes)

> 
> In that case, the current implementation adds the memory block to
> only one node. In register_mem_block_under_node_early(), if the
> first PFN returns the correct expected nid, the memory block will
> be registered under that node. It does not iterate over the other
> PFNs. Is this because of the assumption that one memory block
> cannot be part of multiple nodes?

See my example above. But note that my test VM has

]# uname -a
Linux localhost.localdomain 6.11.10-200.fc40.x86_64 #1 SMP 
PREEMPT_DYNAMIC Sat Nov 23 00:53:13 UTC 2024 x86_64 GNU/Linux


> 
> If one memory block cannot be part of multiple nodes, then we can
> break if get_nid_for_pfn() returns the wrong nid, right?

Again, see above. Hopefully that makes it clearer.

> 
> 
>>
>> But I agree that iterating all pages is rather nasty. I wonder if we
>> could just walk all memblocks in the range?
>>
>> early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ...
>> for each PFN. Testing a range instead could be better.
>>
>> Something like "early_pfn_range_spans_nid()" could be useful for that.
> 
> Do you mean we should do it section by section within a memory block?

All we want to know is if the memblock allocator ("early") thinks that 
any part of the memory block device (memory_block_size_bytes()) belongs 
to that node.

-- 
Cheers,

David / dhildenb