RE: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous check when changing pfn range

Liu, Yuan1 posted 1 patch 1 week ago
RE: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous check when changing pfn range
Posted by Liu, Yuan1 1 week ago
> -----Original Message-----
> From: David Hildenbrand (Arm) <david@kernel.org>
> Sent: Monday, March 23, 2026 7:42 PM
> To: Mike Rapoport <rppt@kernel.org>
> Cc: Liu, Yuan1 <yuan1.liu@intel.com>; Oscar Salvador <osalvador@suse.de>;
> Wei Yang <richard.weiyang@gmail.com>; linux-mm@kvack.org; Hu, Yong
> <yong.hu@intel.com>; Zou, Nanhai <nanhai.zou@intel.com>; Tim Chen
> <tim.c.chen@linux.intel.com>; Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Chen, Yu
> C <yu.c.chen@intel.com>; Deng, Pan <pan.deng@intel.com>; Li, Tianyou
> <tianyou.li@intel.com>; Chen Zhang <zhangchen.kidd@jd.com>; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous
> check when changing pfn range
> 
> On 3/23/26 12:31, Mike Rapoport wrote:
> > On Mon, Mar 23, 2026 at 11:56:35AM +0100, David Hildenbrand (Arm) wrote:
> >> On 3/19/26 10:56, Yuan Liu wrote:
> >
> > ...
> >
> >>> diff --git a/mm/mm_init.c b/mm/mm_init.c
> >>> index df34797691bd..96690e550024 100644
> >>> --- a/mm/mm_init.c
> >>> +++ b/mm/mm_init.c
> >>> @@ -946,6 +946,7 @@ static void __init memmap_init_zone_range(struct
> zone *zone,
> >>>  	unsigned long zone_start_pfn = zone->zone_start_pfn;
> >>>  	unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
> >>>  	int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
> >>> +	unsigned long zone_hole_start, zone_hole_end;
> >>>
> >>>  	start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
> >>>  	end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
> >>> @@ -957,8 +958,19 @@ static void __init memmap_init_zone_range(struct
> zone *zone,
> >>>  			  zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE,
> >>>  			  false);
> >>>
> >>> -	if (*hole_pfn < start_pfn)
> >>> +	WRITE_ONCE(zone->pages_with_online_memmap,
> >>> +		   READ_ONCE(zone->pages_with_online_memmap) +
> >>> +		   (end_pfn - start_pfn));
> >>> +
> >>> +	if (*hole_pfn < start_pfn) {
> >>>  		init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid);
> >>> +		zone_hole_start = clamp(*hole_pfn, zone_start_pfn,
> zone_end_pfn);
> >>> +		zone_hole_end = clamp(start_pfn, zone_start_pfn,
> zone_end_pfn);
> >>> +		if (zone_hole_start < zone_hole_end)
> >>> +			WRITE_ONCE(zone->pages_with_online_memmap,
> >>> +				   READ_ONCE(zone->pages_with_online_memmap) +
> >>> +				   (zone_hole_end - zone_hole_start));
> >>> +	}
> >>
> >> The range can have larger holes without a memmap, and I think we would
> be
> >> missing pages handled by the other init_unavailable_range() call?
> >>
> >>
> >> There is one question for Mike, though: couldn't it happen that the
> >> init_unavailable_range() call in memmap_init() would initialize
> >> the memmap outside of the node/zone span?
> >
> > Yes, and it most likely will.
> >
> > Very common example is page 0 on x86 systems:
> >
> > [    0.012196]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > [    0.012221] On node 0, zone DMA: 1 pages in unavailable ranges
> > [    0.012205] Early memory node ranges
> > [    0.012206]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> >
> > The unavailable page in zone DMA is the page from  0x0 to 0x1000 that is
> > neither in node 0 nor in zone DMA.
> >
> > For ZONE_NORMAL it would be a more pathological case when zone/node span
> > ends in a middle of a section, but that's still possible.
> >
> >> If so, I wonder whether we would want to adjust the node+zone space to
> >> include these ranges.
> >>
> >> Later memory onlining could make these ranges suddenly fall into the
> >> node/zone span.
> >
> > But doesn't memory onlining always happen at section boundaries?
> 
> Sure, but assume ZONE_NORMAL ends in the middle of a section, and then
> you hotplug the next section.
> 
> Then, the zone spans that memmap. zone->pages_with_online_memmap will be
> wrong.
> 
> Once we unplug the hotplugged section, zone shrinking code will stumble
> over the whole-pfns and assume they belong to the zone.
> zone->pages_with_online_memmap will be wrong.
> 
> zone->pages_with_online_memmap being wrong means that it is smaller than
> it should. I guess, it would not be broken, but we would fail to detect
> contiguous zones.
> 
> If there would be an easy way to avoid that, that would be cleaner.

I try to get your points and draft below codes.

+static void adjust_pages_with_online_memmap(struct zone *zone, long nr_pages,
+                                           long added_spanned_pages)
+{
+       if (added_spanned_pages == nr_pages)
+               zone->pages_with_online_memmap += nr_pages
+       else
+               zone->pages_with_online_memmap += added_spanned_pages;
+}
 /*
  * Must be called with mem_hotplug_lock in write mode.
  */
@@ -1154,6 +1162,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,
        const int nid = zone_to_nid(zone);
        int need_zonelists_rebuild = 0;
        unsigned long flags;
+       unsigned long old_spanned_pages = zone->spanned_pages;
        int ret;

        /*
@@ -1206,6 +1215,8 @@ int online_pages(unsigned long pfn, unsigned long nr_pages,

        online_pages_range(pfn, nr_pages);
        adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
+       adjust_pages_with_online_memmap(zone, nr_pages,
+                                       zone->spanned_pages - old_spanned_pages);

        if (node_arg.nid >= 0)
                node_set_state(nid, N_MEMORY);
@@ -1905,6 +1916,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
        struct node_notify node_arg = {
                .nid = NUMA_NO_NODE,
        };
+       unsigned long old_spanned_pages = zone->spanned_pages;
        unsigned long flags;
        char *reason;
        int ret;
@@ -2051,6 +2063,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
        /* removal success */
        adjust_managed_page_count(pfn_to_page(start_pfn), -managed_pages);
        adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages);
+       adjust_pages_with_online_memmap(zone, nr_pages,
+                                       zone->spanned_pages - old_spanned_pages);

Btw, can we introduce a new kernel command-line parameter to allow users to select 
the memory block size? This could also address the current issue.

Test Results as below, memory block size 128MB Vs. 2GB
+----------------+------+---------------+--------------+----------------+
|                | Size |    128MG      |    2GB       | Time Reduction |
|                +------+---------------+--------------+----------------+
| Plug Memory    | 256G |      10s      |       3s     |       70%      |
|                +------+---------------+--------------+----------------+
|                | 512G |      36s      |       7s     |       81%      |
+----------------+------+---------------+--------------+----------------+
 
+----------------+------+---------------+--------------+----------------+
|                | Size |    128MG      |    2GB       | Time Reduction |
|                +------+---------------+--------------+----------------+
| Unplug Memory  | 256G |      11s      |      3s      |       72%      |
|                +------+---------------+--------------+----------------+
|                | 512G |      36s      |      7s      |       81%      |
+----------------+------+---------------+--------------+----------------+

And I see the UV system has already this (Kernel parameter is uv_memblksize). 
I think if we can introduce a common kernel parameter for memory block size configuration?

--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1458,6 +1458,26 @@ int __init set_memory_block_size_order(unsigned int order)
        return 0;
 }

+static int __init cmdline_parse_memory_block_size(char *p)
+{
+    unsigned long size;
+    char *endp = p;
+    int ret;
+
+    size = memparse(p, &endp);
+    if (*endp != '\0' || !is_power_of_2(size))
+        return -EINVAL;
+
+    ret = set_memory_block_size_order(ilog2(size));
+    if (ret)
+        return ret;
+
+    pr_info("x86/mm: memory_block_size cmdline override: %ldMB\n",
+        size >> 20);
+    return 0;
+}
+early_param("x86_memory_block_size", cmdline_parse_memory_block_size);

> --
> Cheers,
> 
> David
Re: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous check when changing pfn range
Posted by Chen, Yu C 1 week ago
On 3/26/2026 3:30 PM, Liu, Yuan1 wrote:
>> -----Original Message-----

[ .... ]

> 
> Btw, can we introduce a new kernel command-line parameter to allow users to select
> the memory block size? This could also address the current issue.
> 
> Test Results as below, memory block size 128MB Vs. 2GB
> +----------------+------+---------------+--------------+----------------+
> |                | Size |    128MG      |    2GB       | Time Reduction |
> |                +------+---------------+--------------+----------------+
> | Plug Memory    | 256G |      10s      |       3s     |       70%      |
> |                +------+---------------+--------------+----------------+
> |                | 512G |      36s      |       7s     |       81%      |
> +----------------+------+---------------+--------------+----------------+
>   
> +----------------+------+---------------+--------------+----------------+
> |                | Size |    128MG      |    2GB       | Time Reduction |
> |                +------+---------------+--------------+----------------+
> | Unplug Memory  | 256G |      11s      |      3s      |       72%      |
> |                +------+---------------+--------------+----------------+
> |                | 512G |      36s      |      7s      |       81%      |
> +----------------+------+---------------+--------------+----------------+
> 
> And I see the UV system has already this (Kernel parameter is uv_memblksize).
> I think if we can introduce a common kernel parameter for memory block size configuration?
> 

Is it possible to turn uv_memblksize into a generic commandline 
memblksize without
introducing extra parameter?

thanks,
Chenyu
Re: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous check when changing pfn range
Posted by David Hildenbrand (Arm) 1 week ago
On 3/26/26 08:38, Chen, Yu C wrote:
> On 3/26/2026 3:30 PM, Liu, Yuan1 wrote:
>>> -----Original Message-----
> 
> [ .... ]
> 
>>
>> Btw, can we introduce a new kernel command-line parameter to allow
>> users to select
>> the memory block size? This could also address the current issue.
>>
>> Test Results as below, memory block size 128MB Vs. 2GB
>> +----------------+------+---------------+--------------+----------------+
>> |                | Size |    128MG      |    2GB       | Time Reduction |
>> |                +------+---------------+--------------+----------------+
>> | Plug Memory    | 256G |      10s      |       3s     |       70%      |
>> |                +------+---------------+--------------+----------------+
>> |                | 512G |      36s      |       7s     |       81%      |
>> +----------------+------+---------------+--------------+----------------+
>>   +----------------+------+---------------+--------------
>> +----------------+
>> |                | Size |    128MG      |    2GB       | Time Reduction |
>> |                +------+---------------+--------------+----------------+
>> | Unplug Memory  | 256G |      11s      |      3s      |       72%      |
>> |                +------+---------------+--------------+----------------+
>> |                | 512G |      36s      |      7s      |       81%      |
>> +----------------+------+---------------+--------------+----------------+
>>
>> And I see the UV system has already this (Kernel parameter is
>> uv_memblksize).
>> I think if we can introduce a common kernel parameter for memory block
>> size configuration?
>>
> 
> Is it possible to turn uv_memblksize into a generic commandline
> memblksize without
> introducing extra parameter?

We don't want that, and it's kind of a workaround for the problem. :)

I think we would want to only account pages towards
pages_with_online_memmap that fall within the zone span.

We will not account pages initialized that are outside the zone span.

Growing the zone and later trying to shrink them will only possibly see
a "too small"  pages_with_online_memmap value. That is fine, it simply
prevents detecting "contiguous" so it's safe.

We can document that, and in the future we could handle it a bit nicer
(e.g., indicate these pages as being just fill material).

So indeed, I guess we want to teach init_unavailable_range() to only
account towards zone->pages_with_online_memmap whatever falls into the
zone span.

That could be done internally, or from the callers by calling
init_unavailable_range() once for the out-of-zone range and once for the
in-zone-range.

-- 
Cheers,

David
RE: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous check when changing pfn range
Posted by Liu, Yuan1 6 days, 12 hours ago
> -----Original Message-----
> From: David Hildenbrand (Arm) <david@kernel.org>
> Sent: Thursday, March 26, 2026 5:54 PM
> To: Chen, Yu C <yu.c.chen@intel.com>; Liu, Yuan1 <yuan1.liu@intel.com>;
> Mike Rapoport <rppt@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>; Wei Yang
> <richard.weiyang@gmail.com>; linux-mm@kvack.org; Hu, Yong
> <yong.hu@intel.com>; Zou, Nanhai <nanhai.zou@intel.com>; Tim Chen
> <tim.c.chen@linux.intel.com>; Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Deng,
> Pan <pan.deng@intel.com>; Li, Tianyou <tianyou.li@intel.com>; Chen Zhang
> <zhangchen.kidd@jd.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous
> check when changing pfn range
> 
> On 3/26/26 08:38, Chen, Yu C wrote:
> > On 3/26/2026 3:30 PM, Liu, Yuan1 wrote:
> >>> -----Original Message-----
> >
> > [ .... ]
> >
> >>
> >> Btw, can we introduce a new kernel command-line parameter to allow
> >> users to select
> >> the memory block size? This could also address the current issue.
> >>
> >> Test Results as below, memory block size 128MB Vs. 2GB
> >> +----------------+------+---------------+--------------+---------------
> -+
> >> |                | Size |    128MG      |    2GB       | Time Reduction
> |
> >> |                +------+---------------+--------------+---------------
> -+
> >> | Plug Memory    | 256G
> |      10s      |       3s     |       70%      |
> >> |                +------+---------------+--------------+---------------
> -+
> >> |                | 512G
> |      36s      |       7s     |       81%      |
> >> +----------------+------+---------------+--------------+---------------
> -+
> >>   +----------------+------+---------------+--------------
> >> +----------------+
> >> |                | Size |    128MG      |    2GB       | Time Reduction
> |
> >> |                +------+---------------+--------------+---------------
> -+
> >> | Unplug Memory  | 256G
> |      11s      |      3s      |       72%      |
> >> |                +------+---------------+--------------+---------------
> -+
> >> |                | 512G
> |      36s      |      7s      |       81%      |
> >> +----------------+------+---------------+--------------+---------------
> -+
> >>
> >> And I see the UV system has already this (Kernel parameter is
> >> uv_memblksize).
> >> I think if we can introduce a common kernel parameter for memory block
> >> size configuration?
> >>
> >
> > Is it possible to turn uv_memblksize into a generic commandline
> > memblksize without
> > introducing extra parameter?
> 
> We don't want that, and it's kind of a workaround for the problem. :)
> I think we would want to only account pages towards
> pages_with_online_memmap that fall within the zone span.
> We will not account pages initialized that are outside the zone span.
>
> Growing the zone and later trying to shrink them will only possibly see
> a "too small"  pages_with_online_memmap value. That is fine, it simply
> prevents detecting "contiguous" so it's safe.
> 
> We can document that, and in the future we could handle it a bit nicer
> (e.g., indicate these pages as being just fill material).
> 
> So indeed, I guess we want to teach init_unavailable_range() to only
> account towards zone->pages_with_online_memmap whatever falls into the
> zone span.
> 
> That could be done internally, or from the callers by calling
> init_unavailable_range() once for the out-of-zone range and once for the
> in-zone-range.

This looks like a clear approach, I will implement it in the next version.

> --
> Cheers,
> 
> David