Deprecate zone_reclaim_mode

[RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode

Posted by Joshua Hahn 2 months ago

Hello folks, 
This is a code RFC for my upcoming discussion at LPC 2025 in Tokyo [1].

<preface>
You might notice that the RFC that I'm sending out is different from the
proposed abstract. Initially when I submitted my proposal, I was interested
in addressing how fallback allocations work under pressure for
NUMA-restricted allocations. Soon after, Johannes proposed a patch [2] which
addressed the problem I was investigating, so I wanted to explore a different
direction in the same area of fallback allocations.

At the same time, I was also thinking about zone_reclaim_mode [3]. I thought
that LPC would be a good opportunity to discuss deprecating zone_reclaim_mode,
so I hope to discuss this topic at LPC during my presentation slot.

Sorry for the patch submission so close to the conference as well. I thought
it would still be better to send this RFC out late, instead of just presenting
the topic at the conference without giving folks some time to think about it.
</preface>

zone_reclaim_mode was introduced in 2005 to prevent the kernel from facing
the high remote access latency associated with NUMA systems. With it enabled,
when the kernel sees that the local node is full, it will stall allocations and
trigger direct reclaim locally, instead of making a remote allocation, even
when there may still be free memory. Thsi is the preferred way to consume memory
if remote memory access is more expensive than performing direct reclaim.
The choice is made on a system-wide basis, but can be toggled at runtime.

This series deprecates the zone_reclaim_mode sysctl in favor of other NUMA
aware mechanisms, such as NUMA balancing, memory.reclaim, membind, and
tiering / promotion / demotion. Let's break down what differences there are
in these mechanisms, based on workload characteristics.

Scenario 1) Workload fits in a single NUMA node
In this case, if the rest of the NUMA node is unused, the zone_reclaim_mode
does nothing. On the other hand, if there are several workloads competing
for memory in the same NUMA node, with sum(workload_mem) > mem_capacity(node),
then zone_reclaim_mode is actively harmful. Direct reclaim is aggressively
triggered whenever one workload makes an allocation that goes over the limit,
and there is no fairness mechanism to prevent one workload from completely
blocking the other workload from making progress.

Scenario 2) Workload does not fit in a single NUMA node
Again, in this case, zone_reclaim_mode is actively harmful. Direct reclaim
will constantly be triggered whenever memory goes above the limit, leading
to memory thrashing. Moreover, even if the user really wants avoid remote
allocations, membind is a better alternative in this case; zone_reclaim_mode
forces the user to make the decision for all workloads on the system, whereas
membind gives per-process granularity.

Scenario 3) Workload size is approximately the same as the NUMA capacity
This is probably the case for most workloads. When it is uncertain whether
memory consumption will exceed the capacity, it doesn't really make a lot
of sense to make a system-wide bet on whether direct reclaim is better or
worse than remote allocations. In other words, it might make more sense to
allow memory to spill over to remote nodes, and let the kernel handle the
NUMA balancing depending on how cold or hot the newly allocated memory is.

These examples might make it seem like zone_reclaim_mode is harmful for
all scenarios. But that is not the case:

Scenario 4) Newly allocated memory is going to be hot
This is probably the scenario that makes zone_reclaim_mode shine the most.
If the newly allocated memory is going to be hot, then it makes much more
sense to try and reclaim locally, which would kick out cold(er) memory and
prevent eating any remote memory access latency frequently.

Scenario 5) Tiered NUMA system makes remote access latency higher
In some tiered memory scenarios, remote access latency can be higher for
lower memory tiers. In these scenarios, the cost of direct reclaim may be
cheaper, relative to placing hot memory on a remote node with high access
latency.

Now, let me try and present a case for deprecating zone_reclaim_mode, despite
these two scenarios where it performs as intended.
In scenario 4, the catch is that the system is not an oracle that can predict
that newly allocated memory is going to be hot. In fact, a lot of the kernel
assumes that newly allocated memory is cold, and it has to "prove" that it
is hot through accesses. In a perfect world, the kernel would be able to
selectively trigger direct reclaim or allocate remotely, based on whehter the
current allocation will be cold or hot in the future.

But without these insights, it is difficult to make a system-wide bet and
always trigger direct reclaim locally, when we might be reclaiming or
evicting relatively hotter memory from the local node in order to make room.

In scenario 5, remote access latency is higher, which means the cost of
placing hot memory in remote nodes is higher. But today, we have many
strategies that can help us overcome the higher cost of placing hot memory in
remote nodes. If the system has tiered memory with different memory
access characteristics per-node, then the user is probably already enabling
promotion and demotion mechanisms that can quickly correct the placement of
hot pages in lower tiers. In these systems, it might make more sense to allow
the kernel to naturally consume all of the memory it can (whether it is local
or on a lower tier remote node), then allow the kernel to then take corrective
action based on what it finds as hot or cold memory.

Of course, demonstrating that there are alternatives is not enough to warrant
a deprecation. I think that the real benefit of this patch comes in reduced
sysctl maintenance and what I think is much easier code to read.

This series which has 466 deletions and 9 insertions:
- Deprecates the zone_reclaim_mode sysctl (patch 4)
- Deprecates the min_slab_ratio sysctl (patch 3)
- Deprecates the min_unmapped_ratio sysctl (patch 3)
- Removes the node_reclaim() function and simplifies the get_page_from_freelist
  watermark checks (which is already a very large function) (patch 2)
- Simplifies hpage_collapse_scan_{pmd, file} (patch 1).
- There are also more opportunities for future cleanup, like removing
  __node_reclaim and converting its last caller to use try_to_free_pages
  (suggested by Johannes Weiner)

Here are some discussion points that I hope to discuss at LPC:
- For workloads that are assumed to fit in a NUMA node, is membind really
  enough to achieve the same effect?
- Is NUMA balancing good enough to correct action when memory spills over to
  remote nodes, and end up being accessed frequently?
- How widely is zone_reclaim_mode currently being used?
- Are there usecases for zone_reclaim_mode that cannot be replaced by any
  of the mentioned alternatives?
- Now that node_reclaim() is deprecated in patch 2, patch 3 deprecates
  min_slab_ratio and min_unmapped_ratio. Does this change make sense?
  IOW, should proactive reclaim via memory.reclaim still care about
  these thresholds before making a decision to reclaim?
- If we agree that there are better alternatives to zone_reclaim_mode, how
  should we make the transition to deprecate it, along with the other
  sysctls that are deprecated in this series (min_{slab, unmapped}_ratio)?

Please also note that I've excluded all individual email addresses for the
Cc list. It was ~30 addresses, as I just wanted to avoid spamming
maintainers and reviewers, so I've just left the mailing list targets.
The individuals are Cc-ed in the relevant patches, though.

Thank you everyone. I'm looking forward to discussing this idea with you all!
Joshua

[1] https://lpc.events/event/19/contributions/2142/
[2] https://lore.kernel.org/linux-mm/20250919162134.1098208-1-hannes@cmpxchg.org/
[3] https://lore.kernel.org/all/20250805205048.1518453-1-joshua.hahnjy@gmail.com/

Joshua Hahn (4):
  mm/khugepaged: Remove hpage_collapse_scan_abort
  mm/vmscan/page_alloc: Remove node_reclaim
  mm/vmscan/page_alloc: Deprecate min_{slab, unmapped}_ratio
  mm/vmscan: Deprecate zone_reclaim_mode

 Documentation/admin-guide/sysctl/vm.rst       |  78 ---------
 Documentation/mm/physical_memory.rst          |   9 -
 .../translations/zh_CN/mm/physical_memory.rst |   8 -
 arch/powerpc/include/asm/topology.h           |   4 -
 include/linux/mmzone.h                        |   8 -
 include/linux/swap.h                          |   5 -
 include/linux/topology.h                      |   6 -
 include/linux/vm_event_item.h                 |   4 -
 include/trace/events/huge_memory.h            |   1 -
 include/uapi/linux/mempolicy.h                |  14 --
 mm/internal.h                                 |  22 ---
 mm/khugepaged.c                               |  34 ----
 mm/page_alloc.c                               | 120 +------------
 mm/vmscan.c                                   | 158 +-----------------
 mm/vmstat.c                                   |   4 -
 15 files changed, 9 insertions(+), 466 deletions(-)


base-commit: e4c4d9892021888be6d874ec1be307e80382f431
-- 
2.47.3

Re: [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode

Posted by mawupeng 2 months ago


On 2025/12/6 7:32, Joshua Hahn wrote:
> Hello folks, 
> This is a code RFC for my upcoming discussion at LPC 2025 in Tokyo [1].
> 
> <preface>
> You might notice that the RFC that I'm sending out is different from the
> proposed abstract. Initially when I submitted my proposal, I was interested
> in addressing how fallback allocations work under pressure for
> NUMA-restricted allocations. Soon after, Johannes proposed a patch [2] which
> addressed the problem I was investigating, so I wanted to explore a different
> direction in the same area of fallback allocations.
> 
> At the same time, I was also thinking about zone_reclaim_mode [3]. I thought
> that LPC would be a good opportunity to discuss deprecating zone_reclaim_mode,
> so I hope to discuss this topic at LPC during my presentation slot.
> 
> Sorry for the patch submission so close to the conference as well. I thought
> it would still be better to send this RFC out late, instead of just presenting
> the topic at the conference without giving folks some time to think about it.
> </preface>
> 
> zone_reclaim_mode was introduced in 2005 to prevent the kernel from facing
> the high remote access latency associated with NUMA systems. With it enabled,
> when the kernel sees that the local node is full, it will stall allocations and
> trigger direct reclaim locally, instead of making a remote allocation, even
> when there may still be free memory. Thsi is the preferred way to consume memory
> if remote memory access is more expensive than performing direct reclaim.
> The choice is made on a system-wide basis, but can be toggled at runtime.
> 
> This series deprecates the zone_reclaim_mode sysctl in favor of other NUMA
> aware mechanisms, such as NUMA balancing, memory.reclaim, membind, and
> tiering / promotion / demotion. Let's break down what differences there are
> in these mechanisms, based on workload characteristics.
> 
> Scenario 1) Workload fits in a single NUMA node
> In this case, if the rest of the NUMA node is unused, the zone_reclaim_mode
> does nothing. On the other hand, if there are several workloads competing
> for memory in the same NUMA node, with sum(workload_mem) > mem_capacity(node),
> then zone_reclaim_mode is actively harmful. Direct reclaim is aggressively
> triggered whenever one workload makes an allocation that goes over the limit,
> and there is no fairness mechanism to prevent one workload from completely
> blocking the other workload from making progress.
> 
> Scenario 2) Workload does not fit in a single NUMA node
> Again, in this case, zone_reclaim_mode is actively harmful. Direct reclaim
> will constantly be triggered whenever memory goes above the limit, leading
> to memory thrashing. Moreover, even if the user really wants avoid remote
> allocations, membind is a better alternative in this case; zone_reclaim_mode
> forces the user to make the decision for all workloads on the system, whereas
> membind gives per-process granularity.
> 
> Scenario 3) Workload size is approximately the same as the NUMA capacity
> This is probably the case for most workloads. When it is uncertain whether
> memory consumption will exceed the capacity, it doesn't really make a lot
> of sense to make a system-wide bet on whether direct reclaim is better or
> worse than remote allocations. In other words, it might make more sense to
> allow memory to spill over to remote nodes, and let the kernel handle the
> NUMA balancing depending on how cold or hot the newly allocated memory is.
> 
> These examples might make it seem like zone_reclaim_mode is harmful for
> all scenarios. But that is not the case:
> 
> Scenario 4) Newly allocated memory is going to be hot
> This is probably the scenario that makes zone_reclaim_mode shine the most.
> If the newly allocated memory is going to be hot, then it makes much more
> sense to try and reclaim locally, which would kick out cold(er) memory and
> prevent eating any remote memory access latency frequently.
> 
> Scenario 5) Tiered NUMA system makes remote access latency higher
> In some tiered memory scenarios, remote access latency can be higher for
> lower memory tiers. In these scenarios, the cost of direct reclaim may be
> cheaper, relative to placing hot memory on a remote node with high access
> latency.
> 
> Now, let me try and present a case for deprecating zone_reclaim_mode, despite
> these two scenarios where it performs as intended.
> In scenario 4, the catch is that the system is not an oracle that can predict
> that newly allocated memory is going to be hot. In fact, a lot of the kernel
> assumes that newly allocated memory is cold, and it has to "prove" that it
> is hot through accesses. In a perfect world, the kernel would be able to
> selectively trigger direct reclaim or allocate remotely, based on whehter the
> current allocation will be cold or hot in the future.
> 
> But without these insights, it is difficult to make a system-wide bet and
> always trigger direct reclaim locally, when we might be reclaiming or
> evicting relatively hotter memory from the local node in order to make room.
> 
> In scenario 5, remote access latency is higher, which means the cost of
> placing hot memory in remote nodes is higher. But today, we have many
> strategies that can help us overcome the higher cost of placing hot memory in
> remote nodes. If the system has tiered memory with different memory
> access characteristics per-node, then the user is probably already enabling
> promotion and demotion mechanisms that can quickly correct the placement of
> hot pages in lower tiers. In these systems, it might make more sense to allow
> the kernel to naturally consume all of the memory it can (whether it is local
> or on a lower tier remote node), then allow the kernel to then take corrective
> action based on what it finds as hot or cold memory.
> 
> Of course, demonstrating that there are alternatives is not enough to warrant
> a deprecation. I think that the real benefit of this patch comes in reduced
> sysctl maintenance and what I think is much easier code to read.
> 
> This series which has 466 deletions and 9 insertions:
> - Deprecates the zone_reclaim_mode sysctl (patch 4)
> - Deprecates the min_slab_ratio sysctl (patch 3)
> - Deprecates the min_unmapped_ratio sysctl (patch 3)
> - Removes the node_reclaim() function and simplifies the get_page_from_freelist
>   watermark checks (which is already a very large function) (patch 2)
> - Simplifies hpage_collapse_scan_{pmd, file} (patch 1).
> - There are also more opportunities for future cleanup, like removing
>   __node_reclaim and converting its last caller to use try_to_free_pages
>   (suggested by Johannes Weiner)
> 
> Here are some discussion points that I hope to discuss at LPC:
> - For workloads that are assumed to fit in a NUMA node, is membind really
>   enough to achieve the same effect?

In real-world scenarios, we have observed on a dual-socket (2P) server with multiple
NUMA nodes—each having relatively limited local memory capacity—that page cache
negatively impacts overall performance. The zone_reclaim_node feature is used to
alleviate performance issues.

The main reason is that page cache consumes free memory on the local node, causing
processes without mbind restrictions to fall back to other nodes that still have free
memory. Accessing remote memory comes with a significant latency penalty. In extreme
testing, if a system is fully populated with page cache beforehand, Spark application
performance can drop by 80%. However, with zone_reclaim enabled, the performance
degradation is limited to only about 30%.

Furthermore, for typical HPC applications, memory pressure tends to be balanced
across NUMA nodes. Yet page cache is often generated by background tasks—such as
logging modules—which breaks memory locality and adversely affects overall performance.

At the same time, there are a large number of __GFP_THISNODE memory allocation requests in
the system. Anonymous pages that fall back from other nodes cannot be migrated or easily
reclaimed (especially when swap is disabled), leading to uneven distribution of available
memory within a single node. By enabling zone_reclaim_mode, the kernel preferentially reclaims
file pages within the local NUMA node to satisfy local anonymous-page allocations, which
effectively avoids warn_alloc problems caused by uneven distribution of anonymous pages.

In such scenarios, relying solely on mbind may offer limited flexibility.

We have also experimented with proactively waking kswapd to improve synchronous reclaim
efficiency. Our actual tests show that this can roughly double the memory allocation rate[1].

We could also discuss whether there are better solutions for such HPC scenarios.

[1]: https://lore.kernel.org/all/20251011062043.772549-1-mawupeng1@huawei.com/

> - Is NUMA balancing good enough to correct action when memory spills over to
>   remote nodes, and end up being accessed frequently?
> - How widely is zone_reclaim_mode currently being used?
> - Are there usecases for zone_reclaim_mode that cannot be replaced by any
>   of the mentioned alternatives?
> - Now that node_reclaim() is deprecated in patch 2, patch 3 deprecates
>   min_slab_ratio and min_unmapped_ratio. Does this change make sense?
>   IOW, should proactive reclaim via memory.reclaim still care about
>   these thresholds before making a decision to reclaim?
> - If we agree that there are better alternatives to zone_reclaim_mode, how
>   should we make the transition to deprecate it, along with the other
>   sysctls that are deprecated in this series (min_{slab, unmapped}_ratio)?
> 
> Please also note that I've excluded all individual email addresses for the
> Cc list. It was ~30 addresses, as I just wanted to avoid spamming
> maintainers and reviewers, so I've just left the mailing list targets.
> The individuals are Cc-ed in the relevant patches, though.
> 
> Thank you everyone. I'm looking forward to discussing this idea with you all!
> Joshua
> 
> [1] https://lpc.events/event/19/contributions/2142/
> [2] https://lore.kernel.org/linux-mm/20250919162134.1098208-1-hannes@cmpxchg.org/
> [3] https://lore.kernel.org/all/20250805205048.1518453-1-joshua.hahnjy@gmail.com/
> 
> Joshua Hahn (4):
>   mm/khugepaged: Remove hpage_collapse_scan_abort
>   mm/vmscan/page_alloc: Remove node_reclaim
>   mm/vmscan/page_alloc: Deprecate min_{slab, unmapped}_ratio
>   mm/vmscan: Deprecate zone_reclaim_mode
> 
>  Documentation/admin-guide/sysctl/vm.rst       |  78 ---------
>  Documentation/mm/physical_memory.rst          |   9 -
>  .../translations/zh_CN/mm/physical_memory.rst |   8 -
>  arch/powerpc/include/asm/topology.h           |   4 -
>  include/linux/mmzone.h                        |   8 -
>  include/linux/swap.h                          |   5 -
>  include/linux/topology.h                      |   6 -
>  include/linux/vm_event_item.h                 |   4 -
>  include/trace/events/huge_memory.h            |   1 -
>  include/uapi/linux/mempolicy.h                |  14 --
>  mm/internal.h                                 |  22 ---
>  mm/khugepaged.c                               |  34 ----
>  mm/page_alloc.c                               | 120 +------------
>  mm/vmscan.c                                   | 158 +-----------------
>  mm/vmstat.c                                   |   4 -
>  15 files changed, 9 insertions(+), 466 deletions(-)
> 
> 
> base-commit: e4c4d9892021888be6d874ec1be307e80382f431

Re: [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode

Posted by Joshua Hahn 1 month, 4 weeks ago

On Tue, 9 Dec 2025 20:43:01 +0800 mawupeng <mawupeng1@huawei.com> wrote:
> 
> On 2025/12/6 7:32, Joshua Hahn wrote:
> > Hello folks, 
> > This is a code RFC for my upcoming discussion at LPC 2025 in Tokyo [1].
> >
> > zone_reclaim_mode was introduced in 2005 to prevent the kernel from facing
> > the high remote access latency associated with NUMA systems. With it enabled,
> > when the kernel sees that the local node is full, it will stall allocations and
> > trigger direct reclaim locally, instead of making a remote allocation, even
> > when there may still be free memory. Thsi is the preferred way to consume memory
> > if remote memory access is more expensive than performing direct reclaim.
> > The choice is made on a system-wide basis, but can be toggled at runtime.
> > 
> > This series deprecates the zone_reclaim_mode sysctl in favor of other NUMA
> > aware mechanisms, such as NUMA balancing, memory.reclaim, membind, and
> > tiering / promotion / demotion. Let's break down what differences there are
> > in these mechanisms, based on workload characteristics.

[...snip...]

Hello mawupeng, thank you for your feedback on this RFC.

I was wondering if you were planning to attend LPC this year. If so, I'll be
discussing this idea at the MM microconference tomorrow (December 11th) and
would love to discuss this after the presentation with you in the hallway.
I want to make sure that I'm not missing any important nuances or use cases
for zone_reclaim_mode. After all, my only motivation for deprecating this is
to simplify the code allocation path and reduce maintenence burden, both of
which definitely does not outweigh valid usecases. On the other hand if we can
find out that we can deprecate zone_reclaim_mode, and also find some
alternatives that lead to better performance on your end, that sounds
like the ultimate win-win scenario for me : -)

> In real-world scenarios, we have observed on a dual-socket (2P) server with multiple
> NUMA nodes—each having relatively limited local memory capacity—that page cache
> negatively impacts overall performance. The zone_reclaim_node feature is used to
> alleviate performance issues.
> 
> The main reason is that page cache consumes free memory on the local node, causing
> processes without mbind restrictions to fall back to other nodes that still have free
> memory. Accessing remote memory comes with a significant latency penalty. In extreme
> testing, if a system is fully populated with page cache beforehand, Spark application
> performance can drop by 80%. However, with zone_reclaim enabled, the performance
> degradation is limited to only about 30%.

This sounds right to me. In fact, I have observed similar results in some
experiments that I ran myself, where on a 2-NUMA system with 125GB memory each,
I fill up one node with 100G of garbage filecache and try to run a 60G anon
workload in it. Here are the average access latency results:

- zone_reclaim_mode enabled: 56.34 ns/access
- zone_reclaim_mode disabled: 67.86 ns/access

However, I was able to achieve better results by disabling zone_reclaim_mode
and using membind instead:

- zone_reclaim_mode disabled + membind: 52.98 ns/access

Of course, these are on my specific system with my specific workload so the
numbers (and results) may be different on your end. You specifically mentioned
"processes without mbind restrictions". Is there a reason why these workloads
cannot be membound to a node?

On that note, I had another follow-up question. If remote latency really is a
big concern, I am wondering if you have seen remote allocations despite
enabling zone_reclaim_mode. From my understanding of the code, zone_reclaim_mode
is not a strict guarantee of memory locality. If direct reclaim fails and
we fail to reclaim enough, the allocation is serviced from a remote node anyways.

Maybe I did not make this clear in my RFC, but I definitely believe that there
are workloads out there that benefit from zone_reclaim_mode. However, I
also believe that membind is just a better alternative for all the scenarios
that I can think of, so it would really be helpful for my education to learn
about workloads that benefit from zone_reclaim_mode but cannot use membind.

> Furthermore, for typical HPC applications, memory pressure tends to be balanced
> across NUMA nodes. Yet page cache is often generated by background tasks—such as
> logging modules—which breaks memory locality and adversely affects overall performance.

I see. From my very limited understanding of HPC applications, they tend to be
perfectly sized for the nodes they run on, so having logging agents generate
additional page cache really does sound like a problem to me. 

> At the same time, there are a large number of __GFP_THISNODE memory allocation requests in
> the system. Anonymous pages that fall back from other nodes cannot be migrated or easily
> reclaimed (especially when swap is disabled), leading to uneven distribution of available
> memory within a single node. By enabling zone_reclaim_mode, the kernel preferentially reclaims
> file pages within the local NUMA node to satisfy local anonymous-page allocations, which
> effectively avoids warn_alloc problems caused by uneven distribution of anonymous pages.
> 
> In such scenarios, relying solely on mbind may offer limited flexibility.

I see. So if I understand your scenario correctly, what you want is something
between mbind which is strict in guaranteeing that memory comes locally, and
the default memory allocation preference, which prefers allocating from
remote nodes when the local node runs out of memory.

I have some follow-up questions here.
It seems like the fact that anonymous memory from remote processes leaking
their memory into the current node is actually caused by two characteristics
of zone_reclaim_mode. Namely, that it does not guarantee memory locality,
and that it is a system-wide setting. Under your scenario, we cannot have
a mixture of HPC workloads that cannot handle remote memory access latency,
as well as non-HPC workloads that would actually benefit from being able to
consume free memory from remote nodes before triggering reclaim.

So in a scenario where we have multiple HPC workloads running on a multi-NUMA
system, we can just size each workload to fit the nodes, and membind them so
that we don't have to worry about migrating or reclaiming remote processes'
anonymous memory.

In a scenario where we have an HPC workload + non-HPC workloads, we can membind
the HPC workload to a single node, and exclude that node from the other
workloads' nodemasks to prevent anonymous memory from leaking into it.

> We have also experimented with proactively waking kswapd to improve synchronous reclaim
> efficiency. Our actual tests show that this can roughly double the memory allocation rate[1].

Personally I believe that this could be the way forward. However, there are
still some problems that we have to address, the biggest one being: pagecache
can be considered "garbage" in both your HPC workloads and my microbenchmark.
However, the pagecache can be very valuable in certain scenarios. What if
the workload will access the pagecache in the future? I'm not really sure if
it makes sense to clean up that pagecache and allocate locally, when the
worst-case scenario is that we have to incur much more latency reading from
disk and bringing in those pages again, when there is free memory still
available in the system.

Perhaps the real solution is to deprecate zone_reclaim_mode and offer more
granular (per-workload basis), and sane (guarantee memory locality and also
perform kswapd when the ndoe is full) options for the user.

> We could also discuss whether there are better solutions for such HPC scenarios.

Yes, I really hope that we can reach the win-win scenario that I mentioned at
the beginning of the reply. I really want to help users achieve the best
performance they can, and also help keep the kernel easy to maintain in the
long-run. Thank you for sharing your perspective, I really learned a lot.
Looking forward to your response, or if you are coming to LPC, would love to
grab a coffee. Have a grat day!
Joshua

> [1]: https://lore.kernel.org/all/20251011062043.772549-1-mawupeng1@huawei.com/

Re: [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode

Posted by mawupeng 1 month, 4 weeks ago


On 2025/12/10 20:30, Joshua Hahn wrote:
> On Tue, 9 Dec 2025 20:43:01 +0800 mawupeng <mawupeng1@huawei.com> wrote:
>>
>> On 2025/12/6 7:32, Joshua Hahn wrote:
>>> Hello folks, 
>>> This is a code RFC for my upcoming discussion at LPC 2025 in Tokyo [1].
>>>
>>> zone_reclaim_mode was introduced in 2005 to prevent the kernel from facing
>>> the high remote access latency associated with NUMA systems. With it enabled,
>>> when the kernel sees that the local node is full, it will stall allocations and
>>> trigger direct reclaim locally, instead of making a remote allocation, even
>>> when there may still be free memory. Thsi is the preferred way to consume memory
>>> if remote memory access is more expensive than performing direct reclaim.
>>> The choice is made on a system-wide basis, but can be toggled at runtime.
>>>
>>> This series deprecates the zone_reclaim_mode sysctl in favor of other NUMA
>>> aware mechanisms, such as NUMA balancing, memory.reclaim, membind, and
>>> tiering / promotion / demotion. Let's break down what differences there are
>>> in these mechanisms, based on workload characteristics.
> 
> [...snip...]
> 
> Hello mawupeng, thank you for your feedback on this RFC.
> 
> I was wondering if you were planning to attend LPC this year. If so, I'll be
> discussing this idea at the MM microconference tomorrow (December 11th) and
> would love to discuss this after the presentation with you in the hallway.
> I want to make sure that I'm not missing any important nuances or use cases
> for zone_reclaim_mode. After all, my only motivation for deprecating this is
> to simplify the code allocation path and reduce maintenence burden, both of
> which definitely does not outweigh valid usecases. On the other hand if we can
> find out that we can deprecate zone_reclaim_mode, and also find some
> alternatives that lead to better performance on your end, that sounds
> like the ultimate win-win scenario for me : -)
> 
>> In real-world scenarios, we have observed on a dual-socket (2P) server with multiple
>> NUMA nodes—each having relatively limited local memory capacity—that page cache
>> negatively impacts overall performance. The zone_reclaim_node feature is used to
>> alleviate performance issues.
>>
>> The main reason is that page cache consumes free memory on the local node, causing
>> processes without mbind restrictions to fall back to other nodes that still have free
>> memory. Accessing remote memory comes with a significant latency penalty. In extreme
>> testing, if a system is fully populated with page cache beforehand, Spark application
>> performance can drop by 80%. However, with zone_reclaim enabled, the performance
>> degradation is limited to only about 30%.
> 
> This sounds right to me. In fact, I have observed similar results in some
> experiments that I ran myself, where on a 2-NUMA system with 125GB memory each,
> I fill up one node with 100G of garbage filecache and try to run a 60G anon
> workload in it. Here are the average access latency results:
> 
> - zone_reclaim_mode enabled: 56.34 ns/access
> - zone_reclaim_mode disabled: 67.86 ns/access
> 
> However, I was able to achieve better results by disabling zone_reclaim_mode
> and using membind instead:
> 
> - zone_reclaim_mode disabled + membind: 52.98 ns/access
> 
> Of course, these are on my specific system with my specific workload so the
> numbers (and results) may be different on your end. You specifically mentioned
> "processes without mbind restrictions". Is there a reason why these workloads
> cannot be membound to a node?

My apologies for the delayed response — I’ve been discussing the actual workload
scenarios with HPC experts.

In HPC workloads, certain specialized tasks are responsible for initialization,
checkpointing by specific processes, post‑processing, I/O operations, and so on.
Since the memory access patterns are not completely uniform, binding them to a
specific NUMA node via membind is often a more suitable approach.

However, two main concerns arise here:

If a workload spans more than a single NUMA node, binding it to just one node may
not be appropriate. Binding to multiple NUMA nodes might be feasible, but the
impact on page cache behavior requires further evaluation.

HPC applications vary widely, and not all can be addressed by membind. At the same
time, zone_reclaim_mode may not be a complete solution either—this also needs
further investigation.

We should explore whether there are better ways to balance memory locality with
memory availability.

> 
> On that note, I had another follow-up question. If remote latency really is a
> big concern, I am wondering if you have seen remote allocations despite
> enabling zone_reclaim_mode. From my understanding of the code, zone_reclaim_mode
> is not a strict guarantee of memory locality. If direct reclaim fails and
> we fail to reclaim enough, the allocation is serviced from a remote node anyways.
> 
> Maybe I did not make this clear in my RFC, but I definitely believe that there
> are workloads out there that benefit from zone_reclaim_mode. However, I
> also believe that membind is just a better alternative for all the scenarios
> that I can think of, so it would really be helpful for my education to learn
> about workloads that benefit from zone_reclaim_mode but cannot use membind.
> 
>> Furthermore, for typical HPC applications, memory pressure tends to be balanced
>> across NUMA nodes. Yet page cache is often generated by background tasks—such as
>> logging modules—which breaks memory locality and adversely affects overall performance.
> 
> I see. From my very limited understanding of HPC applications, they tend to be
> perfectly sized for the nodes they run on, so having logging agents generate
> additional page cache really does sound like a problem to me. 
> 
>> At the same time, there are a large number of __GFP_THISNODE memory allocation requests in
>> the system. Anonymous pages that fall back from other nodes cannot be migrated or easily
>> reclaimed (especially when swap is disabled), leading to uneven distribution of available
>> memory within a single node. By enabling zone_reclaim_mode, the kernel preferentially reclaims
>> file pages within the local NUMA node to satisfy local anonymous-page allocations, which
>> effectively avoids warn_alloc problems caused by uneven distribution of anonymous pages.
>>
>> In such scenarios, relying solely on mbind may offer limited flexibility.
> 
> I see. So if I understand your scenario correctly, what you want is something
> between mbind which is strict in guaranteeing that memory comes locally, and
> the default memory allocation preference, which prefers allocating from
> remote nodes when the local node runs out of memory.
> 
> I have some follow-up questions here.
> It seems like the fact that anonymous memory from remote processes leaking
> their memory into the current node is actually caused by two characteristics
> of zone_reclaim_mode. Namely, that it does not guarantee memory locality,
> and that it is a system-wide setting. Under your scenario, we cannot have
> a mixture of HPC workloads that cannot handle remote memory access latency,
> as well as non-HPC workloads that would actually benefit from being able to
> consume free memory from remote nodes before triggering reclaim.
> 
> So in a scenario where we have multiple HPC workloads running on a multi-NUMA
> system, we can just size each workload to fit the nodes, and membind them so
> that we don't have to worry about migrating or reclaiming remote processes'
> anonymous memory.
> 
> In a scenario where we have an HPC workload + non-HPC workloads, we can membind
> the HPC workload to a single node, and exclude that node from the other
> workloads' nodemasks to prevent anonymous memory from leaking into it.

Maybe we can try to mbind to multiple node for this scenario as above.

> 
>> We have also experimented with proactively waking kswapd to improve synchronous reclaim
>> efficiency. Our actual tests show that this can roughly double the memory allocation rate[1].
> 
> Personally I believe that this could be the way forward. However, there are
> still some problems that we have to address, the biggest one being: pagecache
> can be considered "garbage" in both your HPC workloads and my microbenchmark.
> However, the pagecache can be very valuable in certain scenarios. What if
> the workload will access the pagecache in the future? I'm not really sure if
> it makes sense to clean up that pagecache and allocate locally, when the
> worst-case scenario is that we have to incur much more latency reading from
> disk and bringing in those pages again, when there is free memory still
> available in the system.

If a larger file cache is required, disabling zone_reclaim_mode may help alleviate the
issue, but it indeed lacks some flexibility.

In my personal understanding, the main problem is that page cache tends to occupy all
available free memory, leading to an unreasonable increase in fallback.
> 
> Perhaps the real solution is to deprecate zone_reclaim_mode and offer more
> granular (per-workload basis), and sane (guarantee memory locality and also
> perform kswapd when the ndoe is full) options for the user.

I think it’s worth looking into whether there are better approaches to address the
current issue.

> 
>> We could also discuss whether there are better solutions for such HPC scenarios.
> 
> Yes, I really hope that we can reach the win-win scenario that I mentioned at
> the beginning of the reply. I really want to help users achieve the best
> performance they can, and also help keep the kernel easy to maintain in the
> long-run. Thank you for sharing your perspective, I really learned a lot.
> Looking forward to your response, or if you are coming to LPC, would love to
> grab a coffee. Have a grat day!
> Joshua
> 
>> [1]: https://lore.kernel.org/all/20251011062043.772549-1-mawupeng1@huawei.com/