[v1] mm/page_alloc: dynamic min_free_kbytes adjustment

[PATCH 0/1] mm/page_alloc: dynamic min_free_kbytes adjustment

Posted by wujing 1 month ago

Atomic allocations (GFP_ATOMIC), particularly in network interrupt contexts,
are prone to failure during bursts of traffic if the pre-configured 
min_free_kbytes (atomic reserve) is insufficient. These failures lead to 
packet drops and performance degradation.

Static tuning of vm.min_free_kbytes is often challenging: setting it too 
low risks drops, while setting it too high wastes valuable memory.

This patch series introduces a reactive mechanism that:
1. Detects critical order-0 GFP_ATOMIC allocation failures.
2. Automatically doubles vm.min_free_kbytes to reserve more memory for 
   future bursts.
3. Enforces a safety cap (1% of total RAM) to prevent OOM or excessive waste.

This allows the system to self-adjust to the workload's specific atomic 
memory requirements without manual intervention.

wujing (1):
  mm/page_alloc: auto-tune min_free_kbytes on atomic allocation failure

 mm/page_alloc.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

-- 
2.39.5

[PATCH v3 0/1] mm/page_alloc: dynamic watermark boosting

Posted by wujing 1 month ago

This is v3 of the auto-tuning patch, addressing feedback from Vlastimil Babka,
Andrew Morton, and Matthew Wilcox.

Major shift in v3:
Following Vlastimil's suggestion, this version abandons the direct modification
of min_free_kbytes. Instead, it leverages the existing watermark_boost
infrastructure. This approach is more idiomatic as it:
- Avoids conflicts with administrative sysctl settings.
- Only affects specific zones experiencing pressure.
- Utilizes standard kswapd logic for natural decay after reclamation.

Responses to Vlastimil Babka's feedback:
> "Were they really packet drops observed? AFAIK the receive is deferred to non-irq 
> context if those atomic allocations fail, it shouldn't mean a drop."
In our high-concurrency production environment, we observed that while the 
network stack tries to defer processing, persistent GFP_ATOMIC failures 
eventually lead to NIC-level drops due to RX buffer exhaustion.

> "As for the implementation I'd rather not be changing min_free_kbytes directly... 
> We already have watermark_boost to dynamically change watermarks"
Agreed and implemented in v3.

Changes in v3:
- Replaced min_free_kbytes modification with watermark_boost calls.
- Removed all complex decay/persistence logic from v2, relying on kswapd's 
  standard behavior.
- Maintained the 10-second debounce mechanism.
- Engaged netdev@ community as requested by Andrew Morton.

Thanks for the thoughtful reviews!

wujing (1):
  mm/page_alloc: auto-tune watermarks on atomic allocation failure

 mm/page_alloc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

-- 
2.39.5

[PATCH v2 0/1] mm/page_alloc: dynamic min_free_kbytes adjustment

Posted by wujing 1 month ago

This is v2 of the auto-tuning patch, addressing feedback from Andrew Morton
and Matthew Wilcox.

## Responses to Andrew Morton's feedback:

> "But no attempt to reduce it again after the load spike has gone away."

v2 implements a decay mechanism: min_free_kbytes automatically reduces by 5%
every 5 minutes after being increased. However, it stops at 1.2x the initial
value rather than returning to baseline, ensuring the system "remembers"
previous pressure patterns.

> "Probably this should be selectable and tunable via a kernel boot parameter
> or a procfs tunable."

Per Matthew Wilcox's preference to avoid new tunables, v2 implements an
algorithm designed to work automatically without configuration. The parameters
(50% increase, 5% decay, 10s debounce) are chosen to be responsive yet stable.

> "Can I suggest that you engage with [the networking people]? netdev@"

Done - netdev@ is now CC'd on this v2 submission.

## Responses to Matthew Wilcox's feedback:

> "Is doubling too aggressive? Would an increase of, say, 10% or 20% be more
> appropriate?"

v2 uses a 50% increase (compromise between responsiveness and conservatism).
20% felt too slow for burst traffic scenarios based on our observations.

> "Do we have to wait for failure before increasing? Could we schedule the
> increase for when we get to within, say, 10% of the current limit?"

We considered proactive monitoring but concluded it would add overhead and
complexity. The debounce mechanism (10s) ensures we don't thrash while still
being reactive.

> "Hm, how would we do that? Automatically decay by 5%, 300 seconds after
> increasing; then schedule another decay for 300 seconds after that..."

Exactly as you suggested! v2 implements this decay chain. The only addition
is stopping at 1.2x baseline to preserve learning.

> "Ugh, please, no new tunables. Let's just implement an algorithm that works."

Agreed - v2 has zero new tunables.

## Changes in v2:
- Reduced aggressiveness: +50% increase instead of doubling
- Added debounce: Only trigger once per 10 seconds to prevent storms
- Added decay: Automatically reduce by 5% every 5 minutes
- Preserve learning: Decay stops at 1.2x initial value, not baseline
- Engaged networking community (netdev@)

Thanks for the thoughtful reviews!

wujing (1):
  mm/page_alloc: auto-tune min_free_kbytes on atomic allocation failure

 mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

-- 
2.39.5