Documentation/admin-guide/sysctl/net.rst | 12 ------------ include/net/sock.h | 1 - mm/page_frag_cache.c | 2 +- net/core/sock.c | 8 ++------ net/core/sysctl_net_core.c | 7 ------- 5 files changed, 3 insertions(+), 27 deletions(-)
From: Barry Song <v-songbaohua@oppo.com>
On phones, we have observed significant phone heating when running apps
with high network bandwidth. This is caused by the network stack frequently
waking kswapd for order-3 allocations. As a result, memory reclamation becomes
constantly active, even though plenty of memory is still available for network
allocations which can fall back to order-0.
Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
introduced high_order_alloc_disable for the transmit (TX) path
(skb_page_frag_refill()) to mitigate some memory reclamation issues,
allowing the TX path to fall back to order-0 immediately, while leaving the
receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
generally unaware of the sysctl and cannot easily adjust it for specific use
cases. Enabling high_order_alloc_disable also completely disables the
benefit of order-3 allocations. Additionally, the sysctl does not apply to the
RX path.
An alternative approach is to disable kswapd for these frequent
allocations and provide best-effort order-3 service for both TX and RX paths,
while removing the sysctl entirely.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Cc: Huacai Zhou <zhouhuacai@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
Documentation/admin-guide/sysctl/net.rst | 12 ------------
include/net/sock.h | 1 -
mm/page_frag_cache.c | 2 +-
net/core/sock.c | 8 ++------
net/core/sysctl_net_core.c | 7 -------
5 files changed, 3 insertions(+), 27 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff1..b903bbae239c 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
list is then passed to the stack when the number of segments reaches the
gro_normal_batch limit.
-high_order_alloc_disable
-------------------------
-
-By default the allocator for page frags tries to use high order pages (order-3
-on x86). While the default behavior gives good results in most cases, some users
-might have hit a contention in page allocations/freeing. This was especially
-true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
-lists. This allows to opt-in for order-0 allocation instead but is now mostly of
-historical importance.
-
-Default: 0
-
2. /proc/sys/net/unix - Parameters for Unix domain sockets
----------------------------------------------------------
diff --git a/include/net/sock.h b/include/net/sock.h
index 60bcb13f045c..62306c1095d5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
extern __u32 sysctl_rmem_default;
#define SKB_FRAG_PAGE_ORDER get_order(32768)
-DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
{
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..dd36114dd16f 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_t gfp = gfp_mask;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
+ gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP |
__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
numa_mem_id(), NULL);
diff --git a/net/core/sock.c b/net/core/sock.c
index dc03d4b5909a..1fa1e9177d86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk)
}
}
-DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
-
/**
* skb_page_frag_refill - check that a page_frag contains enough room
* @sz: minimum size of the fragment we want to get
@@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
}
pfrag->offset = 0;
- if (SKB_FRAG_PAGE_ORDER &&
- !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
- /* Avoid direct reclaim but allow kswapd to wake */
- pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ if (SKB_FRAG_PAGE_ORDER) {
+ pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) |
__GFP_COMP | __GFP_NOWARN |
__GFP_NORETRY,
SKB_FRAG_PAGE_ORDER);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1..181f6532beb8 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_THREE,
},
- {
- .procname = "high_order_alloc_disable",
- .data = &net_high_order_alloc_disable_key.key,
- .maxlen = sizeof(net_high_order_alloc_disable_key),
- .mode = 0644,
- .proc_handler = proc_do_static_key,
- },
{
.procname = "gro_normal_batch",
.data = &net_hotdata.gro_normal_batch,
--
2.39.3 (Apple Git-146)
On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > On phones, we have observed significant phone heating when running apps > with high network bandwidth. This is caused by the network stack frequently > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > constantly active, even though plenty of memory is still available for network > allocations which can fall back to order-0. I think we need to understand what's going on here a whole lot more than this! So, we try to do an order-3 allocation. kswapd runs and ... succeeds in creating order-3 pages? Or fails to? If it fails, that's something we need to sort out. If it succeeds, now we have several order-3 pages, great. But where do they all go that we need to run kswapd again?
On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > On phones, we have observed significant phone heating when running apps > > with high network bandwidth. This is caused by the network stack frequently > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > constantly active, even though plenty of memory is still available for network > > allocations which can fall back to order-0. > > I think we need to understand what's going on here a whole lot more than > this! > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > creating order-3 pages? Or fails to? > Our team observed that most of the time we successfully obtain order-3 memory, but the cost is excessive memory reclamation, since we end up over-reclaiming order-0 pages that could have remained in memory. > If it fails, that's something we need to sort out. > > If it succeeds, now we have several order-3 pages, great. But where do > they all go that we need to run kswapd again? The network app keeps running and continues to issue new order-3 allocation requests, so those few order-3 pages won’t be enough to satisfy the continuous demand. Thanks Barry
On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > > On phones, we have observed significant phone heating when running apps > > > with high network bandwidth. This is caused by the network stack frequently > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > > constantly active, even though plenty of memory is still available for network > > > allocations which can fall back to order-0. > > > > I think we need to understand what's going on here a whole lot more than > > this! > > > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > > creating order-3 pages? Or fails to? > > > > Our team observed that most of the time we successfully obtain order-3 > memory, but the cost is excessive memory reclamation, since we end up > over-reclaiming order-0 pages that could have remained in memory. > > > If it fails, that's something we need to sort out. > > > > If it succeeds, now we have several order-3 pages, great. But where do > > they all go that we need to run kswapd again? > > The network app keeps running and continues to issue new order-3 allocation > requests, so those few order-3 pages won’t be enough to satisfy the > continuous demand. These pages are freed as order-3 pages, and should replenish the buddy as if nothing happened. I think you are missing something to control how much memory can be pushed on each TCP socket ? What is tcp_wmem on your phones ? What about tcp_mem ? Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
On Tue, Oct 14, 2025 at 1:04 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > > > On phones, we have observed significant phone heating when running apps > > > > with high network bandwidth. This is caused by the network stack frequently > > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > > > constantly active, even though plenty of memory is still available for network > > > > allocations which can fall back to order-0. > > > > > > I think we need to understand what's going on here a whole lot more than > > > this! > > > > > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > > > creating order-3 pages? Or fails to? > > > > > > > Our team observed that most of the time we successfully obtain order-3 > > memory, but the cost is excessive memory reclamation, since we end up > > over-reclaiming order-0 pages that could have remained in memory. > > > > > If it fails, that's something we need to sort out. > > > > > > If it succeeds, now we have several order-3 pages, great. But where do > > > they all go that we need to run kswapd again? > > > > The network app keeps running and continues to issue new order-3 allocation > > requests, so those few order-3 pages won’t be enough to satisfy the > > continuous demand. > > These pages are freed as order-3 pages, and should replenish the buddy > as if nothing happened. Ideally, that would be the case if the workload were simple. However, the system may have many other processes and kernel drivers running simultaneously, also consuming memory from the buddy allocator and possibly taking the replenished pages. As a result, we can still observe multiple kswapd wakeups and instances of over-reclamation caused by the network stack’s high-order allocations. > > I think you are missing something to control how much memory can be > pushed on each TCP socket ? > > What is tcp_wmem on your phones ? What about tcp_mem ? > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat # cat /proc/sys/net/ipv4/tcp_wmem 524288 1048576 6710886 # cat /proc/sys/net/ipv4/tcp_mem 131220 174961 262440 # cat /proc/sys/net/ipv4/tcp_notsent_lowat 4294967295 Any thoughts on these settings? Thanks Barry
On Tue, Oct 14, 2025 at 1:58 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 1:04 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote: > > > > > On phones, we have observed significant phone heating when running apps > > > > > with high network bandwidth. This is caused by the network stack frequently > > > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes > > > > > constantly active, even though plenty of memory is still available for network > > > > > allocations which can fall back to order-0. > > > > > > > > I think we need to understand what's going on here a whole lot more than > > > > this! > > > > > > > > So, we try to do an order-3 allocation. kswapd runs and ... succeeds in > > > > creating order-3 pages? Or fails to? > > > > > > > > > > Our team observed that most of the time we successfully obtain order-3 > > > memory, but the cost is excessive memory reclamation, since we end up > > > over-reclaiming order-0 pages that could have remained in memory. > > > > > > > If it fails, that's something we need to sort out. > > > > > > > > If it succeeds, now we have several order-3 pages, great. But where do > > > > they all go that we need to run kswapd again? > > > > > > The network app keeps running and continues to issue new order-3 allocation > > > requests, so those few order-3 pages won’t be enough to satisfy the > > > continuous demand. > > > > These pages are freed as order-3 pages, and should replenish the buddy > > as if nothing happened. > > Ideally, that would be the case if the workload were simple. However, the > system may have many other processes and kernel drivers running > simultaneously, also consuming memory from the buddy allocator and possibly > taking the replenished pages. As a result, we can still observe multiple > kswapd wakeups and instances of over-reclamation caused by the network > stack’s high-order allocations. > > > > > I think you are missing something to control how much memory can be > > pushed on each TCP socket ? > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > # cat /proc/sys/net/ipv4/tcp_wmem > 524288 1048576 6710886 Ouch. That is insane tcp_wmem[0] . Please stick to 4096, or risk OOM of various sorts. > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > 4294967295 > > Any thoughts on these settings? Please look at https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt tcp_notsent_lowat - UNSIGNED INTEGER A TCP socket can control the amount of unsent bytes in its write queue, thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() reports POLLOUT events if the amount of unsent bytes is below a per socket value, and if the write queue is not full. sendmsg() will also not add new buffers if the limit is hit. This global variable controls the amount of unsent data for sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change to the global variable has immediate effect. Setting this sysctl to 2MB can effectively reduce the amount of memory in TCP write queues by 66 %, or allow you to increase tcp_wmem[2] so that only flows needing big BDP can get it.
> > > > > > > > I think you are missing something to control how much memory can be > > > pushed on each TCP socket ? > > > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > > > # cat /proc/sys/net/ipv4/tcp_wmem > > 524288 1048576 6710886 > > Ouch. That is insane tcp_wmem[0] . > > Please stick to 4096, or risk OOM of various sorts. > > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > > 4294967295 > > > > Any thoughts on these settings? > > Please look at > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > > tcp_notsent_lowat - UNSIGNED INTEGER > A TCP socket can control the amount of unsent bytes in its write queue, > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() > reports POLLOUT events if the amount of unsent bytes is below a per > socket value, and if the write queue is not full. sendmsg() will > also not add new buffers if the limit is hit. > > This global variable controls the amount of unsent data for > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change > to the global variable has immediate effect. > > > Setting this sysctl to 2MB can effectively reduce the amount of memory > in TCP write queues by 66 %, > or allow you to increase tcp_wmem[2] so that only flows needing big > BDP can get it. We obtained these settings from our hardware vendors. It might be worth exploring these settings further, but I can’t quite see their connection to high-order allocations, since high-order allocations are kernel macros. #define SKB_FRAG_PAGE_ORDER get_order(32768) #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) Is there anything I’m missing? Thanks Barry
On Tue, Oct 14, 2025 at 06:19:05PM +0800, Barry Song wrote: > > > > > > > > > > > I think you are missing something to control how much memory can be > > > > pushed on each TCP socket ? > > > > > > > > What is tcp_wmem on your phones ? What about tcp_mem ? > > > > > > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat > > > > > > # cat /proc/sys/net/ipv4/tcp_wmem > > > 524288 1048576 6710886 > > > > Ouch. That is insane tcp_wmem[0] . > > > > Please stick to 4096, or risk OOM of various sorts. > > > > > > > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat > > > 4294967295 > > > > > > Any thoughts on these settings? > > > > Please look at > > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > > > > tcp_notsent_lowat - UNSIGNED INTEGER > > A TCP socket can control the amount of unsent bytes in its write queue, > > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() > > reports POLLOUT events if the amount of unsent bytes is below a per > > socket value, and if the write queue is not full. sendmsg() will > > also not add new buffers if the limit is hit. > > > > This global variable controls the amount of unsent data for > > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change > > to the global variable has immediate effect. > > > > > > Setting this sysctl to 2MB can effectively reduce the amount of memory > > in TCP write queues by 66 %, > > or allow you to increase tcp_wmem[2] so that only flows needing big > > BDP can get it. > > We obtained these settings from our hardware vendors. > > It might be worth exploring these settings further, but I can’t quite see > their connection to high-order allocations, I don't think there is a connection between them. Is there a reason you are expecting a connection/relation between them?
On Tue, Oct 14, 2025 at 3:19 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > >
> > > >
> > > > I think you are missing something to control how much memory can be
> > > > pushed on each TCP socket ?
> > > >
> > > > What is tcp_wmem on your phones ? What about tcp_mem ?
> > > >
> > > > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
> > >
> > > # cat /proc/sys/net/ipv4/tcp_wmem
> > > 524288 1048576 6710886
> >
> > Ouch. That is insane tcp_wmem[0] .
> >
> > Please stick to 4096, or risk OOM of various sorts.
> >
> > >
> > > # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> > > 4294967295
> > >
> > > Any thoughts on these settings?
> >
> > Please look at
> > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> >
> > tcp_notsent_lowat - UNSIGNED INTEGER
> > A TCP socket can control the amount of unsent bytes in its write queue,
> > thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> > reports POLLOUT events if the amount of unsent bytes is below a per
> > socket value, and if the write queue is not full. sendmsg() will
> > also not add new buffers if the limit is hit.
> >
> > This global variable controls the amount of unsent data for
> > sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> > to the global variable has immediate effect.
> >
> >
> > Setting this sysctl to 2MB can effectively reduce the amount of memory
> > in TCP write queues by 66 %,
> > or allow you to increase tcp_wmem[2] so that only flows needing big
> > BDP can get it.
>
> We obtained these settings from our hardware vendors.
Tell them they are wrong.
>
> It might be worth exploring these settings further, but I can’t quite see
> their connection to high-order allocations, since high-order allocations are
> kernel macros.
>
> #define SKB_FRAG_PAGE_ORDER get_order(32768)
> #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK)
> #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE)
>
> Is there anything I’m missing?
What is your question exactly ? You read these macros just fine. What
is your point ?
We had in the past something dynamic that we removed
commit d9b2938aabf757da2d40153489b251d4fc3fdd18
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Aug 27 20:49:34 2014 -0700
net: attempt a single high order allocation
On Mon, Oct 13, 2025 at 3:16 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> On phones, we have observed significant phone heating when running apps
> with high network bandwidth. This is caused by the network stack frequently
> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> constantly active, even though plenty of memory is still available for network
> allocations which can fall back to order-0.
>
> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> introduced high_order_alloc_disable for the transmit (TX) path
> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> allowing the TX path to fall back to order-0 immediately, while leaving the
> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> generally unaware of the sysctl and cannot easily adjust it for specific use
> cases. Enabling high_order_alloc_disable also completely disables the
> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> RX path.
>
> An alternative approach is to disable kswapd for these frequent
> allocations and provide best-effort order-3 service for both TX and RX paths,
> while removing the sysctl entirely.
>
>
...
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> Documentation/admin-guide/sysctl/net.rst | 12 ------------
> include/net/sock.h | 1 -
> mm/page_frag_cache.c | 2 +-
> net/core/sock.c | 8 ++------
> net/core/sysctl_net_core.c | 7 -------
> 5 files changed, 3 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 2ef50828aff1..b903bbae239c 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> list is then passed to the stack when the number of segments reaches the
> gro_normal_batch limit.
>
> -high_order_alloc_disable
> -------------------------
> -
> -By default the allocator for page frags tries to use high order pages (order-3
> -on x86). While the default behavior gives good results in most cases, some users
> -might have hit a contention in page allocations/freeing. This was especially
> -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> -historical importance.
> -
The sysctl is quite useful for testing purposes, say on a freshly
booted host, with plenty of free memory.
Also, having order-3 pages if possible is quite important for IOMM use cases.
Perhaps kswapd should have some kind of heuristic to not start if a
recent run has already happened.
I am guessing phones do not need to send 1.6 Tbit per second on
network devices (yet),
an option could be to disable it in your boot scripts.
> >
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..b903bbae239c 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > list is then passed to the stack when the number of segments reaches the
> > gro_normal_batch limit.
> >
> > -high_order_alloc_disable
> > -------------------------
> > -
> > -By default the allocator for page frags tries to use high order pages (order-3
> > -on x86). While the default behavior gives good results in most cases, some users
> > -might have hit a contention in page allocations/freeing. This was especially
> > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > -historical importance.
> > -
>
> The sysctl is quite useful for testing purposes, say on a freshly
> booted host, with plenty of free memory.
>
> Also, having order-3 pages if possible is quite important for IOMM use cases.
>
> Perhaps kswapd should have some kind of heuristic to not start if a
> recent run has already happened.
I don’t understand why it shouldn’t start when users continuously request
order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t
make sense logically to skip it just because earlier requests were already
satisfied.
>
> I am guessing phones do not need to send 1.6 Tbit per second on
> network devices (yet),
> an option could be to disable it in your boot scripts.
A problem with the existing sysctl is that it only covers the TX path;
for the RX path, we also observe that kswapd consumes significant power.
I could add the patch below to make it support the RX path, but it feels
like a bit of a layer violation, since the RX path code resides in mm
and is intended to serve generic users rather than networking, even
though the current callers are primarily network-related.
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..8ad18ec49f39 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -18,6 +18,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/page_frag_cache.h>
+#include <net/sock.h>
#include "internal.h"
static unsigned long encoded_page_create(struct page *page, unsigned int order,
@@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_t gfp = gfp_mask;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
- __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
- page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
- numa_mem_id(), NULL);
+ if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) {
+ gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
+ __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+ page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
+ numa_mem_id(), NULL);
+ }
#endif
if (unlikely(!page)) {
Do you have a better idea on how to make the sysctl also cover the RX path?
Thanks
Barry
On Mon, Oct 13, 2025 at 8:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > > index 2ef50828aff1..b903bbae239c 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > > list is then passed to the stack when the number of segments reaches the
> > > gro_normal_batch limit.
> > >
> > > -high_order_alloc_disable
> > > -------------------------
> > > -
> > > -By default the allocator for page frags tries to use high order pages (order-3
> > > -on x86). While the default behavior gives good results in most cases, some users
> > > -might have hit a contention in page allocations/freeing. This was especially
> > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > > -historical importance.
> > > -
> >
> > The sysctl is quite useful for testing purposes, say on a freshly
> > booted host, with plenty of free memory.
> >
> > Also, having order-3 pages if possible is quite important for IOMM use cases.
> >
> > Perhaps kswapd should have some kind of heuristic to not start if a
> > recent run has already happened.
>
> I don’t understand why it shouldn’t start when users continuously request
> order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t
> make sense logically to skip it just because earlier requests were already
> satisfied.
>
> >
> > I am guessing phones do not need to send 1.6 Tbit per second on
> > network devices (yet),
> > an option could be to disable it in your boot scripts.
>
> A problem with the existing sysctl is that it only covers the TX path;
> for the RX path, we also observe that kswapd consumes significant power.
> I could add the patch below to make it support the RX path, but it feels
> like a bit of a layer violation, since the RX path code resides in mm
> and is intended to serve generic users rather than networking, even
> though the current callers are primarily network-related.
You might have a buggy driver.
High performance drivers use order-0 allocations only.
>
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index d2423f30577e..8ad18ec49f39 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -18,6 +18,7 @@
> #include <linux/init.h>
> #include <linux/mm.h>
> #include <linux/page_frag_cache.h>
> +#include <net/sock.h>
> #include "internal.h"
>
> static unsigned long encoded_page_create(struct page *page, unsigned int order,
> @@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> gfp_t gfp = gfp_mask;
>
> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> - __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> - page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> - numa_mem_id(), NULL);
> + if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) {
> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> + page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> + numa_mem_id(), NULL);
> + }
> #endif
> if (unlikely(!page)) {
>
>
> Do you have a better idea on how to make the sysctl also cover the RX path?
>
> Thanks
> Barry
>
> >
> > A problem with the existing sysctl is that it only covers the TX path;
> > for the RX path, we also observe that kswapd consumes significant power.
> > I could add the patch below to make it support the RX path, but it feels
> > like a bit of a layer violation, since the RX path code resides in mm
> > and is intended to serve generic users rather than networking, even
> > though the current callers are primarily network-related.
>
> You might have a buggy driver.
We are observing the RX path as follows:
do_softirq
taskset_hi_action
kalPacketAlloc
__netdev_alloc_skb
page_frag_alloc_align
__page_frag_cache_refill
This appears to be a fairly common stack.
So it is a buggy driver?
>
> High performance drivers use order-0 allocations only.
>
Do you have an example of high-performance drivers that use only order-0 memory?
Thanks
Barry
On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > A problem with the existing sysctl is that it only covers the TX path; > > > for the RX path, we also observe that kswapd consumes significant power. > > > I could add the patch below to make it support the RX path, but it feels > > > like a bit of a layer violation, since the RX path code resides in mm > > > and is intended to serve generic users rather than networking, even > > > though the current callers are primarily network-related. > > > > You might have a buggy driver. > > We are observing the RX path as follows: > > do_softirq > taskset_hi_action > kalPacketAlloc > __netdev_alloc_skb > page_frag_alloc_align > __page_frag_cache_refill > > This appears to be a fairly common stack. > > So it is a buggy driver? No idea, kalPacketAlloc is not in upstream trees. It apparently needs high order allocations. It will fail at some point. > > > > > High performance drivers use order-0 allocations only. > > > > Do you have an example of high-performance drivers that use only order-0 memory? About all drivers using XDP, and/or using napi_get_frags() XDP has been using order-0 pages from the very beginning.
On Tue, Oct 14, 2025 at 3:01 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > >
> > > > A problem with the existing sysctl is that it only covers the TX path;
> > > > for the RX path, we also observe that kswapd consumes significant power.
> > > > I could add the patch below to make it support the RX path, but it feels
> > > > like a bit of a layer violation, since the RX path code resides in mm
> > > > and is intended to serve generic users rather than networking, even
> > > > though the current callers are primarily network-related.
> > >
> > > You might have a buggy driver.
> >
> > We are observing the RX path as follows:
> >
> > do_softirq
> > taskset_hi_action
> > kalPacketAlloc
> > __netdev_alloc_skb
> > page_frag_alloc_align
> > __page_frag_cache_refill
> >
> > This appears to be a fairly common stack.
> >
> > So it is a buggy driver?
>
> No idea, kalPacketAlloc is not in upstream trees.
>
> It apparently needs high order allocations. It will fail at some point.
>
> >
> > >
> > > High performance drivers use order-0 allocations only.
> > >
> >
> > Do you have an example of high-performance drivers that use only order-0 memory?
>
> About all drivers using XDP, and/or using napi_get_frags()
>
> XDP has been using order-0 pages from the very beginning.
Thanks! But there are still many drivers using netdev_alloc_skb()—we
shouldn’t overlook them, right?
net % git grep netdev_alloc_skb | wc -l
359
Thanks
Barry
On Tue, Oct 14, 2025 at 1:17 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 3:01 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Oct 13, 2025 at 11:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > A problem with the existing sysctl is that it only covers the TX path; > > > > > for the RX path, we also observe that kswapd consumes significant power. > > > > > I could add the patch below to make it support the RX path, but it feels > > > > > like a bit of a layer violation, since the RX path code resides in mm > > > > > and is intended to serve generic users rather than networking, even > > > > > though the current callers are primarily network-related. > > > > > > > > You might have a buggy driver. > > > > > > We are observing the RX path as follows: > > > > > > do_softirq > > > taskset_hi_action > > > kalPacketAlloc > > > __netdev_alloc_skb > > > page_frag_alloc_align > > > __page_frag_cache_refill > > > > > > This appears to be a fairly common stack. > > > > > > So it is a buggy driver? > > > > No idea, kalPacketAlloc is not in upstream trees. > > > > It apparently needs high order allocations. It will fail at some point. > > > > > > > > > > > > > High performance drivers use order-0 allocations only. > > > > > > > > > > Do you have an example of high-performance drivers that use only order-0 memory? > > > > About all drivers using XDP, and/or using napi_get_frags() > > > > XDP has been using order-0 pages from the very beginning. > > Thanks! But there are still many drivers using netdev_alloc_skb()—we > shouldn’t overlook them, right? > > net % git grep netdev_alloc_skb | wc -l > 359 Only the ones that are using 16KB allocations like some WAN drivers :) Some networks use MTU=9000 If a hardware does not provide SG support on receive, a kmalloc() based will use 16KB of memory. By using a frag allocator, we can pack 3 allocations per 32KB instead of 2. TCP can go 50% faster. If memory is short, it will fail no matter what.
On 10/13/25 12:16, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> On phones, we have observed significant phone heating when running apps
> with high network bandwidth. This is caused by the network stack frequently
> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> constantly active, even though plenty of memory is still available for network
> allocations which can fall back to order-0.
>
> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> introduced high_order_alloc_disable for the transmit (TX) path
> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> allowing the TX path to fall back to order-0 immediately, while leaving the
> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> generally unaware of the sysctl and cannot easily adjust it for specific use
> cases. Enabling high_order_alloc_disable also completely disables the
> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> RX path.
>
> An alternative approach is to disable kswapd for these frequent
> allocations and provide best-effort order-3 service for both TX and RX paths,
> while removing the sysctl entirely.
>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Kuniyuki Iwashima <kuniyu@google.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Simon Horman <horms@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Yunsheng Lin <linyunsheng@huawei.com>
> Cc: Huacai Zhou <zhouhuacai@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> Documentation/admin-guide/sysctl/net.rst | 12 ------------
> include/net/sock.h | 1 -
> mm/page_frag_cache.c | 2 +-
> net/core/sock.c | 8 ++------
> net/core/sysctl_net_core.c | 7 -------
> 5 files changed, 3 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 2ef50828aff1..b903bbae239c 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> list is then passed to the stack when the number of segments reaches the
> gro_normal_batch limit.
>
> -high_order_alloc_disable
> -------------------------
> -
> -By default the allocator for page frags tries to use high order pages (order-3
> -on x86). While the default behavior gives good results in most cases, some users
> -might have hit a contention in page allocations/freeing. This was especially
> -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> -historical importance.
> -
> -Default: 0
> -
> 2. /proc/sys/net/unix - Parameters for Unix domain sockets
> ----------------------------------------------------------
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 60bcb13f045c..62306c1095d5 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> extern __u32 sysctl_rmem_default;
>
> #define SKB_FRAG_PAGE_ORDER get_order(32768)
> -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
>
> static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> {
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index d2423f30577e..dd36114dd16f 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> gfp_t gfp = gfp_mask;
>
> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> + gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP |
> __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
fine for the page allocator itself where we have a different entry point
that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
I wonder if we should either:
1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
determine it precisely.
2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
not being disturbing (like proposed here), but that can in fact allow
spinning. Instead, decide to not wake up kswapd by those when other
information indicates it's an opportunistic allocation
(~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
order > 0...)
3) something better?
Vlastimil
> page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> numa_mem_id(), NULL);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index dc03d4b5909a..1fa1e9177d86 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk)
> }
> }
>
> -DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> -
> /**
> * skb_page_frag_refill - check that a page_frag contains enough room
> * @sz: minimum size of the fragment we want to get
> @@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
> }
>
> pfrag->offset = 0;
> - if (SKB_FRAG_PAGE_ORDER &&
> - !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
> - /* Avoid direct reclaim but allow kswapd to wake */
> - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> + if (SKB_FRAG_PAGE_ORDER) {
> + pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> __GFP_NORETRY,
> SKB_FRAG_PAGE_ORDER);
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 8cf04b57ade1..181f6532beb8 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = {
> .extra1 = SYSCTL_ZERO,
> .extra2 = SYSCTL_THREE,
> },
> - {
> - .procname = "high_order_alloc_disable",
> - .data = &net_high_order_alloc_disable_key.key,
> - .maxlen = sizeof(net_high_order_alloc_disable_key),
> - .mode = 0644,
> - .proc_handler = proc_do_static_key,
> - },
> {
> .procname = "gro_normal_batch",
> .data = &net_hotdata.gro_normal_batch,
On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > On 10/13/25 12:16, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> [...] > I wonder if we should either: > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > determine it precisely. As said in other reply I do not think this is a good fit for this specific case as it is all or nothing approach. Soon enough we discover that "no effort to reclaim/compact" hurts other usecases. So I do not think we need a dedicated flag for this specific case. We need a way to tell kswapd/kcompactd how much to try instead. -- Michal Hocko SUSE Labs
On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote: > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > On 10/13/25 12:16, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > [...] > > I wonder if we should either: > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > determine it precisely. > > As said in other reply I do not think this is a good fit for this > specific case as it is all or nothing approach. Soon enough we discover > that "no effort to reclaim/compact" hurts other usecases. So I do not > think we need a dedicated flag for this specific case. We need a way to > tell kswapd/kcompactd how much to try instead. To me this new floag is to decouple two orthogonal requests i.e. no lock semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp flag convey the semantics of no lock. This can lead to unintended usage of no lock semantics by users which for whatever reason don't want to wakeup kswapd.
On Tue 14-10-25 07:27:06, Shakeel Butt wrote: > On Tue, Oct 14, 2025 at 09:26:49AM +0200, Michal Hocko wrote: > > On Mon 13-10-25 20:30:13, Vlastimil Babka wrote: > > > On 10/13/25 12:16, Barry Song wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > [...] > > > I wonder if we should either: > > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > > determine it precisely. > > > > As said in other reply I do not think this is a good fit for this > > specific case as it is all or nothing approach. Soon enough we discover > > that "no effort to reclaim/compact" hurts other usecases. So I do not > > think we need a dedicated flag for this specific case. We need a way to > > tell kswapd/kcompactd how much to try instead. > > To me this new floag is to decouple two orthogonal requests i.e. no lock > semantic and don't wakeup kswapd. At the moment the lack of kswapd gfp > flag convey the semantics of no lock. This can lead to unintended usage > of no lock semantics by users which for whatever reason don't want to > wakeup kswapd. I would argue that callers should have no business into saying whether the MM should wake up kswapd or not. The flag name currently suggests that but that is mostly for historic reasons. A random page allocator user shouldn't really care about this low level detail, really. -- Michal Hocko SUSE Labs
On Tue, Oct 14, 2025 at 3:26 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 13-10-25 20:30:13, Vlastimil Babka wrote:
> > On 10/13/25 12:16, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> [...]
> > I wonder if we should either:
> >
> > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > determine it precisely.
>
> As said in other reply I do not think this is a good fit for this
> specific case as it is all or nothing approach. Soon enough we discover
> that "no effort to reclaim/compact" hurts other usecases. So I do not
> think we need a dedicated flag for this specific case. We need a way to
> tell kswapd/kcompactd how much to try instead.
+Baolin, who may have observed the same issue.
An issue with vmscan is that kcompactd is woken up very late, only after
reclaiming a large number of order-0 pages to satisfy an order-3
application.
static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
{
...
balanced = pgdat_balanced(pgdat, sc.order, highest_zoneidx);
if (!balanced && nr_boost_reclaim) {
nr_boost_reclaim = 0;
goto restart;
}
/*
* If boosting is not active then only reclaim if there are no
* eligible zones. Note that sc.reclaim_idx is not used as
* buffer_heads_over_limit may have adjusted it.
*/
if (!nr_boost_reclaim && balanced)
goto out;
...
if (kswapd_shrink_node(pgdat, &sc))
raise_priority = false;
...
out:
...
/*
* As there is now likely space, wakeup kcompact to defragment
* pageblocks.
*/
wakeup_kcompactd(pgdat, pageblock_order, highest_zoneidx);
}
As pgdat_balanced() needs at least one 3-order pages to return true:
bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
int highest_zoneidx, unsigned int alloc_flags,
long free_pages)
{
...
if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
return false;
/* If this is an order-0 request then the watermark is fine */
if (!order)
return true;
/* For a high-order request, check at least one suitable page is free */
for (o = order; o < NR_PAGE_ORDERS; o++) {
struct free_area *area = &z->free_area[o];
int mt;
if (!area->nr_free)
continue;
for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
if (!free_area_empty(area, mt))
return true;
}
#ifdef CONFIG_CMA
if ((alloc_flags & ALLOC_CMA) &&
!free_area_empty(area, MIGRATE_CMA)) {
return true;
}
#endif
if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
!free_area_empty(area, MIGRATE_HIGHATOMIC)) {
return true;
}
}
This appears to be incorrect and will always lead to over-reclamation in order0
to satisfy high-order applications.
I wonder if we should "goto out" earlier to wake up kcompactd when there
is plenty of memory available, even if no order-3 pages exist.
Conceptually, what I mean is:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c80fcae7f2a1..d0e03066bbaa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7057,9 +7057,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
* eligible zones. Note that sc.reclaim_idx is not used as
* buffer_heads_over_limit may have adjusted it.
*/
- if (!nr_boost_reclaim && balanced)
+ if (!nr_boost_reclaim && (balanced || we_have_plenty_memory_to_compact()))
goto out;
/* Limit the priority of boosting to avoid reclaim writeback */
if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
raise_priority = false;
Thanks
Barry
Vlastimil Babka <vbabka@suse.cz> writes:
> On 10/13/25 12:16, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> On phones, we have observed significant phone heating when running apps
>> with high network bandwidth. This is caused by the network stack frequently
>> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
>> constantly active, even though plenty of memory is still available for network
>> allocations which can fall back to order-0.
>>
>> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
>> introduced high_order_alloc_disable for the transmit (TX) path
>> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
>> allowing the TX path to fall back to order-0 immediately, while leaving the
>> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
>> generally unaware of the sysctl and cannot easily adjust it for specific use
>> cases. Enabling high_order_alloc_disable also completely disables the
>> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
>> RX path.
>>
>> An alternative approach is to disable kswapd for these frequent
>> allocations and provide best-effort order-3 service for both TX and RX paths,
>> while removing the sysctl entirely.
I'm not sure this is the right path long-term. There are significant
benefits associated with using larger pages, so making the kernel fall
back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd
trying to defragment memory, the only other option is to force tasks
into the direct compaction and it's known to be problematic.
I wonder if instead we should look into optimizing kswapd to be less
power-hungry?
And if you still prefer to disable kswapd for this purpose, at least it
should be conditional to vm.laptop_mode. But again, I don't think it's
the right long-term approach.
Thanks!
On Mon 13-10-25 15:46:54, Roman Gushchin wrote: > Vlastimil Babka <vbabka@suse.cz> writes: > > > On 10/13/25 12:16, Barry Song wrote: [...] > >> An alternative approach is to disable kswapd for these frequent > >> allocations and provide best-effort order-3 service for both TX and RX paths, > >> while removing the sysctl entirely. > > I'm not sure this is the right path long-term. There are significant > benefits associated with using larger pages, so making the kernel fall > back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd > trying to defragment memory, the only other option is to force tasks > into the direct compaction and it's known to be problematic. > > I wonder if instead we should look into optimizing kswapd to be less > power-hungry? Exactly. If your specific needs prefer low power consumption to higher order pages availability then we should have more flixible way to say that than a hardcoded allocation mode. We should be able to tell kswapd/kcompactd how much to try for those allocations. -- Michal Hocko SUSE Labs
On Tue, Oct 14, 2025 at 6:47 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Vlastimil Babka <vbabka@suse.cz> writes:
>
> > On 10/13/25 12:16, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> On phones, we have observed significant phone heating when running apps
> >> with high network bandwidth. This is caused by the network stack frequently
> >> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> >> constantly active, even though plenty of memory is still available for network
> >> allocations which can fall back to order-0.
> >>
> >> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> >> introduced high_order_alloc_disable for the transmit (TX) path
> >> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> >> allowing the TX path to fall back to order-0 immediately, while leaving the
> >> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> >> generally unaware of the sysctl and cannot easily adjust it for specific use
> >> cases. Enabling high_order_alloc_disable also completely disables the
> >> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> >> RX path.
> >>
> >> An alternative approach is to disable kswapd for these frequent
> >> allocations and provide best-effort order-3 service for both TX and RX paths,
> >> while removing the sysctl entirely.
>
> I'm not sure this is the right path long-term. There are significant
> benefits associated with using larger pages, so making the kernel fall
> back to order-0 pages easier and sooner feels wrong, tbh. Without kswapd
> trying to defragment memory, the only other option is to force tasks
> into the direct compaction and it's known to be problematic.
I guess the benefits depend on the hardware: for loopback, they might be
significant, while for slower network devices, order-3 memory may provide
much smaller gains?
On the other hand, I wonder if we could make kcompactd more active when
kswapd is woken for order-3 allocations, instead of reclaiming
order-0 pages to form order-3.
>
> I wonder if instead we should look into optimizing kswapd to be less
> power-hungry?
People have been working on this for years, yet reclaiming a folio still
requires a lot of effort, including folio_referenced, try_to_unmap_one,
and compressing folios to swap out to zRAM.
>
> And if you still prefer to disable kswapd for this purpose, at least it
> should be conditional to vm.laptop_mode. But again, I don't think it's
> the right long-term approach.
My point is that phones generally have much slower network hardware
compared to PCs, and far slower hardware compared to servers, so they
are likely not very sensitive to whether memory is order-3 or order-0. On
the other hand, phones are highly sensitive to power consumption. As a
result, the power cost of creating order-3 pages is likely to outweigh any
benefit that order-3 memory might offer for network performance.
It might be worth extending the existing net_high_order_alloc_disable_key
to the RX path, as I mentioned in my reply to Eric[1], allowing users to
decide whether network or power consumption is more important?
[1] https://lore.kernel.org/linux-mm/20251014035846.1519-1-21cnbao@gmail.com/
Thanks
Barry
On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
> On 10/13/25 12:16, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > On phones, we have observed significant phone heating when running apps
> > with high network bandwidth. This is caused by the network stack frequently
> > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > constantly active, even though plenty of memory is still available for network
> > allocations which can fall back to order-0.
> >
> > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> > introduced high_order_alloc_disable for the transmit (TX) path
> > (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> > allowing the TX path to fall back to order-0 immediately, while leaving the
> > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> > generally unaware of the sysctl and cannot easily adjust it for specific use
> > cases. Enabling high_order_alloc_disable also completely disables the
> > benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> > RX path.
> >
> > An alternative approach is to disable kswapd for these frequent
> > allocations and provide best-effort order-3 service for both TX and RX paths,
> > while removing the sysctl entirely.
> >
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Kuniyuki Iwashima <kuniyu@google.com>
> > Cc: Paolo Abeni <pabeni@redhat.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Simon Horman <horms@kernel.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Brendan Jackman <jackmanb@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Yunsheng Lin <linyunsheng@huawei.com>
> > Cc: Huacai Zhou <zhouhuacai@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> > Documentation/admin-guide/sysctl/net.rst | 12 ------------
> > include/net/sock.h | 1 -
> > mm/page_frag_cache.c | 2 +-
> > net/core/sock.c | 8 ++------
> > net/core/sysctl_net_core.c | 7 -------
> > 5 files changed, 3 insertions(+), 27 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..b903bbae239c 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > list is then passed to the stack when the number of segments reaches the
> > gro_normal_batch limit.
> >
> > -high_order_alloc_disable
> > -------------------------
> > -
> > -By default the allocator for page frags tries to use high order pages (order-3
> > -on x86). While the default behavior gives good results in most cases, some users
> > -might have hit a contention in page allocations/freeing. This was especially
> > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > -historical importance.
> > -
> > -Default: 0
> > -
> > 2. /proc/sys/net/unix - Parameters for Unix domain sockets
> > ----------------------------------------------------------
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 60bcb13f045c..62306c1095d5 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> > extern __u32 sysctl_rmem_default;
> >
> > #define SKB_FRAG_PAGE_ORDER get_order(32768)
> > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> >
> > static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> > {
> > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > index d2423f30577e..dd36114dd16f 100644
> > --- a/mm/page_frag_cache.c
> > +++ b/mm/page_frag_cache.c
> > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> > gfp_t gfp = gfp_mask;
> >
> > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> > + gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP |
> > __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>
> I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> fine for the page allocator itself where we have a different entry point
> that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
>
> I wonder if we should either:
>
> 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> determine it precisely.
>
> 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> not being disturbing (like proposed here), but that can in fact allow
> spinning. Instead, decide to not wake up kswapd by those when other
> information indicates it's an opportunistic allocation
> (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> order > 0...)
>
> 3) something better?
>
For the !allow_spin allocations, I think we should just add a new __GFP
flag instead of adding more complexity to other allocators which may or
may not want kswapd wakeup for many different reasons.
On Mon, Oct 13, 2025 at 2:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
> > On 10/13/25 12:16, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > On phones, we have observed significant phone heating when running apps
> > > with high network bandwidth. This is caused by the network stack frequently
> > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > > constantly active, even though plenty of memory is still available for network
> > > allocations which can fall back to order-0.
> > >
> > > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> > > introduced high_order_alloc_disable for the transmit (TX) path
> > > (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> > > allowing the TX path to fall back to order-0 immediately, while leaving the
> > > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> > > generally unaware of the sysctl and cannot easily adjust it for specific use
> > > cases. Enabling high_order_alloc_disable also completely disables the
> > > benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> > > RX path.
> > >
> > > An alternative approach is to disable kswapd for these frequent
> > > allocations and provide best-effort order-3 service for both TX and RX paths,
> > > while removing the sysctl entirely.
> > >
> > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > Cc: Eric Dumazet <edumazet@google.com>
> > > Cc: Kuniyuki Iwashima <kuniyu@google.com>
> > > Cc: Paolo Abeni <pabeni@redhat.com>
> > > Cc: Willem de Bruijn <willemb@google.com>
> > > Cc: "David S. Miller" <davem@davemloft.net>
> > > Cc: Jakub Kicinski <kuba@kernel.org>
> > > Cc: Simon Horman <horms@kernel.org>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Brendan Jackman <jackmanb@google.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Cc: Yunsheng Lin <linyunsheng@huawei.com>
> > > Cc: Huacai Zhou <zhouhuacai@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > > Documentation/admin-guide/sysctl/net.rst | 12 ------------
> > > include/net/sock.h | 1 -
> > > mm/page_frag_cache.c | 2 +-
> > > net/core/sock.c | 8 ++------
> > > net/core/sysctl_net_core.c | 7 -------
> > > 5 files changed, 3 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > > index 2ef50828aff1..b903bbae239c 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > > list is then passed to the stack when the number of segments reaches the
> > > gro_normal_batch limit.
> > >
> > > -high_order_alloc_disable
> > > -------------------------
> > > -
> > > -By default the allocator for page frags tries to use high order pages (order-3
> > > -on x86). While the default behavior gives good results in most cases, some users
> > > -might have hit a contention in page allocations/freeing. This was especially
> > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > > -historical importance.
> > > -
> > > -Default: 0
> > > -
> > > 2. /proc/sys/net/unix - Parameters for Unix domain sockets
> > > ----------------------------------------------------------
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 60bcb13f045c..62306c1095d5 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> > > extern __u32 sysctl_rmem_default;
> > >
> > > #define SKB_FRAG_PAGE_ORDER get_order(32768)
> > > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> > >
> > > static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> > > {
> > > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > > index d2423f30577e..dd36114dd16f 100644
> > > --- a/mm/page_frag_cache.c
> > > +++ b/mm/page_frag_cache.c
> > > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> > > gfp_t gfp = gfp_mask;
> > >
> > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > > - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> > > + gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP |
> > > __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >
> > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> > fine for the page allocator itself where we have a different entry point
> > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
> >
> > I wonder if we should either:
> >
> > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > determine it precisely.
> >
> > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> > not being disturbing (like proposed here), but that can in fact allow
> > spinning. Instead, decide to not wake up kswapd by those when other
> > information indicates it's an opportunistic allocation
> > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> > order > 0...)
> >
> > 3) something better?
> >
>
> For the !allow_spin allocations, I think we should just add a new __GFP
> flag instead of adding more complexity to other allocators which may or
> may not want kswapd wakeup for many different reasons.
That's what I proposed long ago, but was convinced that the new flag
adds more complexity. Looks like we walked this road far enough and
the new flag will actually make things simpler.
Back then I proposed __GFP_TRYLOCK which is not a good name.
How about __GFP_NOLOCK ? or __GFP_NOSPIN ?
On Mon, Oct 13, 2025 at 02:53:17PM -0700, Alexei Starovoitov wrote: > On Mon, Oct 13, 2025 at 2:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote: [...] > > > > > > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that > > > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's > > > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's > > > fine for the page allocator itself where we have a different entry point > > > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds > > > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab > > > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully > > > > > > I wonder if we should either: > > > > > > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to > > > determine it precisely. > > > > > > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of > > > not being disturbing (like proposed here), but that can in fact allow > > > spinning. Instead, decide to not wake up kswapd by those when other > > > information indicates it's an opportunistic allocation > > > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC, > > > order > 0...) > > > > > > 3) something better? > > > > > > > For the !allow_spin allocations, I think we should just add a new __GFP > > flag instead of adding more complexity to other allocators which may or > > may not want kswapd wakeup for many different reasons. > > That's what I proposed long ago, but was convinced that the new flag > adds more complexity. Oh somehow I thought we took that route because we are low on available bits. > Looks like we walked this road far enough and > the new flag will actually make things simpler. > Back then I proposed __GFP_TRYLOCK which is not a good name. > How about __GFP_NOLOCK ? or __GFP_NOSPIN ? Let's go with __GFP_NOLOCK as we already have nolock variants of the allocation APIs.
© 2016 - 2026 Red Hat, Inc.