This code previously always used vmalloc for anything above
PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in
commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()").
The commit that changed it did so because "xt_alloc_table_info()
basically opencodes kvmalloc()", which is not actually what it was
doing. kvmalloc() does not attempt to go directly to vmalloc if the
order the caller is trying to allocate is "expensive", instead it only
uses vmalloc as a fallback in case the buddy allocator is not able to
fullfill the request.
The difference between the two is actually huge in case the system is
under memory pressure and has no free pages of a large order. Before the
change to kvmalloc we wouldn't even try going to the buddy allocator for
large orders, but now we would force it to try to find a page of the
required order by waking up kswapd/kcompactd and dropping reclaimable memory
for no reason at all to satisfy our huge order allocation that could easily
exist within vmalloc'ed memory instead.
Revert the change to always call vmalloc, since this code doesn't really
benefit from contiguous physical memory, and the size it allocates is
directly dictated by the userspace-passed table buffer thus allowing it to
torture the buddy allocator by carefully crafting a huge table that fits
right at the maximum available memory order on the system.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
net/netfilter/x_tables.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 90b7630421c4..c98f4b05d79d 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1190,7 +1190,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
return NULL;
- info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
+ info = __vmalloc(sz, GFP_KERNEL_ACCOUNT);
if (!info)
return NULL;
@@ -1210,7 +1210,7 @@ void xt_free_table_info(struct xt_table_info *info)
kvfree(info->jumpstack);
}
- kvfree(info);
+ vfree(info);
}
EXPORT_SYMBOL(xt_free_table_info);
--
2.34.1
On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin <d-tatianin@yandex-team.ru> wrote: > > This code previously always used vmalloc for anything above > PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in > commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()"). > > The commit that changed it did so because "xt_alloc_table_info() > basically opencodes kvmalloc()", which is not actually what it was > doing. kvmalloc() does not attempt to go directly to vmalloc if the > order the caller is trying to allocate is "expensive", instead it only > uses vmalloc as a fallback in case the buddy allocator is not able to > fullfill the request. > > The difference between the two is actually huge in case the system is > under memory pressure and has no free pages of a large order. Before the > change to kvmalloc we wouldn't even try going to the buddy allocator for > large orders, but now we would force it to try to find a page of the > required order by waking up kswapd/kcompactd and dropping reclaimable memory > for no reason at all to satisfy our huge order allocation that could easily > exist within vmalloc'ed memory instead. This would hint at an issue with kvmalloc(), why not fixing it, instead of trying to fix all its users ? There was a time where PAGE_ALLOC_COSTLY_ORDER was used. > > Revert the change to always call vmalloc, since this code doesn't really > benefit from contiguous physical memory, and the size it allocates is > directly dictated by the userspace-passed table buffer thus allowing it to > torture the buddy allocator by carefully crafting a huge table that fits > right at the maximum available memory order on the system. > > Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> > --- > net/netfilter/x_tables.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index 90b7630421c4..c98f4b05d79d 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1190,7 +1190,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size) > if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) > return NULL; > > - info = kvmalloc(sz, GFP_KERNEL_ACCOUNT); > + info = __vmalloc(sz, GFP_KERNEL_ACCOUNT); > if (!info) > return NULL; > > @@ -1210,7 +1210,7 @@ void xt_free_table_info(struct xt_table_info *info) > kvfree(info->jumpstack); > } > > - kvfree(info); > + vfree(info); > } > EXPORT_SYMBOL(xt_free_table_info); > > -- > 2.34.1 >
On 9/23/25 12:12 AM, Eric Dumazet wrote: > On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin > <d-tatianin@yandex-team.ru> wrote: >> This code previously always used vmalloc for anything above >> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in >> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()"). >> >> The commit that changed it did so because "xt_alloc_table_info() >> basically opencodes kvmalloc()", which is not actually what it was >> doing. kvmalloc() does not attempt to go directly to vmalloc if the >> order the caller is trying to allocate is "expensive", instead it only >> uses vmalloc as a fallback in case the buddy allocator is not able to >> fullfill the request. >> >> The difference between the two is actually huge in case the system is >> under memory pressure and has no free pages of a large order. Before the >> change to kvmalloc we wouldn't even try going to the buddy allocator for >> large orders, but now we would force it to try to find a page of the >> required order by waking up kswapd/kcompactd and dropping reclaimable memory >> for no reason at all to satisfy our huge order allocation that could easily >> exist within vmalloc'ed memory instead. > This would hint at an issue with kvmalloc(), why not fixing it, instead > of trying to fix all its users ? Thanks for the quick reply! From my understanding, there is a lot of callers of kvmalloc who do indeed benefit from the physical memory being contiguous, because it is then used for hardware DMA etc., so I'm not sure that would be feasible. > > There was a time where PAGE_ALLOC_COSTLY_ORDER was used. Out of curiosity, do you mean kvmalloc used to always fall back to vmalloc for > COSTLY_ORDER? If so, do you happen to know, which commit changed that behavior? I tried grepping the logs and looking at the git blame of slub.c but I guess it was changed too long ago so I wasn't successful. > > > >> Revert the change to always call vmalloc, since this code doesn't really >> benefit from contiguous physical memory, and the size it allocates is >> directly dictated by the userspace-passed table buffer thus allowing it to >> torture the buddy allocator by carefully crafting a huge table that fits >> right at the maximum available memory order on the system. >> >> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> >> --- >> net/netfilter/x_tables.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c >> index 90b7630421c4..c98f4b05d79d 100644 >> --- a/net/netfilter/x_tables.c >> +++ b/net/netfilter/x_tables.c >> @@ -1190,7 +1190,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size) >> if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) >> return NULL; >> >> - info = kvmalloc(sz, GFP_KERNEL_ACCOUNT); >> + info = __vmalloc(sz, GFP_KERNEL_ACCOUNT); >> if (!info) >> return NULL; >> >> @@ -1210,7 +1210,7 @@ void xt_free_table_info(struct xt_table_info *info) >> kvfree(info->jumpstack); >> } >> >> - kvfree(info); >> + vfree(info); >> } >> EXPORT_SYMBOL(xt_free_table_info); >> >> -- >> 2.34.1 >>
Daniil Tatianin <d-tatianin@yandex-team.ru> wrote: > > On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin > > <d-tatianin@yandex-team.ru> wrote: > >> This code previously always used vmalloc for anything above > >> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in > >> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()"). > >> > >> The commit that changed it did so because "xt_alloc_table_info() > >> basically opencodes kvmalloc()", which is not actually what it was > >> doing. kvmalloc() does not attempt to go directly to vmalloc if the > >> order the caller is trying to allocate is "expensive", instead it only > >> uses vmalloc as a fallback in case the buddy allocator is not able to > >> fullfill the request. > >> > >> The difference between the two is actually huge in case the system is > >> under memory pressure and has no free pages of a large order. Before the > >> change to kvmalloc we wouldn't even try going to the buddy allocator for > >> large orders, but now we would force it to try to find a page of the > >> required order by waking up kswapd/kcompactd and dropping reclaimable memory > >> for no reason at all to satisfy our huge order allocation that could easily > >> exist within vmalloc'ed memory instead. > > This would hint at an issue with kvmalloc(), why not fixing it, instead > > of trying to fix all its users ? I agree with Eric. There is nothing special in xtables compared to kvmalloc usage elsewhere in the stack. Why "fix" xtables and not e.g. rhashtable? Please work with mm hackers to improve the situation for your use case. Maybe its enough to raise __GFP_NORETRY in kmalloc_gfp_adjust() if size results in >= PAGE_ALLOC_COSTLY_ORDER allocation. > Thanks for the quick reply! From my understanding, there is a lot of > callers of kvmalloc > who do indeed benefit from the physical memory being contiguous, because > it is then > used for hardware DMA etc., so I'm not sure that would be feasible. How can that work? kvmalloc won't make vmalloc backed memory physically contiguous.
On 9/23/25 2:36 PM, Florian Westphal wrote: > Daniil Tatianin <d-tatianin@yandex-team.ru> wrote: >>> On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin >>> <d-tatianin@yandex-team.ru> wrote: >>>> This code previously always used vmalloc for anything above >>>> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in >>>> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()"). >>>> >>>> The commit that changed it did so because "xt_alloc_table_info() >>>> basically opencodes kvmalloc()", which is not actually what it was >>>> doing. kvmalloc() does not attempt to go directly to vmalloc if the >>>> order the caller is trying to allocate is "expensive", instead it only >>>> uses vmalloc as a fallback in case the buddy allocator is not able to >>>> fullfill the request. >>>> >>>> The difference between the two is actually huge in case the system is >>>> under memory pressure and has no free pages of a large order. Before the >>>> change to kvmalloc we wouldn't even try going to the buddy allocator for >>>> large orders, but now we would force it to try to find a page of the >>>> required order by waking up kswapd/kcompactd and dropping reclaimable memory >>>> for no reason at all to satisfy our huge order allocation that could easily >>>> exist within vmalloc'ed memory instead. >>> This would hint at an issue with kvmalloc(), why not fixing it, instead >>> of trying to fix all its users ? > I agree with Eric. There is nothing special in xtables compared to > kvmalloc usage elsewhere in the stack. Why "fix" xtables and not e.g. > rhashtable? > > Please work with mm hackers to improve the situation for your use case. > > Maybe its enough to raise __GFP_NORETRY in kmalloc_gfp_adjust() if size > results in >= PAGE_ALLOC_COSTLY_ORDER allocation. Thanks for your reply! Perhaps this is the way to go, although this might have much broader implications since there are tons of other callers to take into account. I'm not sure whether rhashtable's size also directly depends on user input, I was only aware of x_table since this is the case we ran into specifically. > >> Thanks for the quick reply! From my understanding, there is a lot of >> callers of kvmalloc >> who do indeed benefit from the physical memory being contiguous, because >> it is then >> used for hardware DMA etc., so I'm not sure that would be feasible. > How can that work? kvmalloc won't make vmalloc backed memory > physically contiguous. The allocated physical memory won't be contiguous only for fallback cases (which should be rare), I assume in that case the hardware operation may end up being more expensive with larger scatter-gather lists etc. So most of the time such code can take optimized paths for fully contiguous memory. This is not the case for x_tables etc.
On Tue, Sep 23, 2025 at 5:04 AM Daniil Tatianin <d-tatianin@yandex-team.ru> wrote: > > On 9/23/25 2:36 PM, Florian Westphal wrote: > > > Daniil Tatianin <d-tatianin@yandex-team.ru> wrote: > >>> On Mon, Sep 22, 2025 at 12:48 PM Daniil Tatianin > >>> <d-tatianin@yandex-team.ru> wrote: > >>>> This code previously always used vmalloc for anything above > >>>> PAGE_ALLOC_COSTLY_ORDER, but this logic was changed in > >>>> commit eacd86ca3b036 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()"). > >>>> > >>>> The commit that changed it did so because "xt_alloc_table_info() > >>>> basically opencodes kvmalloc()", which is not actually what it was > >>>> doing. kvmalloc() does not attempt to go directly to vmalloc if the > >>>> order the caller is trying to allocate is "expensive", instead it only > >>>> uses vmalloc as a fallback in case the buddy allocator is not able to > >>>> fullfill the request. > >>>> > >>>> The difference between the two is actually huge in case the system is > >>>> under memory pressure and has no free pages of a large order. Before the > >>>> change to kvmalloc we wouldn't even try going to the buddy allocator for > >>>> large orders, but now we would force it to try to find a page of the > >>>> required order by waking up kswapd/kcompactd and dropping reclaimable memory > >>>> for no reason at all to satisfy our huge order allocation that could easily > >>>> exist within vmalloc'ed memory instead. > >>> This would hint at an issue with kvmalloc(), why not fixing it, instead > >>> of trying to fix all its users ? > > I agree with Eric. There is nothing special in xtables compared to > > kvmalloc usage elsewhere in the stack. Why "fix" xtables and not e.g. > > rhashtable? > > > > Please work with mm hackers to improve the situation for your use case. > > > > Maybe its enough to raise __GFP_NORETRY in kmalloc_gfp_adjust() if size > > results in >= PAGE_ALLOC_COSTLY_ORDER allocation. > > Thanks for your reply! Perhaps this is the way to go, although this > might have > much broader implications since there are tons of other callers to take > into account. > > I'm not sure whether rhashtable's size also directly depends on user > input, I was only > aware of x_table since this is the case we ran into specifically. > > > > >> Thanks for the quick reply! From my understanding, there is a lot of > >> callers of kvmalloc > >> who do indeed benefit from the physical memory being contiguous, because > >> it is then > >> used for hardware DMA etc., so I'm not sure that would be feasible. > > How can that work? kvmalloc won't make vmalloc backed memory > > physically contiguous. > > The allocated physical memory won't be contiguous only for fallback > cases (which should be rare), > I assume in that case the hardware operation may end up being more > expensive with larger scatter-gather > lists etc. So most of the time such code can take optimized paths for > fully contiguous memory. This is not > the case for x_tables etc. At least some years ago, we were seeing a performance difference. x_tables data is often read sequentially, I do not know if modern cpus hardware prefetches use TLB (virtual space). I do not know if they can span a 4K page, even if physically contiguous. Some context : commit 6c5ab6511f718c3fb19bcc3f78a90b0e0b601675 Author: Michal Hocko <mhocko@suse.com> Date: Mon May 8 15:57:15 2017 -0700 mm: support __GFP_REPEAT in kvmalloc_node for >32kB vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp. vhost_vsock because it would really like to prefer kmalloc to the vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device allocation to vmalloc") for more context. Michael Tsirkin has also noted: commit 23cc5a991c7a9fb7e6d6550e65cee4f4173111c5 Author: Michael S. Tsirkin <mst@redhat.com> Date: Wed Jan 23 21:46:47 2013 +0100 vhost-net: extend device allocation to vmalloc
© 2016 - 2025 Red Hat, Inc.