We are about to use page_frag_alloc_*() API to not just
allocate memory for skb->data, but also use them to do
the memory allocation for skb frag too. Currently the
implementation of page_frag in mm subsystem is running
the offset as a countdown rather than count-up value,
there may have several advantages to that as mentioned
in [1], but it may have some disadvantages, for example,
it may disable skb frag coalescing and more correct cache
prefetching
We have a trade-off to make in order to have a unified
implementation and API for page_frag, so use a initial zero
offset in this patch, and the following patch will try to
make some optimization to avoid the disadvantages as much
as possible.
1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
---
mm/page_frag_cache.c | 46 ++++++++++++++++++++++----------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index 609a485cd02a..4c8e04379cb3 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -63,9 +63,13 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask,
unsigned int align_mask)
{
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ unsigned int size = nc->size;
+#else
unsigned int size = PAGE_SIZE;
+#endif
+ unsigned int offset;
struct page *page;
- int offset;
if (unlikely(!nc->va)) {
refill:
@@ -85,11 +89,24 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
/* reset page count bias and offset to start of new frag */
nc->pfmemalloc = page_is_pfmemalloc(page);
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- nc->offset = size;
+ nc->offset = 0;
}
- offset = nc->offset - fragsz;
- if (unlikely(offset < 0)) {
+ offset = __ALIGN_KERNEL_MASK(nc->offset, ~align_mask);
+ if (unlikely(offset + fragsz > size)) {
+ if (unlikely(fragsz > PAGE_SIZE)) {
+ /*
+ * The caller is trying to allocate a fragment
+ * with fragsz > PAGE_SIZE but the cache isn't big
+ * enough to satisfy the request, this may
+ * happen in low memory conditions.
+ * We don't release the cache page because
+ * it could make memory pressure worse
+ * so we simply return NULL here.
+ */
+ return NULL;
+ }
+
page = virt_to_page(nc->va);
if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
@@ -100,33 +117,16 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
goto refill;
}
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
/* OK, page count is 0, we can safely set it */
set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- offset = size - fragsz;
- if (unlikely(offset < 0)) {
- /*
- * The caller is trying to allocate a fragment
- * with fragsz > PAGE_SIZE but the cache isn't big
- * enough to satisfy the request, this may
- * happen in low memory conditions.
- * We don't release the cache page because
- * it could make memory pressure worse
- * so we simply return NULL here.
- */
- return NULL;
- }
+ offset = 0;
}
nc->pagecnt_bias--;
- offset &= align_mask;
- nc->offset = offset;
+ nc->offset = offset + fragsz;
return nc->va + offset;
}
--
2.33.0
Hi Yunsheng,
On 10/28/24 04:53, Yunsheng Lin wrote:
> We are about to use page_frag_alloc_*() API to not just
> allocate memory for skb->data, but also use them to do
> the memory allocation for skb frag too. Currently the
> implementation of page_frag in mm subsystem is running
> the offset as a countdown rather than count-up value,
> there may have several advantages to that as mentioned
> in [1], but it may have some disadvantages, for example,
> it may disable skb frag coalescing and more correct cache
> prefetching
>
> We have a trade-off to make in order to have a unified
> implementation and API for page_frag, so use a initial zero
> offset in this patch, and the following patch will try to
> make some optimization to avoid the disadvantages as much
> as possible.
>
> 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/
>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Linux-MM <linux-mm@kvack.org>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Sorry for the late feedback, this patch causes the bgmac driver in is
.ndo_open() function to return -ENOMEM, the call trace looks like this:
bgmac_open
-> bgmac_dma_init
-> bgmac_dma_rx_skb_for_slot
-> netdev_alloc_frag
BGMAC_RX_ALLOC_SIZE = 10048 and PAGE_FRAG_CACHE_MAX_SIZE = 32768.
Eventually we land into __page_frag_alloc_align() with the following
parameters across multiple successive calls:
__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=0
__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768,
offset=10048
__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768,
offset=20096
__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768,
offset=30144
So in that case we do indeed have offset + fragsz (40192) > size (32768)
and so we would eventually return NULL.
Any idea on how to best fix that within the bgmac driver?
Thanks!
> ---
> mm/page_frag_cache.c | 46 ++++++++++++++++++++++----------------------
> 1 file changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index 609a485cd02a..4c8e04379cb3 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -63,9 +63,13 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
> unsigned int fragsz, gfp_t gfp_mask,
> unsigned int align_mask)
> {
> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> + unsigned int size = nc->size;
> +#else
> unsigned int size = PAGE_SIZE;
> +#endif
> + unsigned int offset;
> struct page *page;
> - int offset;
>
> if (unlikely(!nc->va)) {
> refill:
> @@ -85,11 +89,24 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
> /* reset page count bias and offset to start of new frag */
> nc->pfmemalloc = page_is_pfmemalloc(page);
> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> - nc->offset = size;
> + nc->offset = 0;
> }
>
> - offset = nc->offset - fragsz;
> - if (unlikely(offset < 0)) {
> + offset = __ALIGN_KERNEL_MASK(nc->offset, ~align_mask);
> + if (unlikely(offset + fragsz > size)) {
> + if (unlikely(fragsz > PAGE_SIZE)) {
> + /*
> + * The caller is trying to allocate a fragment
> + * with fragsz > PAGE_SIZE but the cache isn't big
> + * enough to satisfy the request, this may
> + * happen in low memory conditions.
> + * We don't release the cache page because
> + * it could make memory pressure worse
> + * so we simply return NULL here.
> + */
> + return NULL;
> + }
> +
> page = virt_to_page(nc->va);
>
> if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
> @@ -100,33 +117,16 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
> goto refill;
> }
>
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - /* if size can vary use size else just use PAGE_SIZE */
> - size = nc->size;
> -#endif
> /* OK, page count is 0, we can safely set it */
> set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
>
> /* reset page count bias and offset to start of new frag */
> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> - offset = size - fragsz;
> - if (unlikely(offset < 0)) {
> - /*
> - * The caller is trying to allocate a fragment
> - * with fragsz > PAGE_SIZE but the cache isn't big
> - * enough to satisfy the request, this may
> - * happen in low memory conditions.
> - * We don't release the cache page because
> - * it could make memory pressure worse
> - * so we simply return NULL here.
> - */
> - return NULL;
> - }
> + offset = 0;
> }
>
> nc->pagecnt_bias--;
> - offset &= align_mask;
> - nc->offset = offset;
> + nc->offset = offset + fragsz;
>
> return nc->va + offset;
> }
--
Florian
On 2025/1/24 3:15, Florian Fainelli wrote: > > Sorry for the late feedback, this patch causes the bgmac driver in is .ndo_open() function to return -ENOMEM, the call trace looks like this: Hi, Florian Thanks for the report. > > bgmac_open > -> bgmac_dma_init > -> bgmac_dma_rx_skb_for_slot > -> netdev_alloc_frag > > BGMAC_RX_ALLOC_SIZE = 10048 and PAGE_FRAG_CACHE_MAX_SIZE = 32768. I guess BGMAC_RX_ALLOC_SIZE being bigger than PAGE_SIZE is the problem here, as the frag API is not really supporting allocating fragment that is bigger than PAGE_SIZE, as it will fall back to allocate the base page when order-3 compound page allocation fails, see __page_frag_cache_refill(). Also, it seems strange that bgmac driver seems to always use jumbo frame size to allocate fragment, isn't more appropriate to allocate the fragment according to MTU? > > Eventually we land into __page_frag_alloc_align() with the following parameters across multiple successive calls: > > __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=0 > __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=10048 > __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=20096 > __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=30144 > > So in that case we do indeed have offset + fragsz (40192) > size (32768) and so we would eventually return NULL. It seems the changing of '(unlikely(offset < 0))' checking to "fragsz > PAGE_SIZE" causes bgmac driver to break more easily here. But bgmac driver might likely break too if the system memory is severely fragmented when falling back to alllocate base page before this patch. > > Any idea on how to best fix that within the bgmac driver? Maybe use the page allocator API directly when allocating fragment with BGMAC_RX_ALLOC_SIZE > PAGE_SIZE for a quick fix. In the long term, maybe it makes sense to use the page_pool API as more drivers are converting to use page_pool API already.
On 1/24/25 01:52, Yunsheng Lin wrote: > On 2025/1/24 3:15, Florian Fainelli wrote: >> >> Sorry for the late feedback, this patch causes the bgmac driver in is .ndo_open() function to return -ENOMEM, the call trace looks like this: > > Hi, Florian > Thanks for the report. > >> >> bgmac_open >> -> bgmac_dma_init >> -> bgmac_dma_rx_skb_for_slot >> -> netdev_alloc_frag >> >> BGMAC_RX_ALLOC_SIZE = 10048 and PAGE_FRAG_CACHE_MAX_SIZE = 32768. > > I guess BGMAC_RX_ALLOC_SIZE being bigger than PAGE_SIZE is the > problem here, as the frag API is not really supporting allocating > fragment that is bigger than PAGE_SIZE, as it will fall back to > allocate the base page when order-3 compound page allocation fails, > see __page_frag_cache_refill(). > > Also, it seems strange that bgmac driver seems to always use jumbo > frame size to allocate fragment, isn't more appropriate to allocate > the fragment according to MTU? Totally, even though my email domain would suggest otherwise, I am just an user of that driver here, not its maintainer, though I do have some familiarity with it, I don't know why that choice was made. > >> >> Eventually we land into __page_frag_alloc_align() with the following parameters across multiple successive calls: >> >> __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=0 >> __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=10048 >> __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=20096 >> __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=30144 >> >> So in that case we do indeed have offset + fragsz (40192) > size (32768) and so we would eventually return NULL. > > It seems the changing of '(unlikely(offset < 0))' checking to > "fragsz > PAGE_SIZE" causes bgmac driver to break more easily > here. But bgmac driver might likely break too if the system > memory is severely fragmented when falling back to alllocate > base page before this patch. > >> >> Any idea on how to best fix that within the bgmac driver? > > Maybe use the page allocator API directly when allocating fragment > with BGMAC_RX_ALLOC_SIZE > PAGE_SIZE for a quick fix. > > In the long term, maybe it makes sense to use the page_pool API > as more drivers are converting to use page_pool API already. Short term, I think I am going to submit a quick fix that is inspired by the out of tree patches carried in OpenWrt: https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=target/linux/bcm53xx/patches-6.6/700-bgmac-reduce-max-frame-size-to-support-just-MTU-1500.patch;h=3a2f4b06ed6d8cda1f3f0be23e1066267234766b;hb=HEAD Thanks! -- Florian
© 2016 - 2026 Red Hat, Inc.