[v3] introduce page_pool_alloc() API

[PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 8 months ago

Currently page pool supports the below use cases:
use case 1: allocate page without page splitting using
            page_pool_alloc_pages() API if the driver knows
            that the memory it need is always bigger than
            half of the page allocated from page pool.
use case 2: allocate page frag with page splitting using
            page_pool_alloc_frag() API if the driver knows
            that the memory it need is always smaller than
            or equal to the half of the page allocated from
            page pool.

There is emerging use case [1] & [2] that is a mix of the
above two case: the driver doesn't know the size of memory it
need beforehand, so the driver may use something like below to
allocate memory with least memory utilization and performance
penalty:

if (size << 1 > max_size)
	page = page_pool_alloc_pages();
else
	page = page_pool_alloc_frag();

To avoid the driver doing something like above, add the
page_pool_alloc() API to support the above use case, and update
the true size of memory that is acctually allocated by updating
'*size' back to the driver in order to avoid the truesize
underestimate problem.

1. https://lore.kernel.org/all/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/
2. https://lore.kernel.org/all/20230526054621.18371-3-liangchen.linux@gmail.com/

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 include/net/page_pool.h | 43 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 0b8cd2acc1d7..c135cd157cea 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -260,6 +260,49 @@ static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
 	return page_pool_alloc_frag(pool, offset, size, gfp);
 }
 
+static inline struct page *page_pool_alloc(struct page_pool *pool,
+					   unsigned int *offset,
+					   unsigned int *size, gfp_t gfp)
+{
+	unsigned int max_size = PAGE_SIZE << pool->p.order;
+	struct page *page;
+
+	*size = ALIGN(*size, dma_get_cache_alignment());
+
+	if (WARN_ON(*size > max_size))
+		return NULL;
+
+	if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COUNT) {
+		*size = max_size;
+		*offset = 0;
+		return page_pool_alloc_pages(pool, gfp);
+	}
+
+	page = __page_pool_alloc_frag(pool, offset, *size, gfp);
+	if (unlikely(!page))
+		return NULL;
+
+	/* There is very likely not enough space for another frag, so append the
+	 * remaining size to the current frag to avoid truesize underestimate
+	 * problem.
+	 */
+	if (pool->frag_offset + *size > max_size) {
+		*size = max_size - *offset;
+		pool->frag_offset = max_size;
+	}
+
+	return page;
+}
+
+static inline struct page *page_pool_dev_alloc(struct page_pool *pool,
+					       unsigned int *offset,
+					       unsigned int *size)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc(pool, offset, size, gfp);
+}
+
 /* get the stored dma direction. A driver might decide to treat this locally and
  * avoid the extra cache line from page_pool to determine the direction
  */
-- 
2.33.0

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 8 months ago

On Fri, Jun 9, 2023 at 6:20 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> Currently page pool supports the below use cases:
> use case 1: allocate page without page splitting using
>             page_pool_alloc_pages() API if the driver knows
>             that the memory it need is always bigger than
>             half of the page allocated from page pool.
> use case 2: allocate page frag with page splitting using
>             page_pool_alloc_frag() API if the driver knows
>             that the memory it need is always smaller than
>             or equal to the half of the page allocated from
>             page pool.
>
> There is emerging use case [1] & [2] that is a mix of the
> above two case: the driver doesn't know the size of memory it
> need beforehand, so the driver may use something like below to
> allocate memory with least memory utilization and performance
> penalty:
>
> if (size << 1 > max_size)
>         page = page_pool_alloc_pages();
> else
>         page = page_pool_alloc_frag();
>
> To avoid the driver doing something like above, add the
> page_pool_alloc() API to support the above use case, and update
> the true size of memory that is acctually allocated by updating
> '*size' back to the driver in order to avoid the truesize
> underestimate problem.
>
> 1. https://lore.kernel.org/all/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/
> 2. https://lore.kernel.org/all/20230526054621.18371-3-liangchen.linux@gmail.com/
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Lorenzo Bianconi <lorenzo@kernel.org>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> ---
>  include/net/page_pool.h | 43 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 0b8cd2acc1d7..c135cd157cea 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -260,6 +260,49 @@ static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
>         return page_pool_alloc_frag(pool, offset, size, gfp);
>  }
>
> +static inline struct page *page_pool_alloc(struct page_pool *pool,
> +                                          unsigned int *offset,
> +                                          unsigned int *size, gfp_t gfp)
> +{
> +       unsigned int max_size = PAGE_SIZE << pool->p.order;
> +       struct page *page;
> +
> +       *size = ALIGN(*size, dma_get_cache_alignment());
> +
> +       if (WARN_ON(*size > max_size))
> +               return NULL;
> +
> +       if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COUNT) {
> +               *size = max_size;
> +               *offset = 0;
> +               return page_pool_alloc_pages(pool, gfp);
> +       }
> +
> +       page = __page_pool_alloc_frag(pool, offset, *size, gfp);
> +       if (unlikely(!page))
> +               return NULL;
> +
> +       /* There is very likely not enough space for another frag, so append the
> +        * remaining size to the current frag to avoid truesize underestimate
> +        * problem.
> +        */
> +       if (pool->frag_offset + *size > max_size) {
> +               *size = max_size - *offset;
> +               pool->frag_offset = max_size;
> +       }
> +

Rather than preventing a truesize underestimation this will cause one.
You are adding memory to the size of the page reserved and not
accounting for it anywhere as this isn't reported up to the network
stack. I would suggest dropping this from your patch.

> +       return page;
> +}
> +
> +static inline struct page *page_pool_dev_alloc(struct page_pool *pool,
> +                                              unsigned int *offset,
> +                                              unsigned int *size)
> +{
> +       gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
> +
> +       return page_pool_alloc(pool, offset, size, gfp);
> +}
> +
>  /* get the stored dma direction. A driver might decide to treat this locally and
>   * avoid the extra cache line from page_pool to determine the direction
>   */
> --
> 2.33.0
>

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 8 months ago

On 2023/6/13 22:36, Alexander Duyck wrote:
> On Fri, Jun 9, 2023 at 6:20 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

...

>>
>> +static inline struct page *page_pool_alloc(struct page_pool *pool,
>> +                                          unsigned int *offset,
>> +                                          unsigned int *size, gfp_t gfp)
>> +{
>> +       unsigned int max_size = PAGE_SIZE << pool->p.order;
>> +       struct page *page;
>> +
>> +       *size = ALIGN(*size, dma_get_cache_alignment());
>> +
>> +       if (WARN_ON(*size > max_size))
>> +               return NULL;
>> +
>> +       if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COUNT) {
>> +               *size = max_size;
>> +               *offset = 0;
>> +               return page_pool_alloc_pages(pool, gfp);
>> +       }
>> +
>> +       page = __page_pool_alloc_frag(pool, offset, *size, gfp);
>> +       if (unlikely(!page))
>> +               return NULL;
>> +
>> +       /* There is very likely not enough space for another frag, so append the
>> +        * remaining size to the current frag to avoid truesize underestimate
>> +        * problem.
>> +        */
>> +       if (pool->frag_offset + *size > max_size) {
>> +               *size = max_size - *offset;
>> +               pool->frag_offset = max_size;
>> +       }
>> +
> 
> Rather than preventing a truesize underestimation this will cause one.
> You are adding memory to the size of the page reserved and not
> accounting for it anywhere as this isn't reported up to the network
> stack. I would suggest dropping this from your patch.

I was thinking about the driver author reporting it up to the network
stack using the new API as something like below:

int truesize = size;
struct page *page;
int offset;

page = page_pool_dev_alloc(pool, &offset, &truesize);
if (unlikely(!page))
	goto err;

skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
		offset, size, truesize);

and similiar handling for *_build_skb() case too.

Does it make senses for that? or did I miss something obvious here?


>

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 7 months ago

On Tue, Jun 13, 2023 at 8:51 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/13 22:36, Alexander Duyck wrote:
> > On Fri, Jun 9, 2023 at 6:20 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> ...
>
> >>
> >> +static inline struct page *page_pool_alloc(struct page_pool *pool,
> >> +                                          unsigned int *offset,
> >> +                                          unsigned int *size, gfp_t gfp)
> >> +{
> >> +       unsigned int max_size = PAGE_SIZE << pool->p.order;
> >> +       struct page *page;
> >> +
> >> +       *size = ALIGN(*size, dma_get_cache_alignment());
> >> +
> >> +       if (WARN_ON(*size > max_size))
> >> +               return NULL;
> >> +
> >> +       if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COUNT) {
> >> +               *size = max_size;
> >> +               *offset = 0;
> >> +               return page_pool_alloc_pages(pool, gfp);
> >> +       }
> >> +
> >> +       page = __page_pool_alloc_frag(pool, offset, *size, gfp);
> >> +       if (unlikely(!page))
> >> +               return NULL;
> >> +
> >> +       /* There is very likely not enough space for another frag, so append the
> >> +        * remaining size to the current frag to avoid truesize underestimate
> >> +        * problem.
> >> +        */
> >> +       if (pool->frag_offset + *size > max_size) {
> >> +               *size = max_size - *offset;
> >> +               pool->frag_offset = max_size;
> >> +       }
> >> +
> >
> > Rather than preventing a truesize underestimation this will cause one.
> > You are adding memory to the size of the page reserved and not
> > accounting for it anywhere as this isn't reported up to the network
> > stack. I would suggest dropping this from your patch.
>
> I was thinking about the driver author reporting it up to the network
> stack using the new API as something like below:
>
> int truesize = size;
> struct page *page;
> int offset;
>
> page = page_pool_dev_alloc(pool, &offset, &truesize);
> if (unlikely(!page))
>         goto err;
>
> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
>                 offset, size, truesize);
>
> and similiar handling for *_build_skb() case too.
>
> Does it make senses for that? or did I miss something obvious here?

It is more the fact that you are creating a solution in search of a
problem. As I said before most of the drivers will deal with these
sorts of issues by just doing the fragmentation themselves or
allocating fixed size frags and knowing how it will be divided into
the page.

If you are going to go down this path then you should have a consumer
for the API and fully implement it instead of taking half measures and
making truesize underreporting worse by evicting pages earlier.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 7 months ago

On 2023/6/14 22:18, Alexander Duyck wrote:
> On Tue, Jun 13, 2023 at 8:51 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/6/13 22:36, Alexander Duyck wrote:
>>> On Fri, Jun 9, 2023 at 6:20 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> ...
>>
>>>>
>>>> +static inline struct page *page_pool_alloc(struct page_pool *pool,
>>>> +                                          unsigned int *offset,
>>>> +                                          unsigned int *size, gfp_t gfp)
>>>> +{
>>>> +       unsigned int max_size = PAGE_SIZE << pool->p.order;
>>>> +       struct page *page;
>>>> +
>>>> +       *size = ALIGN(*size, dma_get_cache_alignment());
>>>> +
>>>> +       if (WARN_ON(*size > max_size))
>>>> +               return NULL;
>>>> +
>>>> +       if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COUNT) {
>>>> +               *size = max_size;
>>>> +               *offset = 0;
>>>> +               return page_pool_alloc_pages(pool, gfp);
>>>> +       }
>>>> +
>>>> +       page = __page_pool_alloc_frag(pool, offset, *size, gfp);
>>>> +       if (unlikely(!page))
>>>> +               return NULL;
>>>> +
>>>> +       /* There is very likely not enough space for another frag, so append the
>>>> +        * remaining size to the current frag to avoid truesize underestimate
>>>> +        * problem.
>>>> +        */
>>>> +       if (pool->frag_offset + *size > max_size) {
>>>> +               *size = max_size - *offset;
>>>> +               pool->frag_offset = max_size;
>>>> +       }
>>>> +
>>>
>>> Rather than preventing a truesize underestimation this will cause one.
>>> You are adding memory to the size of the page reserved and not
>>> accounting for it anywhere as this isn't reported up to the network
>>> stack. I would suggest dropping this from your patch.
>>
>> I was thinking about the driver author reporting it up to the network
>> stack using the new API as something like below:
>>
>> int truesize = size;
>> struct page *page;
>> int offset;
>>
>> page = page_pool_dev_alloc(pool, &offset, &truesize);
>> if (unlikely(!page))
>>         goto err;
>>
>> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
>>                 offset, size, truesize);
>>
>> and similiar handling for *_build_skb() case too.
>>
>> Does it make senses for that? or did I miss something obvious here?
> 
> It is more the fact that you are creating a solution in search of a
> problem. As I said before most of the drivers will deal with these
> sorts of issues by just doing the fragmentation themselves or
> allocating fixed size frags and knowing how it will be divided into
> the page.

It seems that there are already some drivers which using the page pool
API with different frag size for almost every calling, the virtio_net
and veth are the obvious ones.

When reviewing the page frag support for virtio_net, I found that it
was manipulating the page_pool->frag_offset directly to do something
as this patch does, see:

https://lore.kernel.org/lkml/CAKhg4tL9PrUebqQHL+s7A6-xqNnju3erNQejMr7UFjwTaOduZw@mail.gmail.com/

I am not sure we are both agreed that drivers should not be manipulating
the page_pool->frag_offset directly unless it is really necessary?

For the specific case for virtio_net, it seems we have the below options:
1. both the driver and page pool do not handle it.
2. the driver handles it by manipulating the page_pool->frag_offset
   directly.
3. the page pool handles it as this patch does.

Is there any other options I missed for the specific case for virtio_net?
What is your perfer option? And why?

> 
> If you are going to go down this path then you should have a consumer
> for the API and fully implement it instead of taking half measures and
> making truesize underreporting worse by evicting pages earlier.

I am not sure I understand what do you mean by "a consumer for the API",
Do you mean adding a new API something like page_pool_free() to do
something ligthweight, such as decrementing the frag user and adjusting
the frag_offset, which is corresponding to the page_pool_alloc() API
introduced in this patch?
If yes, I was considering about that before, but I am not sure it worth
the effort, as for most usecase, it is a very rare case for error handling
as my understanding.

I just note that we already have page_pool_free() used by the page pool
destroy process，we might need to do something to avoid the confusion
between page_pool_alloc() and page_pool_free() :(

> 
> .
>

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 7 months ago

On Wed, Jun 14, 2023 at 11:39 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/14 22:18, Alexander Duyck wrote:
> > On Tue, Jun 13, 2023 at 8:51 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2023/6/13 22:36, Alexander Duyck wrote:
> >>> On Fri, Jun 9, 2023 at 6:20 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> ...
> >>
> >>>>
> >>>> +static inline struct page *page_pool_alloc(struct page_pool *pool,
> >>>> +                                          unsigned int *offset,
> >>>> +                                          unsigned int *size, gfp_t gfp)
> >>>> +{
> >>>> +       unsigned int max_size = PAGE_SIZE << pool->p.order;
> >>>> +       struct page *page;
> >>>> +
> >>>> +       *size = ALIGN(*size, dma_get_cache_alignment());
> >>>> +
> >>>> +       if (WARN_ON(*size > max_size))
> >>>> +               return NULL;
> >>>> +
> >>>> +       if ((*size << 1) > max_size || PAGE_POOL_DMA_USE_PP_FRAG_COUNT) {
> >>>> +               *size = max_size;
> >>>> +               *offset = 0;
> >>>> +               return page_pool_alloc_pages(pool, gfp);
> >>>> +       }
> >>>> +
> >>>> +       page = __page_pool_alloc_frag(pool, offset, *size, gfp);
> >>>> +       if (unlikely(!page))
> >>>> +               return NULL;
> >>>> +
> >>>> +       /* There is very likely not enough space for another frag, so append the
> >>>> +        * remaining size to the current frag to avoid truesize underestimate
> >>>> +        * problem.
> >>>> +        */
> >>>> +       if (pool->frag_offset + *size > max_size) {
> >>>> +               *size = max_size - *offset;
> >>>> +               pool->frag_offset = max_size;
> >>>> +       }
> >>>> +
> >>>
> >>> Rather than preventing a truesize underestimation this will cause one.
> >>> You are adding memory to the size of the page reserved and not
> >>> accounting for it anywhere as this isn't reported up to the network
> >>> stack. I would suggest dropping this from your patch.
> >>
> >> I was thinking about the driver author reporting it up to the network
> >> stack using the new API as something like below:
> >>
> >> int truesize = size;
> >> struct page *page;
> >> int offset;
> >>
> >> page = page_pool_dev_alloc(pool, &offset, &truesize);
> >> if (unlikely(!page))
> >>         goto err;
> >>
> >> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
> >>                 offset, size, truesize);
> >>
> >> and similiar handling for *_build_skb() case too.
> >>
> >> Does it make senses for that? or did I miss something obvious here?
> >
> > It is more the fact that you are creating a solution in search of a
> > problem. As I said before most of the drivers will deal with these
> > sorts of issues by just doing the fragmentation themselves or
> > allocating fixed size frags and knowing how it will be divided into
> > the page.
>
> It seems that there are already some drivers which using the page pool
> API with different frag size for almost every calling, the virtio_net
> and veth are the obvious ones.
>
> When reviewing the page frag support for virtio_net, I found that it
> was manipulating the page_pool->frag_offset directly to do something
> as this patch does, see:
>
> https://lore.kernel.org/lkml/CAKhg4tL9PrUebqQHL+s7A6-xqNnju3erNQejMr7UFjwTaOduZw@mail.gmail.com/
>
> I am not sure we are both agreed that drivers should not be manipulating
> the page_pool->frag_offset directly unless it is really necessary?

Agreed, they are doing something similar to this. The difference is
though that they have chosen to do that. With your change you are
forcing driver writers into a setup that will likely not work for
most.

> For the specific case for virtio_net, it seems we have the below options:
> 1. both the driver and page pool do not handle it.
> 2. the driver handles it by manipulating the page_pool->frag_offset
>    directly.

I view 2 as being the only acceptable approach. Otherwise we are
forcing drivers into a solution that may not fit and forcing yet
another fork of allocation setups. There is a reason vendors have
already taken the approach of manipulating frag_offset directly. In
many cases trying to pre-allocate things just isn't going to work.

> 3. the page pool handles it as this patch does.

The problem is the page pool isn't handling it. It is forcing the
allocations larger without reporting them as of this patch. It is
trying to forecast what the next request is going to be which is
problematic since it doesn't have enough info to necessarily be making
such assumptions.

> Is there any other options I missed for the specific case for virtio_net?
> What is your perfer option? And why?

My advice would be to leave it to the driver.

What concerns me is that you seem to be taking the page pool API in a
direction other than what it was really intended for. For any physical
device you aren't going to necessarily know what size fragment you are
working with until you have already allocated the page and DMA has
been performed. That is why drivers such as the Mellanox driver are
fragmenting in the driver instead of allocated pre-fragmented pages.

> >
> > If you are going to go down this path then you should have a consumer
> > for the API and fully implement it instead of taking half measures and
> > making truesize underreporting worse by evicting pages earlier.
>
> I am not sure I understand what do you mean by "a consumer for the API",
> Do you mean adding a new API something like page_pool_free() to do
> something ligthweight, such as decrementing the frag user and adjusting
> the frag_offset, which is corresponding to the page_pool_alloc() API
> introduced in this patch?

What I was getting at is that if you are going to add an API you have
to have a consumer for the API. That is rule #1 for kernel API
development. You don't add API without a consumer for it. The changes
you are making are to support some future implementation, and I see it
breaking most of the existing implementation. That is my concern.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 7 months ago

On 2023/6/15 22:45, Alexander Duyck wrote:
>>
>> It seems that there are already some drivers which using the page pool
>> API with different frag size for almost every calling, the virtio_net
>> and veth are the obvious ones.
>>
>> When reviewing the page frag support for virtio_net, I found that it
>> was manipulating the page_pool->frag_offset directly to do something
>> as this patch does, see:
>>
>> https://lore.kernel.org/lkml/CAKhg4tL9PrUebqQHL+s7A6-xqNnju3erNQejMr7UFjwTaOduZw@mail.gmail.com/
>>
>> I am not sure we are both agreed that drivers should not be manipulating
>> the page_pool->frag_offset directly unless it is really necessary?
> 
> Agreed, they are doing something similar to this. The difference is
> though that they have chosen to do that. With your change you are
> forcing driver writers into a setup that will likely not work for
> most.
> 
>> For the specific case for virtio_net, it seems we have the below options:
>> 1. both the driver and page pool do not handle it.
>> 2. the driver handles it by manipulating the page_pool->frag_offset
>>    directly.
> 
> I view 2 as being the only acceptable approach. Otherwise we are
> forcing drivers into a solution that may not fit and forcing yet
> another fork of allocation setups. There is a reason vendors have

I respectly disagree with driver manipulating the page_pool->frag_offset
directly.

It is a implemenation detail which should be hiden from the driver:
For page_pool_alloc_frag() API, page_pool->frag_offset is not even
useful for arch with PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true,
similar cases for page_pool_alloc() returning mono-frag if I understand
'mono-frag ' correctly.

IMHO, if the driver try to do the their own page spilting, it should
use it's own offset, not messing with the offset the page pool is using.
Yes, that may mean driver doing it's own page splitting and page pool
doing it's own page splitting for the same page if we really like to
make the best out of a page.

That way I see the page splitting in page pool as a way to share a
page between different desc, and page splitting in driver as way to
reclaim some memory for small packet, something like ena driver is
doing:
https://lore.kernel.org/netdev/20230612121448.28829-1-darinzon@amazon.com/T/

And hns3 driver has already done something similar for old per-desc
page flipping with 64K page size:

https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c#L3737

As we have done the page splitting to share a page between different desc
in the page pool, I really double that the benefit will justify the
complexity of the page splitting in the driver.

> already taken the approach of manipulating frag_offset directly. In
> many cases trying to pre-allocate things just isn't going to work.

As above, I think the driver trying to do it's own splitting should use
it's own offset instead of page pool's frag_offset.

If the mlx5 way of doing page splitting in the driver is proved to be
useful, we should really provide some API to allow that to work in
arch with PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true and make the page
splitting in the driver play along with the page splitting in the page
pool.

I am not sure if there is any other 'trying to pre-allocate things just
isn't going to work' case that I missed, it will be very appreciatived
if you can provide the complete cases here, so that we can discuss it
throughly.

> 
>> 3. the page pool handles it as this patch does.
> 
> The problem is the page pool isn't handling it. It is forcing the
> allocations larger without reporting them as of this patch. It is
> trying to forecast what the next request is going to be which is
> problematic since it doesn't have enough info to necessarily be making
> such assumptions.

We are talking about rx for networking, right? I think the driver
does not have that kind of enough info too, Or I am missing something
here?

> 
>> Is there any other options I missed for the specific case for virtio_net?
>> What is your perfer option? And why?
> 
> My advice would be to leave it to the driver.
> 
> What concerns me is that you seem to be taking the page pool API in a
> direction other than what it was really intended for. For any physical
> device you aren't going to necessarily know what size fragment you are
> working with until you have already allocated the page and DMA has
> been performed. That is why drivers such as the Mellanox driver are
> fragmenting in the driver instead of allocated pre-fragmented pages.

Why do you think using the page pool API to do the fragmenting in the
driver is the direction that page pool was intended for?

I thought page pool API was not intended for any fragmenting in the
first place by the disscusion in the maillist, I think we should be
more open about what direction the page pool API is heading to
considering the emerging use case:)

> 
>>>
>>> If you are going to go down this path then you should have a consumer
>>> for the API and fully implement it instead of taking half measures and
>>> making truesize underreporting worse by evicting pages earlier.
>>
>> I am not sure I understand what do you mean by "a consumer for the API",
>> Do you mean adding a new API something like page_pool_free() to do
>> something ligthweight, such as decrementing the frag user and adjusting
>> the frag_offset, which is corresponding to the page_pool_alloc() API
>> introduced in this patch?
> 
> What I was getting at is that if you are going to add an API you have
> to have a consumer for the API. That is rule #1 for kernel API
> development. You don't add API without a consumer for it. The changes
> you are making are to support some future implementation, and I see it
> breaking most of the existing implementation. That is my concern.

The patch is extending a new api, the behavior of current api is preserved
as much as possible, so I am not sure which implementation is broken by
this patch? How and why?

As for the '#1 for kernel API development', I think I had mention the
usecase it is intended for in the coverletter, and if I recall correctly,
the page_pool_fragment_page() API you added also do not come with a
actual consumer, I was overloaded at that time, so just toke a glance
and wonder why there was no user with a API added.

Anyway, as jesper was offering to help out, will add veth as a consumer
for the new api:)

> .
>

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 7 months ago

On Fri, Jun 16, 2023 at 4:47 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/15 22:45, Alexander Duyck wrote:
> >>
> >> It seems that there are already some drivers which using the page pool
> >> API with different frag size for almost every calling, the virtio_net
> >> and veth are the obvious ones.
> >>
> >> When reviewing the page frag support for virtio_net, I found that it
> >> was manipulating the page_pool->frag_offset directly to do something
> >> as this patch does, see:
> >>
> >> https://lore.kernel.org/lkml/CAKhg4tL9PrUebqQHL+s7A6-xqNnju3erNQejMr7UFjwTaOduZw@mail.gmail.com/
> >>
> >> I am not sure we are both agreed that drivers should not be manipulating
> >> the page_pool->frag_offset directly unless it is really necessary?
> >
> > Agreed, they are doing something similar to this. The difference is
> > though that they have chosen to do that. With your change you are
> > forcing driver writers into a setup that will likely not work for
> > most.
> >
> >> For the specific case for virtio_net, it seems we have the below options:
> >> 1. both the driver and page pool do not handle it.
> >> 2. the driver handles it by manipulating the page_pool->frag_offset
> >>    directly.
> >
> > I view 2 as being the only acceptable approach. Otherwise we are
> > forcing drivers into a solution that may not fit and forcing yet
> > another fork of allocation setups. There is a reason vendors have
>
> I respectly disagree with driver manipulating the page_pool->frag_offset
> directly.
>
> It is a implemenation detail which should be hiden from the driver:
> For page_pool_alloc_frag() API, page_pool->frag_offset is not even
> useful for arch with PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true,
> similar cases for page_pool_alloc() returning mono-frag if I understand
> 'mono-frag ' correctly.
>
> IMHO, if the driver try to do the their own page spilting, it should
> use it's own offset, not messing with the offset the page pool is using.
> Yes, that may mean driver doing it's own page splitting and page pool
> doing it's own page splitting for the same page if we really like to
> make the best out of a page.

Actually, now that I reread this I agree with you. It shouldn't be
manipulating the frag_offset. The frag offset isn't really a thing
that we have to worry about if we are being given the entire page to
fragment as we want. Basically the driver needs the ability to access
any offset within the page that it will need to. The frag_offset is an
implementation of the page_pool and is not an aspect of the fragment
or page that is given out. That is one of the reasons why the page
fragments are nothing more than a virtual address that is known to be
a given size. With that what we can do is subdivide the page further
in the drivers.

What I was thinking of was the frag count. That is something the
driver should have the ability to manipulate, be it adding or removing
frags as it takes the section of memory it was given and it decides to
break it up further before handing it out in skb frames.

> That way I see the page splitting in page pool as a way to share a
> page between different desc, and page splitting in driver as way to
> reclaim some memory for small packet, something like ena driver is
> doing:
> https://lore.kernel.org/netdev/20230612121448.28829-1-darinzon@amazon.com/T/
>
> And hns3 driver has already done something similar for old per-desc
> page flipping with 64K page size:
>
> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c#L3737

Yeah, I am well aware of the approach. I was the one that originally
implemented that in igb/ixgbe over a decade ago.

> As we have done the page splitting to share a page between different desc
> in the page pool, I really double that the benefit will justify the
> complexity of the page splitting in the driver.

The point I am getting at is that there are drivers already using this
code. There is a tradeoff for adding complexity to update things to
make it fit another use case. What I question is if it is worth it for
the other drivers to take on any extra overhead you are adding for a
use case that doesn't really seem to fix the existing one.

> > already taken the approach of manipulating frag_offset directly. In
> > many cases trying to pre-allocate things just isn't going to work.
>
> As above, I think the driver trying to do it's own splitting should use
> it's own offset instead of page pool's frag_offset.

Yes. The frag_offset should be of no value to the driver itself. The
driver should be getting a virtual address that it can then fragment
further by adding or subtracting from what is the frag count and by
adding its own internal offset.

> If the mlx5 way of doing page splitting in the driver is proved to be
> useful, we should really provide some API to allow that to work in
> arch with PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true and make the page
> splitting in the driver play along with the page splitting in the page
> pool.
>
> I am not sure if there is any other 'trying to pre-allocate things just
> isn't going to work' case that I missed, it will be very appreciatived
> if you can provide the complete cases here, so that we can discuss it
> throughly.

The idea is to keep it simple. Basically just like with a classic page
we can add to or remove from the reference count. That is how most of
the drivers did all this before the page pool was available.

> >
> >> 3. the page pool handles it as this patch does.
> >
> > The problem is the page pool isn't handling it. It is forcing the
> > allocations larger without reporting them as of this patch. It is
> > trying to forecast what the next request is going to be which is
> > problematic since it doesn't have enough info to necessarily be making
> > such assumptions.
>
> We are talking about rx for networking, right? I think the driver
> does not have that kind of enough info too, Or I am missing something
> here?

Yes, we are talking about Rx networking. Most drivers will map a page
without knowing the size of the frame they are going to receive. As
such they can end up breaking up the page into multiple fragments with
the offsets being provided by the device descriptors.

> >
> >> Is there any other options I missed for the specific case for virtio_net?
> >> What is your perfer option? And why?
> >
> > My advice would be to leave it to the driver.
> >
> > What concerns me is that you seem to be taking the page pool API in a
> > direction other than what it was really intended for. For any physical
> > device you aren't going to necessarily know what size fragment you are
> > working with until you have already allocated the page and DMA has
> > been performed. That is why drivers such as the Mellanox driver are
> > fragmenting in the driver instead of allocated pre-fragmented pages.
>
> Why do you think using the page pool API to do the fragmenting in the
> driver is the direction that page pool was intended for?
>
> I thought page pool API was not intended for any fragmenting in the
> first place by the discussion in the maillist, I think we should be
> more open about what direction the page pool API is heading to
> considering the emerging use case:)

The problem is virtual devices are very different from physical
devices. One of the big things we had specifically designed the page
pool for was to avoid the overhead of DMA mapping and unmapping
involved in allocating Rx buffers for network devices. Much of it was
based on the principals we had in drivers such as ixgbe that were
pinning the Rx pages using reference counting hacks in order to avoid
having to perform the unmap.

> >
> >>>
> >>> If you are going to go down this path then you should have a consumer
> >>> for the API and fully implement it instead of taking half measures and
> >>> making truesize underreporting worse by evicting pages earlier.
> >>
> >> I am not sure I understand what do you mean by "a consumer for the API",
> >> Do you mean adding a new API something like page_pool_free() to do
> >> something ligthweight, such as decrementing the frag user and adjusting
> >> the frag_offset, which is corresponding to the page_pool_alloc() API
> >> introduced in this patch?
> >
> > What I was getting at is that if you are going to add an API you have
> > to have a consumer for the API. That is rule #1 for kernel API
> > development. You don't add API without a consumer for it. The changes
> > you are making are to support some future implementation, and I see it
> > breaking most of the existing implementation. That is my concern.
>
> The patch is extending a new api, the behavior of current api is preserved
> as much as possible, so I am not sure which implementation is broken by
> this patch? How and why?
>
> As for the '#1 for kernel API development', I think I had mention the
> usecase it is intended for in the coverletter, and if I recall correctly,
> the page_pool_fragment_page() API you added also do not come with a
> actual consumer, I was overloaded at that time, so just toke a glance
> and wonder why there was no user with a API added.

As I recall it was partly due to the fact that I had offered to step
in and take over that part of the implementation you were working on
as we had been going back and forth for a while without making much
progress on the patchset.

So there was supposed to be a consumer, and it looks like the Mellanox
guys were able to jump on it and make good use of it. So I would say
the patch has done what was expected. If that hadn't been the case
then I would have been in favor of just reverting it since there would
have been no consumers.

> Anyway, as jesper was offering to help out, will add veth as a consumer
> for the new api:)

Sounds good.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 7 months ago

On 2023/6/16 22:46, Alexander Duyck wrote:
...

>>>> 2. the driver handles it by manipulating the page_pool->frag_offset
>>>>    directly.
>>>
>>> I view 2 as being the only acceptable approach. Otherwise we are
>>> forcing drivers into a solution that may not fit and forcing yet
>>> another fork of allocation setups. There is a reason vendors have
>>
>> I respectly disagree with driver manipulating the page_pool->frag_offset
>> directly.
>>
>> It is a implemenation detail which should be hiden from the driver:
>> For page_pool_alloc_frag() API, page_pool->frag_offset is not even
>> useful for arch with PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true,
>> similar cases for page_pool_alloc() returning mono-frag if I understand
>> 'mono-frag ' correctly.
>>
>> IMHO, if the driver try to do the their own page spilting, it should
>> use it's own offset, not messing with the offset the page pool is using.
>> Yes, that may mean driver doing it's own page splitting and page pool
>> doing it's own page splitting for the same page if we really like to
>> make the best out of a page.
> 
> Actually, now that I reread this I agree with you. It shouldn't be
> manipulating the frag_offset. The frag offset isn't really a thing
> that we have to worry about if we are being given the entire page to
> fragment as we want. Basically the driver needs the ability to access
> any offset within the page that it will need to. The frag_offset is an
> implementation of the page_pool and is not an aspect of the fragment
> or page that is given out. That is one of the reasons why the page
> fragments are nothing more than a virtual address that is known to be
> a given size. With that what we can do is subdivide the page further
> in the drivers.

I am really doubtful that the need of 'subdividing the page further
in the drivers' if we have the page splitting in page pool to allow
multi-descs to share the same page for most of the nic driver. IOW,
we should do only one level of page splitting IMHO.

If I understand it correctly, most hw have a per-queue fixed buffer
size, even the mlx5 one with per-desc buffer size support through
mlx5_wqe_data_seg, the driver seems to use the 'per-queue fixed
buffer size' model, I assume that using per-desc buffer size is just
not worth the effort?

Let's say we use the per-queue fixed buffer with 2K buffer size, I
am supposing that is most drivers is using by default, so the question
is how much memory is needed for each desc to allowing subdividing
within desc? I am supposing we mostly need a 4K buffer for each desc,
right? 

For system with 4K page size, that means we are only able to do one
level of page splitting, either in the page pool or in the driver.
What is your perfer option? and Why?

For system with larger page size, that means we are able to do
multi level of page splitting, and I suppose page splitting in
the page pool is always needed, the question is whether we allow
subdividing in the driver, right?

I think we need to consider the truesize underestimate problem
for small packet if we want to do page splitting in the driver
to save memory, even for system with 4K page size, not to
mention the 64K page size.

As you said in the previous thread:
"
> IMHO, doing the above only alleviate the problem. How is above splitting
> different from splitting it evently with 16 2K fragments, and when 15 frag
> is released, we still have the last 2K fragment holding onto 32K memory,
> doesn't that also cause massive truesize underestimate? Not to mention that
> for system with 64K page size.

Yes, that is a known issue. That is why I am not wanting us to further
exacerbate the issue.
"
Also If there is any known solution to this 'known issue'? it
seems unfixable to me, so I did not continue that discussion in
that thread.

And for packet with normal size, it seems the hns3 case seems
like a proof that doing the page splitting in the page pool
seems like a better choice, even for system with 4K page size,
mainly because doing the page splitting in page pool is per-queue,
and doing the page splitting in driver is per-desc, which is a
smaller 'pool' comparing to page pool.

> 
> What I was thinking of was the frag count. That is something the
> driver should have the ability to manipulate, be it adding or removing
> frags as it takes the section of memory it was given and it decides to
> break it up further before handing it out in skb frames.

As my understanding, there is no essential difference between frag
count and frag offet if we want to do 'subdividing', just like we
have frag_count for page pool and _refcount for page allocator, we
may need a third one for this 'subdividing'.

> 
>> That way I see the page splitting in page pool as a way to share a
>> page between different desc, and page splitting in driver as way to
>> reclaim some memory for small packet, something like ena driver is
>> doing:
>> https://lore.kernel.org/netdev/20230612121448.28829-1-darinzon@amazon.com/T/
>>
>> And hns3 driver has already done something similar for old per-desc
>> page flipping with 64K page size:
>>
>> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c#L3737
> 
> Yeah, I am well aware of the approach. I was the one that originally
> implemented that in igb/ixgbe over a decade ago.
> 
>> As we have done the page splitting to share a page between different desc
>> in the page pool, I really double that the benefit will justify the
>> complexity of the page splitting in the driver.
> 
> The point I am getting at is that there are drivers already using this
> code. There is a tradeoff for adding complexity to update things to
> make it fit another use case. What I question is if it is worth it for
> the other drivers to take on any extra overhead you are adding for a
> use case that doesn't really seem to fix the existing one.

I am not sure I understand your point here.
page_pool_alloc() is more of way to simplify the interface for
driver, so yes, it is impossible to take care of all use cases,
and there is some extra overhead for the simplicity, but it does
not means that we can not try to start generalizing thing, right?

If page_pool_alloc() does not fit a specific driver/hw' need, there
are still page_pool_alloc_pages() and page_pool_alloc_frag()
available.

> 

...

>>
>> I am not sure if there is any other 'trying to pre-allocate things just
>> isn't going to work' case that I missed, it will be very appreciatived
>> if you can provide the complete cases here, so that we can discuss it
>> throughly.
> 
> The idea is to keep it simple. Basically just like with a classic page
> we can add to or remove from the reference count. That is how most of
> the drivers did all this before the page pool was available.

I understand that is how most of the drivers did all this before the
page pool was available, but I am really not convinced that we need
that when page pool was available yet as the reasons mentioned at the
begining.

> 
>>>
>>>> 3. the page pool handles it as this patch does.
>>>
>>> The problem is the page pool isn't handling it. It is forcing the
>>> allocations larger without reporting them as of this patch. It is
>>> trying to forecast what the next request is going to be which is
>>> problematic since it doesn't have enough info to necessarily be making
>>> such assumptions.
>>
>> We are talking about rx for networking, right? I think the driver
>> does not have that kind of enough info too, Or I am missing something
>> here?
> 
> Yes, we are talking about Rx networking. Most drivers will map a page
> without knowing the size of the frame they are going to receive. As
> such they can end up breaking up the page into multiple fragments with
> the offsets being provided by the device descriptors.

For most hw without support for multi-packet in one desc, the driver
still need to allocate a per-queue size buffer for the next packet,
so I am really not sure the benefit will justify the complexity and
the truesize underestimate exacerbating yet.

> 
>>>
>>>> Is there any other options I missed for the specific case for virtio_net?
>>>> What is your perfer option? And why?
>>>
>>> My advice would be to leave it to the driver.
>>>
>>> What concerns me is that you seem to be taking the page pool API in a
>>> direction other than what it was really intended for. For any physical
>>> device you aren't going to necessarily know what size fragment you are
>>> working with until you have already allocated the page and DMA has
>>> been performed. That is why drivers such as the Mellanox driver are
>>> fragmenting in the driver instead of allocated pre-fragmented pages.
>>
>> Why do you think using the page pool API to do the fragmenting in the
>> driver is the direction that page pool was intended for?
>>
>> I thought page pool API was not intended for any fragmenting in the
>> first place by the discussion in the maillist, I think we should be
>> more open about what direction the page pool API is heading to
>> considering the emerging use case:)
> 
> The problem is virtual devices are very different from physical
> devices. One of the big things we had specifically designed the page
> pool for was to avoid the overhead of DMA mapping and unmapping
> involved in allocating Rx buffers for network devices. Much of it was
> based on the principals we had in drivers such as ixgbe that were
> pinning the Rx pages using reference counting hacks in order to avoid
> having to perform the unmap.

I think I am agreed on this one if I understand it correctly, the basic
idea is to aovid adding another layer of caching as page allocator has
the per-cpu page allocator, right?

If the veth can find a better sultion as discussed on other thread,
then I may need to find another consumer for the new API:) 

> 
>>>
>>>>>
>>>>> If you are going to go down this path then you should have a consumer
>>>>> for the API and fully implement it instead of taking half measures and
>>>>> making truesize underreporting worse by evicting pages earlier.
>>>>
>>>> I am not sure I understand what do you mean by "a consumer for the API",
>>>> Do you mean adding a new API something like page_pool_free() to do
>>>> something ligthweight, such as decrementing the frag user and adjusting
>>>> the frag_offset, which is corresponding to the page_pool_alloc() API
>>>> introduced in this patch?
>>>
>>> What I was getting at is that if you are going to add an API you have
>>> to have a consumer for the API. That is rule #1 for kernel API
>>> development. You don't add API without a consumer for it. The changes
>>> you are making are to support some future implementation, and I see it
>>> breaking most of the existing implementation. That is my concern.
>>
>> The patch is extending a new api, the behavior of current api is preserved
>> as much as possible, so I am not sure which implementation is broken by
>> this patch? How and why?
>>
>> As for the '#1 for kernel API development', I think I had mention the
>> usecase it is intended for in the coverletter, and if I recall correctly,
>> the page_pool_fragment_page() API you added also do not come with a
>> actual consumer, I was overloaded at that time, so just toke a glance
>> and wonder why there was no user with a API added.
> 
> As I recall it was partly due to the fact that I had offered to step
> in and take over that part of the implementation you were working on
> as we had been going back and forth for a while without making much
> progress on the patchset.

Let's improve on that so that there is less of going back and forth
for a while without making much progress this time:)

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 7 months ago

On Sat, Jun 17, 2023 at 4:41 AM Yunsheng Lin <yunshenglin0825@gmail.com> wrote:
>
> On 2023/6/16 22:46, Alexander Duyck wrote:
> ...
>
> >>>> 2. the driver handles it by manipulating the page_pool->frag_offset
> >>>>    directly.
> >>>
> >>> I view 2 as being the only acceptable approach. Otherwise we are
> >>> forcing drivers into a solution that may not fit and forcing yet
> >>> another fork of allocation setups. There is a reason vendors have
> >>
> >> I respectly disagree with driver manipulating the page_pool->frag_offset
> >> directly.
> >>
> >> It is a implemenation detail which should be hiden from the driver:
> >> For page_pool_alloc_frag() API, page_pool->frag_offset is not even
> >> useful for arch with PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true,
> >> similar cases for page_pool_alloc() returning mono-frag if I understand
> >> 'mono-frag ' correctly.
> >>
> >> IMHO, if the driver try to do the their own page spilting, it should
> >> use it's own offset, not messing with the offset the page pool is using.
> >> Yes, that may mean driver doing it's own page splitting and page pool
> >> doing it's own page splitting for the same page if we really like to
> >> make the best out of a page.
> >
> > Actually, now that I reread this I agree with you. It shouldn't be
> > manipulating the frag_offset. The frag offset isn't really a thing
> > that we have to worry about if we are being given the entire page to
> > fragment as we want. Basically the driver needs the ability to access
> > any offset within the page that it will need to. The frag_offset is an
> > implementation of the page_pool and is not an aspect of the fragment
> > or page that is given out. That is one of the reasons why the page
> > fragments are nothing more than a virtual address that is known to be
> > a given size. With that what we can do is subdivide the page further
> > in the drivers.
>
> I am really doubtful that the need of 'subdividing the page further
> in the drivers' if we have the page splitting in page pool to allow
> multi-descs to share the same page for most of the nic driver. IOW,
> we should do only one level of page splitting IMHO.

The problem is there are architectures that use things like 64K pages.
On those systems drivers are usually taking the page and having to
split it into 4K sub-sections of the page since the device cannot
fully utilize the 64K page. The thought here is that if we are already
treating 4K pages as a single fragged page, and performing
fragmentation on it then if we were to take the 64K page and break it
into 16 4K fragments, those fragments could further be broken up and
used by the device instead of wasting 60K due to the large page size.

> If I understand it correctly, most hw have a per-queue fixed buffer
> size, even the mlx5 one with per-desc buffer size support through
> mlx5_wqe_data_seg, the driver seems to use the 'per-queue fixed
> buffer size' model, I assume that using per-desc buffer size is just
> not worth the effort?

The problem is the device really has two buffer sizes it is dealing
with. The wqe size, and the cqe size. What goes in as a 4K page can
come up as multiple frames depending on the packet sizes being
received.

> Let's say we use the per-queue fixed buffer with 2K buffer size, I
> am supposing that is most drivers is using by default, so the question
> is how much memory is needed for each desc to allowing subdividing
> within desc? I am supposing we mostly need a 4K buffer for each desc,
> right?

The problem is your solution here is naive. Most drivers are getting
much more complicated than just using a fixed buffer size these days.
Again, the point I am getting at is that they will be taking in 4K
pages most likely and then having to subdivide it after the DMA has
been performed and they are sorting out completion queue entries to
figure out how the hardware used the page and what was written.

> For system with 4K page size, that means we are only able to do one
> level of page splitting, either in the page pool or in the driver.
> What is your perfer option? and Why?

We can do both. All we have to do is add references from the driver.
It is no different than how in the old drivers pre-page pool we were
taking the reference count on the page in the driver and splitting it
ourselves for reuse.

> For system with larger page size, that means we are able to do
> multi level of page splitting, and I suppose page splitting in
> the page pool is always needed, the question is whether we allow
> subdividing in the driver, right?

Right, and there isn't really anything we can do to prevent that nor
should we. Otherwise we risk making the page pool obsolete and it will
be replaced by something that can support it since that is the
hardware model that vendor can support with their device.

> I think we need to consider the truesize underestimate problem
> for small packet if we want to do page splitting in the driver
> to save memory, even for system with 4K page size, not to
> mention the 64K page size.
>
> As you said in the previous thread:
> "
> > IMHO, doing the above only alleviate the problem. How is above splitting
> > different from splitting it evently with 16 2K fragments, and when 15 frag
> > is released, we still have the last 2K fragment holding onto 32K memory,
> > doesn't that also cause massive truesize underestimate? Not to mention that
> > for system with 64K page size.
>
> Yes, that is a known issue. That is why I am not wanting us to further
> exacerbate the issue.
> "
> Also If there is any known solution to this 'known issue'? it
> seems unfixable to me, so I did not continue that discussion in
> that thread.
>
> And for packet with normal size, it seems the hns3 case seems
> like a proof that doing the page splitting in the page pool
> seems like a better choice, even for system with 4K page size,
> mainly because doing the page splitting in page pool is per-queue,
> and doing the page splitting in driver is per-desc, which is a
> smaller 'pool' comparing to page pool.

That seems like a very narrow perspective. The fact is the Mellanox
use case is more likely the direction things are going in. It is more
heavily used and is more widely supported. That is one of the reasons
why trying to modify it to fit your use case is going to be much more
tricky, and why I believe you should think about that more with the
hns3 being a secondary use case.

> >
> > What I was thinking of was the frag count. That is something the
> > driver should have the ability to manipulate, be it adding or removing
> > frags as it takes the section of memory it was given and it decides to
> > break it up further before handing it out in skb frames.
>
> As my understanding, there is no essential difference between frag
> count and frag offet if we want to do 'subdividing', just like we
> have frag_count for page pool and _refcount for page allocator, we
> may need a third one for this 'subdividing'.

There is a huge difference, and may be part of the reason why you and
I have such a different understanding of this.

The offset is just local to your fragmentation, whereas the count is
the global value for the page at which it can finally be freed back to
the pool. You could have multiple threads all working with different
offsets as long as they are all bounded within separate regions of the
page, however they must all agree on the frag count they are working
with since that is a property specific to the page. This is why
frag_count must be atomic whereas we keep frag_offset as a local
variable.

No additional counts needed. We never added another _refcount when we
were doing splitting in the drivers, and we wouldn't need to in order
to do splitting with page_pool pages. We would just have to start with
a frag count of 1.

> >
> >> That way I see the page splitting in page pool as a way to share a
> >> page between different desc, and page splitting in driver as way to
> >> reclaim some memory for small packet, something like ena driver is
> >> doing:
> >> https://lore.kernel.org/netdev/20230612121448.28829-1-darinzon@amazon.com/T/
> >>
> >> And hns3 driver has already done something similar for old per-desc
> >> page flipping with 64K page size:
> >>
> >> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c#L3737
> >
> > Yeah, I am well aware of the approach. I was the one that originally
> > implemented that in igb/ixgbe over a decade ago.
> >
> >> As we have done the page splitting to share a page between different desc
> >> in the page pool, I really double that the benefit will justify the
> >> complexity of the page splitting in the driver.
> >
> > The point I am getting at is that there are drivers already using this
> > code. There is a tradeoff for adding complexity to update things to
> > make it fit another use case. What I question is if it is worth it for
> > the other drivers to take on any extra overhead you are adding for a
> > use case that doesn't really seem to fix the existing one.
>
> I am not sure I understand your point here.
> page_pool_alloc() is more of way to simplify the interface for
> driver, so yes, it is impossible to take care of all use cases,
> and there is some extra overhead for the simplicity, but it does
> not means that we can not try to start generalizing thing, right?
>
> If page_pool_alloc() does not fit a specific driver/hw' need, there
> are still page_pool_alloc_pages() and page_pool_alloc_frag()
> available.

No offence, it just comes across that you are wanting to fork the
fragmentation API to fit one specific use case and want to strip it
from all others.

When I had implemented the API I had intended for it to be used the
way the Mellanox driver isusing it while also supporting other use
cases as well. However I hadn't planned for it to be used in non-DMA
use cases for instance since the whole advantage of the API is that it
retains the DMA mapping.

> >>
> >> I am not sure if there is any other 'trying to pre-allocate things just
> >> isn't going to work' case that I missed, it will be very appreciatived
> >> if you can provide the complete cases here, so that we can discuss it
> >> throughly.
> >
> > The idea is to keep it simple. Basically just like with a classic page
> > we can add to or remove from the reference count. That is how most of
> > the drivers did all this before the page pool was available.
>
> I understand that is how most of the drivers did all this before the
> page pool was available, but I am really not convinced that we need
> that when page pool was available yet as the reasons mentioned at the
> begining.

Rather than debating this, perhaps it would be better to present the
code for how you plan to make use of all this. I know there was a bit
of veth code, but it ended up being a counterexample more than
anything else since it was using the page poool API in a spot where it
should have been using the page fragment cache API.

If we can see code that demonstrates the benefits of this change
without making things worse for other drivers then we can talk. My
concern is that last time I had to step in as it seemed like we just
kept going down paths that made a mess of things for everyone else to
barely enable one specific use case. I would rather not have to repeat
that.

> >
> >>>
> >>>> 3. the page pool handles it as this patch does.
> >>>
> >>> The problem is the page pool isn't handling it. It is forcing the
> >>> allocations larger without reporting them as of this patch. It is
> >>> trying to forecast what the next request is going to be which is
> >>> problematic since it doesn't have enough info to necessarily be making
> >>> such assumptions.
> >>
> >> We are talking about rx for networking, right? I think the driver
> >> does not have that kind of enough info too, Or I am missing something
> >> here?
> >
> > Yes, we are talking about Rx networking. Most drivers will map a page
> > without knowing the size of the frame they are going to receive. As
> > such they can end up breaking up the page into multiple fragments with
> > the offsets being provided by the device descriptors.
>
> For most hw without support for multi-packet in one desc, the driver
> still need to allocate a per-queue size buffer for the next packet,
> so I am really not sure the benefit will justify the complexity and
> the truesize underestimate exacerbating yet.

What I was getting at here are the drivers that are just blindly using
page pool pages without first verifying what the page size is. There
are some that blindly assume a page is always 4K because that is
something like 99% of their market. You then put the driver one
something like a PowerPC and the system has OOM issues since it is
sitting on a pile of 64K pages. The idea is that we make page pool
always return something like a 4K page unless a higher order is
requested and we just do page subdividing in page pool rather than
having to implement it in every driver.

> >
> >>>
> >>>> Is there any other options I missed for the specific case for virtio_net?
> >>>> What is your perfer option? And why?
> >>>
> >>> My advice would be to leave it to the driver.
> >>>
> >>> What concerns me is that you seem to be taking the page pool API in a
> >>> direction other than what it was really intended for. For any physical
> >>> device you aren't going to necessarily know what size fragment you are
> >>> working with until you have already allocated the page and DMA has
> >>> been performed. That is why drivers such as the Mellanox driver are
> >>> fragmenting in the driver instead of allocated pre-fragmented pages.
> >>
> >> Why do you think using the page pool API to do the fragmenting in the
> >> driver is the direction that page pool was intended for?
> >>
> >> I thought page pool API was not intended for any fragmenting in the
> >> first place by the discussion in the maillist, I think we should be
> >> more open about what direction the page pool API is heading to
> >> considering the emerging use case:)
> >
> > The problem is virtual devices are very different from physical
> > devices. One of the big things we had specifically designed the page
> > pool for was to avoid the overhead of DMA mapping and unmapping
> > involved in allocating Rx buffers for network devices. Much of it was
> > based on the principals we had in drivers such as ixgbe that were
> > pinning the Rx pages using reference counting hacks in order to avoid
> > having to perform the unmap.
>
> I think I am agreed on this one if I understand it correctly, the basic
> idea is to aovid adding another layer of caching as page allocator has
> the per-cpu page allocator, right?
>
> If the veth can find a better sultion as discussed on other thread,
> then I may need to find another consumer for the new API:)
>
> >
> >>>
> >>>>>
> >>>>> If you are going to go down this path then you should have a consumer
> >>>>> for the API and fully implement it instead of taking half measures and
> >>>>> making truesize underreporting worse by evicting pages earlier.
> >>>>
> >>>> I am not sure I understand what do you mean by "a consumer for the API",
> >>>> Do you mean adding a new API something like page_pool_free() to do
> >>>> something ligthweight, such as decrementing the frag user and adjusting
> >>>> the frag_offset, which is corresponding to the page_pool_alloc() API
> >>>> introduced in this patch?
> >>>
> >>> What I was getting at is that if you are going to add an API you have
> >>> to have a consumer for the API. That is rule #1 for kernel API
> >>> development. You don't add API without a consumer for it. The changes
> >>> you are making are to support some future implementation, and I see it
> >>> breaking most of the existing implementation. That is my concern.
> >>
> >> The patch is extending a new api, the behavior of current api is preserved
> >> as much as possible, so I am not sure which implementation is broken by
> >> this patch? How and why?
> >>
> >> As for the '#1 for kernel API development', I think I had mention the
> >> usecase it is intended for in the coverletter, and if I recall correctly,
> >> the page_pool_fragment_page() API you added also do not come with a
> >> actual consumer, I was overloaded at that time, so just toke a glance
> >> and wonder why there was no user with a API added.
> >
> > As I recall it was partly due to the fact that I had offered to step
> > in and take over that part of the implementation you were working on
> > as we had been going back and forth for a while without making much
> > progress on the patchset.
>
> Let's improve on that so that there is less of going back and forth
> for a while without making much progress this time:)

Agreed.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 7 months ago

On 2023/6/20 23:39, Alexander Duyck wrote:
...

> 
>> If I understand it correctly, most hw have a per-queue fixed buffer
>> size, even the mlx5 one with per-desc buffer size support through
>> mlx5_wqe_data_seg, the driver seems to use the 'per-queue fixed
>> buffer size' model, I assume that using per-desc buffer size is just
>> not worth the effort?
> 
> The problem is the device really has two buffer sizes it is dealing
> with. The wqe size, and the cqe size. What goes in as a 4K page can
> come up as multiple frames depending on the packet sizes being
> received.

Yes, I understand that the buffer associated with wqe must be large
enough to hold the biggest packet, and sometimes hw may report that
only a small portion of that buffer is used as indicated in cqe when
a small packet is received. The problem is: how much buffer is
associated with a wqe to allow subdividing within wqe? With biggest
packet being 2K size, we need a buffer with 4K size to be associated
with a wqe, right? Isn't it wasteful to do that? Not to mention true
size exacerbating problem for small packet.

And it seems mlx5 is not using the page_pool_fragment_page() API as
you expected.
As my understanding, for a mpwqe, it have multi strides, a packet
seems to be able to fit in a stride or multi strides within a mpwqe,
and a stride seems to be corresponding to a frag, and there seems to
be no subdividing within a stride, see mlx5e_handle_rx_cqe_mpwrq().

https://elixir.bootlin.com/linux/v6.4-rc6/source/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L2366

...

> 
>>>
>>> What I was thinking of was the frag count. That is something the
>>> driver should have the ability to manipulate, be it adding or removing
>>> frags as it takes the section of memory it was given and it decides to
>>> break it up further before handing it out in skb frames.
>>
>> As my understanding, there is no essential difference between frag
>> count and frag offet if we want to do 'subdividing', just like we
>> have frag_count for page pool and _refcount for page allocator, we
>> may need a third one for this 'subdividing'.
> 
> There is a huge difference, and may be part of the reason why you and
> I have such a different understanding of this.
> 
> The offset is just local to your fragmentation, whereas the count is
> the global value for the page at which it can finally be freed back to
> the pool. You could have multiple threads all working with different
> offsets as long as they are all bounded within separate regions of the
> page, however they must all agree on the frag count they are working
> with since that is a property specific to the page. This is why
> frag_count must be atomic whereas we keep frag_offset as a local
> variable.
> 
> No additional counts needed. We never added another _refcount when we
> were doing splitting in the drivers, and we wouldn't need to in order
> to do splitting with page_pool pages. We would just have to start with
> a frag count of 1.

In that case, we can not do something like below as _refcount if we have
the same frag count for page pool and driver, right?

https://elixir.bootlin.com/linux/v6.4-rc6/source/drivers/net/ethernet/intel/iavf/iavf_txrx.c#L1220

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Jesper Dangaard Brouer 2 years, 7 months ago

On 15/06/2023 16.45, Alexander Duyck wrote:
[..]
> 
> What concerns me is that you seem to be taking the page pool API in a
> direction other than what it was really intended for. For any physical
> device you aren't going to necessarily know what size fragment you are
> working with until you have already allocated the page and DMA has
> been performed. That is why drivers such as the Mellanox driver are
> fragmenting in the driver instead of allocated pre-fragmented pages.
> 

+1

I share concerns with Alexander Duyck here. As the inventor and
maintainer, I can say this is taking the page_pool API in a direction I
didn't intent or planned for. As Alex also says, the intent was for
fixed sized memory chunks that are DMA ready.  Use-case was the physical
device RX "early demux problem", where the size is not known before hand.

I need to be convinced this is a good direction to take the page_pool
design/architecture into... e.g. allocations with dynamic sizes.
Maybe it is a good idea, but as below "consumers" of the API is usually
the way to show this is the case.

[...]
> 
> What I was getting at is that if you are going to add an API you have
> to have a consumer for the API. That is rule #1 for kernel API
> development. You don't add API without a consumer for it. The changes
> you are making are to support some future implementation, and I see it
> breaking most of the existing implementation. That is my concern.
> 

You have mentioned veth as the use-case. I know I acked adding page_pool
use-case to veth, for when we need to convert an SKB into an
xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
In this case in veth, the size is known at the page allocation time.
Thus, using the page_pool API is wasting memory.  We did this for
performance reasons, but we are not using PP for what is was intended
for.  We mostly use page_pool, because it an existing recycle return
path, and we were too lazy to add another alloc-type (see enum
xdp_mem_type).

Maybe you/we can extend veth to use this dynamic size API, to show us
that this is API is a better approach.  I will signup for benchmarking
this (and coordinating with CC Maryam as she came with use-case we
improved on).

--Jesper

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 7 months ago

On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:

...

> You have mentioned veth as the use-case. I know I acked adding page_pool
> use-case to veth, for when we need to convert an SKB into an
> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
> In this case in veth, the size is known at the page allocation time.
> Thus, using the page_pool API is wasting memory.  We did this for
> performance reasons, but we are not using PP for what is was intended
> for.  We mostly use page_pool, because it an existing recycle return
> path, and we were too lazy to add another alloc-type (see enum
> xdp_mem_type).
> 
> Maybe you/we can extend veth to use this dynamic size API, to show us
> that this is API is a better approach.  I will signup for benchmarking
> this (and coordinating with CC Maryam as she came with use-case we
> improved on).

Thanks, let's find out if page pool is the right hammer for the
veth XDP case.

Below is the change for veth using the new api in this patch.
Only compile test as I am not familiar enough with veth XDP and
testing environment for it.
Please try it if it is helpful.

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..8850394f1d29 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
        if (skb_shared(skb) || skb_head_is_locked(skb) ||
            skb_shinfo(skb)->nr_frags ||
            skb_headroom(skb) < XDP_PACKET_HEADROOM) {
-               u32 size, len, max_head_size, off;
+               u32 size, len, max_head_size, off, truesize, page_offset;
                struct sk_buff *nskb;
                struct page *page;
                int i, head_off;
@@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
                if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
                        goto drop;

+               size = min_t(u32, skb->len, max_head_size);
+               truesize = size;
+
                /* Allocate skb head */
-               page = page_pool_dev_alloc_pages(rq->page_pool);
+               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
                if (!page)
                        goto drop;

-               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
+               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
                if (!nskb) {
                        page_pool_put_full_page(rq->page_pool, page, true);
                        goto drop;
@@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
                skb_copy_header(nskb, skb);
                skb_mark_for_recycle(nskb);

-               size = min_t(u32, skb->len, max_head_size);
                if (skb_copy_bits(skb, 0, nskb->data, size)) {
                        consume_skb(nskb);
                        goto drop;
@@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
                len = skb->len - off;

                for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
-                       page = page_pool_dev_alloc_pages(rq->page_pool);
+                       size = min_t(u32, len, PAGE_SIZE);
+                       truesize = size;
+
+                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
+                                                  &truesize);
                        if (!page) {
                                consume_skb(nskb);
                                goto drop;
                        }

-                       size = min_t(u32, len, PAGE_SIZE);
-                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
+                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
                        if (skb_copy_bits(skb, off, page_address(page),
                                          size)) {
                                consume_skb(nskb);


> 
> --Jesper
> 
> .
>

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Jesper Dangaard Brouer 2 years, 7 months ago


On 16/06/2023 13.57, Yunsheng Lin wrote:
> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>> You have mentioned veth as the use-case. I know I acked adding page_pool
>> use-case to veth, for when we need to convert an SKB into an
>> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
>> In this case in veth, the size is known at the page allocation time.
>> Thus, using the page_pool API is wasting memory.  We did this for
>> performance reasons, but we are not using PP for what is was intended
>> for.  We mostly use page_pool, because it an existing recycle return
>> path, and we were too lazy to add another alloc-type (see enum
>> xdp_mem_type).
>>
>> Maybe you/we can extend veth to use this dynamic size API, to show us
>> that this is API is a better approach.  I will signup for benchmarking
>> this (and coordinating with CC Maryam as she came with use-case we
>> improved on).
> 
> Thanks, let's find out if page pool is the right hammer for the
> veth XDP case.
> 
> Below is the change for veth using the new api in this patch.
> Only compile test as I am not familiar enough with veth XDP and
> testing environment for it.
> Please try it if it is helpful.
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 614f3e3efab0..8850394f1d29 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>          if (skb_shared(skb) || skb_head_is_locked(skb) ||
>              skb_shinfo(skb)->nr_frags ||
>              skb_headroom(skb) < XDP_PACKET_HEADROOM) {
> -               u32 size, len, max_head_size, off;
> +               u32 size, len, max_head_size, off, truesize, page_offset;
>                  struct sk_buff *nskb;
>                  struct page *page;
>                  int i, head_off;
> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>                  if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>                          goto drop;
> 
> +               size = min_t(u32, skb->len, max_head_size);
> +               truesize = size;
> +
>                  /* Allocate skb head */
> -               page = page_pool_dev_alloc_pages(rq->page_pool);
> +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);

Maybe rename API to:

  addr = netmem_alloc(rq->page_pool, &truesize);

>                  if (!page)
>                          goto drop;
> 
> -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
> +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);

IMHO this illustrates that API is strange/funky.
(I think this is what Alex Duyck is also pointing out).

This is the memory (virtual) address "pointer":
  addr = page_address(page) + page_offset

This is what napi_build_skb() takes as input. (I looked at other users 
of napi_build_skb() whom all give a mem ptr "va" as arg.)
So, why does your new API provide the "page" and not just the address?

As proposed above:
   addr = netmem_alloc(rq->page_pool, &truesize);

Maybe the API should be renamed, to indicate this isn't returning a "page"?
We have talked about the name "netmem" before.

>                  if (!nskb) {
>                          page_pool_put_full_page(rq->page_pool, page, true);
>                          goto drop;
> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>                  skb_copy_header(nskb, skb);
>                  skb_mark_for_recycle(nskb);
> 
> -               size = min_t(u32, skb->len, max_head_size);
>                  if (skb_copy_bits(skb, 0, nskb->data, size)) {
>                          consume_skb(nskb);
>                          goto drop;
> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>                  len = skb->len - off;
> 
>                  for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> -                       page = page_pool_dev_alloc_pages(rq->page_pool);
> +                       size = min_t(u32, len, PAGE_SIZE);
> +                       truesize = size;
> +
> +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> +                                                  &truesize);
>                          if (!page) {
>                                  consume_skb(nskb);
>                                  goto drop;
>                          }
> 
> -                       size = min_t(u32, len, PAGE_SIZE);
> -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);

Guess, this shows the opposite; that the "page" _is_ used by the 
existing API.

>                          if (skb_copy_bits(skb, off, page_address(page),
>                                            size)) {
>                                  consume_skb(nskb);

--Jesper

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 7 months ago

On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 16/06/2023 13.57, Yunsheng Lin wrote:
> > On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
> >
> > ...
> >
> >> You have mentioned veth as the use-case. I know I acked adding page_pool
> >> use-case to veth, for when we need to convert an SKB into an
> >> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
> >> In this case in veth, the size is known at the page allocation time.
> >> Thus, using the page_pool API is wasting memory.  We did this for
> >> performance reasons, but we are not using PP for what is was intended
> >> for.  We mostly use page_pool, because it an existing recycle return
> >> path, and we were too lazy to add another alloc-type (see enum
> >> xdp_mem_type).
> >>
> >> Maybe you/we can extend veth to use this dynamic size API, to show us
> >> that this is API is a better approach.  I will signup for benchmarking
> >> this (and coordinating with CC Maryam as she came with use-case we
> >> improved on).
> >
> > Thanks, let's find out if page pool is the right hammer for the
> > veth XDP case.
> >
> > Below is the change for veth using the new api in this patch.
> > Only compile test as I am not familiar enough with veth XDP and
> > testing environment for it.
> > Please try it if it is helpful.
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 614f3e3efab0..8850394f1d29 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >          if (skb_shared(skb) || skb_head_is_locked(skb) ||
> >              skb_shinfo(skb)->nr_frags ||
> >              skb_headroom(skb) < XDP_PACKET_HEADROOM) {
> > -               u32 size, len, max_head_size, off;
> > +               u32 size, len, max_head_size, off, truesize, page_offset;
> >                  struct sk_buff *nskb;
> >                  struct page *page;
> >                  int i, head_off;
> > @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >                  if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
> >                          goto drop;
> >
> > +               size = min_t(u32, skb->len, max_head_size);
> > +               truesize = size;
> > +
> >                  /* Allocate skb head */
> > -               page = page_pool_dev_alloc_pages(rq->page_pool);
> > +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>
> Maybe rename API to:
>
>   addr = netmem_alloc(rq->page_pool, &truesize);
>
> >                  if (!page)
> >                          goto drop;
> >
> > -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
> > +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>
> IMHO this illustrates that API is strange/funky.
> (I think this is what Alex Duyck is also pointing out).
>
> This is the memory (virtual) address "pointer":
>   addr = page_address(page) + page_offset
>
> This is what napi_build_skb() takes as input. (I looked at other users
> of napi_build_skb() whom all give a mem ptr "va" as arg.)
> So, why does your new API provide the "page" and not just the address?
>
> As proposed above:
>    addr = netmem_alloc(rq->page_pool, &truesize);
>
> Maybe the API should be renamed, to indicate this isn't returning a "page"?
> We have talked about the name "netmem" before.

Yeah, this is more-or-less what I was getting at. Keep in mind this is
likely the most common case since most frames passed and forth aren't
ever usually much larger than 1500B.

> >                  if (!nskb) {
> >                          page_pool_put_full_page(rq->page_pool, page, true);
> >                          goto drop;
> > @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >                  skb_copy_header(nskb, skb);
> >                  skb_mark_for_recycle(nskb);
> >
> > -               size = min_t(u32, skb->len, max_head_size);
> >                  if (skb_copy_bits(skb, 0, nskb->data, size)) {
> >                          consume_skb(nskb);
> >                          goto drop;
> > @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >                  len = skb->len - off;
> >
> >                  for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> > -                       page = page_pool_dev_alloc_pages(rq->page_pool);
> > +                       size = min_t(u32, len, PAGE_SIZE);
> > +                       truesize = size;
> > +
> > +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> > +                                                  &truesize);
> >                          if (!page) {
> >                                  consume_skb(nskb);
> >                                  goto drop;
> >                          }
> >
> > -                       size = min_t(u32, len, PAGE_SIZE);
> > -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> > +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
>
> Guess, this shows the opposite; that the "page" _is_ used by the
> existing API.

This is a sort-of. One thing that has come up as of late is that all
this stuff is being moved over to folios anyway and getting away from
pages. In addition I am not sure how often we are having to take this
path as I am not sure how many non-Tx frames end up having to have
fragments added to them. For something like veth it might be more
common though since Tx becomes Rx in this case.

One thought I had on this is that we could look at adding a new
function that abstracts this away and makes use of netmem instead.
Then the whole page/folio thing would be that much further removed.

One other question I have now that I look at this code as well. Why is
it using page_pool and not just a frag cache allocator, or pages
themselves? It doesn't seem like it has a DMA mapping to deal with
since this is essentially copy-break code. Seems problematic that
there is no DMA involved here at all. This could be more easily
handled with just a single page_frag_cache style allocator.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 7 months ago

On 2023/6/17 1:34, Alexander Duyck wrote:
...

>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 614f3e3efab0..8850394f1d29 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>          if (skb_shared(skb) || skb_head_is_locked(skb) ||
>>>              skb_shinfo(skb)->nr_frags ||
>>>              skb_headroom(skb) < XDP_PACKET_HEADROOM) {
>>> -               u32 size, len, max_head_size, off;
>>> +               u32 size, len, max_head_size, off, truesize, page_offset;
>>>                  struct sk_buff *nskb;
>>>                  struct page *page;
>>>                  int i, head_off;
>>> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                  if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>>>                          goto drop;
>>>
>>> +               size = min_t(u32, skb->len, max_head_size);
>>> +               truesize = size;
>>> +
>>>                  /* Allocate skb head */
>>> -               page = page_pool_dev_alloc_pages(rq->page_pool);
>>> +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>>
>> Maybe rename API to:
>>
>>   addr = netmem_alloc(rq->page_pool, &truesize);

Unless we create a subsystem called netmem, I am not sure about
the 'netmem', it seems more confusing to use it here.

>>
>>>                  if (!page)
>>>                          goto drop;
>>>
>>> -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
>>> +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>>
>> IMHO this illustrates that API is strange/funky.
>> (I think this is what Alex Duyck is also pointing out).
>>
>> This is the memory (virtual) address "pointer":
>>   addr = page_address(page) + page_offset
>>
>> This is what napi_build_skb() takes as input. (I looked at other users
>> of napi_build_skb() whom all give a mem ptr "va" as arg.)
>> So, why does your new API provide the "page" and not just the address?
>>
>> As proposed above:
>>    addr = netmem_alloc(rq->page_pool, &truesize);
>>
>> Maybe the API should be renamed, to indicate this isn't returning a "page"?
>> We have talked about the name "netmem" before.
> 
> Yeah, this is more-or-less what I was getting at. Keep in mind this is
> likely the most common case since most frames passed and forth aren't
> ever usually much larger than 1500B.

I do feel the pain here, there is why I use a per cpu 'struct
page_pool_frag' to report the result back to user so that we
can report both 'va' and 'page' to the user in the RFC of this
patchset.

IHMO, compared to the above point, it is more importance that
we have a unified implementation for both of them instead
of page frag based on the page allocator.

Currently there are three implementations for page frag:
1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

Acctually I have a patchset to remove the third one waiting
to send out after this one.

And I wonder if the first and second one can be unified as
one, as it seems the only user facing difference is one
returning va, and the other returning page. other difference
seems to be implementation specific, for example, one is
doing offset incrementing, and the other doing offset
decrementing.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Jesper Dangaard Brouer 2 years, 7 months ago


On 16/06/2023 19.34, Alexander Duyck wrote:
> On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 16/06/2023 13.57, Yunsheng Lin wrote:
>>> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:
>>>
>>> ...
>>>
>>>> You have mentioned veth as the use-case. I know I acked adding page_pool
>>>> use-case to veth, for when we need to convert an SKB into an
>>>> xdp_buff/xdp-frame, but maybe it was the wrong hammer(?).
>>>> In this case in veth, the size is known at the page allocation time.
>>>> Thus, using the page_pool API is wasting memory.  We did this for
>>>> performance reasons, but we are not using PP for what is was intended
>>>> for.  We mostly use page_pool, because it an existing recycle return
>>>> path, and we were too lazy to add another alloc-type (see enum
>>>> xdp_mem_type).
>>>>
>>>> Maybe you/we can extend veth to use this dynamic size API, to show us
>>>> that this is API is a better approach.  I will signup for benchmarking
>>>> this (and coordinating with CC Maryam as she came with use-case we
>>>> improved on).
>>>
>>> Thanks, let's find out if page pool is the right hammer for the
>>> veth XDP case.
>>>
>>> Below is the change for veth using the new api in this patch.
>>> Only compile test as I am not familiar enough with veth XDP and
>>> testing environment for it.
>>> Please try it if it is helpful.
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 614f3e3efab0..8850394f1d29 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -736,7 +736,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>           if (skb_shared(skb) || skb_head_is_locked(skb) ||
>>>               skb_shinfo(skb)->nr_frags ||
>>>               skb_headroom(skb) < XDP_PACKET_HEADROOM) {
>>> -               u32 size, len, max_head_size, off;
>>> +               u32 size, len, max_head_size, off, truesize, page_offset;
>>>                   struct sk_buff *nskb;
>>>                   struct page *page;
>>>                   int i, head_off;
>>> @@ -752,12 +752,15 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                   if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size)
>>>                           goto drop;
>>>
>>> +               size = min_t(u32, skb->len, max_head_size);
>>> +               truesize = size;
>>> +
>>>                   /* Allocate skb head */
>>> -               page = page_pool_dev_alloc_pages(rq->page_pool);
>>> +               page = page_pool_dev_alloc(rq->page_pool, &page_offset, &truesize);
>>
>> Maybe rename API to:
>>
>>    addr = netmem_alloc(rq->page_pool, &truesize);
>>
>>>                   if (!page)
>>>                           goto drop;
>>>
>>> -               nskb = napi_build_skb(page_address(page), PAGE_SIZE);
>>> +               nskb = napi_build_skb(page_address(page) + page_offset, truesize);
>>
>> IMHO this illustrates that API is strange/funky.
>> (I think this is what Alex Duyck is also pointing out).
>>
>> This is the memory (virtual) address "pointer":
>>    addr = page_address(page) + page_offset
>>
>> This is what napi_build_skb() takes as input. (I looked at other users
>> of napi_build_skb() whom all give a mem ptr "va" as arg.)
>> So, why does your new API provide the "page" and not just the address?
>>
>> As proposed above:
>>     addr = netmem_alloc(rq->page_pool, &truesize);
>>
>> Maybe the API should be renamed, to indicate this isn't returning a "page"?
>> We have talked about the name "netmem" before.
> 
> Yeah, this is more-or-less what I was getting at. Keep in mind this is
> likely the most common case since most frames passed and forth aren't
> ever usually much larger than 1500B.
> 

Good to get confirmed this is "more-or-less" your suggestion/direction.


>>>                   if (!nskb) {
>>>                           page_pool_put_full_page(rq->page_pool, page, true);
>>>                           goto drop;
>>> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                   skb_copy_header(nskb, skb);
>>>                   skb_mark_for_recycle(nskb);
>>>
>>> -               size = min_t(u32, skb->len, max_head_size);
>>>                   if (skb_copy_bits(skb, 0, nskb->data, size)) {
>>>                           consume_skb(nskb);
>>>                           goto drop;
>>> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>>                   len = skb->len - off;
>>>
>>>                   for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
>>> -                       page = page_pool_dev_alloc_pages(rq->page_pool);
>>> +                       size = min_t(u32, len, PAGE_SIZE);
>>> +                       truesize = size;
>>> +
>>> +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
>>> +                                                  &truesize);
>>>                           if (!page) {
>>>                                   consume_skb(nskb);
>>>                                   goto drop;
>>>                           }
>>>
>>> -                       size = min_t(u32, len, PAGE_SIZE);
>>> -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
>>> +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
>>
>> Guess, this shows the opposite; that the "page" _is_ used by the
>> existing API.
> 
> This is a sort-of. One thing that has come up as of late is that all
> this stuff is being moved over to folios anyway and getting away from
> pages. In addition I am not sure how often we are having to take this
> path as I am not sure how many non-Tx frames end up having to have
> fragments added to them. For something like veth it might be more
> common though since Tx becomes Rx in this case.

I'm thinking, that is it very unlikely that XDP have modified the
fragments.  So, why are we allocating and copying the fragments?
Wouldn't it be possible for this veth code to bump the refcnt on these
fragments? (maybe I missed some detail).

> 
> One thought I had on this is that we could look at adding a new
> function that abstracts this away and makes use of netmem instead.
> Then the whole page/folio thing would be that much further removed.

I like this "thought" of yours :-)

> 
> One other question I have now that I look at this code as well. Why is
> it using page_pool and not just a frag cache allocator, or pages
> themselves? It doesn't seem like it has a DMA mapping to deal with
> since this is essentially copy-break code. Seems problematic that
> there is no DMA involved here at all. This could be more easily
> handled with just a single page_frag_cache style allocator.
> 

Yes, precisely.
I distinctly remember what I tried to poke you and Eric on this approach
earlier, but I cannot find a link to that email.

I would really appreciate, if you Alex, could give the approach in
veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
potential for improvements that will lead to large performance
improvements. (I'm sure Maryam will be eager to help re-test performance
for her use-cases).

--Jesper

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 7 months ago

On Fri, Jun 16, 2023 at 11:41 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 16/06/2023 19.34, Alexander Duyck wrote:
> > On Fri, Jun 16, 2023 at 9:31 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >>
> >> On 16/06/2023 13.57, Yunsheng Lin wrote:
> >>> On 2023/6/16 0:19, Jesper Dangaard Brouer wrote:

[...]

>
>
> >>>                   if (!nskb) {
> >>>                           page_pool_put_full_page(rq->page_pool, page, true);
> >>>                           goto drop;
> >>> @@ -767,7 +770,6 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>>                   skb_copy_header(nskb, skb);
> >>>                   skb_mark_for_recycle(nskb);
> >>>
> >>> -               size = min_t(u32, skb->len, max_head_size);
> >>>                   if (skb_copy_bits(skb, 0, nskb->data, size)) {
> >>>                           consume_skb(nskb);
> >>>                           goto drop;
> >>> @@ -782,14 +784,17 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>>                   len = skb->len - off;
> >>>
> >>>                   for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) {
> >>> -                       page = page_pool_dev_alloc_pages(rq->page_pool);
> >>> +                       size = min_t(u32, len, PAGE_SIZE);
> >>> +                       truesize = size;
> >>> +
> >>> +                       page = page_pool_dev_alloc(rq->page_pool, &page_offset,
> >>> +                                                  &truesize);
> >>>                           if (!page) {
> >>>                                   consume_skb(nskb);
> >>>                                   goto drop;
> >>>                           }
> >>>
> >>> -                       size = min_t(u32, len, PAGE_SIZE);
> >>> -                       skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE);
> >>> +                       skb_add_rx_frag(nskb, i, page, page_offset, size, truesize);
> >>
> >> Guess, this shows the opposite; that the "page" _is_ used by the
> >> existing API.
> >
> > This is a sort-of. One thing that has come up as of late is that all
> > this stuff is being moved over to folios anyway and getting away from
> > pages. In addition I am not sure how often we are having to take this
> > path as I am not sure how many non-Tx frames end up having to have
> > fragments added to them. For something like veth it might be more
> > common though since Tx becomes Rx in this case.
>
> I'm thinking, that is it very unlikely that XDP have modified the
> fragments.  So, why are we allocating and copying the fragments?
> Wouldn't it be possible for this veth code to bump the refcnt on these
> fragments? (maybe I missed some detail).

From what I can tell this is an exception case with multiple caveats
for shared, locked, or nonlinear frames.

As such I suspect there may be some deprecated cases in there too
since XDP multi-buf support has been around for a while so the code
shouldn't need to reallocate to linearize a nonlinear frame.

> >
> > One other question I have now that I look at this code as well. Why is
> > it using page_pool and not just a frag cache allocator, or pages
> > themselves? It doesn't seem like it has a DMA mapping to deal with
> > since this is essentially copy-break code. Seems problematic that
> > there is no DMA involved here at all. This could be more easily
> > handled with just a single page_frag_cache style allocator.
> >
>
> Yes, precisely.
> I distinctly remember what I tried to poke you and Eric on this approach
> earlier, but I cannot find a link to that email.
>
> I would really appreciate, if you Alex, could give the approach in
> veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> potential for improvements that will lead to large performance
> improvements. (I'm sure Maryam will be eager to help re-test performance
> for her use-cases).

Well just looking at it the quick and dirty answer would be to look at
making use of something like page_frag_cache. I won't go into details
since it isn't too different from the frag allocator, but it is much
simpler since it is just doing reference count hacks instead of having
to do the extra overhead to keep the DMA mapping in place. The veth
would then just be sitting on at most an order 3 page while it is
waiting to fully consume it rather than waiting on a full pool of
pages.

Alternatively it could do something similar to page_frag_alloc_align
itself and just bypass doing a custom allocator. If it went that route
it could do something almost like a ring buffer and greatly improve
the throughput since it would be able to allocate a higher order page
and just copy the entire skb in so the entire thing would be linear
rather than having to allocate a bunch of single pages.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Lorenzo Bianconi 2 years, 7 months ago

[...]
> >
> > Yes, precisely.
> > I distinctly remember what I tried to poke you and Eric on this approach
> > earlier, but I cannot find a link to that email.
> >
> > I would really appreciate, if you Alex, could give the approach in
> > veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> > potential for improvements that will lead to large performance
> > improvements. (I'm sure Maryam will be eager to help re-test performance
> > for her use-cases).
> 
> Well just looking at it the quick and dirty answer would be to look at
> making use of something like page_frag_cache. I won't go into details
> since it isn't too different from the frag allocator, but it is much
> simpler since it is just doing reference count hacks instead of having
> to do the extra overhead to keep the DMA mapping in place. The veth
> would then just be sitting on at most an order 3 page while it is
> waiting to fully consume it rather than waiting on a full pool of
> pages.

Hi,

I did some experiments using page_frag_cache/page_frag_alloc() instead of
page_pools in a simple environment I used to test XDP for veth driver.
In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
the page_frag_cache in order to copy the full skb in the new one, actually
"linearizing" the packet (since we know the original skb length).
I run an iperf TCP connection over a veth pair where the
remote device runs the xdp_rxq_info sample (available in the kernel source
tree, with action XDP_PASS):

TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server

net-next (page_pool):
- MTU 1500B: ~  7.5 Gbps
- MTU 8000B: ~ 15.3 Gbps

net-next + page_frag_alloc:
- MTU 1500B: ~  8.4 Gbps
- MTU 8000B: ~ 14.7 Gbps

It seems there is no a clear "win" situation here (at least in this environment
and we this simple approach). Moreover:
- can the linearization introduce any issue whenever we perform XDP_REDIRECT
  into a destination device?
- can the page_frag_cache introduce more memory fragmentation (IIRC we were
  experiencing this issue in mt76 before switching to page_pools).

What do you think?

Regards,
Lorenzo

> 
> Alternatively it could do something similar to page_frag_alloc_align
> itself and just bypass doing a custom allocator. If it went that route
> it could do something almost like a ring buffer and greatly improve
> the throughput since it would be able to allocate a higher order page
> and just copy the entire skb in so the entire thing would be linear
> rather than having to allocate a bunch of single pages.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Alexander Duyck 2 years, 7 months ago

On Sun, Jun 18, 2023 at 8:05 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> [...]
> > >
> > > Yes, precisely.
> > > I distinctly remember what I tried to poke you and Eric on this approach
> > > earlier, but I cannot find a link to that email.
> > >
> > > I would really appreciate, if you Alex, could give the approach in
> > > veth_convert_skb_to_xdp_buff() some review, as I believe that is a huge
> > > potential for improvements that will lead to large performance
> > > improvements. (I'm sure Maryam will be eager to help re-test performance
> > > for her use-cases).
> >
> > Well just looking at it the quick and dirty answer would be to look at
> > making use of something like page_frag_cache. I won't go into details
> > since it isn't too different from the frag allocator, but it is much
> > simpler since it is just doing reference count hacks instead of having
> > to do the extra overhead to keep the DMA mapping in place. The veth
> > would then just be sitting on at most an order 3 page while it is
> > waiting to fully consume it rather than waiting on a full pool of
> > pages.
>
> Hi,
>
> I did some experiments using page_frag_cache/page_frag_alloc() instead of
> page_pools in a simple environment I used to test XDP for veth driver.
> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
> the page_frag_cache in order to copy the full skb in the new one, actually
> "linearizing" the packet (since we know the original skb length).
> I run an iperf TCP connection over a veth pair where the
> remote device runs the xdp_rxq_info sample (available in the kernel source
> tree, with action XDP_PASS):
>
> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>
> net-next (page_pool):
> - MTU 1500B: ~  7.5 Gbps
> - MTU 8000B: ~ 15.3 Gbps
>
> net-next + page_frag_alloc:
> - MTU 1500B: ~  8.4 Gbps
> - MTU 8000B: ~ 14.7 Gbps
>
> It seems there is no a clear "win" situation here (at least in this environment
> and we this simple approach). Moreover:

For the 1500B packets it is a win, but for 8000B it looks like there
is a regression. Any idea what is causing it?

> - can the linearization introduce any issue whenever we perform XDP_REDIRECT
>   into a destination device?

It shouldn't. If it does it would probably point to an issue w/ the
destination driver rather than an issue with the code doing this.

> - can the page_frag_cache introduce more memory fragmentation (IIRC we were
>   experiencing this issue in mt76 before switching to page_pools).

I think it largely depends on where the packets are ending up. I know
this is the approach we are using for sockets, see
skb_page_frag_refill(). If nothing else, if you took a similar
approach to it you might be able to bypass the need for the
page_frag_cache itself, although you would likely still end up
allocating similar structures.

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Lorenzo Bianconi 2 years, 7 months ago

[...]

> > I did some experiments using page_frag_cache/page_frag_alloc() instead of
> > page_pools in a simple environment I used to test XDP for veth driver.
> > In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
> > the page_frag_cache in order to copy the full skb in the new one, actually
> > "linearizing" the packet (since we know the original skb length).
> > I run an iperf TCP connection over a veth pair where the
> > remote device runs the xdp_rxq_info sample (available in the kernel source
> > tree, with action XDP_PASS):
> >
> > TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
> >
> > net-next (page_pool):
> > - MTU 1500B: ~  7.5 Gbps
> > - MTU 8000B: ~ 15.3 Gbps
> >
> > net-next + page_frag_alloc:
> > - MTU 1500B: ~  8.4 Gbps
> > - MTU 8000B: ~ 14.7 Gbps
> >
> > It seems there is no a clear "win" situation here (at least in this environment
> > and we this simple approach). Moreover:
> 
> For the 1500B packets it is a win, but for 8000B it looks like there
> is a regression. Any idea what is causing it?

nope, I have not looked into it yet.

> 
> > - can the linearization introduce any issue whenever we perform XDP_REDIRECT
> >   into a destination device?
> 
> It shouldn't. If it does it would probably point to an issue w/ the
> destination driver rather than an issue with the code doing this.

ack, fine.

> 
> > - can the page_frag_cache introduce more memory fragmentation (IIRC we were
> >   experiencing this issue in mt76 before switching to page_pools).
> 
> I think it largely depends on where the packets are ending up. I know
> this is the approach we are using for sockets, see
> skb_page_frag_refill(). If nothing else, if you took a similar
> approach to it you might be able to bypass the need for the
> page_frag_cache itself, although you would likely still end up
> allocating similar structures.

ack.

Regards,
Lorenzo

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Jesper Dangaard Brouer 2 years, 7 months ago

On 20/06/2023 23.16, Lorenzo Bianconi wrote:
> [...]
> 
>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>> page_pools in a simple environment I used to test XDP for veth driver.
>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>> "linearizing" the packet (since we know the original skb length).
>>> I run an iperf TCP connection over a veth pair where the
>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>> tree, with action XDP_PASS):
>>>
>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>
>>> net-next (page_pool):
>>> - MTU 1500B: ~  7.5 Gbps
>>> - MTU 8000B: ~ 15.3 Gbps
>>>
>>> net-next + page_frag_alloc:
>>> - MTU 1500B: ~  8.4 Gbps
>>> - MTU 8000B: ~ 14.7 Gbps
>>>
>>> It seems there is no a clear "win" situation here (at least in this environment
>>> and we this simple approach). Moreover:
>>
>> For the 1500B packets it is a win, but for 8000B it looks like there
>> is a regression. Any idea what is causing it?
> 
> nope, I have not looked into it yet.
> 

I think I can explain via using micro-benchmark numbers.
(Lorenzo and I have discussed this over IRC, so this is our summary)

*** MTU 1500***

* The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):

The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).

The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
for packets.  The order-3 alloc + free cost 514 cycles (page_bench01:
alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
1514+320+256 = 2090 bytes.  In 32KB we can fit 15 packets.  Thus, the
amortized cost per packet is only 34.3 cycles (514/15).

Thus, this explains why page_frag_alloc API have an advantage here, as
amortized cost per packet is lower (for page_frag_alloc).

*** MTU 8000 ***

* The MTU 8000 case, where PP is faster than page_frag_alloc.

The page_frag_alloc API cannot slice the same 32KB into as many packets.
The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes.  This is can
only store 3 full packets (32768/8590 = 3.81).
Thus, cost is 514/3 = 171 cycles.

The PP is actually challenged at MTU 8000, because it unfortunately
leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
bytes. Thus cost is 3x 48 cycles = 144 cycles.
(There is also a chance of Jakubs "allow_direct" optimization in 
page_pool_return_skb_page to increase performance for PP).

Thus, this explains why PP is fastest in this case.

*** Surprising insights ***

My (maybe) surprising conclusion is that we should combine the two
approaches.  Which is basically what Lin's patchset is doing!
Thus, I'm actually suddenly become a fan of this patchset...

The insight is that PP can also work with higher-order pages and the
cost of PP recycles via ptr_ring will be the same, regardless of page
order size.  Thus, we can reduced the order-3 cost 514 cycles to
basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
amortized allocator cost 48/15 = 3.2 cycles.

On the PP alloc-side this will be amazingly fast. When PP recycles frags
side, see page_pool_defrag_page() there is an atomic_sub operation.
I've measured atomic_inc to cost 17 cycles (for optimal non-contended
case), thus 3+17 = 20 cycles, it should still be a win.

--Jesper

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Yunsheng Lin 2 years, 7 months ago

On 2023/6/21 19:55, Jesper Dangaard Brouer wrote:
> 
> 
> On 20/06/2023 23.16, Lorenzo Bianconi wrote:
>> [...]
>>
>>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>>> page_pools in a simple environment I used to test XDP for veth driver.
>>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>>> "linearizing" the packet (since we know the original skb length).
>>>> I run an iperf TCP connection over a veth pair where the
>>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>>> tree, with action XDP_PASS):
>>>>
>>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>>
>>>> net-next (page_pool):
>>>> - MTU 1500B: ~  7.5 Gbps
>>>> - MTU 8000B: ~ 15.3 Gbps
>>>>
>>>> net-next + page_frag_alloc:
>>>> - MTU 1500B: ~  8.4 Gbps
>>>> - MTU 8000B: ~ 14.7 Gbps
>>>>
>>>> It seems there is no a clear "win" situation here (at least in this environment
>>>> and we this simple approach). Moreover:
>>>
>>> For the 1500B packets it is a win, but for 8000B it looks like there
>>> is a regression. Any idea what is causing it?
>>
>> nope, I have not looked into it yet.
>>
> 
> I think I can explain via using micro-benchmark numbers.
> (Lorenzo and I have discussed this over IRC, so this is our summary)
> 
> *** MTU 1500***
> 
> * The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):
> 
> The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
> ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).
> 
> The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
> for packets.  The order-3 alloc + free cost 514 cycles (page_bench01:
> alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
> 1514+320+256 = 2090 bytes.  In 32KB we can fit 15 packets.  Thus, the
> amortized cost per packet is only 34.3 cycles (514/15).
> 
> Thus, this explains why page_frag_alloc API have an advantage here, as
> amortized cost per packet is lower (for page_frag_alloc).
> 
> 
> *** MTU 8000 ***
> 
> * The MTU 8000 case, where PP is faster than page_frag_alloc.
> 
> The page_frag_alloc API cannot slice the same 32KB into as many packets.
> The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes.  This is can
> only store 3 full packets (32768/8590 = 3.81).
> Thus, cost is 514/3 = 171 cycles.
> 
> The PP is actually challenged at MTU 8000, because it unfortunately
> leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
> bytes. Thus cost is 3x 48 cycles = 144 cycles.
> (There is also a chance of Jakubs "allow_direct" optimization in page_pool_return_skb_page to increase performance for PP).
> 
> Thus, this explains why PP is fastest in this case.

Great analysis.
So the problem seems to be: can we optimize the page fragment cache
implementation so that it can at least match the performance of PP
for the above case? As Alexander seems to be against using PP for
the veth case without involving DMA mapping.

> 
> 
> *** Surprising insights ***
> 
> My (maybe) surprising conclusion is that we should combine the two
> approaches.  Which is basically what Lin's patchset is doing!
> Thus, I'm actually suddenly become a fan of this patchset...
> 
> The insight is that PP can also work with higher-order pages and the
> cost of PP recycles via ptr_ring will be the same, regardless of page
> order size.  Thus, we can reduced the order-3 cost 514 cycles to
> basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
> amortized allocator cost 48/15 = 3.2 cycles.
> 
> On the PP alloc-side this will be amazingly fast. When PP recycles frags
> side, see page_pool_defrag_page() there is an atomic_sub operation.
> I've measured atomic_inc to cost 17 cycles (for optimal non-contended
> case), thus 3+17 = 20 cycles, it should still be a win.
> 
> 
> --Jesper
> 
>

Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API

Posted by Jakub Kicinski 2 years, 7 months ago

On Fri, 16 Jun 2023 20:41:45 +0200 Jesper Dangaard Brouer wrote:
> > This is a sort-of. One thing that has come up as of late is that all
> > this stuff is being moved over to folios anyway and getting away from
> > pages. In addition I am not sure how often we are having to take this
> > path as I am not sure how many non-Tx frames end up having to have
> > fragments added to them. For something like veth it might be more
> > common though since Tx becomes Rx in this case.  
> 
> I'm thinking, that is it very unlikely that XDP have modified the
> fragments.  So, why are we allocating and copying the fragments?
> Wouldn't it be possible for this veth code to bump the refcnt on these
> fragments? (maybe I missed some detail).

They may be page cache pages, AFAIU.

[PATCH net-next v3 1/4] page_pool: frag API support for 32-bit arch with 64-bit DMA
[PATCH net-next v3 2/4] page_pool: unify frag_count handling in page_pool_is_last_frag()
[PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
[PATCH net-next v3 4/4] page_pool: remove PP_FLAG_PAGE_FRAG flag