[PATCH net-next 2/2] net/mlx5e: Clamp page_pool size to max

Tariq Toukan posted 2 patches 1 week, 2 days ago
[PATCH net-next 2/2] net/mlx5e: Clamp page_pool size to max
Posted by Tariq Toukan 1 week, 2 days ago
From: Dragos Tatulea <dtatulea@nvidia.com>

When the user configures a large ring size (8K) and a large MTU (9000)
in HW-GRO mode, the queue will fail to allocate due to the size of the
page_pool going above the limit.

This change clamps the pool_size to the limit.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5e007bb3bad1..e56052895776 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -989,6 +989,8 @@ static int mlx5e_alloc_rq(struct mlx5e_params *params,
 		/* Create a page_pool and register it with rxq */
 		struct page_pool_params pp_params = { 0 };
 
+		pool_size = min_t(u32, pool_size, PAGE_POOL_SIZE_LIMIT);
+
 		pp_params.order     = 0;
 		pp_params.flags     = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 		pp_params.pool_size = pool_size;
-- 
2.31.1
Re: [PATCH net-next 2/2] net/mlx5e: Clamp page_pool size to max
Posted by Jakub Kicinski 1 week, 1 day ago
On Mon, 22 Sep 2025 12:18:35 +0300 Tariq Toukan wrote:
> When the user configures a large ring size (8K) and a large MTU (9000)
> in HW-GRO mode, the queue will fail to allocate due to the size of the
> page_pool going above the limit.

Please do some testing. A PP cache of 32k is just silly, you should
probably use a smaller limit.
Re: [PATCH net-next 2/2] net/mlx5e: Clamp page_pool size to max
Posted by Dragos Tatulea 1 week, 1 day ago
On Tue, Sep 23, 2025 at 07:23:56AM -0700, Jakub Kicinski wrote:
> On Mon, 22 Sep 2025 12:18:35 +0300 Tariq Toukan wrote:
> > When the user configures a large ring size (8K) and a large MTU (9000)
> > in HW-GRO mode, the queue will fail to allocate due to the size of the
> > page_pool going above the limit.
> 
> Please do some testing. A PP cache of 32k is just silly, you should
> probably use a smaller limit.
You mean clamping the pool_size to a certain limit so that the page_pool
ring size doesn't cover a full RQ when the RQ ring size is too large?

Thanks,
Dragos
Re: [PATCH net-next 2/2] net/mlx5e: Clamp page_pool size to max
Posted by Jakub Kicinski 1 week, 1 day ago
On Tue, 23 Sep 2025 15:12:33 +0000 Dragos Tatulea wrote:
> On Tue, Sep 23, 2025 at 07:23:56AM -0700, Jakub Kicinski wrote:
> > On Mon, 22 Sep 2025 12:18:35 +0300 Tariq Toukan wrote:  
> > > When the user configures a large ring size (8K) and a large MTU (9000)
> > > in HW-GRO mode, the queue will fail to allocate due to the size of the
> > > page_pool going above the limit.  
> > 
> > Please do some testing. A PP cache of 32k is just silly, you should
> > probably use a smaller limit.  
> You mean clamping the pool_size to a certain limit so that the page_pool
> ring size doesn't cover a full RQ when the RQ ring size is too large?

Yes, 8k ring will take milliseconds to drain. We don't really need
milliseconds of page cache. By the time the driver processed the full
ring we must have gone thru 128 NAPI cycles, and the application
most likely already stated freeing the pages.

If my math is right at 80Gbps per ring and 9k MTU it takes more than a
1usec to receive a frame. So 8msec to just _receive_ a full ring worth
of data. At Meta we mostly use large rings to cover up scheduler and
IRQ masking latency.
Re: [PATCH net-next 2/2] net/mlx5e: Clamp page_pool size to max
Posted by Jakub Kicinski 1 week, 1 day ago
On Tue, 23 Sep 2025 08:23:10 -0700 Jakub Kicinski wrote:
> On Tue, 23 Sep 2025 15:12:33 +0000 Dragos Tatulea wrote:
> > On Tue, Sep 23, 2025 at 07:23:56AM -0700, Jakub Kicinski wrote:  
> > > Please do some testing. A PP cache of 32k is just silly, you should
> > > probably use a smaller limit.    
> > You mean clamping the pool_size to a certain limit so that the page_pool
> > ring size doesn't cover a full RQ when the RQ ring size is too large?  
> 
> Yes, 8k ring will take milliseconds to drain. We don't really need
> milliseconds of page cache. By the time the driver processed the full
> ring we must have gone thru 128 NAPI cycles, and the application
> most likely already stated freeing the pages.
> 
> If my math is right at 80Gbps per ring and 9k MTU it takes more than a
> 1usec to receive a frame. So 8msec to just _receive_ a full ring worth
> of data. At Meta we mostly use large rings to cover up scheduler and
> IRQ masking latency.

On second thought, let's just clamp it to 16k in the core and remove
the error. Clearly the expectations of the API are too intricate,
most drivers just use ring size as the cache size.
Re: [PATCH net-next 2/2] net/mlx5e: Clamp page_pool size to max
Posted by Simon Horman 1 week, 1 day ago
On Mon, Sep 22, 2025 at 12:18:35PM +0300, Tariq Toukan wrote:
> From: Dragos Tatulea <dtatulea@nvidia.com>
> 
> When the user configures a large ring size (8K) and a large MTU (9000)
> in HW-GRO mode, the queue will fail to allocate due to the size of the
> page_pool going above the limit.
> 
> This change clamps the pool_size to the limit.
> 
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 5e007bb3bad1..e56052895776 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -989,6 +989,8 @@ static int mlx5e_alloc_rq(struct mlx5e_params *params,
>  		/* Create a page_pool and register it with rxq */
>  		struct page_pool_params pp_params = { 0 };
>  
> +		pool_size = min_t(u32, pool_size, PAGE_POOL_SIZE_LIMIT);

pool_size is u32 and PAGE_POOL_SIZE_LIMIT is a constant.
AFAIK min() would work just fine here.

> +
>  		pp_params.order     = 0;
>  		pp_params.flags     = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>  		pp_params.pool_size = pool_size;
> -- 
> 2.31.1
> 
>