[v2] fix two bugs related to page_pool

[PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Yunsheng Lin 2 months ago

page_pool page may be freed from skb_defer_free_flush() to
softirq context, it may cause concurrent access problem for
pool->alloc cache due to the below time window, as below,
both CPU0 and CPU1 may access the pool->alloc cache
concurrently in page_pool_empty_alloc_cache_once() and
page_pool_recycle_in_cache():

          CPU 0                           CPU1
    page_pool_destroy()          skb_defer_free_flush()
           .                               .
           .                   page_pool_put_unrefed_page()
           .                               .
           .               allow_direct = page_pool_napi_local()
           .                               .
page_pool_disable_direct_recycling()       .
           .                               .
page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()

Use rcu mechanism to avoid the above concurrent access problem.

Note, the above was found during code reviewing on how to fix
the problem in [1].

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 net/core/page_pool.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a813d30d2135..bec6e717cd22 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -818,8 +818,17 @@ static bool page_pool_napi_local(const struct page_pool *pool)
 void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
 				  unsigned int dma_sync_size, bool allow_direct)
 {
-	if (!allow_direct)
+	bool allow_direct_orig = allow_direct;
+
+	/* page_pool_put_unrefed_netmem() is not supposed to be called with
+	 * allow_direct being true after page_pool_destroy() is called, so
+	 * the allow_direct being true case doesn't need synchronization.
+	 */
+	DEBUG_NET_WARN_ON_ONCE(allow_direct && pool->destroy_cnt);
+	if (!allow_direct_orig) {
+		rcu_read_lock();
 		allow_direct = page_pool_napi_local(pool);
+	}
 
 	netmem =
 		__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
@@ -828,6 +837,9 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
 		recycle_stat_inc(pool, ring_full);
 		page_pool_return_page(pool, netmem);
 	}
+
+	if (!allow_direct_orig)
+		rcu_read_unlock();
 }
 EXPORT_SYMBOL(page_pool_put_unrefed_netmem);
 
@@ -861,6 +873,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 	bool allow_direct;
 	bool in_softirq;
 
+	rcu_read_lock();
 	allow_direct = page_pool_napi_local(pool);
 
 	for (i = 0; i < count; i++) {
@@ -876,8 +889,10 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 			data[bulk_len++] = (__force void *)netmem;
 	}
 
-	if (!bulk_len)
+	if (!bulk_len) {
+		rcu_read_unlock();
 		return;
+	}
 
 	/* Bulk producer into ptr_ring page_pool cache */
 	in_softirq = page_pool_producer_lock(pool);
@@ -892,14 +907,18 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 	page_pool_producer_unlock(pool, in_softirq);
 
 	/* Hopefully all pages was return into ptr_ring */
-	if (likely(i == bulk_len))
+	if (likely(i == bulk_len)) {
+		rcu_read_unlock();
 		return;
+	}
 
 	/* ptr_ring cache full, free remaining pages outside producer lock
 	 * since put_page() with refcnt == 1 can be an expensive operation
 	 */
 	for (; i < bulk_len; i++)
 		page_pool_return_page(pool, (__force netmem_ref)data[i]);
+
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(page_pool_put_page_bulk);
 
@@ -1121,6 +1140,12 @@ void page_pool_destroy(struct page_pool *pool)
 		return;
 
 	page_pool_disable_direct_recycling(pool);
+
+	/* Wait for the freeing side see the disabling direct recycling setting
+	 * to avoid the concurrent access to the pool->alloc cache.
+	 */
+	synchronize_rcu();
+
 	page_pool_free_frag(pool);
 
 	if (!page_pool_release(pool))
-- 
2.33.0

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Jakub Kicinski 1 month, 3 weeks ago

On Wed, 25 Sep 2024 15:57:06 +0800 Yunsheng Lin wrote:
> Use rcu mechanism to avoid the above concurrent access problem.
> 
> Note, the above was found during code reviewing on how to fix
> the problem in [1].

The driver must make sure NAPI cannot be running while
page_pool_destroy() is called. There's even an WARN()
checking this.. if you know what to look for.

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Yunsheng Lin 1 month, 2 weeks ago

On 2024/10/9 8:40, Jakub Kicinski wrote:
> On Wed, 25 Sep 2024 15:57:06 +0800 Yunsheng Lin wrote:
>> Use rcu mechanism to avoid the above concurrent access problem.
>>
>> Note, the above was found during code reviewing on how to fix
>> the problem in [1].
> 
> The driver must make sure NAPI cannot be running while
> page_pool_destroy() is called. There's even an WARN()
> checking this.. if you know what to look for.

I am guessing you are referring to the WARN() in
page_pool_disable_direct_recycling(), right?
If yes, I am aware of that WARN().

The problem is that at least from the skb_defer_free_flush()
case, it is not tied to any specific napi instance. When
skb_attempt_defer_free() calls kick_defer_list_purge() to
trigger running of net_rx_action(), skb_defer_free_flush() can
be called without tieing to any specific napi instance as my
understanding:
https://elixir.bootlin.com/linux/v6.7-rc8/source/net/core/dev.c#L6719

Or I am missing something obvious here? I even used below diff to
verify that and it did trigger without any napi in the sd->poll_list:

@@ -6313,6 +6313,9 @@ static void skb_defer_free_flush(struct softnet_data *sd)
        spin_unlock(&sd->defer_lock);

        while (skb != NULL) {
+               if (list_empty(&sd->poll_list))
+                       pr_err("defer freeing: %px with empty napi list\n", skb);
+
                next = skb->next;
                napi_consume_skb(skb, 1);
                skb = next;

> 
>

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Jakub Kicinski 1 month, 2 weeks ago

On Wed, 9 Oct 2024 11:33:02 +0800 Yunsheng Lin wrote:
> Or I am missing something obvious here?

Seemingly the entire logic of how the safety of the lockless caching 
is ensured.

But I don't think you don't trust my opinion so I'll let others explain
this to you..

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Yunsheng Lin 1 month, 2 weeks ago

On 2024/10/9 23:13, Jakub Kicinski wrote:
> On Wed, 9 Oct 2024 11:33:02 +0800 Yunsheng Lin wrote:
>> Or I am missing something obvious here?
> 
> Seemingly the entire logic of how the safety of the lockless caching 
> is ensured.

I looked at it more closely, it seems what you meant ensuring is by setting
the napi->list_owner to -1 when disabling NAPI, right?

But letting skb_defer_free_flush() holding on the napi instance to check
the napi->list_owner without synchronizing with page_pool_destroy() seems
a little surprised to me, as page_pool_destroy() may return to driver to
free the napi even if it is a very very small time window, causing a
possible used-after-free problem?

          CPU 0                           CPU1
           .                               .
           .                   skb_defer_free_flush()
           .                               .
           .                  napi = READ_ONCE(pool->p.napi);
           .                               .
page_pool_disable_direct_recycling()       .
    driver free napi memory                .
           .                               .
           .              napi && READ_ONCE(napi->list_owner) == cpuid
           .                               .

> 
> But I don't think you don't trust my opinion so I'll let others explain
> this to you..

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Paolo Abeni 1 month, 4 weeks ago

On 9/25/24 09:57, Yunsheng Lin wrote:
> page_pool page may be freed from skb_defer_free_flush() to
> softirq context, it may cause concurrent access problem for
> pool->alloc cache due to the below time window, as below,
> both CPU0 and CPU1 may access the pool->alloc cache
> concurrently in page_pool_empty_alloc_cache_once() and
> page_pool_recycle_in_cache():
> 
>            CPU 0                           CPU1
>      page_pool_destroy()          skb_defer_free_flush()
>             .                               .
>             .                   page_pool_put_unrefed_page()
>             .                               .
>             .               allow_direct = page_pool_napi_local()
>             .                               .
> page_pool_disable_direct_recycling()       .
>             .                               .
> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()
> 
> Use rcu mechanism to avoid the above concurrent access problem.
> 
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 
> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>   net/core/page_pool.c | 31 ++++++++++++++++++++++++++++---
>   1 file changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a813d30d2135..bec6e717cd22 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -818,8 +818,17 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>   void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>   				  unsigned int dma_sync_size, bool allow_direct)
>   {
> -	if (!allow_direct)
> +	bool allow_direct_orig = allow_direct;
> +
> +	/* page_pool_put_unrefed_netmem() is not supposed to be called with
> +	 * allow_direct being true after page_pool_destroy() is called, so
> +	 * the allow_direct being true case doesn't need synchronization.
> +	 */
> +	DEBUG_NET_WARN_ON_ONCE(allow_direct && pool->destroy_cnt);
> +	if (!allow_direct_orig) {
> +		rcu_read_lock();
>   		allow_direct = page_pool_napi_local(pool);
> +	}
>   
>   	netmem =
>   		__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
> @@ -828,6 +837,9 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>   		recycle_stat_inc(pool, ring_full);
>   		page_pool_return_page(pool, netmem);
>   	}
> +
> +	if (!allow_direct_orig)
> +		rcu_read_unlock();

What about always acquiring the rcu lock? would that impact performances 
negatively?

If not, I think it's preferable, as it would make static checker happy.

>   }
>   EXPORT_SYMBOL(page_pool_put_unrefed_netmem);
>   

[...]

> @@ -1121,6 +1140,12 @@ void page_pool_destroy(struct page_pool *pool)
>   		return;
>   
>   	page_pool_disable_direct_recycling(pool);
> +
> +	/* Wait for the freeing side see the disabling direct recycling setting
> +	 * to avoid the concurrent access to the pool->alloc cache.
> +	 */
> +	synchronize_rcu();

When turning on/off a device with a lot of queues, the above could 
introduce a lot of long waits under the RTNL lock, right?

What about moving the trailing of this function in a separate helper and 
use call_rcu() instead?

Thanks!

Paolo


> +
>   	page_pool_free_frag(pool);
>   
>   	if (!page_pool_release(pool))

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Yunsheng Lin 1 month, 4 weeks ago

On 10/1/2024 7:30 PM, Paolo Abeni wrote:

...

>> @@ -828,6 +837,9 @@ void page_pool_put_unrefed_netmem(struct page_pool 
>> *pool, netmem_ref netmem,
>>           recycle_stat_inc(pool, ring_full);
>>           page_pool_return_page(pool, netmem);
>>       }
>> +
>> +    if (!allow_direct_orig)
>> +        rcu_read_unlock();
> 
> What about always acquiring the rcu lock? would that impact performances 
> negatively?
> 
> If not, I think it's preferable, as it would make static checker happy.

As mentioned in cover letter, the overhead is about ~2ns
I guess it is the 'if' checking before rcu_read_unlock that static
checker is not happy about, there is also a 'if' checking before
the 'destroy_lock' introduced in patch 2, maybe '__cond_acquires'
can be used to make static checker happy?

> 
>>   }
>>   EXPORT_SYMBOL(page_pool_put_unrefed_netmem);
> 
> [...]
> 
>> @@ -1121,6 +1140,12 @@ void page_pool_destroy(struct page_pool *pool)
>>           return;
>>       page_pool_disable_direct_recycling(pool);
>> +
>> +    /* Wait for the freeing side see the disabling direct recycling 
>> setting
>> +     * to avoid the concurrent access to the pool->alloc cache.
>> +     */
>> +    synchronize_rcu();
> 
> When turning on/off a device with a lot of queues, the above could 
> introduce a lot of long waits under the RTNL lock, right?
> 
> What about moving the trailing of this function in a separate helper and 
> use call_rcu() instead?

For this patch, yes, it can be done.
But patch 2 also rely on the rcu lock in this patch to ensure that free
side is synchronized with the destroy path, and the dma mapping done for
the inflight pages in page_pool_item_uninit() can not be done in 
call_rcu(), as the driver might have unbound when RCU callback is
called, which might defeat the purpose of patch 2.

Maybe an optimization here is to only call synchronize_rcu() when there
are some inflight pages and pool->dma_map is set.

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Joe Damato 2 months ago

On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote:
> page_pool page may be freed from skb_defer_free_flush() to
> softirq context, it may cause concurrent access problem for
> pool->alloc cache due to the below time window, as below,
> both CPU0 and CPU1 may access the pool->alloc cache
> concurrently in page_pool_empty_alloc_cache_once() and
> page_pool_recycle_in_cache():
> 
>           CPU 0                           CPU1
>     page_pool_destroy()          skb_defer_free_flush()
>            .                               .
>            .                   page_pool_put_unrefed_page()
>            .                               .
>            .               allow_direct = page_pool_napi_local()
>            .                               .
> page_pool_disable_direct_recycling()       .
>            .                               .
> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()
> 
> Use rcu mechanism to avoid the above concurrent access problem.
> 
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 
> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Alexander Lobakin <aleksander.lobakin@intel.com>

Sorry for the noise, but I hit an assert in page_pool_unref_netmem
and I am trying to figure out if it is related to what you all are
debugging? I thought it might be, but if not, my apologies.

Just in case it is, I've put the backtrace on github [1]. I
triggered this while testing an RFC [2] I've been working on. Please
note, the RFC posted publicly does not currently apply cleanly to
net-next and has some bugs I've fixed in my v4. I had planned to
send the v4 early next week and mention the page pool issue I am
hitting.

After triggering the assert in [1], I tried applying the patches of
this series and retesting the RFC v4 I have queued locally. When I
did that, I hit a new assertion page_pool_destroy [3].

There are a few possibilities:
   1. I am hitting the same issue you are hitting
   2. I am hitting a different issue caused by a bug I introduced
   3. I am hitting a different page pool issue entirely

In case of 2 and 3, my apologies for the noise.

In case of 1: If you think I am hitting the same issue as you are
trying to solve, I can reliably reproduce this with my RFC v4 and
would be happy to test any patches meant to fix the issue.

[1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt
[2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@fastly.com/#r
[3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt

Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi_local

Posted by Yunsheng Lin 2 months ago

On 2024/9/27 4:06, Joe Damato wrote:
> On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote:
>> page_pool page may be freed from skb_defer_free_flush() to
>> softirq context, it may cause concurrent access problem for
>> pool->alloc cache due to the below time window, as below,
>> both CPU0 and CPU1 may access the pool->alloc cache
>> concurrently in page_pool_empty_alloc_cache_once() and
>> page_pool_recycle_in_cache():
>>
>>           CPU 0                           CPU1
>>     page_pool_destroy()          skb_defer_free_flush()
>>            .                               .
>>            .                   page_pool_put_unrefed_page()
>>            .                               .
>>            .               allow_direct = page_pool_napi_local()
>>            .                               .
>> page_pool_disable_direct_recycling()       .
>>            .                               .
>> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()
>>
>> Use rcu mechanism to avoid the above concurrent access problem.
>>
>> Note, the above was found during code reviewing on how to fix
>> the problem in [1].
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>
>> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Sorry for the noise, but I hit an assert in page_pool_unref_netmem
> and I am trying to figure out if it is related to what you all are
> debugging? I thought it might be, but if not, my apologies.
> 
> Just in case it is, I've put the backtrace on github [1]. I
> triggered this while testing an RFC [2] I've been working on. Please
> note, the RFC posted publicly does not currently apply cleanly to
> net-next and has some bugs I've fixed in my v4. I had planned to
> send the v4 early next week and mention the page pool issue I am
> hitting.
> 
> After triggering the assert in [1], I tried applying the patches of
> this series and retesting the RFC v4 I have queued locally. When I
> did that, I hit a new assertion page_pool_destroy [3].
> 
> There are a few possibilities:
>    1. I am hitting the same issue you are hitting
>    2. I am hitting a different issue caused by a bug I introduced
>    3. I am hitting a different page pool issue entirely
> 
> In case of 2 and 3, my apologies for the noise.
> 
> In case of 1: If you think I am hitting the same issue as you are
> trying to solve, I can reliably reproduce this with my RFC v4 and
> would be happy to test any patches meant to fix the issue.
> 
> [1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt

It seems there may be some concurrent access of page->pp_ref_count/
page_pool->frag_users or imbalance‌ incrementing and decrementing of
page->pp_ref_count.

For the concurrent access part, you may refer to the below patch
for debugging, but it seems mlx5 driver doesn't use frag API directly
as my last recall, so you can't use it directly.

https://lore.kernel.org/all/20240903122208.3379182-1-linyunsheng@huawei.com/

> [2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@fastly.com/#r
> [3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt

There warning is only added to see if there is any infight page that
need dma unmapping when page_pool_destroy() is called, you can ignore
that warning or remove that WARN_ONCE() in page_pool_item_uninit()
when testing.

>