page_pool page may be freed from skb_defer_free_flush() in
softirq context without binding to any specific napi, it
may cause use-after-free problem due to the below time window,
as below, CPU1 may still access napi->list_owner after CPU0
free the napi memory:
CPU 0 CPU1
page_pool_destroy() skb_defer_free_flush()
. .
. napi = READ_ONCE(pool->p.napi);
. .
page_pool_disable_direct_recycling() .
driver free napi memory .
. .
. napi && READ_ONCE(napi->list_owner) == cpuid
. .
Use rcu mechanism to avoid the above problem.
Note, the above was found during code reviewing on how to fix
the problem in [1].
As the following IOMMU fix patch depends on synchronize_rcu()
added in this patch and the time window is so small that it
doesn't seem to be an urgent fix, so target the net-next as
the IOMMU fix patch does.
1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
net/core/page_pool.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 9733206d6406..1aa7b93bdcc8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
static bool page_pool_napi_local(const struct page_pool *pool)
{
const struct napi_struct *napi;
+ bool napi_local;
u32 cpuid;
if (unlikely(!in_softirq()))
@@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
if (READ_ONCE(pool->cpuid) == cpuid)
return true;
+ /* Synchronizated with page_pool_destory() to avoid use-after-free
+ * for 'napi'.
+ */
+ rcu_read_lock();
napi = READ_ONCE(pool->p.napi);
+ napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
+ rcu_read_unlock();
- return napi && READ_ONCE(napi->list_owner) == cpuid;
+ return napi_local;
}
void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
@@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
if (!page_pool_release(pool))
return;
+ /* Paired with rcu lock in page_pool_napi_local() to enable clearing
+ * of pool->p.napi in page_pool_disable_direct_recycling() is seen
+ * before returning to driver to free the napi instance.
+ */
+ synchronize_rcu();
+
page_pool_detached(pool);
pool->defer_start = jiffies;
pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
--
2.33.0
Yunsheng Lin <linyunsheng@huawei.com> writes:
> page_pool page may be freed from skb_defer_free_flush() in
> softirq context without binding to any specific napi, it
> may cause use-after-free problem due to the below time window,
> as below, CPU1 may still access napi->list_owner after CPU0
> free the napi memory:
>
> CPU 0 CPU1
> page_pool_destroy() skb_defer_free_flush()
> . .
> . napi = READ_ONCE(pool->p.napi);
> . .
> page_pool_disable_direct_recycling() .
> driver free napi memory .
> . .
> . napi && READ_ONCE(napi->list_owner) == cpuid
> . .
Have you actually observed this happen, or are you just speculating?
Because I don't think it can; deleting a NAPI instance already requires
observing an RCU grace period, cf netdevice.h:
/**
* __netif_napi_del - remove a NAPI context
* @napi: NAPI context
*
* Warning: caller must observe RCU grace period before freeing memory
* containing @napi. Drivers might want to call this helper to combine
* all the needed RCU grace periods into a single one.
*/
void __netif_napi_del(struct napi_struct *napi);
/**
* netif_napi_del - remove a NAPI context
* @napi: NAPI context
*
* netif_napi_del() removes a NAPI context from the network device NAPI list
*/
static inline void netif_napi_del(struct napi_struct *napi)
{
__netif_napi_del(napi);
synchronize_net();
}
> Use rcu mechanism to avoid the above problem.
>
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
>
> As the following IOMMU fix patch depends on synchronize_rcu()
> added in this patch and the time window is so small that it
> doesn't seem to be an urgent fix, so target the net-next as
> the IOMMU fix patch does.
>
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>
> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> net/core/page_pool.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 9733206d6406..1aa7b93bdcc8 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
> static bool page_pool_napi_local(const struct page_pool *pool)
> {
> const struct napi_struct *napi;
> + bool napi_local;
> u32 cpuid;
>
> if (unlikely(!in_softirq()))
> @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
> if (READ_ONCE(pool->cpuid) == cpuid)
> return true;
>
> + /* Synchronizated with page_pool_destory() to avoid use-after-free
> + * for 'napi'.
> + */
> + rcu_read_lock();
> napi = READ_ONCE(pool->p.napi);
> + napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
> + rcu_read_unlock();
This rcu_read_lock/unlock() pair is redundant in the context you mention
above, since skb_defer_free_flush() is only ever called from softirq
context (within local_bh_disable()), which already function as an RCU
read lock.
> - return napi && READ_ONCE(napi->list_owner) == cpuid;
> + return napi_local;
> }
>
> void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
> if (!page_pool_release(pool))
> return;
>
> + /* Paired with rcu lock in page_pool_napi_local() to enable clearing
> + * of pool->p.napi in page_pool_disable_direct_recycling() is seen
> + * before returning to driver to free the napi instance.
> + */
> + synchronize_rcu();
Most drivers call page_pool_destroy() in a loop for each RX queue, so
now you're introducing a full synchronize_rcu() wait for each queue.
That can delay tearing down the device significantly, so I don't think
this is a good idea.
-Toke
On 1/10/2025 11:40 PM, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
>
>> page_pool page may be freed from skb_defer_free_flush() in
>> softirq context without binding to any specific napi, it
>> may cause use-after-free problem due to the below time window,
>> as below, CPU1 may still access napi->list_owner after CPU0
>> free the napi memory:
>>
>> CPU 0 CPU1
>> page_pool_destroy() skb_defer_free_flush()
>> . .
>> . napi = READ_ONCE(pool->p.napi);
>> . .
>> page_pool_disable_direct_recycling() .
>> driver free napi memory .
>> . .
>> . napi && READ_ONCE(napi->list_owner) == cpuid
>> . .
>
> Have you actually observed this happen, or are you just speculating?
I did not actually observe this happen, but I added some delaying and
pr_err() debugging code in page_pool_napi_local()/page_pool_destroy(),
and modified the test module for page_pool in [1] to trigger that it is
indeed possible if the delay between reading napi and checking
napi->list_owner is long enough.
1.
https://patchwork.kernel.org/project/netdevbpf/patch/20240909091913.987826-1-linyunsheng@huawei.com/
> Because I don't think it can; deleting a NAPI instance already requires
> observing an RCU grace period, cf netdevice.h:
>
> /**
> * __netif_napi_del - remove a NAPI context
> * @napi: NAPI context
> *
> * Warning: caller must observe RCU grace period before freeing memory
> * containing @napi. Drivers might want to call this helper to combine
> * all the needed RCU grace periods into a single one.
> */
> void __netif_napi_del(struct napi_struct *napi);
>
> /**
> * netif_napi_del - remove a NAPI context
> * @napi: NAPI context
> *
> * netif_napi_del() removes a NAPI context from the network device NAPI list
> */
> static inline void netif_napi_del(struct napi_struct *napi)
> {
> __netif_napi_del(napi);
> synchronize_net();
> }
I am not sure we can reliably depend on the implicit synchronize_net()
above if netif_napi_del() might not be called before page_pool_destroy()
as there might not be netif_napi_del() before page_pool_destroy() for
the case of changing rx_desc_num for a queue, which seems to be the case
of hns3_set_ringparam() for hns3 driver.
>
>
>> Use rcu mechanism to avoid the above problem.
>>
>> Note, the above was found during code reviewing on how to fix
>> the problem in [1].
>>
>> As the following IOMMU fix patch depends on synchronize_rcu()
>> added in this patch and the time window is so small that it
>> doesn't seem to be an urgent fix, so target the net-next as
>> the IOMMU fix patch does.
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>
>> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> ---
>> net/core/page_pool.c | 15 ++++++++++++++-
>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 9733206d6406..1aa7b93bdcc8 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>> static bool page_pool_napi_local(const struct page_pool *pool)
>> {
>> const struct napi_struct *napi;
>> + bool napi_local;
>> u32 cpuid;
>>
>> if (unlikely(!in_softirq()))
>> @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>> if (READ_ONCE(pool->cpuid) == cpuid)
>> return true;
>>
>> + /* Synchronizated with page_pool_destory() to avoid use-after-free
>> + * for 'napi'.
>> + */
>> + rcu_read_lock();
>> napi = READ_ONCE(pool->p.napi);
>> + napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
>> + rcu_read_unlock();
>
> This rcu_read_lock/unlock() pair is redundant in the context you mention
> above, since skb_defer_free_flush() is only ever called from softirq
> context (within local_bh_disable()), which already function as an RCU
> read lock.
I thought about it, but I am not sure if we need a explicit rcu lock
for different kernel PREEMPT and RCU config.
Perhaps use rcu_read_lock_bh_held() to ensure that we are in the
correct context?
>
>> - return napi && READ_ONCE(napi->list_owner) == cpuid;
>> + return napi_local;
>> }
>>
>> void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>> if (!page_pool_release(pool))
>> return;
>>
>> + /* Paired with rcu lock in page_pool_napi_local() to enable clearing
>> + * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>> + * before returning to driver to free the napi instance.
>> + */
>> + synchronize_rcu();
>
> Most drivers call page_pool_destroy() in a loop for each RX queue, so
> now you're introducing a full synchronize_rcu() wait for each queue.
> That can delay tearing down the device significantly, so I don't think
> this is a good idea.
synchronize_rcu() is called after page_pool_release(pool), which means
it is only called when there are some inflight pages, so there is not
necessarily a full synchronize_rcu() wait for each queue.
Anyway, it seems that there are some cases that need explicit
synchronize_rcu() and some cases depending on the other API providing
synchronize_rcu() semantics, maybe we provide two diffferent API for
both cases like the netif_napi_del()/__netif_napi_del() APIs do?
Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> On 1/10/2025 11:40 PM, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>>
>>> page_pool page may be freed from skb_defer_free_flush() in
>>> softirq context without binding to any specific napi, it
>>> may cause use-after-free problem due to the below time window,
>>> as below, CPU1 may still access napi->list_owner after CPU0
>>> free the napi memory:
>>>
>>> CPU 0 CPU1
>>> page_pool_destroy() skb_defer_free_flush()
>>> . .
>>> . napi = READ_ONCE(pool->p.napi);
>>> . .
>>> page_pool_disable_direct_recycling() .
>>> driver free napi memory .
>>> . .
>>> . napi && READ_ONCE(napi->list_owner) == cpuid
>>> . .
>>
>> Have you actually observed this happen, or are you just speculating?
>
> I did not actually observe this happen, but I added some delaying and
> pr_err() debugging code in page_pool_napi_local()/page_pool_destroy(),
> and modified the test module for page_pool in [1] to trigger that it is
> indeed possible if the delay between reading napi and checking
> napi->list_owner is long enough.
>
> 1.
> https://patchwork.kernel.org/project/netdevbpf/patch/20240909091913.987826-1-linyunsheng@huawei.com/
Right, I wasn't contesting whether it's possible to trigger this race by
calling those two functions directly in some fashion. I was asking
whether there are any drivers that use the API in a way that this race
can happen; because I would consider any such driver buggy, and we
should fix this rather than adding more cruft to the page_pool API. See
below.
>> Because I don't think it can; deleting a NAPI instance already requires
>> observing an RCU grace period, cf netdevice.h:
>>
>> /**
>> * __netif_napi_del - remove a NAPI context
>> * @napi: NAPI context
>> *
>> * Warning: caller must observe RCU grace period before freeing memory
>> * containing @napi. Drivers might want to call this helper to combine
>> * all the needed RCU grace periods into a single one.
>> */
>> void __netif_napi_del(struct napi_struct *napi);
>>
>> /**
>> * netif_napi_del - remove a NAPI context
>> * @napi: NAPI context
>> *
>> * netif_napi_del() removes a NAPI context from the network device NAPI list
>> */
>> static inline void netif_napi_del(struct napi_struct *napi)
>> {
>> __netif_napi_del(napi);
>> synchronize_net();
>> }
>
> I am not sure we can reliably depend on the implicit synchronize_net()
> above if netif_napi_del() might not be called before page_pool_destroy()
> as there might not be netif_napi_del() before page_pool_destroy() for
> the case of changing rx_desc_num for a queue, which seems to be the case
> of hns3_set_ringparam() for hns3 driver.
The hns3 driver doesn't use pp->napi at all AFAICT, so that's hardly
relevant.
>>
>>
>>> Use rcu mechanism to avoid the above problem.
>>>
>>> Note, the above was found during code reviewing on how to fix
>>> the problem in [1].
>>>
>>> As the following IOMMU fix patch depends on synchronize_rcu()
>>> added in this patch and the time window is so small that it
>>> doesn't seem to be an urgent fix, so target the net-next as
>>> the IOMMU fix patch does.
>>>
>>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>>
>>> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> ---
>>> net/core/page_pool.c | 15 ++++++++++++++-
>>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index 9733206d6406..1aa7b93bdcc8 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>>> static bool page_pool_napi_local(const struct page_pool *pool)
>>> {
>>> const struct napi_struct *napi;
>>> + bool napi_local;
>>> u32 cpuid;
>>>
>>> if (unlikely(!in_softirq()))
>>> @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>>> if (READ_ONCE(pool->cpuid) == cpuid)
>>> return true;
>>>
>>> + /* Synchronizated with page_pool_destory() to avoid use-after-free
>>> + * for 'napi'.
>>> + */
>>> + rcu_read_lock();
>>> napi = READ_ONCE(pool->p.napi);
>>> + napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
>>> + rcu_read_unlock();
>>
>> This rcu_read_lock/unlock() pair is redundant in the context you mention
>> above, since skb_defer_free_flush() is only ever called from softirq
>> context (within local_bh_disable()), which already function as an RCU
>> read lock.
>
> I thought about it, but I am not sure if we need a explicit rcu lock
> for different kernel PREEMPT and RCU config.
> Perhaps use rcu_read_lock_bh_held() to ensure that we are in the
> correct context?
page_pool_napi_local() returns immediately if in_softirq() returns
false. So the rcu_read_lock() is definitely not needed.
>>
>>> - return napi && READ_ONCE(napi->list_owner) == cpuid;
>>> + return napi_local;
>>> }
>>>
>>> void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>> if (!page_pool_release(pool))
>>> return;
>>>
>>> + /* Paired with rcu lock in page_pool_napi_local() to enable clearing
>>> + * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>>> + * before returning to driver to free the napi instance.
>>> + */
>>> + synchronize_rcu();
>>
>> Most drivers call page_pool_destroy() in a loop for each RX queue, so
>> now you're introducing a full synchronize_rcu() wait for each queue.
>> That can delay tearing down the device significantly, so I don't think
>> this is a good idea.
>
> synchronize_rcu() is called after page_pool_release(pool), which means
> it is only called when there are some inflight pages, so there is not
> necessarily a full synchronize_rcu() wait for each queue.
>
> Anyway, it seems that there are some cases that need explicit
> synchronize_rcu() and some cases depending on the other API providing
> synchronize_rcu() semantics, maybe we provide two diffferent API for
> both cases like the netif_napi_del()/__netif_napi_del() APIs do?
I don't think so. This race can only be triggered if:
- An skb is allocated from a page_pool with a napi instance attached
- That skb is freed *in softirq context* while the memory backing the
NAPI instance is being freed.
It's only valid to free a napi instance after calling netif_napi_del(),
which does a full synchronise_rcu(). This means that any running
softirqs will have exited at this point, and all packets will have been
flushed from the deferred freeing queues. And since the NAPI has been
stopped at this point, no new packets can enter the deferred freeing
queue from that NAPI instance.
So I really don't see a way for this race to happen with correct usage
of the page_pool and NAPI APIs, which means there's no reason to make
the change you are proposing here.
-Toke
On 2025/1/20 19:24, Toke Høiland-Jørgensen wrote:
...
>>>>
>>>> void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>>>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>>> if (!page_pool_release(pool))
>>>> return;
>>>>
>>>> + /* Paired with rcu lock in page_pool_napi_local() to enable clearing
>>>> + * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>>>> + * before returning to driver to free the napi instance.
>>>> + */
>>>> + synchronize_rcu();
>>>
>>> Most drivers call page_pool_destroy() in a loop for each RX queue, so
>>> now you're introducing a full synchronize_rcu() wait for each queue.
>>> That can delay tearing down the device significantly, so I don't think
>>> this is a good idea.
>>
>> synchronize_rcu() is called after page_pool_release(pool), which means
>> it is only called when there are some inflight pages, so there is not
>> necessarily a full synchronize_rcu() wait for each queue.
>>
>> Anyway, it seems that there are some cases that need explicit
>> synchronize_rcu() and some cases depending on the other API providing
>> synchronize_rcu() semantics, maybe we provide two diffferent API for
>> both cases like the netif_napi_del()/__netif_napi_del() APIs do?
>
> I don't think so. This race can only be triggered if:
>
> - An skb is allocated from a page_pool with a napi instance attached
>
> - That skb is freed *in softirq context* while the memory backing the
> NAPI instance is being freed.
>
> It's only valid to free a napi instance after calling netif_napi_del(),
> which does a full synchronise_rcu(). This means that any running
> softirqs will have exited at this point, and all packets will have been
> flushed from the deferred freeing queues. And since the NAPI has been
> stopped at this point, no new packets can enter the deferred freeing
> queue from that NAPI instance.
Note that the skb_defer_free_flush() can be called without bounding to
any NAPI instance, see the skb_defer_free_flush() called by net_rx_action(),
which means the packets from that NAPI instance can still be called in
the softirq context even when the NAPI has been stopped.
>
> So I really don't see a way for this race to happen with correct usage
> of the page_pool and NAPI APIs, which means there's no reason to make
> the change you are proposing here.
I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
seems to call page_pool_disable_direct_recycling() when unloading, see
bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
only called for the new queue_mgmt API:
/* rtnl_lock held, this call can only be made after a previous successful
* call to bnxt_half_open_nic().
*/
void bnxt_half_close_nic(struct bnxt *bp)
{
bnxt_hwrm_resource_free(bp, false, true);
bnxt_del_napi(bp); *----call napi del and rcu sync----*
bnxt_free_skbs(bp);
bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
}
Even if there is a page_pool_disable_direct_recycling() called between
bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
rcu sync need to be called after page_pool_disable_direct_recycling(),
it seems some refactor is needed for bnxt driver to reuse the rcu sync
from the NAPI API, in order to avoid calling the rcu sync for
page_pool_destroy().
>
> -Toke
>
>
Yunsheng Lin <linyunsheng@huawei.com> writes:
>> So I really don't see a way for this race to happen with correct usage
>> of the page_pool and NAPI APIs, which means there's no reason to make
>> the change you are proposing here.
>
> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
> seems to call page_pool_disable_direct_recycling() when unloading, see
> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
> only called for the new queue_mgmt API:
>
> /* rtnl_lock held, this call can only be made after a previous successful
> * call to bnxt_half_open_nic().
> */
> void bnxt_half_close_nic(struct bnxt *bp)
> {
> bnxt_hwrm_resource_free(bp, false, true);
> bnxt_del_napi(bp); *----call napi del and rcu sync----*
> bnxt_free_skbs(bp);
> bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
> clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
> }
>
> Even if there is a page_pool_disable_direct_recycling() called between
> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
> rcu sync need to be called after page_pool_disable_direct_recycling(),
> it seems some refactor is needed for bnxt driver to reuse the rcu sync
> from the NAPI API, in order to avoid calling the rcu sync for
> page_pool_destroy().
Well, I would consider that usage buggy. A page pool object is created
with a reference to the napi struct; so the page pool should also be
destroyed (clearing its reference) before the napi memory is freed. I
guess this is not really documented anywhere, but it's pretty standard
practice to free objects in the opposite order of their creation.
So no, I don't think this is something that should be fixed on the page
pool side (and certainly not by adding another synchronize_rcu() call
per queue!); rather, we should fix the drivers that get this wrong (and
probably document the requirement a bit better).
-Toke
On 1/25/2025 1:13 AM, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
>
>>> So I really don't see a way for this race to happen with correct usage
>>> of the page_pool and NAPI APIs, which means there's no reason to make
>>> the change you are proposing here.
>>
>> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
>> seems to call page_pool_disable_direct_recycling() when unloading, see
>> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
>> only called for the new queue_mgmt API:
>>
>> /* rtnl_lock held, this call can only be made after a previous successful
>> * call to bnxt_half_open_nic().
>> */
>> void bnxt_half_close_nic(struct bnxt *bp)
>> {
>> bnxt_hwrm_resource_free(bp, false, true);
>> bnxt_del_napi(bp); *----call napi del and rcu sync----*
>> bnxt_free_skbs(bp);
>> bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
>> clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
>> }
>>
>> Even if there is a page_pool_disable_direct_recycling() called between
>> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
>> rcu sync need to be called after page_pool_disable_direct_recycling(),
>> it seems some refactor is needed for bnxt driver to reuse the rcu sync
>> from the NAPI API, in order to avoid calling the rcu sync for
>> page_pool_destroy().
>
> Well, I would consider that usage buggy. A page pool object is created
> with a reference to the napi struct; so the page pool should also be
> destroyed (clearing its reference) before the napi memory is freed. I
> guess this is not really documented anywhere, but it's pretty standard
> practice to free objects in the opposite order of their creation.
I am not so familiar with rule about the creation API of NAPI, but the
implementation of bnxt driver can have reference of 'struct napi' before
calling netif_napi_add(), see below:
static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool
link_re_init)
{
.......
rc = bnxt_alloc_mem(bp, irq_re_init); *create page_pool*
if (rc) {
netdev_err(bp->dev, "bnxt_alloc_mem err: %x\n", rc);
goto open_err_free_mem;
}
if (irq_re_init) {
bnxt_init_napi(bp); *netif_napi_add*
rc = bnxt_request_irq(bp);
if (rc) {
netdev_err(bp->dev, "bnxt_request_irq err: %x\n", rc);
goto open_err_irq;
}
}
.....
}
>
> So no, I don't think this is something that should be fixed on the page
> pool side (and certainly not by adding another synchronize_rcu() call
> per queue!); rather, we should fix the drivers that get this wrong (and
> probably document the requirement a bit better).
Even if timing problem of checking and disabling napi_local should not
be fixed on the page_pool side, do we have some common understanding
about fixing the DMA API misuse problem on the page_pool side?
If yes, do we have some common understanding about some mechanism
like synchronize_rcu() might be still needed on the page_pool side?
If no, I am not sure if there is still any better about how to fix
the DMA API misuse problem after all the previous discussion?
If yes, it may be better to focus on discussing how to avoid calling rcu
sync for each queue mentioned in [1].
1.
https://lore.kernel.org/all/22de6033-744e-486e-bbd9-8950249cd018@huawei.com/
Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> On 1/25/2025 1:13 AM, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>>
>>>> So I really don't see a way for this race to happen with correct usage
>>>> of the page_pool and NAPI APIs, which means there's no reason to make
>>>> the change you are proposing here.
>>>
>>> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
>>> seems to call page_pool_disable_direct_recycling() when unloading, see
>>> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
>>> only called for the new queue_mgmt API:
>>>
>>> /* rtnl_lock held, this call can only be made after a previous successful
>>> * call to bnxt_half_open_nic().
>>> */
>>> void bnxt_half_close_nic(struct bnxt *bp)
>>> {
>>> bnxt_hwrm_resource_free(bp, false, true);
>>> bnxt_del_napi(bp); *----call napi del and rcu sync----*
>>> bnxt_free_skbs(bp);
>>> bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
>>> clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
>>> }
>>>
>>> Even if there is a page_pool_disable_direct_recycling() called between
>>> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
>>> rcu sync need to be called after page_pool_disable_direct_recycling(),
>>> it seems some refactor is needed for bnxt driver to reuse the rcu sync
>>> from the NAPI API, in order to avoid calling the rcu sync for
>>> page_pool_destroy().
>>
>> Well, I would consider that usage buggy. A page pool object is created
>> with a reference to the napi struct; so the page pool should also be
>> destroyed (clearing its reference) before the napi memory is freed. I
>> guess this is not really documented anywhere, but it's pretty standard
>> practice to free objects in the opposite order of their creation.
>
> I am not so familiar with rule about the creation API of NAPI, but the
> implementation of bnxt driver can have reference of 'struct napi' before
> calling netif_napi_add(), see below:
>
> static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool
> link_re_init)
> {
> .......
> rc = bnxt_alloc_mem(bp, irq_re_init); *create page_pool*
> if (rc) {
> netdev_err(bp->dev, "bnxt_alloc_mem err: %x\n", rc);
> goto open_err_free_mem;
> }
>
> if (irq_re_init) {
> bnxt_init_napi(bp); *netif_napi_add*
> rc = bnxt_request_irq(bp);
> if (rc) {
> netdev_err(bp->dev, "bnxt_request_irq err: %x\n", rc);
> goto open_err_irq;
> }
> }
>
> .....
> }
Regardless of the initialisation error, the fact that bnxt frees the
NAPI memory before calling page_pool_destroy() is a driver bug. Mina has
a suggestion for a warning to catch such bugs over in this thread:
https://lore.kernel.org/r/CAHS8izOv=tUiuzha6NFq1-ZurLGz9Jdi78jb3ey4ExVJirMprA@mail.gmail.com
>> So no, I don't think this is something that should be fixed on the page
>> pool side (and certainly not by adding another synchronize_rcu() call
>> per queue!); rather, we should fix the drivers that get this wrong (and
>> probably document the requirement a bit better).
>
> Even if timing problem of checking and disabling napi_local should not
> be fixed on the page_pool side, do we have some common understanding
> about fixing the DMA API misuse problem on the page_pool side?
> If yes, do we have some common understanding about some mechanism
> like synchronize_rcu() might be still needed on the page_pool side?
I have not reviewed the rest of your patch set, I only looked at this
patch. I see you posted v8 without addressing Jesper's ask for a
conceptual description of your design. I am not going to review a
600-something line patch series without such a description to go by, so
please address that first.
> If yes, it may be better to focus on discussing how to avoid calling rcu
> sync for each queue mentioned in [1].
Regardless of whether a synchronize_rcu() is needed in the final design
(and again, note that I don't have an opinion on this before reviewing
the whole series), this patch should be dropped from the series. The bug
it is purporting to fix is a driver API misuse and should be fixed in
the drivers, cf the above.
-Toke
On 1/27/2025 9:47 PM, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>
>> On 1/25/2025 1:13 AM, Toke Høiland-Jørgensen wrote:
>>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>>>
>>>>> So I really don't see a way for this race to happen with correct usage
>>>>> of the page_pool and NAPI APIs, which means there's no reason to make
>>>>> the change you are proposing here.
>>>>
>>>> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
>>>> seems to call page_pool_disable_direct_recycling() when unloading, see
>>>> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
>>>> only called for the new queue_mgmt API:
>>>>
>>>> /* rtnl_lock held, this call can only be made after a previous successful
>>>> * call to bnxt_half_open_nic().
>>>> */
>>>> void bnxt_half_close_nic(struct bnxt *bp)
>>>> {
>>>> bnxt_hwrm_resource_free(bp, false, true);
>>>> bnxt_del_napi(bp); *----call napi del and rcu sync----*
>>>> bnxt_free_skbs(bp);
>>>> bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
>>>> clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
>>>> }
>>>>
>>>> Even if there is a page_pool_disable_direct_recycling() called between
>>>> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
>>>> rcu sync need to be called after page_pool_disable_direct_recycling(),
>>>> it seems some refactor is needed for bnxt driver to reuse the rcu sync
>>>> from the NAPI API, in order to avoid calling the rcu sync for
>>>> page_pool_destroy().
>>>
>>> Well, I would consider that usage buggy. A page pool object is created
>>> with a reference to the napi struct; so the page pool should also be
>>> destroyed (clearing its reference) before the napi memory is freed. I
>>> guess this is not really documented anywhere, but it's pretty standard
>>> practice to free objects in the opposite order of their creation.
>>
>> I am not so familiar with rule about the creation API of NAPI, but the
>> implementation of bnxt driver can have reference of 'struct napi' before
>> calling netif_napi_add(), see below:
>>
>> static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool
>> link_re_init)
>> {
>> .......
>> rc = bnxt_alloc_mem(bp, irq_re_init); *create page_pool*
>> if (rc) {
>> netdev_err(bp->dev, "bnxt_alloc_mem err: %x\n", rc);
>> goto open_err_free_mem;
>> }
>>
>> if (irq_re_init) {
>> bnxt_init_napi(bp); *netif_napi_add*
>> rc = bnxt_request_irq(bp);
>> if (rc) {
>> netdev_err(bp->dev, "bnxt_request_irq err: %x\n", rc);
>> goto open_err_irq;
>> }
>> }
>>
>> .....
>> }
>
> Regardless of the initialisation error, the fact that bnxt frees the
> NAPI memory before calling page_pool_destroy() is a driver bug. Mina has
> a suggestion for a warning to catch such bugs over in this thread:
>
> https://lore.kernel.org/r/CAHS8izOv=tUiuzha6NFq1-ZurLGz9Jdi78jb3ey4ExVJirMprA@mail.gmail.com
Thanks for the reminder.
As the main problem is about adding a rcu sync between
page_pool_disable_direct_recycling() and page_pool_destroy(), I am
really doubtful that a warning can be added to catch such bugs if
page_pool_destroy() does not use an explicit rcu sync and rely on
the rcu sync from napi del API.
>
>>> So no, I don't think this is something that should be fixed on the page
>>> pool side (and certainly not by adding another synchronize_rcu() call
>>> per queue!); rather, we should fix the drivers that get this wrong (and
>>> probably document the requirement a bit better).
>>
>> Even if timing problem of checking and disabling napi_local should not
>> be fixed on the page_pool side, do we have some common understanding
>> about fixing the DMA API misuse problem on the page_pool side?
>> If yes, do we have some common understanding about some mechanism
>> like synchronize_rcu() might be still needed on the page_pool side?
>
> I have not reviewed the rest of your patch set, I only looked at this
> patch. I see you posted v8 without addressing Jesper's ask for a
> conceptual description of your design. I am not going to review a
> 600-something line patch series without such a description to go by, so
> please address that first.
I thought what Jesper'ask was mainly about why hijacking the page->pp
pointer.
I summarized the discussion in [1] as below, please let me know if that
addresses your concern too.
"By using the 'struct page_pool_item' referenced by page->pp_item,
page_pool is not only able to keep track of the inflight page to do dma
unmmaping when page_pool_destroy() is called if some pages are still
handled in networking stack, and networking stack is also able to find
the page_pool owning the page when returning pages back into page_pool.
struct page_pool_item {
unsigned long state;
union {
netmem_ref pp_netmem;
struct llist_node lentry;
};
};
When a page is added to the page_pool, an item is deleted from
pool->hold_items and set the 'pp_netmem' pointing to that page and set
'state' accordingly in order to keep track of that page, refill from
pool->release_items when pool->hold_items is empty or use the item from
pool->slow_items when fast items run out.
When a page is released from the page_pool, it is able to tell which
page_pool this page belongs to by using the below functions:
static inline struct page_pool_item_block *
page_pool_item_to_block(struct page_pool_item *item)
{
return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
}
static inline struct page_pool *page_pool_get_pp(struct page *page)
{
/* The size of item_block is always PAGE_SIZE, the address of item_block
* for a specific item can be calculated using 'item & PAGE_MASK', so
* that we can find the page_pool object it belongs to.
*/
return page_pool_item_to_block(page->pp_item)->pp;
}
and after clearing the pp_item->state', the item for the released page
is added back to pool->release_items so that it can be reused for new
pages or just free it when it is from the pool->slow_items.
When page_pool_destroy() is called, pp_item->state is used to tell if a
specific item is being used/dma mapped or not by scanning all the item
blocks in pool->item_blocks, then pp_item->netmem can be used to do the
dma unmmaping if the corresponding inflight page is dma mapped."
1.
https://lore.kernel.org/all/2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com/
>
>> If yes, it may be better to focus on discussing how to avoid calling rcu
>> sync for each queue mentioned in [1].
>
> Regardless of whether a synchronize_rcu() is needed in the final design
> (and again, note that I don't have an opinion on this before reviewing
> the whole series), this patch should be dropped from the series. The bug
> it is purporting to fix is a driver API misuse and should be fixed in
> the drivers, cf the above.
I am still a little doubltful it is a driver API misuse problem yet as
I am not true if page_pool_destroy() can depend on the rcu sync from
napi del API for all cases. Even if it is, this driver API misuse
problem seems to only exist after page_pool NAPI recycling feature/API
is added, which might mean some refactoring needed from the driver side
to support page_pool NAPI recycling.
Anyway, it seems to make sense to drop this patch from the series for
better forward progressing for the dma misuse problem as they are not
really related.
>
> -Toke
>
On 2025/1/11 13:24, Yunsheng Lin wrote: ... >>> } >>> void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, >>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool) >>> if (!page_pool_release(pool)) >>> return; >>> + /* Paired with rcu lock in page_pool_napi_local() to enable clearing >>> + * of pool->p.napi in page_pool_disable_direct_recycling() is seen >>> + * before returning to driver to free the napi instance. >>> + */ >>> + synchronize_rcu(); >> >> Most drivers call page_pool_destroy() in a loop for each RX queue, so >> now you're introducing a full synchronize_rcu() wait for each queue. >> That can delay tearing down the device significantly, so I don't think >> this is a good idea. > > synchronize_rcu() is called after page_pool_release(pool), which means > it is only called when there are some inflight pages, so there is not > necessarily a full synchronize_rcu() wait for each queue. > > Anyway, it seems that there are some cases that need explicit > synchronize_rcu() and some cases depending on the other API providing > synchronize_rcu() semantics, maybe we provide two diffferent API for > both cases like the netif_napi_del()/__netif_napi_del() APIs do? As the synchronize_rcu() is also needed to fix the DMA API misuse problem, we can not really handle it like netif_napi_del()/__netif_napi_del() APIs do, the best I can think is something like below: bool need_sync = false; for (each queue) need_sync |= page_pool_destroy_prepare(queue->pool); if (need_sync) synchronize_rcu() for (each queue) page_pool_destroy_commit(queue->pool); But I am not sure if the above worth the effort or not for now as the synchronize_rcu() is only called for the inflight page case. Any better idea? If not, maybe we can optimize the above later if the synchronize_rcu() does turn out to be a problem. >
© 2016 - 2026 Red Hat, Inc.