[v3] page_pool: Add page_pool_release_stalled tracepoint

[PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Leon Hwang 1 month ago

Introduce a new tracepoint to track stalled page pool releases,
providing better observability for page pool lifecycle issues.

Problem:
Currently, when a page pool shutdown is stalled due to inflight pages,
the kernel only logs a warning message via pr_warn(). This has several
limitations:

1. The warning floods the kernel log after the initial DEFER_WARN_INTERVAL,
   making it difficult to track the progression of stalled releases
2. There's no structured way to monitor or analyze these events
3. Debugging tools cannot easily capture and correlate stalled pool
   events with other network activity

Solution:
Add a new tracepoint, page_pool_release_stalled, that fires when a page
pool shutdown is stalled. The tracepoint captures:
- pool: pointer to the stalled page_pool
- inflight: number of pages still in flight
- sec: seconds since the release was deferred

The implementation also modifies the logging behavior:
- pr_warn() is only emitted during the first warning interval
  (DEFER_WARN_INTERVAL to DEFER_WARN_INTERVAL*2)
- The tracepoint is fired always, reducing log noise while still
  allowing monitoring tools to track the issue

This allows developers and system administrators to:
- Use tools like perf, ftrace, or eBPF to monitor stalled releases
- Correlate page pool issues with network driver behavior
- Analyze patterns without parsing kernel logs
- Track the progression of inflight page counts over time

Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
v2 -> v3:
 - Print id using '%u'.
 - https://lore.kernel.org/netdev/20260102061718.210248-1-leon.hwang@linux.dev/

v1 -> v2:
 - Drop RFC.
 - Store 'pool->user.id' to '__entry->id' (per Steven Rostedt).
 - https://lore.kernel.org/netdev/20251125082207.356075-1-leon.hwang@linux.dev/
---
 include/trace/events/page_pool.h | 24 ++++++++++++++++++++++++
 net/core/page_pool.c             |  6 ++++--
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h
index 31825ed30032..a851e0f6a384 100644
--- a/include/trace/events/page_pool.h
+++ b/include/trace/events/page_pool.h
@@ -113,6 +113,30 @@ TRACE_EVENT(page_pool_update_nid,
 		  __entry->pool, __entry->pool_nid, __entry->new_nid)
 );
 
+TRACE_EVENT(page_pool_release_stalled,
+
+	TP_PROTO(const struct page_pool *pool, int inflight, int sec),
+
+	TP_ARGS(pool, inflight, sec),
+
+	TP_STRUCT__entry(
+		__field(const struct page_pool *, pool)
+		__field(u32,			  id)
+		__field(int,			  inflight)
+		__field(int,			  sec)
+	),
+
+	TP_fast_assign(
+		__entry->pool		= pool;
+		__entry->id		= pool->user.id;
+		__entry->inflight	= inflight;
+		__entry->sec		= sec;
+	),
+
+	TP_printk("page_pool=%p id=%u inflight=%d sec=%d",
+		  __entry->pool, __entry->id, __entry->inflight, __entry->sec)
+);
+
 #endif /* _TRACE_PAGE_POOL_H */
 
 /* This part must be outside protection */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 265a729431bb..01564aa84c89 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1222,8 +1222,10 @@ static void page_pool_release_retry(struct work_struct *wq)
 	    (!netdev || netdev == NET_PTR_POISON)) {
 		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
 
-		pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
-			__func__, pool->user.id, inflight, sec);
+		if (sec >= DEFER_WARN_INTERVAL / HZ && sec < DEFER_WARN_INTERVAL * 2 / HZ)
+			pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
+				__func__, pool->user.id, inflight, sec);
+		trace_page_pool_release_stalled(pool, inflight, sec);
 		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
 	}
 
-- 
2.52.0

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Jesper Dangaard Brouer 1 month ago


On 02/01/2026 08.17, Leon Hwang wrote:
> Introduce a new tracepoint to track stalled page pool releases,
> providing better observability for page pool lifecycle issues.
> 

In general I like/support adding this tracepoint for "debugability" of
page pool lifecycle issues.

For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3], 
which gives us the ability to get events and list page_pools from userspace.
I've not used this myself (yet) so I need input from others if this is 
something that others have been using for page pool lifecycle issues?

Need input from @Kuba/others as the "page-pool-get"[4] state that "Only 
Page Pools associated with a net_device can be listed".  Don't we want 
the ability to list "invisible" page_pool's to allow debugging issues?

  [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
  [2] https://docs.kernel.org/userspace-api/netlink/index.html
  [3] https://docs.kernel.org/netlink/specs/netdev.html
  [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get

Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
notification is only generated once (in page_pool_destroy) and not when
we retry in page_pool_release_retry (like this patch).  In that sense,
this patch/tracepoint is catching something more than netlink provides.
First I though we could add a netlink notification, but I can imagine
cases this could generate too many netlink messages e.g. a netdev with
128 RX queues generating these every second for every RX queue.

Guess, I've talked myself into liking this change, what do other
maintainers think?  (e.g. netlink scheme and debugging balance)


> Problem:
> Currently, when a page pool shutdown is stalled due to inflight pages,
> the kernel only logs a warning message via pr_warn(). This has several
> limitations:
> 
> 1. The warning floods the kernel log after the initial DEFER_WARN_INTERVAL,
>     making it difficult to track the progression of stalled releases
> 2. There's no structured way to monitor or analyze these events
> 3. Debugging tools cannot easily capture and correlate stalled pool
>     events with other network activity
> 
> Solution:
> Add a new tracepoint, page_pool_release_stalled, that fires when a page
> pool shutdown is stalled. The tracepoint captures:
> - pool: pointer to the stalled page_pool
> - inflight: number of pages still in flight
> - sec: seconds since the release was deferred
> 
> The implementation also modifies the logging behavior:
> - pr_warn() is only emitted during the first warning interval
>    (DEFER_WARN_INTERVAL to DEFER_WARN_INTERVAL*2)
> - The tracepoint is fired always, reducing log noise while still
>    allowing monitoring tools to track the issue
> 
> This allows developers and system administrators to:
> - Use tools like perf, ftrace, or eBPF to monitor stalled releases
> - Correlate page pool issues with network driver behavior
> - Analyze patterns without parsing kernel logs
> - Track the progression of inflight page counts over time
> 
> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
> v2 -> v3:
>   - Print id using '%u'.
>   - https://lore.kernel.org/netdev/20260102061718.210248-1-leon.hwang@linux.dev/
> 
> v1 -> v2:
>   - Drop RFC.
>   - Store 'pool->user.id' to '__entry->id' (per Steven Rostedt).
>   - https://lore.kernel.org/netdev/20251125082207.356075-1-leon.hwang@linux.dev/
> ---
>   include/trace/events/page_pool.h | 24 ++++++++++++++++++++++++
>   net/core/page_pool.c             |  6 ++++--
>   2 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h
> index 31825ed30032..a851e0f6a384 100644
> --- a/include/trace/events/page_pool.h
> +++ b/include/trace/events/page_pool.h
> @@ -113,6 +113,30 @@ TRACE_EVENT(page_pool_update_nid,
>   		  __entry->pool, __entry->pool_nid, __entry->new_nid)
>   );
>   
> +TRACE_EVENT(page_pool_release_stalled,
> +
> +	TP_PROTO(const struct page_pool *pool, int inflight, int sec),
> +
> +	TP_ARGS(pool, inflight, sec),
> +
> +	TP_STRUCT__entry(
> +		__field(const struct page_pool *, pool)
> +		__field(u32,			  id)
> +		__field(int,			  inflight)
> +		__field(int,			  sec)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pool		= pool;
> +		__entry->id		= pool->user.id;
> +		__entry->inflight	= inflight;
> +		__entry->sec		= sec;
> +	),
> +
> +	TP_printk("page_pool=%p id=%u inflight=%d sec=%d",
> +		  __entry->pool, __entry->id, __entry->inflight, __entry->sec)
> +);
> +
>   #endif /* _TRACE_PAGE_POOL_H */
>   
>   /* This part must be outside protection */
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 265a729431bb..01564aa84c89 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -1222,8 +1222,10 @@ static void page_pool_release_retry(struct work_struct *wq)
>   	    (!netdev || netdev == NET_PTR_POISON)) {
>   		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
>   
> -		pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
> -			__func__, pool->user.id, inflight, sec);
> +		if (sec >= DEFER_WARN_INTERVAL / HZ && sec < DEFER_WARN_INTERVAL * 2 / HZ)
> +			pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
> +				__func__, pool->user.id, inflight, sec);
> +		trace_page_pool_release_stalled(pool, inflight, sec);
>   		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
>   	}
>

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Jakub Kicinski 1 month ago

On Fri, 2 Jan 2026 12:43:46 +0100 Jesper Dangaard Brouer wrote:
> On 02/01/2026 08.17, Leon Hwang wrote:
> > Introduce a new tracepoint to track stalled page pool releases,
> > providing better observability for page pool lifecycle issues.
> 
> In general I like/support adding this tracepoint for "debugability" of
> page pool lifecycle issues.
> 
> For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3], 
> which gives us the ability to get events and list page_pools from userspace.
> I've not used this myself (yet) so I need input from others if this is 
> something that others have been using for page pool lifecycle issues?

My input here is the least valuable (since one may expect the person
who added the code uses it) - but FWIW yes, we do use the PP stats to
monitor PP lifecycle issues at Meta. That said - we only monitor for
accumulation of leaked memory from orphaned pages, as the whole reason
for adding this code was that in practice the page may be sitting in
a socket rx queue (or defer free queue etc.) IOW a PP which is not
getting destroyed for a long time is not necessarily a kernel issue.

> Need input from @Kuba/others as the "page-pool-get"[4] state that "Only 
> Page Pools associated with a net_device can be listed".  Don't we want 
> the ability to list "invisible" page_pool's to allow debugging issues?
> 
>   [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
>   [2] https://docs.kernel.org/userspace-api/netlink/index.html
>   [3] https://docs.kernel.org/netlink/specs/netdev.html
>   [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get

The documentation should probably be updated :(
I think what I meant is that most _drivers_ didn't link their PP to the
netdev via params when the API was added. So if the user doesn't see the
page pools - the driver is probably not well maintained.

In practice only page pools which are not accessible / visible via the
API are page pools from already destroyed network namespaces (assuming
their netdevs were also destroyed and not re-parented to init_net).
Which I'd think is a rare case?

> Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
> notification is only generated once (in page_pool_destroy) and not when
> we retry in page_pool_release_retry (like this patch).  In that sense,
> this patch/tracepoint is catching something more than netlink provides.
> First I though we could add a netlink notification, but I can imagine
> cases this could generate too many netlink messages e.g. a netdev with
> 128 RX queues generating these every second for every RX queue.

FWIW yes, we can add more notifications. Tho, as I mentioned at the
start of my reply - the expectation is that page pools waiting for
a long time to be destroyed is something that _will_ happen in
production.

> Guess, I've talked myself into liking this change, what do other
> maintainers think?  (e.g. netlink scheme and debugging balance)

We added the Netlink API to mute the pr_warn() in all practical cases.
If Xiang Mei is seeing the pr_warn() I think we should start by asking
what kernel and driver they are using, and what the usage pattern is :(
As I mentioned most commonly the pr_warn() will trigger because driver
doesn't link the pp to a netdev.

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Jesper Dangaard Brouer 2 weeks, 4 days ago


On 04/01/2026 17.43, Jakub Kicinski wrote:
> On Fri, 2 Jan 2026 12:43:46 +0100 Jesper Dangaard Brouer wrote:
>> On 02/01/2026 08.17, Leon Hwang wrote:
>>> Introduce a new tracepoint to track stalled page pool releases,
>>> providing better observability for page pool lifecycle issues.
>>
>> In general I like/support adding this tracepoint for "debugability" of
>> page pool lifecycle issues.
>>
>> For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3],
>> which gives us the ability to get events and list page_pools from userspace.
>> I've not used this myself (yet) so I need input from others if this is
>> something that others have been using for page pool lifecycle issues?
> 
> My input here is the least valuable (since one may expect the person
> who added the code uses it) - but FWIW yes, we do use the PP stats to
> monitor PP lifecycle issues at Meta. That said - we only monitor for
> accumulation of leaked memory from orphaned pages, as the whole reason
> for adding this code was that in practice the page may be sitting in
> a socket rx queue (or defer free queue etc.) IOW a PP which is not
> getting destroyed for a long time is not necessarily a kernel issue.
> 
>> Need input from @Kuba/others as the "page-pool-get"[4] state that "Only
>> Page Pools associated with a net_device can be listed".  Don't we want
>> the ability to list "invisible" page_pool's to allow debugging issues?
>>
>>    [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
>>    [2] https://docs.kernel.org/userspace-api/netlink/index.html
>>    [3] https://docs.kernel.org/netlink/specs/netdev.html
>>    [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get
> 
> The documentation should probably be updated :(
> I think what I meant is that most _drivers_ didn't link their PP to the
> netdev via params when the API was added. So if the user doesn't see the
> page pools - the driver is probably not well maintained.
> 
> In practice only page pools which are not accessible / visible via the
> API are page pools from already destroyed network namespaces (assuming
> their netdevs were also destroyed and not re-parented to init_net).
> Which I'd think is a rare case?
> 
>> Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
>> notification is only generated once (in page_pool_destroy) and not when
>> we retry in page_pool_release_retry (like this patch).  In that sense,
>> this patch/tracepoint is catching something more than netlink provides.
>> First I though we could add a netlink notification, but I can imagine
>> cases this could generate too many netlink messages e.g. a netdev with
>> 128 RX queues generating these every second for every RX queue.
> 
> FWIW yes, we can add more notifications. Tho, as I mentioned at the
> start of my reply - the expectation is that page pools waiting for
> a long time to be destroyed is something that _will_ happen in
> production.
> 
>> Guess, I've talked myself into liking this change, what do other
>> maintainers think?  (e.g. netlink scheme and debugging balance)
> 
> We added the Netlink API to mute the pr_warn() in all practical cases.
> If Xiang Mei is seeing the pr_warn() I think we should start by asking
> what kernel and driver they are using, and what the usage pattern is :(
> As I mentioned most commonly the pr_warn() will trigger because driver
> doesn't link the pp to a netdev.

The commit that introduced this be0096676e23 ("net: page_pool: mute the
periodic warning for visible page pools") (Author: Jakub Kicinski) was
added in kernel v6.8.  Our fleet runs 6.12.

Looking at production logs I'm still seeing these messages, e.g.:
  "page_pool_release_retry() stalled pool shutdown: id 322, 1 inflight 
591248 sec"

Looking at one of these servers it runs kernel 6.12.59 and ice NIC driver.

I'm surprised to see these on our normal servers and also the long
period.  Previously I was seeing these on k8s servers, which makes more
sense at veth interfaces are likely to be removed and easier reach the
pr_warn() (as Jakub added extra if statement checking netdev in commit
(!netdev || netdev == NET_PTR_POISON)).

An example from a k8s server have smaller stalled period, and I think it 
recovered:
  "page_pool_release_retry() stalled pool shutdown: id 18, 1 inflight 
3020 sec"

I'm also surprised to see ice NIC driver, as previously we mostly seen
these warning on driver bnxt_en.  I did manage to find some cases of the
bnxt_en driver now, but I see that the server likely have a hardware defect.

Bottom-line yes these stalled pool shutdown pr_warn() are still
happening in production.

--Jesper

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Leon Hwang 2 weeks, 5 days ago


On 5/1/26 00:43, Jakub Kicinski wrote:
> On Fri, 2 Jan 2026 12:43:46 +0100 Jesper Dangaard Brouer wrote:
>> On 02/01/2026 08.17, Leon Hwang wrote:
>>> Introduce a new tracepoint to track stalled page pool releases,
>>> providing better observability for page pool lifecycle issues.
>>
>> In general I like/support adding this tracepoint for "debugability" of
>> page pool lifecycle issues.
>>
>> For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3], 
>> which gives us the ability to get events and list page_pools from userspace.
>> I've not used this myself (yet) so I need input from others if this is 
>> something that others have been using for page pool lifecycle issues?
> 
> My input here is the least valuable (since one may expect the person
> who added the code uses it) - but FWIW yes, we do use the PP stats to
> monitor PP lifecycle issues at Meta. That said - we only monitor for
> accumulation of leaked memory from orphaned pages, as the whole reason
> for adding this code was that in practice the page may be sitting in
> a socket rx queue (or defer free queue etc.) IOW a PP which is not
> getting destroyed for a long time is not necessarily a kernel issue.
> 
>> Need input from @Kuba/others as the "page-pool-get"[4] state that "Only 
>> Page Pools associated with a net_device can be listed".  Don't we want 
>> the ability to list "invisible" page_pool's to allow debugging issues?
>>
>>   [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
>>   [2] https://docs.kernel.org/userspace-api/netlink/index.html
>>   [3] https://docs.kernel.org/netlink/specs/netdev.html
>>   [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get
> 
> The documentation should probably be updated :(
> I think what I meant is that most _drivers_ didn't link their PP to the
> netdev via params when the API was added. So if the user doesn't see the
> page pools - the driver is probably not well maintained.
> 
> In practice only page pools which are not accessible / visible via the
> API are page pools from already destroyed network namespaces (assuming
> their netdevs were also destroyed and not re-parented to init_net).
> Which I'd think is a rare case?
> 
>> Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
>> notification is only generated once (in page_pool_destroy) and not when
>> we retry in page_pool_release_retry (like this patch).  In that sense,
>> this patch/tracepoint is catching something more than netlink provides.
>> First I though we could add a netlink notification, but I can imagine
>> cases this could generate too many netlink messages e.g. a netdev with
>> 128 RX queues generating these every second for every RX queue.
> 
> FWIW yes, we can add more notifications. Tho, as I mentioned at the
> start of my reply - the expectation is that page pools waiting for
> a long time to be destroyed is something that _will_ happen in
> production.
> 
>> Guess, I've talked myself into liking this change, what do other
>> maintainers think?  (e.g. netlink scheme and debugging balance)
> 
> We added the Netlink API to mute the pr_warn() in all practical cases.
> If Xiang Mei is seeing the pr_warn() I think we should start by asking
> what kernel and driver they are using, and what the usage pattern is :(
> As I mentioned most commonly the pr_warn() will trigger because driver
> doesn't link the pp to a netdev.

Hi Jakub, Jesper,

Thanks for the discussion. Since netlink notifications are only emitted
at page_pool_destroy(), the tracepoint still provides additional
debugging visibility for prolonged page_pool_release_retry() cases.

Steven has reviewed the tracepoint [1]. Any further feedback would be
appreciated.

Thanks,
Leon

[1]
https://lore.kernel.org/netdev/20260102104504.7f593441@gandalf.local.home/

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Jesper Dangaard Brouer 2 weeks, 5 days ago


On 19/01/2026 09.49, Leon Hwang wrote:
> 
> 
> On 5/1/26 00:43, Jakub Kicinski wrote:
>> On Fri, 2 Jan 2026 12:43:46 +0100 Jesper Dangaard Brouer wrote:
>>> On 02/01/2026 08.17, Leon Hwang wrote:
>>>> Introduce a new tracepoint to track stalled page pool releases,
>>>> providing better observability for page pool lifecycle issues.
>>>
>>> In general I like/support adding this tracepoint for "debugability" of
>>> page pool lifecycle issues.
>>>
>>> For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3],
>>> which gives us the ability to get events and list page_pools from userspace.
>>> I've not used this myself (yet) so I need input from others if this is
>>> something that others have been using for page pool lifecycle issues?
>>
>> My input here is the least valuable (since one may expect the person
>> who added the code uses it) - but FWIW yes, we do use the PP stats to
>> monitor PP lifecycle issues at Meta. That said - we only monitor for
>> accumulation of leaked memory from orphaned pages, as the whole reason
>> for adding this code was that in practice the page may be sitting in
>> a socket rx queue (or defer free queue etc.) IOW a PP which is not
>> getting destroyed for a long time is not necessarily a kernel issue.
>>

What monitoring tool did production people add metrics to?

People at CF recommend that I/we add this to prometheus/node_exporter.
Perhaps somebody else already added this to some other FOSS tool?

https://github.com/prometheus/node_exporter


>>> Need input from @Kuba/others as the "page-pool-get"[4] state that "Only
>>> Page Pools associated with a net_device can be listed".  Don't we want
>>> the ability to list "invisible" page_pool's to allow debugging issues?
>>>
>>>    [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
>>>    [2] https://docs.kernel.org/userspace-api/netlink/index.html
>>>    [3] https://docs.kernel.org/netlink/specs/netdev.html
>>>    [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get
>>
>> The documentation should probably be updated :(
>> I think what I meant is that most _drivers_ didn't link their PP to the
>> netdev via params when the API was added. So if the user doesn't see the
>> page pools - the driver is probably not well maintained.
>>
>> In practice only page pools which are not accessible / visible via the
>> API are page pools from already destroyed network namespaces (assuming
>> their netdevs were also destroyed and not re-parented to init_net).
>> Which I'd think is a rare case?
>>
>>> Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
>>> notification is only generated once (in page_pool_destroy) and not when
>>> we retry in page_pool_release_retry (like this patch).  In that sense,
>>> this patch/tracepoint is catching something more than netlink provides.
>>> First I though we could add a netlink notification, but I can imagine
>>> cases this could generate too many netlink messages e.g. a netdev with
>>> 128 RX queues generating these every second for every RX queue.
>>
>> FWIW yes, we can add more notifications. Tho, as I mentioned at the
>> start of my reply - the expectation is that page pools waiting for
>> a long time to be destroyed is something that _will_ happen in
>> production.
>>
>>> Guess, I've talked myself into liking this change, what do other
>>> maintainers think?  (e.g. netlink scheme and debugging balance)
>>
>> We added the Netlink API to mute the pr_warn() in all practical cases.
>> If Xiang Mei is seeing the pr_warn() I think we should start by asking
>> what kernel and driver they are using, and what the usage pattern is :(
>> As I mentioned most commonly the pr_warn() will trigger because driver
>> doesn't link the pp to a netdev.
> 
> Hi Jakub, Jesper,
> 
> Thanks for the discussion. Since netlink notifications are only emitted
> at page_pool_destroy(), the tracepoint still provides additional
> debugging visibility for prolonged page_pool_release_retry() cases.
> 
> Steven has reviewed the tracepoint [1]. Any further feedback would be
> appreciated.

This change looks good as-is:

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

Your patch[0] is marked as "Changes Requested".
I suggest you send a V4 with my Acked-by added.

--Jesper

[0] 
https://patchwork.kernel.org/project/netdevbpf/patch/20260102071745.291969-1-leon.hwang@linux.dev/

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Jakub Kicinski 2 weeks, 4 days ago

On Mon, 19 Jan 2026 10:54:13 +0100 Jesper Dangaard Brouer wrote:
> On 19/01/2026 09.49, Leon Hwang wrote:
> >> My input here is the least valuable (since one may expect the person
> >> who added the code uses it) - but FWIW yes, we do use the PP stats to
> >> monitor PP lifecycle issues at Meta. That said - we only monitor for
> >> accumulation of leaked memory from orphaned pages, as the whole reason
> >> for adding this code was that in practice the page may be sitting in
> >> a socket rx queue (or defer free queue etc.) IOW a PP which is not
> >> getting destroyed for a long time is not necessarily a kernel issue.
> >>  
> 
> What monitoring tool did production people add metrics to?
> 
> People at CF recommend that I/we add this to prometheus/node_exporter.
> Perhaps somebody else already added this to some other FOSS tool?
> 
> https://github.com/prometheus/node_exporter

We added it to this:

  https://github.com/facebookincubator/dynolog

But AFAICT it's missing from the open source version(?!)

Luckily ynltool now exists so one can just plug it into any monitoring
system that can hoover up JSON:

  ynltool -j page-pool stats

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Yunsheng Lin 1 month ago

On 2026/1/2 19:43, Jesper Dangaard Brouer wrote:
> 
> 
> On 02/01/2026 08.17, Leon Hwang wrote:
>> Introduce a new tracepoint to track stalled page pool releases,
>> providing better observability for page pool lifecycle issues.
>>
> 
> In general I like/support adding this tracepoint for "debugability" of
> page pool lifecycle issues.
> 
> For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3], which gives us the ability to get events and list page_pools from userspace.
> I've not used this myself (yet) so I need input from others if this is something that others have been using for page pool lifecycle issues?
> 
> Need input from @Kuba/others as the "page-pool-get"[4] state that "Only Page Pools associated with a net_device can be listed".  Don't we want the ability to list "invisible" page_pool's to allow debugging issues?
> 
>  [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
>  [2] https://docs.kernel.org/userspace-api/netlink/index.html
>  [3] https://docs.kernel.org/netlink/specs/netdev.html
>  [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get
> 
> Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
> notification is only generated once (in page_pool_destroy) and not when
> we retry in page_pool_release_retry (like this patch).  In that sense,
> this patch/tracepoint is catching something more than netlink provides.
> First I though we could add a netlink notification, but I can imagine
> cases this could generate too many netlink messages e.g. a netdev with
> 128 RX queues generating these every second for every RX queue.
> 
> Guess, I've talked myself into liking this change, what do other
> maintainers think?  (e.g. netlink scheme and debugging balance)
> 
> 
>> Problem:
>> Currently, when a page pool shutdown is stalled due to inflight pages,
>> the kernel only logs a warning message via pr_warn(). This has several
>> limitations:
>>
>> 1. The warning floods the kernel log after the initial DEFER_WARN_INTERVAL,
>>     making it difficult to track the progression of stalled releases
>> 2. There's no structured way to monitor or analyze these events
>> 3. Debugging tools cannot easily capture and correlate stalled pool
>>     events with other network activity
>>
>> Solution:
>> Add a new tracepoint, page_pool_release_stalled, that fires when a page
>> pool shutdown is stalled. The tracepoint captures:
>> - pool: pointer to the stalled page_pool
>> - inflight: number of pages still in flight
>> - sec: seconds since the release was deferred
>>
>> The implementation also modifies the logging behavior:
>> - pr_warn() is only emitted during the first warning interval
>>    (DEFER_WARN_INTERVAL to DEFER_WARN_INTERVAL*2)
>> - The tracepoint is fired always, reducing log noise while still
>>    allowing monitoring tools to track the issue

If the initial log is still present, I don't really see what's the benefit
of re-triggering logs or tracepoints when the first two fields are unchanged
and the last two fields can be inspected using some tool? If there are none,
perhaps we only need to print the first trigger log and a log upon completion
of page_pool destruction.

>>
>> This allows developers and system administrators to:
>> - Use tools like perf, ftrace, or eBPF to monitor stalled releases
>> - Correlate page pool issues with network driver behavior
>> - Analyze patterns without parsing kernel logs
>> - Track the progression of inflight page counts over time
>>
>> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Leon Hwang 1 month ago


On 4/1/26 10:18, Yunsheng Lin wrote:
> On 2026/1/2 19:43, Jesper Dangaard Brouer wrote:
>>
>>
>> On 02/01/2026 08.17, Leon Hwang wrote:
>>> Introduce a new tracepoint to track stalled page pool releases,
>>> providing better observability for page pool lifecycle issues.
>>>
>>
>> In general I like/support adding this tracepoint for "debugability" of
>> page pool lifecycle issues.
>>
>> For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3], which gives us the ability to get events and list page_pools from userspace.
>> I've not used this myself (yet) so I need input from others if this is something that others have been using for page pool lifecycle issues?
>>
>> Need input from @Kuba/others as the "page-pool-get"[4] state that "Only Page Pools associated with a net_device can be listed".  Don't we want the ability to list "invisible" page_pool's to allow debugging issues?
>>
>>  [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
>>  [2] https://docs.kernel.org/userspace-api/netlink/index.html
>>  [3] https://docs.kernel.org/netlink/specs/netdev.html
>>  [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get
>>
>> Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
>> notification is only generated once (in page_pool_destroy) and not when
>> we retry in page_pool_release_retry (like this patch).  In that sense,
>> this patch/tracepoint is catching something more than netlink provides.
>> First I though we could add a netlink notification, but I can imagine
>> cases this could generate too many netlink messages e.g. a netdev with
>> 128 RX queues generating these every second for every RX queue.
>>
>> Guess, I've talked myself into liking this change, what do other
>> maintainers think?  (e.g. netlink scheme and debugging balance)
>>
>>
>>> Problem:
>>> Currently, when a page pool shutdown is stalled due to inflight pages,
>>> the kernel only logs a warning message via pr_warn(). This has several
>>> limitations:
>>>
>>> 1. The warning floods the kernel log after the initial DEFER_WARN_INTERVAL,
>>>     making it difficult to track the progression of stalled releases
>>> 2. There's no structured way to monitor or analyze these events
>>> 3. Debugging tools cannot easily capture and correlate stalled pool
>>>     events with other network activity
>>>
>>> Solution:
>>> Add a new tracepoint, page_pool_release_stalled, that fires when a page
>>> pool shutdown is stalled. The tracepoint captures:
>>> - pool: pointer to the stalled page_pool
>>> - inflight: number of pages still in flight
>>> - sec: seconds since the release was deferred
>>>
>>> The implementation also modifies the logging behavior:
>>> - pr_warn() is only emitted during the first warning interval
>>>    (DEFER_WARN_INTERVAL to DEFER_WARN_INTERVAL*2)
>>> - The tracepoint is fired always, reducing log noise while still
>>>    allowing monitoring tools to track the issue
> 
> If the initial log is still present, I don't really see what's the benefit
> of re-triggering logs or tracepoints when the first two fields are unchanged
> and the last two fields can be inspected using some tool? If there are none,
> perhaps we only need to print the first trigger log and a log upon completion
> of page_pool destruction.
> 

Even though it is possible to inspect the last two fields via the
workqueue (e.g., by tracing page_pool_release_retry with BPF tools),
this is not a practical approach for routine monitoring or debugging.

With the proposed tracepoint, obtaining these fields becomes
straightforward and lightweight, making it much easier to observe and
reason about stalled page pool releases in real systems.

In the issue I encountered, it was crucial to notice that the inflight
count was gradually decreasing over time. This gave me confidence that
some orphaned pages were eventually being returned to the page pool.
Based on that signal, I was then able to capture the call stack of
page_pool_put_defragged_page (kernel v6.6) to identify the code path
responsible for returning those pages.

Without repeated pr_warn logs or tracepoint events, it would have been
significantly harder to observe this progression and correlate it with
the eventual page returns.

Thanks,
Leon

Re: [PATCH net-next v3] page_pool: Add page_pool_release_stalled tracepoint

Posted by Leon Hwang 1 month ago

On 2026/1/2 19:43, Jesper Dangaard Brouer wrote:
> 
> 
> On 02/01/2026 08.17, Leon Hwang wrote:
>> Introduce a new tracepoint to track stalled page pool releases,
>> providing better observability for page pool lifecycle issues.
>>
> 
> In general I like/support adding this tracepoint for "debugability" of
> page pool lifecycle issues.
> 
> For "observability" @Kuba added a netlink scheme[1][2] for page_pool[3],
> which gives us the ability to get events and list page_pools from
> userspace.
> I've not used this myself (yet) so I need input from others if this is
> something that others have been using for page pool lifecycle issues?
> 
> Need input from @Kuba/others as the "page-pool-get"[4] state that "Only
> Page Pools associated with a net_device can be listed".  Don't we want
> the ability to list "invisible" page_pool's to allow debugging issues?
> 
>  [1] https://docs.kernel.org/userspace-api/netlink/intro-specs.html
>  [2] https://docs.kernel.org/userspace-api/netlink/index.html
>  [3] https://docs.kernel.org/netlink/specs/netdev.html
>  [4] https://docs.kernel.org/netlink/specs/netdev.html#page-pool-get
> 
> Looking at the code, I see that NETDEV_CMD_PAGE_POOL_CHANGE_NTF netlink
> notification is only generated once (in page_pool_destroy) and not when
> we retry in page_pool_release_retry (like this patch).  In that sense,
> this patch/tracepoint is catching something more than netlink provides.
> First I though we could add a netlink notification, but I can imagine
> cases this could generate too many netlink messages e.g. a netdev with
> 128 RX queues generating these every second for every RX queue.
> 
> Guess, I've talked myself into liking this change, what do other
> maintainers think?  (e.g. netlink scheme and debugging balance)
> 

Hi Jesper,

Thanks for the thoughtful review and for sharing the context around the
existing netlink-based observability.

I ran into a real-world issue where stalled pages were still referenced
by dangling TCP sockets. I wrote up the investigation in more detail in
my blog post “let page inflight” [1] (unfortunately only available in
Chinese at the moment).

In practice, the hardest part was identifying *who* was still holding
references to the inflight pages. With the current tooling, it is very
difficult to introspect the active users of a page once it becomes stalled.

If we can expose more information about current page users—such as the
user type and a user pointer, it becomes much easier to debug these
issues using BPF-based tools. For example, by tracing
page_pool_state_hold and page_pool_state_release, tools like bpftrace
[2] or bpfsnoop [3] (which I implemented) can correlate inflight page
pointers with their active users. This significantly lowers the barrier
to diagnosing page pool lifecycle problems.

As you noted, the existing netlink notifications are generated only at
page_pool_destroy, and not during retries in page_pool_release_retry. In
that sense, the proposed tracepoint captures a class of issues that
netlink does not currently cover, and does so without the risk of
generating excessive userspace events.

Thanks again for the feedback, and I’m happy to refine the approach
based on further input from you, Kuba, or other maintainers.

Links:
[1] https://blog.leonhw.com/post/linux-networking-6-inflight-page/
[2] https://github.com/bpftrace/bpftrace/
[3] https://github.com/bpfsnoop/bpfsnoop/

Thanks,
Leon

[...]