[v1] net: netpoll: Improve SKB pool management

[PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Breno Leitao 1 month ago

The current implementation has a flaw where it populates the skb_pool
with 32 SKBs before calling __netpoll_setup(). If the setup fails, the
skb_pool buffer will persist indefinitely and never be cleaned up.

This change moves the skb_pool population to after the successful
completion of __netpoll_setup(), ensuring that the buffers are not
unnecessarily retained. Additionally, this modification alleviates rtnl
lock pressure by allowing the buffer filling to occur outside of the
lock.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 net/core/netpoll.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index aa49b92e9194..e83fd8bdce36 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -772,13 +772,13 @@ int netpoll_setup(struct netpoll *np)
 		}
 	}
 
-	/* fill up the skb queue */
-	refill_skbs();
-
 	err = __netpoll_setup(np, ndev);
 	if (err)
 		goto put;
 	rtnl_unlock();
+
+	/* fill up the skb queue */
+	refill_skbs();
 	return 0;
 
 put:
-- 
2.43.5

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Jakub Kicinski 3 weeks, 3 days ago

On Fri, 25 Oct 2024 07:20:18 -0700 Breno Leitao wrote:
> The current implementation has a flaw where it populates the skb_pool
> with 32 SKBs before calling __netpoll_setup(). If the setup fails, the
> skb_pool buffer will persist indefinitely and never be cleaned up.
> 
> This change moves the skb_pool population to after the successful
> completion of __netpoll_setup(), ensuring that the buffers are not
> unnecessarily retained. Additionally, this modification alleviates rtnl
> lock pressure by allowing the buffer filling to occur outside of the
> lock.

arguably if the setup succeeds there would now be a window of time
where np is active but pool is empty.

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Breno Leitao 3 weeks, 3 days ago

Hello Jakub,

On Thu, Oct 31, 2024 at 06:26:47PM -0700, Jakub Kicinski wrote:
> On Fri, 25 Oct 2024 07:20:18 -0700 Breno Leitao wrote:
> > The current implementation has a flaw where it populates the skb_pool
> > with 32 SKBs before calling __netpoll_setup(). If the setup fails, the
> > skb_pool buffer will persist indefinitely and never be cleaned up.
> > 
> > This change moves the skb_pool population to after the successful
> > completion of __netpoll_setup(), ensuring that the buffers are not
> > unnecessarily retained. Additionally, this modification alleviates rtnl
> > lock pressure by allowing the buffer filling to occur outside of the
> > lock.
> 
> arguably if the setup succeeds there would now be a window of time
> where np is active but pool is empty.

I am not convinced this is a problem. Given that netpoll_setup() is only
called from netconsole.

In netconsole, a target is not enabled (as in sending packets) until the
netconsole target is, in fact, enabled. (nt->enabled = true). Enabling
the target(nt) only happen after netpoll_setup() returns successfully.

Example:

	static void write_ext_msg(struct console *con, const char *msg,
				  unsigned int len)
	{
		...
		list_for_each_entry(nt, &target_list, list)
			if (nt->extended && nt->enabled && netif_running(nt->np.dev))
				send_ext_msg_udp(nt, msg, len);

So, back to your point, the netpoll interface will be up, but, not used
at all.

On top of that, two other considerations:

 * If the netpoll target is used without the buffer, it is not a big
deal, since refill_skbs() is called, independently if the pool is full
or not. (Which is not ideal and I eventually want to improve it).

Anyway, this is how the code works today:

	void netpoll_send_udp(struct netpoll *np, const char *msg, int len)
	{
		...
		skb = find_skb(np, total_len + np->dev->needed_tailroom,...
		// transmit the skb

	static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
	{
		...
		refill_skbs(np);
		skb = alloc_skb(len, GFP_ATOMIC);
		if (!skb)
			skb = skb_dequeue(&np->skb_pool);
		...
		// return the skb

So, even in there is a transmission in-between enabling the netpoll
target and not populating the pool (which is NOT the case in the code
today), it would not be a problem, given that netpoll_send_udp() will
call refill_skbs() anyway.

I have an in-development patch to improve it, by deferring this to a
workthread, mainly because this whole allocation dance is done with a
bunch of locks held, including printk/console lock.

I think that a best mechanism might be something like:

 * If find_skb() needs to consume from the pool (which is rare, only
when alloc_skb() fails), raise workthread that tries to repopulate the
pool in the background. 

 * Eventually avoid alloc_skb() first, and getting directly from the
   pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC).
   This might make the code faster, but, I don't have data yet.

   This might also required a netpool reconfigurable of pool size. Today
   it is hardcoded (#define MAX_SKBS 32). This current patchset is the
   first step to individualize the pool, then, we can have a field in
   struct netpoll that specify what is the pool size (32 by default),
   but user configuration.

   On netconsole, we can do it using the configfs fields.

Anyway, are these ideas too crazy?

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Breno Leitao 3 weeks, 2 days ago

On Fri, Nov 01, 2024 at 03:51:59AM -0700, Breno Leitao wrote:
> Hello Jakub,
> 
> On Thu, Oct 31, 2024 at 06:26:47PM -0700, Jakub Kicinski wrote:
> > On Fri, 25 Oct 2024 07:20:18 -0700 Breno Leitao wrote:
> > > The current implementation has a flaw where it populates the skb_pool
> > > with 32 SKBs before calling __netpoll_setup(). If the setup fails, the
> > > skb_pool buffer will persist indefinitely and never be cleaned up.
> > > 
> > > This change moves the skb_pool population to after the successful
> > > completion of __netpoll_setup(), ensuring that the buffers are not
> > > unnecessarily retained. Additionally, this modification alleviates rtnl
> > > lock pressure by allowing the buffer filling to occur outside of the
> > > lock.
> > 
> > arguably if the setup succeeds there would now be a window of time
> > where np is active but pool is empty.
> 
> I am not convinced this is a problem. Given that netpoll_setup() is only
> called from netconsole.
> 
> In netconsole, a target is not enabled (as in sending packets) until the
> netconsole target is, in fact, enabled. (nt->enabled = true). Enabling
> the target(nt) only happen after netpoll_setup() returns successfully.
> 
> Example:
> 
> 	static void write_ext_msg(struct console *con, const char *msg,
> 				  unsigned int len)
> 	{
> 		...
> 		list_for_each_entry(nt, &target_list, list)
> 			if (nt->extended && nt->enabled && netif_running(nt->np.dev))
> 				send_ext_msg_udp(nt, msg, len);
> 
> So, back to your point, the netpoll interface will be up, but, not used
> at all.
> 
> On top of that, two other considerations:
> 
>  * If the netpoll target is used without the buffer, it is not a big
> deal, since refill_skbs() is called, independently if the pool is full
> or not. (Which is not ideal and I eventually want to improve it).
> 
> Anyway, this is how the code works today:
> 
> 
> 	void netpoll_send_udp(struct netpoll *np, const char *msg, int len)
> 	{
> 		...
> 		skb = find_skb(np, total_len + np->dev->needed_tailroom,...
> 		// transmit the skb
> 		
> 
> 	static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
> 	{
> 		...
> 		refill_skbs(np);
> 		skb = alloc_skb(len, GFP_ATOMIC);
> 		if (!skb)
> 			skb = skb_dequeue(&np->skb_pool);
> 		...
> 		// return the skb
> 
> So, even in there is a transmission in-between enabling the netpoll
> target and not populating the pool (which is NOT the case in the code
> today), it would not be a problem, given that netpoll_send_udp() will
> call refill_skbs() anyway.
> 
> I have an in-development patch to improve it, by deferring this to a
> workthread, mainly because this whole allocation dance is done with a
> bunch of locks held, including printk/console lock.
> 
> I think that a best mechanism might be something like:
> 
>  * If find_skb() needs to consume from the pool (which is rare, only
> when alloc_skb() fails), raise workthread that tries to repopulate the
> pool in the background. 
> 
>  * Eventually avoid alloc_skb() first, and getting directly from the
>    pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC).
>    This might make the code faster, but, I don't have data yet.

I've hacked this case (getting the skb from the pool first and refilling
it on a workqueue) today, and the performance is expressive.

I've tested sending 2k messages, and meassured the time it takes to
run `netpoll_send_udp`, which is the critical function in netpoll.

Actual code (with this patchset applied), where the index is
nanoseconds:

	[14K, 16K)           107 |@@@                                                 |
	[16K, 20K)          1757 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
	[20K, 24K)            59 |@                                                   |
	[24K, 28K)            35 |@                                                   |
	[28K, 32K)            35 |@                                                   |
	[32K, 40K)             5 |                                                    |
	[40K, 48K)             0 |                                                    |
	[48K, 56K)             0 |                                                    |
	[56K, 64K)             0 |                                                    |
	[64K, 80K)             1 |                                                    |
	[80K, 96K)             0 |                                                    |
	[96K, 112K)            1 |                                                    |


With the optimization applied, I get a solid win:

	[8K, 10K)             32 |@                                                   |
	[10K, 12K)           514 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
	[12K, 14K)           932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
	[14K, 16K)           367 |@@@@@@@@@@@@@@@@@@@@                                |
	[16K, 20K)           102 |@@@@@                                               |
	[20K, 24K)            29 |@                                                   |
	[24K, 28K)            17 |                                                    |
	[28K, 32K)             1 |                                                    |
	[32K, 40K)             3 |                                                    |
	[40K, 48K)             0 |                                                    |
	[48K, 56K)             0 |                                                    |
	[56K, 64K)             1 |                                                    |
	[64K, 80K)             0 |                                                    |
	[80K, 96K)             1 |                                                    |
	[96K, 112K)            1 |                                                    |

That was captured this simple bpftrace script:

	kprobe:netpoll_send_udp {
	    @start[tid] = nsecs;
	}

	kretprobe:netpoll_send_udp /@start[tid]/ {
	    $duration = nsecs - @start[tid];
	    delete(@start[tid]);
	    @ = hist($duration, 2)
	}

	END
	{
		clear(@start);
		print(@);
	}

And this is the patch I am testing right now:


	commit 262de00e439e0708fadf5e4c2896837c046a325b
	Author: Breno Leitao <leitao@debian.org>
	Date:   Fri Nov 1 10:03:58 2024 -0700

	    netpoll: prioritize the skb from the pool.

	    Prioritize allocating SKBs from the pool over alloc_skb() to reduce
	    overhead in the critical path.

	    Move the pool refill operation to a worktask, allowing it to run
	    outside of the printk/console lock.

	    This change improves performance by minimizing the time spent in the
	    critical path filling and allocating skbs, reducing contention on the
	    printk/console lock.

	    Signed-off-by: Breno Leitao <leitao@debian.org>

	diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
	index 77635b885c18..c81dc9cc0139 100644
	--- a/include/linux/netpoll.h
	+++ b/include/linux/netpoll.h
	@@ -33,6 +33,7 @@ struct netpoll {
		u16 local_port, remote_port;
		u8 remote_mac[ETH_ALEN];
		struct sk_buff_head skb_pool;
	+	struct work_struct work;
	 };

	 struct netpoll_info {
	diff --git a/net/core/netpoll.c b/net/core/netpoll.c
	index bf2064d689d5..657100393489 100644
	--- a/net/core/netpoll.c
	+++ b/net/core/netpoll.c
	@@ -278,18 +278,24 @@ static void zap_completion_queue(void)
		put_cpu_var(softnet_data);
	 }

	+
	+static void refill_skbs_wt(struct work_struct *work)
	+{
	+	struct netpoll *np = container_of(work, struct netpoll, work);
	+
	+	refill_skbs(np);
	+}
	+
	 static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
	 {
		int count = 0;
		struct sk_buff *skb;

		zap_completion_queue();
	-	refill_skbs(np);
	 repeat:
	-
	-	skb = alloc_skb(len, GFP_ATOMIC);
	+	skb = skb_dequeue(&np->skb_pool);
		if (!skb)
	-		skb = skb_dequeue(&np->skb_pool);
	+		skb = alloc_skb(len, GFP_ATOMIC);

		if (!skb) {
			if (++count < 10) {
	@@ -301,6 +307,7 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)

		refcount_set(&skb->users, 1);
		skb_reserve(skb, reserve);
	+	schedule_work(&np->work);
		return skb;
	 }

	@@ -780,6 +787,7 @@ int netpoll_setup(struct netpoll *np)

		/* fill up the skb queue */
		refill_skbs(np);
	+	INIT_WORK(&np->work, refill_skbs_wt);
		return 0;

	 put:

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Jakub Kicinski 3 weeks, 2 days ago

On Fri, 1 Nov 2024 11:18:29 -0700 Breno Leitao wrote:
> > I think that a best mechanism might be something like:
> > 
> >  * If find_skb() needs to consume from the pool (which is rare, only
> > when alloc_skb() fails), raise workthread that tries to repopulate the
> > pool in the background. 
> > 
> >  * Eventually avoid alloc_skb() first, and getting directly from the
> >    pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC).
> >    This might make the code faster, but, I don't have data yet.  
> 
> I've hacked this case (getting the skb from the pool first and refilling
> it on a workqueue) today, and the performance is expressive.
> 
> I've tested sending 2k messages, and meassured the time it takes to
> run `netpoll_send_udp`, which is the critical function in netpoll.

The purpose of the pool is to have a reserve in case of OOM, AFAIU.
We may speed things up by taking the allocations out of line but
we risk the pool being empty when we really need it.

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Breno Leitao 2 weeks, 6 days ago

On Fri, Nov 01, 2024 at 07:01:01PM -0700, Jakub Kicinski wrote:
> On Fri, 1 Nov 2024 11:18:29 -0700 Breno Leitao wrote:
> > > I think that a best mechanism might be something like:
> > > 
> > >  * If find_skb() needs to consume from the pool (which is rare, only
> > > when alloc_skb() fails), raise workthread that tries to repopulate the
> > > pool in the background. 
> > > 
> > >  * Eventually avoid alloc_skb() first, and getting directly from the
> > >    pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC).
> > >    This might make the code faster, but, I don't have data yet.  
> > 
> > I've hacked this case (getting the skb from the pool first and refilling
> > it on a workqueue) today, and the performance is expressive.
> > 
> > I've tested sending 2k messages, and meassured the time it takes to
> > run `netpoll_send_udp`, which is the critical function in netpoll.
> 
> The purpose of the pool is to have a reserve in case of OOM, AFAIU.
> We may speed things up by taking the allocations out of line but
> we risk the pool being empty when we really need it.

Correct, but, in a case of OOM, I am not sure if this is going to
change the chances at all.

Let's assume the pool is full and we start getting OOMs. It doesn't
matter if alloc_skb() will fail in the critical path or in the work
thread, netpoll will have MAX_SKBS skbs buffered to use, and none will
be allocated, thus, just 32 SKBs will be used until a -ENOMEM returns.

On the other side, let's suppose there is a bunch of OOM pressure for a
while (10 SKBs are consumed for instance), and then some free memory
show up, causing the pool to be replenished. It is better
to do it in the workthread other than in the hot path.

In both cases, the chance of not having SKBs to send the packet seems to
be the same, unless I am not modeling the problem correctly.

On top of that, a few other points that this new model could help more,
in a OOM case.

1) Now with Maksysm patches, we can monitor the OOMing rate

2) With the pool per target, we can easily increase the pool size if we
want. (patchset not pushed yet)

This will also fix another corner case we have in netconsole. When
printk() holds the console/target_list locked, the upcoming code cannot
printk() anymore, otherwise it will deadlcok system. That is because a
printk() will call netconsole again (nested), and it will try to get the
console_lock/target_lock again, but that is already held. Having the
alloc_skb() out of that thot path will reduce the probability of this
happening. This is something I am planning to fix later, by just
dropping the upcoming message. Having this patch might make less packets
being dropped.

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Jakub Kicinski 2 weeks, 5 days ago

On Mon, 4 Nov 2024 12:40:00 -0800 Breno Leitao wrote:
> Let's assume the pool is full and we start getting OOMs. It doesn't
> matter if alloc_skb() will fail in the critical path or in the work
> thread, netpoll will have MAX_SKBS skbs buffered to use, and none will
> be allocated, thus, just 32 SKBs will be used until a -ENOMEM returns.

Do you assume the worker thread will basically keep up with the output?
Vadim was showing me a system earlier today where workqueue workers
didn't get scheduled in for minutes :( That's a bit extreme but doesn't
inspire confidence in worker replenishing the pool quickly.

> On the other side, let's suppose there is a bunch of OOM pressure for a
> while (10 SKBs are consumed for instance), and then some free memory
> show up, causing the pool to be replenished. It is better
> to do it in the workthread other than in the hot path.

We could cap how much we replenish in one go?

> In both cases, the chance of not having SKBs to send the packet seems to
> be the same, unless I am not modeling the problem correctly.

Maybe I misunderstood the proposal, I think you said earlier that you
want to consume from the pool instead of calling alloc(). If you mean
that we'd still alloc in the fast path but not replenish the pool
that's different.

> On top of that, a few other points that this new model could help more,
> in a OOM case.
> 
> 1) Now with Maksysm patches, we can monitor the OOMing rate
> 
> 2) With the pool per target, we can easily increase the pool size if we
> want. (patchset not pushed yet)

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Breno Leitao 2 weeks, 5 days ago

Hello Jakub,

On Tue, Nov 05, 2024 at 05:00:29PM -0800, Jakub Kicinski wrote:
> On Mon, 4 Nov 2024 12:40:00 -0800 Breno Leitao wrote:
> > Let's assume the pool is full and we start getting OOMs. It doesn't
> > matter if alloc_skb() will fail in the critical path or in the work
> > thread, netpoll will have MAX_SKBS skbs buffered to use, and none will
> > be allocated, thus, just 32 SKBs will be used until a -ENOMEM returns.
> 
> Do you assume the worker thread will basically keep up with the output?
> Vadim was showing me a system earlier today where workqueue workers
> didn't get scheduled in for minutes :( That's a bit extreme but doesn't
> inspire confidence in worker replenishing the pool quickly.

Interesting. Thanks for the data point.

> > On the other side, let's suppose there is a bunch of OOM pressure for a
> > while (10 SKBs are consumed for instance), and then some free memory
> > show up, causing the pool to be replenished. It is better
> > to do it in the workthread other than in the hot path.
> 
> We could cap how much we replenish in one go?

If we keep the replenish in the hot path, I think it is worth doing it,
for sure.

> > In both cases, the chance of not having SKBs to send the packet seems to
> > be the same, unless I am not modeling the problem correctly.
> 
> Maybe I misunderstood the proposal, I think you said earlier that you
> want to consume from the pool instead of calling alloc(). If you mean
> that we'd still alloc in the fast path but not replenish the pool
> that's different.

To clarify, let me take a step back and outline what this patchset proposes:

The patchset enhances SKB pool management in three key ways:

	a) It delays populating the skb pool until the target is active.
	b) It releases the skb pool when there are no more active users.
	c) It creates a separate pool for each target.

The third point (c) is the one that's open to discussion, as I
understand.

I proposed that having an individualized skb pool that users can control
would be beneficial. For example, users could define the number of skbs
in the pool. This could lead to additional advantages, such as allowing
netpoll to directly consume from the pool instead of relying on alloc()
in the optimal scenario, thereby speeding up the critical path.

--breno

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Jakub Kicinski 2 weeks, 4 days ago

On Wed, 6 Nov 2024 07:06:06 -0800 Breno Leitao wrote:
> To clarify, let me take a step back and outline what this patchset proposes:
> 
> The patchset enhances SKB pool management in three key ways:
> 
> 	a) It delays populating the skb pool until the target is active.
> 	b) It releases the skb pool when there are no more active users.
> 	c) It creates a separate pool for each target.
> 
> The third point (c) is the one that's open to discussion, as I
> understand.
> 
> I proposed that having an individualized skb pool that users can control
> would be beneficial. For example, users could define the number of skbs
> in the pool. This could lead to additional advantages, such as allowing
> netpoll to directly consume from the pool instead of relying on alloc()
> in the optimal scenario, thereby speeding up the critical path.

Patch 1 is the one I'm not completely convinced by. I understand 
the motivation but its rather unusual to activate partially initialized
objects. Maybe let's leave it out.

The rest is fine, although I'd invert the justification for the second
patch. We should in fact scale the number of pooled packets with the
number of consoles. Each message gets send to every console so system
with 2 netconsoles has effectively half the OOM cushion.

Re: [PATCH net-next 1/3] net: netpoll: Defer skb_pool population until setup success

Posted by Breno Leitao 2 weeks, 4 days ago

Hello Jakub,

On Wed, Nov 06, 2024 at 03:43:49PM -0800, Jakub Kicinski wrote:
> On Wed, 6 Nov 2024 07:06:06 -0800 Breno Leitao wrote:
> > To clarify, let me take a step back and outline what this patchset proposes:
> > 
> > The patchset enhances SKB pool management in three key ways:
> > 
> > 	a) It delays populating the skb pool until the target is active.
> > 	b) It releases the skb pool when there are no more active users.
> > 	c) It creates a separate pool for each target.
> > 
> > The third point (c) is the one that's open to discussion, as I
> > understand.
> > 
> > I proposed that having an individualized skb pool that users can control
> > would be beneficial. For example, users could define the number of skbs
> > in the pool. This could lead to additional advantages, such as allowing
> > netpoll to directly consume from the pool instead of relying on alloc()
> > in the optimal scenario, thereby speeding up the critical path.
> 
> Patch 1 is the one I'm not completely convinced by. I understand 
> the motivation but its rather unusual to activate partially initialized
> objects. Maybe let's leave it out.
> 
> The rest is fine, although I'd invert the justification for the second
> patch. We should in fact scale the number of pooled packets with the
> number of consoles. Each message gets send to every console so system
> with 2 netconsoles has effectively half the OOM cushion.

That is fair. Thanks for the guidance. I will keep patch 1 out of it and
send a v2.

Thanks
--breno