[PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set

Joe Damato posted 7 patches 2 weeks, 5 days ago
There is a newer version of this series
[PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set
Posted by Joe Damato 2 weeks, 5 days ago
From: Martin Karsten <mkarsten@uwaterloo.ca>

When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
irq_suspend_timeout is nonzero, this timeout is used to defer softirq
scheduling, potentially longer than gro_flush_timeout. This can be used
to effectively suspend softirq processing during the time it takes for
an application to process data and return to the next busy-poll.

The call to napi->poll in busy_poll_stop might lead to an invocation of
napi_complete_done, but the prefer-busy flag is still set at that time,
so the same logic is used to defer softirq scheduling for
irq_suspend_timeout.

Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 v3:
   - Removed reference to non-existent sysfs parameter from commit
     message. No functional/code changes.

 net/core/dev.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4d910872963f..51d88f758e2e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 			timeout = napi_get_gro_flush_timeout(n);
 		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
-	if (n->defer_hard_irqs_count > 0) {
+	if (napi_prefer_busy_poll(n)) {
+		timeout = napi_get_irq_suspend_timeout(n);
+		if (timeout)
+			ret = false;
+	}
+	if (ret && n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
 		timeout = napi_get_gro_flush_timeout(n);
 		if (timeout)
@@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
-		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
-		timeout = napi_get_gro_flush_timeout(napi);
-		if (napi->defer_hard_irqs_count && timeout) {
+		timeout = napi_get_irq_suspend_timeout(napi);
+		if (!timeout) {
+			napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
+			if (napi->defer_hard_irqs_count)
+				timeout = napi_get_gro_flush_timeout(napi);
+		}
+		if (timeout) {
 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
 			skip_schedule = true;
 		}
-- 
2.25.1
Re: [PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set
Posted by Jakub Kicinski 2 weeks, 4 days ago
On Mon,  4 Nov 2024 21:55:26 +0000 Joe Damato wrote:
> From: Martin Karsten <mkarsten@uwaterloo.ca>
> 
> When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
> irq_suspend_timeout is nonzero, this timeout is used to defer softirq
> scheduling, potentially longer than gro_flush_timeout. This can be used
> to effectively suspend softirq processing during the time it takes for
> an application to process data and return to the next busy-poll.
> 
> The call to napi->poll in busy_poll_stop might lead to an invocation of

The call to napi->poll when we're arming the timer is counter
productive, right? Maybe we can take this opportunity to add
the seemingly missing logic to skip over it?

> napi_complete_done, but the prefer-busy flag is still set at that time,
> so the same logic is used to defer softirq scheduling for
> irq_suspend_timeout.
> 
> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Co-developed-by: Joe Damato <jdamato@fastly.com>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Tested-by: Joe Damato <jdamato@fastly.com>
> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>  v3:
>    - Removed reference to non-existent sysfs parameter from commit
>      message. No functional/code changes.
> 
>  net/core/dev.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4d910872963f..51d88f758e2e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>  			timeout = napi_get_gro_flush_timeout(n);
>  		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
>  	}
> -	if (n->defer_hard_irqs_count > 0) {
> +	if (napi_prefer_busy_poll(n)) {
> +		timeout = napi_get_irq_suspend_timeout(n);

Why look at the suspend timeout in napi_complete_done()?
We are unlikely to be exiting busy poll here.
Is it because we need more time than gro_flush_timeout
for the application to take over the polling?

> +		if (timeout)
> +			ret = false;
> +	}
> +	if (ret && n->defer_hard_irqs_count > 0) {
>  		n->defer_hard_irqs_count--;
>  		timeout = napi_get_gro_flush_timeout(n);
>  		if (timeout)
> @@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
>  	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>  
>  	if (flags & NAPI_F_PREFER_BUSY_POLL) {
> -		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> -		timeout = napi_get_gro_flush_timeout(napi);
> -		if (napi->defer_hard_irqs_count && timeout) {
> +		timeout = napi_get_irq_suspend_timeout(napi);

Even here I'm not sure if we need to trigger suspend.
I don't know the eventpoll code well but it seems like you suspend 
and resume based on events when exiting epoll. Why also here?

> +		if (!timeout) {
> +			napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> +			if (napi->defer_hard_irqs_count)
> +				timeout = napi_get_gro_flush_timeout(napi);
> +		}
> +		if (timeout) {
>  			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
>  			skip_schedule = true;
>  		}
Re: [PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set
Posted by Joe Damato 2 weeks, 3 days ago
On Tue, Nov 05, 2024 at 09:03:38PM -0800, Jakub Kicinski wrote:
> On Mon,  4 Nov 2024 21:55:26 +0000 Joe Damato wrote:
> > From: Martin Karsten <mkarsten@uwaterloo.ca>
> > 
> > When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
> > irq_suspend_timeout is nonzero, this timeout is used to defer softirq
> > scheduling, potentially longer than gro_flush_timeout. This can be used
> > to effectively suspend softirq processing during the time it takes for
> > an application to process data and return to the next busy-poll.
> > 
> > The call to napi->poll in busy_poll_stop might lead to an invocation of
> 
> The call to napi->poll when we're arming the timer is counter
> productive, right? Maybe we can take this opportunity to add
> the seemingly missing logic to skip over it?

It seems like the call to napi->poll in busy_poll_stop is counter
productive and we're not opposed to making an optimization like that
in the future.

When we tried it, it triggered several bugs/system hangs, so we left
as much of the original code in place as possible.

The existing patch works and streamlining busy_poll_stop to skip the
call to napi->poll is an optimization that can be added as a later
series that focuses solely on when/where/how napi->poll is called.

Our focus was on:
  - Not breaking any of the existing mechanisms
  - Adding a new mechanism

I think we should avoid pulling the optimization you suggest into
this particular series and save that for the future.

> > napi_complete_done, but the prefer-busy flag is still set at that time,
> > so the same logic is used to defer softirq scheduling for
> > irq_suspend_timeout.
> > 
> > Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > Co-developed-by: Joe Damato <jdamato@fastly.com>
> > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > Tested-by: Joe Damato <jdamato@fastly.com>
> > Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> > Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> >  v3:
> >    - Removed reference to non-existent sysfs parameter from commit
> >      message. No functional/code changes.
> > 
> >  net/core/dev.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 4d910872963f..51d88f758e2e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> >  			timeout = napi_get_gro_flush_timeout(n);
> >  		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
> >  	}
> > -	if (n->defer_hard_irqs_count > 0) {
> > +	if (napi_prefer_busy_poll(n)) {
> > +		timeout = napi_get_irq_suspend_timeout(n);
> 
> Why look at the suspend timeout in napi_complete_done()?
> We are unlikely to be exiting busy poll here.

The idea is similar to commit 7fd3253a7de6 ("net: Introduce
preferred busy-polling"); continue to defer IRQs as long as forward
progress is being made. In this case, napi->poll ran, called
napi_complete_done -- the system is moving forward with processing
so prevent IRQs from interrupting us.

epoll_wait will re-enable IRQs (by calling napi_schedule) if
there are no events ready for processing.

> Is it because we need more time than gro_flush_timeout
> for the application to take over the polling?

That's right; we want the application to retain control of packet
processing. That's why we connected this to the "prefer_busy_poll"
flag.

> > +		if (timeout)
> > +			ret = false;
> > +	}
> > +	if (ret && n->defer_hard_irqs_count > 0) {
> >  		n->defer_hard_irqs_count--;
> >  		timeout = napi_get_gro_flush_timeout(n);
> >  		if (timeout)
> > @@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
> >  	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> >  
> >  	if (flags & NAPI_F_PREFER_BUSY_POLL) {
> > -		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > -		timeout = napi_get_gro_flush_timeout(napi);
> > -		if (napi->defer_hard_irqs_count && timeout) {
> > +		timeout = napi_get_irq_suspend_timeout(napi);
> 
> Even here I'm not sure if we need to trigger suspend.
> I don't know the eventpoll code well but it seems like you suspend 
> and resume based on events when exiting epoll. Why also here?

There's two questions wrapped up here and an overall point to make:

1. Suspend and resume based on events when exiting epoll - that's
   right and as you'll see in those patches that happens by:
     - arming the suspend timer (via a call to napi_suspend_irqs)
       when a positive number of events are retrieved
     - calling napi_schedule (via napi_resume_irqs) when there are
       no events or the epoll context is being freed.

2. Why defer the suspend timer here in busy_poll_stop? Note that the
   original code would set the timer to gro_flush_timeout, which
   would introduce the trade offs we mention in the cover letter
   (latency for large values, IRQ interruption for small values).

   We don't want the gro_flush_timeout to take over yet because we
   want to avoid these tradeoffs up until the point where epoll_wait
   finds no events for processing.

   Does that make sense? If we skipped the IRQ suspend deferral
   here, we'd be giving packet processing control back to
   gro_flush_timeout and napi_defer_hard_irqs, but the system might
   still have packets that can be processed in the next call to
   epoll_wait.

The overall point to make is that: the suspend timer is used to
prevent misbehaving userland applications from taking too long. It's
essentially a backstop and, as long as the app is making forward
progress, allows the app to continue running its busy poll loop
undisturbed (via napi_complete_done preventing the driver from
enabling IRQs).

Does that make sense?

> > +		if (!timeout) {
> > +			napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > +			if (napi->defer_hard_irqs_count)
> > +				timeout = napi_get_gro_flush_timeout(napi);
> > +		}
> > +		if (timeout) {
> >  			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
> >  			skip_schedule = true;
> >  		}
>
Re: [PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set
Posted by Jakub Kicinski 2 weeks, 3 days ago
On Wed, 6 Nov 2024 08:52:00 -0800 Joe Damato wrote:
> On Tue, Nov 05, 2024 at 09:03:38PM -0800, Jakub Kicinski wrote:
> > On Mon,  4 Nov 2024 21:55:26 +0000 Joe Damato wrote:  
> > > From: Martin Karsten <mkarsten@uwaterloo.ca>
> > > 
> > > When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
> > > irq_suspend_timeout is nonzero, this timeout is used to defer softirq
> > > scheduling, potentially longer than gro_flush_timeout. This can be used
> > > to effectively suspend softirq processing during the time it takes for
> > > an application to process data and return to the next busy-poll.
> > > 
> > > The call to napi->poll in busy_poll_stop might lead to an invocation of  
> > 
> > The call to napi->poll when we're arming the timer is counter
> > productive, right? Maybe we can take this opportunity to add
> > the seemingly missing logic to skip over it?  
> 
> It seems like the call to napi->poll in busy_poll_stop is counter
> productive and we're not opposed to making an optimization like that
> in the future.
> 
> When we tried it, it triggered several bugs/system hangs, so we left
> as much of the original code in place as possible.

You don't happen to have the patch you used? Many ways to get the
skipping wrong.

> The existing patch works and streamlining busy_poll_stop to skip the
> call to napi->poll is an optimization that can be added as a later
> series that focuses solely on when/where/how napi->poll is called.

The reason I brought it up is that it rearms the timer, if driver 
ends up calling napi_complete_done(). So we arm the timer in
napi_poll_stop(), then call the driver which may rearm again, 
making the already complex code even harder to reason about.

> Our focus was on:
>   - Not breaking any of the existing mechanisms
>   - Adding a new mechanism
> 
> I think we should avoid pulling the optimization you suggest into
> this particular series and save that for the future.

I'm primarily worried about maintainability of this code.

> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 4d910872963f..51d88f758e2e 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> > >  			timeout = napi_get_gro_flush_timeout(n);
> > >  		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
> > >  	}
> > > -	if (n->defer_hard_irqs_count > 0) {
> > > +	if (napi_prefer_busy_poll(n)) {
> > > +		timeout = napi_get_irq_suspend_timeout(n);  
> > 
> > Why look at the suspend timeout in napi_complete_done()?
> > We are unlikely to be exiting busy poll here.  
> 
> The idea is similar to commit 7fd3253a7de6 ("net: Introduce
> preferred busy-polling"); continue to defer IRQs as long as forward
> progress is being made. In this case, napi->poll ran, called
> napi_complete_done -- the system is moving forward with processing
> so prevent IRQs from interrupting us.

We should clarify the mental models. You're describing IRQ deferal,
but say prefer busy poll.

Prefer busy poll has only one function - if we are at 100% busy
and always see >= budget of packets on the ring, we never call
napi_complete_done(). Drivers don't call napi_complete_done() if they
consumed full budget. So we need a way to break that re-polling loop,
release the NAPI ownership and give busy poll a chance to claim the
NAPI instance ownership (the SCHED bit). We check for prefer
busy poll in __napi_poll(), because, again, in the target scenario
napi_complete_done() is never called.

The IRQ deferal mechanism is necessary for prefer busy poll to work,
but it's separate and used by some drivers without good IRQ coalescing,
no user space polling involved.

In your case, when machine is not melting under 100% load - prefer busy
poll will be set once or not at all.

> epoll_wait will re-enable IRQs (by calling napi_schedule) if
> there are no events ready for processing.

To be 100% precise calling napi_schedule will not reenable IRQs 
if IRQ deferal is active. It only guarantees one NAPI run in 
softirq (or thread if threaded).

> > Is it because we need more time than gro_flush_timeout
> > for the application to take over the polling?  
> 
> That's right; we want the application to retain control of packet
> processing. That's why we connected this to the "prefer_busy_poll"
> flag.
> 
> > > +		if (timeout)
> > > +			ret = false;
> > > +	}
> > > +	if (ret && n->defer_hard_irqs_count > 0) {
> > >  		n->defer_hard_irqs_count--;
> > >  		timeout = napi_get_gro_flush_timeout(n);
> > >  		if (timeout)
> > > @@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
> > >  	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> > >  
> > >  	if (flags & NAPI_F_PREFER_BUSY_POLL) {
> > > -		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > > -		timeout = napi_get_gro_flush_timeout(napi);
> > > -		if (napi->defer_hard_irqs_count && timeout) {
> > > +		timeout = napi_get_irq_suspend_timeout(napi);  
> > 
> > Even here I'm not sure if we need to trigger suspend.
> > I don't know the eventpoll code well but it seems like you suspend 
> > and resume based on events when exiting epoll. Why also here?  
> 
> There's two questions wrapped up here and an overall point to make:
> 
> 1. Suspend and resume based on events when exiting epoll - that's
>    right and as you'll see in those patches that happens by:
>      - arming the suspend timer (via a call to napi_suspend_irqs)
>        when a positive number of events are retrieved
>      - calling napi_schedule (via napi_resume_irqs) when there are
>        no events or the epoll context is being freed.
> 
> 2. Why defer the suspend timer here in busy_poll_stop? Note that the
>    original code would set the timer to gro_flush_timeout, which
>    would introduce the trade offs we mention in the cover letter
>    (latency for large values, IRQ interruption for small values).
> 
>    We don't want the gro_flush_timeout to take over yet because we
>    want to avoid these tradeoffs up until the point where epoll_wait
>    finds no events for processing.
> 
>    Does that make sense? If we skipped the IRQ suspend deferral
>    here, we'd be giving packet processing control back to
>    gro_flush_timeout and napi_defer_hard_irqs, but the system might
>    still have packets that can be processed in the next call to
>    epoll_wait.

Let me tell you what I think happens and then you can correct me.

0 epoll
1   # ..does its magic..
2   __napi_busy_loop() 
3     # finds a waking packet
4     busy_poll_stop()
5       # arms the timer for long suspend
6   # epoll sees events
7     ep_suspend_napi_irqs()
8       napi_suspend_irqs()
9         # arms for long timer again

The timer we arm here only has to survive from line 5 to line 9,
because we will override the timeout on line 9.

> The overall point to make is that: the suspend timer is used to
> prevent misbehaving userland applications from taking too long. It's
> essentially a backstop and, as long as the app is making forward
> progress, allows the app to continue running its busy poll loop
> undisturbed (via napi_complete_done preventing the driver from
> enabling IRQs).
> 
> Does that make sense?

My mental model put in yet another way is that only epoll knows if it
has events, and therefore whether the timeout should be short or long.
So the suspend timer should only be applied by epoll.
Re: [PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set
Posted by Joe Damato 2 weeks, 3 days ago
On Wed, Nov 06, 2024 at 03:31:00PM -0800, Jakub Kicinski wrote:
> On Wed, 6 Nov 2024 08:52:00 -0800 Joe Damato wrote:
> > On Tue, Nov 05, 2024 at 09:03:38PM -0800, Jakub Kicinski wrote:
> > > On Mon,  4 Nov 2024 21:55:26 +0000 Joe Damato wrote:  
> > > > From: Martin Karsten <mkarsten@uwaterloo.ca>
> > > > 
> > > > When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
> > > > irq_suspend_timeout is nonzero, this timeout is used to defer softirq
> > > > scheduling, potentially longer than gro_flush_timeout. This can be used
> > > > to effectively suspend softirq processing during the time it takes for
> > > > an application to process data and return to the next busy-poll.
> > > > 
> > > > The call to napi->poll in busy_poll_stop might lead to an invocation of  
> > > 
> > > The call to napi->poll when we're arming the timer is counter
> > > productive, right? Maybe we can take this opportunity to add
> > > the seemingly missing logic to skip over it?  
> > 
> > It seems like the call to napi->poll in busy_poll_stop is counter
> > productive and we're not opposed to making an optimization like that
> > in the future.
> > 
> > When we tried it, it triggered several bugs/system hangs, so we left
> > as much of the original code in place as possible.
> 
> You don't happen to have the patch you used? Many ways to get the
> skipping wrong.

Please see below; I think we've found a solution.
 
> > The existing patch works and streamlining busy_poll_stop to skip the
> > call to napi->poll is an optimization that can be added as a later
> > series that focuses solely on when/where/how napi->poll is called.
> 
> The reason I brought it up is that it rearms the timer, if driver 
> ends up calling napi_complete_done(). So we arm the timer in
> napi_poll_stop(), then call the driver which may rearm again, 
> making the already complex code even harder to reason about.

Agreed that the timer is unnecessarily re-armed twice.

Martin ran some initial tests of this series but with this patch
(patch 2) dropped and the initial results from a small number of
runs seem fine.

In other words: I think we can simply drop this patch entirely,
re-run our tests to regenerate the data, update the documentation,
and send a v7.

But please continue reading below.

> > Our focus was on:
> >   - Not breaking any of the existing mechanisms
> >   - Adding a new mechanism
> > 
> > I think we should avoid pulling the optimization you suggest into
> > this particular series and save that for the future.
> 
> I'm primarily worried about maintainability of this code.

Of course and we're open to figuring out how to help with that.

> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index 4d910872963f..51d88f758e2e 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> > > >  			timeout = napi_get_gro_flush_timeout(n);
> > > >  		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
> > > >  	}
> > > > -	if (n->defer_hard_irqs_count > 0) {
> > > > +	if (napi_prefer_busy_poll(n)) {
> > > > +		timeout = napi_get_irq_suspend_timeout(n);  
> > > 
> > > Why look at the suspend timeout in napi_complete_done()?
> > > We are unlikely to be exiting busy poll here.  
> > 
> > The idea is similar to commit 7fd3253a7de6 ("net: Introduce
> > preferred busy-polling"); continue to defer IRQs as long as forward
> > progress is being made. In this case, napi->poll ran, called
> > napi_complete_done -- the system is moving forward with processing
> > so prevent IRQs from interrupting us.
> 
> We should clarify the mental models. You're describing IRQ deferal,
> but say prefer busy poll.

OK; we're open to using different language if that would be helpful.

> Prefer busy poll has only one function - if we are at 100% busy
> and always see >= budget of packets on the ring, we never call
> napi_complete_done(). Drivers don't call napi_complete_done() if they
> consumed full budget. So we need a way to break that re-polling loop,
> release the NAPI ownership and give busy poll a chance to claim the
> NAPI instance ownership (the SCHED bit). We check for prefer
> busy poll in __napi_poll(), because, again, in the target scenario
> napi_complete_done() is never called.
> 
> The IRQ deferal mechanism is necessary for prefer busy poll to work,
> but it's separate and used by some drivers without good IRQ coalescing,
> no user space polling involved.

Sure, we agree and are on the same page on the above about what
prefer busy poll is and its interaction with IRQ deferral.

> In your case, when machine is not melting under 100% load - prefer busy
> poll will be set once or not at all.

I am not sure what you mean by that last sentence, because the
prefer busy poll flag is set by the application?

Similar to prefer busy poll piggybacking on IRQ deferral, we
piggyback on prefer busy polling by allowing the application to use
an even larger timeout while it is processing incoming data, i.e.,
deferring IRQs for a longer period, but only after a successful busy
poll. This makes prefer busy poll + irq suspend useful when
utilization is below 100%.

> > epoll_wait will re-enable IRQs (by calling napi_schedule) if
> > there are no events ready for processing.
> 
> To be 100% precise calling napi_schedule will not reenable IRQs 
> if IRQ deferal is active. It only guarantees one NAPI run in 
> softirq (or thread if threaded).

Yes, I should have been more precise.

Calling napi_schedule doesn't enable IRQs, but runs NAPI which sets
about the process of *eventually* causing napi_complete_done to
return true which triggers the re-arming of IRQs by the driver.

> > > Is it because we need more time than gro_flush_timeout
> > > for the application to take over the polling?  
> > 
> > That's right; we want the application to retain control of packet
> > processing. That's why we connected this to the "prefer_busy_poll"
> > flag.
> > 
> > > > +		if (timeout)
> > > > +			ret = false;
> > > > +	}
> > > > +	if (ret && n->defer_hard_irqs_count > 0) {
> > > >  		n->defer_hard_irqs_count--;
> > > >  		timeout = napi_get_gro_flush_timeout(n);
> > > >  		if (timeout)
> > > > @@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
> > > >  	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> > > >  
> > > >  	if (flags & NAPI_F_PREFER_BUSY_POLL) {
> > > > -		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > > > -		timeout = napi_get_gro_flush_timeout(napi);
> > > > -		if (napi->defer_hard_irqs_count && timeout) {
> > > > +		timeout = napi_get_irq_suspend_timeout(napi);  
> > > 
> > > Even here I'm not sure if we need to trigger suspend.
> > > I don't know the eventpoll code well but it seems like you suspend 
> > > and resume based on events when exiting epoll. Why also here?  
> > 
> > There's two questions wrapped up here and an overall point to make:
> > 
> > 1. Suspend and resume based on events when exiting epoll - that's
> >    right and as you'll see in those patches that happens by:
> >      - arming the suspend timer (via a call to napi_suspend_irqs)
> >        when a positive number of events are retrieved
> >      - calling napi_schedule (via napi_resume_irqs) when there are
> >        no events or the epoll context is being freed.
> > 
> > 2. Why defer the suspend timer here in busy_poll_stop? Note that the
> >    original code would set the timer to gro_flush_timeout, which
> >    would introduce the trade offs we mention in the cover letter
> >    (latency for large values, IRQ interruption for small values).
> > 
> >    We don't want the gro_flush_timeout to take over yet because we
> >    want to avoid these tradeoffs up until the point where epoll_wait
> >    finds no events for processing.
> > 
> >    Does that make sense? If we skipped the IRQ suspend deferral
> >    here, we'd be giving packet processing control back to
> >    gro_flush_timeout and napi_defer_hard_irqs, but the system might
> >    still have packets that can be processed in the next call to
> >    epoll_wait.
> 
> Let me tell you what I think happens and then you can correct me.
> 
> 0 epoll
> 1   # ..does its magic..
> 2   __napi_busy_loop() 
> 3     # finds a waking packet
> 4     busy_poll_stop()
> 5       # arms the timer for long suspend
> 6   # epoll sees events
> 7     ep_suspend_napi_irqs()
> 8       napi_suspend_irqs()
> 9         # arms for long timer again
> 
> The timer we arm here only has to survive from line 5 to line 9,
> because we will override the timeout on line 9.

Yes, you are right. Thanks for highlighting and catching this.

> > The overall point to make is that: the suspend timer is used to
> > prevent misbehaving userland applications from taking too long. It's
> > essentially a backstop and, as long as the app is making forward
> > progress, allows the app to continue running its busy poll loop
> > undisturbed (via napi_complete_done preventing the driver from
> > enabling IRQs).
> > 
> > Does that make sense?
> 
> My mental model put in yet another way is that only epoll knows if it
> has events, and therefore whether the timeout should be short or long.
> So the suspend timer should only be applied by epoll.

Here's what we are thinking, can you let me know if you agree with
this?

  - We can drop patch 2 entirely
  - Update the documentation about IRQ suspension as needed now
    that patch 2 has been dropped
  - Leave the rest of the series as is
  - Re-run our tests to gather sufficient data for the test cases
    outlined in the cover letter to ensure that the performance
    numbers hold over several iterations

Does that seem reasonable for the v7 to you?

I am asking because generating the amount of data over the number of
scenarios we are testing takes a long time and I want to make sure
we are as aligned as we can be before I kick off another run :)
Re: [PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set
Posted by Jakub Kicinski 2 weeks, 2 days ago
On Wed, 6 Nov 2024 19:24:32 -0800 Joe Damato wrote:
> > In your case, when machine is not melting under 100% load - prefer busy
> > poll will be set once or not at all.  
> 
> I am not sure what you mean by that last sentence, because the
> prefer busy poll flag is set by the application?

There are two flags with almost exactly the same name, that's probably
the source of misunderstanding. NAPI_F_PREFER_BUSY_POLL will always be
set, but unless we are actually in a "napi live lock"
NAPI_STATE_PREFER_BUSY_POLL will not get set, and the latter is what
napi_prefer_busy_poll() tests. But we're dropping this patch so
doesn't matter. I was trying to pile up reasons why checking
napi_prefer_busy_poll() should not be necessary.

> Similar to prefer busy poll piggybacking on IRQ deferral, we
> piggyback on prefer busy polling by allowing the application to use
> an even larger timeout while it is processing incoming data, i.e.,
> deferring IRQs for a longer period, but only after a successful busy
> poll. This makes prefer busy poll + irq suspend useful when
> utilization is below 100%.
> 
> > > The overall point to make is that: the suspend timer is used to
> > > prevent misbehaving userland applications from taking too long. It's
> > > essentially a backstop and, as long as the app is making forward
> > > progress, allows the app to continue running its busy poll loop
> > > undisturbed (via napi_complete_done preventing the driver from
> > > enabling IRQs).
> > > 
> > > Does that make sense?  
> > 
> > My mental model put in yet another way is that only epoll knows if it
> > has events, and therefore whether the timeout should be short or long.
> > So the suspend timer should only be applied by epoll.  
> 
> Here's what we are thinking, can you let me know if you agree with
> this?
> 
>   - We can drop patch 2 entirely
>   - Update the documentation about IRQ suspension as needed now
>     that patch 2 has been dropped
>   - Leave the rest of the series as is
>   - Re-run our tests to gather sufficient data for the test cases
>     outlined in the cover letter to ensure that the performance
>     numbers hold over several iterations
> 
> Does that seem reasonable for the v7 to you?
> 
> I am asking because generating the amount of data over the number of
> scenarios we are testing takes a long time and I want to make sure
> we are as aligned as we can be before I kick off another run :)

SGTM, the rest of the series makes perfect sense to me.
Sorry for the delay..
Re: [PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy_poll is set
Posted by Joe Damato 2 weeks, 2 days ago
On Wed, Nov 06, 2024 at 07:24:32PM -0800, Joe Damato wrote:
> On Wed, Nov 06, 2024 at 03:31:00PM -0800, Jakub Kicinski wrote:
> > On Wed, 6 Nov 2024 08:52:00 -0800 Joe Damato wrote:
> > > On Tue, Nov 05, 2024 at 09:03:38PM -0800, Jakub Kicinski wrote:
> > > > On Mon,  4 Nov 2024 21:55:26 +0000 Joe Damato wrote:  

[...]

> > 0 epoll
> > 1   # ..does its magic..
> > 2   __napi_busy_loop() 
> > 3     # finds a waking packet
> > 4     busy_poll_stop()
> > 5       # arms the timer for long suspend
> > 6   # epoll sees events
> > 7     ep_suspend_napi_irqs()
> > 8       napi_suspend_irqs()
> > 9         # arms for long timer again
> > 
> > The timer we arm here only has to survive from line 5 to line 9,
> > because we will override the timeout on line 9.
> 
> Yes, you are right. Thanks for highlighting and catching this.
> 
> > > The overall point to make is that: the suspend timer is used to
> > > prevent misbehaving userland applications from taking too long. It's
> > > essentially a backstop and, as long as the app is making forward
> > > progress, allows the app to continue running its busy poll loop
> > > undisturbed (via napi_complete_done preventing the driver from
> > > enabling IRQs).
> > > 
> > > Does that make sense?
> > 
> > My mental model put in yet another way is that only epoll knows if it
> > has events, and therefore whether the timeout should be short or long.
> > So the suspend timer should only be applied by epoll.
> 
> Here's what we are thinking, can you let me know if you agree with
> this?
> 
>   - We can drop patch 2 entirely
>   - Update the documentation about IRQ suspension as needed now
>     that patch 2 has been dropped
>   - Leave the rest of the series as is
>   - Re-run our tests to gather sufficient data for the test cases
>     outlined in the cover letter to ensure that the performance
>     numbers hold over several iterations
> 
> Does that seem reasonable for the v7 to you?
> 
> I am asking because generating the amount of data over the number of
> scenarios we are testing takes a long time and I want to make sure
> we are as aligned as we can be before I kick off another run :)

I just noticed this was marked "changes requested". I re-ran the
tests overnight and have the data to confirm results are the same
even after dropping patch 2, which simplifies the code and removes
the double arming of the timer.

I wasn't sure if you were asking for other changes other than
dropping patch 2, but since I have the data I'm going to proceed as
specified in my previous email above:
  - Drop patch 2
  - Update cover letter with new data
  - Send that as v7

Unless you'd like me to hold off for some reason? Or if there was
some other feedback I need to address that I am missing?