[v1] xen/netfront: Fix TX response spurious interrupts

[PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Anthoine Bourgeois 3 months, 3 weeks ago

We found at Vates that there are lot of spurious interrupts when
benchmarking the PV drivers of Xen. This issue appeared with a patch
that addresses security issue XSA-391 (see Fixes below). On an iperf
benchmark, spurious interrupts can represent up to 50% of the
interrupts.

Spurious interrupts are interrupts that are rised for nothing, there is
no work to do. This appends because the function that handles the
interrupts ("xennet_tx_buf_gc") is also called at the end of the request
path to garbage collect the responses received during the transmission
load.

The request path is doing the work that the interrupt handler should
have done otherwise. This is particurary true when there is more than
one vcpu and get worse linearly with the number of vcpu/queue.

Moreover, this problem is amplifyed by the penalty imposed by a spurious
interrupt. When an interrupt is found spurious the interrupt chip will
delay the EOI to slowdown the backend. This delay will allow more
responses to be handled by the request path and then there will be more
chance the next interrupt will not find any work to do, creating a new
spurious interrupt.

This causes performance issue. The solution here is to remove the calls
from the request path and let the interrupt handler do the processing of
the responses. This approch removes spurious interrupts (<0.05%) and
also has the benefit of freeing up cycles in the request path, allowing
it to process more work, which improves performance compared to masking
the spurious interrupt one way or another.

Some vif throughput performance figures from a 8 vCPUs, 4GB of RAM HVM
guest(s):

Without this patch on the :
vm -> dom0: 4.5Gb/s
vm -> vm:   7.0Gb/s

Without XSA-391 patch (revert of b27d47950e48):
vm -> dom0: 8.3Gb/s
vm -> vm:   8.7Gb/s

With XSA-391 and this patch:
vm -> dom0: 11.5Gb/s
vm -> vm:   12.6Gb/s

Fixes: b27d47950e48 ("xen/netfront: harden netfront against event channel storms")
Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
---
 drivers/net/xen-netfront.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 9bac50963477..a11a0e949400 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -638,8 +638,6 @@ static int xennet_xdp_xmit_one(struct net_device *dev,
 	tx_stats->packets++;
 	u64_stats_update_end(&tx_stats->syncp);
 
-	xennet_tx_buf_gc(queue);
-
 	return 0;
 }
 
@@ -849,9 +847,6 @@ static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device *dev
 	tx_stats->packets++;
 	u64_stats_update_end(&tx_stats->syncp);
 
-	/* Note: It is not safe to access skb after xennet_tx_buf_gc()! */
-	xennet_tx_buf_gc(queue);
-
 	if (!netfront_tx_slot_available(queue))
 		netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
 
-- 
2.49.1



Anthoine Bourgeois | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Juergen Gross 3 months, 3 weeks ago

On 10.07.25 18:11, Anthoine Bourgeois wrote:
> We found at Vates that there are lot of spurious interrupts when
> benchmarking the PV drivers of Xen. This issue appeared with a patch
> that addresses security issue XSA-391 (see Fixes below). On an iperf
> benchmark, spurious interrupts can represent up to 50% of the
> interrupts.
> 
> Spurious interrupts are interrupts that are rised for nothing, there is
> no work to do. This appends because the function that handles the
> interrupts ("xennet_tx_buf_gc") is also called at the end of the request
> path to garbage collect the responses received during the transmission
> load.
> 
> The request path is doing the work that the interrupt handler should
> have done otherwise. This is particurary true when there is more than
> one vcpu and get worse linearly with the number of vcpu/queue.
> 
> Moreover, this problem is amplifyed by the penalty imposed by a spurious
> interrupt. When an interrupt is found spurious the interrupt chip will
> delay the EOI to slowdown the backend. This delay will allow more
> responses to be handled by the request path and then there will be more
> chance the next interrupt will not find any work to do, creating a new
> spurious interrupt.
> 
> This causes performance issue. The solution here is to remove the calls
> from the request path and let the interrupt handler do the processing of
> the responses. This approch removes spurious interrupts (<0.05%) and
> also has the benefit of freeing up cycles in the request path, allowing
> it to process more work, which improves performance compared to masking
> the spurious interrupt one way or another.
> 
> Some vif throughput performance figures from a 8 vCPUs, 4GB of RAM HVM
> guest(s):
> 
> Without this patch on the :
> vm -> dom0: 4.5Gb/s
> vm -> vm:   7.0Gb/s
> 
> Without XSA-391 patch (revert of b27d47950e48):
> vm -> dom0: 8.3Gb/s
> vm -> vm:   8.7Gb/s
> 
> With XSA-391 and this patch:
> vm -> dom0: 11.5Gb/s
> vm -> vm:   12.6Gb/s
> 
> Fixes: b27d47950e48 ("xen/netfront: harden netfront against event channel storms")
> Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>

Please resend this patch with the relevant maintainers added in the
recipients list.

You can add my Reviewed-by: tag, of course.


Juergen

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Anthoine Bourgeois 3 months, 2 weeks ago

On Fri, Jul 11, 2025 at 05:33:43PM +0200, Juergen Gross wrote:
>On 10.07.25 18:11, Anthoine Bourgeois wrote:
>> We found at Vates that there are lot of spurious interrupts when
>> benchmarking the PV drivers of Xen. This issue appeared with a patch
>> that addresses security issue XSA-391 (see Fixes below). On an iperf
>> benchmark, spurious interrupts can represent up to 50% of the
>> interrupts.
>>
>> Spurious interrupts are interrupts that are rised for nothing, there is
>> no work to do. This appends because the function that handles the
>> interrupts ("xennet_tx_buf_gc") is also called at the end of the request
>> path to garbage collect the responses received during the transmission
>> load.
>>
>> The request path is doing the work that the interrupt handler should
>> have done otherwise. This is particurary true when there is more than
>> one vcpu and get worse linearly with the number of vcpu/queue.
>>
>> Moreover, this problem is amplifyed by the penalty imposed by a spurious
>> interrupt. When an interrupt is found spurious the interrupt chip will
>> delay the EOI to slowdown the backend. This delay will allow more
>> responses to be handled by the request path and then there will be more
>> chance the next interrupt will not find any work to do, creating a new
>> spurious interrupt.
>>
>> This causes performance issue. The solution here is to remove the calls
>> from the request path and let the interrupt handler do the processing of
>> the responses. This approch removes spurious interrupts (<0.05%) and
>> also has the benefit of freeing up cycles in the request path, allowing
>> it to process more work, which improves performance compared to masking
>> the spurious interrupt one way or another.
>>
>> Some vif throughput performance figures from a 8 vCPUs, 4GB of RAM HVM
>> guest(s):
>>
>> Without this patch on the :
>> vm -> dom0: 4.5Gb/s
>> vm -> vm:   7.0Gb/s
>>
>> Without XSA-391 patch (revert of b27d47950e48):
>> vm -> dom0: 8.3Gb/s
>> vm -> vm:   8.7Gb/s
>>
>> With XSA-391 and this patch:
>> vm -> dom0: 11.5Gb/s
>> vm -> vm:   12.6Gb/s
>>
>> Fixes: b27d47950e48 ("xen/netfront: harden netfront against event channel storms")
>> Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
>
>Please resend this patch with the relevant maintainers added in the
>recipients list.

Ok, I will resend the patch tomorrow.
>
>You can add my Reviewed-by: tag, of course.

Thanks!

Anthoine






Anthoine Bourgeois | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Elliott Mitchell 3 months, 2 weeks ago

On Mon, Jul 14, 2025 at 07:11:06AM +0000, Anthoine Bourgeois wrote:
> On Fri, Jul 11, 2025 at 05:33:43PM +0200, Juergen Gross wrote:
> >On 10.07.25 18:11, Anthoine Bourgeois wrote:
> >> We found at Vates that there are lot of spurious interrupts when
> >> benchmarking the PV drivers of Xen. This issue appeared with a patch
> >> that addresses security issue XSA-391 (see Fixes below). On an iperf
> >> benchmark, spurious interrupts can represent up to 50% of the
> >> interrupts.
> >>
> >> Spurious interrupts are interrupts that are rised for nothing, there is
> >> no work to do. This appends because the function that handles the
> >> interrupts ("xennet_tx_buf_gc") is also called at the end of the request
> >> path to garbage collect the responses received during the transmission
> >> load.
> >>
> >> The request path is doing the work that the interrupt handler should
> >> have done otherwise. This is particurary true when there is more than
> >> one vcpu and get worse linearly with the number of vcpu/queue.
> >>
> >> Moreover, this problem is amplifyed by the penalty imposed by a spurious
> >> interrupt. When an interrupt is found spurious the interrupt chip will
> >> delay the EOI to slowdown the backend. This delay will allow more
> >> responses to be handled by the request path and then there will be more
> >> chance the next interrupt will not find any work to do, creating a new
> >> spurious interrupt.
> >>
> >> This causes performance issue. The solution here is to remove the calls
> >> from the request path and let the interrupt handler do the processing of
> >> the responses. This approch removes spurious interrupts (<0.05%) and
> >> also has the benefit of freeing up cycles in the request path, allowing
> >> it to process more work, which improves performance compared to masking
> >> the spurious interrupt one way or another.
> >>
> >> Some vif throughput performance figures from a 8 vCPUs, 4GB of RAM HVM
> >> guest(s):
> >>
> >> Without this patch on the :
> >> vm -> dom0: 4.5Gb/s
> >> vm -> vm:   7.0Gb/s
> >>
> >> Without XSA-391 patch (revert of b27d47950e48):
> >> vm -> dom0: 8.3Gb/s
> >> vm -> vm:   8.7Gb/s
> >>
> >> With XSA-391 and this patch:
> >> vm -> dom0: 11.5Gb/s
> >> vm -> vm:   12.6Gb/s
> >>
> >> Fixes: b27d47950e48 ("xen/netfront: harden netfront against event channel storms")
> >> Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
> >
> >Please resend this patch with the relevant maintainers added in the
> >recipients list.
> 
> Ok, I will resend the patch tomorrow.
> >
> >You can add my Reviewed-by: tag, of course.
> 
> Thanks!

Tested on a VM which this could be tried on.

Booting was successful, network appeared to function as it had been.
Spurious events continued to occur at roughly the same interval they had
been.

I can well believe this increases Xen network performance and may
reduce the occurrence of spurious interrupts, but it certainly doesn't
fully fix the problem(s).  Appears you're going to need to keep digging.

I believe this does count as Tested-by since I observed no new ill
effects.  Just the existing ill effects aren't fully solved.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Anthoine Bourgeois 3 months, 2 weeks ago

On Mon, Jul 14, 2025 at 05:37:51PM -0700, Elliott Mitchell wrote:
>On Mon, Jul 14, 2025 at 07:11:06AM +0000, Anthoine Bourgeois wrote:
>> On Fri, Jul 11, 2025 at 05:33:43PM +0200, Juergen Gross wrote:
>> >On 10.07.25 18:11, Anthoine Bourgeois wrote:
>
>Tested on a VM which this could be tried on.
>
>Booting was successful, network appeared to function as it had been.
>Spurious events continued to occur at roughly the same interval they had
>been.
>
>I can well believe this increases Xen network performance and may
>reduce the occurrence of spurious interrupts, but it certainly doesn't
>fully fix the problem(s).  Appears you're going to need to keep digging.
>
>I believe this does count as Tested-by since I observed no new ill
>effects.  Just the existing ill effects aren't fully solved.
>

Thank you for the test!
Could you send me the domU/dom0 kernel version and xen version ?

Regards,
Anthoine


Anthoine Bourgeois | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Elliott Mitchell 3 months, 2 weeks ago

On Tue, Jul 15, 2025 at 08:21:40AM +0000, Anthoine Bourgeois wrote:
> On Mon, Jul 14, 2025 at 05:37:51PM -0700, Elliott Mitchell wrote:
> >On Mon, Jul 14, 2025 at 07:11:06AM +0000, Anthoine Bourgeois wrote:
> >> On Fri, Jul 11, 2025 at 05:33:43PM +0200, Juergen Gross wrote:
> >> >On 10.07.25 18:11, Anthoine Bourgeois wrote:
> >
> >Tested on a VM which this could be tried on.
> >
> >Booting was successful, network appeared to function as it had been.
> >Spurious events continued to occur at roughly the same interval they had
> >been.
> >
> >I can well believe this increases Xen network performance and may
> >reduce the occurrence of spurious interrupts, but it certainly doesn't
> >fully fix the problem(s).  Appears you're going to need to keep digging.
> >
> >I believe this does count as Tested-by since I observed no new ill
> >effects.  Just the existing ill effects aren't fully solved.
> >
> 
> Thank you for the test!
> Could you send me the domU/dom0 kernel version and xen version ?

I tend to follow Debian, so kernel 6.1.140 and 4.17.6.  What may be
more notable is AMD processor.

When initially reported, it was reported as being more severe on systems
with AMD processors.  I've been wondering about the reason(s) behind
that.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Anthoine Bourgeois 3 months, 2 weeks ago

On Tue, Jul 15, 2025 at 12:19:34PM -0700, Elliott Mitchell wrote:
>On Tue, Jul 15, 2025 at 08:21:40AM +0000, Anthoine Bourgeois wrote:
>> On Mon, Jul 14, 2025 at 05:37:51PM -0700, Elliott Mitchell wrote:
>> >On Mon, Jul 14, 2025 at 07:11:06AM +0000, Anthoine Bourgeois wrote:
>> >> On Fri, Jul 11, 2025 at 05:33:43PM +0200, Juergen Gross wrote:
>> >> >On 10.07.25 18:11, Anthoine Bourgeois wrote:
>> >
>> >Tested on a VM which this could be tried on.
>> >
>> >Booting was successful, network appeared to function as it had been.
>> >Spurious events continued to occur at roughly the same interval they had
>> >been.
>> >
>> >I can well believe this increases Xen network performance and may
>> >reduce the occurrence of spurious interrupts, but it certainly doesn't
>> >fully fix the problem(s).  Appears you're going to need to keep digging.
>> >
>> >I believe this does count as Tested-by since I observed no new ill
>> >effects.  Just the existing ill effects aren't fully solved.
>> >
>>
>> Thank you for the test!
>> Could you send me the domU/dom0 kernel version and xen version ?
>
>I tend to follow Debian, so kernel 6.1.140 and 4.17.6.  What may be
>more notable is AMD processor.
>
>When initially reported, it was reported as being more severe on systems
>with AMD processors.  I've been wondering about the reason(s) behind
>that.

AMD processors could make a huge difference. On Ryzen, this patch could
almost double the bandwidth and on Epyc close to nothing with low
frequency models, there is another bottleneck here I guess.
On which one do you test?

Do you know there is also a workaround on AMD processors about remapping
grant tables as WriteBack?
Upstream patch is 22650d605462 from XenServer.
The test package for XCP-ng with the patch:
https://xcp-ng.org/forum/topic/10943/network-traffic-performance-on-amd-processors

Regards,
Anthoine


Anthoine Bourgeois | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Elliott Mitchell 3 months, 2 weeks ago

On Wed, Jul 16, 2025 at 07:47:48AM +0000, Anthoine Bourgeois wrote:
> On Tue, Jul 15, 2025 at 12:19:34PM -0700, Elliott Mitchell wrote:
> >On Tue, Jul 15, 2025 at 08:21:40AM +0000, Anthoine Bourgeois wrote:
> >> On Mon, Jul 14, 2025 at 05:37:51PM -0700, Elliott Mitchell wrote:
> >> >On Mon, Jul 14, 2025 at 07:11:06AM +0000, Anthoine Bourgeois wrote:
> >> >> On Fri, Jul 11, 2025 at 05:33:43PM +0200, Juergen Gross wrote:
> >> >> >On 10.07.25 18:11, Anthoine Bourgeois wrote:
> >> >
> >> >Tested on a VM which this could be tried on.
> >> >
> >> >Booting was successful, network appeared to function as it had been.
> >> >Spurious events continued to occur at roughly the same interval they had
> >> >been.
> >> >
> >> >I can well believe this increases Xen network performance and may
> >> >reduce the occurrence of spurious interrupts, but it certainly doesn't
> >> >fully fix the problem(s).  Appears you're going to need to keep digging.
> >> >
> >> >I believe this does count as Tested-by since I observed no new ill
> >> >effects.  Just the existing ill effects aren't fully solved.
> >>
> >> Thank you for the test!
> >> Could you send me the domU/dom0 kernel version and xen version ?
> >
> >I tend to follow Debian, so kernel 6.1.140 and 4.17.6.  What may be
> >more notable is AMD processor.
> >
> >When initially reported, it was reported as being more severe on systems
> >with AMD processors.  I've been wondering about the reason(s) behind
> >that.
> 
> AMD processors could make a huge difference. On Ryzen, this patch could
> almost double the bandwidth and on Epyc close to nothing with low
> frequency models, there is another bottleneck here I guess.
> On which one do you test?
> 
> Do you know there is also a workaround on AMD processors about remapping
> grant tables as WriteBack?
> Upstream patch is 22650d605462 from XenServer.
> The test package for XCP-ng with the patch:
> https://xcp-ng.org/forum/topic/10943/network-traffic-performance-on-amd-processors
> 

Why are you jumping onto mostly unrelated issues when the current bug is
unfinished?

Spurious events continue to be observed on the network backend.  Spurious
events are also being observed on block and PCI backends.  You identified
one cause, but others remain.

(I'm hoping the next one effects all the back/front ends; the PCI backend
is a bigger issue for me)

Should add, one VM being observed with these issue(s) is using 6.12.38.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Elliott Mitchell 3 months, 1 week ago

On Wed, Jul 16, 2025 at 11:31:06AM -0700, Elliott Mitchell wrote:
> On Wed, Jul 16, 2025 at 07:47:48AM +0000, Anthoine Bourgeois wrote:
> > On Tue, Jul 15, 2025 at 12:19:34PM -0700, Elliott Mitchell wrote:
> > >On Tue, Jul 15, 2025 at 08:21:40AM +0000, Anthoine Bourgeois wrote:
> > >>
> > >> Thank you for the test!
> > >> Could you send me the domU/dom0 kernel version and xen version ?
> > >
> > >I tend to follow Debian, so kernel 6.1.140 and 4.17.6.  What may be
> > >more notable is AMD processor.
> > >
> > >When initially reported, it was reported as being more severe on systems
> > >with AMD processors.  I've been wondering about the reason(s) behind
> > >that.
> > 
> > AMD processors could make a huge difference. On Ryzen, this patch could
> > almost double the bandwidth and on Epyc close to nothing with low
> > frequency models, there is another bottleneck here I guess.
> > On which one do you test?
> > 
> > Do you know there is also a workaround on AMD processors about remapping
> > grant tables as WriteBack?
> > Upstream patch is 22650d605462 from XenServer.
> > The test package for XCP-ng with the patch:
> > https://xcp-ng.org/forum/topic/10943/network-traffic-performance-on-amd-processors
> > 
> 
> Why are you jumping onto mostly unrelated issues when the current bug is
> unfinished?
> 
> Spurious events continue to be observed on the network backend.  Spurious
> events are also being observed on block and PCI backends.  You identified
> one cause, but others remain.
> 
> (I'm hoping the next one effects all the back/front ends; the PCI backend
> is a bigger issue for me)
> 
> Should add, one VM being observed with these issue(s) is using 6.12.38.

For reference, the following:

for d in /sys/devices/{pci,vbd,vif}-*[0-9]-*[0-9]/xenbus
do      if [ -f "$d/spurious_events" ]
        then    read s < "$d/spurious_events"
        else    s=0
        fi
        if [ "$s" -gt 0 ]
        then    printf "problem %s: %d\\n" "$d/spurious_events" "$s"
        else    printf "clean: %s\\n" "$d/spurious_events"
        fi
done

Flags all passthrough and virtual devices.  Even though there is a
reduction with virtual network devices, that is only a 10% reduction.
Most of the problem remains even though there is progress.

I was mentioning an AMD processor since the initial report stated the
problem was more severe with AMD processor machines.

This is likely a driver design issue.  Most pieces of hardware, telling
the hardware to process an empty queue is quite cheap.  Perhaps minor
energy loss, but most hardware isn't (yet) too worried about being
attacked.

Passthrough and virtual devices are quite unusual in there being a
concern over attacks.  There could be major design flaws due to the
front-ends being designed similar to normal drivers.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Anthoine Bourgeois 3 months, 1 week ago

On Fri, Jul 18, 2025 at 01:48:17PM -0700, Elliott Mitchell wrote:
>On Wed, Jul 16, 2025 at 11:31:06AM -0700, Elliott Mitchell wrote:
>> On Wed, Jul 16, 2025 at 07:47:48AM +0000, Anthoine Bourgeois wrote:
>> > On Tue, Jul 15, 2025 at 12:19:34PM -0700, Elliott Mitchell wrote:
>> > >
>> > >I tend to follow Debian, so kernel 6.1.140 and 4.17.6.  What may be
>> > >more notable is AMD processor.
>> > >
>> > >When initially reported, it was reported as being more severe on systems
>> > >with AMD processors.  I've been wondering about the reason(s) behind
>> > >that.
>> >
>> > AMD processors could make a huge difference. On Ryzen, this patch could
>> > almost double the bandwidth and on Epyc close to nothing with low
>> > frequency models, there is another bottleneck here I guess.
>> > On which one do you test?
>> >
>> > Do you know there is also a workaround on AMD processors about remapping
>> > grant tables as WriteBack?
>> > Upstream patch is 22650d605462 from XenServer.
>> > The test package for XCP-ng with the patch:
>> > https://xcp-ng.org/forum/topic/10943/network-traffic-performance-on-amd-processors
>> >
>>
>> Why are you jumping onto mostly unrelated issues when the current bug is
>> unfinished?
>>
>> Spurious events continue to be observed on the network backend.  Spurious
>> events are also being observed on block and PCI backends.  You identified
>> one cause, but others remain.
>>
>> (I'm hoping the next one effects all the back/front ends; the PCI backend
>> is a bigger issue for me)
>>
>> Should add, one VM being observed with these issue(s) is using 6.12.38.
>
>For reference, the following:
>
>for d in /sys/devices/{pci,vbd,vif}-*[0-9]-*[0-9]/xenbus
>do      if [ -f "$d/spurious_events" ]
>        then    read s < "$d/spurious_events"
>        else    s=0
>        fi
>        if [ "$s" -gt 0 ]
>        then    printf "problem %s: %d\\n" "$d/spurious_events" "$s"
>        else    printf "clean: %s\\n" "$d/spurious_events"
>        fi
>done
>
>Flags all passthrough and virtual devices.  Even though there is a
>reduction with virtual network devices, that is only a 10% reduction.
>Most of the problem remains even though there is progress.
>
>I was mentioning an AMD processor since the initial report stated the
>problem was more severe with AMD processor machines.
>
>This is likely a driver design issue.  Most pieces of hardware, telling
>the hardware to process an empty queue is quite cheap.  Perhaps minor
>energy loss, but most hardware isn't (yet) too worried about being
>attacked.
>
>Passthrough and virtual devices are quite unusual in there being a
>concern over attacks.  There could be major design flaws due to the
>front-ends being designed similar to normal drivers.
>

Hmm, you check the spurious on the backend.
Sorry I should have been more specific, this patch only mitigate the
spurious on the frontend.

I will take a look on the backend.

Regards,
Anthoine


Anthoine Bourgeois | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Jürgen Groß 3 months, 3 weeks ago

On 10.07.25 18:11, Anthoine Bourgeois wrote:
> We found at Vates that there are lot of spurious interrupts when
> benchmarking the PV drivers of Xen. This issue appeared with a patch
> that addresses security issue XSA-391 (see Fixes below). On an iperf
> benchmark, spurious interrupts can represent up to 50% of the
> interrupts.
> 
> Spurious interrupts are interrupts that are rised for nothing, there is
> no work to do. This appends because the function that handles the
> interrupts ("xennet_tx_buf_gc") is also called at the end of the request
> path to garbage collect the responses received during the transmission
> load.
> 
> The request path is doing the work that the interrupt handler should
> have done otherwise. This is particurary true when there is more than
> one vcpu and get worse linearly with the number of vcpu/queue.
> 
> Moreover, this problem is amplifyed by the penalty imposed by a spurious
> interrupt. When an interrupt is found spurious the interrupt chip will
> delay the EOI to slowdown the backend. This delay will allow more
> responses to be handled by the request path and then there will be more
> chance the next interrupt will not find any work to do, creating a new
> spurious interrupt.
> 
> This causes performance issue. The solution here is to remove the calls
> from the request path and let the interrupt handler do the processing of
> the responses. This approch removes spurious interrupts (<0.05%) and
> also has the benefit of freeing up cycles in the request path, allowing
> it to process more work, which improves performance compared to masking
> the spurious interrupt one way or another.
> 
> Some vif throughput performance figures from a 8 vCPUs, 4GB of RAM HVM
> guest(s):
> 
> Without this patch on the :
> vm -> dom0: 4.5Gb/s
> vm -> vm:   7.0Gb/s
> 
> Without XSA-391 patch (revert of b27d47950e48):
> vm -> dom0: 8.3Gb/s
> vm -> vm:   8.7Gb/s
> 
> With XSA-391 and this patch:
> vm -> dom0: 11.5Gb/s
> vm -> vm:   12.6Gb/s
> 
> Fixes: b27d47950e48 ("xen/netfront: harden netfront against event channel storms")
> Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Teddy Astie 3 months, 3 weeks ago

You also probably want to send this to linux kernel mailing list too.

Le 10/07/2025 à 18:14, Anthoine Bourgeois a écrit :
> We found at Vates that there are lot of spurious interrupts when
> benchmarking the PV drivers of Xen. This issue appeared with a patch
> that addresses security issue XSA-391 (see Fixes below). On an iperf
> benchmark, spurious interrupts can represent up to 50% of the
> interrupts.
> 
> Spurious interrupts are interrupts that are rised for nothing, there is
> no work to do. This appends because the function that handles the
> interrupts ("xennet_tx_buf_gc") is also called at the end of the request
> path to garbage collect the responses received during the transmission
> load.
> 
> The request path is doing the work that the interrupt handler should
> have done otherwise. This is particurary true when there is more than
> one vcpu and get worse linearly with the number of vcpu/queue.
> 
> Moreover, this problem is amplifyed by the penalty imposed by a spurious
> interrupt. When an interrupt is found spurious the interrupt chip will
> delay the EOI to slowdown the backend. This delay will allow more
> responses to be handled by the request path and then there will be more
> chance the next interrupt will not find any work to do, creating a new
> spurious interrupt.
> 
> This causes performance issue. The solution here is to remove the calls
> from the request path and let the interrupt handler do the processing of
> the responses. This approch removes spurious interrupts (<0.05%) and
> also has the benefit of freeing up cycles in the request path, allowing
> it to process more work, which improves performance compared to masking
> the spurious interrupt one way or another.
> 
> Some vif throughput performance figures from a 8 vCPUs, 4GB of RAM HVM
> guest(s):
> 
> Without this patch on the :
> vm -> dom0: 4.5Gb/s
> vm -> vm:   7.0Gb/s
> 
> Without XSA-391 patch (revert of b27d47950e48):
> vm -> dom0: 8.3Gb/s
> vm -> vm:   8.7Gb/s
> 
> With XSA-391 and this patch:
> vm -> dom0: 11.5Gb/s
> vm -> vm:   12.6Gb/s
> 
> Fixes: b27d47950e48 ("xen/netfront: harden netfront against event channel storms")
> Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
> ---
>   drivers/net/xen-netfront.c | 5 -----
>   1 file changed, 5 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 9bac50963477..a11a0e949400 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -638,8 +638,6 @@ static int xennet_xdp_xmit_one(struct net_device *dev,
>   	tx_stats->packets++;
>   	u64_stats_update_end(&tx_stats->syncp);
>   
> -	xennet_tx_buf_gc(queue);
> -
>   	return 0;
>   }
>   
> @@ -849,9 +847,6 @@ static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device *dev
>   	tx_stats->packets++;
>   	u64_stats_update_end(&tx_stats->syncp);
>   
> -	/* Note: It is not safe to access skb after xennet_tx_buf_gc()! */
> -	xennet_tx_buf_gc(queue);
> -
>   	if (!netfront_tx_slot_available(queue))
>   		netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
>   

Is there a risk of having a condition where the ring is full and the 
event channel is not up (which would cause the interrupt to never be 
called, and no message to be received again) ?

Teddy


Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Jürgen Groß 3 months, 3 weeks ago

On 11.07.25 11:29, Teddy Astie wrote:
> You also probably want to send this to linux kernel mailing list too.
> 
> Le 10/07/2025 à 18:14, Anthoine Bourgeois a écrit :
>> We found at Vates that there are lot of spurious interrupts when
>> benchmarking the PV drivers of Xen. This issue appeared with a patch
>> that addresses security issue XSA-391 (see Fixes below). On an iperf
>> benchmark, spurious interrupts can represent up to 50% of the
>> interrupts.
>>
>> Spurious interrupts are interrupts that are rised for nothing, there is
>> no work to do. This appends because the function that handles the
>> interrupts ("xennet_tx_buf_gc") is also called at the end of the request
>> path to garbage collect the responses received during the transmission
>> load.
>>
>> The request path is doing the work that the interrupt handler should
>> have done otherwise. This is particurary true when there is more than
>> one vcpu and get worse linearly with the number of vcpu/queue.
>>
>> Moreover, this problem is amplifyed by the penalty imposed by a spurious
>> interrupt. When an interrupt is found spurious the interrupt chip will
>> delay the EOI to slowdown the backend. This delay will allow more
>> responses to be handled by the request path and then there will be more
>> chance the next interrupt will not find any work to do, creating a new
>> spurious interrupt.
>>
>> This causes performance issue. The solution here is to remove the calls
>> from the request path and let the interrupt handler do the processing of
>> the responses. This approch removes spurious interrupts (<0.05%) and
>> also has the benefit of freeing up cycles in the request path, allowing
>> it to process more work, which improves performance compared to masking
>> the spurious interrupt one way or another.
>>
>> Some vif throughput performance figures from a 8 vCPUs, 4GB of RAM HVM
>> guest(s):
>>
>> Without this patch on the :
>> vm -> dom0: 4.5Gb/s
>> vm -> vm:   7.0Gb/s
>>
>> Without XSA-391 patch (revert of b27d47950e48):
>> vm -> dom0: 8.3Gb/s
>> vm -> vm:   8.7Gb/s
>>
>> With XSA-391 and this patch:
>> vm -> dom0: 11.5Gb/s
>> vm -> vm:   12.6Gb/s
>>
>> Fixes: b27d47950e48 ("xen/netfront: harden netfront against event channel storms")
>> Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech>
>> ---
>>    drivers/net/xen-netfront.c | 5 -----
>>    1 file changed, 5 deletions(-)
>>
>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index 9bac50963477..a11a0e949400 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -638,8 +638,6 @@ static int xennet_xdp_xmit_one(struct net_device *dev,
>>    	tx_stats->packets++;
>>    	u64_stats_update_end(&tx_stats->syncp);
>>    
>> -	xennet_tx_buf_gc(queue);
>> -
>>    	return 0;
>>    }
>>    
>> @@ -849,9 +847,6 @@ static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device *dev
>>    	tx_stats->packets++;
>>    	u64_stats_update_end(&tx_stats->syncp);
>>    
>> -	/* Note: It is not safe to access skb after xennet_tx_buf_gc()! */
>> -	xennet_tx_buf_gc(queue);
>> -
>>    	if (!netfront_tx_slot_available(queue))
>>    		netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
>>    
> 
> Is there a risk of having a condition where the ring is full and the
> event channel is not up (which would cause the interrupt to never be
> called, and no message to be received again) ?

That would be a backend bug.


Juergen

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Elliott Mitchell 3 months, 3 weeks ago

On Thu, Jul 10, 2025 at 04:11:15PM +0000, Anthoine Bourgeois wrote:
> We found at Vates that there are lot of spurious interrupts when
> benchmarking the PV drivers of Xen. This issue appeared with a patch
> that addresses security issue XSA-391 (see Fixes below). On an iperf
> benchmark, spurious interrupts can represent up to 50% of the
> interrupts.

If this is the correct fix, near-identical fixes are needed for *all*
of the Xen front-ends.  Xen virtual block-devices and Xen PCI-passthrough
devices are also effected by a similar issue.

Thanks for finding a candidate fix, this effects many other people who
have been troubled by this performance issue.

FreeBSD will also need a similar fix.

-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Anthoine Bourgeois 3 months, 3 weeks ago

On Thu, Jul 10, 2025 at 01:05:47PM -0700, Elliott Mitchell wrote:
>On Thu, Jul 10, 2025 at 04:11:15PM +0000, Anthoine Bourgeois wrote:
>> We found at Vates that there are lot of spurious interrupts when
>> benchmarking the PV drivers of Xen. This issue appeared with a patch
>> that addresses security issue XSA-391 (see Fixes below). On an iperf
>> benchmark, spurious interrupts can represent up to 50% of the
>> interrupts.
>
>If this is the correct fix, near-identical fixes are needed for *all*
>of the Xen front-ends.  Xen virtual block-devices and Xen PCI-passthrough
>devices are also effected by a similar issue.
>
blkfront doesn't call the response handle from multiple places. It
doesn't seem to be affected by this problem.
And pcifront neither.

>Thanks for finding a candidate fix, this effects many other people who
>have been troubled by this performance issue.
>
>FreeBSD will also need a similar fix.

In FreeBSD, netfront may also be affected.
xn_assemble_tx_request calls xn_txeof.
blkfront and pcifront seems good.

Regards,
Anthoine


Anthoine Bourgeois | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH] xen/netfront: Fix TX response spurious interrupts

Posted by Elliott Mitchell 3 months, 2 weeks ago

On Fri, Jul 11, 2025 at 07:41:02AM +0000, Anthoine Bourgeois wrote:
> On Thu, Jul 10, 2025 at 01:05:47PM -0700, Elliott Mitchell wrote:
> >On Thu, Jul 10, 2025 at 04:11:15PM +0000, Anthoine Bourgeois wrote:
> >> We found at Vates that there are lot of spurious interrupts when
> >> benchmarking the PV drivers of Xen. This issue appeared with a patch
> >> that addresses security issue XSA-391 (see Fixes below). On an iperf
> >> benchmark, spurious interrupts can represent up to 50% of the
> >> interrupts.
> >
> >If this is the correct fix, near-identical fixes are needed for *all*
> >of the Xen front-ends.  Xen virtual block-devices and Xen PCI-passthrough
> >devices are also effected by a similar issue.
> >
> blkfront doesn't call the response handle from multiple places. It
> doesn't seem to be affected by this problem.
> And pcifront neither.

Ick.  I had hope it might be a single bug, or at worst a group of
highly similar bugs.  This may be several distinct bugs then.  :-(

When this bug was first reportted to xen-devel@, I noticed reports of
spurious interrupts on the block backend.  Due to a piece of hardware
intended for pass-through, I now know it manifests on the PCI backend
too.

Perhaps you hadn't noticed due to caching lessening the impact of
spurious events on virtual block devices?  Whatever the case
`ls /sys/devices/??*-[0-9]*-[0-9]*/xenbus/spurious_events` shows
spurious interrupts on pretty well every virtual device for me (I plan to
test the fix for net-front in the near future).

> >Thanks for finding a candidate fix, this effects many other people who
> >have been troubled by this performance issue.
> >
> >FreeBSD will also need a similar fix.
> 
> In FreeBSD, netfront may also be affected.
> xn_assemble_tx_request calls xn_txeof.
> blkfront and pcifront seems good.

Interestingly the problem doesn't seem nearly as severe on ARM.

-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445