[PATCH] tap: introduce IFF_NAPI

Jon Kohler posted 1 patch 10 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230531031645.44335-1-jon@nutanix.com
Maintainers: Jason Wang <jasowang@redhat.com>
net/tap-linux.c | 4 ++++
net/tap-linux.h | 1 +
2 files changed, 5 insertions(+)
[PATCH] tap: introduce IFF_NAPI
Posted by Jon Kohler 10 months ago
If kernel supports IFF_NAPI, lets use it, which is especially useful
on kernels containing fb3f903769e8 ("tun: support NAPI for packets
received from batched XDP buffs"), as IFF_NAPI allows the
vhost_tx_batch path to use NAPI on XDP buffs.

Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
from a guest running kernel 5.10.105 to remote bare metal running
patched code on kernel 5.10.139. Guests were configured 1x virtio-net
device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
"Before" and around ~50-60% utilization "After".

Single Stream: iperf -P 1
iperf -l size | Before          | After          | Increase
64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%

Multiple Stream: iperf -P 12
iperf -l size | Before          | After          | Increase
64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 net/tap-linux.c | 4 ++++
 net/tap-linux.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/net/tap-linux.c b/net/tap-linux.c
index f54f308d359..fd94df166e0 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
         ifr.ifr_flags |= IFF_ONE_QUEUE;
     }
 
+    if (features & IFF_NAPI) {
+        ifr.ifr_flags |= IFF_NAPI;
+    }
+
     if (*vnet_hdr) {
         if (features & IFF_VNET_HDR) {
             *vnet_hdr = 1;
diff --git a/net/tap-linux.h b/net/tap-linux.h
index bbbb62c2a75..f4d8e55270b 100644
--- a/net/tap-linux.h
+++ b/net/tap-linux.h
@@ -37,6 +37,7 @@
 
 /* TUNSETIFF ifr flags */
 #define IFF_TAP          0x0002
+#define IFF_NAPI         0x0010
 #define IFF_NO_PI        0x1000
 #define IFF_ONE_QUEUE    0x2000
 #define IFF_VNET_HDR     0x4000
-- 
2.30.1 (Apple Git-130)
Re: [PATCH] tap: introduce IFF_NAPI
Posted by Jason Wang 10 months ago
On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
>
> If kernel supports IFF_NAPI, lets use it, which is especially useful
> on kernels containing fb3f903769e8 ("tun: support NAPI for packets
> received from batched XDP buffs"), as IFF_NAPI allows the
> vhost_tx_batch path to use NAPI on XDP buffs.
>
> Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
> from a guest running kernel 5.10.105 to remote bare metal running
> patched code on kernel 5.10.139. Guests were configured 1x virtio-net
> device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
> identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
> Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
> "Before" and around ~50-60% utilization "After".
>
> Single Stream: iperf -P 1
> iperf -l size | Before          | After          | Increase
> 64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
> 128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
> 4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%
>
> Multiple Stream: iperf -P 12
> iperf -l size | Before          | After          | Increase
> 64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
> 128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
> 4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)
>
> Signed-off-by: Jon Kohler <jon@nutanix.com>

Great, but I would suggest having an option.

So we can:

1) ease the debug and compare
2) enable this by default only for 8.1, disable it for pre 8.1

Thanks

Thanks

> ---
>  net/tap-linux.c | 4 ++++
>  net/tap-linux.h | 1 +
>  2 files changed, 5 insertions(+)
>
> diff --git a/net/tap-linux.c b/net/tap-linux.c
> index f54f308d359..fd94df166e0 100644
> --- a/net/tap-linux.c
> +++ b/net/tap-linux.c
> @@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
>          ifr.ifr_flags |= IFF_ONE_QUEUE;
>      }
>
> +    if (features & IFF_NAPI) {
> +        ifr.ifr_flags |= IFF_NAPI;
> +    }
> +
>      if (*vnet_hdr) {
>          if (features & IFF_VNET_HDR) {
>              *vnet_hdr = 1;
> diff --git a/net/tap-linux.h b/net/tap-linux.h
> index bbbb62c2a75..f4d8e55270b 100644
> --- a/net/tap-linux.h
> +++ b/net/tap-linux.h
> @@ -37,6 +37,7 @@
>
>  /* TUNSETIFF ifr flags */
>  #define IFF_TAP          0x0002
> +#define IFF_NAPI         0x0010
>  #define IFF_NO_PI        0x1000
>  #define IFF_ONE_QUEUE    0x2000
>  #define IFF_VNET_HDR     0x4000
> --
> 2.30.1 (Apple Git-130)
>
Re: [PATCH] tap: introduce IFF_NAPI
Posted by Jason Wang 10 months ago
On Wed, May 31, 2023 at 11:32 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
> >
> > If kernel supports IFF_NAPI, lets use it, which is especially useful
> > on kernels containing fb3f903769e8 ("tun: support NAPI for packets
> > received from batched XDP buffs"), as IFF_NAPI allows the
> > vhost_tx_batch path to use NAPI on XDP buffs.
> >
> > Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
> > from a guest running kernel 5.10.105 to remote bare metal running
> > patched code on kernel 5.10.139. Guests were configured 1x virtio-net
> > device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
> > identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
> > Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
> > "Before" and around ~50-60% utilization "After".
> >
> > Single Stream: iperf -P 1
> > iperf -l size | Before          | After          | Increase
> > 64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
> > 128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
> > 4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%
> >
> > Multiple Stream: iperf -P 12
> > iperf -l size | Before          | After          | Increase
> > 64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
> > 128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
> > 4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)
> >
> > Signed-off-by: Jon Kohler <jon@nutanix.com>
>
> Great, but I would suggest having an option.
>
> So we can:
>
> 1) ease the debug and compare
> 2) enable this by default only for 8.1, disable it for pre 8.1

More thought, if the performance boost only after fb3f903769e8, we
probably need to disable it by default and let the mgmt layer to
enable it.

Thanks

>
> Thanks
>
> Thanks
>
> > ---
> >  net/tap-linux.c | 4 ++++
> >  net/tap-linux.h | 1 +
> >  2 files changed, 5 insertions(+)
> >
> > diff --git a/net/tap-linux.c b/net/tap-linux.c
> > index f54f308d359..fd94df166e0 100644
> > --- a/net/tap-linux.c
> > +++ b/net/tap-linux.c
> > @@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
> >          ifr.ifr_flags |= IFF_ONE_QUEUE;
> >      }
> >
> > +    if (features & IFF_NAPI) {
> > +        ifr.ifr_flags |= IFF_NAPI;
> > +    }
> > +
> >      if (*vnet_hdr) {
> >          if (features & IFF_VNET_HDR) {
> >              *vnet_hdr = 1;
> > diff --git a/net/tap-linux.h b/net/tap-linux.h
> > index bbbb62c2a75..f4d8e55270b 100644
> > --- a/net/tap-linux.h
> > +++ b/net/tap-linux.h
> > @@ -37,6 +37,7 @@
> >
> >  /* TUNSETIFF ifr flags */
> >  #define IFF_TAP          0x0002
> > +#define IFF_NAPI         0x0010
> >  #define IFF_NO_PI        0x1000
> >  #define IFF_ONE_QUEUE    0x2000
> >  #define IFF_VNET_HDR     0x4000
> > --
> > 2.30.1 (Apple Git-130)
> >
Re: [PATCH] tap: introduce IFF_NAPI
Posted by Jon Kohler 10 months ago

> On May 30, 2023, at 11:35 PM, Jason Wang <jasowang@redhat.com> wrote:
> 
> On Wed, May 31, 2023 at 11:32 AM Jason Wang <jasowang@redhat.com> wrote:
>> 
>> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
>>> 
>>> If kernel supports IFF_NAPI, lets use it, which is especially useful
>>> on kernels containing fb3f903769e8 ("tun: support NAPI for packets
>>> received from batched XDP buffs"), as IFF_NAPI allows the
>>> vhost_tx_batch path to use NAPI on XDP buffs.
>>> 
>>> Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
>>> from a guest running kernel 5.10.105 to remote bare metal running
>>> patched code on kernel 5.10.139. Guests were configured 1x virtio-net
>>> device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
>>> identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
>>> Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
>>> "Before" and around ~50-60% utilization "After".
>>> 
>>> Single Stream: iperf -P 1
>>> iperf -l size | Before          | After          | Increase
>>> 64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
>>> 128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
>>> 4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%
>>> 
>>> Multiple Stream: iperf -P 12
>>> iperf -l size | Before          | After          | Increase
>>> 64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
>>> 128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
>>> 4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)
>>> 
>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> 
>> Great, but I would suggest having an option.
>> 
>> So we can:
>> 
>> 1) ease the debug and compare
>> 2) enable this by default only for 8.1, disable it for pre 8.1

Fair enough, one favor to ask though - 
Would you be able to point me to an existing option like what you’re
proposing so I could make sure I’m on the same page?

> 
> More thought, if the performance boost only after fb3f903769e8, we
> probably need to disable it by default and let the mgmt layer to
> enable it.
> 

I focused my testing with that commit, but I could take it out and
we still should get benefit. Would you like me to profile that to validate?

Asking as NAPI support in tun.c has been there for a while, guessing
at first glance that there would be non-zero gains, with little downsides.
Looking at git blame, seems about ~5-6 years of support.

Also for posterity, that commit has been in since 5.18, so a little over 1 year.

> Thanks
> 
>> 
>> Thanks
>> 
>> Thanks
>> 
>>> ---
>>> net/tap-linux.c | 4 ++++
>>> net/tap-linux.h | 1 +
>>> 2 files changed, 5 insertions(+)
>>> 
>>> diff --git a/net/tap-linux.c b/net/tap-linux.c
>>> index f54f308d359..fd94df166e0 100644
>>> --- a/net/tap-linux.c
>>> +++ b/net/tap-linux.c
>>> @@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
>>>         ifr.ifr_flags |= IFF_ONE_QUEUE;
>>>     }
>>> 
>>> +    if (features & IFF_NAPI) {
>>> +        ifr.ifr_flags |= IFF_NAPI;
>>> +    }
>>> +
>>>     if (*vnet_hdr) {
>>>         if (features & IFF_VNET_HDR) {
>>>             *vnet_hdr = 1;
>>> diff --git a/net/tap-linux.h b/net/tap-linux.h
>>> index bbbb62c2a75..f4d8e55270b 100644
>>> --- a/net/tap-linux.h
>>> +++ b/net/tap-linux.h
>>> @@ -37,6 +37,7 @@
>>> 
>>> /* TUNSETIFF ifr flags */
>>> #define IFF_TAP          0x0002
>>> +#define IFF_NAPI         0x0010
>>> #define IFF_NO_PI        0x1000
>>> #define IFF_ONE_QUEUE    0x2000
>>> #define IFF_VNET_HDR     0x4000
>>> --
>>> 2.30.1 (Apple Git-130)

Re: [PATCH] tap: introduce IFF_NAPI
Posted by Jason Wang 10 months ago
On Wed, May 31, 2023 at 11:47 AM Jon Kohler <jon@nutanix.com> wrote:
>
>
>
> > On May 30, 2023, at 11:35 PM, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, May 31, 2023 at 11:32 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
> >>>
> >>> If kernel supports IFF_NAPI, lets use it, which is especially useful
> >>> on kernels containing fb3f903769e8 ("tun: support NAPI for packets
> >>> received from batched XDP buffs"), as IFF_NAPI allows the
> >>> vhost_tx_batch path to use NAPI on XDP buffs.
> >>>
> >>> Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
> >>> from a guest running kernel 5.10.105 to remote bare metal running
> >>> patched code on kernel 5.10.139. Guests were configured 1x virtio-net
> >>> device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
> >>> identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
> >>> Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
> >>> "Before" and around ~50-60% utilization "After".
> >>>
> >>> Single Stream: iperf -P 1
> >>> iperf -l size | Before          | After          | Increase
> >>> 64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
> >>> 128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
> >>> 4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%
> >>>
> >>> Multiple Stream: iperf -P 12
> >>> iperf -l size | Before          | After          | Increase
> >>> 64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
> >>> 128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
> >>> 4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)
> >>>
> >>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >>
> >> Great, but I would suggest having an option.
> >>
> >> So we can:
> >>
> >> 1) ease the debug and compare
> >> 2) enable this by default only for 8.1, disable it for pre 8.1
>
> Fair enough, one favor to ask though -
> Would you be able to point me to an existing option like what you’re
> proposing so I could make sure I’m on the same page?

For example, the vhost option for tap. Maybe we can have an napi option.

>
> >
> > More thought, if the performance boost only after fb3f903769e8, we
> > probably need to disable it by default and let the mgmt layer to
> > enable it.
> >
>
> I focused my testing with that commit, but I could take it out and
> we still should get benefit. Would you like me to profile that to validate?
>

One problem is that NAPI for TAP was originally used for kernel
hardening. Looking at the history, it introduces a lot of bugs.

Consider:

1) it has been merged for several years
2) tap has been widely used for a long time as well

I think it would be still safe to keep the option off (at least for
pre 8.1 machines).

> Asking as NAPI support in tun.c has been there for a while, guessing
> at first glance that there would be non-zero gains, with little downsides.
> Looking at git blame, seems about ~5-6 years of support.

Yes.

>
> Also for posterity, that commit has been in since 5.18, so a little over 1 year.

Then I think we can make it enabled by default for 8.1 and see.

Thanks

>
> > Thanks
> >
> >>
> >> Thanks
> >>
> >> Thanks
> >>
> >>> ---
> >>> net/tap-linux.c | 4 ++++
> >>> net/tap-linux.h | 1 +
> >>> 2 files changed, 5 insertions(+)
> >>>
> >>> diff --git a/net/tap-linux.c b/net/tap-linux.c
> >>> index f54f308d359..fd94df166e0 100644
> >>> --- a/net/tap-linux.c
> >>> +++ b/net/tap-linux.c
> >>> @@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
> >>>         ifr.ifr_flags |= IFF_ONE_QUEUE;
> >>>     }
> >>>
> >>> +    if (features & IFF_NAPI) {
> >>> +        ifr.ifr_flags |= IFF_NAPI;
> >>> +    }
> >>> +
> >>>     if (*vnet_hdr) {
> >>>         if (features & IFF_VNET_HDR) {
> >>>             *vnet_hdr = 1;
> >>> diff --git a/net/tap-linux.h b/net/tap-linux.h
> >>> index bbbb62c2a75..f4d8e55270b 100644
> >>> --- a/net/tap-linux.h
> >>> +++ b/net/tap-linux.h
> >>> @@ -37,6 +37,7 @@
> >>>
> >>> /* TUNSETIFF ifr flags */
> >>> #define IFF_TAP          0x0002
> >>> +#define IFF_NAPI         0x0010
> >>> #define IFF_NO_PI        0x1000
> >>> #define IFF_ONE_QUEUE    0x2000
> >>> #define IFF_VNET_HDR     0x4000
> >>> --
> >>> 2.30.1 (Apple Git-130)
>
Re: [PATCH] tap: introduce IFF_NAPI
Posted by Jason Wang 10 months ago
On Wed, May 31, 2023 at 11:55 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, May 31, 2023 at 11:47 AM Jon Kohler <jon@nutanix.com> wrote:
> >
> >
> >
> > > On May 30, 2023, at 11:35 PM, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, May 31, 2023 at 11:32 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>
> > >> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
> > >>>
> > >>> If kernel supports IFF_NAPI, lets use it, which is especially useful
> > >>> on kernels containing fb3f903769e8 ("tun: support NAPI for packets
> > >>> received from batched XDP buffs"), as IFF_NAPI allows the
> > >>> vhost_tx_batch path to use NAPI on XDP buffs.
> > >>>
> > >>> Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
> > >>> from a guest running kernel 5.10.105 to remote bare metal running
> > >>> patched code on kernel 5.10.139. Guests were configured 1x virtio-net
> > >>> device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
> > >>> identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
> > >>> Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
> > >>> "Before" and around ~50-60% utilization "After".
> > >>>
> > >>> Single Stream: iperf -P 1
> > >>> iperf -l size | Before          | After          | Increase
> > >>> 64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
> > >>> 128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
> > >>> 4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%
> > >>>
> > >>> Multiple Stream: iperf -P 12
> > >>> iperf -l size | Before          | After          | Increase
> > >>> 64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
> > >>> 128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
> > >>> 4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)
> > >>>
> > >>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> > >>
> > >> Great, but I would suggest having an option.
> > >>
> > >> So we can:
> > >>
> > >> 1) ease the debug and compare
> > >> 2) enable this by default only for 8.1, disable it for pre 8.1
> >
> > Fair enough, one favor to ask though -
> > Would you be able to point me to an existing option like what you’re
> > proposing so I could make sure I’m on the same page?
>
> For example, the vhost option for tap. Maybe we can have an napi option.
>
> >
> > >
> > > More thought, if the performance boost only after fb3f903769e8, we
> > > probably need to disable it by default and let the mgmt layer to
> > > enable it.
> > >
> >
> > I focused my testing with that commit, but I could take it out and
> > we still should get benefit. Would you like me to profile that to validate?
> >
>
> One problem is that NAPI for TAP was originally used for kernel
> hardening. Looking at the history, it introduces a lot of bugs.
>
> Consider:
>
> 1) it has been merged for several years
> 2) tap has been widely used for a long time as well
>
> I think it would be still safe to keep the option off (at least for
> pre 8.1 machines).
>
> > Asking as NAPI support in tun.c has been there for a while, guessing
> > at first glance that there would be non-zero gains, with little downsides.
> > Looking at git blame, seems about ~5-6 years of support.
>
> Yes.
>
> >
> > Also for posterity, that commit has been in since 5.18, so a little over 1 year.
>
> Then I think we can make it enabled by default for 8.1 and see.

Btw, it would be better if we can have some PPS benchmark as well.

Thanks

>
> Thanks
>
> >
> > > Thanks
> > >
> > >>
> > >> Thanks
> > >>
> > >> Thanks
> > >>
> > >>> ---
> > >>> net/tap-linux.c | 4 ++++
> > >>> net/tap-linux.h | 1 +
> > >>> 2 files changed, 5 insertions(+)
> > >>>
> > >>> diff --git a/net/tap-linux.c b/net/tap-linux.c
> > >>> index f54f308d359..fd94df166e0 100644
> > >>> --- a/net/tap-linux.c
> > >>> +++ b/net/tap-linux.c
> > >>> @@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
> > >>>         ifr.ifr_flags |= IFF_ONE_QUEUE;
> > >>>     }
> > >>>
> > >>> +    if (features & IFF_NAPI) {
> > >>> +        ifr.ifr_flags |= IFF_NAPI;
> > >>> +    }
> > >>> +
> > >>>     if (*vnet_hdr) {
> > >>>         if (features & IFF_VNET_HDR) {
> > >>>             *vnet_hdr = 1;
> > >>> diff --git a/net/tap-linux.h b/net/tap-linux.h
> > >>> index bbbb62c2a75..f4d8e55270b 100644
> > >>> --- a/net/tap-linux.h
> > >>> +++ b/net/tap-linux.h
> > >>> @@ -37,6 +37,7 @@
> > >>>
> > >>> /* TUNSETIFF ifr flags */
> > >>> #define IFF_TAP          0x0002
> > >>> +#define IFF_NAPI         0x0010
> > >>> #define IFF_NO_PI        0x1000
> > >>> #define IFF_ONE_QUEUE    0x2000
> > >>> #define IFF_VNET_HDR     0x4000
> > >>> --
> > >>> 2.30.1 (Apple Git-130)
> >
Re: [PATCH] tap: introduce IFF_NAPI
Posted by Jon Kohler 10 months ago

> On May 31, 2023, at 1:27 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> On Wed, May 31, 2023 at 11:55 AM Jason Wang <jasowang@redhat.com> wrote:
>> 
>> On Wed, May 31, 2023 at 11:47 AM Jon Kohler <jon@nutanix.com> wrote:
>>> 
>>> 
>>> 
>>>> On May 30, 2023, at 11:35 PM, Jason Wang <jasowang@redhat.com> wrote:
>>>> 
>>>> On Wed, May 31, 2023 at 11:32 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>> 
>>>>> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
>>>>>> 
>>>>>> If kernel supports IFF_NAPI, lets use it, which is especially useful
>>>>>> on kernels containing fb3f903769e8 ("tun: support NAPI for packets
>>>>>> received from batched XDP buffs"), as IFF_NAPI allows the
>>>>>> vhost_tx_batch path to use NAPI on XDP buffs.
>>>>>> 
>>>>>> Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
>>>>>> from a guest running kernel 5.10.105 to remote bare metal running
>>>>>> patched code on kernel 5.10.139. Guests were configured 1x virtio-net
>>>>>> device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
>>>>>> identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
>>>>>> Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
>>>>>> "Before" and around ~50-60% utilization "After".
>>>>>> 
>>>>>> Single Stream: iperf -P 1
>>>>>> iperf -l size | Before          | After          | Increase
>>>>>> 64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
>>>>>> 128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
>>>>>> 4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%
>>>>>> 
>>>>>> Multiple Stream: iperf -P 12
>>>>>> iperf -l size | Before          | After          | Increase
>>>>>> 64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
>>>>>> 128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
>>>>>> 4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)
>>>>>> 
>>>>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>>>> 
>>>>> Great, but I would suggest having an option.
>>>>> 
>>>>> So we can:
>>>>> 
>>>>> 1) ease the debug and compare
>>>>> 2) enable this by default only for 8.1, disable it for pre 8.1
>>> 
>>> Fair enough, one favor to ask though -
>>> Would you be able to point me to an existing option like what you’re
>>> proposing so I could make sure I’m on the same page?
>> 
>> For example, the vhost option for tap. Maybe we can have an napi option.

OK thanks, I’ll see what I can do there.

>> 
>>> 
>>>> 
>>>> More thought, if the performance boost only after fb3f903769e8, we
>>>> probably need to disable it by default and let the mgmt layer to
>>>> enable it.
>>>> 
>>> 
>>> I focused my testing with that commit, but I could take it out and
>>> we still should get benefit. Would you like me to profile that to validate?
>>> 
>> 
>> One problem is that NAPI for TAP was originally used for kernel
>> hardening. Looking at the history, it introduces a lot of bugs.
>> 
>> Consider:
>> 
>> 1) it has been merged for several years
>> 2) tap has been widely used for a long time as well
>> 
>> I think it would be still safe to keep the option off (at least for
>> pre 8.1 machines).
>> 
>>> Asking as NAPI support in tun.c has been there for a while, guessing
>>> at first glance that there would be non-zero gains, with little downsides.
>>> Looking at git blame, seems about ~5-6 years of support.
>> 
>> Yes.
>> 
>>> 
>>> Also for posterity, that commit has been in since 5.18, so a little over 1 year.
>> 
>> Then I think we can make it enabled by default for 8.1 and see.

OK, I’ll see what I can come up with.

> 
> Btw, it would be better if we can have some PPS benchmark as well.
> 
> Thanks

Is there a set of specific benchmark(s) that you want to see? Certain packet
sizes? TCP/UDP? Certain tool (netperf, iperf, etc)? The existing benchmarks
in the commit msg have both single and multi stream and multiple payload
sizes, all of which are a corollary to PPS, no?

Happy to do more profiling, just want to make sure I get you exactly what you
want.

> 
>> 
>> Thanks
>> 
>>> 
>>>> Thanks
>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> ---
>>>>>> net/tap-linux.c | 4 ++++
>>>>>> net/tap-linux.h | 1 +
>>>>>> 2 files changed, 5 insertions(+)
>>>>>> 
>>>>>> diff --git a/net/tap-linux.c b/net/tap-linux.c
>>>>>> index f54f308d359..fd94df166e0 100644
>>>>>> --- a/net/tap-linux.c
>>>>>> +++ b/net/tap-linux.c
>>>>>> @@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
>>>>>>        ifr.ifr_flags |= IFF_ONE_QUEUE;
>>>>>>    }
>>>>>> 
>>>>>> +    if (features & IFF_NAPI) {
>>>>>> +        ifr.ifr_flags |= IFF_NAPI;
>>>>>> +    }
>>>>>> +
>>>>>>    if (*vnet_hdr) {
>>>>>>        if (features & IFF_VNET_HDR) {
>>>>>>            *vnet_hdr = 1;
>>>>>> diff --git a/net/tap-linux.h b/net/tap-linux.h
>>>>>> index bbbb62c2a75..f4d8e55270b 100644
>>>>>> --- a/net/tap-linux.h
>>>>>> +++ b/net/tap-linux.h
>>>>>> @@ -37,6 +37,7 @@
>>>>>> 
>>>>>> /* TUNSETIFF ifr flags */
>>>>>> #define IFF_TAP          0x0002
>>>>>> +#define IFF_NAPI         0x0010
>>>>>> #define IFF_NO_PI        0x1000
>>>>>> #define IFF_ONE_QUEUE    0x2000
>>>>>> #define IFF_VNET_HDR     0x4000
>>>>>> --
>>>>>> 2.30.1 (Apple Git-130)
>>> 
> 

Re: [PATCH] tap: introduce IFF_NAPI
Posted by Jason Wang 10 months ago
On Wed, May 31, 2023 at 9:46 PM Jon Kohler <jon@nutanix.com> wrote:
>
>
>
> > On May 31, 2023, at 1:27 AM, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, May 31, 2023 at 11:55 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On Wed, May 31, 2023 at 11:47 AM Jon Kohler <jon@nutanix.com> wrote:
> >>>
> >>>
> >>>
> >>>> On May 30, 2023, at 11:35 PM, Jason Wang <jasowang@redhat.com> wrote:
> >>>>
> >>>> On Wed, May 31, 2023 at 11:32 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>
> >>>>> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
> >>>>>>
> >>>>>> If kernel supports IFF_NAPI, lets use it, which is especially useful
> >>>>>> on kernels containing fb3f903769e8 ("tun: support NAPI for packets
> >>>>>> received from batched XDP buffs"), as IFF_NAPI allows the
> >>>>>> vhost_tx_batch path to use NAPI on XDP buffs.
> >>>>>>
> >>>>>> Benchmark w/ iperf -c (remote srv) -P (thread count) -l (stream size)
> >>>>>> from a guest running kernel 5.10.105 to remote bare metal running
> >>>>>> patched code on kernel 5.10.139. Guests were configured 1x virtio-net
> >>>>>> device with 4x queues, resulting in 4x vhost-worker threads. Hosts are
> >>>>>> identical with Intel ICX 4314 @ 2.4 GHz with Mellanox CX5 25GbE NIC ->
> >>>>>> Arista 25GbE switch. vhost-worker threads largely maxed out on CPU on
> >>>>>> "Before" and around ~50-60% utilization "After".
> >>>>>>
> >>>>>> Single Stream: iperf -P 1
> >>>>>> iperf -l size | Before          | After          | Increase
> >>>>>> 64B           | 593 Mbits/sec   | 712 Mbits/sec  | ~20%
> >>>>>> 128B          | 1.00 Gbits/sec  | 1.18 Gbits/sec | ~18%
> >>>>>> 4KB           | 17.6 Gbits/sec  | 22.7 Gbits/sec | ~29%
> >>>>>>
> >>>>>> Multiple Stream: iperf -P 12
> >>>>>> iperf -l size | Before          | After          | Increase
> >>>>>> 64B           | 6.35 Gbits/sec  | 7.78 Gbits/sec | ~23%
> >>>>>> 128B          | 10.8 Gbits/sec  | 14.2 Gbits/sec | ~31%
> >>>>>> 4KB           | 23.6 Gbits/sec  | 23.6 Gbits/sec | (line speed)
> >>>>>>
> >>>>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >>>>>
> >>>>> Great, but I would suggest having an option.
> >>>>>
> >>>>> So we can:
> >>>>>
> >>>>> 1) ease the debug and compare
> >>>>> 2) enable this by default only for 8.1, disable it for pre 8.1
> >>>
> >>> Fair enough, one favor to ask though -
> >>> Would you be able to point me to an existing option like what you’re
> >>> proposing so I could make sure I’m on the same page?
> >>
> >> For example, the vhost option for tap. Maybe we can have an napi option.
>
> OK thanks, I’ll see what I can do there.
>
> >>
> >>>
> >>>>
> >>>> More thought, if the performance boost only after fb3f903769e8, we
> >>>> probably need to disable it by default and let the mgmt layer to
> >>>> enable it.
> >>>>
> >>>
> >>> I focused my testing with that commit, but I could take it out and
> >>> we still should get benefit. Would you like me to profile that to validate?
> >>>
> >>
> >> One problem is that NAPI for TAP was originally used for kernel
> >> hardening. Looking at the history, it introduces a lot of bugs.
> >>
> >> Consider:
> >>
> >> 1) it has been merged for several years
> >> 2) tap has been widely used for a long time as well
> >>
> >> I think it would be still safe to keep the option off (at least for
> >> pre 8.1 machines).
> >>
> >>> Asking as NAPI support in tun.c has been there for a while, guessing
> >>> at first glance that there would be non-zero gains, with little downsides.
> >>> Looking at git blame, seems about ~5-6 years of support.
> >>
> >> Yes.
> >>
> >>>
> >>> Also for posterity, that commit has been in since 5.18, so a little over 1 year.
> >>
> >> Then I think we can make it enabled by default for 8.1 and see.
>
> OK, I’ll see what I can come up with.
>
> >
> > Btw, it would be better if we can have some PPS benchmark as well.
> >
> > Thanks
>
> Is there a set of specific benchmark(s) that you want to see? Certain packet
> sizes? TCP/UDP? Certain tool (netperf, iperf, etc)?

It could be like this:

1) Netperf TCP_STREAM with various packet size 64 to maximum
2) Pktgen test from guest to host

> The existing benchmarks
> in the commit msg have both single and multi stream and multiple payload
> sizes, all of which are a corollary to PPS, no?

The problem is that it's the PPS of the TCP, various layers in the
middle. For example, if TCP coalesce usersapce write to larger
packets, we will end up with a low PPS.

Using pktgen may help to know if TAP can deal with more packets per second.

Thanks

>
> Happy to do more profiling, just want to make sure I get you exactly what you
> want.
>
> >
> >>
> >> Thanks
> >>
> >>>
> >>>> Thanks
> >>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>> ---
> >>>>>> net/tap-linux.c | 4 ++++
> >>>>>> net/tap-linux.h | 1 +
> >>>>>> 2 files changed, 5 insertions(+)
> >>>>>>
> >>>>>> diff --git a/net/tap-linux.c b/net/tap-linux.c
> >>>>>> index f54f308d359..fd94df166e0 100644
> >>>>>> --- a/net/tap-linux.c
> >>>>>> +++ b/net/tap-linux.c
> >>>>>> @@ -62,6 +62,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
> >>>>>>        ifr.ifr_flags |= IFF_ONE_QUEUE;
> >>>>>>    }
> >>>>>>
> >>>>>> +    if (features & IFF_NAPI) {
> >>>>>> +        ifr.ifr_flags |= IFF_NAPI;
> >>>>>> +    }
> >>>>>> +
> >>>>>>    if (*vnet_hdr) {
> >>>>>>        if (features & IFF_VNET_HDR) {
> >>>>>>            *vnet_hdr = 1;
> >>>>>> diff --git a/net/tap-linux.h b/net/tap-linux.h
> >>>>>> index bbbb62c2a75..f4d8e55270b 100644
> >>>>>> --- a/net/tap-linux.h
> >>>>>> +++ b/net/tap-linux.h
> >>>>>> @@ -37,6 +37,7 @@
> >>>>>>
> >>>>>> /* TUNSETIFF ifr flags */
> >>>>>> #define IFF_TAP          0x0002
> >>>>>> +#define IFF_NAPI         0x0010
> >>>>>> #define IFF_NO_PI        0x1000
> >>>>>> #define IFF_ONE_QUEUE    0x2000
> >>>>>> #define IFF_VNET_HDR     0x4000
> >>>>>> --
> >>>>>> 2.30.1 (Apple Git-130)
> >>>
> >
>