[v6] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

[PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Simon Schippers 2 months, 2 weeks ago

On 11/21/25 07:19, Jason Wang wrote:
> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> This patch series deals with tun/tap and vhost-net which drop incoming
>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>> patch series, the associated netdev queue is stopped before this happens.
>> This allows the connected qdisc to function correctly as reported by [1]
>> and improves application-layer performance, see our paper [2]. Meanwhile
>> the theoretical performance differs only slightly:
>>
>> +--------------------------------+-----------+----------+
>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
>> | i5 6300HQ, 20M packets         |           |          |
>> +-----------------+--------------+-----------+----------+
>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
>> |                 +--------------+-----------+----------+
>> |                 | Lost         | 1615 Kpps | 0 pps    |
>> +-----------------+--------------+-----------+----------+
>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
>> |                 +--------------+-----------+----------+
>> |                 | Lost         | 1164 Kpps | 0 pps    |
>> +-----------------+--------------+-----------+----------+
> 

Hi Jason,

thank you for your reply!

> PPS drops somehow for TAP, any reason for that?

I have no explicit explanation for that except general overheads coming
with this implementation.

> 
> Btw, I had some questions:
> 
> 1) most of the patches in this series would introduce non-trivial
> impact on the performance, we probably need to benchmark each or split
> the series. What's more we need to run TCP benchmark
> (throughput/latency) as well as pktgen see the real impact

What could be done, IMO, is to activate tun_ring_consume() /
tap_ring_consume() before enabling tun_ring_produce(). Then we could see
if this alone drops performance.

For TCP benchmarks, you mean userspace performance like iperf3 between a
host and a guest system?

> 
> 2) I see this:
> 
>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>                 goto drop;
>         }
> 
> So there could still be packet drop? Or is this related to the XDP path?

Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
unconsume. Since those two happen so rarely, I figured we should just
drop in this case.

> 
> 3) The LLTX change would have performance implications, but the
> benmark doesn't cover the case where multiple transmission is done in
> parallel

Do you mean multiple applications that produce traffic and potentially
run on different CPUs?

> 
> 4) After the LLTX change, it seems we've lost the synchronization with
> the XDP_TX and XDP_REDIRECT path?

I must admit I did not take a look at XDP and cannot really judge if/how
lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
call and I do not see the impact for XDP, which calls its own methods.
> 
> 5) The series introduces various ptr_ring helpers with lots of
> ordering stuff which is complicated, I wonder if we first have a
> simple patch to implement the zero packet loss

I personally don't see how a simpler patch is possible without using
discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
spin locking between producer and consumer. But I am open for
suggestions :)

> 
>>
>> This patch series includes tun/tap, and vhost-net because they share
>> logic. Adjusting only one of them would break the others. Therefore, the
>> patch series is structured as follows:
>> 1+2: new ptr_ring helpers for 3
>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
>> management
>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
>> vhost-net
>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
>> not break git bisect
>> 8: tun/tap: drop get ring exports (not used anymore)
>>
>> Possible future work:
>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> 
> This seems to be not easy. The tx completion depends on the userspace behaviour.

I agree, but I really would like to reduce the buffer bloat caused by the
default 500 TUN / 1000 TAP packet queue without losing performance.

> 
>> - Adaption of the netdev queue flow control for ipvtap & macvtap
>>
>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>>
> 
> Thanks
> 

Thanks! :)

Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Michael S. Tsirkin 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 10:22:54AM +0100, Simon Schippers wrote:
> I agree, but I really would like to reduce the buffer bloat caused by the
> default 500 TUN / 1000 TAP packet queue without losing performance.

that default is part of the userspace API and can't be changed.
just change whatever userspace is creating your device.

-- 
MST

[PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Simon Schippers 2 months, 2 weeks ago

On 11/26/25 08:15, Michael S. Tsirkin wrote:
> On Fri, Nov 21, 2025 at 10:22:54AM +0100, Simon Schippers wrote:
>> I agree, but I really would like to reduce the buffer bloat caused by the
>> default 500 TUN / 1000 TAP packet queue without losing performance.
> 
> that default is part of the userspace API and can't be changed.
> just change whatever userspace is creating your device.
> 

Yes, but I’m thinking about introducing a new interface flag like
IFF_BQL. However, as noted earlier, there are significant implementation
challenges.

I think there can be advantages to something like VPN's on mobile
devices where the throughput varies between a few Mbit/s (small TUN/TAP
queue is fine) and multiple Gbit/s (need a bigger queue).

Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Jason Wang 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 11/21/25 07:19, Jason Wang wrote:
> > On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> This patch series deals with tun/tap and vhost-net which drop incoming
> >> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> >> patch series, the associated netdev queue is stopped before this happens.
> >> This allows the connected qdisc to function correctly as reported by [1]
> >> and improves application-layer performance, see our paper [2]. Meanwhile
> >> the theoretical performance differs only slightly:
> >>
> >> +--------------------------------+-----------+----------+
> >> | pktgen benchmarks to Debian VM | Stock     | Patched  |
> >> | i5 6300HQ, 20M packets         |           |          |
> >> +-----------------+--------------+-----------+----------+
> >> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
> >> |                 +--------------+-----------+----------+
> >> |                 | Lost         | 1615 Kpps | 0 pps    |
> >> +-----------------+--------------+-----------+----------+
> >> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
> >> |                 +--------------+-----------+----------+
> >> |                 | Lost         | 1164 Kpps | 0 pps    |
> >> +-----------------+--------------+-----------+----------+
> >
>
> Hi Jason,
>
> thank you for your reply!
>
> > PPS drops somehow for TAP, any reason for that?
>
> I have no explicit explanation for that except general overheads coming
> with this implementation.

It would be better to fix that.

>
> >
> > Btw, I had some questions:
> >
> > 1) most of the patches in this series would introduce non-trivial
> > impact on the performance, we probably need to benchmark each or split
> > the series. What's more we need to run TCP benchmark
> > (throughput/latency) as well as pktgen see the real impact
>
> What could be done, IMO, is to activate tun_ring_consume() /
> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
> if this alone drops performance.
>
> For TCP benchmarks, you mean userspace performance like iperf3 between a
> host and a guest system?

Yes,

>
> >
> > 2) I see this:
> >
> >         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
> >                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >                 goto drop;
> >         }
> >
> > So there could still be packet drop? Or is this related to the XDP path?
>
> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
> unconsume. Since those two happen so rarely, I figured we should just
> drop in this case.
>
> >
> > 3) The LLTX change would have performance implications, but the
> > benmark doesn't cover the case where multiple transmission is done in
> > parallel
>
> Do you mean multiple applications that produce traffic and potentially
> run on different CPUs?

Yes.

>
> >
> > 4) After the LLTX change, it seems we've lost the synchronization with
> > the XDP_TX and XDP_REDIRECT path?
>
> I must admit I did not take a look at XDP and cannot really judge if/how
> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
> call and I do not see the impact for XDP, which calls its own methods.

Without LLTX tun_net_xmit is protected by tx lock but it is not the
case of tun_xdp_xmit. This is because, unlike other devices, tun
doesn't have a dedicated TX queue for XDP, so the queue is shared by
both XDP and skb. So XDP xmit path needs to be protected with tx lock
as well, and since we don't have queue discipline for XDP, it means we
could still drop packets when XDP is enabled. I'm not sure this would
defeat the whole idea or not.

> >
> > 5) The series introduces various ptr_ring helpers with lots of
> > ordering stuff which is complicated, I wonder if we first have a
> > simple patch to implement the zero packet loss
>
> I personally don't see how a simpler patch is possible without using
> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
> spin locking between producer and consumer. But I am open for
> suggestions :)

I see NETDEV_TX_BUSY is used by veth:

static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
{
        if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
                return NETDEV_TX_BUSY; /* signal qdisc layer */

        return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
}

Maybe it would be simpler to start from that (probably with a new tun->flags?).

Thanks

>
> >
> >>
> >> This patch series includes tun/tap, and vhost-net because they share
> >> logic. Adjusting only one of them would break the others. Therefore, the
> >> patch series is structured as follows:
> >> 1+2: new ptr_ring helpers for 3
> >> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> >> management
> >> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> >> vhost-net
> >> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> >> not break git bisect
> >> 8: tun/tap: drop get ring exports (not used anymore)
> >>
> >> Possible future work:
> >> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> >
> > This seems to be not easy. The tx completion depends on the userspace behaviour.
>
> I agree, but I really would like to reduce the buffer bloat caused by the
> default 500 TUN / 1000 TAP packet queue without losing performance.
>
> >
> >> - Adaption of the netdev queue flow control for ipvtap & macvtap
> >>
> >> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >>
> >
> > Thanks
> >
>
> Thanks! :)
>

Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Simon Schippers 2 months, 2 weeks ago

On 11/24/25 02:04, Jason Wang wrote:
> On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 11/21/25 07:19, Jason Wang wrote:
>>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> This patch series deals with tun/tap and vhost-net which drop incoming
>>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>>>> patch series, the associated netdev queue is stopped before this happens.
>>>> This allows the connected qdisc to function correctly as reported by [1]
>>>> and improves application-layer performance, see our paper [2]. Meanwhile
>>>> the theoretical performance differs only slightly:
>>>>
>>>> +--------------------------------+-----------+----------+
>>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
>>>> | i5 6300HQ, 20M packets         |           |          |
>>>> +-----------------+--------------+-----------+----------+
>>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
>>>> |                 +--------------+-----------+----------+
>>>> |                 | Lost         | 1615 Kpps | 0 pps    |
>>>> +-----------------+--------------+-----------+----------+
>>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
>>>> |                 +--------------+-----------+----------+
>>>> |                 | Lost         | 1164 Kpps | 0 pps    |
>>>> +-----------------+--------------+-----------+----------+
>>>
>>
>> Hi Jason,
>>
>> thank you for your reply!
>>
>>> PPS drops somehow for TAP, any reason for that?
>>
>> I have no explicit explanation for that except general overheads coming
>> with this implementation.
> 
> It would be better to fix that.
> 
>>
>>>
>>> Btw, I had some questions:
>>>
>>> 1) most of the patches in this series would introduce non-trivial
>>> impact on the performance, we probably need to benchmark each or split
>>> the series. What's more we need to run TCP benchmark
>>> (throughput/latency) as well as pktgen see the real impact
>>
>> What could be done, IMO, is to activate tun_ring_consume() /
>> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
>> if this alone drops performance.
>>
>> For TCP benchmarks, you mean userspace performance like iperf3 between a
>> host and a guest system?
> 
> Yes,
> 
>>
>>>
>>> 2) I see this:
>>>
>>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>                 goto drop;
>>>         }
>>>
>>> So there could still be packet drop? Or is this related to the XDP path?
>>
>> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
>> unconsume. Since those two happen so rarely, I figured we should just
>> drop in this case.
>>
>>>
>>> 3) The LLTX change would have performance implications, but the
>>> benmark doesn't cover the case where multiple transmission is done in
>>> parallel
>>
>> Do you mean multiple applications that produce traffic and potentially
>> run on different CPUs?
> 
> Yes.
> 
>>
>>>
>>> 4) After the LLTX change, it seems we've lost the synchronization with
>>> the XDP_TX and XDP_REDIRECT path?
>>
>> I must admit I did not take a look at XDP and cannot really judge if/how
>> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
>> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
>> call and I do not see the impact for XDP, which calls its own methods.
> 
> Without LLTX tun_net_xmit is protected by tx lock but it is not the
> case of tun_xdp_xmit. This is because, unlike other devices, tun
> doesn't have a dedicated TX queue for XDP, so the queue is shared by
> both XDP and skb. So XDP xmit path needs to be protected with tx lock
> as well, and since we don't have queue discipline for XDP, it means we
> could still drop packets when XDP is enabled. I'm not sure this would
> defeat the whole idea or not.

Good point.

> 
>>>
>>> 5) The series introduces various ptr_ring helpers with lots of
>>> ordering stuff which is complicated, I wonder if we first have a
>>> simple patch to implement the zero packet loss
>>
>> I personally don't see how a simpler patch is possible without using
>> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
>> spin locking between producer and consumer. But I am open for
>> suggestions :)
> 
> I see NETDEV_TX_BUSY is used by veth:
> 
> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> {
>         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>                 return NETDEV_TX_BUSY; /* signal qdisc layer */
> 
>         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> }
> 
> Maybe it would be simpler to start from that (probably with a new tun->flags?).
> 
> Thanks

Do you mean that this patchset could be implemented using the same
approach that was used for veth in [1]?
This could then also fix the XDP path.

But is returning NETDEV_TX_BUSY fine in our case?

Do you mean a flag that enables or disables the no-drop behavior?

Thanks!

[1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u

> 
>>
>>>
>>>>
>>>> This patch series includes tun/tap, and vhost-net because they share
>>>> logic. Adjusting only one of them would break the others. Therefore, the
>>>> patch series is structured as follows:
>>>> 1+2: new ptr_ring helpers for 3
>>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
>>>> management
>>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
>>>> vhost-net
>>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
>>>> not break git bisect
>>>> 8: tun/tap: drop get ring exports (not used anymore)
>>>>
>>>> Possible future work:
>>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
>>>
>>> This seems to be not easy. The tx completion depends on the userspace behaviour.
>>
>> I agree, but I really would like to reduce the buffer bloat caused by the
>> default 500 TUN / 1000 TAP packet queue without losing performance.
>>
>>>
>>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
>>>>
>>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>>>>
>>>
>>> Thanks
>>>
>>
>> Thanks! :)
>>
>

Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Jason Wang 2 months, 2 weeks ago

On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 11/24/25 02:04, Jason Wang wrote:
> > On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 11/21/25 07:19, Jason Wang wrote:
> >>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> This patch series deals with tun/tap and vhost-net which drop incoming
> >>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> >>>> patch series, the associated netdev queue is stopped before this happens.
> >>>> This allows the connected qdisc to function correctly as reported by [1]
> >>>> and improves application-layer performance, see our paper [2]. Meanwhile
> >>>> the theoretical performance differs only slightly:
> >>>>
> >>>> +--------------------------------+-----------+----------+
> >>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
> >>>> | i5 6300HQ, 20M packets         |           |          |
> >>>> +-----------------+--------------+-----------+----------+
> >>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
> >>>> |                 +--------------+-----------+----------+
> >>>> |                 | Lost         | 1615 Kpps | 0 pps    |
> >>>> +-----------------+--------------+-----------+----------+
> >>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
> >>>> |                 +--------------+-----------+----------+
> >>>> |                 | Lost         | 1164 Kpps | 0 pps    |
> >>>> +-----------------+--------------+-----------+----------+
> >>>
> >>
> >> Hi Jason,
> >>
> >> thank you for your reply!
> >>
> >>> PPS drops somehow for TAP, any reason for that?
> >>
> >> I have no explicit explanation for that except general overheads coming
> >> with this implementation.
> >
> > It would be better to fix that.
> >
> >>
> >>>
> >>> Btw, I had some questions:
> >>>
> >>> 1) most of the patches in this series would introduce non-trivial
> >>> impact on the performance, we probably need to benchmark each or split
> >>> the series. What's more we need to run TCP benchmark
> >>> (throughput/latency) as well as pktgen see the real impact
> >>
> >> What could be done, IMO, is to activate tun_ring_consume() /
> >> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
> >> if this alone drops performance.
> >>
> >> For TCP benchmarks, you mean userspace performance like iperf3 between a
> >> host and a guest system?
> >
> > Yes,
> >
> >>
> >>>
> >>> 2) I see this:
> >>>
> >>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
> >>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>                 goto drop;
> >>>         }
> >>>
> >>> So there could still be packet drop? Or is this related to the XDP path?
> >>
> >> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
> >> unconsume. Since those two happen so rarely, I figured we should just
> >> drop in this case.
> >>
> >>>
> >>> 3) The LLTX change would have performance implications, but the
> >>> benmark doesn't cover the case where multiple transmission is done in
> >>> parallel
> >>
> >> Do you mean multiple applications that produce traffic and potentially
> >> run on different CPUs?
> >
> > Yes.
> >
> >>
> >>>
> >>> 4) After the LLTX change, it seems we've lost the synchronization with
> >>> the XDP_TX and XDP_REDIRECT path?
> >>
> >> I must admit I did not take a look at XDP and cannot really judge if/how
> >> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
> >> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
> >> call and I do not see the impact for XDP, which calls its own methods.
> >
> > Without LLTX tun_net_xmit is protected by tx lock but it is not the
> > case of tun_xdp_xmit. This is because, unlike other devices, tun
> > doesn't have a dedicated TX queue for XDP, so the queue is shared by
> > both XDP and skb. So XDP xmit path needs to be protected with tx lock
> > as well, and since we don't have queue discipline for XDP, it means we
> > could still drop packets when XDP is enabled. I'm not sure this would
> > defeat the whole idea or not.
>
> Good point.
>
> >
> >>>
> >>> 5) The series introduces various ptr_ring helpers with lots of
> >>> ordering stuff which is complicated, I wonder if we first have a
> >>> simple patch to implement the zero packet loss
> >>
> >> I personally don't see how a simpler patch is possible without using
> >> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
> >> spin locking between producer and consumer. But I am open for
> >> suggestions :)
> >
> > I see NETDEV_TX_BUSY is used by veth:
> >
> > static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> > {
> >         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> >                 return NETDEV_TX_BUSY; /* signal qdisc layer */
> >
> >         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> > }
> >
> > Maybe it would be simpler to start from that (probably with a new tun->flags?).
> >
> > Thanks
>
> Do you mean that this patchset could be implemented using the same
> approach that was used for veth in [1]?
> This could then also fix the XDP path.

I think so.

>
> But is returning NETDEV_TX_BUSY fine in our case?

If it helps to avoid packet drop. But I'm not sure if qdisc is a must
in your case.

>
> Do you mean a flag that enables or disables the no-drop behavior?

Yes, via a new flags that could be set via TUNSETIFF.

Thanks

>
> Thanks!
>
> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
>
> >
> >>
> >>>
> >>>>
> >>>> This patch series includes tun/tap, and vhost-net because they share
> >>>> logic. Adjusting only one of them would break the others. Therefore, the
> >>>> patch series is structured as follows:
> >>>> 1+2: new ptr_ring helpers for 3
> >>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> >>>> management
> >>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> >>>> vhost-net
> >>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> >>>> not break git bisect
> >>>> 8: tun/tap: drop get ring exports (not used anymore)
> >>>>
> >>>> Possible future work:
> >>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> >>>
> >>> This seems to be not easy. The tx completion depends on the userspace behaviour.
> >>
> >> I agree, but I really would like to reduce the buffer bloat caused by the
> >> default 500 TUN / 1000 TAP packet queue without losing performance.
> >>
> >>>
> >>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
> >>>>
> >>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >> Thanks! :)
> >>
> >
>

[PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Simon Schippers 2 months, 2 weeks ago

On 11/25/25 02:34, Jason Wang wrote:
> On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 11/24/25 02:04, Jason Wang wrote:
>>> On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 11/21/25 07:19, Jason Wang wrote:
>>>>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> This patch series deals with tun/tap and vhost-net which drop incoming
>>>>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>>>>>> patch series, the associated netdev queue is stopped before this happens.
>>>>>> This allows the connected qdisc to function correctly as reported by [1]
>>>>>> and improves application-layer performance, see our paper [2]. Meanwhile
>>>>>> the theoretical performance differs only slightly:
>>>>>>
>>>>>> +--------------------------------+-----------+----------+
>>>>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
>>>>>> | i5 6300HQ, 20M packets         |           |          |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
>>>>>> |                 +--------------+-----------+----------+
>>>>>> |                 | Lost         | 1615 Kpps | 0 pps    |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
>>>>>> |                 +--------------+-----------+----------+
>>>>>> |                 | Lost         | 1164 Kpps | 0 pps    |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>
>>>>
>>>> Hi Jason,
>>>>
>>>> thank you for your reply!
>>>>
>>>>> PPS drops somehow for TAP, any reason for that?
>>>>
>>>> I have no explicit explanation for that except general overheads coming
>>>> with this implementation.
>>>
>>> It would be better to fix that.
>>>
>>>>
>>>>>
>>>>> Btw, I had some questions:
>>>>>
>>>>> 1) most of the patches in this series would introduce non-trivial
>>>>> impact on the performance, we probably need to benchmark each or split
>>>>> the series. What's more we need to run TCP benchmark
>>>>> (throughput/latency) as well as pktgen see the real impact
>>>>
>>>> What could be done, IMO, is to activate tun_ring_consume() /
>>>> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
>>>> if this alone drops performance.
>>>>
>>>> For TCP benchmarks, you mean userspace performance like iperf3 between a
>>>> host and a guest system?
>>>
>>> Yes,
>>>
>>>>
>>>>>
>>>>> 2) I see this:
>>>>>
>>>>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>                 goto drop;
>>>>>         }
>>>>>
>>>>> So there could still be packet drop? Or is this related to the XDP path?
>>>>
>>>> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
>>>> unconsume. Since those two happen so rarely, I figured we should just
>>>> drop in this case.
>>>>
>>>>>
>>>>> 3) The LLTX change would have performance implications, but the
>>>>> benmark doesn't cover the case where multiple transmission is done in
>>>>> parallel
>>>>
>>>> Do you mean multiple applications that produce traffic and potentially
>>>> run on different CPUs?
>>>
>>> Yes.
>>>
>>>>
>>>>>
>>>>> 4) After the LLTX change, it seems we've lost the synchronization with
>>>>> the XDP_TX and XDP_REDIRECT path?
>>>>
>>>> I must admit I did not take a look at XDP and cannot really judge if/how
>>>> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
>>>> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
>>>> call and I do not see the impact for XDP, which calls its own methods.
>>>
>>> Without LLTX tun_net_xmit is protected by tx lock but it is not the
>>> case of tun_xdp_xmit. This is because, unlike other devices, tun
>>> doesn't have a dedicated TX queue for XDP, so the queue is shared by
>>> both XDP and skb. So XDP xmit path needs to be protected with tx lock
>>> as well, and since we don't have queue discipline for XDP, it means we
>>> could still drop packets when XDP is enabled. I'm not sure this would
>>> defeat the whole idea or not.
>>
>> Good point.
>>
>>>
>>>>>
>>>>> 5) The series introduces various ptr_ring helpers with lots of
>>>>> ordering stuff which is complicated, I wonder if we first have a
>>>>> simple patch to implement the zero packet loss
>>>>
>>>> I personally don't see how a simpler patch is possible without using
>>>> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
>>>> spin locking between producer and consumer. But I am open for
>>>> suggestions :)
>>>
>>> I see NETDEV_TX_BUSY is used by veth:
>>>
>>> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>> {
>>>         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>>>                 return NETDEV_TX_BUSY; /* signal qdisc layer */
>>>
>>>         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>> }
>>>
>>> Maybe it would be simpler to start from that (probably with a new tun->flags?).
>>>
>>> Thanks
>>
>> Do you mean that this patchset could be implemented using the same
>> approach that was used for veth in [1]?
>> This could then also fix the XDP path.
> 
> I think so.

Okay, I will do so and submit a v7 when net-next opens again for 6.19.

> 
>>
>> But is returning NETDEV_TX_BUSY fine in our case?
> 
> If it helps to avoid packet drop. But I'm not sure if qdisc is a must
> in your case.

I will try to avoid returning it.

When no qdisc is connected, I will just drop like veth does.

> 
>>
>> Do you mean a flag that enables or disables the no-drop behavior?
> 
> Yes, via a new flags that could be set via TUNSETIFF.
> 
> Thanks

I am not a fan of that, since I can not imagine a use case where
dropping packets is desired. veth does not introduce a flag either.

Of course, if there is a major performance degradation, it makes sense.
But I will benchmark it, and we will see.

Thank you!

> 
>>
>> Thanks!
>>
>> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> This patch series includes tun/tap, and vhost-net because they share
>>>>>> logic. Adjusting only one of them would break the others. Therefore, the
>>>>>> patch series is structured as follows:
>>>>>> 1+2: new ptr_ring helpers for 3
>>>>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
>>>>>> management
>>>>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
>>>>>> vhost-net
>>>>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
>>>>>> not break git bisect
>>>>>> 8: tun/tap: drop get ring exports (not used anymore)
>>>>>>
>>>>>> Possible future work:
>>>>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
>>>>>
>>>>> This seems to be not easy. The tx completion depends on the userspace behaviour.
>>>>
>>>> I agree, but I really would like to reduce the buffer bloat caused by the
>>>> default 500 TUN / 1000 TAP packet queue without losing performance.
>>>>
>>>>>
>>>>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
>>>>>>
>>>>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>>>>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>> Thanks! :)
>>>>
>>>
>>
>

Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop

Posted by Jason Wang 2 months, 2 weeks ago

On Tue, Nov 25, 2025 at 10:05 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 11/25/25 02:34, Jason Wang wrote:
> > On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 11/24/25 02:04, Jason Wang wrote:
> >>> On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 11/21/25 07:19, Jason Wang wrote:
> >>>>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> This patch series deals with tun/tap and vhost-net which drop incoming
> >>>>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> >>>>>> patch series, the associated netdev queue is stopped before this happens.
> >>>>>> This allows the connected qdisc to function correctly as reported by [1]
> >>>>>> and improves application-layer performance, see our paper [2]. Meanwhile
> >>>>>> the theoretical performance differs only slightly:
> >>>>>>
> >>>>>> +--------------------------------+-----------+----------+
> >>>>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
> >>>>>> | i5 6300HQ, 20M packets         |           |          |
> >>>>>> +-----------------+--------------+-----------+----------+
> >>>>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
> >>>>>> |                 +--------------+-----------+----------+
> >>>>>> |                 | Lost         | 1615 Kpps | 0 pps    |
> >>>>>> +-----------------+--------------+-----------+----------+
> >>>>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
> >>>>>> |                 +--------------+-----------+----------+
> >>>>>> |                 | Lost         | 1164 Kpps | 0 pps    |
> >>>>>> +-----------------+--------------+-----------+----------+
> >>>>>
> >>>>
> >>>> Hi Jason,
> >>>>
> >>>> thank you for your reply!
> >>>>
> >>>>> PPS drops somehow for TAP, any reason for that?
> >>>>
> >>>> I have no explicit explanation for that except general overheads coming
> >>>> with this implementation.
> >>>
> >>> It would be better to fix that.
> >>>
> >>>>
> >>>>>
> >>>>> Btw, I had some questions:
> >>>>>
> >>>>> 1) most of the patches in this series would introduce non-trivial
> >>>>> impact on the performance, we probably need to benchmark each or split
> >>>>> the series. What's more we need to run TCP benchmark
> >>>>> (throughput/latency) as well as pktgen see the real impact
> >>>>
> >>>> What could be done, IMO, is to activate tun_ring_consume() /
> >>>> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
> >>>> if this alone drops performance.
> >>>>
> >>>> For TCP benchmarks, you mean userspace performance like iperf3 between a
> >>>> host and a guest system?
> >>>
> >>> Yes,
> >>>
> >>>>
> >>>>>
> >>>>> 2) I see this:
> >>>>>
> >>>>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
> >>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>                 goto drop;
> >>>>>         }
> >>>>>
> >>>>> So there could still be packet drop? Or is this related to the XDP path?
> >>>>
> >>>> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
> >>>> unconsume. Since those two happen so rarely, I figured we should just
> >>>> drop in this case.
> >>>>
> >>>>>
> >>>>> 3) The LLTX change would have performance implications, but the
> >>>>> benmark doesn't cover the case where multiple transmission is done in
> >>>>> parallel
> >>>>
> >>>> Do you mean multiple applications that produce traffic and potentially
> >>>> run on different CPUs?
> >>>
> >>> Yes.
> >>>
> >>>>
> >>>>>
> >>>>> 4) After the LLTX change, it seems we've lost the synchronization with
> >>>>> the XDP_TX and XDP_REDIRECT path?
> >>>>
> >>>> I must admit I did not take a look at XDP and cannot really judge if/how
> >>>> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
> >>>> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
> >>>> call and I do not see the impact for XDP, which calls its own methods.
> >>>
> >>> Without LLTX tun_net_xmit is protected by tx lock but it is not the
> >>> case of tun_xdp_xmit. This is because, unlike other devices, tun
> >>> doesn't have a dedicated TX queue for XDP, so the queue is shared by
> >>> both XDP and skb. So XDP xmit path needs to be protected with tx lock
> >>> as well, and since we don't have queue discipline for XDP, it means we
> >>> could still drop packets when XDP is enabled. I'm not sure this would
> >>> defeat the whole idea or not.
> >>
> >> Good point.
> >>
> >>>
> >>>>>
> >>>>> 5) The series introduces various ptr_ring helpers with lots of
> >>>>> ordering stuff which is complicated, I wonder if we first have a
> >>>>> simple patch to implement the zero packet loss
> >>>>
> >>>> I personally don't see how a simpler patch is possible without using
> >>>> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
> >>>> spin locking between producer and consumer. But I am open for
> >>>> suggestions :)
> >>>
> >>> I see NETDEV_TX_BUSY is used by veth:
> >>>
> >>> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> >>> {
> >>>         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> >>>                 return NETDEV_TX_BUSY; /* signal qdisc layer */
> >>>
> >>>         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> >>> }
> >>>
> >>> Maybe it would be simpler to start from that (probably with a new tun->flags?).
> >>>
> >>> Thanks
> >>
> >> Do you mean that this patchset could be implemented using the same
> >> approach that was used for veth in [1]?
> >> This could then also fix the XDP path.
> >
> > I think so.
>
> Okay, I will do so and submit a v7 when net-next opens again for 6.19.
>
> >
> >>
> >> But is returning NETDEV_TX_BUSY fine in our case?
> >
> > If it helps to avoid packet drop. But I'm not sure if qdisc is a must
> > in your case.
>
> I will try to avoid returning it.
>
> When no qdisc is connected, I will just drop like veth does.
>
> >
> >>
> >> Do you mean a flag that enables or disables the no-drop behavior?
> >
> > Yes, via a new flags that could be set via TUNSETIFF.
> >
> > Thanks
>
> I am not a fan of that, since I can not imagine a use case where
> dropping packets is desired.

Right, it's just for the case when we can see regression for some specific test.

> veth does not introduce a flag either.
>
> Of course, if there is a major performance degradation, it makes sense.
> But I will benchmark it, and we will see.

Exactly.

Thanks

>
> Thank you!
>
> >
> >>
> >> Thanks!
> >>
> >> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> This patch series includes tun/tap, and vhost-net because they share
> >>>>>> logic. Adjusting only one of them would break the others. Therefore, the
> >>>>>> patch series is structured as follows:
> >>>>>> 1+2: new ptr_ring helpers for 3
> >>>>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> >>>>>> management
> >>>>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> >>>>>> vhost-net
> >>>>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> >>>>>> not break git bisect
> >>>>>> 8: tun/tap: drop get ring exports (not used anymore)
> >>>>>>
> >>>>>> Possible future work:
> >>>>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> >>>>>
> >>>>> This seems to be not easy. The tx completion depends on the userspace behaviour.
> >>>>
> >>>> I agree, but I really would like to reduce the buffer bloat caused by the
> >>>> default 500 TUN / 1000 TAP packet queue without losing performance.
> >>>>
> >>>>>
> >>>>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
> >>>>>>
> >>>>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >>>>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>> Thanks! :)
> >>>>
> >>>
> >>
> >
>