[PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present

Simon Schippers posted 9 patches 1 month ago
[PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Simon Schippers 1 month ago
This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once an entry is successfully produced and the ptr_ring
reaches capacity, the netdev queue is stopped instead of dropping
subsequent packets.

If producing an entry fails anyways, the tun_net_xmit returns
NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
LLTX is enabled and the transmit path operates without the usual locking.
As a result, concurrent calls to tun_net_xmit() are not prevented.

The existing __{tun,tap}_ring_consume functions free space in the
ptr_ring and wake the netdev queue. Races between this wakeup and the
queue-stop logic could leave the queue stopped indefinitely. To prevent
this, a memory barrier is enforced (as discussed in a similar
implementation in [1]), followed by a recheck that wakes the queue if
space is already available.

If no qdisc is present, the previous tail-drop behavior is preserved.

+-------------------------+-----------+---------------+----------------+
| pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
| Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
| 10M packets             |           |               |                |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
|           +-------------+-----------+---------------+----------------+
|           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
|  +        +-------------+-----------+---------------+----------------+
| vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+

[1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 71b6981d07d7..74d7fd09e9ba 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct netdev_queue *queue;
 	struct tun_file *tfile;
 	int len = skb->len;
+	bool qdisc_present;
+	int ret;
 
 	rcu_read_lock();
 	tfile = rcu_dereference(tun->tfiles[txq]);
@@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	qdisc_present = !qdisc_txq_has_no_queue(queue);
+
+	spin_lock(&tfile->tx_ring.producer_lock);
+	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
+	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
+		netif_tx_stop_queue(queue);
+		/* Avoid races with queue wake-up in
+		 * __{tun,tap}_ring_consume by waking if space is
+		 * available in a re-check.
+		 * The barrier makes sure that the stop is visible before
+		 * we re-check.
+		 */
+		smp_mb__after_atomic();
+		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
+			netif_tx_wake_queue(queue);
+	}
+	spin_unlock(&tfile->tx_ring.producer_lock);
+
+	if (ret) {
+		/* If a qdisc is attached to our virtual device,
+		 * returning NETDEV_TX_BUSY is allowed.
+		 */
+		if (qdisc_present) {
+			rcu_read_unlock();
+			return NETDEV_TX_BUSY;
+		}
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
 
 	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
 	txq_trans_cond_update(queue);
 
 	/* Notify and wake up reader process */
-- 
2.43.0
Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Jason Wang 1 month ago
On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> becomes full. Once an entry is successfully produced and the ptr_ring
> reaches capacity, the netdev queue is stopped instead of dropping
> subsequent packets.
>
> If producing an entry fails anyways, the tun_net_xmit returns
> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> LLTX is enabled and the transmit path operates without the usual locking.
> As a result, concurrent calls to tun_net_xmit() are not prevented.
>
> The existing __{tun,tap}_ring_consume functions free space in the
> ptr_ring and wake the netdev queue. Races between this wakeup and the
> queue-stop logic could leave the queue stopped indefinitely. To prevent
> this, a memory barrier is enforced (as discussed in a similar
> implementation in [1]), followed by a recheck that wakes the queue if
> space is already available.
>
> If no qdisc is present, the previous tail-drop behavior is preserved.
>
> +-------------------------+-----------+---------------+----------------+
> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> | 10M packets             |           |               |                |
> +-----------+-------------+-----------+---------------+----------------+
> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> |           +-------------+-----------+---------------+----------------+
> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> +-----------+-------------+-----------+---------------+----------------+
> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> |  +        +-------------+-----------+---------------+----------------+
> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> +-----------+-------------+-----------+---------------+----------------+
>
> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 71b6981d07d7..74d7fd09e9ba 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>         struct netdev_queue *queue;
>         struct tun_file *tfile;
>         int len = skb->len;
> +       bool qdisc_present;
> +       int ret;
>
>         rcu_read_lock();
>         tfile = rcu_dereference(tun->tfiles[txq]);
> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>
>         nf_reset_ct(skb);
>
> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +       queue = netdev_get_tx_queue(dev, txq);
> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> +
> +       spin_lock(&tfile->tx_ring.producer_lock);
> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> +               netif_tx_stop_queue(queue);
> +               /* Avoid races with queue wake-up in
> +                * __{tun,tap}_ring_consume by waking if space is
> +                * available in a re-check.
> +                * The barrier makes sure that the stop is visible before
> +                * we re-check.
> +                */
> +               smp_mb__after_atomic();
> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> +                       netif_tx_wake_queue(queue);

I'm not sure I will get here, but I think those should be moved to the
following if(ret) check. If __ptr_ring_produce() succeed, there's no
need to bother with those queue stop/wake logic?

> +       }
> +       spin_unlock(&tfile->tx_ring.producer_lock);
> +
> +       if (ret) {
> +               /* If a qdisc is attached to our virtual device,
> +                * returning NETDEV_TX_BUSY is allowed.
> +                */
> +               if (qdisc_present) {
> +                       rcu_read_unlock();
> +                       return NETDEV_TX_BUSY;
> +               }
>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>                 goto drop;
>         }
>
>         /* dev->lltx requires to do our own update of trans_start */
> -       queue = netdev_get_tx_queue(dev, txq);
>         txq_trans_cond_update(queue);
>
>         /* Notify and wake up reader process */
> --
> 2.43.0
>

Thanks
[PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Simon Schippers 1 month ago
On 1/8/26 05:37, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>> becomes full. Once an entry is successfully produced and the ptr_ring
>> reaches capacity, the netdev queue is stopped instead of dropping
>> subsequent packets.
>>
>> If producing an entry fails anyways, the tun_net_xmit returns
>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
>> LLTX is enabled and the transmit path operates without the usual locking.
>> As a result, concurrent calls to tun_net_xmit() are not prevented.
>>
>> The existing __{tun,tap}_ring_consume functions free space in the
>> ptr_ring and wake the netdev queue. Races between this wakeup and the
>> queue-stop logic could leave the queue stopped indefinitely. To prevent
>> this, a memory barrier is enforced (as discussed in a similar
>> implementation in [1]), followed by a recheck that wakes the queue if
>> space is already available.
>>
>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>
>> +-------------------------+-----------+---------------+----------------+
>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
>> | 10M packets             |           |               |                |
>> +-----------+-------------+-----------+---------------+----------------+
>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
>> |           +-------------+-----------+---------------+----------------+
>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
>> +-----------+-------------+-----------+---------------+----------------+
>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
>> |  +        +-------------+-----------+---------------+----------------+
>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
>> +-----------+-------------+-----------+---------------+----------------+
>>
>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 71b6981d07d7..74d7fd09e9ba 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>         struct netdev_queue *queue;
>>         struct tun_file *tfile;
>>         int len = skb->len;
>> +       bool qdisc_present;
>> +       int ret;
>>
>>         rcu_read_lock();
>>         tfile = rcu_dereference(tun->tfiles[txq]);
>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>         nf_reset_ct(skb);
>>
>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>> +       queue = netdev_get_tx_queue(dev, txq);
>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>> +
>> +       spin_lock(&tfile->tx_ring.producer_lock);
>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>> +               netif_tx_stop_queue(queue);
>> +               /* Avoid races with queue wake-up in
>> +                * __{tun,tap}_ring_consume by waking if space is
>> +                * available in a re-check.
>> +                * The barrier makes sure that the stop is visible before
>> +                * we re-check.
>> +                */
>> +               smp_mb__after_atomic();
>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>> +                       netif_tx_wake_queue(queue);
> 
> I'm not sure I will get here, but I think those should be moved to the
> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> need to bother with those queue stop/wake logic?

There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
we stop the queue proactively.

I believe what you are aiming for is to always stop the queue if(ret),
which I can agree with. In that case, I would simply change the condition
to:

if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))

> 
>> +       }
>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>> +
>> +       if (ret) {
>> +               /* If a qdisc is attached to our virtual device,
>> +                * returning NETDEV_TX_BUSY is allowed.
>> +                */
>> +               if (qdisc_present) {
>> +                       rcu_read_unlock();
>> +                       return NETDEV_TX_BUSY;
>> +               }
>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>                 goto drop;
>>         }
>>
>>         /* dev->lltx requires to do our own update of trans_start */
>> -       queue = netdev_get_tx_queue(dev, txq);
>>         txq_trans_cond_update(queue);
>>
>>         /* Notify and wake up reader process */
>> --
>> 2.43.0
>>
> 
> Thanks
> 
Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Jason Wang 1 month ago
On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/8/26 05:37, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >> becomes full. Once an entry is successfully produced and the ptr_ring
> >> reaches capacity, the netdev queue is stopped instead of dropping
> >> subsequent packets.
> >>
> >> If producing an entry fails anyways, the tun_net_xmit returns
> >> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >> LLTX is enabled and the transmit path operates without the usual locking.
> >> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>
> >> The existing __{tun,tap}_ring_consume functions free space in the
> >> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >> this, a memory barrier is enforced (as discussed in a similar
> >> implementation in [1]), followed by a recheck that wakes the queue if
> >> space is already available.
> >>
> >> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>
> >> +-------------------------+-----------+---------------+----------------+
> >> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >> | 10M packets             |           |               |                |
> >> +-----------+-------------+-----------+---------------+----------------+
> >> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >> |           +-------------+-----------+---------------+----------------+
> >> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >> +-----------+-------------+-----------+---------------+----------------+
> >> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >> |  +        +-------------+-----------+---------------+----------------+
> >> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >> +-----------+-------------+-----------+---------------+----------------+
> >>
> >> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 71b6981d07d7..74d7fd09e9ba 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>         struct netdev_queue *queue;
> >>         struct tun_file *tfile;
> >>         int len = skb->len;
> >> +       bool qdisc_present;
> >> +       int ret;
> >>
> >>         rcu_read_lock();
> >>         tfile = rcu_dereference(tun->tfiles[txq]);
> >> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>
> >>         nf_reset_ct(skb);
> >>
> >> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >> +       queue = netdev_get_tx_queue(dev, txq);
> >> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >> +
> >> +       spin_lock(&tfile->tx_ring.producer_lock);
> >> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >> +               netif_tx_stop_queue(queue);
> >> +               /* Avoid races with queue wake-up in
> >> +                * __{tun,tap}_ring_consume by waking if space is
> >> +                * available in a re-check.
> >> +                * The barrier makes sure that the stop is visible before
> >> +                * we re-check.
> >> +                */
> >> +               smp_mb__after_atomic();
> >> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >> +                       netif_tx_wake_queue(queue);
> >
> > I'm not sure I will get here, but I think those should be moved to the
> > following if(ret) check. If __ptr_ring_produce() succeed, there's no
> > need to bother with those queue stop/wake logic?
>
> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> we stop the queue proactively.

This seems to conflict with the following NETDEV_TX_BUSY. Or is
NETDEV_TX_BUSY prepared for the xdp_xmit?

>
> I believe what you are aiming for is to always stop the queue if(ret),
> which I can agree with. In that case, I would simply change the condition
> to:
>
> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
>
> >
> >> +       }
> >> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >> +
> >> +       if (ret) {
> >> +               /* If a qdisc is attached to our virtual device,
> >> +                * returning NETDEV_TX_BUSY is allowed.
> >> +                */
> >> +               if (qdisc_present) {
> >> +                       rcu_read_unlock();
> >> +                       return NETDEV_TX_BUSY;
> >> +               }
> >>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>                 goto drop;
> >>         }
> >>
> >>         /* dev->lltx requires to do our own update of trans_start */
> >> -       queue = netdev_get_tx_queue(dev, txq);
> >>         txq_trans_cond_update(queue);
> >>
> >>         /* Notify and wake up reader process */
> >> --
> >> 2.43.0
> >>
> >
> > Thanks
> >
>

Thanks
[PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Simon Schippers 1 month ago
On 1/9/26 07:09, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/8/26 05:37, Jason Wang wrote:
>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>> subsequent packets.
>>>>
>>>> If producing an entry fails anyways, the tun_net_xmit returns
>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
>>>>
>>>> The existing __{tun,tap}_ring_consume functions free space in the
>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
>>>> this, a memory barrier is enforced (as discussed in a similar
>>>> implementation in [1]), followed by a recheck that wakes the queue if
>>>> space is already available.
>>>>
>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>
>>>> +-------------------------+-----------+---------------+----------------+
>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
>>>> | 10M packets             |           |               |                |
>>>> +-----------+-------------+-----------+---------------+----------------+
>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
>>>> |           +-------------+-----------+---------------+----------------+
>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
>>>> +-----------+-------------+-----------+---------------+----------------+
>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
>>>> |  +        +-------------+-----------+---------------+----------------+
>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>
>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>>>
>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 71b6981d07d7..74d7fd09e9ba 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>         struct netdev_queue *queue;
>>>>         struct tun_file *tfile;
>>>>         int len = skb->len;
>>>> +       bool qdisc_present;
>>>> +       int ret;
>>>>
>>>>         rcu_read_lock();
>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>
>>>>         nf_reset_ct(skb);
>>>>
>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>> +       queue = netdev_get_tx_queue(dev, txq);
>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>> +
>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>> +               netif_tx_stop_queue(queue);
>>>> +               /* Avoid races with queue wake-up in
>>>> +                * __{tun,tap}_ring_consume by waking if space is
>>>> +                * available in a re-check.
>>>> +                * The barrier makes sure that the stop is visible before
>>>> +                * we re-check.
>>>> +                */
>>>> +               smp_mb__after_atomic();
>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>> +                       netif_tx_wake_queue(queue);
>>>
>>> I'm not sure I will get here, but I think those should be moved to the
>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
>>> need to bother with those queue stop/wake logic?
>>
>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
>> we stop the queue proactively.
> 
> This seems to conflict with the following NETDEV_TX_BUSY. Or is
> NETDEV_TX_BUSY prepared for the xdp_xmit?

Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
And I do not understand the connection with xdp_xmit.

> 
>>
>> I believe what you are aiming for is to always stop the queue if(ret),
>> which I can agree with. In that case, I would simply change the condition
>> to:
>>
>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
>>
>>>
>>>> +       }
>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>>>> +
>>>> +       if (ret) {
>>>> +               /* If a qdisc is attached to our virtual device,
>>>> +                * returning NETDEV_TX_BUSY is allowed.
>>>> +                */
>>>> +               if (qdisc_present) {
>>>> +                       rcu_read_unlock();
>>>> +                       return NETDEV_TX_BUSY;
>>>> +               }
>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>                 goto drop;
>>>>         }
>>>>
>>>>         /* dev->lltx requires to do our own update of trans_start */
>>>> -       queue = netdev_get_tx_queue(dev, txq);
>>>>         txq_trans_cond_update(queue);
>>>>
>>>>         /* Notify and wake up reader process */
>>>> --
>>>> 2.43.0
>>>>
>>>
>>> Thanks
>>>
>>
> 
> Thanks
> 
Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Michael S. Tsirkin 4 weeks ago
On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?

We jump through a lot of hoops in virtio_net to avoid using
NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
Given your patches aim to improve precisely ring full,
I would say stopping proactively before NETDEV_TX_BUSY
should be a priority.

-- 
MST
[PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Simon Schippers 4 weeks ago
On 1/12/26 05:33, Michael S. Tsirkin wrote:
> On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
>> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> 
> We jump through a lot of hoops in virtio_net to avoid using
> NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
> Given your patches aim to improve precisely ring full,
> I would say stopping proactively before NETDEV_TX_BUSY
> should be a priority.
> 

I already proactively stop here with the approach you proposed in
the v6.
Or am I missing something (apart from the xdp path)?

And yes I also dislike returning NETDEV_TX_BUSY but I do not see how
this can be prevented with lltx enabled.
Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Michael S. Tsirkin 4 weeks ago
On Mon, Jan 12, 2026 at 12:17:12PM +0100, Simon Schippers wrote:
> On 1/12/26 05:33, Michael S. Tsirkin wrote:
> > On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
> >> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> > 
> > We jump through a lot of hoops in virtio_net to avoid using
> > NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
> > Given your patches aim to improve precisely ring full,
> > I would say stopping proactively before NETDEV_TX_BUSY
> > should be a priority.
> > 
> 
> I already proactively stop here with the approach you proposed in
> the v6.
> Or am I missing something (apart from the xdp path)?

Yes, I am just answering the general question you posed.

> 
> And yes I also dislike returning NETDEV_TX_BUSY but I do not see how
> this can be prevented with lltx enabled.

Preventing NETDEV_TX_BUSY 100% of the time is not required. It's there
to handle races.

-- 
MST
[PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Simon Schippers 4 weeks ago
On 1/12/26 12:19, Michael S. Tsirkin wrote:
> On Mon, Jan 12, 2026 at 12:17:12PM +0100, Simon Schippers wrote:
>> On 1/12/26 05:33, Michael S. Tsirkin wrote:
>>> On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
>>>> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
>>>
>>> We jump through a lot of hoops in virtio_net to avoid using
>>> NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
>>> Given your patches aim to improve precisely ring full,
>>> I would say stopping proactively before NETDEV_TX_BUSY
>>> should be a priority.
>>>
>>
>> I already proactively stop here with the approach you proposed in
>> the v6.
>> Or am I missing something (apart from the xdp path)?
> 
> Yes, I am just answering the general question you posed.

Ah okay.

> 
>>
>> And yes I also dislike returning NETDEV_TX_BUSY but I do not see how
>> this can be prevented with lltx enabled.
> 
> Preventing NETDEV_TX_BUSY 100% of the time is not required. It's there
> to handle races.

Great to know. Thanks
Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Jason Wang 4 weeks ago
On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/9/26 07:09, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/8/26 05:37, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>> subsequent packets.
> >>>>
> >>>> If producing an entry fails anyways, the tun_net_xmit returns
> >>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>>>
> >>>> The existing __{tun,tap}_ring_consume functions free space in the
> >>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >>>> this, a memory barrier is enforced (as discussed in a similar
> >>>> implementation in [1]), followed by a recheck that wakes the queue if
> >>>> space is already available.
> >>>>
> >>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>
> >>>> +-------------------------+-----------+---------------+----------------+
> >>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >>>> | 10M packets             |           |               |                |
> >>>> +-----------+-------------+-----------+---------------+----------------+
> >>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >>>> |           +-------------+-----------+---------------+----------------+
> >>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >>>> +-----------+-------------+-----------+---------------+----------------+
> >>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >>>> |  +        +-------------+-----------+---------------+----------------+
> >>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>
> >>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>>
> >>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>> ---
> >>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>> index 71b6981d07d7..74d7fd09e9ba 100644
> >>>> --- a/drivers/net/tun.c
> >>>> +++ b/drivers/net/tun.c
> >>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>         struct netdev_queue *queue;
> >>>>         struct tun_file *tfile;
> >>>>         int len = skb->len;
> >>>> +       bool qdisc_present;
> >>>> +       int ret;
> >>>>
> >>>>         rcu_read_lock();
> >>>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>
> >>>>         nf_reset_ct(skb);
> >>>>
> >>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>> +       queue = netdev_get_tx_queue(dev, txq);
> >>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>> +
> >>>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>> +               netif_tx_stop_queue(queue);
> >>>> +               /* Avoid races with queue wake-up in
> >>>> +                * __{tun,tap}_ring_consume by waking if space is
> >>>> +                * available in a re-check.
> >>>> +                * The barrier makes sure that the stop is visible before
> >>>> +                * we re-check.
> >>>> +                */
> >>>> +               smp_mb__after_atomic();
> >>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>> +                       netif_tx_wake_queue(queue);
> >>>
> >>> I'm not sure I will get here, but I think those should be moved to the
> >>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> >>> need to bother with those queue stop/wake logic?
> >>
> >> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> >> we stop the queue proactively.
> >
> > This seems to conflict with the following NETDEV_TX_BUSY. Or is
> > NETDEV_TX_BUSY prepared for the xdp_xmit?
>
> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?

No, I mean I don't understand why we still need to peek since we've
already used NETDEV_TX_BUSY.

> And I do not understand the connection with xdp_xmit.

Since there's we don't modify xdp_xmit path, so even if we peek next
ndo_start_xmit can still hit ring full.

Thanks

>
> >
> >>
> >> I believe what you are aiming for is to always stop the queue if(ret),
> >> which I can agree with. In that case, I would simply change the condition
> >> to:
> >>
> >> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
> >>
> >>>
> >>>> +       }
> >>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>>> +
> >>>> +       if (ret) {
> >>>> +               /* If a qdisc is attached to our virtual device,
> >>>> +                * returning NETDEV_TX_BUSY is allowed.
> >>>> +                */
> >>>> +               if (qdisc_present) {
> >>>> +                       rcu_read_unlock();
> >>>> +                       return NETDEV_TX_BUSY;
> >>>> +               }
> >>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>                 goto drop;
> >>>>         }
> >>>>
> >>>>         /* dev->lltx requires to do our own update of trans_start */
> >>>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>>         txq_trans_cond_update(queue);
> >>>>
> >>>>         /* Notify and wake up reader process */
> >>>> --
> >>>> 2.43.0
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >
> > Thanks
> >
>
[PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Simon Schippers 4 weeks ago
On 1/12/26 03:22, Jason Wang wrote:
> On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/9/26 07:09, Jason Wang wrote:
>>> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 1/8/26 05:37, Jason Wang wrote:
>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>>>> subsequent packets.
>>>>>>
>>>>>> If producing an entry fails anyways, the tun_net_xmit returns
>>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
>>>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
>>>>>>
>>>>>> The existing __{tun,tap}_ring_consume functions free space in the
>>>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
>>>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
>>>>>> this, a memory barrier is enforced (as discussed in a similar
>>>>>> implementation in [1]), followed by a recheck that wakes the queue if
>>>>>> space is already available.
>>>>>>
>>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>>>
>>>>>> +-------------------------+-----------+---------------+----------------+
>>>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
>>>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
>>>>>> | 10M packets             |           |               |                |
>>>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
>>>>>> |           +-------------+-----------+---------------+----------------+
>>>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
>>>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
>>>>>> |  +        +-------------+-----------+---------------+----------------+
>>>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
>>>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>>>
>>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>>>>>
>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>> ---
>>>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 71b6981d07d7..74d7fd09e9ba 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>         struct netdev_queue *queue;
>>>>>>         struct tun_file *tfile;
>>>>>>         int len = skb->len;
>>>>>> +       bool qdisc_present;
>>>>>> +       int ret;
>>>>>>
>>>>>>         rcu_read_lock();
>>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
>>>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>
>>>>>>         nf_reset_ct(skb);
>>>>>>
>>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>>>> +       queue = netdev_get_tx_queue(dev, txq);
>>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>> +
>>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
>>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>>>> +               netif_tx_stop_queue(queue);
>>>>>> +               /* Avoid races with queue wake-up in
>>>>>> +                * __{tun,tap}_ring_consume by waking if space is
>>>>>> +                * available in a re-check.
>>>>>> +                * The barrier makes sure that the stop is visible before
>>>>>> +                * we re-check.
>>>>>> +                */
>>>>>> +               smp_mb__after_atomic();
>>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>>>> +                       netif_tx_wake_queue(queue);
>>>>>
>>>>> I'm not sure I will get here, but I think those should be moved to the
>>>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
>>>>> need to bother with those queue stop/wake logic?
>>>>
>>>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
>>>> we stop the queue proactively.
>>>
>>> This seems to conflict with the following NETDEV_TX_BUSY. Or is
>>> NETDEV_TX_BUSY prepared for the xdp_xmit?
>>
>> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> 
> No, I mean I don't understand why we still need to peek since we've
> already used NETDEV_TX_BUSY.

Yes, if __ptr_ring_produce() returns -ENOSPC, there is no need to check
__ptr_ring_produce_peek(). I agree with you on this point and will update
the code accordingly. In all other cases, checking
__ptr_ring_produce_peek() is still required in order to proactively stop
the queue.

> 
>> And I do not understand the connection with xdp_xmit.
> 
> Since there's we don't modify xdp_xmit path, so even if we peek next
> ndo_start_xmit can still hit ring full.

Ah okay. Would you apply the same stop-and-recheck logic in
tun_xdp_xmit when __ptr_ring_produce() fails to produce it, or is that
not permitted there?

Apart from that, as noted in the commit message, since we are using LLTX,
hitting a full ring is still possible anyway. I could see that especially
at multiqueue tests with pktgen by looking at the qdisc requeues.

Thanks

> 
> Thanks
> 
>>
>>>
>>>>
>>>> I believe what you are aiming for is to always stop the queue if(ret),
>>>> which I can agree with. In that case, I would simply change the condition
>>>> to:
>>>>
>>>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
>>>>
>>>>>
>>>>>> +       }
>>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>>>>>> +
>>>>>> +       if (ret) {
>>>>>> +               /* If a qdisc is attached to our virtual device,
>>>>>> +                * returning NETDEV_TX_BUSY is allowed.
>>>>>> +                */
>>>>>> +               if (qdisc_present) {
>>>>>> +                       rcu_read_unlock();
>>>>>> +                       return NETDEV_TX_BUSY;
>>>>>> +               }
>>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>>                 goto drop;
>>>>>>         }
>>>>>>
>>>>>>         /* dev->lltx requires to do our own update of trans_start */
>>>>>> -       queue = netdev_get_tx_queue(dev, txq);
>>>>>>         txq_trans_cond_update(queue);
>>>>>>
>>>>>>         /* Notify and wake up reader process */
>>>>>> --
>>>>>> 2.43.0
>>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>
>>> Thanks
>>>
>>
> 
Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Jason Wang 3 weeks, 6 days ago
On Mon, Jan 12, 2026 at 7:08 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/12/26 03:22, Jason Wang wrote:
> > On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/9/26 07:09, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/8/26 05:37, Jason Wang wrote:
> >>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>>>> subsequent packets.
> >>>>>>
> >>>>>> If producing an entry fails anyways, the tun_net_xmit returns
> >>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >>>>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>>>>>
> >>>>>> The existing __{tun,tap}_ring_consume functions free space in the
> >>>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >>>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >>>>>> this, a memory barrier is enforced (as discussed in a similar
> >>>>>> implementation in [1]), followed by a recheck that wakes the queue if
> >>>>>> space is already available.
> >>>>>>
> >>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>>>
> >>>>>> +-------------------------+-----------+---------------+----------------+
> >>>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >>>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >>>>>> | 10M packets             |           |               |                |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >>>>>> |           +-------------+-----------+---------------+----------------+
> >>>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >>>>>> |  +        +-------------+-----------+---------------+----------------+
> >>>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>>
> >>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>>>>
> >>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>> ---
> >>>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index 71b6981d07d7..74d7fd09e9ba 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>         struct netdev_queue *queue;
> >>>>>>         struct tun_file *tfile;
> >>>>>>         int len = skb->len;
> >>>>>> +       bool qdisc_present;
> >>>>>> +       int ret;
> >>>>>>
> >>>>>>         rcu_read_lock();
> >>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>
> >>>>>>         nf_reset_ct(skb);
> >>>>>>
> >>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>>>> +       queue = netdev_get_tx_queue(dev, txq);
> >>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>>>> +
> >>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>>>> +               netif_tx_stop_queue(queue);
> >>>>>> +               /* Avoid races with queue wake-up in
> >>>>>> +                * __{tun,tap}_ring_consume by waking if space is
> >>>>>> +                * available in a re-check.
> >>>>>> +                * The barrier makes sure that the stop is visible before
> >>>>>> +                * we re-check.
> >>>>>> +                */
> >>>>>> +               smp_mb__after_atomic();
> >>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>>>> +                       netif_tx_wake_queue(queue);
> >>>>>
> >>>>> I'm not sure I will get here, but I think those should be moved to the
> >>>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> >>>>> need to bother with those queue stop/wake logic?
> >>>>
> >>>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> >>>> we stop the queue proactively.
> >>>
> >>> This seems to conflict with the following NETDEV_TX_BUSY. Or is
> >>> NETDEV_TX_BUSY prepared for the xdp_xmit?
> >>
> >> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> >
> > No, I mean I don't understand why we still need to peek since we've
> > already used NETDEV_TX_BUSY.
>
> Yes, if __ptr_ring_produce() returns -ENOSPC, there is no need to check
> __ptr_ring_produce_peek(). I agree with you on this point and will update
> the code accordingly. In all other cases, checking
> __ptr_ring_produce_peek() is still required in order to proactively stop
> the queue.
>
> >
> >> And I do not understand the connection with xdp_xmit.
> >
> > Since there's we don't modify xdp_xmit path, so even if we peek next
> > ndo_start_xmit can still hit ring full.
>
> Ah okay. Would you apply the same stop-and-recheck logic in
> tun_xdp_xmit when __ptr_ring_produce() fails to produce it, or is that
> not permitted there?

I think it won't work as there's no qdsic logic implemented in the XDP
xmit path. NETDEV_TX_BUSY for tun_net_xmit() should be sufficient.

>
> Apart from that, as noted in the commit message, since we are using LLTX,
> hitting a full ring is still possible anyway. I could see that especially
> at multiqueue tests with pktgen by looking at the qdisc requeues.
>
> Thanks

Thanks

>
> >
> > Thanks
> >
> >>
> >>>
> >>>>
> >>>> I believe what you are aiming for is to always stop the queue if(ret),
> >>>> which I can agree with. In that case, I would simply change the condition
> >>>> to:
> >>>>
> >>>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
> >>>>
> >>>>>
> >>>>>> +       }
> >>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>>>>> +
> >>>>>> +       if (ret) {
> >>>>>> +               /* If a qdisc is attached to our virtual device,
> >>>>>> +                * returning NETDEV_TX_BUSY is allowed.
> >>>>>> +                */
> >>>>>> +               if (qdisc_present) {
> >>>>>> +                       rcu_read_unlock();
> >>>>>> +                       return NETDEV_TX_BUSY;
> >>>>>> +               }
> >>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>>                 goto drop;
> >>>>>>         }
> >>>>>>
> >>>>>>         /* dev->lltx requires to do our own update of trans_start */
> >>>>>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>>>>         txq_trans_cond_update(queue);
> >>>>>>
> >>>>>>         /* Notify and wake up reader process */
> >>>>>> --
> >>>>>> 2.43.0
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >
>
Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
Posted by Michael S. Tsirkin 4 weeks ago
On Mon, Jan 12, 2026 at 12:08:14PM +0100, Simon Schippers wrote:
> On 1/12/26 03:22, Jason Wang wrote:
> > On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/9/26 07:09, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/8/26 05:37, Jason Wang wrote:
> >>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>>>> subsequent packets.
> >>>>>>
> >>>>>> If producing an entry fails anyways, the tun_net_xmit returns
> >>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >>>>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>>>>>
> >>>>>> The existing __{tun,tap}_ring_consume functions free space in the
> >>>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >>>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >>>>>> this, a memory barrier is enforced (as discussed in a similar
> >>>>>> implementation in [1]), followed by a recheck that wakes the queue if
> >>>>>> space is already available.
> >>>>>>
> >>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>>>
> >>>>>> +-------------------------+-----------+---------------+----------------+
> >>>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >>>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >>>>>> | 10M packets             |           |               |                |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >>>>>> |           +-------------+-----------+---------------+----------------+
> >>>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >>>>>> |  +        +-------------+-----------+---------------+----------------+
> >>>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>>
> >>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>>>>
> >>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>> ---
> >>>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index 71b6981d07d7..74d7fd09e9ba 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>         struct netdev_queue *queue;
> >>>>>>         struct tun_file *tfile;
> >>>>>>         int len = skb->len;
> >>>>>> +       bool qdisc_present;
> >>>>>> +       int ret;
> >>>>>>
> >>>>>>         rcu_read_lock();
> >>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>
> >>>>>>         nf_reset_ct(skb);
> >>>>>>
> >>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>>>> +       queue = netdev_get_tx_queue(dev, txq);
> >>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>>>> +
> >>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>>>> +               netif_tx_stop_queue(queue);
> >>>>>> +               /* Avoid races with queue wake-up in
> >>>>>> +                * __{tun,tap}_ring_consume by waking if space is
> >>>>>> +                * available in a re-check.
> >>>>>> +                * The barrier makes sure that the stop is visible before
> >>>>>> +                * we re-check.
> >>>>>> +                */
> >>>>>> +               smp_mb__after_atomic();
> >>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>>>> +                       netif_tx_wake_queue(queue);
> >>>>>
> >>>>> I'm not sure I will get here, but I think those should be moved to the
> >>>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> >>>>> need to bother with those queue stop/wake logic?
> >>>>
> >>>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> >>>> we stop the queue proactively.
> >>>
> >>> This seems to conflict with the following NETDEV_TX_BUSY. Or is
> >>> NETDEV_TX_BUSY prepared for the xdp_xmit?
> >>
> >> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> > 
> > No, I mean I don't understand why we still need to peek since we've
> > already used NETDEV_TX_BUSY.
> 
> Yes, if __ptr_ring_produce() returns -ENOSPC, there is no need to check
> __ptr_ring_produce_peek(). I agree with you on this point and will update
> the code accordingly. In all other cases, checking
> __ptr_ring_produce_peek() is still required in order to proactively stop
> the queue.
> 
> > 
> >> And I do not understand the connection with xdp_xmit.
> > 
> > Since there's we don't modify xdp_xmit path, so even if we peek next
> > ndo_start_xmit can still hit ring full.
> 
> Ah okay. Would you apply the same stop-and-recheck logic in
> tun_xdp_xmit when __ptr_ring_produce() fails to produce it, or is that
> not permitted there?
> 
> Apart from that, as noted in the commit message, since we are using LLTX,
> hitting a full ring is still possible anyway. I could see that especially
> at multiqueue tests with pktgen by looking at the qdisc requeues.
> 
> Thanks

If it's an exceptional rare condition (i.e. a race), it's not a big deal. That's why
NETDEV_TX_BUSY is there in the 1st place. If it's the main modus
operadi - not good.


> > 
> > Thanks
> > 
> >>
> >>>
> >>>>
> >>>> I believe what you are aiming for is to always stop the queue if(ret),
> >>>> which I can agree with. In that case, I would simply change the condition
> >>>> to:
> >>>>
> >>>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
> >>>>
> >>>>>
> >>>>>> +       }
> >>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>>>>> +
> >>>>>> +       if (ret) {
> >>>>>> +               /* If a qdisc is attached to our virtual device,
> >>>>>> +                * returning NETDEV_TX_BUSY is allowed.
> >>>>>> +                */
> >>>>>> +               if (qdisc_present) {
> >>>>>> +                       rcu_read_unlock();
> >>>>>> +                       return NETDEV_TX_BUSY;
> >>>>>> +               }
> >>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>>                 goto drop;
> >>>>>>         }
> >>>>>>
> >>>>>>         /* dev->lltx requires to do our own update of trans_start */
> >>>>>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>>>>         txq_trans_cond_update(queue);
> >>>>>>
> >>>>>>         /* Notify and wake up reader process */
> >>>>>> --
> >>>>>> 2.43.0
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >