[v2] veth: Fix TXQ stall race condition and add recovery

[PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Jesper Dangaard Brouer 5 months, 1 week ago

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") introduced a race condition that can lead to a permanently
stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
Max).

The race occurs in veth_xmit(). The producer observes a full ptr_ring and
stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
intended to re-wake the queue if the consumer had just emptied it (if
(__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
"lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
traffic halts.

This failure is caused by an incorrect use of the __ptr_ring_empty() API
from the producer side. As noted in kernel comments, this check is not
guaranteed to be correct if a consumer is operating on another CPU. The
empty test is based on ptr_ring->consumer_head, making it reliable only for
the consumer. Using this check from the producer side is fundamentally racy.

This patch fixes the race by adopting the more robust logic from an earlier
version V4 of the patchset, which always flushed the peer:

(1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
are removed. Instead, after stopping the queue, we unconditionally call
__veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
making it solely responsible for re-waking the TXQ.

(2) On the consumer side, the logic for waking the peer TXQ is moved out of
veth_xdp_rcv() and placed at the end of the veth_poll() function. This
placement is part of fixing the race, as the netif_tx_queue_stopped() check
must occur after rx_notify_masked is potentially set to false during NAPI
completion.
 This handles the race where veth_poll() consumes all packets and completes
NAPI before veth_xmit() on the producer side has called netif_tx_stop_queue().
In this state, the producer's __veth_xdp_flush(rq) call will see
rx_notify_masked is false and reschedule NAPI. This new NAPI poll, even if it
processes no packets, is now guaranteed to run the netif_tx_queue_stopped()
check, see the stopped queue, and wake it up, allowing veth_xmit() to proceed.

(3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
about to complete (napi_complete_done), it now also checks if the peer TXQ
is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
reschedule itself. This prevents a new race where the producer stops the
queue just as the consumer is finishing its poll, ensuring the wakeup is not
missed.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   43 ++++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7b1a9805b270..828b62916f50 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
 		__skb_push(skb, ETH_HLEN);
-		/* Depend on prior success packets started NAPI consumer via
-		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
-		 * paired with empty check in veth_poll().
-		 */
 		netif_tx_stop_queue(txq);
-		smp_mb__after_atomic();
-		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
-			netif_tx_wake_queue(txq);
+		/* Makes sure NAPI peer consumer runs. Consumer is responsible
+		 * for starting txq again, until then ndo_start_xmit (this
+		 * function) will not be invoked by the netstack again.
+		 */
+		__veth_xdp_flush(rq);
 		break;
 	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
@@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
-	struct veth_priv *priv = netdev_priv(rq->dev);
-	int queue_idx = rq->xdp_rxq.queue_index;
-	struct netdev_queue *peer_txq;
-	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
 
-	/* NAPI functions as RCU section */
-	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
-	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
-
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
 
@@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
-		txq_trans_cond_update(peer_txq);
-		netif_tx_wake_queue(peer_txq);
-	}
-
 	return done;
 }
 
@@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_rq *rq =
 		container_of(napi, struct veth_rq, xdp_napi);
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
 	struct veth_stats stats = {};
+	struct net_device *peer_dev;
 	struct veth_xdp_tx_bq bq;
 	int done;
 
 	bq.count = 0;
 
+	/* NAPI functions as RCU section */
+	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
+	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
+
 	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);
 
@@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
 		smp_store_mb(rq->rx_notify_masked, false);
-		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
+		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
+			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
 			if (napi_schedule_prep(&rq->xdp_napi)) {
 				WRITE_ONCE(rq->rx_notify_masked, true);
 				__napi_schedule(&rq->xdp_napi);
@@ -998,6 +992,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		veth_xdp_flush(rq, &bq);
 	xdp_clear_return_frame_no_direct();
 
+	/* Release backpressure per NAPI poll */
+	smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
+	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
+		txq_trans_cond_update(peer_txq);
+		netif_tx_wake_queue(peer_txq);
+	}
+
 	return done;
 }

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Paolo Abeni 5 months, 1 week ago

On 10/27/25 9:05 PM, Jesper Dangaard Brouer wrote:
> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.

[...]

> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  	if (done < budget && napi_complete_done(napi, done)) {
>  		/* Write rx_notify_masked before reading ptr_ring */
>  		smp_store_mb(rq->rx_notify_masked, false);
> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>  			if (napi_schedule_prep(&rq->xdp_napi)) {
>  				WRITE_ONCE(rq->rx_notify_masked, true);
>  				__napi_schedule(&rq->xdp_napi);

Double checking I'm read the code correctly. The above is supposed to
trigger when something alike the following happens

[producer]				[consumer]
					veth_poll()
					[ring empty]
veth_xmit
  veth_forward_skb
  [NETDEV_TX_BUSY]		
					napi_complete_done()
					
  netif_tx_stop_queue
  __veth_xdp_flush()
  rq->rx_notify_masked == true
					WRITE_ONCE(rq->rx_notify_masked,
						   false);

?

I think the above can't happen, the producer should need to fill the
whole ring in-between the ring check and napi_complete_done().

Am I misreading it?

/P

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Jesper Dangaard Brouer 5 months ago


On 30/10/2025 13.28, Paolo Abeni wrote:
> On 10/27/25 9:05 PM, Jesper Dangaard Brouer wrote:
>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>> reschedule itself. This prevents a new race where the producer stops the
>> queue just as the consumer is finishing its poll, ensuring the wakeup is not
>> missed.
> 
> [...]
> 
>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   	if (done < budget && napi_complete_done(napi, done)) {
>>   		/* Write rx_notify_masked before reading ptr_ring */
>>   		smp_store_mb(rq->rx_notify_masked, false);
>> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>   			if (napi_schedule_prep(&rq->xdp_napi)) {
>>   				WRITE_ONCE(rq->rx_notify_masked, true);
>>   				__napi_schedule(&rq->xdp_napi);
> 
> Double checking I'm read the code correctly. The above is supposed to
> trigger when something alike the following happens
> 
> [producer]				[consumer]
> 					veth_poll()
> 					[ring empty]
> veth_xmit
>    veth_forward_skb
>    [NETDEV_TX_BUSY]		
> 					napi_complete_done()
> 					
>    netif_tx_stop_queue
>    __veth_xdp_flush()
>    rq->rx_notify_masked == true
> 					WRITE_ONCE(rq->rx_notify_masked,
> 						   false);
> 
> ?
> 
> I think the above can't happen, the producer should need to fill the
> whole ring in-between the ring check and napi_complete_done().

The race I can see is slightly different.  It is centered around the
consumer manage to empty the ring after [NETDEV_TX_BUSY].
We have 256 packets in queue and I observe NAPI packet processing time
of 7.64 usec on a given ARM64 metal. This means it takes 1956 usec or
1.96 ms to empty the queue (which is the time needed for the race to
occur in below during "(something interrupts)").

It would look like this:

  [producer]			[consumer]
				veth_poll() - already running
  veth_xmit
   veth_forward_skb
   [ring full]
   [NETDEV_TX_BUSY]
   (something interrupts)
				veth_poll()
				manage to [empty ring]
				napi_complete_done()
   netif_tx_stop_queue
   __veth_xdp_flush()
    - No effect of flush as:
    - rq->rx_notify_masked == true
				WRITE_ONCE(rq->rx_notify_masked, false)
				[empty ring] don't restart NAPI
				Observe netif_tx_queue_stopped == true


Notice: at end (the consumer) do observe netif_tx_queue_stopped is true.
This is leveraged in the patch by moving the netif_tx_queue_stopped
check to the end of veth_poll(). This now happens after rx_notify_masked
is changed to false, which is the race fix.

Other cases where veth_poll() stop NAPI and exits, is recovered by
__veth_xdp_flush() in producer.

--Jesper

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Toshiaki Makita 5 months, 1 week ago

On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:

> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.

Maybe another option is to use !ptr_ring_full() instead of ptr_ring_empty()?
I'm not sure which is better. Anyway I'm ok with your approach.

...

> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.
...

> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   	if (done < budget && napi_complete_done(napi, done)) {
>   		/* Write rx_notify_masked before reading ptr_ring */
>   		smp_store_mb(rq->rx_notify_masked, false);
> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {

Not sure if this is necessary.
 From commitlog, your intention seems to be making sure to wake up the queue,
but you wake up the queue immediately after this hunk in the same function,
so isn't it guaranteed without scheduling another napi?

>   			if (napi_schedule_prep(&rq->xdp_napi)) {
>   				WRITE_ONCE(rq->rx_notify_masked, true);
>   				__napi_schedule(&rq->xdp_napi);
> @@ -998,6 +992,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   		veth_xdp_flush(rq, &bq);
>   	xdp_clear_return_frame_no_direct();
>   
> +	/* Release backpressure per NAPI poll */
> +	smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
> +		txq_trans_cond_update(peer_txq);
> +		netif_tx_wake_queue(peer_txq);
> +	}
> +
>   	return done;
>   }

--
Toshiaki Makita

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Jesper Dangaard Brouer 5 months, 1 week ago



On 28/10/2025 15.56, Toshiaki Makita wrote:
> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
> 
>> (1) In veth_xmit(), the racy conditional wake-up logic and its memory 
>> barrier
>> are removed. Instead, after stopping the queue, we unconditionally call
>> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is 
>> scheduled,
>> making it solely responsible for re-waking the TXQ.
> 
> Maybe another option is to use !ptr_ring_full() instead of 
> ptr_ring_empty()?

Nope, that will not work.
I think MST will agree.

> I'm not sure which is better. Anyway I'm ok with your approach.
> 
> ...
> 
>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>> reschedule itself. This prevents a new race where the producer stops the
>> queue just as the consumer is finishing its poll, ensuring the wakeup 
>> is not missed.
> ...
> 
>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int 
>> budget)
>>       if (done < budget && napi_complete_done(napi, done)) {
>>           /* Write rx_notify_masked before reading ptr_ring */
>>           smp_store_mb(rq->rx_notify_masked, false);
>> -        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>> +        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>> +                 (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
> 
> Not sure if this is necessary.

How sure are you that this isn't necessary?

>  From commitlog, your intention seems to be making sure to wake up the 
> queue,
> but you wake up the queue immediately after this hunk in the same function,
> so isn't it guaranteed without scheduling another napi?
> 

The above code catches the case, where the ptr_ring is empty and the
tx_queue is stopped.  It feels wrong not to reach in this case, but you
*might* be right that it isn't strictly necessary, because below code
will also call netif_tx_wake_queue() which *should* have a SKB stored
that will *indirectly* trigger a restart of the NAPI.

I will stare some more at the code to see if I can convince myself that
we don't have to catch this case.

Please, also provide "How sure are you that this isn't necessary?"


>>               if (napi_schedule_prep(&rq->xdp_napi)) {
>>                   WRITE_ONCE(rq->rx_notify_masked, true);
>>                   __napi_schedule(&rq->xdp_napi);
>> @@ -998,6 +992,13 @@ static int veth_poll(struct napi_struct *napi, 
>> int budget)
>>           veth_xdp_flush(rq, &bq);
>>       xdp_clear_return_frame_no_direct();
>> +    /* Release backpressure per NAPI poll */
>> +    smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
>> +    if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
>> +        txq_trans_cond_update(peer_txq);
>> +        netif_tx_wake_queue(peer_txq);
>> +    }
>> +
>>       return done;
>>   }
> 
> -- 
> Toshiaki Makita

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Toshiaki Makita 5 months, 1 week ago

On 2025/10/29 19:33, Jesper Dangaard Brouer wrote:
> On 28/10/2025 15.56, Toshiaki Makita wrote:
>> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
>>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>>> reschedule itself. This prevents a new race where the producer stops the
>>> queue just as the consumer is finishing its poll, ensuring the wakeup is not missed.
>> ...
>>
>>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>>       if (done < budget && napi_complete_done(napi, done)) {
>>>           /* Write rx_notify_masked before reading ptr_ring */
>>>           smp_store_mb(rq->rx_notify_masked, false);
>>> -        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>>> +        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>>> +                 (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>
>> Not sure if this is necessary.
> 
> How sure are you that this isn't necessary?
> 
>>  From commitlog, your intention seems to be making sure to wake up the queue,
>> but you wake up the queue immediately after this hunk in the same function,
>> so isn't it guaranteed without scheduling another napi?
>>
> 
> The above code catches the case, where the ptr_ring is empty and the
> tx_queue is stopped.  It feels wrong not to reach in this case, but you
> *might* be right that it isn't strictly necessary, because below code
> will also call netif_tx_wake_queue() which *should* have a SKB stored
> that will *indirectly* trigger a restart of the NAPI.

I'm a bit confused.
Wrt (3), what you want is waking up the queue, right?
Or, what you want is actually NAPI reschedule itself?

My understanding was the former (wake up the queue).
If it's correct, (3) seems not necessary because you have already woken up the queue 
in the same function.

First NAPI
  veth_poll()
    // ptr_ring_empty() and queue_stopped()
   __napi_schedule() ... schedule second NAPI
   netif_tx_wake_queue() ... wake up the queue if queue_stopped()

Second NAPI
  veth_poll()
   netif_tx_wake_queue() ... this is what you want,
                             but the queue has been woken up in the first NAPI
                             What's the point?

> I will stare some more at the code to see if I can convince myself that
> we don't have to catch this case.
> 
> Please, also provide "How sure are you that this isn't necessary?"

I could not find the case we need (3) as I explained above.

--
Toshiaki Makita

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Jesper Dangaard Brouer 5 months, 1 week ago


On 29/10/2025 16.00, Toshiaki Makita wrote:
> On 2025/10/29 19:33, Jesper Dangaard Brouer wrote:
>> On 28/10/2025 15.56, Toshiaki Makita wrote:
>>> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
>>>> (3) Finally, the NAPI completion check in veth_poll() is updated. If 
>>>> NAPI is
>>>> about to complete (napi_complete_done), it now also checks if the 
>>>> peer TXQ
>>>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>>>> reschedule itself. This prevents a new race where the producer stops 
>>>> the
>>>> queue just as the consumer is finishing its poll, ensuring the 
>>>> wakeup is not missed.
>>> ...
>>>
>>>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, 
>>>> int budget)
>>>>       if (done < budget && napi_complete_done(napi, done)) {
>>>>           /* Write rx_notify_masked before reading ptr_ring */
>>>>           smp_store_mb(rq->rx_notify_masked, false);
>>>> -        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>>>> +        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>>>> +                 (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>>
>>> Not sure if this is necessary.
>>
>> How sure are you that this isn't necessary?
>>
>>>  From commitlog, your intention seems to be making sure to wake up 
>>> the queue,
>>> but you wake up the queue immediately after this hunk in the same 
>>> function,
>>> so isn't it guaranteed without scheduling another napi?
>>>
>>
>> The above code catches the case, where the ptr_ring is empty and the
>> tx_queue is stopped.  It feels wrong not to reach in this case, but you
>> *might* be right that it isn't strictly necessary, because below code
>> will also call netif_tx_wake_queue() which *should* have a SKB stored
>> that will *indirectly* trigger a restart of the NAPI.
> 
> I'm a bit confused.
> Wrt (3), what you want is waking up the queue, right?
> Or, what you want is actually NAPI reschedule itself?

I want NAPI to reschedule itself, the queue it woken up later close to
the exit of the function.  Maybe it is unnecessary to for NAPI to
reschedule itself here... and that is what you are objecting to?

> My understanding was the former (wake up the queue).
> If it's correct, (3) seems not necessary because you have already woken 
> up the queue in the same function.
> 
> First NAPI
>   veth_poll()
>     // ptr_ring_empty() and queue_stopped()
>    __napi_schedule() ... schedule second NAPI
>    netif_tx_wake_queue() ... wake up the queue if queue_stopped()
> 
> Second NAPI
>   veth_poll()
>    netif_tx_wake_queue() ... this is what you want,
>                              but the queue has been woken up in the 
> first NAPI
>                              What's the point?
> 

So, yes I agree that there is a potential for restarting NAPI one time
too many.  But only *potential* because if NAPI is already/still running
then the producer will not actually start NAPI.

I guess this is a kind of optimization, to avoid the time it takes to
restart NAPI. When we see that TXQ is stopped and ptr_ring is empty,
then we know that a packet will be sitting in the qdisc requeue queue,
and netif_tx_wake_queue() will very soon fill "produce" a packet into
ptr_ring (via calling ndo_start_xmit/veth_xmit).

As this is a fixes patch I can drop this optimization. It seems both
Paolo and you thinks this isn't necessary.

--Jesper

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Toshiaki Makita 5 months ago

On 2025/10/31 4:06, Jesper Dangaard Brouer wrote:
> On 29/10/2025 16.00, Toshiaki Makita wrote:
>> On 2025/10/29 19:33, Jesper Dangaard Brouer wrote:
>>> On 28/10/2025 15.56, Toshiaki Makita wrote:
>>>> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
>>>>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>>>>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>>>>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>>>>> reschedule itself. This prevents a new race where the producer stops the
>>>>> queue just as the consumer is finishing its poll, ensuring the wakeup is not 
>>>>> missed.
>>>> ...
>>>>
>>>>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>>>>       if (done < budget && napi_complete_done(napi, done)) {
>>>>>           /* Write rx_notify_masked before reading ptr_ring */
>>>>>           smp_store_mb(rq->rx_notify_masked, false);
>>>>> -        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>>>>> +        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>>>>> +                 (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>>>
>>>> Not sure if this is necessary.
>>>
>>> How sure are you that this isn't necessary?
>>>
>>>>  From commitlog, your intention seems to be making sure to wake up the queue,
>>>> but you wake up the queue immediately after this hunk in the same function,
>>>> so isn't it guaranteed without scheduling another napi?
>>>>
>>>
>>> The above code catches the case, where the ptr_ring is empty and the
>>> tx_queue is stopped.  It feels wrong not to reach in this case, but you
>>> *might* be right that it isn't strictly necessary, because below code
>>> will also call netif_tx_wake_queue() which *should* have a SKB stored
>>> that will *indirectly* trigger a restart of the NAPI.
>>
>> I'm a bit confused.
>> Wrt (3), what you want is waking up the queue, right?
>> Or, what you want is actually NAPI reschedule itself?
> 
> I want NAPI to reschedule itself, the queue it woken up later close to
> the exit of the function.  Maybe it is unnecessary to for NAPI to
> reschedule itself here... and that is what you are objecting to?
> 
>> My understanding was the former (wake up the queue).
>> If it's correct, (3) seems not necessary because you have already woken up the 
>> queue in the same function.
>>
>> First NAPI
>>   veth_poll()
>>     // ptr_ring_empty() and queue_stopped()
>>    __napi_schedule() ... schedule second NAPI
>>    netif_tx_wake_queue() ... wake up the queue if queue_stopped()
>>
>> Second NAPI
>>   veth_poll()
>>    netif_tx_wake_queue() ... this is what you want,
>>                              but the queue has been woken up in the first NAPI
>>                              What's the point?
>>
> 
> So, yes I agree that there is a potential for restarting NAPI one time
> too many.  But only *potential* because if NAPI is already/still running
> then the producer will not actually start NAPI.
> 
> I guess this is a kind of optimization, to avoid the time it takes to
> restart NAPI. When we see that TXQ is stopped and ptr_ring is empty,
> then we know that a packet will be sitting in the qdisc requeue queue,
> and netif_tx_wake_queue() will very soon fill "produce" a packet into
> ptr_ring (via calling ndo_start_xmit/veth_xmit).

In some cases it may be an optimization but not in every case because it can 
prematurely start NAPI before tx side fills packets?

> As this is a fixes patch I can drop this optimization. It seems both
> Paolo and you thinks this isn't necessary.

I think it's better to drop (3) as a fix.

Toshiaki Makita

Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck

Posted by Toke Høiland-Jørgensen 5 months, 1 week ago

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
>
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
>
> This failure is caused by an incorrect use of the __ptr_ring_empty() API
> from the producer side. As noted in kernel comments, this check is not
> guaranteed to be correct if a consumer is operating on another CPU. The
> empty test is based on ptr_ring->consumer_head, making it reliable only for
> the consumer. Using this check from the producer side is fundamentally racy.
>
> This patch fixes the race by adopting the more robust logic from an earlier
> version V4 of the patchset, which always flushed the peer:
>
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.
>
> (2) On the consumer side, the logic for waking the peer TXQ is moved out of
> veth_xdp_rcv() and placed at the end of the veth_poll() function. This
> placement is part of fixing the race, as the netif_tx_queue_stopped() check
> must occur after rx_notify_masked is potentially set to false during NAPI
> completion.
>  This handles the race where veth_poll() consumes all packets and completes
> NAPI before veth_xmit() on the producer side has called netif_tx_stop_queue().
> In this state, the producer's __veth_xdp_flush(rq) call will see
> rx_notify_masked is false and reschedule NAPI. This new NAPI poll, even if it
> processes no packets, is now guaranteed to run the netif_tx_queue_stopped()
> check, see the stopped queue, and wake it up, allowing veth_xmit() to proceed.
>
> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.
>
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>