[v1] veth: Fix TXQ stall race condition and add recovery

[PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck

Posted by Jesper Dangaard Brouer 3 months, 2 weeks ago

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") introduced a race condition that can lead to a permanently
stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
Max).

The race occurs in veth_xmit(). The producer observes a full ptr_ring and
stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
intended to re-wake the queue if the consumer had just emptied it (if
(__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
"lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
traffic halts.

This failure is caused by an incorrect use of the __ptr_ring_empty() API
from the producer side. As noted in kernel comments, this check is not
guaranteed to be correct if a consumer is operating on another CPU. The
empty test is based on ptr_ring->consumer_head, making it reliable only for
the consumer. Using this check from the producer side is fundamentally racy.

This patch fixes the race by adopting the more robust logic from an earlier
version V4 of the patchset, which always flushed the peer:

(1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
are removed. Instead, after stopping the queue, we unconditionally call
__veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
making it solely responsible for re-waking the TXQ.

(2) On the consumer side, the logic for waking the peer TXQ is centralized.
It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
called once per complete NAPI poll cycle.

(3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
about to complete (napi_complete_done), it now also checks if the peer TXQ
is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
reschedule itself. This prevents a new race where the producer stops the
queue just as the consumer is finishing its poll, ensuring the wakeup is not
missed.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 3976ddda5fb8..1d70377481eb 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
 		__skb_push(skb, ETH_HLEN);
-		/* Depend on prior success packets started NAPI consumer via
-		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
-		 * paired with empty check in veth_poll().
-		 */
 		netif_tx_stop_queue(txq);
-		smp_mb__after_atomic();
-		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
-			netif_tx_wake_queue(txq);
+		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
+		 * responsible for starting txq again, until then ndo_start_xmit
+		 * (this function) will not be invoked by the netstack again.
+		 */
+		__veth_xdp_flush(rq);
 		break;
 	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
@@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
-	struct veth_priv *priv = netdev_priv(rq->dev);
-	int queue_idx = rq->xdp_rxq.queue_index;
-	struct netdev_queue *peer_txq;
-	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
 
-	/* NAPI functions as RCU section */
-	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
-	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
-
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
 
@@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
-		txq_trans_cond_update(peer_txq);
-		netif_tx_wake_queue(peer_txq);
-	}
-
 	return done;
 }
 
@@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_rq *rq =
 		container_of(napi, struct veth_rq, xdp_napi);
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
 	struct veth_stats stats = {};
+	struct net_device *peer_dev;
 	struct veth_xdp_tx_bq bq;
 	int done;
 
 	bq.count = 0;
 
+	/* NAPI functions as RCU section */
+	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
+	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
+
 	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);
 
@@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
 		smp_store_mb(rq->rx_notify_masked, false);
-		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
+		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
+			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
 			if (napi_schedule_prep(&rq->xdp_napi)) {
 				WRITE_ONCE(rq->rx_notify_masked, true);
 				__napi_schedule(&rq->xdp_napi);
@@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		veth_xdp_flush(rq, &bq);
 	xdp_clear_return_frame_no_direct();
 
+	/* Release backpressure per NAPI poll */
+	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
+		txq_trans_cond_update(peer_txq);
+		netif_tx_wake_queue(peer_txq);
+	}
+
 	return done;
 }

Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck

Posted by Jesper Dangaard Brouer 3 months, 2 weeks ago


On 23/10/2025 16.59, Jesper Dangaard Brouer wrote:
[...]
> ---
>   drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
>   1 file changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 3976ddda5fb8..1d70377481eb 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   		}
>   		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>   		__skb_push(skb, ETH_HLEN);
> -		/* Depend on prior success packets started NAPI consumer via
> -		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
> -		 * paired with empty check in veth_poll().
> -		 */
>   		netif_tx_stop_queue(txq);
> -		smp_mb__after_atomic();
> -		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
> -			netif_tx_wake_queue(txq);
> +		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
> +		 * responsible for starting txq again, until then ndo_start_xmit
> +		 * (this function) will not be invoked by the netstack again.
> +		 */
> +		__veth_xdp_flush(rq);
>   		break;
>   	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>   drop:
[...]
> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   	if (done < budget && napi_complete_done(napi, done)) {
>   		/* Write rx_notify_masked before reading ptr_ring */
>   		smp_store_mb(rq->rx_notify_masked, false);
> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>   			if (napi_schedule_prep(&rq->xdp_napi)) {
>   				WRITE_ONCE(rq->rx_notify_masked, true);
>   				__napi_schedule(&rq->xdp_napi);
> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   		veth_xdp_flush(rq, &bq);
>   	xdp_clear_return_frame_no_direct();
>   
> +	/* Release backpressure per NAPI poll */
> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
                         ^^^^^^^^^^^^^^^^^^^^^^
The check netif_tx_queue_stopped() use a non-atomic test_bit().
Thus, I'm considering adding a smp_rmb() before the if statement, to be
paired with the netif_tx_stop_queue() in veth_xmit().


> +		txq_trans_cond_update(peer_txq);
> +		netif_tx_wake_queue(peer_txq);
> +	}
> +
>   	return done;
>   }

--Jesper

Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck

Posted by Toke Høiland-Jørgensen 3 months, 2 weeks ago

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
>
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
>
> This failure is caused by an incorrect use of the __ptr_ring_empty() API
> from the producer side. As noted in kernel comments, this check is not
> guaranteed to be correct if a consumer is operating on another CPU. The
> empty test is based on ptr_ring->consumer_head, making it reliable only for
> the consumer. Using this check from the producer side is fundamentally racy.
>
> This patch fixes the race by adopting the more robust logic from an earlier
> version V4 of the patchset, which always flushed the peer:
>
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.

This makes sense.

> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
> called once per complete NAPI poll cycle.

So is this second point strictly necessary to fix the race, or is it
more of an optimisation?

> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.

Also makes sense...

> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
>  1 file changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 3976ddda5fb8..1d70377481eb 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>  		__skb_push(skb, ETH_HLEN);
> -		/* Depend on prior success packets started NAPI consumer via
> -		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
> -		 * paired with empty check in veth_poll().
> -		 */
>  		netif_tx_stop_queue(txq);
> -		smp_mb__after_atomic();
> -		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
> -			netif_tx_wake_queue(txq);
> +		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
> +		 * responsible for starting txq again, until then ndo_start_xmit
> +		 * (this function) will not be invoked by the netstack again.
> +		 */
> +		__veth_xdp_flush(rq);

Nit: I'd lose the "Handle race:" prefix from the comment; the rest of
the comment is clear enough without it, and since there's no explanation
of *which* race is being handled, it is just confusing, IMO.

>  		break;
>  	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>  drop:
> @@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct veth_xdp_tx_bq *bq,
>  			struct veth_stats *stats)
>  {
> -	struct veth_priv *priv = netdev_priv(rq->dev);
> -	int queue_idx = rq->xdp_rxq.queue_index;
> -	struct netdev_queue *peer_txq;
> -	struct net_device *peer_dev;
>  	int i, done = 0, n_xdpf = 0;
>  	void *xdpf[VETH_XDP_BATCH];
>  
> -	/* NAPI functions as RCU section */
> -	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> -	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
> -
>  	for (i = 0; i < budget; i++) {
>  		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>  
> @@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  	rq->stats.vs.xdp_packets += done;
>  	u64_stats_update_end(&rq->stats.syncp);
>  
> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
> -		txq_trans_cond_update(peer_txq);
> -		netif_tx_wake_queue(peer_txq);
> -	}
> -
>  	return done;
>  }
>  
> @@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  {
>  	struct veth_rq *rq =
>  		container_of(napi, struct veth_rq, xdp_napi);
> +	struct veth_priv *priv = netdev_priv(rq->dev);
> +	int queue_idx = rq->xdp_rxq.queue_index;
> +	struct netdev_queue *peer_txq;
>  	struct veth_stats stats = {};
> +	struct net_device *peer_dev;
>  	struct veth_xdp_tx_bq bq;
>  	int done;
>  
>  	bq.count = 0;
>  
> +	/* NAPI functions as RCU section */
> +	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> +	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
> +
>  	xdp_set_return_frame_no_direct();
>  	done = veth_xdp_rcv(rq, budget, &bq, &stats);
>  
> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  	if (done < budget && napi_complete_done(napi, done)) {
>  		/* Write rx_notify_masked before reading ptr_ring */
>  		smp_store_mb(rq->rx_notify_masked, false);
> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>  			if (napi_schedule_prep(&rq->xdp_napi)) {
>  				WRITE_ONCE(rq->rx_notify_masked, true);
>  				__napi_schedule(&rq->xdp_napi);
> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  		veth_xdp_flush(rq, &bq);
>  	xdp_clear_return_frame_no_direct();
>  
> +	/* Release backpressure per NAPI poll */
> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
> +		txq_trans_cond_update(peer_txq);
> +		netif_tx_wake_queue(peer_txq);
> +	}
> +
>  	return done;
>  }
>

Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck

Posted by Jesper Dangaard Brouer 3 months, 2 weeks ago



On 24/10/2025 16.33, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>> reduce TX drops") introduced a race condition that can lead to a permanently
>> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
>> Max).
>>
>> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
>> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
>> intended to re-wake the queue if the consumer had just emptied it (if
>> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
>> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
>> traffic halts.
>>
>> This failure is caused by an incorrect use of the __ptr_ring_empty() API
>> from the producer side. As noted in kernel comments, this check is not
>> guaranteed to be correct if a consumer is operating on another CPU. The
>> empty test is based on ptr_ring->consumer_head, making it reliable only for
>> the consumer. Using this check from the producer side is fundamentally racy.
>>
>> This patch fixes the race by adopting the more robust logic from an earlier
>> version V4 of the patchset, which always flushed the peer:
>>
>> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
>> are removed. Instead, after stopping the queue, we unconditionally call
>> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
>> making it solely responsible for re-waking the TXQ.
> 
> This makes sense.
> 
>> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
>> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
>> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
>> called once per complete NAPI poll cycle.
> 
> So is this second point strictly necessary to fix the race, or is it
> more of an optimisation?
> 

IMHO it is strictly necessary to fix the race.  The wakeup check
netif_tx_queue_stopped() in veth_poll() needs to be after the code that
(potentially) writes rx_notify_masked.

This handles the race where veth_xmit() haven't called
netif_tx_stop_queue() yet, but veth_poll() manage to consume all packets
and stopped NAPI.  Then we know that __veth_xdp_flush(rq) in veth_xmit()
will see rx_notify_masked==false and start NAPI/veth_poll() again, and
even-though there is no packets left to process we still hit the check
netif_tx_queue_stopped() which start txq and will allow veth_xmit() to
run again.

I'll see if I can improve the description for (2).

>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>> reschedule itself. This prevents a new race where the producer stops the
>> queue just as the consumer is finishing its poll, ensuring the wakeup is not
>> missed.
> 
> Also makes sense...
> 
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
>>   1 file changed, 21 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 3976ddda5fb8..1d70377481eb 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>>   		}
>>   		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>>   		__skb_push(skb, ETH_HLEN);
>> -		/* Depend on prior success packets started NAPI consumer via
>> -		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
>> -		 * paired with empty check in veth_poll().
>> -		 */
>>   		netif_tx_stop_queue(txq);
>> -		smp_mb__after_atomic();
>> -		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
>> -			netif_tx_wake_queue(txq);
>> +		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
>> +		 * responsible for starting txq again, until then ndo_start_xmit
>> +		 * (this function) will not be invoked by the netstack again.
>> +		 */
>> +		__veth_xdp_flush(rq);
> 
> Nit: I'd lose the "Handle race:" prefix from the comment; the rest of
> the comment is clear enough without it, and since there's no explanation
> of *which* race is being handled, it is just confusing, IMO.

Good point, I will remove prefix.

>>   		break;
>>   	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>>   drop:
>> @@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   			struct veth_xdp_tx_bq *bq,
>>   			struct veth_stats *stats)
>>   {
>> -	struct veth_priv *priv = netdev_priv(rq->dev);
>> -	int queue_idx = rq->xdp_rxq.queue_index;
>> -	struct netdev_queue *peer_txq;
>> -	struct net_device *peer_dev;
>>   	int i, done = 0, n_xdpf = 0;
>>   	void *xdpf[VETH_XDP_BATCH];
>>   
>> -	/* NAPI functions as RCU section */
>> -	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
>> -	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>> -
>>   	for (i = 0; i < budget; i++) {
>>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>   
>> @@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   	rq->stats.vs.xdp_packets += done;
>>   	u64_stats_update_end(&rq->stats.syncp);
>>   
>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>> -		txq_trans_cond_update(peer_txq);
>> -		netif_tx_wake_queue(peer_txq);
>> -	}
>> -
>>   	return done;
>>   }
>>   
>> @@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   {
>>   	struct veth_rq *rq =
>>   		container_of(napi, struct veth_rq, xdp_napi);
>> +	struct veth_priv *priv = netdev_priv(rq->dev);
>> +	int queue_idx = rq->xdp_rxq.queue_index;
>> +	struct netdev_queue *peer_txq;
>>   	struct veth_stats stats = {};
>> +	struct net_device *peer_dev;
>>   	struct veth_xdp_tx_bq bq;
>>   	int done;
>>   
>>   	bq.count = 0;
>>   
>> +	/* NAPI functions as RCU section */
>> +	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
>> +	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>> +
>>   	xdp_set_return_frame_no_direct();
>>   	done = veth_xdp_rcv(rq, budget, &bq, &stats);
>>   
>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   	if (done < budget && napi_complete_done(napi, done)) {
>>   		/* Write rx_notify_masked before reading ptr_ring */
>>   		smp_store_mb(rq->rx_notify_masked, false);
>> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>   			if (napi_schedule_prep(&rq->xdp_napi)) {
>>   				WRITE_ONCE(rq->rx_notify_masked, true);
>>   				__napi_schedule(&rq->xdp_napi);
>> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   		veth_xdp_flush(rq, &bq);
>>   	xdp_clear_return_frame_no_direct();
>>   
>> +	/* Release backpressure per NAPI poll */
>> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
>> +		txq_trans_cond_update(peer_txq);
>> +		netif_tx_wake_queue(peer_txq);
>> +	}
>> +
>>   	return done;
>>   }
>>   
>

Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck

Posted by Toke Høiland-Jørgensen 3 months, 2 weeks ago

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 24/10/2025 16.33, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>> 
>>> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>>> reduce TX drops") introduced a race condition that can lead to a permanently
>>> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
>>> Max).
>>>
>>> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
>>> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
>>> intended to re-wake the queue if the consumer had just emptied it (if
>>> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
>>> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
>>> traffic halts.
>>>
>>> This failure is caused by an incorrect use of the __ptr_ring_empty() API
>>> from the producer side. As noted in kernel comments, this check is not
>>> guaranteed to be correct if a consumer is operating on another CPU. The
>>> empty test is based on ptr_ring->consumer_head, making it reliable only for
>>> the consumer. Using this check from the producer side is fundamentally racy.
>>>
>>> This patch fixes the race by adopting the more robust logic from an earlier
>>> version V4 of the patchset, which always flushed the peer:
>>>
>>> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
>>> are removed. Instead, after stopping the queue, we unconditionally call
>>> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
>>> making it solely responsible for re-waking the TXQ.
>> 
>> This makes sense.
>> 
>>> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
>>> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
>>> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
>>> called once per complete NAPI poll cycle.
>> 
>> So is this second point strictly necessary to fix the race, or is it
>> more of an optimisation?
>> 
>
> IMHO it is strictly necessary to fix the race.  The wakeup check
> netif_tx_queue_stopped() in veth_poll() needs to be after the code that
> (potentially) writes rx_notify_masked.
>
> This handles the race where veth_xmit() haven't called
> netif_tx_stop_queue() yet, but veth_poll() manage to consume all packets
> and stopped NAPI.  Then we know that __veth_xdp_flush(rq) in veth_xmit()
> will see rx_notify_masked==false and start NAPI/veth_poll() again, and
> even-though there is no packets left to process we still hit the check
> netif_tx_queue_stopped() which start txq and will allow veth_xmit() to
> run again.
>
> I'll see if I can improve the description for (2).

Right, okay. Yes, adding this reasoning to the commit message would be
good :)

-Toke

[PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs
[PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up
[PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck