[v1] veth: Fix TXQ stall race condition and add recovery

[PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jesper Dangaard Brouer 3 months, 2 weeks ago

The changes introduced in commit dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops") have been found to cause
a race condition in production environments.

Under specific circumstances, observed exclusively on ARM64 (aarch64)
systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
permanently stalled. This happens when the race condition leads to the TXQ
entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
preventing the attached qdisc from dequeueing packets and causing the
network link to halt.

As a first step towards resolving this issue, this patch introduces a
failsafe mechanism. It enables the net device watchdog by setting a timeout
value and implements the .ndo_tx_timeout callback.

If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
which logs a warning and calls netif_tx_wake_queue() to unstall the queue
and allow traffic to resume.

The log message will look like this:

 veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
 veth42: veth backpressure stalled(n:1) TXQ(0) re-enable

This provides a necessary recovery mechanism while the underlying race
condition is investigated further. Subsequent patches will address the root
cause and add more robust state handling in ndo_open/ndo_stop.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..7b1a9805b270 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
+	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
+		txq_trans_cond_update(peer_txq);
 		netif_tx_wake_queue(peer_txq);
+	}
 
 	return done;
 }
@@ -1373,6 +1375,16 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
+		   atomic_long_read(&txq->trans_timeout), txqueue);
+
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1711,6 +1723,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1749,6 +1762,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(5000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;

Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Toke Høiland-Jørgensen 3 months, 2 weeks ago

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
> backpressure on full ptr_ring to reduce TX drops") have been found to cause
> a race condition in production environments.
>
> Under specific circumstances, observed exclusively on ARM64 (aarch64)
> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
> permanently stalled. This happens when the race condition leads to the TXQ
> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
> preventing the attached qdisc from dequeueing packets and causing the
> network link to halt.
>
> As a first step towards resolving this issue, this patch introduces a
> failsafe mechanism. It enables the net device watchdog by setting a timeout
> value and implements the .ndo_tx_timeout callback.
>
> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
> and allow traffic to resume.
>
> The log message will look like this:
>
>  veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>  veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>
> This provides a necessary recovery mechanism while the underlying race
> condition is investigated further. Subsequent patches will address the root
> cause and add more robust state handling in ndo_open/ndo_stop.
>
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index a3046142cb8e..7b1a9805b270 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  	rq->stats.vs.xdp_packets += done;
>  	u64_stats_update_end(&rq->stats.syncp);
>  
> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
> +		txq_trans_cond_update(peer_txq);
>  		netif_tx_wake_queue(peer_txq);
> +	}

Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?

-Toke

Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jesper Dangaard Brouer 3 months, 2 weeks ago


On 24/10/2025 15.39, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>> a race condition in production environments.
>>
>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>> permanently stalled. This happens when the race condition leads to the TXQ
>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>> preventing the attached qdisc from dequeueing packets and causing the
>> network link to halt.
>>
>> As a first step towards resolving this issue, this patch introduces a
>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>> value and implements the .ndo_tx_timeout callback.
>>
>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>> and allow traffic to resume.
>>
>> The log message will look like this:
>>
>>   veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>   veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>
>> This provides a necessary recovery mechanism while the underlying race
>> condition is investigated further. Subsequent patches will address the root
>> cause and add more robust state handling in ndo_open/ndo_stop.
>>
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/veth.c |   16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index a3046142cb8e..7b1a9805b270 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   	rq->stats.vs.xdp_packets += done;
>>   	u64_stats_update_end(&rq->stats.syncp);
>>   
>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>> +		txq_trans_cond_update(peer_txq);
>>   		netif_tx_wake_queue(peer_txq);
>> +	}
> 
> Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
> in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?
> 

The veth_xmit() call (indirectly) *do* update the txq_trans start
timestamp, but only for return code NET_RX_SUCCESS / NETDEV_TX_OK.
As .ndo_start_xmit = veth_xmit and netdev_start_xmit[1] will call
txq_trans_update on NETDEV_TX_OK.

This call to txq_trans_cond_update() isn't strictly necessary, as
veth_xmit() call will update it later, and the netif_tx_stop_queue()
call also updates trans_start.

I primarily added it because other drivers that use BQL have their
helper functions update txq_trans.  As I see the veth implementation as
a simplified BQL, that we hopefully can extend to become more dynamic
like BQL.

Do you prefer that I remove this?  (call to txq_trans_cond_update)

--Jesper


[1] 
https://elixir.bootlin.com/linux/v6.17.5/source/include/linux/netdevice.h#L5222-L5233

Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Toke Høiland-Jørgensen 3 months, 2 weeks ago

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 24/10/2025 15.39, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>> 
>>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>>> a race condition in production environments.
>>>
>>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>>> permanently stalled. This happens when the race condition leads to the TXQ
>>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>>> preventing the attached qdisc from dequeueing packets and causing the
>>> network link to halt.
>>>
>>> As a first step towards resolving this issue, this patch introduces a
>>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>>> value and implements the .ndo_tx_timeout callback.
>>>
>>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>>> and allow traffic to resume.
>>>
>>> The log message will look like this:
>>>
>>>   veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>>   veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>>
>>> This provides a necessary recovery mechanism while the underlying race
>>> condition is investigated further. Subsequent patches will address the root
>>> cause and add more robust state handling in ndo_open/ndo_stop.
>>>
>>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> ---
>>>   drivers/net/veth.c |   16 +++++++++++++++-
>>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index a3046142cb8e..7b1a9805b270 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>   	rq->stats.vs.xdp_packets += done;
>>>   	u64_stats_update_end(&rq->stats.syncp);
>>>   
>>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>>> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>>> +		txq_trans_cond_update(peer_txq);
>>>   		netif_tx_wake_queue(peer_txq);
>>> +	}
>> 
>> Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
>> in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?
>> 
>
> The veth_xmit() call (indirectly) *do* update the txq_trans start
> timestamp, but only for return code NET_RX_SUCCESS / NETDEV_TX_OK.
> As .ndo_start_xmit = veth_xmit and netdev_start_xmit[1] will call
> txq_trans_update on NETDEV_TX_OK.

Ah, right; didn't think of checking the caller, thanks for the pointer :)

> This call to txq_trans_cond_update() isn't strictly necessary, as
> veth_xmit() call will update it later, and the netif_tx_stop_queue()
> call also updates trans_start.
>
> I primarily added it because other drivers that use BQL have their
> helper functions update txq_trans.  As I see the veth implementation as
> a simplified BQL, that we hopefully can extend to become more dynamic
> like BQL.
>
> Do you prefer that I remove this?  (call to txq_trans_cond_update)

Hmm, don't we need it for the XDP path? I.e., if there's no traffic
other than XDP_REDIRECT traffic, ndo_start_xmit() will not get called,
so we need some way other to keep the watchdog from firing, I think?

-Toke

Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jesper Dangaard Brouer 3 months, 2 weeks ago


On 27/10/2025 15.09, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> On 24/10/2025 15.39, Toke Høiland-Jørgensen wrote:
>>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>>>
>>>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>>>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>>>> a race condition in production environments.
>>>>
>>>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>>>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>>>> permanently stalled. This happens when the race condition leads to the TXQ
>>>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>>>> preventing the attached qdisc from dequeueing packets and causing the
>>>> network link to halt.
>>>>
>>>> As a first step towards resolving this issue, this patch introduces a
>>>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>>>> value and implements the .ndo_tx_timeout callback.
>>>>
>>>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>>>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>>>> and allow traffic to resume.
>>>>
>>>> The log message will look like this:
>>>>
>>>>    veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>>>    veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>>>
>>>> This provides a necessary recovery mechanism while the underlying race
>>>> condition is investigated further. Subsequent patches will address the root
>>>> cause and add more robust state handling in ndo_open/ndo_stop.
>>>>
>>>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>> ---
>>>>    drivers/net/veth.c |   16 +++++++++++++++-
>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index a3046142cb8e..7b1a9805b270 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>    	rq->stats.vs.xdp_packets += done;
>>>>    	u64_stats_update_end(&rq->stats.syncp);
>>>>    
>>>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>>>> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>>>> +		txq_trans_cond_update(peer_txq);
>>>>    		netif_tx_wake_queue(peer_txq);
>>>> +	}
>>>
>>> Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
>>> in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?
>>>
>>
>> The veth_xmit() call (indirectly) *do* update the txq_trans start
>> timestamp, but only for return code NET_RX_SUCCESS / NETDEV_TX_OK.
>> As .ndo_start_xmit = veth_xmit and netdev_start_xmit[1] will call
>> txq_trans_update on NETDEV_TX_OK.
> 
> Ah, right; didn't think of checking the caller, thanks for the pointer :)
> 
>> This call to txq_trans_cond_update() isn't strictly necessary, as
>> veth_xmit() call will update it later, and the netif_tx_stop_queue()
>> call also updates trans_start.
>>
>> I primarily added it because other drivers that use BQL have their
>> helper functions update txq_trans.  As I see the veth implementation as
>> a simplified BQL, that we hopefully can extend to become more dynamic
>> like BQL.
>>
>> Do you prefer that I remove this?  (call to txq_trans_cond_update)
> 
> Hmm, don't we need it for the XDP path? I.e., if there's no traffic
> other than XDP_REDIRECT traffic, ndo_start_xmit() will not get called,
> so we need some way other to keep the watchdog from firing, I think?
> 

Yes, perhaps you are right.  Even-though the stop call
netif_tx_stop_queue() also updates the txq_trans start, then with XDP
redirect something else can keep the ptr_ring full.  The
netif_tx_wake_queue() call doesn't update txq_trans itself (it depend on
a successful netstack packet).  So, without this txq_trans update it can
get very old (out-of-date) if we starve normal network stack packets.
  I'm not 100% sure this will trigger a watchdog even, as the queue
stopped bit should have been cleared.  It is might worth keeping to
avoid it gets too much out-of-date due to XDP traffic.

--Jesper

[PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs
[PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up
[PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck