[v3] veth: Fix TXQ stall race condition and add recovery

[PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jesper Dangaard Brouer 5 months ago

The changes introduced in commit dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops") have been found to cause
a race condition in production environments.

Under specific circumstances, observed exclusively on ARM64 (aarch64)
systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
permanently stalled. This happens when the race condition leads to the TXQ
entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
preventing the attached qdisc from dequeueing packets and causing the
network link to halt.

As a first step towards resolving this issue, this patch introduces a
failsafe mechanism. It enables the net device watchdog by setting a timeout
value and implements the .ndo_tx_timeout callback.

If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
which logs a warning and calls netif_tx_wake_queue() to unstall the queue
and allow traffic to resume.

The log message will look like this:

 veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
 veth42: veth backpressure stalled(n:1) TXQ(0) re-enable

This provides a necessary recovery mechanism while the underlying race
condition is investigated further. Subsequent patches will address the root
cause and add more robust state handling.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..7b1a9805b270 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
+	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
+		txq_trans_cond_update(peer_txq);
 		netif_tx_wake_queue(peer_txq);
+	}
 
 	return done;
 }
@@ -1373,6 +1375,16 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
+		   atomic_long_read(&txq->trans_timeout), txqueue);
+
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1711,6 +1723,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1749,6 +1762,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(5000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;

Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jakub Kicinski 5 months ago

On Wed, 05 Nov 2025 18:28:12 +0100 Jesper Dangaard Brouer wrote:
> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
> backpressure on full ptr_ring to reduce TX drops") have been found to cause
> a race condition in production environments.
> 
> Under specific circumstances, observed exclusively on ARM64 (aarch64)
> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
> permanently stalled. This happens when the race condition leads to the TXQ
> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
> preventing the attached qdisc from dequeueing packets and causing the
> network link to halt.
> 
> As a first step towards resolving this issue, this patch introduces a
> failsafe mechanism. It enables the net device watchdog by setting a timeout
> value and implements the .ndo_tx_timeout callback.
> 
> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
> and allow traffic to resume.
> 
> The log message will look like this:
> 
>  veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>  veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
> 
> This provides a necessary recovery mechanism while the underlying race
> condition is investigated further. Subsequent patches will address the root
> cause and add more robust state handling.
> 
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

I think this belongs in net-next.. Fail safe is not really a bug fix.
I'm slightly worried we're missing a corner case and will cause
timeouts to get printed for someone's config.

> +static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
> +{
> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
> +
> +	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
> +		   atomic_long_read(&txq->trans_timeout), txqueue);

If you think the trans_timeout is useful, let's add it to the message
core prints? And then we can make this msg just veth specific, I don't
think we should be repeating what core already printed.

> +	netif_tx_wake_queue(txq);
> +}
> +
>  static int veth_open(struct net_device *dev)
>  {
>  	struct veth_priv *priv = netdev_priv(dev);
> @@ -1711,6 +1723,7 @@ static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_bpf		= veth_xdp,
>  	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
>  	.ndo_get_peer_dev	= veth_peer_dev,
> +	.ndo_tx_timeout		= veth_tx_timeout,
>  };
>  
>  static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
> @@ -1749,6 +1762,7 @@ static void veth_setup(struct net_device *dev)
>  	dev->priv_destructor = veth_dev_free;
>  	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
>  	dev->max_mtu = ETH_MAX_MTU;
> +	dev->watchdog_timeo = msecs_to_jiffies(5000);
>  
>  	dev->hw_features = VETH_FEATURES;
>  	dev->hw_enc_features = VETH_FEATURES;
> 
>

Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jesper Dangaard Brouer 5 months ago

On 07/11/2025 02.29, Jakub Kicinski wrote:
> On Wed, 05 Nov 2025 18:28:12 +0100 Jesper Dangaard Brouer wrote:
>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>> a race condition in production environments.
>>
>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>> permanently stalled. This happens when the race condition leads to the TXQ
>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>> preventing the attached qdisc from dequeueing packets and causing the
>> network link to halt.
>>
>> As a first step towards resolving this issue, this patch introduces a
>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>> value and implements the .ndo_tx_timeout callback.
>>
>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>> and allow traffic to resume.
>>
>> The log message will look like this:
>>
>>   veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>   veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>
>> This provides a necessary recovery mechanism while the underlying race
>> condition is investigated further. Subsequent patches will address the root
>> cause and add more robust state handling.
>>
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> I think this belongs in net-next.. Fail safe is not really a bug fix.
> I'm slightly worried we're missing a corner case and will cause
> timeouts to get printed for someone's config.
> 

This is a recovery fix.  If the race condition fix isn't 100% then this
patch will allow veth to recover.  Thus, to me it makes sense to group
these two patches together.

I'm more worried that we we're missing a corner case that we cannot
recover from. Than triggering timeouts to get printed, for a config
where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
max 64 packets this needs to consume packets at a rate less than 12.8
pps). It might be good to get some warnings if the system is operating
this slow.

Also remember this is not the default config that most people use.
The code is only activated if attaching a qdisc to veth, which isn't
default. Plus, NAPI mode need to be activated, where in normal NAPI mode
the producer and consumer usually runs on the same CPU, which makes it
impossible to overflow the ptr_ring.  The veth backpressure is primarily
needed when running with threaded-NAPI, where it is natural that
producer and consumer runs on different CPUs. In our production setup
the consumer is always slower than the producer (as the product inside
the namespace have installed too many nftables rules).

>> +static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
>> +{
>> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
>> +
>> +	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
>> +		   atomic_long_read(&txq->trans_timeout), txqueue);
> 
> If you think the trans_timeout is useful, let's add it to the message
> core prints? And then we can make this msg just veth specific, I don't
> think we should be repeating what core already printed.

The trans_timeout is a counter for how many times this TXQ have seen a
timeout.  It is practical as it directly tell us if this a frequent
event (without having to search log files for similar events).

It does make sense to add this to the core message ("NETDEV WATCHDOG")
with the same argument.  For physical NICs these logs are present in
production. Looking at logs through Kibana (right now) and it would make
my life easier to see the number of times the individual queues have
experienced timeouts.  The logs naturally gets spaced in time by the
timeout, making it harder to tell the even frequency. Such a patch would
naturally go though net-next.

Do you still want me to remove the frequency counter from this message?
By the same argument it is practical for me to have as a single log line
when troubleshooting this in practice.  BTW, I've already backported
this watchdog patch to prod kernel (without race fix) and I'll try to
reproduce the race in staging/lab on some ARM64 servers.  If I reproduce
it will be practical to have this counter.

--Jesper

Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jakub Kicinski 5 months ago

On Fri, 7 Nov 2025 14:42:58 +0100 Jesper Dangaard Brouer wrote:
> > I think this belongs in net-next.. Fail safe is not really a bug fix.
> > I'm slightly worried we're missing a corner case and will cause
> > timeouts to get printed for someone's config.
> 
> This is a recovery fix.  If the race condition fix isn't 100% then this
> patch will allow veth to recover.  Thus, to me it makes sense to group
> these two patches together.
> 
> I'm more worried that we we're missing a corner case that we cannot
> recover from. Than triggering timeouts to get printed, for a config
> where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
> max 64 packets this needs to consume packets at a rate less than 12.8
> pps). It might be good to get some warnings if the system is operating
> this slow.
> 
> Also remember this is not the default config that most people use.
> The code is only activated if attaching a qdisc to veth, which isn't
> default. Plus, NAPI mode need to be activated, where in normal NAPI mode
> the producer and consumer usually runs on the same CPU, which makes it
> impossible to overflow the ptr_ring.  The veth backpressure is primarily
> needed when running with threaded-NAPI, where it is natural that
> producer and consumer runs on different CPUs. In our production setup
> the consumer is always slower than the producer (as the product inside
> the namespace have installed too many nftables rules).

I understand all of this, but IMO the fix is in patch 2.
This is a resiliency improvement, not a fix.

> >> +static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
> >> +{
> >> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
> >> +
> >> +	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
> >> +		   atomic_long_read(&txq->trans_timeout), txqueue);  
> > 
> > If you think the trans_timeout is useful, let's add it to the message
> > core prints? And then we can make this msg just veth specific, I don't
> > think we should be repeating what core already printed.  
> 
> The trans_timeout is a counter for how many times this TXQ have seen a
> timeout.  It is practical as it directly tell us if this a frequent
> event (without having to search log files for similar events).
> 
> It does make sense to add this to the core message ("NETDEV WATCHDOG")
> with the same argument.  For physical NICs these logs are present in
> production. Looking at logs through Kibana (right now) and it would make
> my life easier to see the number of times the individual queues have
> experienced timeouts.  The logs naturally gets spaced in time by the
> timeout, making it harder to tell the even frequency. Such a patch would
> naturally go though net-next.

Right, I see how it'd be useful. I think it's worth adding in the core.

> Do you still want me to remove the frequency counter from this message?
> By the same argument it is practical for me to have as a single log line
> when troubleshooting this in practice. 

IOW it makes it easier to query logs for veth timeouts vs non-veth
timeouts? I'm tempted to suggest adding driver name to the logs in
the core :) but it's fine, I'm already asking you to add the timeout
count in the core.

I'm just trying to make sure everyone can benefit from the good ideas,
rather than hiding them in one driver.

> BTW, I've already backported this watchdog patch to prod kernel
> (without race fix) and I'll try to reproduce the race in staging/lab
> on some ARM64 servers.  If I reproduce it will be practical to have
> this counter.

Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs

Posted by Jesper Dangaard Brouer 4 months, 3 weeks ago


On 08/11/2025 02.54, Jakub Kicinski wrote:
> On Fri, 7 Nov 2025 14:42:58 +0100 Jesper Dangaard Brouer wrote:
>>> I think this belongs in net-next.. Fail safe is not really a bug fix.
>>> I'm slightly worried we're missing a corner case and will cause
>>> timeouts to get printed for someone's config.
>>
>> This is a recovery fix.  If the race condition fix isn't 100% then this
>> patch will allow veth to recover.  Thus, to me it makes sense to group
>> these two patches together.
>>
>> I'm more worried that we we're missing a corner case that we cannot
>> recover from. Than triggering timeouts to get printed, for a config
>> where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
>> max 64 packets this needs to consume packets at a rate less than 12.8
>> pps). It might be good to get some warnings if the system is operating
>> this slow.
>>
>> Also remember this is not the default config that most people use.
>> The code is only activated if attaching a qdisc to veth, which isn't
>> default. Plus, NAPI mode need to be activated, where in normal NAPI mode
>> the producer and consumer usually runs on the same CPU, which makes it
>> impossible to overflow the ptr_ring.  The veth backpressure is primarily
>> needed when running with threaded-NAPI, where it is natural that
>> producer and consumer runs on different CPUs. In our production setup
>> the consumer is always slower than the producer (as the product inside
>> the namespace have installed too many nftables rules).
> 
> I understand all of this, but IMO the fix is in patch 2.
> This is a resiliency improvement, not a fix.

As maintainer you have the final say, so I send a [V4]. Notice that
doing it this way will cause a merge conflict once net and net-next gets
merged.

[V4] 
https://lore.kernel.org/all/176295319819.307447.6162285688886096284.stgit@firesoul/

--Jesper

[PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs
[PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck