[PATCH net V3 0/2] veth: Fix TXQ stall race condition and add recovery

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net V3 0/2] veth: Fix TXQ stall race condition and add recovery
@ 2025-11-05 17:28 Jesper Dangaard Brouer
  2025-11-05 17:28 ` [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
  2025-11-05 17:28 ` [PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
  0 siblings, 2 replies; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-05 17:28 UTC (permalink / raw)
  To: netdev, Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

This patchset addresses a race condition introduced in commit dc82a33297fc
("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops"). In
production, this has been observed to cause a permanently stalled transmit
queue (TXQ) on ARM64 (Ampere Altra Max) systems, leading to a "lost wakeup"
scenario where the TXQ remains in the QUEUE_STATE_DRV_XOFF state and traffic
halts.

The root cause, which is fixed in patch 2, is a racy use of the
__ptr_ring_empty() API from the producer side (veth_xmit). The producer
stops the queue and then checks the ptr_ring consumer's head, but this is
not guaranteed to be correct, when observed from the producer side,
when the NAPI consumer on another CPU has just finished consuming.

This series fixes the bug and make the driver more resilient to recover.
The patches are ordered to first add recovery mechanisms, then fix the
underlying race.

V3:
 - Don't keep NAPI running when detecting race, because end of veth_poll will
   see TXQ is stopped anyway and wake queue, making it responsibility of the
   producer veth_xmit to do a "flush" that restarts NAPI.

V2: https://lore.kernel.org/all/176159549627.5396.15971398227283515867.stgit@firesoul/
 - Drop patch that changed up/down NDOs
 - For race fix add a smb_rmb and improve commit message reasoning for race cases

V1: https://lore.kernel.org/all/176123150256.2281302.7000617032469740443.stgit@firesoul/

---

Jesper Dangaard Brouer (2):
      veth: enable dev_watchdog for detecting stalled TXQs
      veth: more robust handing of race to avoid txq getting stuck

 drivers/net/veth.c | 50 +++++++++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 18 deletions(-)

--

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs
  2025-11-05 17:28 [PATCH net V3 0/2] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
@ 2025-11-05 17:28 ` Jesper Dangaard Brouer
  2025-11-07  1:29   ` Jakub Kicinski
  2025-11-05 17:28 ` [PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-05 17:28 UTC (permalink / raw)
  To: netdev, Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

The changes introduced in commit dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops") have been found to cause
a race condition in production environments.

Under specific circumstances, observed exclusively on ARM64 (aarch64)
systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
permanently stalled. This happens when the race condition leads to the TXQ
entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
preventing the attached qdisc from dequeueing packets and causing the
network link to halt.

As a first step towards resolving this issue, this patch introduces a
failsafe mechanism. It enables the net device watchdog by setting a timeout
value and implements the .ndo_tx_timeout callback.

If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
which logs a warning and calls netif_tx_wake_queue() to unstall the queue
and allow traffic to resume.

The log message will look like this:

 veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
 veth42: veth backpressure stalled(n:1) TXQ(0) re-enable

This provides a necessary recovery mechanism while the underlying race
condition is investigated further. Subsequent patches will address the root
cause and add more robust state handling.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..7b1a9805b270 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
+	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
+		txq_trans_cond_update(peer_txq);
 		netif_tx_wake_queue(peer_txq);
+	}
 
 	return done;
 }
@@ -1373,6 +1375,16 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
+		   atomic_long_read(&txq->trans_timeout), txqueue);
+
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1711,6 +1723,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1749,6 +1762,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(5000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck
  2025-11-05 17:28 [PATCH net V3 0/2] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
  2025-11-05 17:28 ` [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
@ 2025-11-05 17:28 ` Jesper Dangaard Brouer
  2025-11-06 14:14   ` Toshiaki Makita
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-05 17:28 UTC (permalink / raw)
  To: netdev, Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") introduced a race condition that can lead to a permanently
stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
Max).

The race occurs in veth_xmit(). The producer observes a full ptr_ring and
stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
intended to re-wake the queue if the consumer had just emptied it (if
(__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
"lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
traffic halts.

This failure is caused by an incorrect use of the __ptr_ring_empty() API
from the producer side. As noted in kernel comments, this check is not
guaranteed to be correct if a consumer is operating on another CPU. The
empty test is based on ptr_ring->consumer_head, making it reliable only for
the consumer. Using this check from the producer side is fundamentally racy.

This patch fixes the race by adopting the more robust logic from an earlier
version V4 of the patchset, which always flushed the peer:

(1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
are removed. Instead, after stopping the queue, we unconditionally call
__veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
making it solely responsible for re-waking the TXQ.
  This handles the race where veth_poll() consumes all packets and completes
NAPI *before* veth_xmit() on the producer side has called netif_tx_stop_queue.
The __veth_xdp_flush(rq) will observe rx_notify_masked is false and schedule
NAPI.

(2) On the consumer side, the logic for waking the peer TXQ is moved out of
veth_xdp_rcv() and placed at the end of the veth_poll() function. This
placement is part of fixing the race, as the netif_tx_queue_stopped() check
must occur after rx_notify_masked is potentially set to false during NAPI
completion.
  This handles the race where veth_poll() consumes all packets, but haven't
finished (rx_notify_masked is still true). The producer veth_xmit() stops the
TXQ and __veth_xdp_flush(rq) will observe rx_notify_masked is true, meaning
not starting NAPI.  Then veth_poll() change rx_notify_masked to false and
stops NAPI.  Before exiting veth_poll() will observe TXQ is stopped and wake
it up.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7b1a9805b270..127dab275896 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
 		__skb_push(skb, ETH_HLEN);
-		/* Depend on prior success packets started NAPI consumer via
-		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
-		 * paired with empty check in veth_poll().
-		 */
 		netif_tx_stop_queue(txq);
-		smp_mb__after_atomic();
-		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
-			netif_tx_wake_queue(txq);
+		/* Makes sure NAPI peer consumer runs. Consumer is responsible
+		 * for starting txq again, until then ndo_start_xmit (this
+		 * function) will not be invoked by the netstack again.
+		 */
+		__veth_xdp_flush(rq);
 		break;
 	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
@@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
-	struct veth_priv *priv = netdev_priv(rq->dev);
-	int queue_idx = rq->xdp_rxq.queue_index;
-	struct netdev_queue *peer_txq;
-	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];

-	/* NAPI functions as RCU section */
-	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
-	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
-
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);

@@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);

-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
-		txq_trans_cond_update(peer_txq);
-		netif_tx_wake_queue(peer_txq);
-	}
-
 	return done;
 }

@@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_rq *rq =
 		container_of(napi, struct veth_rq, xdp_napi);
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
 	struct veth_stats stats = {};
+	struct net_device *peer_dev;
 	struct veth_xdp_tx_bq bq;
 	int done;

 	bq.count = 0;

+	/* NAPI functions as RCU section */
+	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
+	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
+
 	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);

@@ -998,6 +991,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		veth_xdp_flush(rq, &bq);
 	xdp_clear_return_frame_no_direct();

+	/* Release backpressure per NAPI poll */
+	smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
+	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
+		txq_trans_cond_update(peer_txq);
+		netif_tx_wake_queue(peer_txq);
+	}
+
 	return done;
 }

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck
  2025-11-05 17:28 ` [PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
@ 2025-11-06 14:14   ` Toshiaki Makita
  0 siblings, 0 replies; 7+ messages in thread
From: Toshiaki Makita @ 2025-11-06 14:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	ihor.solodrai, Michael S. Tsirkin, makita.toshiaki, bpf,
	linux-kernel, linux-arm-kernel, kernel-team,
	Toke Høiland-Jørgensen, netdev

On 2025/11/06 2:28, Jesper Dangaard Brouer wrote:
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
> 
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
> 
> This failure is caused by an incorrect use of the __ptr_ring_empty() API
> from the producer side. As noted in kernel comments, this check is not
> guaranteed to be correct if a consumer is operating on another CPU. The
> empty test is based on ptr_ring->consumer_head, making it reliable only for
> the consumer. Using this check from the producer side is fundamentally racy.
> 
> This patch fixes the race by adopting the more robust logic from an earlier
> version V4 of the patchset, which always flushed the peer:
> 
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.
>    This handles the race where veth_poll() consumes all packets and completes
> NAPI *before* veth_xmit() on the producer side has called netif_tx_stop_queue.
> The __veth_xdp_flush(rq) will observe rx_notify_masked is false and schedule
> NAPI.
> 
> (2) On the consumer side, the logic for waking the peer TXQ is moved out of
> veth_xdp_rcv() and placed at the end of the veth_poll() function. This
> placement is part of fixing the race, as the netif_tx_queue_stopped() check
> must occur after rx_notify_masked is potentially set to false during NAPI
> completion.
>    This handles the race where veth_poll() consumes all packets, but haven't
> finished (rx_notify_masked is still true). The producer veth_xmit() stops the
> TXQ and __veth_xdp_flush(rq) will observe rx_notify_masked is true, meaning
> not starting NAPI.  Then veth_poll() change rx_notify_masked to false and
> stops NAPI.  Before exiting veth_poll() will observe TXQ is stopped and wake
> it up.
> 
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs
  2025-11-05 17:28 ` [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
@ 2025-11-07  1:29   ` Jakub Kicinski
  2025-11-07 13:42     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2025-11-07  1:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Toke Høiland-Jørgensen, Eric Dumazet,
	David S. Miller, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

On Wed, 05 Nov 2025 18:28:12 +0100 Jesper Dangaard Brouer wrote:
> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
> backpressure on full ptr_ring to reduce TX drops") have been found to cause
> a race condition in production environments.
> 
> Under specific circumstances, observed exclusively on ARM64 (aarch64)
> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
> permanently stalled. This happens when the race condition leads to the TXQ
> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
> preventing the attached qdisc from dequeueing packets and causing the
> network link to halt.
> 
> As a first step towards resolving this issue, this patch introduces a
> failsafe mechanism. It enables the net device watchdog by setting a timeout
> value and implements the .ndo_tx_timeout callback.
> 
> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
> and allow traffic to resume.
> 
> The log message will look like this:
> 
>  veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>  veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
> 
> This provides a necessary recovery mechanism while the underlying race
> condition is investigated further. Subsequent patches will address the root
> cause and add more robust state handling.
> 
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

I think this belongs in net-next.. Fail safe is not really a bug fix.
I'm slightly worried we're missing a corner case and will cause
timeouts to get printed for someone's config.

> +static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
> +{
> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
> +
> +	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
> +		   atomic_long_read(&txq->trans_timeout), txqueue);

If you think the trans_timeout is useful, let's add it to the message
core prints? And then we can make this msg just veth specific, I don't
think we should be repeating what core already printed.

> +	netif_tx_wake_queue(txq);
> +}
> +
>  static int veth_open(struct net_device *dev)
>  {
>  	struct veth_priv *priv = netdev_priv(dev);
> @@ -1711,6 +1723,7 @@ static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_bpf		= veth_xdp,
>  	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
>  	.ndo_get_peer_dev	= veth_peer_dev,
> +	.ndo_tx_timeout		= veth_tx_timeout,
>  };
>  
>  static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
> @@ -1749,6 +1762,7 @@ static void veth_setup(struct net_device *dev)
>  	dev->priv_destructor = veth_dev_free;
>  	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
>  	dev->max_mtu = ETH_MAX_MTU;
> +	dev->watchdog_timeo = msecs_to_jiffies(5000);
>  
>  	dev->hw_features = VETH_FEATURES;
>  	dev->hw_enc_features = VETH_FEATURES;
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs
  2025-11-07  1:29   ` Jakub Kicinski
@ 2025-11-07 13:42     ` Jesper Dangaard Brouer
  2025-11-08  1:54       ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-07 13:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Toke Høiland-Jørgensen, Eric Dumazet,
	David S. Miller, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

On 07/11/2025 02.29, Jakub Kicinski wrote:
> On Wed, 05 Nov 2025 18:28:12 +0100 Jesper Dangaard Brouer wrote:
>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>> a race condition in production environments.
>>
>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>> permanently stalled. This happens when the race condition leads to the TXQ
>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>> preventing the attached qdisc from dequeueing packets and causing the
>> network link to halt.
>>
>> As a first step towards resolving this issue, this patch introduces a
>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>> value and implements the .ndo_tx_timeout callback.
>>
>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>> and allow traffic to resume.
>>
>> The log message will look like this:
>>
>>   veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>   veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>
>> This provides a necessary recovery mechanism while the underlying race
>> condition is investigated further. Subsequent patches will address the root
>> cause and add more robust state handling.
>>
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> I think this belongs in net-next.. Fail safe is not really a bug fix.
> I'm slightly worried we're missing a corner case and will cause
> timeouts to get printed for someone's config.
> 

This is a recovery fix.  If the race condition fix isn't 100% then this
patch will allow veth to recover.  Thus, to me it makes sense to group
these two patches together.

I'm more worried that we we're missing a corner case that we cannot
recover from. Than triggering timeouts to get printed, for a config
where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
max 64 packets this needs to consume packets at a rate less than 12.8
pps). It might be good to get some warnings if the system is operating
this slow.

Also remember this is not the default config that most people use.
The code is only activated if attaching a qdisc to veth, which isn't
default. Plus, NAPI mode need to be activated, where in normal NAPI mode
the producer and consumer usually runs on the same CPU, which makes it
impossible to overflow the ptr_ring.  The veth backpressure is primarily
needed when running with threaded-NAPI, where it is natural that
producer and consumer runs on different CPUs. In our production setup
the consumer is always slower than the producer (as the product inside
the namespace have installed too many nftables rules).

>> +static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
>> +{
>> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
>> +
>> +	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
>> +		   atomic_long_read(&txq->trans_timeout), txqueue);
> 
> If you think the trans_timeout is useful, let's add it to the message
> core prints? And then we can make this msg just veth specific, I don't
> think we should be repeating what core already printed.

The trans_timeout is a counter for how many times this TXQ have seen a
timeout.  It is practical as it directly tell us if this a frequent
event (without having to search log files for similar events).

It does make sense to add this to the core message ("NETDEV WATCHDOG")
with the same argument.  For physical NICs these logs are present in
production. Looking at logs through Kibana (right now) and it would make
my life easier to see the number of times the individual queues have
experienced timeouts.  The logs naturally gets spaced in time by the
timeout, making it harder to tell the even frequency. Such a patch would
naturally go though net-next.

Do you still want me to remove the frequency counter from this message?
By the same argument it is practical for me to have as a single log line
when troubleshooting this in practice.  BTW, I've already backported
this watchdog patch to prod kernel (without race fix) and I'll try to
reproduce the race in staging/lab on some ARM64 servers.  If I reproduce
it will be practical to have this counter.

--Jesper

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs
  2025-11-07 13:42     ` Jesper Dangaard Brouer
@ 2025-11-08  1:54       ` Jakub Kicinski
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Kicinski @ 2025-11-08  1:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Toke Høiland-Jørgensen, Eric Dumazet,
	David S. Miller, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

On Fri, 7 Nov 2025 14:42:58 +0100 Jesper Dangaard Brouer wrote:
> > I think this belongs in net-next.. Fail safe is not really a bug fix.
> > I'm slightly worried we're missing a corner case and will cause
> > timeouts to get printed for someone's config.
> 
> This is a recovery fix.  If the race condition fix isn't 100% then this
> patch will allow veth to recover.  Thus, to me it makes sense to group
> these two patches together.
> 
> I'm more worried that we we're missing a corner case that we cannot
> recover from. Than triggering timeouts to get printed, for a config
> where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
> max 64 packets this needs to consume packets at a rate less than 12.8
> pps). It might be good to get some warnings if the system is operating
> this slow.
> 
> Also remember this is not the default config that most people use.
> The code is only activated if attaching a qdisc to veth, which isn't
> default. Plus, NAPI mode need to be activated, where in normal NAPI mode
> the producer and consumer usually runs on the same CPU, which makes it
> impossible to overflow the ptr_ring.  The veth backpressure is primarily
> needed when running with threaded-NAPI, where it is natural that
> producer and consumer runs on different CPUs. In our production setup
> the consumer is always slower than the producer (as the product inside
> the namespace have installed too many nftables rules).

I understand all of this, but IMO the fix is in patch 2.
This is a resiliency improvement, not a fix.

> >> +static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
> >> +{
> >> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
> >> +
> >> +	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
> >> +		   atomic_long_read(&txq->trans_timeout), txqueue);  
> > 
> > If you think the trans_timeout is useful, let's add it to the message
> > core prints? And then we can make this msg just veth specific, I don't
> > think we should be repeating what core already printed.  
> 
> The trans_timeout is a counter for how many times this TXQ have seen a
> timeout.  It is practical as it directly tell us if this a frequent
> event (without having to search log files for similar events).
> 
> It does make sense to add this to the core message ("NETDEV WATCHDOG")
> with the same argument.  For physical NICs these logs are present in
> production. Looking at logs through Kibana (right now) and it would make
> my life easier to see the number of times the individual queues have
> experienced timeouts.  The logs naturally gets spaced in time by the
> timeout, making it harder to tell the even frequency. Such a patch would
> naturally go though net-next.

Right, I see how it'd be useful. I think it's worth adding in the core.

> Do you still want me to remove the frequency counter from this message?
> By the same argument it is practical for me to have as a single log line
> when troubleshooting this in practice. 

IOW it makes it easier to query logs for veth timeouts vs non-veth
timeouts? I'm tempted to suggest adding driver name to the logs in
the core :) but it's fine, I'm already asking you to add the timeout
count in the core.

I'm just trying to make sure everyone can benefit from the good ideas,
rather than hiding them in one driver.

> BTW, I've already backported this watchdog patch to prod kernel
> (without race fix) and I'll try to reproduce the race in staging/lab
> on some ARM64 servers.  If I reproduce it will be practical to have
> this counter.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-11-08  1:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05 17:28 [PATCH net V3 0/2] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
2025-11-05 17:28 ` [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
2025-11-07  1:29   ` Jakub Kicinski
2025-11-07 13:42     ` Jesper Dangaard Brouer
2025-11-08  1:54       ` Jakub Kicinski
2025-11-05 17:28 ` [PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
2025-11-06 14:14   ` Toshiaki Makita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).