[PATCH net V1 0/3] veth: Fix TXQ stall race condition and add recovery

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net V1 0/3] veth: Fix TXQ stall race condition and add recovery
@ 2025-10-23 14:59 Jesper Dangaard Brouer
  2025-10-23 14:59 ` [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-23 14:59 UTC (permalink / raw)
  To: netdev, makita.toshiaki
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf,
	linux-kernel, kernel-team

This patchset addresses a race condition introduced in commit dc82a33297fc
("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops"). In
production, this has been observed to cause a permanently stalled transmit
queue (TXQ) on ARM64 (Ampere Altra Max) systems, leading to a "lost wakeup"
scenario where the TXQ remains in the QUEUE_STATE_DRV_XOFF state and traffic
halts.

The root cause, which is fixed in patch 3, is a racy use of the
__ptr_ring_empty() API from the producer side (veth_xmit). The producer
stops the queue and then checks the ptr_ring consumer's head, but this is
not guaranteed to be correct, when observed from the producer side,
when the NAPI consumer on another CPU has just finished consuming.

This series fixes the bug and make the driver more resilient to recover.
The patches are ordered to first add recovery mechanisms, then fix the
underlying race.

---

Jesper Dangaard Brouer (3):
      veth: enable dev_watchdog for detecting stalled TXQs
      veth: stop and start all TX queue in netdev down/up
      veth: more robust handing of race to avoid txq getting stuck

 drivers/net/veth.c | 59 +++++++++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 19 deletions(-)

--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs
  2025-10-23 14:59 [PATCH net V1 0/3] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
@ 2025-10-23 14:59 ` Jesper Dangaard Brouer
  2025-10-24 13:39   ` Toke Høiland-Jørgensen
  2025-10-23 14:59 ` [PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up Jesper Dangaard Brouer
  2025-10-23 14:59 ` [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
  2 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-23 14:59 UTC (permalink / raw)
  To: netdev, makita.toshiaki
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf,
	linux-kernel, kernel-team

The changes introduced in commit dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops") have been found to cause
a race condition in production environments.

Under specific circumstances, observed exclusively on ARM64 (aarch64)
systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
permanently stalled. This happens when the race condition leads to the TXQ
entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
preventing the attached qdisc from dequeueing packets and causing the
network link to halt.

As a first step towards resolving this issue, this patch introduces a
failsafe mechanism. It enables the net device watchdog by setting a timeout
value and implements the .ndo_tx_timeout callback.

If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
which logs a warning and calls netif_tx_wake_queue() to unstall the queue
and allow traffic to resume.

The log message will look like this:

 veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
 veth42: veth backpressure stalled(n:1) TXQ(0) re-enable

This provides a necessary recovery mechanism while the underlying race
condition is investigated further. Subsequent patches will address the root
cause and add more robust state handling in ndo_open/ndo_stop.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..7b1a9805b270 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
+	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
+		txq_trans_cond_update(peer_txq);
 		netif_tx_wake_queue(peer_txq);
+	}
 
 	return done;
 }
@@ -1373,6 +1375,16 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
+		   atomic_long_read(&txq->trans_timeout), txqueue);
+
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1711,6 +1723,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1749,6 +1762,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(5000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up
  2025-10-23 14:59 [PATCH net V1 0/3] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
  2025-10-23 14:59 ` [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
@ 2025-10-23 14:59 ` Jesper Dangaard Brouer
  2025-10-25  0:54   ` Jakub Kicinski
  2025-10-23 14:59 ` [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
  2 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-23 14:59 UTC (permalink / raw)
  To: netdev, makita.toshiaki
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf,
	linux-kernel, kernel-team

The veth driver started manipulating TXQ states in commit
dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring
to reduce TX drops").

Other drivers manipulating TXQ states takes care of stopping
and starting TXQs in NDOs.  Thus, adding this to veth .ndo_open
and .ndo_stop.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7b1a9805b270..3976ddda5fb8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1404,6 +1404,9 @@ static int veth_open(struct net_device *dev)
 			return err;
 	}
 
+	netif_tx_start_all_queues(dev);
+	netif_tx_start_all_queues(peer);
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
@@ -1423,6 +1426,10 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	netif_tx_stop_all_queues(dev);
+	if (peer)
+		netif_tx_stop_all_queues(peer);
+
 	if (priv->_xdp_prog)
 		veth_disable_xdp(dev);
 	else if (veth_gro_requested(dev))



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck
  2025-10-23 14:59 [PATCH net V1 0/3] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
  2025-10-23 14:59 ` [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
  2025-10-23 14:59 ` [PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up Jesper Dangaard Brouer
@ 2025-10-23 14:59 ` Jesper Dangaard Brouer
  2025-10-24 14:33   ` Toke Høiland-Jørgensen
  2025-10-27 19:22   ` Jesper Dangaard Brouer
  2 siblings, 2 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-23 14:59 UTC (permalink / raw)
  To: netdev, makita.toshiaki
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf,
	linux-kernel, kernel-team

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") introduced a race condition that can lead to a permanently
stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
Max).

The race occurs in veth_xmit(). The producer observes a full ptr_ring and
stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
intended to re-wake the queue if the consumer had just emptied it (if
(__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
"lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
traffic halts.

This failure is caused by an incorrect use of the __ptr_ring_empty() API
from the producer side. As noted in kernel comments, this check is not
guaranteed to be correct if a consumer is operating on another CPU. The
empty test is based on ptr_ring->consumer_head, making it reliable only for
the consumer. Using this check from the producer side is fundamentally racy.

This patch fixes the race by adopting the more robust logic from an earlier
version V4 of the patchset, which always flushed the peer:

(1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
are removed. Instead, after stopping the queue, we unconditionally call
__veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
making it solely responsible for re-waking the TXQ.

(2) On the consumer side, the logic for waking the peer TXQ is centralized.
It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
called once per complete NAPI poll cycle.

(3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
about to complete (napi_complete_done), it now also checks if the peer TXQ
is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
reschedule itself. This prevents a new race where the producer stops the
queue just as the consumer is finishing its poll, ensuring the wakeup is not
missed.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 3976ddda5fb8..1d70377481eb 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
 		__skb_push(skb, ETH_HLEN);
-		/* Depend on prior success packets started NAPI consumer via
-		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
-		 * paired with empty check in veth_poll().
-		 */
 		netif_tx_stop_queue(txq);
-		smp_mb__after_atomic();
-		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
-			netif_tx_wake_queue(txq);
+		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
+		 * responsible for starting txq again, until then ndo_start_xmit
+		 * (this function) will not be invoked by the netstack again.
+		 */
+		__veth_xdp_flush(rq);
 		break;
 	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
@@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
-	struct veth_priv *priv = netdev_priv(rq->dev);
-	int queue_idx = rq->xdp_rxq.queue_index;
-	struct netdev_queue *peer_txq;
-	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];

-	/* NAPI functions as RCU section */
-	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
-	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
-
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);

@@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);

-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
-		txq_trans_cond_update(peer_txq);
-		netif_tx_wake_queue(peer_txq);
-	}
-
 	return done;
 }

@@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_rq *rq =
 		container_of(napi, struct veth_rq, xdp_napi);
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
 	struct veth_stats stats = {};
+	struct net_device *peer_dev;
 	struct veth_xdp_tx_bq bq;
 	int done;

 	bq.count = 0;

+	/* NAPI functions as RCU section */
+	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
+	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
+
 	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);

@@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
 		smp_store_mb(rq->rx_notify_masked, false);
-		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
+		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
+			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
 			if (napi_schedule_prep(&rq->xdp_napi)) {
 				WRITE_ONCE(rq->rx_notify_masked, true);
 				__napi_schedule(&rq->xdp_napi);
@@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		veth_xdp_flush(rq, &bq);
 	xdp_clear_return_frame_no_direct();

+	/* Release backpressure per NAPI poll */
+	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
+		txq_trans_cond_update(peer_txq);
+		netif_tx_wake_queue(peer_txq);
+	}
+
 	return done;
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs
  2025-10-23 14:59 ` [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
@ 2025-10-24 13:39   ` Toke Høiland-Jørgensen
  2025-10-27 11:41     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-24 13:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, makita.toshiaki
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf,
	linux-kernel, kernel-team

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
> backpressure on full ptr_ring to reduce TX drops") have been found to cause
> a race condition in production environments.
>
> Under specific circumstances, observed exclusively on ARM64 (aarch64)
> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
> permanently stalled. This happens when the race condition leads to the TXQ
> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
> preventing the attached qdisc from dequeueing packets and causing the
> network link to halt.
>
> As a first step towards resolving this issue, this patch introduces a
> failsafe mechanism. It enables the net device watchdog by setting a timeout
> value and implements the .ndo_tx_timeout callback.
>
> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
> and allow traffic to resume.
>
> The log message will look like this:
>
>  veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>  veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>
> This provides a necessary recovery mechanism while the underlying race
> condition is investigated further. Subsequent patches will address the root
> cause and add more robust state handling in ndo_open/ndo_stop.
>
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index a3046142cb8e..7b1a9805b270 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  	rq->stats.vs.xdp_packets += done;
>  	u64_stats_update_end(&rq->stats.syncp);
>  
> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
> +		txq_trans_cond_update(peer_txq);
>  		netif_tx_wake_queue(peer_txq);
> +	}

Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck
  2025-10-23 14:59 ` [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
@ 2025-10-24 14:33   ` Toke Høiland-Jørgensen
  2025-10-27 12:19     ` Jesper Dangaard Brouer
  2025-10-27 19:22   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-24 14:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, makita.toshiaki
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf,
	linux-kernel, kernel-team

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
>
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
>
> This failure is caused by an incorrect use of the __ptr_ring_empty() API
> from the producer side. As noted in kernel comments, this check is not
> guaranteed to be correct if a consumer is operating on another CPU. The
> empty test is based on ptr_ring->consumer_head, making it reliable only for
> the consumer. Using this check from the producer side is fundamentally racy.
>
> This patch fixes the race by adopting the more robust logic from an earlier
> version V4 of the patchset, which always flushed the peer:
>
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.

This makes sense.

> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
> called once per complete NAPI poll cycle.

So is this second point strictly necessary to fix the race, or is it
more of an optimisation?

> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.

Also makes sense...

> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
>  1 file changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 3976ddda5fb8..1d70377481eb 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>  		__skb_push(skb, ETH_HLEN);
> -		/* Depend on prior success packets started NAPI consumer via
> -		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
> -		 * paired with empty check in veth_poll().
> -		 */
>  		netif_tx_stop_queue(txq);
> -		smp_mb__after_atomic();
> -		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
> -			netif_tx_wake_queue(txq);
> +		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
> +		 * responsible for starting txq again, until then ndo_start_xmit
> +		 * (this function) will not be invoked by the netstack again.
> +		 */
> +		__veth_xdp_flush(rq);

Nit: I'd lose the "Handle race:" prefix from the comment; the rest of
the comment is clear enough without it, and since there's no explanation
of *which* race is being handled, it is just confusing, IMO.

>  		break;
>  	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>  drop:
> @@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct veth_xdp_tx_bq *bq,
>  			struct veth_stats *stats)
>  {
> -	struct veth_priv *priv = netdev_priv(rq->dev);
> -	int queue_idx = rq->xdp_rxq.queue_index;
> -	struct netdev_queue *peer_txq;
> -	struct net_device *peer_dev;
>  	int i, done = 0, n_xdpf = 0;
>  	void *xdpf[VETH_XDP_BATCH];
>  
> -	/* NAPI functions as RCU section */
> -	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> -	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
> -
>  	for (i = 0; i < budget; i++) {
>  		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>  
> @@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  	rq->stats.vs.xdp_packets += done;
>  	u64_stats_update_end(&rq->stats.syncp);
>  
> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
> -		txq_trans_cond_update(peer_txq);
> -		netif_tx_wake_queue(peer_txq);
> -	}
> -
>  	return done;
>  }
>  
> @@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  {
>  	struct veth_rq *rq =
>  		container_of(napi, struct veth_rq, xdp_napi);
> +	struct veth_priv *priv = netdev_priv(rq->dev);
> +	int queue_idx = rq->xdp_rxq.queue_index;
> +	struct netdev_queue *peer_txq;
>  	struct veth_stats stats = {};
> +	struct net_device *peer_dev;
>  	struct veth_xdp_tx_bq bq;
>  	int done;
>  
>  	bq.count = 0;
>  
> +	/* NAPI functions as RCU section */
> +	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> +	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
> +
>  	xdp_set_return_frame_no_direct();
>  	done = veth_xdp_rcv(rq, budget, &bq, &stats);
>  
> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  	if (done < budget && napi_complete_done(napi, done)) {
>  		/* Write rx_notify_masked before reading ptr_ring */
>  		smp_store_mb(rq->rx_notify_masked, false);
> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>  			if (napi_schedule_prep(&rq->xdp_napi)) {
>  				WRITE_ONCE(rq->rx_notify_masked, true);
>  				__napi_schedule(&rq->xdp_napi);
> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  		veth_xdp_flush(rq, &bq);
>  	xdp_clear_return_frame_no_direct();
>  
> +	/* Release backpressure per NAPI poll */
> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
> +		txq_trans_cond_update(peer_txq);
> +		netif_tx_wake_queue(peer_txq);
> +	}
> +
>  	return done;
>  }
>  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up
  2025-10-23 14:59 ` [PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up Jesper Dangaard Brouer
@ 2025-10-25  0:54   ` Jakub Kicinski
  2025-10-27 10:33     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2025-10-25  0:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, makita.toshiaki, Eric Dumazet, David S. Miller,
	Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf, linux-kernel,
	kernel-team

On Thu, 23 Oct 2025 16:59:37 +0200 Jesper Dangaard Brouer wrote:
> The veth driver started manipulating TXQ states in commit
> dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring
> to reduce TX drops").
> 
> Other drivers manipulating TXQ states takes care of stopping
> and starting TXQs in NDOs.  Thus, adding this to veth .ndo_open
> and .ndo_stop.

Kinda, but taking a device up or down resets the qdisc, IIRC.

So stopping the qdisc for real drivers is mostly a way to make sure
that there's nothing entering the xmit handler as the driver dismantles
its state.

I'm not sure if this is an official rule, but I'm under the impression
that stopping the queues or carrier loss (and
netif_tx_stop_all_queues(peer) in close() is stopping peer's Tx queue
on carrier loss) is inadvisable as it may lead to old packets getting
transmitted when carrier comes back.

IOW based on the commit msg - I'm not sure this patch is needed..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up
  2025-10-25  0:54   ` Jakub Kicinski
@ 2025-10-27 10:33     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 10:33 UTC (permalink / raw)
  To: Jakub Kicinski, Chris Arges
  Cc: netdev, makita.toshiaki, Eric Dumazet, David S. Miller,
	Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf, linux-kernel,
	kernel-team

On 25/10/2025 02.54, Jakub Kicinski wrote:
> On Thu, 23 Oct 2025 16:59:37 +0200 Jesper Dangaard Brouer wrote:
>> The veth driver started manipulating TXQ states in commit
>> dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring
>> to reduce TX drops").
>>
>> Other drivers manipulating TXQ states takes care of stopping
>> and starting TXQs in NDOs.  Thus, adding this to veth .ndo_open
>> and .ndo_stop.
> 
> Kinda, but taking a device up or down resets the qdisc, IIRC.
> 
> So stopping the qdisc for real drivers is mostly a way to make sure
> that there's nothing entering the xmit handler as the driver dismantles
> its state.
> 
> I'm not sure if this is an official rule, but I'm under the impression
> that stopping the queues or carrier loss (and
> netif_tx_stop_all_queues(peer) in close() is stopping peer's Tx queue
> on carrier loss) is inadvisable as it may lead to old packets getting
> transmitted when carrier comes back.
> 
> IOW based on the commit msg - I'm not sure this patch is needed..

During incident, when doing ip link set 'down' flushed all packets in
the qdisc, but the TXQs were not reset (started again) on link 'up'.
  Thus, the qdisc would fill-up again and block all packets on interface.
  Chris also tried to replace the qdisc, but the TXQ was still in stopped
mode QUEUE_STATE_DRV_XOFF state.

This was the origin of the patch, that we could not recover the machine
from this state.  Thus, the idea of starting all queue on link 'up',
would give us a recovery mechanism.  With dev_watchdog this change isn't
really needed.
As you mention this may lead to old packets getting transmitted when
carrier comes back, which would be a changed behavior, that we don't
want in a fixes patch.  So, I will drop this patch.

--Jesper

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs
  2025-10-24 13:39   ` Toke Høiland-Jørgensen
@ 2025-10-27 11:41     ` Jesper Dangaard Brouer
  2025-10-27 14:09       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 11:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev, makita.toshiaki
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	ihor.solodrai, toshiaki.makita1, bpf, linux-kernel, kernel-team



On 24/10/2025 15.39, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>> a race condition in production environments.
>>
>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>> permanently stalled. This happens when the race condition leads to the TXQ
>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>> preventing the attached qdisc from dequeueing packets and causing the
>> network link to halt.
>>
>> As a first step towards resolving this issue, this patch introduces a
>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>> value and implements the .ndo_tx_timeout callback.
>>
>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>> and allow traffic to resume.
>>
>> The log message will look like this:
>>
>>   veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>   veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>
>> This provides a necessary recovery mechanism while the underlying race
>> condition is investigated further. Subsequent patches will address the root
>> cause and add more robust state handling in ndo_open/ndo_stop.
>>
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/veth.c |   16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index a3046142cb8e..7b1a9805b270 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   	rq->stats.vs.xdp_packets += done;
>>   	u64_stats_update_end(&rq->stats.syncp);
>>   
>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>> +		txq_trans_cond_update(peer_txq);
>>   		netif_tx_wake_queue(peer_txq);
>> +	}
> 
> Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
> in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?
> 

The veth_xmit() call (indirectly) *do* update the txq_trans start
timestamp, but only for return code NET_RX_SUCCESS / NETDEV_TX_OK.
As .ndo_start_xmit = veth_xmit and netdev_start_xmit[1] will call
txq_trans_update on NETDEV_TX_OK.

This call to txq_trans_cond_update() isn't strictly necessary, as
veth_xmit() call will update it later, and the netif_tx_stop_queue()
call also updates trans_start.

I primarily added it because other drivers that use BQL have their
helper functions update txq_trans.  As I see the veth implementation as
a simplified BQL, that we hopefully can extend to become more dynamic
like BQL.

Do you prefer that I remove this?  (call to txq_trans_cond_update)

--Jesper


[1] 
https://elixir.bootlin.com/linux/v6.17.5/source/include/linux/netdevice.h#L5222-L5233

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck
  2025-10-24 14:33   ` Toke Høiland-Jørgensen
@ 2025-10-27 12:19     ` Jesper Dangaard Brouer
  2025-10-27 14:12       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 12:19 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev, makita.toshiaki
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	ihor.solodrai, toshiaki.makita1, bpf, linux-kernel, kernel-team




On 24/10/2025 16.33, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>> reduce TX drops") introduced a race condition that can lead to a permanently
>> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
>> Max).
>>
>> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
>> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
>> intended to re-wake the queue if the consumer had just emptied it (if
>> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
>> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
>> traffic halts.
>>
>> This failure is caused by an incorrect use of the __ptr_ring_empty() API
>> from the producer side. As noted in kernel comments, this check is not
>> guaranteed to be correct if a consumer is operating on another CPU. The
>> empty test is based on ptr_ring->consumer_head, making it reliable only for
>> the consumer. Using this check from the producer side is fundamentally racy.
>>
>> This patch fixes the race by adopting the more robust logic from an earlier
>> version V4 of the patchset, which always flushed the peer:
>>
>> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
>> are removed. Instead, after stopping the queue, we unconditionally call
>> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
>> making it solely responsible for re-waking the TXQ.
> 
> This makes sense.
> 
>> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
>> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
>> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
>> called once per complete NAPI poll cycle.
> 
> So is this second point strictly necessary to fix the race, or is it
> more of an optimisation?
> 

IMHO it is strictly necessary to fix the race.  The wakeup check
netif_tx_queue_stopped() in veth_poll() needs to be after the code that
(potentially) writes rx_notify_masked.

This handles the race where veth_xmit() haven't called
netif_tx_stop_queue() yet, but veth_poll() manage to consume all packets
and stopped NAPI.  Then we know that __veth_xdp_flush(rq) in veth_xmit()
will see rx_notify_masked==false and start NAPI/veth_poll() again, and
even-though there is no packets left to process we still hit the check
netif_tx_queue_stopped() which start txq and will allow veth_xmit() to
run again.

I'll see if I can improve the description for (2).

>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>> reschedule itself. This prevents a new race where the producer stops the
>> queue just as the consumer is finishing its poll, ensuring the wakeup is not
>> missed.
> 
> Also makes sense...
> 
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
>>   1 file changed, 21 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 3976ddda5fb8..1d70377481eb 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>>   		}
>>   		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>>   		__skb_push(skb, ETH_HLEN);
>> -		/* Depend on prior success packets started NAPI consumer via
>> -		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
>> -		 * paired with empty check in veth_poll().
>> -		 */
>>   		netif_tx_stop_queue(txq);
>> -		smp_mb__after_atomic();
>> -		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
>> -			netif_tx_wake_queue(txq);
>> +		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
>> +		 * responsible for starting txq again, until then ndo_start_xmit
>> +		 * (this function) will not be invoked by the netstack again.
>> +		 */
>> +		__veth_xdp_flush(rq);
> 
> Nit: I'd lose the "Handle race:" prefix from the comment; the rest of
> the comment is clear enough without it, and since there's no explanation
> of *which* race is being handled, it is just confusing, IMO.

Good point, I will remove prefix.

>>   		break;
>>   	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>>   drop:
>> @@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   			struct veth_xdp_tx_bq *bq,
>>   			struct veth_stats *stats)
>>   {
>> -	struct veth_priv *priv = netdev_priv(rq->dev);
>> -	int queue_idx = rq->xdp_rxq.queue_index;
>> -	struct netdev_queue *peer_txq;
>> -	struct net_device *peer_dev;
>>   	int i, done = 0, n_xdpf = 0;
>>   	void *xdpf[VETH_XDP_BATCH];
>>   
>> -	/* NAPI functions as RCU section */
>> -	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
>> -	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>> -
>>   	for (i = 0; i < budget; i++) {
>>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>   
>> @@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   	rq->stats.vs.xdp_packets += done;
>>   	u64_stats_update_end(&rq->stats.syncp);
>>   
>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>> -		txq_trans_cond_update(peer_txq);
>> -		netif_tx_wake_queue(peer_txq);
>> -	}
>> -
>>   	return done;
>>   }
>>   
>> @@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   {
>>   	struct veth_rq *rq =
>>   		container_of(napi, struct veth_rq, xdp_napi);
>> +	struct veth_priv *priv = netdev_priv(rq->dev);
>> +	int queue_idx = rq->xdp_rxq.queue_index;
>> +	struct netdev_queue *peer_txq;
>>   	struct veth_stats stats = {};
>> +	struct net_device *peer_dev;
>>   	struct veth_xdp_tx_bq bq;
>>   	int done;
>>   
>>   	bq.count = 0;
>>   
>> +	/* NAPI functions as RCU section */
>> +	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
>> +	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>> +
>>   	xdp_set_return_frame_no_direct();
>>   	done = veth_xdp_rcv(rq, budget, &bq, &stats);
>>   
>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   	if (done < budget && napi_complete_done(napi, done)) {
>>   		/* Write rx_notify_masked before reading ptr_ring */
>>   		smp_store_mb(rq->rx_notify_masked, false);
>> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>   			if (napi_schedule_prep(&rq->xdp_napi)) {
>>   				WRITE_ONCE(rq->rx_notify_masked, true);
>>   				__napi_schedule(&rq->xdp_napi);
>> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   		veth_xdp_flush(rq, &bq);
>>   	xdp_clear_return_frame_no_direct();
>>   
>> +	/* Release backpressure per NAPI poll */
>> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
>> +		txq_trans_cond_update(peer_txq);
>> +		netif_tx_wake_queue(peer_txq);
>> +	}
>> +
>>   	return done;
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs
  2025-10-27 11:41     ` Jesper Dangaard Brouer
@ 2025-10-27 14:09       ` Toke Høiland-Jørgensen
  2025-10-27 16:18         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-27 14:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, makita.toshiaki
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	ihor.solodrai, toshiaki.makita1, bpf, linux-kernel, kernel-team

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 24/10/2025 15.39, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>> 
>>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>>> a race condition in production environments.
>>>
>>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>>> permanently stalled. This happens when the race condition leads to the TXQ
>>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>>> preventing the attached qdisc from dequeueing packets and causing the
>>> network link to halt.
>>>
>>> As a first step towards resolving this issue, this patch introduces a
>>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>>> value and implements the .ndo_tx_timeout callback.
>>>
>>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>>> and allow traffic to resume.
>>>
>>> The log message will look like this:
>>>
>>>   veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>>   veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>>
>>> This provides a necessary recovery mechanism while the underlying race
>>> condition is investigated further. Subsequent patches will address the root
>>> cause and add more robust state handling in ndo_open/ndo_stop.
>>>
>>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> ---
>>>   drivers/net/veth.c |   16 +++++++++++++++-
>>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index a3046142cb8e..7b1a9805b270 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>   	rq->stats.vs.xdp_packets += done;
>>>   	u64_stats_update_end(&rq->stats.syncp);
>>>   
>>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>>> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>>> +		txq_trans_cond_update(peer_txq);
>>>   		netif_tx_wake_queue(peer_txq);
>>> +	}
>> 
>> Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
>> in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?
>> 
>
> The veth_xmit() call (indirectly) *do* update the txq_trans start
> timestamp, but only for return code NET_RX_SUCCESS / NETDEV_TX_OK.
> As .ndo_start_xmit = veth_xmit and netdev_start_xmit[1] will call
> txq_trans_update on NETDEV_TX_OK.

Ah, right; didn't think of checking the caller, thanks for the pointer :)

> This call to txq_trans_cond_update() isn't strictly necessary, as
> veth_xmit() call will update it later, and the netif_tx_stop_queue()
> call also updates trans_start.
>
> I primarily added it because other drivers that use BQL have their
> helper functions update txq_trans.  As I see the veth implementation as
> a simplified BQL, that we hopefully can extend to become more dynamic
> like BQL.
>
> Do you prefer that I remove this?  (call to txq_trans_cond_update)

Hmm, don't we need it for the XDP path? I.e., if there's no traffic
other than XDP_REDIRECT traffic, ndo_start_xmit() will not get called,
so we need some way other to keep the watchdog from firing, I think?

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck
  2025-10-27 12:19     ` Jesper Dangaard Brouer
@ 2025-10-27 14:12       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-27 14:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, makita.toshiaki
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	ihor.solodrai, toshiaki.makita1, bpf, linux-kernel, kernel-team

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 24/10/2025 16.33, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>> 
>>> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>>> reduce TX drops") introduced a race condition that can lead to a permanently
>>> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
>>> Max).
>>>
>>> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
>>> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
>>> intended to re-wake the queue if the consumer had just emptied it (if
>>> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
>>> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
>>> traffic halts.
>>>
>>> This failure is caused by an incorrect use of the __ptr_ring_empty() API
>>> from the producer side. As noted in kernel comments, this check is not
>>> guaranteed to be correct if a consumer is operating on another CPU. The
>>> empty test is based on ptr_ring->consumer_head, making it reliable only for
>>> the consumer. Using this check from the producer side is fundamentally racy.
>>>
>>> This patch fixes the race by adopting the more robust logic from an earlier
>>> version V4 of the patchset, which always flushed the peer:
>>>
>>> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
>>> are removed. Instead, after stopping the queue, we unconditionally call
>>> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
>>> making it solely responsible for re-waking the TXQ.
>> 
>> This makes sense.
>> 
>>> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
>>> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
>>> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
>>> called once per complete NAPI poll cycle.
>> 
>> So is this second point strictly necessary to fix the race, or is it
>> more of an optimisation?
>> 
>
> IMHO it is strictly necessary to fix the race.  The wakeup check
> netif_tx_queue_stopped() in veth_poll() needs to be after the code that
> (potentially) writes rx_notify_masked.
>
> This handles the race where veth_xmit() haven't called
> netif_tx_stop_queue() yet, but veth_poll() manage to consume all packets
> and stopped NAPI.  Then we know that __veth_xdp_flush(rq) in veth_xmit()
> will see rx_notify_masked==false and start NAPI/veth_poll() again, and
> even-though there is no packets left to process we still hit the check
> netif_tx_queue_stopped() which start txq and will allow veth_xmit() to
> run again.
>
> I'll see if I can improve the description for (2).

Right, okay. Yes, adding this reasoning to the commit message would be
good :)

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs
  2025-10-27 14:09       ` Toke Høiland-Jørgensen
@ 2025-10-27 16:18         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 16:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev, makita.toshiaki
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	ihor.solodrai, toshiaki.makita1, bpf, linux-kernel, kernel-team



On 27/10/2025 15.09, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> On 24/10/2025 15.39, Toke Høiland-Jørgensen wrote:
>>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>>>
>>>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>>>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>>>> a race condition in production environments.
>>>>
>>>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>>>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>>>> permanently stalled. This happens when the race condition leads to the TXQ
>>>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>>>> preventing the attached qdisc from dequeueing packets and causing the
>>>> network link to halt.
>>>>
>>>> As a first step towards resolving this issue, this patch introduces a
>>>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>>>> value and implements the .ndo_tx_timeout callback.
>>>>
>>>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>>>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>>>> and allow traffic to resume.
>>>>
>>>> The log message will look like this:
>>>>
>>>>    veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>>>    veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>>>
>>>> This provides a necessary recovery mechanism while the underlying race
>>>> condition is investigated further. Subsequent patches will address the root
>>>> cause and add more robust state handling in ndo_open/ndo_stop.
>>>>
>>>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>> ---
>>>>    drivers/net/veth.c |   16 +++++++++++++++-
>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index a3046142cb8e..7b1a9805b270 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>    	rq->stats.vs.xdp_packets += done;
>>>>    	u64_stats_update_end(&rq->stats.syncp);
>>>>    
>>>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>>>> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>>>> +		txq_trans_cond_update(peer_txq);
>>>>    		netif_tx_wake_queue(peer_txq);
>>>> +	}
>>>
>>> Hmm, seems a bit weird that this call to txq_trans_cond_update() is only
>>> in veth_xdp_recv(). Shouldn't there (also?) be one in veth_xmit()?
>>>
>>
>> The veth_xmit() call (indirectly) *do* update the txq_trans start
>> timestamp, but only for return code NET_RX_SUCCESS / NETDEV_TX_OK.
>> As .ndo_start_xmit = veth_xmit and netdev_start_xmit[1] will call
>> txq_trans_update on NETDEV_TX_OK.
> 
> Ah, right; didn't think of checking the caller, thanks for the pointer :)
> 
>> This call to txq_trans_cond_update() isn't strictly necessary, as
>> veth_xmit() call will update it later, and the netif_tx_stop_queue()
>> call also updates trans_start.
>>
>> I primarily added it because other drivers that use BQL have their
>> helper functions update txq_trans.  As I see the veth implementation as
>> a simplified BQL, that we hopefully can extend to become more dynamic
>> like BQL.
>>
>> Do you prefer that I remove this?  (call to txq_trans_cond_update)
> 
> Hmm, don't we need it for the XDP path? I.e., if there's no traffic
> other than XDP_REDIRECT traffic, ndo_start_xmit() will not get called,
> so we need some way other to keep the watchdog from firing, I think?
> 

Yes, perhaps you are right.  Even-though the stop call
netif_tx_stop_queue() also updates the txq_trans start, then with XDP
redirect something else can keep the ptr_ring full.  The
netif_tx_wake_queue() call doesn't update txq_trans itself (it depend on
a successful netstack packet).  So, without this txq_trans update it can
get very old (out-of-date) if we starve normal network stack packets.
  I'm not 100% sure this will trigger a watchdog even, as the queue
stopped bit should have been cleared.  It is might worth keeping to
avoid it gets too much out-of-date due to XDP traffic.

--Jesper


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck
  2025-10-23 14:59 ` [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
  2025-10-24 14:33   ` Toke Høiland-Jørgensen
@ 2025-10-27 19:22   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 19:22 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, makita.toshiaki, David S. Miller, Jakub Kicinski,
	Paolo Abeni, ihor.solodrai, toshiaki.makita1, bpf, linux-kernel,
	kernel-team



On 23/10/2025 16.59, Jesper Dangaard Brouer wrote:
[...]
> ---
>   drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
>   1 file changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 3976ddda5fb8..1d70377481eb 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   		}
>   		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>   		__skb_push(skb, ETH_HLEN);
> -		/* Depend on prior success packets started NAPI consumer via
> -		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
> -		 * paired with empty check in veth_poll().
> -		 */
>   		netif_tx_stop_queue(txq);
> -		smp_mb__after_atomic();
> -		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
> -			netif_tx_wake_queue(txq);
> +		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
> +		 * responsible for starting txq again, until then ndo_start_xmit
> +		 * (this function) will not be invoked by the netstack again.
> +		 */
> +		__veth_xdp_flush(rq);
>   		break;
>   	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>   drop:
[...]
> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   	if (done < budget && napi_complete_done(napi, done)) {
>   		/* Write rx_notify_masked before reading ptr_ring */
>   		smp_store_mb(rq->rx_notify_masked, false);
> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>   			if (napi_schedule_prep(&rq->xdp_napi)) {
>   				WRITE_ONCE(rq->rx_notify_masked, true);
>   				__napi_schedule(&rq->xdp_napi);
> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   		veth_xdp_flush(rq, &bq);
>   	xdp_clear_return_frame_no_direct();
>   
> +	/* Release backpressure per NAPI poll */
> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
                         ^^^^^^^^^^^^^^^^^^^^^^
The check netif_tx_queue_stopped() use a non-atomic test_bit().
Thus, I'm considering adding a smp_rmb() before the if statement, to be
paired with the netif_tx_stop_queue() in veth_xmit().


> +		txq_trans_cond_update(peer_txq);
> +		netif_tx_wake_queue(peer_txq);
> +	}
> +
>   	return done;
>   }

--Jesper

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-27 19:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 14:59 [PATCH net V1 0/3] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
2025-10-23 14:59 ` [PATCH net V1 1/3] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
2025-10-24 13:39   ` Toke Høiland-Jørgensen
2025-10-27 11:41     ` Jesper Dangaard Brouer
2025-10-27 14:09       ` Toke Høiland-Jørgensen
2025-10-27 16:18         ` Jesper Dangaard Brouer
2025-10-23 14:59 ` [PATCH net V1 2/3] veth: stop and start all TX queue in netdev down/up Jesper Dangaard Brouer
2025-10-25  0:54   ` Jakub Kicinski
2025-10-27 10:33     ` Jesper Dangaard Brouer
2025-10-23 14:59 ` [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
2025-10-24 14:33   ` Toke Høiland-Jørgensen
2025-10-27 12:19     ` Jesper Dangaard Brouer
2025-10-27 14:12       ` Toke Høiland-Jørgensen
2025-10-27 19:22   ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).