linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net V4] veth: Fix TXQ stall race condition
@ 2025-11-12 13:13 Jesper Dangaard Brouer
  2025-11-12 13:13 ` [PATCH net V4] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
  0 siblings, 1 reply; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-12 13:13 UTC (permalink / raw)
  To: netdev, Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

This patchset addresses a race condition introduced in commit dc82a33297fc
("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops"). In
production, this has been observed to cause a permanently stalled transmit
queue (TXQ) on ARM64 (Ampere Altra Max) systems, leading to a "lost wakeup"
scenario where the TXQ remains in the QUEUE_STATE_DRV_XOFF state and traffic
halts.

The root cause is a racy use of the__ptr_ring_empty() API from the producer
side (veth_xmit). The producer stops the queue and then checks the ptr_ring
consumer's head, but this is not guaranteed to be correct, when observed from
the producer side, when the NAPI consumer on another CPU has just finished
consuming.

This series fixes the race bug, making the driver more resilient to recover is
postponed to net-next as maintainers don't see this as an actual fix.

V4:
 - Focus on race fix for stable net-tree
 - Watchdog recovery patch is postponed to net-next tree

V3: https://lore.kernel.org/all/176236363962.30034.10275956147958212569.stgit@firesoul/
 - Don't keep NAPI running when detecting race, because end of veth_poll will
   see TXQ is stopped anyway and wake queue, making it responsibility of the
   producer veth_xmit to do a "flush" that restarts NAPI.

V2: https://lore.kernel.org/all/176159549627.5396.15971398227283515867.stgit@firesoul/
 - Drop patch that changed up/down NDOs
 - For race fix add a smb_rmb and improve commit message reasoning for race cases

V1: https://lore.kernel.org/all/176123150256.2281302.7000617032469740443.stgit@firesoul/

---

Jesper Dangaard Brouer (1):
      veth: more robust handing of race to avoid txq getting stuck


 drivers/net/veth.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

--



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH net V4] veth: more robust handing of race to avoid txq getting stuck
  2025-11-12 13:13 [PATCH net V4] veth: Fix TXQ stall race condition Jesper Dangaard Brouer
@ 2025-11-12 13:13 ` Jesper Dangaard Brouer
  2025-11-15  2:20   ` patchwork-bot+netdevbpf
  0 siblings, 1 reply; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-12 13:13 UTC (permalink / raw)
  To: netdev, Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
	makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") introduced a race condition that can lead to a permanently
stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
Max).

The race occurs in veth_xmit(). The producer observes a full ptr_ring and
stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
intended to re-wake the queue if the consumer had just emptied it (if
(__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
"lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
traffic halts.

This failure is caused by an incorrect use of the __ptr_ring_empty() API
from the producer side. As noted in kernel comments, this check is not
guaranteed to be correct if a consumer is operating on another CPU. The
empty test is based on ptr_ring->consumer_head, making it reliable only for
the consumer. Using this check from the producer side is fundamentally racy.

This patch fixes the race by adopting the more robust logic from an earlier
version V4 of the patchset, which always flushed the peer:

(1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
are removed. Instead, after stopping the queue, we unconditionally call
__veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
making it solely responsible for re-waking the TXQ.
  This handles the race where veth_poll() consumes all packets and completes
NAPI *before* veth_xmit() on the producer side has called netif_tx_stop_queue.
The __veth_xdp_flush(rq) will observe rx_notify_masked is false and schedule
NAPI.

(2) On the consumer side, the logic for waking the peer TXQ is moved out of
veth_xdp_rcv() and placed at the end of the veth_poll() function. This
placement is part of fixing the race, as the netif_tx_queue_stopped() check
must occur after rx_notify_masked is potentially set to false during NAPI
completion.
  This handles the race where veth_poll() consumes all packets, but haven't
finished (rx_notify_masked is still true). The producer veth_xmit() stops the
TXQ and __veth_xdp_flush(rq) will observe rx_notify_masked is true, meaning
not starting NAPI.  Then veth_poll() change rx_notify_masked to false and
stops NAPI.  Before exiting veth_poll() will observe TXQ is stopped and wake
it up.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 87a63c4bee77..056f7f79f13d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
 		__skb_push(skb, ETH_HLEN);
-		/* Depend on prior success packets started NAPI consumer via
-		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
-		 * paired with empty check in veth_poll().
-		 */
 		netif_tx_stop_queue(txq);
-		smp_mb__after_atomic();
-		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
-			netif_tx_wake_queue(txq);
+		/* Makes sure NAPI peer consumer runs. Consumer is responsible
+		 * for starting txq again, until then ndo_start_xmit (this
+		 * function) will not be invoked by the netstack again.
+		 */
+		__veth_xdp_flush(rq);
 		break;
 	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
@@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
-	struct veth_priv *priv = netdev_priv(rq->dev);
-	int queue_idx = rq->xdp_rxq.queue_index;
-	struct netdev_queue *peer_txq;
-	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
 
-	/* NAPI functions as RCU section */
-	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
-	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
-
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
 
@@ -959,9 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
-		netif_tx_wake_queue(peer_txq);
-
 	return done;
 }
 
@@ -969,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_rq *rq =
 		container_of(napi, struct veth_rq, xdp_napi);
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
 	struct veth_stats stats = {};
+	struct net_device *peer_dev;
 	struct veth_xdp_tx_bq bq;
 	int done;
 
 	bq.count = 0;
 
+	/* NAPI functions as RCU section */
+	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
+	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
+
 	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);
 
@@ -996,6 +991,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		veth_xdp_flush(rq, &bq);
 	xdp_clear_return_frame_no_direct();
 
+	/* Release backpressure per NAPI poll */
+	smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
+	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
+		txq_trans_cond_update(peer_txq);
+		netif_tx_wake_queue(peer_txq);
+	}
+
 	return done;
 }
 




^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net V4] veth: more robust handing of race to avoid txq getting stuck
  2025-11-12 13:13 ` [PATCH net V4] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
@ 2025-11-15  2:20   ` patchwork-bot+netdevbpf
  0 siblings, 0 replies; 3+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-11-15  2:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, toke, eric.dumazet, davem, kuba, pabeni, ihor.solodrai,
	mst, makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
	linux-arm-kernel, kernel-team

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 12 Nov 2025 14:13:52 +0100 you wrote:
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
> 
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
> 
> [...]

Here is the summary with links:
  - [net,V4] veth: more robust handing of race to avoid txq getting stuck
    https://git.kernel.org/netdev/net/c/5442a9da6978

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-11-15  2:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-12 13:13 [PATCH net V4] veth: Fix TXQ stall race condition Jesper Dangaard Brouer
2025-11-12 13:13 ` [PATCH net V4] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
2025-11-15  2:20   ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).