* [PATCH net V2 0/2] veth: Fix TXQ stall race condition and add recovery
@ 2025-10-27 20:05 Jesper Dangaard Brouer
2025-10-27 20:05 ` [PATCH net V2 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
2025-10-27 20:05 ` [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
0 siblings, 2 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 20:05 UTC (permalink / raw)
To: netdev, Toke Høiland-Jørgensen
Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
linux-arm-kernel, kernel-team
This patchset addresses a race condition introduced in commit dc82a33297fc
("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops"). In
production, this has been observed to cause a permanently stalled transmit
queue (TXQ) on ARM64 (Ampere Altra Max) systems, leading to a "lost wakeup"
scenario where the TXQ remains in the QUEUE_STATE_DRV_XOFF state and traffic
halts.
The root cause, which is fixed in patch 2, is a racy use of the
__ptr_ring_empty() API from the producer side (veth_xmit). The producer
stops the queue and then checks the ptr_ring consumer's head, but this is
not guaranteed to be correct, when observed from the producer side,
when the NAPI consumer on another CPU has just finished consuming.
This series fixes the bug and make the driver more resilient to recover.
The patches are ordered to first add recovery mechanisms, then fix the
underlying race.
V2:
- Drop patch that changed up/down NDOs
- For race fix add a smb_rmb and improve commit message reasoning for race cases
V1: https://lore.kernel.org/all/176123150256.2281302.7000617032469740443.stgit@firesoul/
---
Jesper Dangaard Brouer (2):
veth: enable dev_watchdog for detecting stalled TXQs
veth: more robust handing of race to avoid txq getting stuck
drivers/net/veth.c | 53 +++++++++++++++++++++++++++++-----------------
1 file changed, 34 insertions(+), 19 deletions(-)
--
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net V2 1/2] veth: enable dev_watchdog for detecting stalled TXQs
2025-10-27 20:05 [PATCH net V2 0/2] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
@ 2025-10-27 20:05 ` Jesper Dangaard Brouer
2025-10-28 9:10 ` Toke Høiland-Jørgensen
2025-10-27 20:05 ` [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
1 sibling, 1 reply; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 20:05 UTC (permalink / raw)
To: netdev, Toke Høiland-Jørgensen
Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
linux-arm-kernel, kernel-team
The changes introduced in commit dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops") have been found to cause
a race condition in production environments.
Under specific circumstances, observed exclusively on ARM64 (aarch64)
systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
permanently stalled. This happens when the race condition leads to the TXQ
entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
preventing the attached qdisc from dequeueing packets and causing the
network link to halt.
As a first step towards resolving this issue, this patch introduces a
failsafe mechanism. It enables the net device watchdog by setting a timeout
value and implements the .ndo_tx_timeout callback.
If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
which logs a warning and calls netif_tx_wake_queue() to unstall the queue
and allow traffic to resume.
The log message will look like this:
veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
This provides a necessary recovery mechanism while the underlying race
condition is investigated further. Subsequent patches will address the root
cause and add more robust state handling.
Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
drivers/net/veth.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..7b1a9805b270 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -959,8 +959,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
rq->stats.vs.xdp_packets += done;
u64_stats_update_end(&rq->stats.syncp);
- if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
+ if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
+ txq_trans_cond_update(peer_txq);
netif_tx_wake_queue(peer_txq);
+ }
return done;
}
@@ -1373,6 +1375,16 @@ static int veth_set_channels(struct net_device *dev,
goto out;
}
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+ struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+ netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
+ atomic_long_read(&txq->trans_timeout), txqueue);
+
+ netif_tx_wake_queue(txq);
+}
+
static int veth_open(struct net_device *dev)
{
struct veth_priv *priv = netdev_priv(dev);
@@ -1711,6 +1723,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_bpf = veth_xdp,
.ndo_xdp_xmit = veth_ndo_xdp_xmit,
.ndo_get_peer_dev = veth_peer_dev,
+ .ndo_tx_timeout = veth_tx_timeout,
};
static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1749,6 +1762,7 @@ static void veth_setup(struct net_device *dev)
dev->priv_destructor = veth_dev_free;
dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
dev->max_mtu = ETH_MAX_MTU;
+ dev->watchdog_timeo = msecs_to_jiffies(5000);
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-27 20:05 [PATCH net V2 0/2] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
2025-10-27 20:05 ` [PATCH net V2 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
@ 2025-10-27 20:05 ` Jesper Dangaard Brouer
2025-10-28 9:10 ` Toke Høiland-Jørgensen
` (2 more replies)
1 sibling, 3 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-27 20:05 UTC (permalink / raw)
To: netdev, Toke Høiland-Jørgensen
Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
linux-arm-kernel, kernel-team
Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") introduced a race condition that can lead to a permanently
stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
Max).
The race occurs in veth_xmit(). The producer observes a full ptr_ring and
stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
intended to re-wake the queue if the consumer had just emptied it (if
(__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
"lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
traffic halts.
This failure is caused by an incorrect use of the __ptr_ring_empty() API
from the producer side. As noted in kernel comments, this check is not
guaranteed to be correct if a consumer is operating on another CPU. The
empty test is based on ptr_ring->consumer_head, making it reliable only for
the consumer. Using this check from the producer side is fundamentally racy.
This patch fixes the race by adopting the more robust logic from an earlier
version V4 of the patchset, which always flushed the peer:
(1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
are removed. Instead, after stopping the queue, we unconditionally call
__veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
making it solely responsible for re-waking the TXQ.
(2) On the consumer side, the logic for waking the peer TXQ is moved out of
veth_xdp_rcv() and placed at the end of the veth_poll() function. This
placement is part of fixing the race, as the netif_tx_queue_stopped() check
must occur after rx_notify_masked is potentially set to false during NAPI
completion.
This handles the race where veth_poll() consumes all packets and completes
NAPI before veth_xmit() on the producer side has called netif_tx_stop_queue().
In this state, the producer's __veth_xdp_flush(rq) call will see
rx_notify_masked is false and reschedule NAPI. This new NAPI poll, even if it
processes no packets, is now guaranteed to run the netif_tx_queue_stopped()
check, see the stopped queue, and wake it up, allowing veth_xmit() to proceed.
(3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
about to complete (napi_complete_done), it now also checks if the peer TXQ
is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
reschedule itself. This prevents a new race where the producer stops the
queue just as the consumer is finishing its poll, ensuring the wakeup is not
missed.
Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
drivers/net/veth.c | 43 ++++++++++++++++++++++---------------------
1 file changed, 22 insertions(+), 21 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7b1a9805b270..828b62916f50 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
}
/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
__skb_push(skb, ETH_HLEN);
- /* Depend on prior success packets started NAPI consumer via
- * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
- * paired with empty check in veth_poll().
- */
netif_tx_stop_queue(txq);
- smp_mb__after_atomic();
- if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
- netif_tx_wake_queue(txq);
+ /* Makes sure NAPI peer consumer runs. Consumer is responsible
+ * for starting txq again, until then ndo_start_xmit (this
+ * function) will not be invoked by the netstack again.
+ */
+ __veth_xdp_flush(rq);
break;
case NET_RX_DROP: /* same as NET_XMIT_DROP */
drop:
@@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_xdp_tx_bq *bq,
struct veth_stats *stats)
{
- struct veth_priv *priv = netdev_priv(rq->dev);
- int queue_idx = rq->xdp_rxq.queue_index;
- struct netdev_queue *peer_txq;
- struct net_device *peer_dev;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
- /* NAPI functions as RCU section */
- peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
- peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
-
for (i = 0; i < budget; i++) {
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
rq->stats.vs.xdp_packets += done;
u64_stats_update_end(&rq->stats.syncp);
- if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
- txq_trans_cond_update(peer_txq);
- netif_tx_wake_queue(peer_txq);
- }
-
return done;
}
@@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
{
struct veth_rq *rq =
container_of(napi, struct veth_rq, xdp_napi);
+ struct veth_priv *priv = netdev_priv(rq->dev);
+ int queue_idx = rq->xdp_rxq.queue_index;
+ struct netdev_queue *peer_txq;
struct veth_stats stats = {};
+ struct net_device *peer_dev;
struct veth_xdp_tx_bq bq;
int done;
bq.count = 0;
+ /* NAPI functions as RCU section */
+ peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
+ peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
+
xdp_set_return_frame_no_direct();
done = veth_xdp_rcv(rq, budget, &bq, &stats);
@@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
if (done < budget && napi_complete_done(napi, done)) {
/* Write rx_notify_masked before reading ptr_ring */
smp_store_mb(rq->rx_notify_masked, false);
- if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
+ if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
+ (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
if (napi_schedule_prep(&rq->xdp_napi)) {
WRITE_ONCE(rq->rx_notify_masked, true);
__napi_schedule(&rq->xdp_napi);
@@ -998,6 +992,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
veth_xdp_flush(rq, &bq);
xdp_clear_return_frame_no_direct();
+ /* Release backpressure per NAPI poll */
+ smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
+ if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
+ txq_trans_cond_update(peer_txq);
+ netif_tx_wake_queue(peer_txq);
+ }
+
return done;
}
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 1/2] veth: enable dev_watchdog for detecting stalled TXQs
2025-10-27 20:05 ` [PATCH net V2 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
@ 2025-10-28 9:10 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 12+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-28 9:10 UTC (permalink / raw)
To: Jesper Dangaard Brouer, netdev
Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
linux-arm-kernel, kernel-team
Jesper Dangaard Brouer <hawk@kernel.org> writes:
> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
> backpressure on full ptr_ring to reduce TX drops") have been found to cause
> a race condition in production environments.
>
> Under specific circumstances, observed exclusively on ARM64 (aarch64)
> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
> permanently stalled. This happens when the race condition leads to the TXQ
> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
> preventing the attached qdisc from dequeueing packets and causing the
> network link to halt.
>
> As a first step towards resolving this issue, this patch introduces a
> failsafe mechanism. It enables the net device watchdog by setting a timeout
> value and implements the .ndo_tx_timeout callback.
>
> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
> and allow traffic to resume.
>
> The log message will look like this:
>
> veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
> veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>
> This provides a necessary recovery mechanism while the underlying race
> condition is investigated further. Subsequent patches will address the root
> cause and add more robust state handling.
>
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-27 20:05 ` [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
@ 2025-10-28 9:10 ` Toke Høiland-Jørgensen
2025-10-28 14:56 ` Toshiaki Makita
2025-10-30 12:28 ` Paolo Abeni
2 siblings, 0 replies; 12+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-28 9:10 UTC (permalink / raw)
To: Jesper Dangaard Brouer, netdev
Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
Jakub Kicinski, Paolo Abeni, ihor.solodrai, Michael S. Tsirkin,
makita.toshiaki, toshiaki.makita1, bpf, linux-kernel,
linux-arm-kernel, kernel-team
Jesper Dangaard Brouer <hawk@kernel.org> writes:
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
>
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
>
> This failure is caused by an incorrect use of the __ptr_ring_empty() API
> from the producer side. As noted in kernel comments, this check is not
> guaranteed to be correct if a consumer is operating on another CPU. The
> empty test is based on ptr_ring->consumer_head, making it reliable only for
> the consumer. Using this check from the producer side is fundamentally racy.
>
> This patch fixes the race by adopting the more robust logic from an earlier
> version V4 of the patchset, which always flushed the peer:
>
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.
>
> (2) On the consumer side, the logic for waking the peer TXQ is moved out of
> veth_xdp_rcv() and placed at the end of the veth_poll() function. This
> placement is part of fixing the race, as the netif_tx_queue_stopped() check
> must occur after rx_notify_masked is potentially set to false during NAPI
> completion.
> This handles the race where veth_poll() consumes all packets and completes
> NAPI before veth_xmit() on the producer side has called netif_tx_stop_queue().
> In this state, the producer's __veth_xdp_flush(rq) call will see
> rx_notify_masked is false and reschedule NAPI. This new NAPI poll, even if it
> processes no packets, is now guaranteed to run the netif_tx_queue_stopped()
> check, see the stopped queue, and wake it up, allowing veth_xmit() to proceed.
>
> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.
>
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-27 20:05 ` [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
2025-10-28 9:10 ` Toke Høiland-Jørgensen
@ 2025-10-28 14:56 ` Toshiaki Makita
2025-10-29 10:33 ` Jesper Dangaard Brouer
2025-10-30 12:28 ` Paolo Abeni
2 siblings, 1 reply; 12+ messages in thread
From: Toshiaki Makita @ 2025-10-28 14:56 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
ihor.solodrai, Michael S. Tsirkin, makita.toshiaki, bpf,
linux-kernel, linux-arm-kernel, kernel-team, netdev,
Toke Høiland-Jørgensen
On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.
Maybe another option is to use !ptr_ring_full() instead of ptr_ring_empty()?
I'm not sure which is better. Anyway I'm ok with your approach.
...
> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.
...
> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
> if (done < budget && napi_complete_done(napi, done)) {
> /* Write rx_notify_masked before reading ptr_ring */
> smp_store_mb(rq->rx_notify_masked, false);
> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
Not sure if this is necessary.
From commitlog, your intention seems to be making sure to wake up the queue,
but you wake up the queue immediately after this hunk in the same function,
so isn't it guaranteed without scheduling another napi?
> if (napi_schedule_prep(&rq->xdp_napi)) {
> WRITE_ONCE(rq->rx_notify_masked, true);
> __napi_schedule(&rq->xdp_napi);
> @@ -998,6 +992,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
> veth_xdp_flush(rq, &bq);
> xdp_clear_return_frame_no_direct();
>
> + /* Release backpressure per NAPI poll */
> + smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
> + if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
> + txq_trans_cond_update(peer_txq);
> + netif_tx_wake_queue(peer_txq);
> + }
> +
> return done;
> }
--
Toshiaki Makita
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-28 14:56 ` Toshiaki Makita
@ 2025-10-29 10:33 ` Jesper Dangaard Brouer
2025-10-29 15:00 ` Toshiaki Makita
0 siblings, 1 reply; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-29 10:33 UTC (permalink / raw)
To: Toshiaki Makita
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
ihor.solodrai, Michael S. Tsirkin, makita.toshiaki, bpf,
linux-kernel, linux-arm-kernel, kernel-team, netdev,
Toke Høiland-Jørgensen
On 28/10/2025 15.56, Toshiaki Makita wrote:
> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
>
>> (1) In veth_xmit(), the racy conditional wake-up logic and its memory
>> barrier
>> are removed. Instead, after stopping the queue, we unconditionally call
>> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is
>> scheduled,
>> making it solely responsible for re-waking the TXQ.
>
> Maybe another option is to use !ptr_ring_full() instead of
> ptr_ring_empty()?
Nope, that will not work.
I think MST will agree.
> I'm not sure which is better. Anyway I'm ok with your approach.
>
> ...
>
>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>> reschedule itself. This prevents a new race where the producer stops the
>> queue just as the consumer is finishing its poll, ensuring the wakeup
>> is not missed.
> ...
>
>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int
>> budget)
>> if (done < budget && napi_complete_done(napi, done)) {
>> /* Write rx_notify_masked before reading ptr_ring */
>> smp_store_mb(rq->rx_notify_masked, false);
>> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>
> Not sure if this is necessary.
How sure are you that this isn't necessary?
> From commitlog, your intention seems to be making sure to wake up the
> queue,
> but you wake up the queue immediately after this hunk in the same function,
> so isn't it guaranteed without scheduling another napi?
>
The above code catches the case, where the ptr_ring is empty and the
tx_queue is stopped. It feels wrong not to reach in this case, but you
*might* be right that it isn't strictly necessary, because below code
will also call netif_tx_wake_queue() which *should* have a SKB stored
that will *indirectly* trigger a restart of the NAPI.
I will stare some more at the code to see if I can convince myself that
we don't have to catch this case.
Please, also provide "How sure are you that this isn't necessary?"
>> if (napi_schedule_prep(&rq->xdp_napi)) {
>> WRITE_ONCE(rq->rx_notify_masked, true);
>> __napi_schedule(&rq->xdp_napi);
>> @@ -998,6 +992,13 @@ static int veth_poll(struct napi_struct *napi,
>> int budget)
>> veth_xdp_flush(rq, &bq);
>> xdp_clear_return_frame_no_direct();
>> + /* Release backpressure per NAPI poll */
>> + smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */
>> + if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
>> + txq_trans_cond_update(peer_txq);
>> + netif_tx_wake_queue(peer_txq);
>> + }
>> +
>> return done;
>> }
>
> --
> Toshiaki Makita
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-29 10:33 ` Jesper Dangaard Brouer
@ 2025-10-29 15:00 ` Toshiaki Makita
2025-10-30 19:06 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 12+ messages in thread
From: Toshiaki Makita @ 2025-10-29 15:00 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
ihor.solodrai, Michael S. Tsirkin, makita.toshiaki, bpf,
linux-kernel, linux-arm-kernel, kernel-team, netdev,
Toke Høiland-Jørgensen
On 2025/10/29 19:33, Jesper Dangaard Brouer wrote:
> On 28/10/2025 15.56, Toshiaki Makita wrote:
>> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
>>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>>> reschedule itself. This prevents a new race where the producer stops the
>>> queue just as the consumer is finishing its poll, ensuring the wakeup is not missed.
>> ...
>>
>>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>> if (done < budget && napi_complete_done(napi, done)) {
>>> /* Write rx_notify_masked before reading ptr_ring */
>>> smp_store_mb(rq->rx_notify_masked, false);
>>> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>>> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>>> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>
>> Not sure if this is necessary.
>
> How sure are you that this isn't necessary?
>
>> From commitlog, your intention seems to be making sure to wake up the queue,
>> but you wake up the queue immediately after this hunk in the same function,
>> so isn't it guaranteed without scheduling another napi?
>>
>
> The above code catches the case, where the ptr_ring is empty and the
> tx_queue is stopped. It feels wrong not to reach in this case, but you
> *might* be right that it isn't strictly necessary, because below code
> will also call netif_tx_wake_queue() which *should* have a SKB stored
> that will *indirectly* trigger a restart of the NAPI.
I'm a bit confused.
Wrt (3), what you want is waking up the queue, right?
Or, what you want is actually NAPI reschedule itself?
My understanding was the former (wake up the queue).
If it's correct, (3) seems not necessary because you have already woken up the queue
in the same function.
First NAPI
veth_poll()
// ptr_ring_empty() and queue_stopped()
__napi_schedule() ... schedule second NAPI
netif_tx_wake_queue() ... wake up the queue if queue_stopped()
Second NAPI
veth_poll()
netif_tx_wake_queue() ... this is what you want,
but the queue has been woken up in the first NAPI
What's the point?
> I will stare some more at the code to see if I can convince myself that
> we don't have to catch this case.
>
> Please, also provide "How sure are you that this isn't necessary?"
I could not find the case we need (3) as I explained above.
--
Toshiaki Makita
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-27 20:05 ` [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
2025-10-28 9:10 ` Toke Høiland-Jørgensen
2025-10-28 14:56 ` Toshiaki Makita
@ 2025-10-30 12:28 ` Paolo Abeni
2025-11-05 15:54 ` Jesper Dangaard Brouer
2 siblings, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2025-10-30 12:28 UTC (permalink / raw)
To: Jesper Dangaard Brouer, netdev, Toke Høiland-Jørgensen
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, ihor.solodrai,
Michael S. Tsirkin, makita.toshiaki, toshiaki.makita1, bpf,
linux-kernel, linux-arm-kernel, kernel-team
On 10/27/25 9:05 PM, Jesper Dangaard Brouer wrote:
> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.
[...]
> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
> if (done < budget && napi_complete_done(napi, done)) {
> /* Write rx_notify_masked before reading ptr_ring */
> smp_store_mb(rq->rx_notify_masked, false);
> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
> if (napi_schedule_prep(&rq->xdp_napi)) {
> WRITE_ONCE(rq->rx_notify_masked, true);
> __napi_schedule(&rq->xdp_napi);
Double checking I'm read the code correctly. The above is supposed to
trigger when something alike the following happens
[producer] [consumer]
veth_poll()
[ring empty]
veth_xmit
veth_forward_skb
[NETDEV_TX_BUSY]
napi_complete_done()
netif_tx_stop_queue
__veth_xdp_flush()
rq->rx_notify_masked == true
WRITE_ONCE(rq->rx_notify_masked,
false);
?
I think the above can't happen, the producer should need to fill the
whole ring in-between the ring check and napi_complete_done().
Am I misreading it?
/P
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-29 15:00 ` Toshiaki Makita
@ 2025-10-30 19:06 ` Jesper Dangaard Brouer
2025-11-03 8:41 ` Toshiaki Makita
0 siblings, 1 reply; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-30 19:06 UTC (permalink / raw)
To: Toshiaki Makita
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
ihor.solodrai, Michael S. Tsirkin, makita.toshiaki, bpf,
linux-kernel, linux-arm-kernel, kernel-team, netdev,
Toke Høiland-Jørgensen
On 29/10/2025 16.00, Toshiaki Makita wrote:
> On 2025/10/29 19:33, Jesper Dangaard Brouer wrote:
>> On 28/10/2025 15.56, Toshiaki Makita wrote:
>>> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
>>>> (3) Finally, the NAPI completion check in veth_poll() is updated. If
>>>> NAPI is
>>>> about to complete (napi_complete_done), it now also checks if the
>>>> peer TXQ
>>>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>>>> reschedule itself. This prevents a new race where the producer stops
>>>> the
>>>> queue just as the consumer is finishing its poll, ensuring the
>>>> wakeup is not missed.
>>> ...
>>>
>>>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi,
>>>> int budget)
>>>> if (done < budget && napi_complete_done(napi, done)) {
>>>> /* Write rx_notify_masked before reading ptr_ring */
>>>> smp_store_mb(rq->rx_notify_masked, false);
>>>> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>>>> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>>>> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>>
>>> Not sure if this is necessary.
>>
>> How sure are you that this isn't necessary?
>>
>>> From commitlog, your intention seems to be making sure to wake up
>>> the queue,
>>> but you wake up the queue immediately after this hunk in the same
>>> function,
>>> so isn't it guaranteed without scheduling another napi?
>>>
>>
>> The above code catches the case, where the ptr_ring is empty and the
>> tx_queue is stopped. It feels wrong not to reach in this case, but you
>> *might* be right that it isn't strictly necessary, because below code
>> will also call netif_tx_wake_queue() which *should* have a SKB stored
>> that will *indirectly* trigger a restart of the NAPI.
>
> I'm a bit confused.
> Wrt (3), what you want is waking up the queue, right?
> Or, what you want is actually NAPI reschedule itself?
I want NAPI to reschedule itself, the queue it woken up later close to
the exit of the function. Maybe it is unnecessary to for NAPI to
reschedule itself here... and that is what you are objecting to?
> My understanding was the former (wake up the queue).
> If it's correct, (3) seems not necessary because you have already woken
> up the queue in the same function.
>
> First NAPI
> veth_poll()
> // ptr_ring_empty() and queue_stopped()
> __napi_schedule() ... schedule second NAPI
> netif_tx_wake_queue() ... wake up the queue if queue_stopped()
>
> Second NAPI
> veth_poll()
> netif_tx_wake_queue() ... this is what you want,
> but the queue has been woken up in the
> first NAPI
> What's the point?
>
So, yes I agree that there is a potential for restarting NAPI one time
too many. But only *potential* because if NAPI is already/still running
then the producer will not actually start NAPI.
I guess this is a kind of optimization, to avoid the time it takes to
restart NAPI. When we see that TXQ is stopped and ptr_ring is empty,
then we know that a packet will be sitting in the qdisc requeue queue,
and netif_tx_wake_queue() will very soon fill "produce" a packet into
ptr_ring (via calling ndo_start_xmit/veth_xmit).
As this is a fixes patch I can drop this optimization. It seems both
Paolo and you thinks this isn't necessary.
--Jesper
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-30 19:06 ` Jesper Dangaard Brouer
@ 2025-11-03 8:41 ` Toshiaki Makita
0 siblings, 0 replies; 12+ messages in thread
From: Toshiaki Makita @ 2025-11-03 8:41 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
ihor.solodrai, Michael S. Tsirkin, makita.toshiaki, bpf,
linux-kernel, linux-arm-kernel, kernel-team, netdev,
Toke Høiland-Jørgensen
On 2025/10/31 4:06, Jesper Dangaard Brouer wrote:
> On 29/10/2025 16.00, Toshiaki Makita wrote:
>> On 2025/10/29 19:33, Jesper Dangaard Brouer wrote:
>>> On 28/10/2025 15.56, Toshiaki Makita wrote:
>>>> On 2025/10/28 5:05, Jesper Dangaard Brouer wrote:
>>>>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>>>>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>>>>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>>>>> reschedule itself. This prevents a new race where the producer stops the
>>>>> queue just as the consumer is finishing its poll, ensuring the wakeup is not
>>>>> missed.
>>>> ...
>>>>
>>>>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>>>> if (done < budget && napi_complete_done(napi, done)) {
>>>>> /* Write rx_notify_masked before reading ptr_ring */
>>>>> smp_store_mb(rq->rx_notify_masked, false);
>>>>> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>>>>> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>>>>> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>>>
>>>> Not sure if this is necessary.
>>>
>>> How sure are you that this isn't necessary?
>>>
>>>> From commitlog, your intention seems to be making sure to wake up the queue,
>>>> but you wake up the queue immediately after this hunk in the same function,
>>>> so isn't it guaranteed without scheduling another napi?
>>>>
>>>
>>> The above code catches the case, where the ptr_ring is empty and the
>>> tx_queue is stopped. It feels wrong not to reach in this case, but you
>>> *might* be right that it isn't strictly necessary, because below code
>>> will also call netif_tx_wake_queue() which *should* have a SKB stored
>>> that will *indirectly* trigger a restart of the NAPI.
>>
>> I'm a bit confused.
>> Wrt (3), what you want is waking up the queue, right?
>> Or, what you want is actually NAPI reschedule itself?
>
> I want NAPI to reschedule itself, the queue it woken up later close to
> the exit of the function. Maybe it is unnecessary to for NAPI to
> reschedule itself here... and that is what you are objecting to?
>
>> My understanding was the former (wake up the queue).
>> If it's correct, (3) seems not necessary because you have already woken up the
>> queue in the same function.
>>
>> First NAPI
>> veth_poll()
>> // ptr_ring_empty() and queue_stopped()
>> __napi_schedule() ... schedule second NAPI
>> netif_tx_wake_queue() ... wake up the queue if queue_stopped()
>>
>> Second NAPI
>> veth_poll()
>> netif_tx_wake_queue() ... this is what you want,
>> but the queue has been woken up in the first NAPI
>> What's the point?
>>
>
> So, yes I agree that there is a potential for restarting NAPI one time
> too many. But only *potential* because if NAPI is already/still running
> then the producer will not actually start NAPI.
>
> I guess this is a kind of optimization, to avoid the time it takes to
> restart NAPI. When we see that TXQ is stopped and ptr_ring is empty,
> then we know that a packet will be sitting in the qdisc requeue queue,
> and netif_tx_wake_queue() will very soon fill "produce" a packet into
> ptr_ring (via calling ndo_start_xmit/veth_xmit).
In some cases it may be an optimization but not in every case because it can
prematurely start NAPI before tx side fills packets?
> As this is a fixes patch I can drop this optimization. It seems both
> Paolo and you thinks this isn't necessary.
I think it's better to drop (3) as a fix.
Toshiaki Makita
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck
2025-10-30 12:28 ` Paolo Abeni
@ 2025-11-05 15:54 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-05 15:54 UTC (permalink / raw)
To: Paolo Abeni, netdev, Toke Høiland-Jørgensen
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, ihor.solodrai,
Michael S. Tsirkin, makita.toshiaki, toshiaki.makita1, bpf,
linux-kernel, linux-arm-kernel, kernel-team
On 30/10/2025 13.28, Paolo Abeni wrote:
> On 10/27/25 9:05 PM, Jesper Dangaard Brouer wrote:
>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>> reschedule itself. This prevents a new race where the producer stops the
>> queue just as the consumer is finishing its poll, ensuring the wakeup is not
>> missed.
>
> [...]
>
>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>> if (done < budget && napi_complete_done(napi, done)) {
>> /* Write rx_notify_masked before reading ptr_ring */
>> smp_store_mb(rq->rx_notify_masked, false);
>> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>> if (napi_schedule_prep(&rq->xdp_napi)) {
>> WRITE_ONCE(rq->rx_notify_masked, true);
>> __napi_schedule(&rq->xdp_napi);
>
> Double checking I'm read the code correctly. The above is supposed to
> trigger when something alike the following happens
>
> [producer] [consumer]
> veth_poll()
> [ring empty]
> veth_xmit
> veth_forward_skb
> [NETDEV_TX_BUSY]
> napi_complete_done()
>
> netif_tx_stop_queue
> __veth_xdp_flush()
> rq->rx_notify_masked == true
> WRITE_ONCE(rq->rx_notify_masked,
> false);
>
> ?
>
> I think the above can't happen, the producer should need to fill the
> whole ring in-between the ring check and napi_complete_done().
The race I can see is slightly different. It is centered around the
consumer manage to empty the ring after [NETDEV_TX_BUSY].
We have 256 packets in queue and I observe NAPI packet processing time
of 7.64 usec on a given ARM64 metal. This means it takes 1956 usec or
1.96 ms to empty the queue (which is the time needed for the race to
occur in below during "(something interrupts)").
It would look like this:
[producer] [consumer]
veth_poll() - already running
veth_xmit
veth_forward_skb
[ring full]
[NETDEV_TX_BUSY]
(something interrupts)
veth_poll()
manage to [empty ring]
napi_complete_done()
netif_tx_stop_queue
__veth_xdp_flush()
- No effect of flush as:
- rq->rx_notify_masked == true
WRITE_ONCE(rq->rx_notify_masked, false)
[empty ring] don't restart NAPI
Observe netif_tx_queue_stopped == true
Notice: at end (the consumer) do observe netif_tx_queue_stopped is true.
This is leveraged in the patch by moving the netif_tx_queue_stopped
check to the end of veth_poll(). This now happens after rx_notify_masked
is changed to false, which is the race fix.
Other cases where veth_poll() stop NAPI and exits, is recovered by
__veth_xdp_flush() in producer.
--Jesper
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-11-05 15:54 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 20:05 [PATCH net V2 0/2] veth: Fix TXQ stall race condition and add recovery Jesper Dangaard Brouer
2025-10-27 20:05 ` [PATCH net V2 1/2] veth: enable dev_watchdog for detecting stalled TXQs Jesper Dangaard Brouer
2025-10-28 9:10 ` Toke Høiland-Jørgensen
2025-10-27 20:05 ` [PATCH net V2 2/2] veth: more robust handing of race to avoid txq getting stuck Jesper Dangaard Brouer
2025-10-28 9:10 ` Toke Høiland-Jørgensen
2025-10-28 14:56 ` Toshiaki Makita
2025-10-29 10:33 ` Jesper Dangaard Brouer
2025-10-29 15:00 ` Toshiaki Makita
2025-10-30 19:06 ` Jesper Dangaard Brouer
2025-11-03 8:41 ` Toshiaki Makita
2025-10-30 12:28 ` Paolo Abeni
2025-11-05 15:54 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).