* [PATCH net-next RFC V3 0/2] veth: qdisc backpressure and qdisc check refactor
@ 2025-04-14 15:45 Jesper Dangaard Brouer
2025-04-14 15:45 ` [PATCH net-next RFC V3 1/2] net: sched: generalize check for no-op qdisc on TX queue Jesper Dangaard Brouer
2025-04-14 15:45 ` [PATCH net-next RFC V3 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
0 siblings, 2 replies; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-14 15:45 UTC (permalink / raw)
To: netdev, Jakub Kicinski
Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
makita.toshiaki, kernel-team
RFC: As refactor is incorrect - need help/input from qdisc experts
This patch series addresses TX drops seen on veth devices under load,
particularly when using threaded NAPI, which is our setup in production.
The root cause is that the NAPI consumer often runs on a different CPU
than the producer. Combined with scheduling delays or simply slower
consumption, this increases the chance that the ptr_ring fills up before
packets are drained, resulting in drops from veth_xmit() (ndo_start_xmit()).
To make this easier to reproduce, we’ve created a script that sets up a
test scenario using network namespaces. The script inserts 1000 iptables
rules in the consumer namespace to slow down packet processing and
amplify the issue. Reproducer script:
https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_setup01_NAPI_TX_drops.sh
This series first introduces a helper to detect no-op qdiscs and then
uses it in the veth driver to conditionally apply qdisc-level
backpressure when a real qdisc is attached. The behavior is off by
default and opt-in, ensuring minimal impact and easy activation.
---
V3:
- Reorder patches, generalize check for no-op qdisc as first patch
- RFC: As testing show this is incorrect
- rcu_dereference(priv->peer) in veth_xdp_rcv as this runs in NAPI
context rcu_read_lock() is implicit.
- Link to V2: https://lore.kernel.org/all/174412623473.3702169.4235683143719614624.stgit@firesoul/
V2:
- Generalize check for no-op qdisc
- Link to RFC-V1: https://lore.kernel.org/all/174377814192.3376479.16481605648460889310.stgit@firesoul/
Jesper Dangaard Brouer (2):
net: sched: generalize check for no-op qdisc on TX queue
veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
drivers/net/veth.c | 49 ++++++++++++++++++++++++++++++++-------
drivers/net/vrf.c | 4 +---
include/net/sch_generic.h | 7 +++++-
3 files changed, 48 insertions(+), 12 deletions(-)
--
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH net-next RFC V3 1/2] net: sched: generalize check for no-op qdisc on TX queue
2025-04-14 15:45 [PATCH net-next RFC V3 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
@ 2025-04-14 15:45 ` Jesper Dangaard Brouer
2025-04-14 15:45 ` [PATCH net-next RFC V3 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
1 sibling, 0 replies; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-14 15:45 UTC (permalink / raw)
To: netdev, Jakub Kicinski
Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
makita.toshiaki, kernel-team
WARNING: Testing show this is NOT correct!
- I need help figuring out why this patch is incorrect.
- Testing against &noop_qdisc is not the same as q->enqueue == NULL
- I copied test (txq->qdisc == &noop_qdisc) from qdisc_tx_is_noop
- Q: is qdisc_tx_is_noop() function incorrect?
The vrf driver includes an open-coded check to determine whether a TX
queue (netdev_queue) has a real qdisc attached. This is done by testing
whether qdisc->enqueue is NULL, which is functionally equivalent to
checking whether the qdisc is &noop_qdisc.
This equivalence stems from noqueue_init(), which explicitly clears the
enqueue pointer to signal no-op behavior to __dev_queue_xmit().
This patch introduces a shared helper, qdisc_txq_is_noop(), to clarify
intent and make this logic reusable. The vrf driver is updated to use
this new helper.
Subsequent patches will make further use of this helper in other drivers,
such as veth.
This is a non-functional change.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
drivers/net/vrf.c | 4 +---
include/net/sch_generic.h | 7 ++++++-
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 7168b33adadb..f0a24fc85945 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -343,15 +343,13 @@ static int vrf_ifindex_lookup_by_table_id(struct net *net, u32 table_id)
static bool qdisc_tx_is_default(const struct net_device *dev)
{
struct netdev_queue *txq;
- struct Qdisc *qdisc;
if (dev->num_tx_queues > 1)
return false;
txq = netdev_get_tx_queue(dev, 0);
- qdisc = rcu_access_pointer(txq->qdisc);
- return !qdisc->enqueue;
+ return qdisc_txq_is_noop(txq);
}
/* Local traffic destined to local address. Reinsert the packet to rx
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d48c657191cd..a1f5560350a6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -803,6 +803,11 @@ static inline bool qdisc_tx_changing(const struct net_device *dev)
return false;
}
+static inline bool qdisc_txq_is_noop(const struct netdev_queue *txq)
+{
+ return rcu_access_pointer(txq->qdisc) == &noop_qdisc;
+}
+
/* Is the device using the noop qdisc on all queues? */
static inline bool qdisc_tx_is_noop(const struct net_device *dev)
{
@@ -810,7 +815,7 @@ static inline bool qdisc_tx_is_noop(const struct net_device *dev)
for (i = 0; i < dev->num_tx_queues; i++) {
struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
- if (rcu_access_pointer(txq->qdisc) != &noop_qdisc)
+ if (!qdisc_txq_is_noop(txq))
return false;
}
return true;
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH net-next RFC V3 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-14 15:45 [PATCH net-next RFC V3 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
2025-04-14 15:45 ` [PATCH net-next RFC V3 1/2] net: sched: generalize check for no-op qdisc on TX queue Jesper Dangaard Brouer
@ 2025-04-14 15:45 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-14 15:45 UTC (permalink / raw)
To: netdev, Jakub Kicinski
Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
makita.toshiaki, kernel-team
In production, we're seeing TX drops on veth devices when the ptr_ring
fills up. This can occur when NAPI mode is enabled, though it's
relatively rare. However, with threaded NAPI - which we use in
production - the drops become significantly more frequent.
The underlying issue is that with threaded NAPI, the consumer often runs
on a different CPU than the producer. This increases the likelihood of
the ring filling up before the consumer gets scheduled, especially under
load, leading to drops in veth_xmit() (ndo_start_xmit()).
This patch introduces backpressure by returning NETDEV_TX_BUSY when the
ring is full, signaling the qdisc layer to requeue the packet. The txq
(netdev queue) is stopped in this condition and restarted once
veth_poll() drains entries from the ring, ensuring coordination between
NAPI and qdisc.
Backpressure is only enabled when a qdisc is attached. Without a qdisc,
the driver retains its original behavior - dropping packets immediately
when the ring is full. This avoids unexpected behavior changes in setups
without a configured qdisc.
With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
(AQM) to fairly schedule packets across flows and reduce collateral
damage from elephant flows.
A known limitation of this approach is that the full ring sits in front
of the qdisc layer, effectively forming a FIFO buffer that introduces
base latency. While AQM still improves fairness and mitigates flow
dominance, the latency impact is measurable.
In hardware drivers, this issue is typically addressed using BQL (Byte
Queue Limits), which tracks in-flight bytes needed based on physical link
rate. However, for virtual drivers like veth, there is no fixed bandwidth
constraint - the bottleneck is CPU availability and the scheduler's ability
to run the NAPI thread. It is unclear how effective BQL would be in this
context.
This patch serves as a first step toward addressing TX drops. Future work
may explore adapting a BQL-like mechanism to better suit virtual devices
like veth.
Reported-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
drivers/net/veth.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 41 insertions(+), 8 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7bb53961c0ea..3455fca2f2af 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
{
if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
- dev_kfree_skb_any(skb);
- return NET_RX_DROP;
+ return NETDEV_TX_BUSY; /* signal qdisc layer */
}
- return NET_RX_SUCCESS;
+ return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
}
static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
@@ -346,11 +345,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct veth_rq *rq = NULL;
- int ret = NETDEV_TX_OK;
+ struct netdev_queue *txq;
struct net_device *rcv;
int length = skb->len;
bool use_napi = false;
- int rxq;
+ int ret, rxq;
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -373,17 +372,41 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
}
skb_tx_timestamp(skb);
- if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
+
+ ret = veth_forward_skb(rcv, skb, rq, use_napi);
+ switch(ret) {
+ case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
if (!use_napi)
dev_sw_netstats_tx_add(dev, 1, length);
else
__veth_xdp_flush(rq);
- } else {
+ break;
+ case NETDEV_TX_BUSY:
+ /* If a qdisc is attached to our virtual device, returning
+ * NETDEV_TX_BUSY is allowed.
+ */
+ txq = netdev_get_tx_queue(dev, rxq);
+
+ if (qdisc_txq_is_noop(txq)) {
+ dev_kfree_skb_any(skb);
+ goto drop;
+ }
+ netif_tx_stop_queue(txq);
+ /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
+ __skb_push(skb, ETH_HLEN);
+ if (use_napi)
+ __veth_xdp_flush(rq);
+
+ break;
+ case NET_RX_DROP: /* same as NET_XMIT_DROP */
drop:
atomic64_inc(&priv->dropped);
ret = NET_XMIT_DROP;
+ break;
+ default:
+ net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
+ dev->name, ret);
}
-
rcu_read_unlock();
return ret;
@@ -874,9 +897,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_xdp_tx_bq *bq,
struct veth_stats *stats)
{
+ struct veth_priv *priv = netdev_priv(rq->dev);
+ int queue_idx = rq->xdp_rxq.queue_index;
+ struct netdev_queue *peer_txq;
+ struct net_device *peer_dev;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
+ peer_dev = rcu_dereference(priv->peer);
+ peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
+
for (i = 0; i < budget; i++) {
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -925,6 +955,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
rq->stats.vs.xdp_packets += done;
u64_stats_update_end(&rq->stats.syncp);
+ if (unlikely(netif_tx_queue_stopped(peer_txq)))
+ netif_tx_wake_queue(peer_txq);
+
return done;
}
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-04-14 15:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14 15:45 [PATCH net-next RFC V3 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
2025-04-14 15:45 ` [PATCH net-next RFC V3 1/2] net: sched: generalize check for no-op qdisc on TX queue Jesper Dangaard Brouer
2025-04-14 15:45 ` [PATCH net-next RFC V3 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox