[PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support

BPF List
 help / color / mirror / Atom feed

* [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support
@ 2026-06-12  8:35 hawk
  2026-06-12  8:35 ` [PATCH net-next v7 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: hawk @ 2026-06-12  8:35 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, simon.schippers, Jesper Dangaard Brouer,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Chris Arges, Mike Freemon,
	Toke Høiland-Jørgensen, Jonas Köppeler,
	Breno Leitao, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, bpf

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.

Problem: veth's 256-entry ptr_ring acts as a "dark buffer" invisible
to the qdisc's AQM.  Under load the ring fills, adding up to 256
packets of unmanaged latency before the qdisc sees congestion.

Solution: BQL stops the queue before the ring fills, pushing excess
packets into the qdisc where sojourn-based AQM can drop them.
Time-based completion coalescing (ethtool tx-usecs, default 100 us)
lets DQL converge on a limit that bounds actual queuing delay rather
than oscillating at limit=2 with per-packet completion.

Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7 ms).  Ping measures RTT.

                   BQL off                    BQL on
  fq_codel:  RTT ~22 ms, 4% loss        RTT ~1.3 ms, 0% loss
  sfq:       RTT ~24 ms, 0% loss        RTT ~1.5 ms, 0% loss

BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
unchanged.

Why BQL over configurable ring size (ethtool -G): BQL auto-tunes per
workload via DQL; a static ring size requires per-setup manual tuning
and a too-small ring drops XDP packets in batches of 16-64.

Selftests: https://github.com/netoptimizer/veth-backpressure-performance-testing

Background:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling with ptr_ring size 30 made
  fq_codel work dramatically better -- motivating a dynamic BQL
  solution.  Chris Arges wrote the reproducer.  Jonas Koeppeler and
  Simon Schippers provided extensive testing and code review.

  During BQL development we also fixed an unrelated 12-year-old CoDel
  bug (stale first_above_time in empty flows), see
  commit 815980fe6dbb ("net_sched: codel: fix stale state for empty flows in fq_codel").
  BQL remains valuable independently.

Patch overview:
  1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  2. veth: implement Byte Queue Limits (BQL) for latency reduction
  3. veth: add tx_timeout watchdog as BQL safety net
  4. net: sched: add timeout count to NETDEV WATCHDOG message
  5. veth: time-based BQL completion coalescing via ethtool tx-usecs

Jesper Dangaard Brouer (4):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message

Simon Schippers (1):
  veth: time-based BQL completion coalescing via ethtool tx-usecs

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: Breno Leitao <leitao@debian.org>
Cc: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: kernel-team@cloudflare.com

---
Changes since V6:
  - Patch 2 (teardown): rework veth_napi_del_range() to drain the ptr_ring
    and balance the peer txq's BQL/DQL by completing the outstanding charges
    via netdev_tx_completed_queue(), instead of netdev_tx_reset_queue().
    dql_reset() races with a concurrent producer, whereas completion is the
    normal single-completer path and is safe once NAPI is gone
    (synchronize_net()) and the producer has stopped charging BQL (it
    observes rq->napi == NULL).  Fixes races reported by sashiko.  The
    peer txq is still woken to clear any leaked DRV_XOFF.
  - Patch 2: document the lltx/DQL locking model at the BQL charge site --
    veth is lltx so the stack skips HARD_TX_LOCK; the ring producer_lock is
    the single-producer lock for dql_queued() (1:1 with the peer txq) and
    the peer NAPI in veth_xdp_rcv() is the single completer.
  - Patch 3 (watchdog): add a VETH_WATCHDOG_TIMEOUT_MS #define for the
    tx_timeout value instead of an inline magic number, and expand the
    math behind the value (64-packet NAPI budget * 250 ms/pkt = 16 s)
    in a comment (Paolo).
  - Patches 2+5: add Co-developed-by for Jonas Koeppeler.
  - Patch 5 (coalescing): flush when n_bql exceeds dql.limit to handle
    BQL starvation.  Removes the empty-ring STACK_XOFF check (and its
    smp_rmb) -- the dql.limit comparison handles it more directly.
  - Patch 5: at teardown, also complete the coalesced bql_state.n_bql
    pending from the last NAPI poll, on top of the patch 2 ring drain.
  - Patch 5: mirror tx_coal_usecs onto the peer in veth_set_coalesce() so
    a veth pair always coalesces symmetrically; the completion path reads
    its own device's value.
  - Patch 5: use WRITE_ONCE/READ_ONCE for tx_coal_usecs (Paolo).
  - Patch 5: init bql_state before napi_enable() to avoid race (sashiko).
  - Patch 5: tx-usecs=0 handled as normal path, no special case.
  - Patch 5: n_bql type changed to uint; explicit if/++ instead of
    implicit bool-to-int addition.
  - Patch 5: call veth_bql_maybe_complete() on every iteration for
    accurate completion intervals with mixed XDP/non-XDP packets.
  - Patch 5: reject tx-usecs above half the tx_timeout watchdog in
    veth_set_coalesce() (return -ERANGE) -- the coalescing window delays
    BQL completions, so an over-large value could trip a false watchdog;
    half the watchdog leaves a generous margin.

Prior versions:
  V6: https://lore.kernel.org/all/20260527135418.1166665-1-hawk@kernel.org/
  V5: https://lore.kernel.org/all/20260505132159.241305-1-hawk@kernel.org/
  V4: https://lore.kernel.org/all/20260501071633.644353-1-hawk@kernel.org/
  V3: https://lore.kernel.org/all/20260429172036.1028526-1-hawk@kernel.org/
  V2: https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org/
  V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/

Changes since V5:
  - Drop patch 1 (OOB txq fix) -- applied to net tree by Paolo as
    08f566e8f83b, already in net-next.
  - New patch 5: time-based BQL completion coalescing via ethtool
    tx-usecs (Simon Schippers).  Resolves the throughput regression
    Paolo flagged on V5: per-packet completion forced DQL to limit=2,
    causing cache-line bouncing between producer/consumer CPUs.
    Coalescing batches completions on a configurable time threshold
    (default 100 us), letting DQL discover a higher useful limit.
  - Patch 2: use __ptr_ring_check_produce() instead of open-coded
    ring-full check (Simon Schippers nit).

Changes since V4:
  - New patch 1: fix OOB txq access in veth_poll() when veth peers have
    asymmetric RX/TX queue counts.  XDP redirect can deliver frames to
    an RX queue index that exceeds the peer's TX queue count, causing
    an out-of-bounds netdev_get_tx_queue() access.  Found by sashiko-bot.
  - Patch 3 (veth BQL): wake stopped peer txqs in veth_napi_del_range()
    to clear DRV_XOFF after NAPI teardown.  A concurrent veth_xmit()
    can set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
    synchronize_net(); with NAPI gone, no veth_poll() clears it.
    Guarded by netif_running() to skip during device close.

Changes since V3:
  - Drop selftest patch (patch 5 from V3) per maintainer request.
  - Rebase on latest net-next.

Changes since V2:
  - Patch 2 (veth BQL): fix syzbot WARNING in veth_napi_del_range():
    clamp BQL reset loop to peer's real_num_tx_queues.  The loop was
    iterating dev->real_num_rx_queues but indexing peer's txq[], which
    goes out of bounds when the peer has fewer TX queues (e.g. veth
    enslaved to a bond with XDP attached).

Changes since V1:
  - Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
  - Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
    of skb->len.  veth has no link speed; the ptr_ring is packet-indexed.
    Byte-based charging lets small packets sneak many entries into the ring.
    Testing: min-size packet flood causes 3.7x ping RTT degradation with
    skb->len vs no change with fixed-unit charging.
  - Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
    netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
    to avoid dql_reset() racing with concurrent dql_completed().
  - Cover letter: update CoDel fix reference to merged commit in net tree.

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            | 267 +++++++++++++++++-
 include/linux/netdevice.h                     |   2 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   6 +-
 5 files changed, 270 insertions(+), 14 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net-next v7 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-06-12  8:35 [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support hawk
@ 2026-06-12  8:35 ` hawk
  2026-06-12  8:35 ` [PATCH net-next v7 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
  2026-06-12 14:10 ` [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support Simon Schippers
  2 siblings, 0 replies; 5+ messages in thread
From: hawk @ 2026-06-12  8:35 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, simon.schippers, Jesper Dangaard Brouer,
	Jonas Köppeler, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

From: Jesper Dangaard Brouer <hawk@kernel.org>

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") gave qdiscs control over veth by returning
NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
a dark buffer, adding base latency because the qdisc has no visibility
into how many bytes are already queued there.

Add BQL support so the qdisc gets feedback and can begin shaping traffic
before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
under UDP load from ~6.61ms to ~0.36ms (18x).

Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
has no link speed -- the ptr_ring drains at CPU speed and is
packet-indexed, not byte-indexed, so bytes are not the natural unit.
With byte-based charging, small packets sneak many more entries into
the ring before STACK_XOFF fires, deepening the dark buffer under
mixed-size workloads.  Testing with a concurrent min-size packet flood
shows 3.7x ping RTT degradation with skb->len charging versus no
change with fixed-unit charging.

Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
confirming the ring is not full.  The charge must precede the produce
because the NAPI consumer can run on another CPU and complete the SKB
the instant it becomes visible in the ring.  Doing both under the same
lock avoids a pre-charge/undo pattern -- BQL is only charged when
produce is guaranteed to succeed.

BQL is enabled only when a real qdisc is attached (guarded by
!qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
for TXQ modification like dql_queued(). For lltx devices, like veth,
this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
producer_lock provides additional serialization that would allow
BQL to work correctly even with noqueue, though that combination
is not currently enabled, as the netstack will drop and warn.

Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
entry.  This is necessary because the qdisc can be replaced live while
SKBs are in-flight -- each SKB must carry the charge decision made at
enqueue time rather than re-checking the peer's qdisc at completion.

Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
clears promptly when producer and consumer run on different CPUs.

BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
alongside the existing DRV_XOFF (ring full).  Both must be clear for the
queue to transmit.  At teardown, veth_napi_del_range() drains the
leftover ring entries after synchronize_net() -- once NAPI is gone and
the producer has stopped charging BQL (it observes rq->napi == NULL).
Rather than netdev_tx_reset_queue(), which calls dql_reset() and races
with a concurrent producer, balance the DQL accounting by completing the
outstanding charges via netdev_tx_completed_queue().  The peer txq is
still woken to clear any DRV_XOFF a late veth_xmit() may have set.  Clamp
the loop to the peer's num_tx_queues, since the peer may have fewer TX
queues than the local device has RX queues (e.g. veth enslaved to a bond
with XDP attached).

Co-developed-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c | 131 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 120 insertions(+), 11 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0cfb19b760dd..a3505627f49e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -34,9 +34,13 @@
 #define DRV_VERSION	"1.0"
 
 #define VETH_XDP_FLAG		BIT(0)
+#define VETH_BQL_FLAG		BIT(1)
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
+#define VETH_BQL_UNIT		1
+
 #define VETH_XDP_TX_BULK_SIZE	16
 #define VETH_XDP_BATCH		16
 
@@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
+static bool veth_ptr_is_bql(void *ptr)
+{
+	return (unsigned long)ptr & VETH_BQL_FLAG;
+}
+
+static struct sk_buff *veth_ptr_to_skb(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
+}
+
+static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
+{
+	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
+}
+
 static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -295,7 +314,26 @@ static void veth_ptr_free(void *ptr)
 	if (veth_is_xdp_frame(ptr))
 		xdp_return_frame(veth_ptr_to_xdp(ptr));
 	else
-		kfree_skb(ptr);
+		kfree_skb(veth_ptr_to_skb(ptr));
+}
+
+/* Drain frames left in the ptr_ring at teardown, freeing each one and
+ * returning the number of BQL-charged SKBs.  The caller completes these
+ * via netdev_tx_completed_queue() to balance the DQL accounting, avoiding
+ * the racy netdev_tx_reset_queue()/dql_reset().
+ */
+static unsigned int veth_ptr_ring_drain(struct ptr_ring *ring)
+{
+	unsigned int n_bql = 0;
+	void *ptr;
+
+	while ((ptr = ptr_ring_consume(ring))) {
+		if (veth_ptr_is_bql(ptr))
+			n_bql++;
+		veth_ptr_free(ptr);
+	}
+
+	return n_bql;
 }
 
 static void __veth_xdp_flush(struct veth_rq *rq)
@@ -309,19 +347,39 @@ static void __veth_xdp_flush(struct veth_rq *rq)
 	}
 }
 
-static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
+		       struct netdev_queue *txq)
 {
-	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
+	struct ptr_ring *ring = &rq->xdp_ring;
+
+	spin_lock(&ring->producer_lock);
+	if (unlikely(__ptr_ring_check_produce(ring))) {
+		spin_unlock(&ring->producer_lock);
 		return NETDEV_TX_BUSY; /* signal qdisc layer */
+	}
+
+	/* Charge BQL before produce; the consumer cannot see the entry yet.
+	 * veth is lltx, so the stack skips HARD_TX_LOCK and txq->_xmit_lock
+	 * does not serialise txq->dql here.  This producer_lock is the single
+	 * producer lock for dql_queued() (1:1 with this rq's peer txq), and
+	 * the peer NAPI in veth_xdp_rcv() is the single completer -- the
+	 * two-context model that dql_queued()/dql_completed() require.
+	 */
+	if (do_bql)
+		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
+
+	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
+	spin_unlock(&ring->producer_lock);
 
 	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }
 
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
-			    struct veth_rq *rq, bool xdp)
+			    struct veth_rq *rq, bool xdp, bool do_bql,
+			    struct netdev_queue *txq)
 {
 	return __dev_forward_skb(dev, skb) ?: xdp ?
-		veth_xdp_rx(rq, skb) :
+		veth_xdp_rx(rq, skb, do_bql, txq) :
 		__netif_rx(skb);
 }
 
@@ -347,11 +405,12 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct netdev_queue *txq = NULL;
 	struct veth_rq *rq = NULL;
-	struct netdev_queue *txq;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
+	bool do_bql = false;
 	int ret, rxq;
 
 	rcu_read_lock();
@@ -375,8 +434,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	skb_tx_timestamp(skb);
-
-	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	if (rxq < dev->real_num_tx_queues) {
+		txq = netdev_get_tx_queue(dev, rxq);
+		/* BQL charge happens inside veth_xdp_rx() under producer_lock */
+		do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
+	}
+	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
 	switch (ret) {
 	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
@@ -412,6 +475,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
 				     __func__, dev->name, ret);
 	}
+
 	rcu_read_unlock();
 
 	return ret;
@@ -900,7 +964,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
-			struct veth_stats *stats)
+			struct veth_stats *stats,
+			struct netdev_queue *peer_txq)
 {
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
@@ -928,9 +993,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			}
 		} else {
 			/* ndo_start_xmit */
-			struct sk_buff *skb = ptr;
+			bool bql_charged = veth_ptr_is_bql(ptr);
+			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
 			stats->xdp_bytes += skb->len;
+			if (peer_txq && bql_charged)
+				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -976,7 +1045,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
 
 	xdp_set_return_frame_no_direct();
-	done = veth_xdp_rcv(rq, budget, &bq, &stats);
+	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
 
 	if (stats.xdp_redirect > 0)
 		xdp_do_flush();
@@ -1074,6 +1143,7 @@ static int __veth_napi_enable(struct net_device *dev)
 static void veth_napi_del_range(struct net_device *dev, int start, int end)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
 	int i;
 
 	for (i = start; i < end; i++) {
@@ -1085,11 +1155,49 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 	}
 	synchronize_net();
 
+	/* This rq's frames were BQL-charged on the peer's txq[i]. */
+	peer = rtnl_dereference(priv->peer);
+
 	for (i = start; i < end; i++) {
 		struct veth_rq *rq = &priv->rq[i];
+		struct netdev_queue *txq;
+		unsigned int n_bql;
 
 		rq->rx_notify_masked = false;
+
+		/* Drain leftover ring frames, counting BQL-charged SKBs that
+		 * were charged via netdev_tx_sent_queue() but never consumed.
+		 */
+		n_bql = veth_ptr_ring_drain(&rq->xdp_ring);
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+
+		if (!peer || i >= peer->num_tx_queues)
+			continue;
+
+		txq = netdev_get_tx_queue(peer, i);
+
+		/* Balance the peer txq's DQL accounting by completing the
+		 * outstanding charges instead of netdev_tx_reset_queue():
+		 * dql_reset() races with a concurrent producer, while
+		 * netdev_tx_completed_queue() is the normal single-completer
+		 * path and is safe here -- NAPI is gone (synchronize_net()
+		 * above) and the producer stopped charging BQL once it
+		 * observed rq->napi == NULL.  Completing every charge drives
+		 * DQL inflight to 0 and clears STACK_XOFF.
+		 */
+		if (n_bql)
+			netdev_tx_completed_queue(txq, n_bql,
+						  n_bql * VETH_BQL_UNIT);
+
+		/* DRV_XOFF is independent of BQL/STACK_XOFF: a concurrent
+		 * veth_xmit() may have set it between rcu_assign_pointer(napi,
+		 * NULL) and synchronize_net(); with NAPI gone nothing else
+		 * clears it.  The completion above only clears STACK_XOFF, so
+		 * still wake the txq to clear DRV_XOFF -- but only when the
+		 * device is still up.
+		 */
+		if (netif_running(dev))
+			netif_tx_wake_queue(txq);
 	}
 
 	for (i = start; i < end; i++) {
@@ -1741,6 +1849,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_PHONY_HEADROOM;
 	dev->priv_flags |= IFF_DISABLE_NETPOLL;
 	dev->lltx = true;
+	dev->bql = true;
 
 	dev->netdev_ops = &veth_netdev_ops;
 	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next v7 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-12  8:35 [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support hawk
  2026-06-12  8:35 ` [PATCH net-next v7 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-06-12  8:35 ` hawk
  2026-06-12 14:10 ` [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support Simon Schippers
  2 siblings, 0 replies; 5+ messages in thread
From: hawk @ 2026-06-12  8:35 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, simon.schippers, Jesper Dangaard Brouer,
	Jonas Köppeler, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

From: Simon Schippers <simon.schippers@tu-dortmund.de>

Per-packet BQL completion forces DQL to converge on limit=2, causing
excessive NAPI scheduling overhead and qdisc requeues.

Accumulate BQL completions and flush them when a configurable time
threshold (tx-usecs) is exceeded, letting DQL discover a limit that
bounds actual queuing delay to the configured interval. Coalescing
state persists across NAPI polls in struct veth_rq so completions can
accumulate beyond a single budget=64 cycle.

The flush condition is:

state->time + bql_flush_ns <= current_time || state->n_bql > dql.limit

Flushing when n_bql exceeds dql.limit handles BQL starvation.

The comparison is strictly greater-than because netdev_tx_sent_queue()
always lets the producer exceed the limit by one before it stops, so
n_bql == dql.limit is a normal in-flight state. dql.limit lives in
the same cacheline as the completion path, so the check is cheap.

Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
setting tx-usecs to 0 disables coalescing and falls back to per-packet
completion.

  ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
  ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)

Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Co-developed-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/veth.c | 123 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 117 insertions(+), 6 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 2473f730734b..c62d87a8402c 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -28,6 +28,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <linux/skbuff_ref.h>
+#include <linux/sched/clock.h>
 #include <net/page_pool/helpers.h>
 
 #define DRV_NAME	"veth"
@@ -50,6 +51,7 @@
  * delay => 64 * 250 ms = 16 s.
  */
 #define VETH_WATCHDOG_TIMEOUT_MS	(64 * 250)
+#define VETH_BQL_COAL_TX_USECS	100 /* default tx-usecs for BQL batching */
 
 struct veth_stats {
 	u64	rx_drops;
@@ -69,6 +71,11 @@ struct veth_rq_stats {
 	struct u64_stats_sync	syncp;
 };
 
+struct veth_bql_state {
+	u64	time;	/* sched_clock() when current coalescing window started */
+	uint	n_bql;	/* BQL completions batched in the current window */
+};
+
 struct veth_rq {
 	struct napi_struct	xdp_napi;
 	struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -76,6 +83,7 @@ struct veth_rq {
 	struct bpf_prog __rcu	*xdp_prog;
 	struct xdp_mem_info	xdp_mem;
 	struct veth_rq_stats	stats;
+	struct veth_bql_state	bql_state;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
@@ -88,6 +96,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
 	unsigned int		requested_headroom;
+	unsigned int		tx_coal_usecs;	/* BQL completion coalescing */
 };
 
 struct veth_xdp_tx_bq {
@@ -272,7 +281,56 @@ static void veth_get_channels(struct net_device *dev,
 static int veth_set_channels(struct net_device *dev,
 			     struct ethtool_channels *ch);
 
+static int veth_get_coalesce(struct net_device *dev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+	return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+
+	/* The coalescing window delays BQL completions, so keep tx-usecs well
+	 * below the tx_timeout watchdog; otherwise a large value could stall a
+	 * stopped queue long enough to trip a false watchdog timeout. Cap at
+	 * half the watchdog to leave a generous safety margin. tx-usecs is
+	 * microseconds, the watchdog is milliseconds.
+	 */
+	if (ec->tx_coalesce_usecs > VETH_WATCHDOG_TIMEOUT_MS / 2 * USEC_PER_MSEC) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "tx-usecs must stay below half the tx_timeout watchdog");
+		return -ERANGE;
+	}
+
+	/* Paired with READ_ONCE in veth_xdp_rcv(). */
+	WRITE_ONCE(priv->tx_coal_usecs, ec->tx_coalesce_usecs);
+
+	/* veth_xdp_rcv() reads each device's own value, so mirror it onto
+	 * the peer to keep the pair symmetric: both directions coalesce
+	 * with the same tx-usecs. Called under RTNL, rtnl_dereference() is safe.
+	 */
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		struct veth_priv *peer_priv = netdev_priv(peer);
+
+		WRITE_ONCE(peer_priv->tx_coal_usecs, ec->tx_coalesce_usecs);
+	}
+
+	return 0;
+}
+
 static const struct ethtool_ops veth_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
 	.get_drvinfo		= veth_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
 	.get_strings		= veth_get_strings,
@@ -282,6 +340,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_ts_info		= ethtool_op_get_ts_info,
 	.get_channels		= veth_get_channels,
 	.set_channels		= veth_set_channels,
+	.get_coalesce		= veth_get_coalesce,
+	.set_coalesce		= veth_set_coalesce,
 };
 
 /* general routines */
@@ -969,13 +1029,54 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	return NULL;
 }
 
+static void veth_bql_maybe_complete(struct veth_bql_state *state,
+				    struct netdev_queue *peer_txq,
+				    u64 bql_flush_ns)
+{
+	u64 current_time;
+
+	/* There is no reason to complete with 0 and
+	 * peer_txq could go away.
+	 */
+	if (!state->n_bql || !peer_txq)
+		return;
+
+	current_time = sched_clock();
+
+	/* We complete if:
+	 * 1. We reach bql_flush_ns.
+	 * 2. We potentially have BQL starvation.
+	 */
+	if (state->time + bql_flush_ns <= current_time ||
+	    state->n_bql > peer_txq->dql.limit) {
+		netdev_tx_completed_queue(peer_txq, state->n_bql,
+					  state->n_bql * VETH_BQL_UNIT);
+		state->time = current_time;
+		state->n_bql = 0;
+	}
+}
+
 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats,
 			struct netdev_queue *peer_txq)
 {
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	struct veth_bql_state *state = &rq->bql_state;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
+	u64 bql_flush_ns;
+
+	/* Mirrored to both peers; paired with WRITE_ONCE() in veth_set_coalesce */
+	bql_flush_ns = (u64)READ_ONCE(priv->tx_coal_usecs) * 1000;
+
+	/* Clamp stored timestamp in case we migrated to a CPU with a behind
+	 * sched_clock(); tries to reduce late BQL flushes.
+	 */
+	state->time = min(state->time, sched_clock());
+
+	/* Flush completions that timed out since the previous NAPI poll. */
+	veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
 
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -1000,12 +1101,11 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			}
 		} else {
 			/* ndo_start_xmit */
-			bool bql_charged = veth_ptr_is_bql(ptr);
 			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
+			if (veth_ptr_is_bql(ptr))
+				state->n_bql++;
 			stats->xdp_bytes += skb->len;
-			if (peer_txq && bql_charged)
-				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
 
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
@@ -1015,6 +1115,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 					napi_gro_receive(&rq->xdp_napi, skb);
 			}
 		}
+		veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
 		done++;
 	}
 
@@ -1123,6 +1224,9 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
 	for (i = start; i < end; i++) {
 		struct veth_rq *rq = &priv->rq[i];
 
+		rq->bql_state.time = sched_clock();
+		rq->bql_state.n_bql = 0;
+
 		napi_enable(&rq->xdp_napi);
 		rcu_assign_pointer(priv->rq[i].napi, &priv->rq[i].xdp_napi);
 	}
@@ -1172,11 +1276,15 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 
 		rq->rx_notify_masked = false;
 
-		/* Drain leftover ring frames, counting BQL-charged SKBs that
-		 * were charged via netdev_tx_sent_queue() but never consumed.
+		/* Drain leftover ring frames, counting BQL-charged SKBs, and
+		 * add the completions still pending in the coalescing window
+		 * (consumed by NAPI but not yet flushed).  Both were charged
+		 * via netdev_tx_sent_queue() and are still outstanding.
 		 */
-		n_bql = veth_ptr_ring_drain(&rq->xdp_ring);
+		n_bql = veth_ptr_ring_drain(&rq->xdp_ring) + rq->bql_state.n_bql;
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+		rq->bql_state.n_bql = 0;
+		rq->bql_state.time = 0;
 
 		if (!peer || i >= peer->num_tx_queues)
 			continue;
@@ -1865,6 +1973,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 
 static void veth_setup(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	ether_setup(dev);
 
 	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1889,6 +1999,7 @@ static void veth_setup(struct net_device *dev)
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
 	dev->watchdog_timeo = msecs_to_jiffies(VETH_WATCHDOG_TIMEOUT_MS);
+	priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support
  2026-06-12  8:35 [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support hawk
  2026-06-12  8:35 ` [PATCH net-next v7 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
  2026-06-12  8:35 ` [PATCH net-next v7 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
@ 2026-06-12 14:10 ` Simon Schippers
  2026-06-12 17:21   ` Jonas Köppeler
  2 siblings, 1 reply; 5+ messages in thread
From: Simon Schippers @ 2026-06-12 14:10 UTC (permalink / raw)
  To: hawk, netdev
  Cc: kernel-team, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Chris Arges, Mike Freemon,
	Toke Høiland-Jørgensen, Jonas Köppeler,
	Breno Leitao, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, bpf

On 6/12/26 10:35, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> This series adds BQL (Byte Queue Limits) to the veth driver, reducing
> latency by dynamically limiting in-flight packets in the ptr_ring and
> moving buffering into the qdisc where AQM algorithms can act on it.

LGTM, thanks for the detailed changelog :)

Maybe we should stop searching for the perfect tx-usecs value.
100us is probably fine for most hardware to not have a performance
regression. And lowering it does not really improve the RTT anyways.
Do you agree?

Nevertheless, I will compile and run the benchmarks again.

I will go on vacation from 15th to 24th of June, so I will not be able
to contribute code or run benchmarks then.

Thanks,
Simon


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support
  2026-06-12 14:10 ` [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support Simon Schippers
@ 2026-06-12 17:21   ` Jonas Köppeler
  0 siblings, 0 replies; 5+ messages in thread
From: Jonas Köppeler @ 2026-06-12 17:21 UTC (permalink / raw)
  To: Simon Schippers, hawk, netdev
  Cc: kernel-team, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Chris Arges, Mike Freemon,
	Toke Høiland-Jørgensen, Breno Leitao,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, bpf

On 6/12/26 16:10, Simon Schippers wrote:
> On 6/12/26 10:35, hawk@kernel.org wrote:
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> This series adds BQL (Byte Queue Limits) to the veth driver, reducing
>> latency by dynamically limiting in-flight packets in the ptr_ring and
>> moving buffering into the qdisc where AQM algorithms can act on it.
> 
> LGTM, thanks for the detailed changelog :)
> 
> Maybe we should stop searching for the perfect tx-usecs value.
> 100us is probably fine for most hardware to not have a performance
> regression. And lowering it does not really improve the RTT anyways.
> Do you agree?
I agree, I already thought that it just might be a very lucky case when 
using 50us where something accidentally aligns nicely. Interestingly, I 
could also reproduce that 50us was consistently a little better compared 
to 100us on an Intel CPU. Maybe if I get the time, I'll have another 
look at it, but in general I think 50us or 100us does not really matter.

> 
> Nevertheless, I will compile and run the benchmarks again.
> 
> I will go on vacation from 15th to 24th of June, so I will not be able
> to contribute code or run benchmarks then.
> 
> Thanks,
> Simon
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-12 17:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  8:35 [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-06-12  8:35 ` [PATCH net-next v7 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-06-12  8:35 ` [PATCH net-next v7 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
2026-06-12 14:10 ` [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support Simon Schippers
2026-06-12 17:21   ` Jonas Köppeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox