Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support
@ 2026-05-27 13:54 hawk
  2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Chris Arges,
	Mike Freemon, Toke Høiland-Jørgensen,
	Jonas Köppeler, Breno Leitao, Simon Schippers,
	Simon Schippers, kernel-team

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.

Problem: veth's 256-entry ptr_ring acts as a "dark buffer" invisible
to the qdisc's AQM.  Under load the ring fills, adding up to 256
packets of unmanaged latency before the qdisc sees congestion.

Solution: BQL stops the queue before the ring fills, pushing excess
packets into the qdisc where sojourn-based AQM can drop them.  V6 adds
time-based completion coalescing (ethtool tx-usecs, default 100 us)
so DQL converges on a limit that bounds actual queuing delay rather
than oscillating at limit=2 with per-packet completion.

Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7 ms).  Ping measures RTT.

                   BQL off                    BQL on
  fq_codel:  RTT ~22 ms, 4% loss        RTT ~1.3 ms, 0% loss
  sfq:       RTT ~24 ms, 0% loss        RTT ~1.5 ms, 0% loss

BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
unchanged.

Why BQL over configurable ring size (ethtool -G): BQL auto-tunes per
workload via DQL; a static ring size requires per-setup manual tuning
and a too-small ring drops XDP packets in batches of 16-64.

Selftests:
  https://github.com/netoptimizer/veth-backpressure-performance-testing

Background:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling with ptr_ring size 30 made
  fq_codel work dramatically better -- motivating a dynamic BQL
  solution.  Chris Arges wrote the reproducer.  Jonas Koeppeler and
  Simon Schippers provided extensive testing and code review.

  During BQL development we also fixed an unrelated 12-year-old CoDel
  bug (stale first_above_time in empty flows), see
  commit 815980fe6dbb ("net_sched: codel: fix stale state for empty flows in fq_codel").
  BQL remains valuable independently.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: Breno Leitao <leitao@debian.org>
Cc: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: Simon Schippers <simon@schippers-hamm.de>
Cc: kernel-team@cloudflare.com

Jesper Dangaard Brouer (4):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message

Simon Schippers (1):
  veth: time-based BQL completion coalescing via ethtool tx-usecs

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            | 198 +++++++++++++++++-
 include/linux/netdevice.h                     |   2 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   8 +-
 5 files changed, 201 insertions(+), 16 deletions(-)

---
Changes since V5:
  - Drop patch 1 (OOB txq fix) -- applied to net tree by Paolo as
    08f566e8f83b, already in net-next.
  - New patch 5: time-based BQL completion coalescing via ethtool
    tx-usecs (Simon Schippers).  Resolves the throughput regression
    Paolo flagged on V5: per-packet completion forced DQL to limit=2,
    causing cache-line bouncing between producer/consumer CPUs.
    Coalescing batches completions on a configurable time threshold
    (default 100 us), letting DQL discover a higher useful limit.
  - Patch 2: use __ptr_ring_check_produce() instead of open-coded
    ring-full check (Simon Schippers nit).

Prior versions:
  V5: https://lore.kernel.org/all/20260505132159.241305-1-hawk@kernel.org/
  V4: https://lore.kernel.org/all/20260501071633.644353-1-hawk@kernel.org/
  V3: https://lore.kernel.org/all/20260429172036.1028526-1-hawk@kernel.org/
  V2: https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org/
  V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/

Changes since V4:
  - New patch 1: fix OOB txq access in veth_poll() when veth peers have
    asymmetric RX/TX queue counts.  XDP redirect can deliver frames to
    an RX queue index that exceeds the peer's TX queue count, causing
    an out-of-bounds netdev_get_tx_queue() access.  Found by sashiko-bot.
  - Patch 3 (veth BQL): wake stopped peer txqs in veth_napi_del_range()
    to clear DRV_XOFF after NAPI teardown.  A concurrent veth_xmit()
    can set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
    synchronize_net(); with NAPI gone, no veth_poll() clears it.
    Guarded by netif_running() to skip during device close.

Changes since V3:
  - Drop selftest patch (patch 5 from V3) per maintainer request.
  - Rebase on latest net-next.

Changes since V2:
  - Patch 2 (veth BQL): fix syzbot WARNING in veth_napi_del_range():
    clamp BQL reset loop to peer's real_num_tx_queues.  The loop was
    iterating dev->real_num_rx_queues but indexing peer's txq[], which
    goes out of bounds when the peer has fewer TX queues (e.g. veth
    enslaved to a bond with XDP attached).

Changes since V1:
  - Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
  - Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
    of skb->len.  veth has no link speed; the ptr_ring is packet-indexed.
    Byte-based charging lets small packets sneak many entries into the ring.
    Testing: min-size packet flood causes 3.7x ping RTT degradation with
    skb->len vs no change with fixed-unit charging.
  - Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
    netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
    to avoid dql_reset() racing with concurrent dql_completed().
  - Cover letter: update CoDel fix reference to merged commit in net tree.

-- 
2.43.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
@ 2026-05-27 13:54 ` hawk
  2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, Jonas Köppeler, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, Kuniyuki Iwashima,
	Stanislav Fomichev, Christian Brauner, Krishna Kumar, Yajun Deng,
	linux-doc, linux-kernel

From: Jesper Dangaard Brouer <hawk@kernel.org>

Virtual devices with IFF_NO_QUEUE or lltx are excluded from BQL sysfs
by netdev_uses_bql(), since they traditionally lack real hardware
queues. However, some virtual devices like veth implement a real
ptr_ring FIFO with NAPI processing and benefit from BQL to limit
in-flight bytes and reduce latency.

Add a per-device 'bql' bitfield boolean in the priv_flags_slow section
of struct net_device. When set, it overrides the IFF_NO_QUEUE/lltx
exclusion and exposes BQL sysfs entries (/sys/class/net/<dev>/queues/
tx-<n>/byte_queue_limits/). The flag is still gated on CONFIG_BQL.

This allows drivers that use BQL despite being IFF_NO_QUEUE to opt in
to sysfs visibility for monitoring and debugging.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 Documentation/networking/net_cachelines/net_device.rst | 1 +
 include/linux/netdevice.h                              | 2 ++
 net/core/net-sysfs.c                                   | 8 +++++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 7b3392553fd6..62df1e09656b 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -170,6 +170,7 @@ unsigned_long:1                     see_all_hwtstamp_requests
 unsigned_long:1                     change_proto_down
 unsigned_long:1                     netns_immutable
 unsigned_long:1                     fcoe_mtu
+unsigned_long:1                     bql                                                                 netdev_uses_bql(net-sysfs.c)
 struct list_head                    net_notifier_list
 struct macsec_ops*                  macsec_ops
 struct udp_tunnel_nic_info*         udp_tunnel_nic_info
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bf3dd9b2c1a7..6b02e82d9625 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2077,6 +2077,7 @@ enum netdev_reg_state {
  *	@change_proto_down: device supports setting carrier via IFLA_PROTO_DOWN
  *	@netns_immutable: interface can't change network namespaces
  *	@fcoe_mtu:	device supports maximum FCoE MTU, 2158 bytes
+ *	@bql:		device uses BQL (DQL sysfs) despite having IFF_NO_QUEUE
  *
  *	@net_notifier_list:	List of per-net netdev notifier block
  *				that follow this device when it is moved
@@ -2491,6 +2492,7 @@ struct net_device {
 	unsigned long		change_proto_down:1;
 	unsigned long		netns_immutable:1;
 	unsigned long		fcoe_mtu:1;
+	unsigned long		bql:1;
 
 	struct list_head	net_notifier_list;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 3318b5666e43..82833e5dae03 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1945,10 +1945,16 @@ static const struct kobj_type netdev_queue_ktype = {
 
 static bool netdev_uses_bql(const struct net_device *dev)
 {
+	if (!IS_ENABLED(CONFIG_BQL))
+		return false;
+
+	if (dev->bql)
+		return true;
+
 	if (dev->lltx || (dev->priv_flags & IFF_NO_QUEUE))
 		return false;
 
-	return IS_ENABLED(CONFIG_BQL);
+	return true;
 }
 
 static int netdev_queue_add_kobject(struct net_device *dev, int index)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
  2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
@ 2026-05-27 13:54 ` hawk
  2026-05-28  7:45   ` Jonas Köppeler
  2026-06-04  8:19   ` Paolo Abeni
  2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
	bpf

From: Jesper Dangaard Brouer <hawk@kernel.org>

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") gave qdiscs control over veth by returning
NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
a dark buffer, adding base latency because the qdisc has no visibility
into how many bytes are already queued there.

Add BQL support so the qdisc gets feedback and can begin shaping traffic
before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
under UDP load from ~6.61ms to ~0.36ms (18x).

Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
has no link speed -- the ptr_ring drains at CPU speed and is
packet-indexed, not byte-indexed, so bytes are not the natural unit.
With byte-based charging, small packets sneak many more entries into
the ring before STACK_XOFF fires, deepening the dark buffer under
mixed-size workloads.  Testing with a concurrent min-size packet flood
shows 3.7x ping RTT degradation with skb->len charging versus no
change with fixed-unit charging.

Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
confirming the ring is not full.  The charge must precede the produce
because the NAPI consumer can run on another CPU and complete the SKB
the instant it becomes visible in the ring.  Doing both under the same
lock avoids a pre-charge/undo pattern -- BQL is only charged when
produce is guaranteed to succeed.

BQL is enabled only when a real qdisc is attached (guarded by
!qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
for TXQ modification like dql_queued(). For lltx devices, like veth,
this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
producer_lock provides additional serialization that would allow
BQL to work correctly even with noqueue, though that combination
is not currently enabled, as the netstack will drop and warn.

Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
entry.  This is necessary because the qdisc can be replaced live while
SKBs are in-flight -- each SKB must carry the charge decision made at
enqueue time rather than re-checking the peer's qdisc at completion.

Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
clears promptly when producer and consumer run on different CPUs.

BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
alongside the existing DRV_XOFF (ring full).  Both must be clear for
the queue to transmit.  Reset BQL state in veth_napi_del_range() after
synchronize_net() to avoid racing with in-flight veth_poll() calls.
Clamp the reset loop to the peer's real_num_tx_queues, since the peer
may have fewer TX queues than the local device has RX queues (e.g. when
veth is enslaved to a bond with XDP attached).

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c | 86 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 75 insertions(+), 11 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0cfb19b760dd..21ff78533943 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -34,9 +34,13 @@
 #define DRV_VERSION	"1.0"
 
 #define VETH_XDP_FLAG		BIT(0)
+#define VETH_BQL_FLAG		BIT(1)
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
+#define VETH_BQL_UNIT		1
+
 #define VETH_XDP_TX_BULK_SIZE	16
 #define VETH_XDP_BATCH		16
 
@@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
+static bool veth_ptr_is_bql(void *ptr)
+{
+	return (unsigned long)ptr & VETH_BQL_FLAG;
+}
+
+static struct sk_buff *veth_ptr_to_skb(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
+}
+
+static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
+{
+	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
+}
+
 static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
 	if (veth_is_xdp_frame(ptr))
 		xdp_return_frame(veth_ptr_to_xdp(ptr));
 	else
-		kfree_skb(ptr);
+		kfree_skb(veth_ptr_to_skb(ptr));
 }
 
 static void __veth_xdp_flush(struct veth_rq *rq)
@@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
 	}
 }
 
-static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
+		       struct netdev_queue *txq)
 {
-	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
+	struct ptr_ring *ring = &rq->xdp_ring;
+
+	spin_lock(&ring->producer_lock);
+	if (unlikely(__ptr_ring_check_produce(ring))) {
+		spin_unlock(&ring->producer_lock);
 		return NETDEV_TX_BUSY; /* signal qdisc layer */
+	}
+
+	/* BQL charge before produce; consumer cannot see entry yet */
+	if (do_bql)
+		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
+
+	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
+	spin_unlock(&ring->producer_lock);
 
 	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }
 
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
-			    struct veth_rq *rq, bool xdp)
+			    struct veth_rq *rq, bool xdp, bool do_bql,
+			    struct netdev_queue *txq)
 {
 	return __dev_forward_skb(dev, skb) ?: xdp ?
-		veth_xdp_rx(rq, skb) :
+		veth_xdp_rx(rq, skb, do_bql, txq) :
 		__netif_rx(skb);
 }
 
@@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct veth_rq *rq = NULL;
-	struct netdev_queue *txq;
+	struct netdev_queue *txq = NULL;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
+	bool do_bql = false;
 	int ret, rxq;
 
 	rcu_read_lock();
@@ -375,8 +409,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	skb_tx_timestamp(skb);
-
-	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	if (rxq < dev->real_num_tx_queues) {
+		txq = netdev_get_tx_queue(dev, rxq);
+		/* BQL charge happens inside veth_xdp_rx() under producer_lock */
+		do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
+	}
+	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
 	switch (ret) {
 	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
@@ -412,6 +450,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
 				     __func__, dev->name, ret);
 	}
+
 	rcu_read_unlock();
 
 	return ret;
@@ -900,7 +939,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
-			struct veth_stats *stats)
+			struct veth_stats *stats,
+			struct netdev_queue *peer_txq)
 {
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
@@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			}
 		} else {
 			/* ndo_start_xmit */
-			struct sk_buff *skb = ptr;
+			bool bql_charged = veth_ptr_is_bql(ptr);
+			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
 			stats->xdp_bytes += skb->len;
+			if (peer_txq && bql_charged)
+				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -976,7 +1020,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
 
 	xdp_set_return_frame_no_direct();
-	done = veth_xdp_rcv(rq, budget, &bq, &stats);
+	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
 
 	if (stats.xdp_redirect > 0)
 		xdp_do_flush();
@@ -1074,6 +1118,7 @@ static int __veth_napi_enable(struct net_device *dev)
 static void veth_napi_del_range(struct net_device *dev, int start, int end)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
 	int i;
 
 	for (i = start; i < end; i++) {
@@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
 	}
 
+	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
+	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
+	 * synchronize_net(), and NAPI can no longer clear it.
+	 * Only wake when the device is still up.
+	 */
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		int peer_end = min_t(int, end, peer->real_num_tx_queues);
+
+		for (i = start; i < peer_end; i++) {
+			struct netdev_queue *txq = netdev_get_tx_queue(peer, i);
+
+			netdev_tx_reset_queue(txq);
+			if (netif_running(dev))
+				netif_tx_wake_queue(txq);
+		}
+	}
+
 	for (i = start; i < end; i++) {
 		page_pool_destroy(priv->rq[i].page_pool);
 		priv->rq[i].page_pool = NULL;
@@ -1741,6 +1804,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_PHONY_HEADROOM;
 	dev->priv_flags |= IFF_DISABLE_NETPOLL;
 	dev->lltx = true;
+	dev->bql = true;
 
 	dev->netdev_ops = &veth_netdev_ops;
 	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net
  2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
  2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
  2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-05-27 13:54 ` hawk
  2026-06-04  8:24   ` Paolo Abeni
  2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
  2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
  4 siblings, 1 reply; 30+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, Jonas Köppeler, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel

From: Jesper Dangaard Brouer <hawk@kernel.org>

With the introduction of BQL (Byte Queue Limits) for veth, there are
now two independent mechanisms that can stop a transmit queue:

 - DRV_XOFF: set by netif_tx_stop_queue() when the ptr_ring is full
 - STACK_XOFF: set by BQL when the byte-in-flight limit is reached

If either mechanism stalls without a corresponding wake/completion,
the queue stops permanently. Enable the net device watchdog timer and
implement ndo_tx_timeout as a failsafe recovery.

The timeout handler resets BQL state (clearing STACK_XOFF) and wakes
the queue (clearing DRV_XOFF), covering both stop mechanisms. The
watchdog fires after 16 seconds, which accommodates worst-case NAPI
processing (budget=64 packets x 250ms per-packet consumer delay)
without false positives under normal backpressure.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 drivers/net/veth.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 21ff78533943..d5675d9d5236 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1442,6 +1442,22 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev,
+		   "veth backpressure(0x%lX) stalled(n:%ld) TXQ(%u) re-enable\n",
+		   txq->state, atomic_long_read(&txq->trans_timeout), txqueue);
+
+	/* Cannot call netdev_tx_reset_queue(): dql_reset() races with
+	 * peer NAPI calling dql_completed() concurrently.
+	 * Just clear the stop bits; the qdisc will re-stop if still stuck.
+	 */
+	clear_bit(__QUEUE_STATE_STACK_XOFF, &txq->state);
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1780,6 +1796,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1819,6 +1836,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(16000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message
  2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
                   ` (2 preceding siblings ...)
  2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
@ 2026-05-27 13:54 ` hawk
  2026-06-04  8:30   ` Paolo Abeni
  2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
  4 siblings, 1 reply; 30+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, Jakub Kicinski, Jonas Köppeler,
	Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, linux-kernel

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add the per-queue timeout counter (trans_timeout) to the core NETDEV
WATCHDOG log message.  This makes it easy to determine how frequently
a particular queue is stalling from a single log line, without having
to search through and correlate spaced-out log entries.

Useful for production monitoring where timeouts are spaced by the
watchdog interval, making frequency hard to judge.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/all/20251107175445.58eba452@kernel.org/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 net/sched/sch_generic.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 237ee1cd0136..2aebab985c00 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -533,13 +533,12 @@ static void dev_watchdog(struct timer_list *t)
 		    netif_running(dev) &&
 		    netif_carrier_ok(dev)) {
 			unsigned int timedout_ms = 0;
+			struct netdev_queue *txq;
 			unsigned int i;
 			unsigned long trans_start;
 			unsigned long oldest_start = jiffies;
 
 			for (i = 0; i < dev->num_tx_queues; i++) {
-				struct netdev_queue *txq;
-
 				txq = netdev_get_tx_queue(dev, i);
 				if (!netif_xmit_stopped(txq))
 					continue;
@@ -561,9 +560,10 @@ static void dev_watchdog(struct timer_list *t)
 
 			if (unlikely(timedout_ms)) {
 				trace_net_dev_xmit_timeout(dev, i);
-				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
+				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms (n:%ld)\n",
 					    raw_smp_processor_id(),
-					    i, timedout_ms);
+					    i, timedout_ms,
+					    atomic_long_read(&txq->trans_timeout));
 				netif_freeze_queues(dev);
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 				netif_unfreeze_queues(dev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
                   ` (3 preceding siblings ...)
  2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
@ 2026-05-27 13:54 ` hawk
  2026-05-28  7:46   ` Jonas Köppeler
                     ` (3 more replies)
  4 siblings, 4 replies; 30+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
  To: netdev
  Cc: Simon Schippers, Jesper Dangaard Brouer, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

From: Simon Schippers <simon.schippers@tu-dortmund.de>

Per-packet BQL completion forces DQL to converge on limit=2, causing
excessive NAPI scheduling overhead and qdisc requeues.

Accumulate BQL completions and flush them when a configurable time
threshold is exceeded, letting DQL discover a limit that bounds actual
queuing delay to the configured interval. Coalescing state persists
across NAPI polls in struct veth_rq so completions can accumulate
beyond a single budget=64 cycle.

Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
setting tx-usecs to 0 disables coalescing and falls back to per-packet
completion.

  ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
  ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)

Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/veth.c | 100 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d5675d9d5236..743d17b37223 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -28,6 +28,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <linux/skbuff_ref.h>
+#include <linux/sched/clock.h>
 #include <net/page_pool/helpers.h>
 
 #define DRV_NAME	"veth"
@@ -44,6 +45,8 @@
 #define VETH_XDP_TX_BULK_SIZE	16
 #define VETH_XDP_BATCH		16
 
+#define VETH_BQL_COAL_TX_USECS	100 /* default tx-usecs for BQL batching */
+
 struct veth_stats {
 	u64	rx_drops;
 	/* xdp */
@@ -62,6 +65,11 @@ struct veth_rq_stats {
 	struct u64_stats_sync	syncp;
 };
 
+struct veth_bql_state {
+	u64	time;	/* sched_clock() when current coalescing window started */
+	int	n_bql;	/* BQL completions batched in the current window */
+};
+
 struct veth_rq {
 	struct napi_struct	xdp_napi;
 	struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,6 +77,7 @@ struct veth_rq {
 	struct bpf_prog __rcu	*xdp_prog;
 	struct xdp_mem_info	xdp_mem;
 	struct veth_rq_stats	stats;
+	struct veth_bql_state	bql_state;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
@@ -81,6 +90,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
 	unsigned int		requested_headroom;
+	unsigned int		tx_coal_usecs;	/* BQL completion coalescing */
 };
 
 struct veth_xdp_tx_bq {
@@ -265,7 +275,30 @@ static void veth_get_channels(struct net_device *dev,
 static int veth_set_channels(struct net_device *dev,
 			     struct ethtool_channels *ch);
 
+static int veth_get_coalesce(struct net_device *dev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+	return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	priv->tx_coal_usecs = ec->tx_coalesce_usecs;
+	return 0;
+}
+
 static const struct ethtool_ops veth_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
 	.get_drvinfo		= veth_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
 	.get_strings		= veth_get_strings,
@@ -275,6 +308,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_ts_info		= ethtool_op_get_ts_info,
 	.get_channels		= veth_get_channels,
 	.set_channels		= veth_set_channels,
+	.get_coalesce		= veth_get_coalesce,
+	.set_coalesce		= veth_set_coalesce,
 };
 
 /* general routines */
@@ -937,13 +972,45 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	return NULL;
 }
 
+static void veth_bql_complete(struct veth_bql_state *state,
+			      struct netdev_queue *peer_txq)
+{
+	netdev_tx_completed_queue(peer_txq, state->n_bql,
+				  state->n_bql * VETH_BQL_UNIT);
+	state->n_bql = 0;
+	state->time = sched_clock();
+}
+
+static void veth_bql_maybe_complete(struct veth_bql_state *state,
+				    struct netdev_queue *peer_txq,
+				    u64 coalescing_ns)
+{
+	if (state->n_bql && sched_clock() >= state->time + coalescing_ns)
+		veth_bql_complete(state, peer_txq);
+}
+
 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats,
 			struct netdev_queue *peer_txq)
 {
+	struct veth_bql_state *state = &rq->bql_state;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
+	struct veth_priv *priv;
+	u64 bql_flush_ns;
+
+	priv = netdev_priv(rq->dev);
+	bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
+
+	/* Clamp stored timestamp in case we migrated to a CPU with a behind
+	 * sched_clock(); prevents the deadline from never firing.
+	 */
+	state->time = min(state->time, sched_clock());
+
+	/* Flush completions that timed out since the previous NAPI poll. */
+	if (peer_txq && bql_flush_ns)
+		veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
 
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -972,8 +1039,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
 			stats->xdp_bytes += skb->len;
-			if (peer_txq && bql_charged)
-				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+			if (peer_txq && bql_charged) {
+				if (!bql_flush_ns) {
+					netdev_tx_completed_queue(peer_txq, 1,
+								  VETH_BQL_UNIT);
+				} else {
+					state->n_bql++;
+					veth_bql_maybe_complete(state, peer_txq,
+								bql_flush_ns);
+				}
+			}
 
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
@@ -989,6 +1064,18 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	if (n_xdpf)
 		veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);
 
+	/* If the ring is now empty and the peer TX queue is stalled by DQL
+	 * backpressure, release completions immediately to unblock it.
+	 */
+	if (peer_txq && state->n_bql && __ptr_ring_empty(&rq->xdp_ring)) {
+		/* Pairs with smp_wmb() in __ptr_ring_produce(); ensure ring
+		 * emptiness is observed before reading peer_txq->state.
+		 */
+		smp_rmb();
+		if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
+			veth_bql_complete(state, peer_txq);
+	}
+
 	u64_stats_update_begin(&rq->stats.syncp);
 	rq->stats.vs.xdp_redirect += stats->xdp_redirect;
 	rq->stats.vs.xdp_bytes += stats->xdp_bytes;
@@ -1093,6 +1180,9 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
 
 		napi_enable(&rq->xdp_napi);
 		rcu_assign_pointer(priv->rq[i].napi, &priv->rq[i].xdp_napi);
+
+		rq->bql_state.time = sched_clock();
+		rq->bql_state.n_bql = 0;
 	}
 
 	return 0;
@@ -1134,6 +1224,8 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 		struct veth_rq *rq = &priv->rq[i];
 
 		rq->rx_notify_masked = false;
+		rq->bql_state.n_bql = 0;
+		rq->bql_state.time = 0;
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
 	}
 
@@ -1813,6 +1905,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 
 static void veth_setup(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	ether_setup(dev);
 
 	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1932,8 @@ static void veth_setup(struct net_device *dev)
 	dev->max_mtu = ETH_MAX_MTU;
 	dev->watchdog_timeo = msecs_to_jiffies(16000);
 
+	priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
 	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-05-28  7:45   ` Jonas Köppeler
  2026-06-04  8:19   ` Paolo Abeni
  1 sibling, 0 replies; 30+ messages in thread
From: Jonas Köppeler @ 2026-05-28  7:45 UTC (permalink / raw)
  To: hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 5/27/26 3:54 PM, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
>
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") gave qdiscs control over veth by returning
> NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
> a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
> a dark buffer, adding base latency because the qdisc has no visibility
> into how many bytes are already queued there.
>
> Add BQL support so the qdisc gets feedback and can begin shaping traffic
> before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
> under UDP load from ~6.61ms to ~0.36ms (18x).
>
> Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
> the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
> has no link speed -- the ptr_ring drains at CPU speed and is
> packet-indexed, not byte-indexed, so bytes are not the natural unit.
> With byte-based charging, small packets sneak many more entries into
> the ring before STACK_XOFF fires, deepening the dark buffer under
> mixed-size workloads.  Testing with a concurrent min-size packet flood
> shows 3.7x ping RTT degradation with skb->len charging versus no
> change with fixed-unit charging.
>
> Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
> confirming the ring is not full.  The charge must precede the produce
> because the NAPI consumer can run on another CPU and complete the SKB
> the instant it becomes visible in the ring.  Doing both under the same
> lock avoids a pre-charge/undo pattern -- BQL is only charged when
> produce is guaranteed to succeed.
>
> BQL is enabled only when a real qdisc is attached (guarded by
> !qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
> for TXQ modification like dql_queued(). For lltx devices, like veth,
> this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
> producer_lock provides additional serialization that would allow
> BQL to work correctly even with noqueue, though that combination
> is not currently enabled, as the netstack will drop and warn.
>
> Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
> entry.  This is necessary because the qdisc can be replaced live while
> SKBs are in-flight -- each SKB must carry the charge decision made at
> enqueue time rather than re-checking the peer's qdisc at completion.
>
> Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
> clears promptly when producer and consumer run on different CPUs.
>
> BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
> alongside the existing DRV_XOFF (ring full).  Both must be clear for
> the queue to transmit.  Reset BQL state in veth_napi_del_range() after
> synchronize_net() to avoid racing with in-flight veth_poll() calls.
> Clamp the reset loop to the peer's real_num_tx_queues, since the peer
> may have fewer TX queues than the local device has RX queues (e.g. when
> veth is enslaved to a bond with XDP attached).
>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>

> ---
>   drivers/net/veth.c | 86 ++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 75 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 0cfb19b760dd..21ff78533943 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -34,9 +34,13 @@
>   #define DRV_VERSION	"1.0"
>   
>   #define VETH_XDP_FLAG		BIT(0)
> +#define VETH_BQL_FLAG		BIT(1)
>   #define VETH_RING_SIZE		256
>   #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
>   
> +/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
> +#define VETH_BQL_UNIT		1
> +
>   #define VETH_XDP_TX_BULK_SIZE	16
>   #define VETH_XDP_BATCH		16
>   
> @@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
>   	return (unsigned long)ptr & VETH_XDP_FLAG;
>   }
>   
> +static bool veth_ptr_is_bql(void *ptr)
> +{
> +	return (unsigned long)ptr & VETH_BQL_FLAG;
> +}
> +
> +static struct sk_buff *veth_ptr_to_skb(void *ptr)
> +{
> +	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
> +}
> +
> +static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
> +{
> +	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
> +}
> +
>   static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
>   {
>   	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
> @@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
>   	if (veth_is_xdp_frame(ptr))
>   		xdp_return_frame(veth_ptr_to_xdp(ptr));
>   	else
> -		kfree_skb(ptr);
> +		kfree_skb(veth_ptr_to_skb(ptr));
>   }
>   
>   static void __veth_xdp_flush(struct veth_rq *rq)
> @@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>   	}
>   }
>   
> -static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> +static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
> +		       struct netdev_queue *txq)
>   {
> -	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> +	struct ptr_ring *ring = &rq->xdp_ring;
> +
> +	spin_lock(&ring->producer_lock);
> +	if (unlikely(__ptr_ring_check_produce(ring))) {
> +		spin_unlock(&ring->producer_lock);
>   		return NETDEV_TX_BUSY; /* signal qdisc layer */
> +	}
> +
> +	/* BQL charge before produce; consumer cannot see entry yet */
> +	if (do_bql)
> +		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
> +
> +	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
> +	spin_unlock(&ring->producer_lock);
>   
>   	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>   }
>   
>   static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> -			    struct veth_rq *rq, bool xdp)
> +			    struct veth_rq *rq, bool xdp, bool do_bql,
> +			    struct netdev_queue *txq)
>   {
>   	return __dev_forward_skb(dev, skb) ?: xdp ?
> -		veth_xdp_rx(rq, skb) :
> +		veth_xdp_rx(rq, skb, do_bql, txq) :
>   		__netif_rx(skb);
>   }
>   
> @@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   {
>   	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>   	struct veth_rq *rq = NULL;
> -	struct netdev_queue *txq;
> +	struct netdev_queue *txq = NULL;
>   	struct net_device *rcv;
>   	int length = skb->len;
>   	bool use_napi = false;
> +	bool do_bql = false;
>   	int ret, rxq;
>   
>   	rcu_read_lock();
> @@ -375,8 +409,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   	}
>   
>   	skb_tx_timestamp(skb);
> -
> -	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	if (rxq < dev->real_num_tx_queues) {
> +		txq = netdev_get_tx_queue(dev, rxq);
> +		/* BQL charge happens inside veth_xdp_rx() under producer_lock */
> +		do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
> +	}
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
>   	switch (ret) {
>   	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>   		if (!use_napi)
> @@ -412,6 +450,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>   				     __func__, dev->name, ret);
>   	}
> +
>   	rcu_read_unlock();
>   
>   	return ret;
> @@ -900,7 +939,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>   
>   static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			struct veth_xdp_tx_bq *bq,
> -			struct veth_stats *stats)
> +			struct veth_stats *stats,
> +			struct netdev_queue *peer_txq)
>   {
>   	int i, done = 0, n_xdpf = 0;
>   	void *xdpf[VETH_XDP_BATCH];
> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			}
>   		} else {
>   			/* ndo_start_xmit */
> -			struct sk_buff *skb = ptr;
> +			bool bql_charged = veth_ptr_is_bql(ptr);
> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>   
>   			stats->xdp_bytes += skb->len;
> +			if (peer_txq && bql_charged)
> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
> +
>   			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>   			if (skb) {
>   				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> @@ -976,7 +1020,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>   
>   	xdp_set_return_frame_no_direct();
> -	done = veth_xdp_rcv(rq, budget, &bq, &stats);
> +	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
>   
>   	if (stats.xdp_redirect > 0)
>   		xdp_do_flush();
> @@ -1074,6 +1118,7 @@ static int __veth_napi_enable(struct net_device *dev)
>   static void veth_napi_del_range(struct net_device *dev, int start, int end)
>   {
>   	struct veth_priv *priv = netdev_priv(dev);
> +	struct net_device *peer;
>   	int i;
>   
>   	for (i = start; i < end; i++) {
> @@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
>   		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
>   	}
>   
> +	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
> +	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
> +	 * synchronize_net(), and NAPI can no longer clear it.
> +	 * Only wake when the device is still up.
> +	 */
> +	peer = rtnl_dereference(priv->peer);
> +	if (peer) {
> +		int peer_end = min_t(int, end, peer->real_num_tx_queues);
> +
> +		for (i = start; i < peer_end; i++) {
> +			struct netdev_queue *txq = netdev_get_tx_queue(peer, i);
> +
> +			netdev_tx_reset_queue(txq);
> +			if (netif_running(dev))
> +				netif_tx_wake_queue(txq);
> +		}
> +	}
> +
>   	for (i = start; i < end; i++) {
>   		page_pool_destroy(priv->rq[i].page_pool);
>   		priv->rq[i].page_pool = NULL;
> @@ -1741,6 +1804,7 @@ static void veth_setup(struct net_device *dev)
>   	dev->priv_flags |= IFF_PHONY_HEADROOM;
>   	dev->priv_flags |= IFF_DISABLE_NETPOLL;
>   	dev->lltx = true;
> +	dev->bql = true;
>   
>   	dev->netdev_ops = &veth_netdev_ops;
>   	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
@ 2026-05-28  7:46   ` Jonas Köppeler
  2026-06-01 12:00     ` Simon Schippers
  2026-05-29 14:51   ` Simon Schippers
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Jonas Köppeler @ 2026-05-28  7:46 UTC (permalink / raw)
  To: hawk, netdev
  Cc: Simon Schippers, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

On 5/27/26 3:54 PM, hawk@kernel.org wrote:
> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>
> Per-packet BQL completion forces DQL to converge on limit=2, causing
> excessive NAPI scheduling overhead and qdisc requeues.
>
> Accumulate BQL completions and flush them when a configurable time
> threshold is exceeded, letting DQL discover a limit that bounds actual
> queuing delay to the configured interval. Coalescing state persists
> across NAPI polls in struct veth_rq so completions can accumulate
> beyond a single budget=64 cycle.
>
> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
> setting tx-usecs to 0 disables coalescing and falls back to per-packet
> completion.
>
>    ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>    ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>
> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>

Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>

> ---
>   drivers/net/veth.c | 100 ++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 98 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index d5675d9d5236..743d17b37223 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -28,6 +28,7 @@
>   #include <linux/bpf_trace.h>
>   #include <linux/net_tstamp.h>
>   #include <linux/skbuff_ref.h>
> +#include <linux/sched/clock.h>
>   #include <net/page_pool/helpers.h>
>   
>   #define DRV_NAME	"veth"
> @@ -44,6 +45,8 @@
>   #define VETH_XDP_TX_BULK_SIZE	16
>   #define VETH_XDP_BATCH		16
>   
> +#define VETH_BQL_COAL_TX_USECS	100 /* default tx-usecs for BQL batching */
> +
>   struct veth_stats {
>   	u64	rx_drops;
>   	/* xdp */
> @@ -62,6 +65,11 @@ struct veth_rq_stats {
>   	struct u64_stats_sync	syncp;
>   };
>   
> +struct veth_bql_state {
> +	u64	time;	/* sched_clock() when current coalescing window started */
> +	int	n_bql;	/* BQL completions batched in the current window */
> +};
> +
>   struct veth_rq {
>   	struct napi_struct	xdp_napi;
>   	struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
> @@ -69,6 +77,7 @@ struct veth_rq {
>   	struct bpf_prog __rcu	*xdp_prog;
>   	struct xdp_mem_info	xdp_mem;
>   	struct veth_rq_stats	stats;
> +	struct veth_bql_state	bql_state;
>   	bool			rx_notify_masked;
>   	struct ptr_ring		xdp_ring;
>   	struct xdp_rxq_info	xdp_rxq;
> @@ -81,6 +90,7 @@ struct veth_priv {
>   	struct bpf_prog		*_xdp_prog;
>   	struct veth_rq		*rq;
>   	unsigned int		requested_headroom;
> +	unsigned int		tx_coal_usecs;	/* BQL completion coalescing */
>   };
>   
>   struct veth_xdp_tx_bq {
> @@ -265,7 +275,30 @@ static void veth_get_channels(struct net_device *dev,
>   static int veth_set_channels(struct net_device *dev,
>   			     struct ethtool_channels *ch);
>   
> +static int veth_get_coalesce(struct net_device *dev,
> +			     struct ethtool_coalesce *ec,
> +			     struct kernel_ethtool_coalesce *kernel_coal,
> +			     struct netlink_ext_ack *extack)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +
> +	ec->tx_coalesce_usecs = priv->tx_coal_usecs;
> +	return 0;
> +}
> +
> +static int veth_set_coalesce(struct net_device *dev,
> +			     struct ethtool_coalesce *ec,
> +			     struct kernel_ethtool_coalesce *kernel_coal,
> +			     struct netlink_ext_ack *extack)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +
> +	priv->tx_coal_usecs = ec->tx_coalesce_usecs;
> +	return 0;
> +}
> +
>   static const struct ethtool_ops veth_ethtool_ops = {
> +	.supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
>   	.get_drvinfo		= veth_get_drvinfo,
>   	.get_link		= ethtool_op_get_link,
>   	.get_strings		= veth_get_strings,
> @@ -275,6 +308,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
>   	.get_ts_info		= ethtool_op_get_ts_info,
>   	.get_channels		= veth_get_channels,
>   	.set_channels		= veth_set_channels,
> +	.get_coalesce		= veth_get_coalesce,
> +	.set_coalesce		= veth_set_coalesce,
>   };
>   
>   /* general routines */
> @@ -937,13 +972,45 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>   	return NULL;
>   }
>   
> +static void veth_bql_complete(struct veth_bql_state *state,
> +			      struct netdev_queue *peer_txq)
> +{
> +	netdev_tx_completed_queue(peer_txq, state->n_bql,
> +				  state->n_bql * VETH_BQL_UNIT);
> +	state->n_bql = 0;
> +	state->time = sched_clock();
> +}
> +
> +static void veth_bql_maybe_complete(struct veth_bql_state *state,
> +				    struct netdev_queue *peer_txq,
> +				    u64 coalescing_ns)
> +{
> +	if (state->n_bql && sched_clock() >= state->time + coalescing_ns)
> +		veth_bql_complete(state, peer_txq);
> +}
> +
>   static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			struct veth_xdp_tx_bq *bq,
>   			struct veth_stats *stats,
>   			struct netdev_queue *peer_txq)
>   {
> +	struct veth_bql_state *state = &rq->bql_state;
>   	int i, done = 0, n_xdpf = 0;
>   	void *xdpf[VETH_XDP_BATCH];
> +	struct veth_priv *priv;
> +	u64 bql_flush_ns;
> +
> +	priv = netdev_priv(rq->dev);
> +	bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
> +
> +	/* Clamp stored timestamp in case we migrated to a CPU with a behind
> +	 * sched_clock(); prevents the deadline from never firing.
> +	 */
> +	state->time = min(state->time, sched_clock());
> +
> +	/* Flush completions that timed out since the previous NAPI poll. */
> +	if (peer_txq && bql_flush_ns)
> +		veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
>   
>   	for (i = 0; i < budget; i++) {
>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
> @@ -972,8 +1039,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>   
>   			stats->xdp_bytes += skb->len;
> -			if (peer_txq && bql_charged)
> -				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
> +			if (peer_txq && bql_charged) {
> +				if (!bql_flush_ns) {
> +					netdev_tx_completed_queue(peer_txq, 1,
> +								  VETH_BQL_UNIT);
> +				} else {
> +					state->n_bql++;
> +					veth_bql_maybe_complete(state, peer_txq,
> +								bql_flush_ns);
> +				}
> +			}
>   
>   			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>   			if (skb) {
> @@ -989,6 +1064,18 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   	if (n_xdpf)
>   		veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);
>   
> +	/* If the ring is now empty and the peer TX queue is stalled by DQL
> +	 * backpressure, release completions immediately to unblock it.
> +	 */
> +	if (peer_txq && state->n_bql && __ptr_ring_empty(&rq->xdp_ring)) {
> +		/* Pairs with smp_wmb() in __ptr_ring_produce(); ensure ring
> +		 * emptiness is observed before reading peer_txq->state.
> +		 */
> +		smp_rmb();
> +		if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
> +			veth_bql_complete(state, peer_txq);
> +	}
> +
>   	u64_stats_update_begin(&rq->stats.syncp);
>   	rq->stats.vs.xdp_redirect += stats->xdp_redirect;
>   	rq->stats.vs.xdp_bytes += stats->xdp_bytes;
> @@ -1093,6 +1180,9 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
>   
>   		napi_enable(&rq->xdp_napi);
>   		rcu_assign_pointer(priv->rq[i].napi, &priv->rq[i].xdp_napi);
> +
> +		rq->bql_state.time = sched_clock();
> +		rq->bql_state.n_bql = 0;
>   	}
>   
>   	return 0;
> @@ -1134,6 +1224,8 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
>   		struct veth_rq *rq = &priv->rq[i];
>   
>   		rq->rx_notify_masked = false;
> +		rq->bql_state.n_bql = 0;
> +		rq->bql_state.time = 0;
>   		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
>   	}
>   
> @@ -1813,6 +1905,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
>   
>   static void veth_setup(struct net_device *dev)
>   {
> +	struct veth_priv *priv = netdev_priv(dev);
> +
>   	ether_setup(dev);
>   
>   	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> @@ -1838,6 +1932,8 @@ static void veth_setup(struct net_device *dev)
>   	dev->max_mtu = ETH_MAX_MTU;
>   	dev->watchdog_timeo = msecs_to_jiffies(16000);
>   
> +	priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
> +
>   	dev->hw_features = VETH_FEATURES;
>   	dev->hw_enc_features = VETH_FEATURES;
>   	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
  2026-05-28  7:46   ` Jonas Köppeler
@ 2026-05-29 14:51   ` Simon Schippers
  2026-06-04  8:21   ` Paolo Abeni
  2026-06-08 10:38   ` Simon Schippers
  3 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2026-05-29 14:51 UTC (permalink / raw)
  To: hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 5/27/26 15:54, hawk@kernel.org wrote:
> From: Simon Schippers <simon.schippers@tu-dortmund.de>
> +	/* If the ring is now empty and the peer TX queue is stalled by DQL
> +	 * backpressure, release completions immediately to unblock it.
> +	 */
> +	if (peer_txq && state->n_bql && __ptr_ring_empty(&rq->xdp_ring)) {
> +		/* Pairs with smp_wmb() in __ptr_ring_produce(); ensure ring
> +		 * emptiness is observed before reading peer_txq->state.
> +		 */
> +		smp_rmb();
> +		if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
> +			veth_bql_complete(state, peer_txq);
> +	}

"Does this leave coalesced BQL completions pending indefinitely if the
receive ring is empty but the peer TX queue isn't stalled by DQL? [...]"

Regarding this by [1]: Yes, this is desired behavior IMO.

From my understanding there should not be a BQL stall caused by
this (needs to be tested though).

A minor nit is that the observation of the minimum amount of
slack found over several iterations of the completion processing
(saved in dql->lowest_slack) is inaccurate and therefore the limit
might be set to a too low value once (at the next completion call).
I think it does not matter though.


Overall The AI reviews [1] [2] have some valid points.

Apart from that I would consider joining patch 2 and 5.

[1] Link: https://sashiko.dev/#/patchset/20260527135418.1166665-1-hawk%40kernel.org
[2] Link: https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260527135418.1166665-1-hawk%40kernel.org



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-05-28  7:46   ` Jonas Köppeler
@ 2026-06-01 12:00     ` Simon Schippers
  2026-06-01 14:03       ` Jonas Köppeler
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2026-06-01 12:00 UTC (permalink / raw)
  To: Jonas Köppeler, hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 5/28/26 09:46, Jonas Köppeler wrote:
> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>
>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>> excessive NAPI scheduling overhead and qdisc requeues.
>>
>> Accumulate BQL completions and flush them when a configurable time
>> threshold is exceeded, letting DQL discover a limit that bounds actual
>> queuing delay to the configured interval. Coalescing state persists
>> across NAPI polls in struct veth_rq so completions can accumulate
>> beyond a single budget=64 cycle.
>>
>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>> completion.
>>
>>    ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>    ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>
>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> 
> Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>
> 

Thanks for your testing!

However, I have issues reproducing.
I run bare metal (without virtme) with v6 + your pktgen patch
and I am on the branch pktgen-and-benchmark, commit
"results: add veth-bql measurements":

1. ping fails with 100% packet loss ~20% of the times with --pktgen.
   When this happens the avg ping of this run is mistakenly set
   to 0.0 ms, which distorts the results.
   I fixed it locally by rerunning when this happens.

2. pktgen runs with > 3 Mpps even with --nrules 10000, see log below.
   I see that this is because of qdisc drops.
   I also tried pfifo and sfq but with the same result.
   I spent quite some time on it but I do not know a fix.

Do you have an idea?
Thanks!


The raw log:

sudo ./veth_bql_test.sh --pktgen --duration 2 --qdisc fq_codel --no-bpftrace --tx-usecs 100 --nrules 10000
INFO: Setting up veth pair with GRO                                                                                                             
INFO: Threaded NAPI enabled       
INFO: Installing qdisc: fq_codel  
INFO: Loaded 10000 iptables rules in consumer NS
INFO: kernel: 7.1.0-rc4-patched-20260307+
INFO: BQL sysfs found: /sys/class/net/veth_bql0/queues/tx-0/byte_queue_limits
INFO: ethtool tx-usecs set to 100 on veth_bql1 (rx side)   
INFO: Starting ping to 10.99.0.2 (5/s) to measure latency under load
INFO: Starting pktgen queue_xmit on veth_bql0 (threads=1 pkt_size=64)
  [5s] BQL inflight=1 limit=17 watchdog=0
  [5s] qdisc fq_codel pkts=27417 drops=6591520 requeues=14115 backlog=0 qlen=0 overlimits=0
  [5s] softnet: processed=27960 time_squeeze=0 multi-CPU(6): cpu0(+5) cpu1(+121) cpu2(+33) cpu3(+116) cpu4(+27641) cpu5(+44)
INFO: Ping loss: 0% packet loss
INFO: Ping summary: rtt min/avg/max/mdev = 0.127/1.818/2.703/0.761 ms
INFO: pktgen results (thread 0):
Params: count 0  min_pkt_size: 64  max_pkt_size: 64
     frags: 0  delay: 0  clone_skb: 0  ifname: veth_bql0@0
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.99.0.2  dst_max: 
     src_min:   src_max: 
     src_mac: 0e:aa:4e:05:95:89 dst_mac: c2:e5:a3:4c:2a:7f
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9999  udp_dst_max: 9999
     src_mac_count: 0  dst_mac_count: 0
     xmit_mode: xmit_queue
     Flags: NO_TIMESTAMP  QUEUE_MAP_CPU  SHARED  
Current:
     pkts-sofar: 6516031  errors: 102854
     started: 12246566004us  stopped: 12248535234us idle: 0us
     seq_num: 6516032  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 10.99.0.1  cur_daddr: 10.99.0.2
     cur_udp_dst: 9999  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 1969229(c1969229+d0) usec, 6516031 (64byte,0frags)
  3308924pps 1694Mb/sec (1694169088bps) errors: 102854
TEST: veth_bql                                                      [ OK ]
INFO: Results: /home/simon/repos/veth-backpressure-performance-testing/results/selftests/2026-06-01T13-24-51


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-01 12:00     ` Simon Schippers
@ 2026-06-01 14:03       ` Jonas Köppeler
  2026-06-01 16:16         ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Jonas Köppeler @ 2026-06-01 14:03 UTC (permalink / raw)
  To: Simon Schippers, hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 6/1/26 2:00 PM, Simon Schippers wrote:
> On 5/28/26 09:46, Jonas Köppeler wrote:
>> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>
>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>
>>> Accumulate BQL completions and flush them when a configurable time
>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>> queuing delay to the configured interval. Coalescing state persists
>>> across NAPI polls in struct veth_rq so completions can accumulate
>>> beyond a single budget=64 cycle.
>>>
>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>> completion.
>>>
>>>     ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>     ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>
>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>
>>
> Thanks for your testing!
>
> However, I have issues reproducing.
> I run bare metal (without virtme) with v6 + your pktgen patch
> and I am on the branch pktgen-and-benchmark, commit
> "results: add veth-bql measurements":
>
> 1. ping fails with 100% packet loss ~20% of the times with --pktgen.
>     When this happens the avg ping of this run is mistakenly set
>     to 0.0 ms, which distorts the results.
>     I fixed it locally by rerunning when this happens.
>
> 2. pktgen runs with > 3 Mpps even with --nrules 10000, see log below.
>     I see that this is because of qdisc drops.
>     I also tried pfifo and sfq but with the same result.
>     I spent quite some time on it but I do not know a fix.
>
> Do you have an idea?
> Thanks!

Hi,
yes there are some changes missing in the test script.
I have pushed it now, sorry. This should fix 1.
Regarding 2.: do not look at the pktgen output, in the
new version you will see something like "goodput",
which is the number you should look for.
Pktgen will report at what speed it enqueued packets in
the qdisc.

Let me know if it worked.
Best,
Jonas

> The raw log:
>
> sudo ./veth_bql_test.sh --pktgen --duration 2 --qdisc fq_codel --no-bpftrace --tx-usecs 100 --nrules 10000
> INFO: Setting up veth pair with GRO
> INFO: Threaded NAPI enabled
> INFO: Installing qdisc: fq_codel
> INFO: Loaded 10000 iptables rules in consumer NS
> INFO: kernel: 7.1.0-rc4-patched-20260307+
> INFO: BQL sysfs found: /sys/class/net/veth_bql0/queues/tx-0/byte_queue_limits
> INFO: ethtool tx-usecs set to 100 on veth_bql1 (rx side)
> INFO: Starting ping to 10.99.0.2 (5/s) to measure latency under load
> INFO: Starting pktgen queue_xmit on veth_bql0 (threads=1 pkt_size=64)
>    [5s] BQL inflight=1 limit=17 watchdog=0
>    [5s] qdisc fq_codel pkts=27417 drops=6591520 requeues=14115 backlog=0 qlen=0 overlimits=0
>    [5s] softnet: processed=27960 time_squeeze=0 multi-CPU(6): cpu0(+5) cpu1(+121) cpu2(+33) cpu3(+116) cpu4(+27641) cpu5(+44)
> INFO: Ping loss: 0% packet loss
> INFO: Ping summary: rtt min/avg/max/mdev = 0.127/1.818/2.703/0.761 ms
> INFO: pktgen results (thread 0):
> Params: count 0  min_pkt_size: 64  max_pkt_size: 64
>       frags: 0  delay: 0  clone_skb: 0  ifname: veth_bql0@0
>       flows: 0 flowlen: 0
>       queue_map_min: 0  queue_map_max: 0
>       dst_min: 10.99.0.2  dst_max:
>       src_min:   src_max:
>       src_mac: 0e:aa:4e:05:95:89 dst_mac: c2:e5:a3:4c:2a:7f
>       udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9999  udp_dst_max: 9999
>       src_mac_count: 0  dst_mac_count: 0
>       xmit_mode: xmit_queue
>       Flags: NO_TIMESTAMP  QUEUE_MAP_CPU  SHARED
> Current:
>       pkts-sofar: 6516031  errors: 102854
>       started: 12246566004us  stopped: 12248535234us idle: 0us
>       seq_num: 6516032  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>       cur_saddr: 10.99.0.1  cur_daddr: 10.99.0.2
>       cur_udp_dst: 9999  cur_udp_src: 9
>       cur_queue_map: 0
>       flows: 0
> Result: OK: 1969229(c1969229+d0) usec, 6516031 (64byte,0frags)
>    3308924pps 1694Mb/sec (1694169088bps) errors: 102854
> TEST: veth_bql                                                      [ OK ]
> INFO: Results: /home/simon/repos/veth-backpressure-performance-testing/results/selftests/2026-06-01T13-24-51
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-01 14:03       ` Jonas Köppeler
@ 2026-06-01 16:16         ` Simon Schippers
  2026-06-02  7:24           ` Jonas Köppeler
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2026-06-01 16:16 UTC (permalink / raw)
  To: Jonas Köppeler, hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 6/1/26 16:03, Jonas Köppeler wrote:
> On 6/1/26 2:00 PM, Simon Schippers wrote:
>> On 5/28/26 09:46, Jonas Köppeler wrote:
>>> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>
>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>
>>>> Accumulate BQL completions and flush them when a configurable time
>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>> queuing delay to the configured interval. Coalescing state persists
>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>> beyond a single budget=64 cycle.
>>>>
>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>> completion.
>>>>
>>>>     ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>     ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>
>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>> Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>
>>>
>> Thanks for your testing!
>>
>> However, I have issues reproducing.
>> I run bare metal (without virtme) with v6 + your pktgen patch
>> and I am on the branch pktgen-and-benchmark, commit
>> "results: add veth-bql measurements":
>>
>> 1. ping fails with 100% packet loss ~20% of the times with --pktgen.
>>     When this happens the avg ping of this run is mistakenly set
>>     to 0.0 ms, which distorts the results.
>>     I fixed it locally by rerunning when this happens.
>>
>> 2. pktgen runs with > 3 Mpps even with --nrules 10000, see log below.
>>     I see that this is because of qdisc drops.
>>     I also tried pfifo and sfq but with the same result.
>>     I spent quite some time on it but I do not know a fix.
>>
>> Do you have an idea?
>> Thanks!
> 
> Hi,
> yes there are some changes missing in the test script.
> I have pushed it now, sorry. This should fix 1.

I pulled it and ran...

sudo ./veth_bql_sweep.sh --runs 1 --pktgen --duration 20 --qdisc fq_codel --no-bpftrace

... but still 8/32=1/4 of the pings are zero, I do not see
a pattern.


I grabbed the logs from /tmp and this is what a failing
ping looks like:

PING 10.99.0.2 (10.99.0.2) 56(84) bytes of data.

--- 10.99.0.2 ping statistics ---
97 packets transmitted, 0 received, 100% packet loss, time 19967ms


Feels like a race or something..
Can you reproduce with the exact command?
I think you need --runs 1, else it just averages over multiple
runs.

> Regarding 2.: do not look at the pktgen output, in the
> new version you will see something like "goodput",
> which is the number you should look for.
> Pktgen will report at what speed it enqueued packets in
> the qdisc.

Exactly. Now it works. Had a single outlier but apart from that
everything is fine.

Thanks,
Simon

> 
> Let me know if it worked.
> Best,
> Jonas
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-01 16:16         ` Simon Schippers
@ 2026-06-02  7:24           ` Jonas Köppeler
  2026-06-02 15:37             ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Jonas Köppeler @ 2026-06-02  7:24 UTC (permalink / raw)
  To: Simon Schippers, hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 6/1/26 6:16 PM, Simon Schippers wrote:

> On 6/1/26 16:03, Jonas Köppeler wrote:
>> On 6/1/26 2:00 PM, Simon Schippers wrote:
>>> On 5/28/26 09:46, Jonas Köppeler wrote:
>>>> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>
>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>
>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>> beyond a single budget=64 cycle.
>>>>>
>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>> completion.
>>>>>
>>>>>      ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>      ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>
>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>
>>>>
>>> Thanks for your testing!
>>>
>>> However, I have issues reproducing.
>>> I run bare metal (without virtme) with v6 + your pktgen patch
>>> and I am on the branch pktgen-and-benchmark, commit
>>> "results: add veth-bql measurements":
>>>
>>> 1. ping fails with 100% packet loss ~20% of the times with --pktgen.
>>>      When this happens the avg ping of this run is mistakenly set
>>>      to 0.0 ms, which distorts the results.
>>>      I fixed it locally by rerunning when this happens.
>>>
>>> 2. pktgen runs with > 3 Mpps even with --nrules 10000, see log below.
>>>      I see that this is because of qdisc drops.
>>>      I also tried pfifo and sfq but with the same result.
>>>      I spent quite some time on it but I do not know a fix.
>>>
>>> Do you have an idea?
>>> Thanks!
>> Hi,
>> yes there are some changes missing in the test script.
>> I have pushed it now, sorry. This should fix 1.
> I pulled it and ran...
>
> sudo ./veth_bql_sweep.sh --runs 1 --pktgen --duration 20 --qdisc fq_codel --no-bpftrace
>
> ... but still 8/32=1/4 of the pings are zero, I do not see
> a pattern.
>
>
> I grabbed the logs from /tmp and this is what a failing
> ping looks like:
>
> PING 10.99.0.2 (10.99.0.2) 56(84) bytes of data.
>
> --- 10.99.0.2 ping statistics ---
> 97 packets transmitted, 0 received, 100% packet loss, time 19967ms
>
>
> Feels like a race or something..
> Can you reproduce with the exact command?
> I think you need --runs 1, else it just averages over multiple
> runs.

Sorry, no I could not reproduce this. I used the exact same
command as you did, and I am using net-next/main + v6 patches.
I have 0% ping loss across all tests. Does the ping loss
happen regardless of the qdisc?

>
>> Regarding 2.: do not look at the pktgen output, in the
>> new version you will see something like "goodput",
>> which is the number you should look for.
>> Pktgen will report at what speed it enqueued packets in
>> the qdisc.
> Exactly. Now it works. Had a single outlier but apart from that
> everything is fine.
>
> Thanks,
> Simon
>
>> Let me know if it worked.
>> Best,
>> Jonas
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-02  7:24           ` Jonas Köppeler
@ 2026-06-02 15:37             ` Simon Schippers
  2026-06-03  8:28               ` Jonas Köppeler
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2026-06-02 15:37 UTC (permalink / raw)
  To: Jonas Köppeler, hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 6/2/26 09:24, Jonas Köppeler wrote:
> On 6/1/26 6:16 PM, Simon Schippers wrote:
> 
>> On 6/1/26 16:03, Jonas Köppeler wrote:
>>> On 6/1/26 2:00 PM, Simon Schippers wrote:
>>>> On 5/28/26 09:46, Jonas Köppeler wrote:
>>>>> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>
>>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>>
>>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>>> beyond a single budget=64 cycle.
>>>>>>
>>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>>> completion.
>>>>>>
>>>>>>      ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>>      ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>>
>>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>> Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>
>>>>>
>>>> Thanks for your testing!
>>>>
>>>> However, I have issues reproducing.
>>>> I run bare metal (without virtme) with v6 + your pktgen patch
>>>> and I am on the branch pktgen-and-benchmark, commit
>>>> "results: add veth-bql measurements":
>>>>
>>>> 1. ping fails with 100% packet loss ~20% of the times with --pktgen.
>>>>      When this happens the avg ping of this run is mistakenly set
>>>>      to 0.0 ms, which distorts the results.
>>>>      I fixed it locally by rerunning when this happens.
>>>>
>>>> 2. pktgen runs with > 3 Mpps even with --nrules 10000, see log below.
>>>>      I see that this is because of qdisc drops.
>>>>      I also tried pfifo and sfq but with the same result.
>>>>      I spent quite some time on it but I do not know a fix.
>>>>
>>>> Do you have an idea?
>>>> Thanks!
>>> Hi,
>>> yes there are some changes missing in the test script.
>>> I have pushed it now, sorry. This should fix 1.
>> I pulled it and ran...
>>
>> sudo ./veth_bql_sweep.sh --runs 1 --pktgen --duration 20 --qdisc fq_codel --no-bpftrace
>>
>> ... but still 8/32=1/4 of the pings are zero, I do not see
>> a pattern.
>>
>>
>> I grabbed the logs from /tmp and this is what a failing
>> ping looks like:
>>
>> PING 10.99.0.2 (10.99.0.2) 56(84) bytes of data.
>>
>> --- 10.99.0.2 ping statistics ---
>> 97 packets transmitted, 0 received, 100% packet loss, time 19967ms
>>
>>
>> Feels like a race or something..
>> Can you reproduce with the exact command?
>> I think you need --runs 1, else it just averages over multiple
>> runs.
> 
> Sorry, no I could not reproduce this. I used the exact same
> command as you did, and I am using net-next/main + v6 patches.
> I have 0% ping loss across all tests. Does the ping loss
> happen regardless of the qdisc?
> 

Yes, it happens for each qdisc I tested.
As a fix I changed the script to rerun if this happens.

With that I ran the benchmark and also created a script to have
the result as an ASCII table.
I think it would make sense to include something like this in
the commit message.

Throughput (pps)
==================================================
nrules |   0us | 100us | 1000us | 10000us || stock
-------+-------+-------+--------+---------++------
     0 | 1.65M | 1.75M |  1.74M |   1.74M || 1.73M
   100 |  684K |  755K |   730K |    728K ||  744K
  1000 |  119K |  126K |   126K |    125K ||  126K
 10000 |   13K |   12K |    13K |     13K ||   13K


Ping RTT ms (avg)
==================================================
nrules |   0us | 100us | 1000us | 10000us || stock
-------+-------+-------+--------+---------++------
     0 | 0.016 | 0.138 |  0.137 |   0.135 || 0.133
   100 | 0.029 | 0.185 |  0.310 |   0.315 || 0.310
  1000 | 0.137 | 0.321 |   1.66 |    1.81 ||  1.78
 10000 |  1.22 |  1.87 |   3.02 |    16.0 ||  17.2

>>
>>> Regarding 2.: do not look at the pktgen output, in the
>>> new version you will see something like "goodput",
>>> which is the number you should look for.
>>> Pktgen will report at what speed it enqueued packets in
>>> the qdisc.
>> Exactly. Now it works. Had a single outlier but apart from that
>> everything is fine.
>>
>> Thanks,
>> Simon
>>
>>> Let me know if it worked.
>>> Best,
>>> Jonas
>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-02 15:37             ` Simon Schippers
@ 2026-06-03  8:28               ` Jonas Köppeler
  0 siblings, 0 replies; 30+ messages in thread
From: Jonas Köppeler @ 2026-06-03  8:28 UTC (permalink / raw)
  To: Simon Schippers, hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 6/2/26 17:37, Simon Schippers wrote:
> On 6/2/26 09:24, Jonas Köppeler wrote:
>> On 6/1/26 6:16 PM, Simon Schippers wrote:
>>
>>> On 6/1/26 16:03, Jonas Köppeler wrote:
>>>> On 6/1/26 2:00 PM, Simon Schippers wrote:
>>>>> On 5/28/26 09:46, Jonas Köppeler wrote:
>>>>>> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>>>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>
>>>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>>>
>>>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>>>> beyond a single budget=64 cycle.
>>>>>>>
>>>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>>>> completion.
>>>>>>>
>>>>>>>       ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>>>       ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>>>
>>>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>> Tested-by: Jonas Köppeler<j.koeppeler@tu-berlin.de>
>>>>>>
>>>>> Thanks for your testing!
>>>>>
>>>>> However, I have issues reproducing.
>>>>> I run bare metal (without virtme) with v6 + your pktgen patch
>>>>> and I am on the branch pktgen-and-benchmark, commit
>>>>> "results: add veth-bql measurements":
>>>>>
>>>>> 1. ping fails with 100% packet loss ~20% of the times with --pktgen.
>>>>>       When this happens the avg ping of this run is mistakenly set
>>>>>       to 0.0 ms, which distorts the results.
>>>>>       I fixed it locally by rerunning when this happens.
>>>>>
>>>>> 2. pktgen runs with > 3 Mpps even with --nrules 10000, see log below.
>>>>>       I see that this is because of qdisc drops.
>>>>>       I also tried pfifo and sfq but with the same result.
>>>>>       I spent quite some time on it but I do not know a fix.
>>>>>
>>>>> Do you have an idea?
>>>>> Thanks!
>>>> Hi,
>>>> yes there are some changes missing in the test script.
>>>> I have pushed it now, sorry. This should fix 1.
>>> I pulled it and ran...
>>>
>>> sudo ./veth_bql_sweep.sh --runs 1 --pktgen --duration 20 --qdisc fq_codel --no-bpftrace
>>>
>>> ... but still 8/32=1/4 of the pings are zero, I do not see
>>> a pattern.
>>>
>>>
>>> I grabbed the logs from /tmp and this is what a failing
>>> ping looks like:
>>>
>>> PING 10.99.0.2 (10.99.0.2) 56(84) bytes of data.
>>>
>>> --- 10.99.0.2 ping statistics ---
>>> 97 packets transmitted, 0 received, 100% packet loss, time 19967ms
>>>
>>>
>>> Feels like a race or something..
>>> Can you reproduce with the exact command?
>>> I think you need --runs 1, else it just averages over multiple
>>> runs.
>> Sorry, no I could not reproduce this. I used the exact same
>> command as you did, and I am using net-next/main + v6 patches.
>> I have 0% ping loss across all tests. Does the ping loss
>> happen regardless of the qdisc?
>>
> Yes, it happens for each qdisc I tested.
> As a fix I changed the script to rerun if this happens.

I was able to reproduce it and I think it is because you are lacking
support of the qdiscs, and thus you will fallback to pfifo, this
explains why it occasionally will drop all packets.

>
> With that I ran the benchmark and also created a script to have
> the result as an ASCII table.
> I think it would make sense to include something like this in
> the commit message.
>
> Throughput (pps)
> ==================================================
> nrules |   0us | 100us | 1000us | 10000us || stock
> -------+-------+-------+--------+---------++------
>       0 | 1.65M | 1.75M |  1.74M |   1.74M || 1.73M
>     100 |  684K |  755K |   730K |    728K ||  744K
>    1000 |  119K |  126K |   126K |    125K ||  126K
>   10000 |   13K |   12K |    13K |     13K ||   13K
>
>
> Ping RTT ms (avg)
> ==================================================
> nrules |   0us | 100us | 1000us | 10000us || stock
> -------+-------+-------+--------+---------++------
>       0 | 0.016 | 0.138 |  0.137 |   0.135 || 0.133
>     100 | 0.029 | 0.185 |  0.310 |   0.315 || 0.310
>    1000 | 0.137 | 0.321 |   1.66 |    1.81 ||  1.78
>   10000 |  1.22 |  1.87 |   3.02 |    16.0 ||  17.2

Yes, good idea. I think we should report p99 + max_ping and not the avg.

>
>>>> Regarding 2.: do not look at the pktgen output, in the
>>>> new version you will see something like "goodput",
>>>> which is the number you should look for.
>>>> Pktgen will report at what speed it enqueued packets in
>>>> the qdisc.
>>> Exactly. Now it works. Had a single outlier but apart from that
>>> everything is fine.
>>>
>>> Thanks,
>>> Simon
>>>
>>>> Let me know if it worked.
>>>> Best,
>>>> Jonas
>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
  2026-05-28  7:45   ` Jonas Köppeler
@ 2026-06-04  8:19   ` Paolo Abeni
  2026-06-10 12:21     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-06-04  8:19 UTC (permalink / raw)
  To: hawk, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 5/27/26 3:54 PM, hawk@kernel.org wrote:
> @@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>  	struct veth_rq *rq = NULL;
> -	struct netdev_queue *txq;
> +	struct netdev_queue *txq = NULL;

Minor nit: please fix the variables declaration order above.

> @@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
>  		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
>  	}
>  
> +	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
> +	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
> +	 * synchronize_net(), and NAPI can no longer clear it.
> +	 * Only wake when the device is still up.
> +	 */
> +	peer = rtnl_dereference(priv->peer);
> +	if (peer) {
> +		int peer_end = min_t(int, end, peer->real_num_tx_queues);

Sashiko noted you may need to additionally complete/reset the tx queue
in the <old_real_num_tx_queues> .. <new_real_num_tx_queue> range, and I
think it's right.

/P


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
  2026-05-28  7:46   ` Jonas Köppeler
  2026-05-29 14:51   ` Simon Schippers
@ 2026-06-04  8:21   ` Paolo Abeni
  2026-06-08 10:38   ` Simon Schippers
  3 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-06-04  8:21 UTC (permalink / raw)
  To: hawk, netdev
  Cc: Simon Schippers, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

On 5/27/26 3:54 PM, hawk@kernel.org wrote:
> +static int veth_set_coalesce(struct net_device *dev,
> +			     struct ethtool_coalesce *ec,
> +			     struct kernel_ethtool_coalesce *kernel_coal,
> +			     struct netlink_ext_ack *extack)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +
> +	priv->tx_coal_usecs = ec->tx_coalesce_usecs;
Sashiko noted _ONCE() annotation are needed here and for lockless reader
below.

/P


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net
  2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
@ 2026-06-04  8:24   ` Paolo Abeni
  2026-06-10 12:37     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-06-04  8:24 UTC (permalink / raw)
  To: hawk, netdev
  Cc: Jonas Köppeler, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, linux-kernel

On 5/27/26 3:54 PM, hawk@kernel.org wrote:
> @@ -1819,6 +1836,7 @@ static void veth_setup(struct net_device *dev)
>  	dev->priv_destructor = veth_dev_free;
>  	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
>  	dev->max_mtu = ETH_MAX_MTU;
> +	dev->watchdog_timeo = msecs_to_jiffies(16000);

Since a repost is neede it could be possibly usedfull using a macro for
the above constant and expanding the math leading to the actual value.
Also possibly an additional + 1 to avoid a very unlikely false positive
on exactly the maximum possible interval timer?

/P


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message
  2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
@ 2026-06-04  8:30   ` Paolo Abeni
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-06-04  8:30 UTC (permalink / raw)
  To: hawk, netdev
  Cc: Jakub Kicinski, Jonas Köppeler, Jamal Hadi Salim, Jiri Pirko,
	David S. Miller, Eric Dumazet, Simon Horman, linux-kernel

On 5/27/26 3:54 PM, hawk@kernel.org wrote:
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 237ee1cd0136..2aebab985c00 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -533,13 +533,12 @@ static void dev_watchdog(struct timer_list *t)
>  		    netif_running(dev) &&
>  		    netif_carrier_ok(dev)) {
>  			unsigned int timedout_ms = 0;
> +			struct netdev_queue *txq;
>  			unsigned int i;
>  			unsigned long trans_start;
>  			unsigned long oldest_start = jiffies;
>  
>  			for (i = 0; i < dev->num_tx_queues; i++) {
> -				struct netdev_queue *txq;
> -
>  				txq = netdev_get_tx_queue(dev, i);
>  				if (!netif_xmit_stopped(txq))
>  					continue;
> @@ -561,9 +560,10 @@ static void dev_watchdog(struct timer_list *t)
>  
>  			if (unlikely(timedout_ms)) {
>  				trace_net_dev_xmit_timeout(dev, i);
> -				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
> +				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms (n:%ld)\n",
>  					    raw_smp_processor_id(),
> -					    i, timedout_ms);
> +					    i, timedout_ms,
> +					    atomic_long_read(&txq->trans_timeout));

It looks like txq could be uninitialized here if num_tx_queues is 0. I'm
unsure if some weird/buggy device driver could actually hit that case,
but grep '->num_tx_queues = 0;' has more than 0 hits in the current tree
and I would err on the safe side.

/P


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
                     ` (2 preceding siblings ...)
  2026-06-04  8:21   ` Paolo Abeni
@ 2026-06-08 10:38   ` Simon Schippers
  2026-06-08 13:04     ` Jesper Dangaard Brouer
  3 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2026-06-08 10:38 UTC (permalink / raw)
  To: hawk, netdev, Jonas Köppeler
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]

On 5/27/26 15:54, hawk@kernel.org wrote:
> From: Simon Schippers <simon.schippers@tu-dortmund.de>
> 
> Per-packet BQL completion forces DQL to converge on limit=2, causing
> excessive NAPI scheduling overhead and qdisc requeues.
> 
> Accumulate BQL completions and flush them when a configurable time
> threshold is exceeded, letting DQL discover a limit that bounds actual
> queuing delay to the configured interval. Coalescing state persists
> across NAPI polls in struct veth_rq so completions can accumulate
> beyond a single budget=64 cycle.
> 
> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
> setting tx-usecs to 0 disables coalescing and falls back to per-packet
> completion.
> 
>   ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>   ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
> 
> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---

I found the issue that n_bql may become infinitly large if producer
and consumer have the same speed (and tx_usecs is large). It could
cause a potential BUG_ON if n_bql grows beyond INT_MAX...
Also I figured that no hardware BQL driver ever completes more than
BQL limit many elements.

Therefore, I propose a simpler logic (see attachment) that completes
either on the usual bql_flush_ns or if n_bql > dql.limit.
If n_bql > dql.limit then we either have the case above that the
producer is as fast as the consumer or we have BQL starvation.

if (state->time + bql_flush_ns <= current_time ||
	state->n_bql > peer_txq->dql.limit) {

It must be n_bql *bigger than* dql.limit because the producer will
always exceed the limit before it stops, see netdev_tx_sent_queue().
It is fast because peer_txq->dql.limit is in the cacheline of the
completion path, see dynamic_queue_limits.h.

Another advantage is that we avoid the snippet checking for empty
and BQL stopped which requires an smp_rmb() and an test_bit().

Apart from that I:
- Always call veth_bql_maybe_complete() in the for loop to have
  more accurate completion intervals when having mixed XDP and
  non-XDP packets.
- Made it so tx_usecs = 0 is now also a normal case.
- Change the type of n_bql to uint instead of int.
- Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
- Moved the bql_state init in __veth_napi_enable_range() in front
  of napi_enable() to avoid a race (Sashiko).
- Moved the bql_state reset in veth_napi_del_range() after the
  ptr_ring_cleanup() (probably does not matter but makes sense to me)

Benchmarks look just fine, see commit message.

WDYT?

Thanks,
Simon

[-- Attachment #2: 0005-veth-time-based-BQL-completion-coalescing-via-ethtoo.patch --]
[-- Type: text/x-patch, Size: 10426 bytes --]

From 59844f703988805ff7913989ed4dcd427ae882af Mon Sep 17 00:00:00 2001
From: Simon Schippers <simon.schippers@tu-dortmund.de>
Date: Wed, 27 May 2026 15:54:16 +0200
Subject: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing
 via ethtool tx-usecs

Per-packet BQL completion forces DQL to converge on limit=2, causing
excessive NAPI scheduling overhead and qdisc requeues.

Accumulate BQL completions and flush them when a configurable time
threshold (tx-usecs) is exceeded, letting DQL discover a limit that
bounds actual queuing delay to the configured interval. Coalescing
state persists across NAPI polls in struct veth_rq so completions can
accumulate beyond a single budget=64 cycle.

The flush condition is:

state->time + bql_flush_ns <= current_time || state->n_bql > dql.limit

Flushing when n_bql exceeds dql.limit handles two cases:
- BQL starvation
- The steady-state case where the producer and consumer run at the
  same speed with a large tx-usecs, which would otherwise allow n_bql
  to grow without bound (and potentially overflow int).

The comparison is strictly greater-than because netdev_tx_sent_queue()
always lets the producer exceed the limit by one before it stops, so
n_bql == dql.limit is a normal in-flight state. dql.limit lives in
the same cacheline as the completion path, so the check is cheap.

Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
setting tx-usecs to 0 disables coalescing and falls back to per-packet
completion.

  ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
  ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)

Benchmarks (10 runs, Ryzen 5 5600X @ 4.3 GHz, SMT off, 3200 MHz RAM):

Throughput (pps)
===========================================================================
nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
     0 | 1.62M | 1.89M | 1.75M | 1.73M |  1.73M |  1.73M |   1.73M || 1.76M
     1 | 1.51M | 1.72M | 1.63M | 1.60M |  1.60M |  1.59M |   1.59M || 1.64M
    10 | 1.33M | 1.52M | 1.47M | 1.41M |  1.41M |  1.41M |   1.41M || 1.45M
   100 |  675K |  748K |  757K |  722K |   722K |   724K |    729K ||  737K
  1000 |  117K |  125K |  125K |  126K |   124K |   124K |    124K ||  126K
 10000 |   13K |   13K |   13K |   13K |    13K |    13K |     13K ||   13K

Ping RTT ms (avg)
===========================================================================
nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
     0 | 0.017 | 0.090 | 0.137 | 0.138 |  0.138 |  0.138 |   0.138 || 0.133
     1 | 0.017 | 0.097 | 0.145 | 0.146 |  0.144 |  0.148 |   0.147 || 0.143
    10 | 0.018 | 0.092 | 0.158 | 0.165 |  0.165 |  0.162 |   0.167 || 0.159
   100 | 0.031 | 0.104 | 0.181 | 0.317 |  0.317 |  0.317 |   0.311 || 0.305
  1000 | 0.142 | 0.198 | 0.314 | 0.991 |   1.69 |   1.82 |    1.82 ||  1.76
 10000 |  1.12 |  1.72 |  1.74 |  1.76 |   2.88 |   9.27 |    15.9 ||  17.4

Ping RTT ms (p99)
===========================================================================
nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
     0 | 0.028 | 0.115 | 0.159 | 0.161 |  0.163 |  0.161 |   0.163 || 0.154
     1 | 0.027 | 0.123 | 0.170 | 0.172 |  0.169 |  0.173 |   0.172 || 0.169
    10 | 0.030 | 0.117 | 0.190 | 0.193 |  0.195 |  0.192 |   0.196 || 0.186
   100 | 0.045 | 0.134 | 0.231 | 0.368 |  0.365 |  0.370 |   0.361 || 0.358
  1000 | 0.230 | 0.300 | 0.408 | 0.989 |   2.11 |   2.12 |    2.13 ||  2.07
 10000 | 0.979 |  1.59 |  1.26 |  2.06 |   3.77 |   9.87 |    20.1 ||  20.3

Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/veth.c | 93 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 3 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d5675d9d5236..b9179de628a6 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -28,6 +28,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <linux/skbuff_ref.h>
+#include <linux/sched/clock.h>
 #include <net/page_pool/helpers.h>
 
 #define DRV_NAME	"veth"
@@ -44,6 +45,8 @@
 #define VETH_XDP_TX_BULK_SIZE	16
 #define VETH_XDP_BATCH		16
 
+#define VETH_BQL_COAL_TX_USECS	100 /* default tx-usecs for BQL batching */
+
 struct veth_stats {
 	u64	rx_drops;
 	/* xdp */
@@ -62,6 +65,11 @@ struct veth_rq_stats {
 	struct u64_stats_sync	syncp;
 };
 
+struct veth_bql_state {
+	u64	time;	/* sched_clock() when current coalescing window started */
+	uint	n_bql;	/* BQL completions batched in the current window */
+};
+
 struct veth_rq {
 	struct napi_struct	xdp_napi;
 	struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,6 +77,7 @@ struct veth_rq {
 	struct bpf_prog __rcu	*xdp_prog;
 	struct xdp_mem_info	xdp_mem;
 	struct veth_rq_stats	stats;
+	struct veth_bql_state	bql_state;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
@@ -81,6 +90,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
 	unsigned int		requested_headroom;
+	unsigned int		tx_coal_usecs;	/* BQL completion coalescing */
 };
 
 struct veth_xdp_tx_bq {
@@ -265,7 +275,31 @@ static void veth_get_channels(struct net_device *dev,
 static int veth_set_channels(struct net_device *dev,
 			     struct ethtool_channels *ch);
 
+static int veth_get_coalesce(struct net_device *dev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+	return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	/* Paired with READ_ONCE in veth_xdp_rcv(). */
+	WRITE_ONCE(priv->tx_coal_usecs, ec->tx_coalesce_usecs);
+	return 0;
+}
+
 static const struct ethtool_ops veth_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
 	.get_drvinfo		= veth_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
 	.get_strings		= veth_get_strings,
@@ -275,6 +309,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_ts_info		= ethtool_op_get_ts_info,
 	.get_channels		= veth_get_channels,
 	.set_channels		= veth_set_channels,
+	.get_coalesce		= veth_get_coalesce,
+	.set_coalesce		= veth_set_coalesce,
 };
 
 /* general routines */
@@ -937,13 +973,56 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	return NULL;
 }
 
+static void veth_bql_maybe_complete(struct veth_bql_state *state,
+				    struct netdev_queue *peer_txq,
+				    u64 bql_flush_ns)
+{
+	u64 current_time;
+
+	/* There is no reason to complete with 0 and
+	 * peer_txq could go away.
+	 */
+	if (!state->n_bql || !peer_txq)
+		return;
+
+	current_time = sched_clock();
+
+	/* We complete if:
+	 * 1. We reach bql_flush_ns.
+	 * 2. We potentially have BQL starvation.
+	 */
+	if (state->time + bql_flush_ns <= current_time ||
+	    state->n_bql > peer_txq->dql.limit) {
+		netdev_tx_completed_queue(peer_txq, state->n_bql,
+					  state->n_bql * VETH_BQL_UNIT);
+		state->time = current_time;
+		state->n_bql = 0;
+	}
+}
+
 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats,
 			struct netdev_queue *peer_txq)
 {
+	struct veth_bql_state *state = &rq->bql_state;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
+	struct veth_priv *priv;
+	u64 bql_flush_ns;
+
+	priv = netdev_priv(rq->dev);
+
+	/* Paired with WRITE_ONCE() in veth_set_coalesce(). */
+	bql_flush_ns = (u64)READ_ONCE(priv->tx_coal_usecs) * 1000;
+
+	/* Clamp stored timestamp in case we migrated to a CPU with a behind
+	 * sched_clock(); tries to reduce late BQL flushes.
+	 */
+	state->time = min(state->time, sched_clock());
+
+	/* Flush completions that timed out since the previous NAPI poll. */
+	veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
 
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -968,12 +1047,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			}
 		} else {
 			/* ndo_start_xmit */
-			bool bql_charged = veth_ptr_is_bql(ptr);
 			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
+			state->n_bql += veth_ptr_is_bql(ptr);
 			stats->xdp_bytes += skb->len;
-			if (peer_txq && bql_charged)
-				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
 
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
@@ -983,6 +1060,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 					napi_gro_receive(&rq->xdp_napi, skb);
 			}
 		}
+		veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
 		done++;
 	}
 
@@ -1091,6 +1169,9 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
 	for (i = start; i < end; i++) {
 		struct veth_rq *rq = &priv->rq[i];
 
+		rq->bql_state.time = sched_clock();
+		rq->bql_state.n_bql = 0;
+
 		napi_enable(&rq->xdp_napi);
 		rcu_assign_pointer(priv->rq[i].napi, &priv->rq[i].xdp_napi);
 	}
@@ -1135,6 +1216,8 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 
 		rq->rx_notify_masked = false;
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+		rq->bql_state.n_bql = 0;
+		rq->bql_state.time = 0;
 	}
 
 	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
@@ -1813,6 +1896,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 
 static void veth_setup(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	ether_setup(dev);
 
 	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1923,8 @@ static void veth_setup(struct net_device *dev)
 	dev->max_mtu = ETH_MAX_MTU;
 	dev->watchdog_timeo = msecs_to_jiffies(16000);
 
+	priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
 	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-08 10:38   ` Simon Schippers
@ 2026-06-08 13:04     ` Jesper Dangaard Brouer
  2026-06-08 13:13       ` Jonas Köppeler
  0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2026-06-08 13:04 UTC (permalink / raw)
  To: Simon Schippers, netdev, Jonas Köppeler
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf



On 08/06/2026 12.38, Simon Schippers wrote:
> On 5/27/26 15:54, hawk@kernel.org wrote:
>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>
>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>> excessive NAPI scheduling overhead and qdisc requeues.
>>
>> Accumulate BQL completions and flush them when a configurable time
>> threshold is exceeded, letting DQL discover a limit that bounds actual
>> queuing delay to the configured interval. Coalescing state persists
>> across NAPI polls in struct veth_rq so completions can accumulate
>> beyond a single budget=64 cycle.
>>
>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>> completion.
>>
>>    ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>    ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>
>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
> 
> I found the issue that n_bql may become infinitly large if producer
> and consumer have the same speed (and tx_usecs is large). It could
> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
> Also I figured that no hardware BQL driver ever completes more than
> BQL limit many elements.
> 
> Therefore, I propose a simpler logic (see attachment) that completes
> either on the usual bql_flush_ns or if n_bql > dql.limit.
> If n_bql > dql.limit then we either have the case above that the
> producer is as fast as the consumer or we have BQL starvation.
> 
> if (state->time + bql_flush_ns <= current_time ||
> 	state->n_bql > peer_txq->dql.limit) {
> 
> It must be n_bql *bigger than* dql.limit because the producer will
> always exceed the limit before it stops, see netdev_tx_sent_queue().
> It is fast because peer_txq->dql.limit is in the cacheline of the
> completion path, see dynamic_queue_limits.h.
> 
> Another advantage is that we avoid the snippet checking for empty
> and BQL stopped which requires an smp_rmb() and an test_bit().
> 
> Apart from that I:
> - Always call veth_bql_maybe_complete() in the for loop to have
>    more accurate completion intervals when having mixed XDP and
>    non-XDP packets.
> - Made it so tx_usecs = 0 is now also a normal case.
> - Change the type of n_bql to uint instead of int.
> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
> - Moved the bql_state init in __veth_napi_enable_range() in front
>    of napi_enable() to avoid a race (Sashiko).
> - Moved the bql_state reset in veth_napi_del_range() after the
>    ptr_ring_cleanup() (probably does not matter but makes sense to me)
> 
> Benchmarks look just fine, see commit message.
> 
> WDYT?

Looks good to me, I will use this in my V7 patchset.

A bike-shedding issue: We change the coalescing parameters for the veth
net_device, but should this be a TX or RX parameter?

For physical NICs adjusting TX coalescing will affect the BQL as this
affect the TX completion of the transmitted packets. For veth, it is the
veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
is where this patch adds netdev_tx_completed_queue calls for BQL.
Any opinions on the "TX" or "RX" color?

--Jesper

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-08 13:04     ` Jesper Dangaard Brouer
@ 2026-06-08 13:13       ` Jonas Köppeler
  2026-06-08 14:21         ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Jonas Köppeler @ 2026-06-08 13:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Simon Schippers, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf


On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:
> 
> 
> On 08/06/2026 12.38, Simon Schippers wrote:
>> On 5/27/26 15:54, hawk@kernel.org wrote:
>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>
>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>
>>> Accumulate BQL completions and flush them when a configurable time
>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>> queuing delay to the configured interval. Coalescing state persists
>>> across NAPI polls in struct veth_rq so completions can accumulate
>>> beyond a single budget=64 cycle.
>>>
>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>> completion.
>>>
>>>    ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>    ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>
>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>> ---
>>
>> I found the issue that n_bql may become infinitly large if producer
>> and consumer have the same speed (and tx_usecs is large). It could
>> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
>> Also I figured that no hardware BQL driver ever completes more than
>> BQL limit many elements.
>>
>> Therefore, I propose a simpler logic (see attachment) that completes
>> either on the usual bql_flush_ns or if n_bql > dql.limit.
>> If n_bql > dql.limit then we either have the case above that the
>> producer is as fast as the consumer or we have BQL starvation.
>>
>> if (state->time + bql_flush_ns <= current_time ||
>>     state->n_bql > peer_txq->dql.limit) {
>>
>> It must be n_bql *bigger than* dql.limit because the producer will
>> always exceed the limit before it stops, see netdev_tx_sent_queue().
>> It is fast because peer_txq->dql.limit is in the cacheline of the
>> completion path, see dynamic_queue_limits.h.
>>
>> Another advantage is that we avoid the snippet checking for empty
>> and BQL stopped which requires an smp_rmb() and an test_bit().
>>
>> Apart from that I:
>> - Always call veth_bql_maybe_complete() in the for loop to have
>>    more accurate completion intervals when having mixed XDP and
>>    non-XDP packets.
>> - Made it so tx_usecs = 0 is now also a normal case.
>> - Change the type of n_bql to uint instead of int.
>> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
>> - Moved the bql_state init in __veth_napi_enable_range() in front
>>    of napi_enable() to avoid a race (Sashiko).
>> - Moved the bql_state reset in veth_napi_del_range() after the
>>    ptr_ring_cleanup() (probably does not matter but makes sense to me)
>>
>> Benchmarks look just fine, see commit message.
>>
>> WDYT?

> 
> Looks good to me, I will use this in my V7 patchset.
> 
> A bike-shedding issue: We change the coalescing parameters for the veth
> net_device, but should this be a TX or RX parameter?
> 

Hi,

The results look a little bit suspicious, that in some cases the p99 is 
smaller than the average, which can happen if you have really large max 
values. It feels something is off with the methodology. I think we 
should drop the avg, and include the max value. This would give a better 
picture.

Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done 
in the kernel? It relies on bool normalization which feels a bit 
implicit to me.

> For physical NICs adjusting TX coalescing will affect the BQL as this
> affect the TX completion of the transmitted packets. For veth, it is the
> veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
> is where this patch adds netdev_tx_completed_queue calls for BQL.
> Any opinions on the "TX" or "RX" color?
I think I would prefer to configure it on the tx dev, and the recv side 
gets the value from the peer. Maybe something like this.

@@ -997,11 +998,12 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
budget,
         struct veth_bql_state *state = &rq->bql_state;
         int i, done = 0, n_xdpf = 0;
         void *xdpf[VETH_XDP_BATCH];
-       struct veth_priv *priv;
-       u64 bql_flush_ns;
+       u64 bql_flush_ns = 0;

-       priv = netdev_priv(rq->dev);
-       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
+       if (peer_txq) {
+           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
+           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
+       }

         /* Clamp stored timestamp in case we migrated to a CPU with a 
behind
          * sched_clock(); prevents the deadline from never firing.

- Jonas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-08 13:13       ` Jonas Köppeler
@ 2026-06-08 14:21         ` Simon Schippers
  2026-06-09 13:59           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2026-06-08 14:21 UTC (permalink / raw)
  To: Jonas Köppeler, Jesper Dangaard Brouer, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 6/8/26 15:13, Jonas Köppeler wrote:
> 
> On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:
>>
>>
>> On 08/06/2026 12.38, Simon Schippers wrote:
>>> On 5/27/26 15:54, hawk@kernel.org wrote:
>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>
>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>
>>>> Accumulate BQL completions and flush them when a configurable time
>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>> queuing delay to the configured interval. Coalescing state persists
>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>> beyond a single budget=64 cycle.
>>>>
>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>> completion.
>>>>
>>>>    ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>    ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>
>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>
>>> I found the issue that n_bql may become infinitly large if producer
>>> and consumer have the same speed (and tx_usecs is large). It could
>>> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
>>> Also I figured that no hardware BQL driver ever completes more than
>>> BQL limit many elements.
>>>
>>> Therefore, I propose a simpler logic (see attachment) that completes
>>> either on the usual bql_flush_ns or if n_bql > dql.limit.
>>> If n_bql > dql.limit then we either have the case above that the
>>> producer is as fast as the consumer or we have BQL starvation.
>>>
>>> if (state->time + bql_flush_ns <= current_time ||
>>>     state->n_bql > peer_txq->dql.limit) {
>>>
>>> It must be n_bql *bigger than* dql.limit because the producer will
>>> always exceed the limit before it stops, see netdev_tx_sent_queue().
>>> It is fast because peer_txq->dql.limit is in the cacheline of the
>>> completion path, see dynamic_queue_limits.h.
>>>
>>> Another advantage is that we avoid the snippet checking for empty
>>> and BQL stopped which requires an smp_rmb() and an test_bit().
>>>
>>> Apart from that I:
>>> - Always call veth_bql_maybe_complete() in the for loop to have
>>>    more accurate completion intervals when having mixed XDP and
>>>    non-XDP packets.
>>> - Made it so tx_usecs = 0 is now also a normal case.
>>> - Change the type of n_bql to uint instead of int.
>>> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
>>> - Moved the bql_state init in __veth_napi_enable_range() in front
>>>    of napi_enable() to avoid a race (Sashiko).
>>> - Moved the bql_state reset in veth_napi_del_range() after the
>>>    ptr_ring_cleanup() (probably does not matter but makes sense to me)
>>>
>>> Benchmarks look just fine, see commit message.
>>>
>>> WDYT?
> 
>>
>> Looks good to me, I will use this in my V7 patchset.
>>
>> A bike-shedding issue: We change the coalescing parameters for the veth
>> net_device, but should this be a TX or RX parameter?
>>
> 
> Hi,
> 
> The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.

Luckily I saved the raw test result from /tmp (the raw ping output
is included).

extract_ping_p99() seems to be broken. It has issues ordering the
numbers. So if there are 2 or 3 numbers after the point I think.
avg ping is fine.

Claude Sonnet wrote a script for me to recalculate the p99 pings
from the raw ping output).
Below is the new result. Not 100% sure but this seems right now.
Most pings are the same.

Ping RTT ms (p99)
===========================================================================
nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
     0 | 0.028 | 0.115 | 0.159 | 0.161 |  0.163 |  0.161 |   0.163 || 0.154
     1 | 0.027 | 0.123 | 0.169 | 0.172 |  0.169 |  0.173 |   0.172 || 0.169
    10 | 0.030 | 0.117 | 0.189 | 0.193 |  0.195 |  0.192 |   0.196 || 0.186
   100 | 0.045 | 0.134 | 0.233 | 0.368 |  0.365 |  0.369 |   0.361 || 0.358
  1000 | 0.230 | 0.300 | 0.412 |  1.26 |   2.11 |   2.12 |    2.13 ||  2.07
 10000 |  2.04 |  2.55 |  2.54 |  2.72 |   3.77 |   12.1 |    20.1 ||  20.3

> 
> Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.

Ok, we could swap for something like 

if (veth_ptr_is_bql(ptr))
	state->n_bql+;

> 
>> For physical NICs adjusting TX coalescing will affect the BQL as this
>> affect the TX completion of the transmitted packets. For veth, it is the
>> veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
>> is where this patch adds netdev_tx_completed_queue calls for BQL.
>> Any opinions on the "TX" or "RX" color?

Personally I would also say TX.

> I think I would prefer to configure it on the tx dev, and the recv side gets the value from the peer. Maybe something like this.
> 
> @@ -997,11 +998,12 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>         struct veth_bql_state *state = &rq->bql_state;
>         int i, done = 0, n_xdpf = 0;
>         void *xdpf[VETH_XDP_BATCH];

Why this deletion?

> -       struct veth_priv *priv;
> -       u64 bql_flush_ns;
> +       u64 bql_flush_ns = 0;
> 
> -       priv = netdev_priv(rq->dev);
> -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;

I see we should check if peer_txq exists.

> +       if (peer_txq) {
> +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
> +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
> +       }
> 
>         /* Clamp stored timestamp in case we migrated to a CPU with a behind
>          * sched_clock(); prevents the deadline from never firing.
> 
> - Jonas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-08 14:21         ` Simon Schippers
@ 2026-06-09 13:59           ` Jesper Dangaard Brouer
  2026-06-09 15:08             ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2026-06-09 13:59 UTC (permalink / raw)
  To: Simon Schippers, Jonas Köppeler, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf, kernel-team



On 08/06/2026 16.21, Simon Schippers wrote:
> On 6/8/26 15:13, Jonas Köppeler wrote:
>>
>> On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 08/06/2026 12.38, Simon Schippers wrote:
>>>> On 5/27/26 15:54, hawk@kernel.org wrote:
>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>
>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>
>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>> beyond a single budget=64 cycle.
>>>>>
>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>> completion.
>>>>>
>>>>>     ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>     ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>
>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>> ---
>>>>
>>>> I found the issue that n_bql may become infinitly large if producer
>>>> and consumer have the same speed (and tx_usecs is large). It could
>>>> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
>>>> Also I figured that no hardware BQL driver ever completes more than
>>>> BQL limit many elements.
>>>>
>>>> Therefore, I propose a simpler logic (see attachment) that completes
>>>> either on the usual bql_flush_ns or if n_bql > dql.limit.
>>>> If n_bql > dql.limit then we either have the case above that the
>>>> producer is as fast as the consumer or we have BQL starvation.
>>>>
>>>> if (state->time + bql_flush_ns <= current_time ||
>>>>      state->n_bql > peer_txq->dql.limit) {
>>>>
>>>> It must be n_bql *bigger than* dql.limit because the producer will
>>>> always exceed the limit before it stops, see netdev_tx_sent_queue().
>>>> It is fast because peer_txq->dql.limit is in the cacheline of the
>>>> completion path, see dynamic_queue_limits.h.
>>>>
>>>> Another advantage is that we avoid the snippet checking for empty
>>>> and BQL stopped which requires an smp_rmb() and an test_bit().
>>>>
>>>> Apart from that I:
>>>> - Always call veth_bql_maybe_complete() in the for loop to have
>>>>     more accurate completion intervals when having mixed XDP and
>>>>     non-XDP packets.
>>>> - Made it so tx_usecs = 0 is now also a normal case.
>>>> - Change the type of n_bql to uint instead of int.
>>>> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
>>>> - Moved the bql_state init in __veth_napi_enable_range() in front
>>>>     of napi_enable() to avoid a race (Sashiko).
>>>> - Moved the bql_state reset in veth_napi_del_range() after the
>>>>     ptr_ring_cleanup() (probably does not matter but makes sense to me)
>>>>
>>>> Benchmarks look just fine, see commit message.
>>>>
>>>> WDYT?
>>
>>>
>>> Looks good to me, I will use this in my V7 patchset.
>>>
>>> A bike-shedding issue: We change the coalescing parameters for the veth
>>> net_device, but should this be a TX or RX parameter?
>>>
>>
>> Hi,
>>
>> The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.
> 
> Luckily I saved the raw test result from /tmp (the raw ping output
> is included).
> 
> extract_ping_p99() seems to be broken. It has issues ordering the
> numbers. So if there are 2 or 3 numbers after the point I think.
> avg ping is fine.
> 
> Claude Sonnet wrote a script for me to recalculate the p99 pings
> from the raw ping output).

For programming switch to Claude Opus, as Sonnet cannot code IMHO.
(Claude Opus wrote the selftest we placed in [1])

Is your script still based on our github[1] selftest ?
If so, then please make some PR(s) against my repo.
I want this to be easy reproducible for others.

[1] https://github.com/netoptimizer/veth-backpressure-performance-testing

> Below is the new result. Not 100% sure but this seems right now.
> Most pings are the same.
> 
> Ping RTT ms (p99)
> ===========================================================================
> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
> -------+-------+-------+-------+-------+--------+--------+---------++------
>       0 | 0.028 | 0.115 | 0.159 | 0.161 |  0.163 |  0.161 |   0.163 || 0.154
>       1 | 0.027 | 0.123 | 0.169 | 0.172 |  0.169 |  0.173 |   0.172 || 0.169
>      10 | 0.030 | 0.117 | 0.189 | 0.193 |  0.195 |  0.192 |   0.196 || 0.186
>     100 | 0.045 | 0.134 | 0.233 | 0.368 |  0.365 |  0.369 |   0.361 || 0.358
>    1000 | 0.230 | 0.300 | 0.412 |  1.26 |   2.11 |   2.12 |    2.13 ||  2.07
>   10000 |  2.04 |  2.55 |  2.54 |  2.72 |   3.77 |   12.1 |    20.1 ||  20.3
> 
>>
>> Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.
> 
> Ok, we could swap for something like
> 
> if (veth_ptr_is_bql(ptr))
> 	state->n_bql+;

I'll see if I can adjust the V7 patch to this feedback.


>>
>>> For physical NICs adjusting TX coalescing will affect the BQL as this
>>> affect the TX completion of the transmitted packets. For veth, it is the
>>> veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
>>> is where this patch adds netdev_tx_completed_queue calls for BQL.
>>> Any opinions on the "TX" or "RX" color?
> 
> Personally I would also say TX.
> 

Okay, we seem to have more votes for TX :-)


>> I think I would prefer to configure it on the tx dev, and the recv
>> side gets the value from the peer. Maybe something like this.
>> 

I'm considering if we should simplify the config by having veth pairs 
have the same tx-coal value. WDYT?


>> @@ -997,11 +998,12 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>          struct veth_bql_state *state = &rq->bql_state;
>>          int i, done = 0, n_xdpf = 0;
>>          void *xdpf[VETH_XDP_BATCH];
> 
> Why this deletion?
> 
>> -       struct veth_priv *priv;
>> -       u64 bql_flush_ns;
>> +       u64 bql_flush_ns = 0;
>>
>> -       priv = netdev_priv(rq->dev);
>> -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
> 
> I see we should check if peer_txq exists.
> 
>> +       if (peer_txq) {
>> +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
>> +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
>> +       }
>>
>>          /* Clamp stored timestamp in case we migrated to a CPU with a behind
>>           * sched_clock(); prevents the deadline from never firing.
>>
>> - Jonas

--Jesper


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-09 13:59           ` Jesper Dangaard Brouer
@ 2026-06-09 15:08             ` Simon Schippers
  2026-06-10  7:04               ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2026-06-09 15:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jonas Köppeler, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf, kernel-team

On 6/9/26 15:59, Jesper Dangaard Brouer wrote:
> 
> 
> On 08/06/2026 16.21, Simon Schippers wrote:
>> On 6/8/26 15:13, Jonas Köppeler wrote:
>>>
>>> On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 08/06/2026 12.38, Simon Schippers wrote:
>>>>> On 5/27/26 15:54, hawk@kernel.org wrote:
>>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>
>>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>>
>>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>>> beyond a single budget=64 cycle.
>>>>>>
>>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>>> completion.
>>>>>>
>>>>>>     ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>>     ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>>
>>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>> ---
>>>>>
>>>>> I found the issue that n_bql may become infinitly large if producer
>>>>> and consumer have the same speed (and tx_usecs is large). It could
>>>>> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
>>>>> Also I figured that no hardware BQL driver ever completes more than
>>>>> BQL limit many elements.
>>>>>
>>>>> Therefore, I propose a simpler logic (see attachment) that completes
>>>>> either on the usual bql_flush_ns or if n_bql > dql.limit.
>>>>> If n_bql > dql.limit then we either have the case above that the
>>>>> producer is as fast as the consumer or we have BQL starvation.
>>>>>
>>>>> if (state->time + bql_flush_ns <= current_time ||
>>>>>      state->n_bql > peer_txq->dql.limit) {
>>>>>
>>>>> It must be n_bql *bigger than* dql.limit because the producer will
>>>>> always exceed the limit before it stops, see netdev_tx_sent_queue().
>>>>> It is fast because peer_txq->dql.limit is in the cacheline of the
>>>>> completion path, see dynamic_queue_limits.h.
>>>>>
>>>>> Another advantage is that we avoid the snippet checking for empty
>>>>> and BQL stopped which requires an smp_rmb() and an test_bit().
>>>>>
>>>>> Apart from that I:
>>>>> - Always call veth_bql_maybe_complete() in the for loop to have
>>>>>     more accurate completion intervals when having mixed XDP and
>>>>>     non-XDP packets.
>>>>> - Made it so tx_usecs = 0 is now also a normal case.
>>>>> - Change the type of n_bql to uint instead of int.
>>>>> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
>>>>> - Moved the bql_state init in __veth_napi_enable_range() in front
>>>>>     of napi_enable() to avoid a race (Sashiko).
>>>>> - Moved the bql_state reset in veth_napi_del_range() after the
>>>>>     ptr_ring_cleanup() (probably does not matter but makes sense to me)
>>>>>
>>>>> Benchmarks look just fine, see commit message.
>>>>>
>>>>> WDYT?
>>>
>>>>
>>>> Looks good to me, I will use this in my V7 patchset.
>>>>
>>>> A bike-shedding issue: We change the coalescing parameters for the veth
>>>> net_device, but should this be a TX or RX parameter?
>>>>
>>>
>>> Hi,
>>>
>>> The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.
>>
>> Luckily I saved the raw test result from /tmp (the raw ping output
>> is included).
>>
>> extract_ping_p99() seems to be broken. It has issues ordering the
>> numbers. So if there are 2 or 3 numbers after the point I think.
>> avg ping is fine.
>>
>> Claude Sonnet wrote a script for me to recalculate the p99 pings
>> from the raw ping output).
> 
> For programming switch to Claude Opus, as Sonnet cannot code IMHO.

Already used 95% of my Copilot Pro+ Tokens this month :P

> (Claude Opus wrote the selftest we placed in [1])
> 
> Is your script still based on our github[1] selftest ?
> If so, then please make some PR(s) against my repo.
> I want this to be easy reproducible for others.
> 
> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing
> 

I forked Jonas latest branch [1] and ran the tests with that.

But I faced 2 issues that I then fixed:
- Ping fails in some runs (~20%) -> My fix: Retry the run until it
  works.
- extract_ping_p99() has a bug for me which caused wrong results,
  as pointed out by Jonas -> Also fixed it

Because of the inconsistencies I will rerun over night to have
clean benchmark results.

I will make a PR tomorrow.

[1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark

>> Below is the new result. Not 100% sure but this seems right now.
>> Most pings are the same.
>>
>> Ping RTT ms (p99)
>> ===========================================================================
>> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
>> -------+-------+-------+-------+-------+--------+--------+---------++------
>>       0 | 0.028 | 0.115 | 0.159 | 0.161 |  0.163 |  0.161 |   0.163 || 0.154
>>       1 | 0.027 | 0.123 | 0.169 | 0.172 |  0.169 |  0.173 |   0.172 || 0.169
>>      10 | 0.030 | 0.117 | 0.189 | 0.193 |  0.195 |  0.192 |   0.196 || 0.186
>>     100 | 0.045 | 0.134 | 0.233 | 0.368 |  0.365 |  0.369 |   0.361 || 0.358
>>    1000 | 0.230 | 0.300 | 0.412 |  1.26 |   2.11 |   2.12 |    2.13 ||  2.07
>>   10000 |  2.04 |  2.55 |  2.54 |  2.72 |   3.77 |   12.1 |    20.1 ||  20.3
>>
>>>
>>> Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.
>>
>> Ok, we could swap for something like
>>
>> if (veth_ptr_is_bql(ptr))
>>     state->n_bql+;
> 
> I'll see if I can adjust the V7 patch to this feedback.

Yes, and also this snippet pointed out by Jonas below:

 -       struct veth_priv *priv;
 -       u64 bql_flush_ns;
 +       u64 bql_flush_ns = 0;

 -       priv = netdev_priv(rq->dev);
 -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
 +       if (peer_txq) {
 +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
 +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
 +       }

Or is peer_txq != 0 guaranteed in veth_xdp_rcv()?


And also I would change in the commit message:

"Flushing when n_bql exceeds dql.limit handles two cases:
- BQL starvation
- The steady-state case where the producer and consumer run at the
  same speed with a large tx-usecs, which would otherwise allow n_bql
  to grow without bound (and potentially overflow int)."

... to just...

"Flushing when n_bql exceeds dql.limit handles BQL starvation."

... because after rethinking I think I overstated here.
n_bql should also not grow infinitely for the v6 version.
Nevertheless the new method should be simpler and faster.

> 
> 
>>>
>>>> For physical NICs adjusting TX coalescing will affect the BQL as this
>>>> affect the TX completion of the transmitted packets. For veth, it is the
>>>> veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
>>>> is where this patch adds netdev_tx_completed_queue calls for BQL.
>>>> Any opinions on the "TX" or "RX" color?
>>
>> Personally I would also say TX.
>>
> 
> Okay, we seem to have more votes for TX :-)
> 
> 
>>> I think I would prefer to configure it on the tx dev, and the recv
>>> side gets the value from the peer. Maybe something like this.
>>>
> 
> I'm considering if we should simplify the config by having veth pairs have the same tx-coal value. WDYT?

I can not think of a case where a user would like to have a different
tx-coal value on either side..
So it's fine for me.

> 
> 
>>> @@ -997,11 +998,12 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>          struct veth_bql_state *state = &rq->bql_state;
>>>          int i, done = 0, n_xdpf = 0;
>>>          void *xdpf[VETH_XDP_BATCH];
>>
>> Why this deletion?
>>
>>> -       struct veth_priv *priv;
>>> -       u64 bql_flush_ns;
>>> +       u64 bql_flush_ns = 0;
>>>
>>> -       priv = netdev_priv(rq->dev);
>>> -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
>>
>> I see we should check if peer_txq exists.
>>
>>> +       if (peer_txq) {
>>> +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
>>> +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
>>> +       }
>>>
>>>          /* Clamp stored timestamp in case we migrated to a CPU with a behind
>>>           * sched_clock(); prevents the deadline from never firing.
>>>
>>> - Jonas
> 
> --Jesper
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-09 15:08             ` Simon Schippers
@ 2026-06-10  7:04               ` Simon Schippers
  2026-06-10 10:15                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2026-06-10  7:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jonas Köppeler, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf, kernel-team

On 6/9/26 17:08, Simon Schippers wrote:
> On 6/9/26 15:59, Jesper Dangaard Brouer wrote:
>>
>>
>> On 08/06/2026 16.21, Simon Schippers wrote:
>>> On 6/8/26 15:13, Jonas Köppeler wrote:
>>>>
>>>> On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:
>>>>>
>>>>>
>>>>> On 08/06/2026 12.38, Simon Schippers wrote:
>>>>>> On 5/27/26 15:54, hawk@kernel.org wrote:
>>>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>
>>>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>>>
>>>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>>>> beyond a single budget=64 cycle.
>>>>>>>
>>>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>>>> completion.
>>>>>>>
>>>>>>>     ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>>>     ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>>>
>>>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>> ---
>>>>>>
>>>>>> I found the issue that n_bql may become infinitly large if producer
>>>>>> and consumer have the same speed (and tx_usecs is large). It could
>>>>>> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
>>>>>> Also I figured that no hardware BQL driver ever completes more than
>>>>>> BQL limit many elements.
>>>>>>
>>>>>> Therefore, I propose a simpler logic (see attachment) that completes
>>>>>> either on the usual bql_flush_ns or if n_bql > dql.limit.
>>>>>> If n_bql > dql.limit then we either have the case above that the
>>>>>> producer is as fast as the consumer or we have BQL starvation.
>>>>>>
>>>>>> if (state->time + bql_flush_ns <= current_time ||
>>>>>>      state->n_bql > peer_txq->dql.limit) {
>>>>>>
>>>>>> It must be n_bql *bigger than* dql.limit because the producer will
>>>>>> always exceed the limit before it stops, see netdev_tx_sent_queue().
>>>>>> It is fast because peer_txq->dql.limit is in the cacheline of the
>>>>>> completion path, see dynamic_queue_limits.h.
>>>>>>
>>>>>> Another advantage is that we avoid the snippet checking for empty
>>>>>> and BQL stopped which requires an smp_rmb() and an test_bit().
>>>>>>
>>>>>> Apart from that I:
>>>>>> - Always call veth_bql_maybe_complete() in the for loop to have
>>>>>>     more accurate completion intervals when having mixed XDP and
>>>>>>     non-XDP packets.
>>>>>> - Made it so tx_usecs = 0 is now also a normal case.
>>>>>> - Change the type of n_bql to uint instead of int.
>>>>>> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
>>>>>> - Moved the bql_state init in __veth_napi_enable_range() in front
>>>>>>     of napi_enable() to avoid a race (Sashiko).
>>>>>> - Moved the bql_state reset in veth_napi_del_range() after the
>>>>>>     ptr_ring_cleanup() (probably does not matter but makes sense to me)
>>>>>>
>>>>>> Benchmarks look just fine, see commit message.
>>>>>>
>>>>>> WDYT?
>>>>
>>>>>
>>>>> Looks good to me, I will use this in my V7 patchset.
>>>>>
>>>>> A bike-shedding issue: We change the coalescing parameters for the veth
>>>>> net_device, but should this be a TX or RX parameter?
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.
>>>
>>> Luckily I saved the raw test result from /tmp (the raw ping output
>>> is included).
>>>
>>> extract_ping_p99() seems to be broken. It has issues ordering the
>>> numbers. So if there are 2 or 3 numbers after the point I think.
>>> avg ping is fine.
>>>
>>> Claude Sonnet wrote a script for me to recalculate the p99 pings
>>> from the raw ping output).
>>
>> For programming switch to Claude Opus, as Sonnet cannot code IMHO.
> 
> Already used 95% of my Copilot Pro+ Tokens this month :P
> 
>> (Claude Opus wrote the selftest we placed in [1])
>>
>> Is your script still based on our github[1] selftest ?
>> If so, then please make some PR(s) against my repo.
>> I want this to be easy reproducible for others.
>>
>> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing
>>
> 
> I forked Jonas latest branch [1] and ran the tests with that.
> 
> But I faced 2 issues that I then fixed:
> - Ping fails in some runs (~20%) -> My fix: Retry the run until it
>   works.
> - extract_ping_p99() has a bug for me which caused wrong results,
>   as pointed out by Jonas -> Also fixed it
> 
> Because of the inconsistencies I will rerun over night to have
> clean benchmark results.
> 
> I will make a PR tomorrow.
> 
> [1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
> 

I ran the benchmarks over night and they look fine,
see ASCII table below.
I do not see a regression, tx_usecs of 50us even outperforms stock.

I did a PR, see [1]. Most of the code changes are by Jonas,
because he rewrote the code to allow the usage of pktgen and
added the bql_sweep.sh script. Thanks Jonas!
My part is the 2 bugfixes, the output as ASCII table and
the new benchmark results with my new 'v7' method.

I have not tested bql_sweep.sh inside virtme-ng, I ran my benchmarks
on the host system. I guess you could mount the folder with read+write
in vng and then run bql_sweep.sh there.

[1] https://github.com/netoptimizer/veth-backpressure-performance-testing/pull/9


System information:
Ryzen 5 5600X @ fixed 4.3 GHz, 3200 MHz RAM, CPU mitigations disabled,
SMT disabled, on host system (NOT virtme-ng).

Patched benchmark command:
sudo ./veth_bql_sweep.sh --runs 10 --tx-usecs-list \
"0 50 100 500 1000 5000 10000" --nrules-list "0 1 10 100 1000 10000" \
--pktgen --duration 20 --qdisc fq_codel --no-bpftrace

Stock benchmark command:
sudo ./veth_bql_sweep.sh --runs 10 --tx-usecs-list "0" --nrules-list \
"0 1 10 100 1000 10000" --pktgen --duration 20 --qdisc fq_codel \
--no-bpftrace

Throughput (pps)
===========================================================================
nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
     0 | 1.62M | 1.89M | 1.75M | 1.73M |  1.73M |  1.73M |   1.73M || 1.76M
     1 | 1.51M | 1.72M | 1.63M | 1.60M |  1.60M |  1.60M |   1.60M || 1.65M
    10 | 1.33M | 1.52M | 1.47M | 1.41M |  1.41M |  1.41M |   1.41M || 1.45M
   100 |  681K |  751K |  756K |  726K |   721K |   723K |    730K ||  736K
  1000 |  116K |  123K |  124K |  124K |   124K |   125K |    123K ||  126K
 10000 |   13K |   13K |   13K |   13K |    13K |    13K |     13K ||   13K


Ping RTT ms (avg)
===========================================================================
nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
     0 | 0.016 | 0.089 | 0.136 | 0.137 |  0.138 |  0.134 |   0.135 || 0.132
     1 | 0.018 | 0.097 | 0.144 | 0.146 |  0.146 |  0.150 |   0.147 || 0.142
    10 | 0.019 | 0.093 | 0.156 | 0.164 |  0.164 |  0.164 |   0.167 || 0.158
   100 | 0.029 | 0.104 | 0.180 | 0.315 |  0.312 |  0.314 |   0.311 || 0.307
  1000 | 0.139 | 0.201 | 0.309 | 0.971 |   1.66 |   1.81 |    1.82 ||  1.77
 10000 |  1.12 |  1.74 |  1.72 |  1.83 |   2.90 |   9.14 |    15.9 ||  17.2


Ping RTT ms (p99)
===========================================================================
nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
     0 | 0.026 | 0.122 | 0.161 | 0.164 |  0.164 |  0.159 |   0.161 || 0.155
     1 | 0.028 | 0.123 | 0.169 | 0.173 |  0.173 |  0.175 |   0.173 || 0.165
    10 | 0.032 | 0.119 | 0.187 | 0.194 |  0.193 |  0.193 |   0.197 || 0.185
   100 | 0.047 | 0.134 | 0.232 | 0.368 |  0.369 |  0.367 |   0.359 || 0.361
  1000 | 0.232 | 0.303 | 0.409 |  1.27 |   2.13 |   2.10 |    2.13 ||  2.10
 10000 |  1.94 |  2.52 |  2.59 |  2.69 |   3.87 |   12.0 |    20.0 ||  20.3


>>> Below is the new result. Not 100% sure but this seems right now.
>>> Most pings are the same.
>>>
>>> Ping RTT ms (p99)
>>> ===========================================================================
>>> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
>>> -------+-------+-------+-------+-------+--------+--------+---------++------
>>>       0 | 0.028 | 0.115 | 0.159 | 0.161 |  0.163 |  0.161 |   0.163 || 0.154
>>>       1 | 0.027 | 0.123 | 0.169 | 0.172 |  0.169 |  0.173 |   0.172 || 0.169
>>>      10 | 0.030 | 0.117 | 0.189 | 0.193 |  0.195 |  0.192 |   0.196 || 0.186
>>>     100 | 0.045 | 0.134 | 0.233 | 0.368 |  0.365 |  0.369 |   0.361 || 0.358
>>>    1000 | 0.230 | 0.300 | 0.412 |  1.26 |   2.11 |   2.12 |    2.13 ||  2.07
>>>   10000 |  2.04 |  2.55 |  2.54 |  2.72 |   3.77 |   12.1 |    20.1 ||  20.3
>>>
>>>>
>>>> Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.
>>>
>>> Ok, we could swap for something like
>>>
>>> if (veth_ptr_is_bql(ptr))
>>>     state->n_bql+;
>>
>> I'll see if I can adjust the V7 patch to this feedback.
> 
> Yes, and also this snippet pointed out by Jonas below:
> 
>  -       struct veth_priv *priv;
>  -       u64 bql_flush_ns;
>  +       u64 bql_flush_ns = 0;
> 
>  -       priv = netdev_priv(rq->dev);
>  -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
>  +       if (peer_txq) {
>  +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
>  +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
>  +       }
> 
> Or is peer_txq != 0 guaranteed in veth_xdp_rcv()?
> 
> 
> And also I would change in the commit message:
> 
> "Flushing when n_bql exceeds dql.limit handles two cases:
> - BQL starvation
> - The steady-state case where the producer and consumer run at the
>   same speed with a large tx-usecs, which would otherwise allow n_bql
>   to grow without bound (and potentially overflow int)."
> 
> ... to just...
> 
> "Flushing when n_bql exceeds dql.limit handles BQL starvation."
> 
> ... because after rethinking I think I overstated here.
> n_bql should also not grow infinitely for the v6 version.
> Nevertheless the new method should be simpler and faster.
> 
>>
>>
>>>>
>>>>> For physical NICs adjusting TX coalescing will affect the BQL as this
>>>>> affect the TX completion of the transmitted packets. For veth, it is the
>>>>> veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
>>>>> is where this patch adds netdev_tx_completed_queue calls for BQL.
>>>>> Any opinions on the "TX" or "RX" color?
>>>
>>> Personally I would also say TX.
>>>
>>
>> Okay, we seem to have more votes for TX :-)
>>
>>
>>>> I think I would prefer to configure it on the tx dev, and the recv
>>>> side gets the value from the peer. Maybe something like this.
>>>>
>>
>> I'm considering if we should simplify the config by having veth pairs have the same tx-coal value. WDYT?
> 
> I can not think of a case where a user would like to have a different
> tx-coal value on either side..
> So it's fine for me.
> 
>>
>>
>>>> @@ -997,11 +998,12 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>          struct veth_bql_state *state = &rq->bql_state;
>>>>          int i, done = 0, n_xdpf = 0;
>>>>          void *xdpf[VETH_XDP_BATCH];
>>>
>>> Why this deletion?
>>>
>>>> -       struct veth_priv *priv;
>>>> -       u64 bql_flush_ns;
>>>> +       u64 bql_flush_ns = 0;
>>>>
>>>> -       priv = netdev_priv(rq->dev);
>>>> -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
>>>
>>> I see we should check if peer_txq exists.
>>>
>>>> +       if (peer_txq) {
>>>> +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
>>>> +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
>>>> +       }
>>>>
>>>>          /* Clamp stored timestamp in case we migrated to a CPU with a behind
>>>>           * sched_clock(); prevents the deadline from never firing.
>>>>
>>>> - Jonas
>>
>> --Jesper
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-10  7:04               ` Simon Schippers
@ 2026-06-10 10:15                 ` Jesper Dangaard Brouer
  2026-06-10 12:00                   ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2026-06-10 10:15 UTC (permalink / raw)
  To: Simon Schippers, Jonas Köppeler, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf, kernel-team




On 10/06/2026 09.04, Simon Schippers wrote:
> On 6/9/26 17:08, Simon Schippers wrote:
>> On 6/9/26 15:59, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 08/06/2026 16.21, Simon Schippers wrote:
>>>> On 6/8/26 15:13, Jonas Köppeler wrote:
>>>>>
>>>>> On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:
>>>>>>
>>>>>>
>>>>>> On 08/06/2026 12.38, Simon Schippers wrote:
>>>>>>> On 5/27/26 15:54, hawk@kernel.org wrote:
>>>>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>
>>>>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>>>>
>>>>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>>>>> beyond a single budget=64 cycle.
>>>>>>>>
>>>>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>>>>> completion.
>>>>>>>>
>>>>>>>>      ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>>>>      ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>>>>
>>>>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>> ---
>>>>>>>
>>>>>>> I found the issue that n_bql may become infinitly large if producer
>>>>>>> and consumer have the same speed (and tx_usecs is large). It could
>>>>>>> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
>>>>>>> Also I figured that no hardware BQL driver ever completes more than
>>>>>>> BQL limit many elements.
>>>>>>>
>>>>>>> Therefore, I propose a simpler logic (see attachment) that completes
>>>>>>> either on the usual bql_flush_ns or if n_bql > dql.limit.
>>>>>>> If n_bql > dql.limit then we either have the case above that the
>>>>>>> producer is as fast as the consumer or we have BQL starvation.
>>>>>>>
>>>>>>> if (state->time + bql_flush_ns <= current_time ||
>>>>>>>       state->n_bql > peer_txq->dql.limit) {
>>>>>>>
>>>>>>> It must be n_bql *bigger than* dql.limit because the producer will
>>>>>>> always exceed the limit before it stops, see netdev_tx_sent_queue().
>>>>>>> It is fast because peer_txq->dql.limit is in the cacheline of the
>>>>>>> completion path, see dynamic_queue_limits.h.
>>>>>>>
>>>>>>> Another advantage is that we avoid the snippet checking for empty
>>>>>>> and BQL stopped which requires an smp_rmb() and an test_bit().
>>>>>>>
>>>>>>> Apart from that I:
>>>>>>> - Always call veth_bql_maybe_complete() in the for loop to have
>>>>>>>      more accurate completion intervals when having mixed XDP and
>>>>>>>      non-XDP packets.
>>>>>>> - Made it so tx_usecs = 0 is now also a normal case.
>>>>>>> - Change the type of n_bql to uint instead of int.
>>>>>>> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
>>>>>>> - Moved the bql_state init in __veth_napi_enable_range() in front
>>>>>>>      of napi_enable() to avoid a race (Sashiko).
>>>>>>> - Moved the bql_state reset in veth_napi_del_range() after the
>>>>>>>      ptr_ring_cleanup() (probably does not matter but makes sense to me)
>>>>>>>
>>>>>>> Benchmarks look just fine, see commit message.
>>>>>>>
>>>>>>> WDYT?
>>>>>
>>>>>>
>>>>>> Looks good to me, I will use this in my V7 patchset.
>>>>>>
>>>>>> A bike-shedding issue: We change the coalescing parameters for the veth
>>>>>> net_device, but should this be a TX or RX parameter?
>>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.
>>>>
>>>> Luckily I saved the raw test result from /tmp (the raw ping output
>>>> is included).
>>>>
>>>> extract_ping_p99() seems to be broken. It has issues ordering the
>>>> numbers. So if there are 2 or 3 numbers after the point I think.
>>>> avg ping is fine.
>>>>
>>>> Claude Sonnet wrote a script for me to recalculate the p99 pings
>>>> from the raw ping output).
>>>
>>> For programming switch to Claude Opus, as Sonnet cannot code IMHO.
>>
>> Already used 95% of my Copilot Pro+ Tokens this month :P
>>

LOL - I'm lucky to (still) have unlimited tokens.

>>> (Claude Opus wrote the selftest we placed in [1])
>>>
>>> Is your script still based on our github[1] selftest ?
>>> If so, then please make some PR(s) against my repo.
>>> I want this to be easy reproducible for others.
>>>
>>> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing
>>>
>>
>> I forked Jonas latest branch [1] and ran the tests with that.
>>
>> But I faced 2 issues that I then fixed:
>> - Ping fails in some runs (~20%) -> My fix: Retry the run until it
>>    works.
>> - extract_ping_p99() has a bug for me which caused wrong results,
>>    as pointed out by Jonas -> Also fixed it
>>
>> Because of the inconsistencies I will rerun over night to have
>> clean benchmark results.
>>
>> I will make a PR tomorrow.
>>
>> [1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
>>
> 
> I ran the benchmarks over night and they look fine,
> see ASCII table below.
> I do not see a regression, tx_usecs of 50us even outperforms stock.
> 

Very interesting that tx_usecs 50us is generally performing better.
It seems we should "sweep" the area between 0-100 usecs for finding best
default from the beginning. (And see cool colored graphs that Jonas
generated).


> I did a PR, see [1]. Most of the code changes are by Jonas,
> because he rewrote the code to allow the usage of pktgen and
> added the bql_sweep.sh script. Thanks Jonas!
> My part is the 2 bugfixes, the output as ASCII table and
> the new benchmark results with my new 'v7' method.
> 
> I have not tested bql_sweep.sh inside virtme-ng, I ran my benchmarks
> on the host system. I guess you could mount the folder with read+write
> in vng and then run bql_sweep.sh there.
> 
> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing/pull/9
> 

I will take a look today.
- Will likely just merge after sanity checking as I want us iterate faster

> 
> System information:
> Ryzen 5 5600X @ fixed 4.3 GHz, 3200 MHz RAM, CPU mitigations disabled,
> SMT disabled, on host system (NOT virtme-ng).
> 
> Patched benchmark command:
> sudo ./veth_bql_sweep.sh --runs 10 --tx-usecs-list \
> "0 50 100 500 1000 5000 10000" --nrules-list "0 1 10 100 1000 10000" \
> --pktgen --duration 20 --qdisc fq_codel --no-bpftrace
> 
> Stock benchmark command:
> sudo ./veth_bql_sweep.sh --runs 10 --tx-usecs-list "0" --nrules-list \
> "0 1 10 100 1000 10000" --pktgen --duration 20 --qdisc fq_codel \
> --no-bpftrace
> 
> Throughput (pps)
> ===========================================================================
> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
> -------+-------+-------+-------+-------+--------+--------+---------++------
>       0 | 1.62M | 1.89M | 1.75M | 1.73M |  1.73M |  1.73M |   1.73M || 1.76M
>       1 | 1.51M | 1.72M | 1.63M | 1.60M |  1.60M |  1.60M |   1.60M || 1.65M
>      10 | 1.33M | 1.52M | 1.47M | 1.41M |  1.41M |  1.41M |   1.41M || 1.45M
>     100 |  681K |  751K |  756K |  726K |   721K |   723K |    730K ||  736K
>    1000 |  116K |  123K |  124K |  124K |   124K |   125K |    123K ||  126K
>   10000 |   13K |   13K |   13K |   13K |    13K |    13K |     13K ||   13K
> 
> 
> Ping RTT ms (avg)
> ===========================================================================
> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
> -------+-------+-------+-------+-------+--------+--------+---------++------
>       0 | 0.016 | 0.089 | 0.136 | 0.137 |  0.138 |  0.134 |   0.135 || 0.132
>       1 | 0.018 | 0.097 | 0.144 | 0.146 |  0.146 |  0.150 |   0.147 || 0.142
>      10 | 0.019 | 0.093 | 0.156 | 0.164 |  0.164 |  0.164 |   0.167 || 0.158
>     100 | 0.029 | 0.104 | 0.180 | 0.315 |  0.312 |  0.314 |   0.311 || 0.307
>    1000 | 0.139 | 0.201 | 0.309 | 0.971 |   1.66 |   1.81 |    1.82 ||  1.77
>   10000 |  1.12 |  1.74 |  1.72 |  1.83 |   2.90 |   9.14 |    15.9 ||  17.2
> 
> 
> Ping RTT ms (p99)
> ===========================================================================
> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
> -------+-------+-------+-------+-------+--------+--------+---------++------
>       0 | 0.026 | 0.122 | 0.161 | 0.164 |  0.164 |  0.159 |   0.161 || 0.155
>       1 | 0.028 | 0.123 | 0.169 | 0.173 |  0.173 |  0.175 |   0.173 || 0.165
>      10 | 0.032 | 0.119 | 0.187 | 0.194 |  0.193 |  0.193 |   0.197 || 0.185
>     100 | 0.047 | 0.134 | 0.232 | 0.368 |  0.369 |  0.367 |   0.359 || 0.361
>    1000 | 0.232 | 0.303 | 0.409 |  1.27 |   2.13 |   2.10 |    2.13 ||  2.10
>   10000 |  1.94 |  2.52 |  2.59 |  2.69 |   3.87 |   12.0 |    20.0 ||  20.3
> 

Looks like 50us delivers better (lower) latency and higher throughput 
across the range.
I'm strongly considering changing V7 to use 50usec as default.

> 
>>>> Below is the new result. Not 100% sure but this seems right now.
>>>> Most pings are the same.
>>>>
>>>> Ping RTT ms (p99)
>>>> ===========================================================================
>>>> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
>>>> -------+-------+-------+-------+-------+--------+--------+---------++------
>>>>        0 | 0.028 | 0.115 | 0.159 | 0.161 |  0.163 |  0.161 |   0.163 || 0.154
>>>>        1 | 0.027 | 0.123 | 0.169 | 0.172 |  0.169 |  0.173 |   0.172 || 0.169
>>>>       10 | 0.030 | 0.117 | 0.189 | 0.193 |  0.195 |  0.192 |   0.196 || 0.186
>>>>      100 | 0.045 | 0.134 | 0.233 | 0.368 |  0.365 |  0.369 |   0.361 || 0.358
>>>>     1000 | 0.230 | 0.300 | 0.412 |  1.26 |   2.11 |   2.12 |    2.13 ||  2.07
>>>>    10000 |  2.04 |  2.55 |  2.54 |  2.72 |   3.77 |   12.1 |    20.1 ||  20.3
>>>>
>>>>>
>>>>> Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.
>>>>
>>>> Ok, we could swap for something like
>>>>
>>>> if (veth_ptr_is_bql(ptr))
>>>>      state->n_bql+;
>>>
>>> I'll see if I can adjust the V7 patch to this feedback.
>>
>> Yes, and also this snippet pointed out by Jonas below:
>>
>>   -       struct veth_priv *priv;
>>   -       u64 bql_flush_ns;
>>   +       u64 bql_flush_ns = 0;
>>
>>   -       priv = netdev_priv(rq->dev);
>>   -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
>>   +       if (peer_txq) {
>>   +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
>>   +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
>>   +       }
>>
>> Or is peer_txq != 0 guaranteed in veth_xdp_rcv()?
>>
>>
>> And also I would change in the commit message:
>>
>> "Flushing when n_bql exceeds dql.limit handles two cases:
>> - BQL starvation
>> - The steady-state case where the producer and consumer run at the
>>    same speed with a large tx-usecs, which would otherwise allow n_bql
>>    to grow without bound (and potentially overflow int)."
>>
>> ... to just...
>>
>> "Flushing when n_bql exceeds dql.limit handles BQL starvation."
>>
>> ... because after rethinking I think I overstated here.
>> n_bql should also not grow infinitely for the v6 version.
>> Nevertheless the new method should be simpler and faster.

Ok, adjusted this for V7 patch.

>>>
>>>
>>>>>
>>>>>> For physical NICs adjusting TX coalescing will affect the BQL as this
>>>>>> affect the TX completion of the transmitted packets. For veth, it is the
>>>>>> veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
>>>>>> is where this patch adds netdev_tx_completed_queue calls for BQL.
>>>>>> Any opinions on the "TX" or "RX" color?
>>>>
>>>> Personally I would also say TX.
>>>>
>>>
>>> Okay, we seem to have more votes for TX :-)
>>>
>>>
>>>>> I think I would prefer to configure it on the tx dev, and the recv
>>>>> side gets the value from the peer. Maybe something like this.
>>>>>
>>>
>>> I'm considering if we should simplify the config by having veth pairs have the same tx-coal value. WDYT?
>>
>> I can not think of a case where a user would like to have a different
>> tx-coal value on either side..
>> So it's fine for me.

Okay, agreed - switching V7 to have mirrored tx-coal value.

--Jesper

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
  2026-06-10 10:15                 ` Jesper Dangaard Brouer
@ 2026-06-10 12:00                   ` Simon Schippers
  0 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2026-06-10 12:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jonas Köppeler, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf, kernel-team

On 6/10/26 12:15, Jesper Dangaard Brouer wrote:
> 
> 
> 
> On 10/06/2026 09.04, Simon Schippers wrote:
>> On 6/9/26 17:08, Simon Schippers wrote:
>>> On 6/9/26 15:59, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 08/06/2026 16.21, Simon Schippers wrote:
>>>>> On 6/8/26 15:13, Jonas Köppeler wrote:
>>>>>>
>>>>>> On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 08/06/2026 12.38, Simon Schippers wrote:
>>>>>>>> On 5/27/26 15:54, hawk@kernel.org wrote:
>>>>>>>>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>
>>>>>>>>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>>>>>>>>> excessive NAPI scheduling overhead and qdisc requeues.
>>>>>>>>>
>>>>>>>>> Accumulate BQL completions and flush them when a configurable time
>>>>>>>>> threshold is exceeded, letting DQL discover a limit that bounds actual
>>>>>>>>> queuing delay to the configured interval. Coalescing state persists
>>>>>>>>> across NAPI polls in struct veth_rq so completions can accumulate
>>>>>>>>> beyond a single budget=64 cycle.
>>>>>>>>>
>>>>>>>>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>>>>>>>>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>>>>>>>>> completion.
>>>>>>>>>
>>>>>>>>>      ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>>>>>>>>      ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>>>>>>>>
>>>>>>>>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>> ---
>>>>>>>>
>>>>>>>> I found the issue that n_bql may become infinitly large if producer
>>>>>>>> and consumer have the same speed (and tx_usecs is large). It could
>>>>>>>> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
>>>>>>>> Also I figured that no hardware BQL driver ever completes more than
>>>>>>>> BQL limit many elements.
>>>>>>>>
>>>>>>>> Therefore, I propose a simpler logic (see attachment) that completes
>>>>>>>> either on the usual bql_flush_ns or if n_bql > dql.limit.
>>>>>>>> If n_bql > dql.limit then we either have the case above that the
>>>>>>>> producer is as fast as the consumer or we have BQL starvation.
>>>>>>>>
>>>>>>>> if (state->time + bql_flush_ns <= current_time ||
>>>>>>>>       state->n_bql > peer_txq->dql.limit) {
>>>>>>>>
>>>>>>>> It must be n_bql *bigger than* dql.limit because the producer will
>>>>>>>> always exceed the limit before it stops, see netdev_tx_sent_queue().
>>>>>>>> It is fast because peer_txq->dql.limit is in the cacheline of the
>>>>>>>> completion path, see dynamic_queue_limits.h.
>>>>>>>>
>>>>>>>> Another advantage is that we avoid the snippet checking for empty
>>>>>>>> and BQL stopped which requires an smp_rmb() and an test_bit().
>>>>>>>>
>>>>>>>> Apart from that I:
>>>>>>>> - Always call veth_bql_maybe_complete() in the for loop to have
>>>>>>>>      more accurate completion intervals when having mixed XDP and
>>>>>>>>      non-XDP packets.
>>>>>>>> - Made it so tx_usecs = 0 is now also a normal case.
>>>>>>>> - Change the type of n_bql to uint instead of int.
>>>>>>>> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
>>>>>>>> - Moved the bql_state init in __veth_napi_enable_range() in front
>>>>>>>>      of napi_enable() to avoid a race (Sashiko).
>>>>>>>> - Moved the bql_state reset in veth_napi_del_range() after the
>>>>>>>>      ptr_ring_cleanup() (probably does not matter but makes sense to me)
>>>>>>>>
>>>>>>>> Benchmarks look just fine, see commit message.
>>>>>>>>
>>>>>>>> WDYT?
>>>>>>
>>>>>>>
>>>>>>> Looks good to me, I will use this in my V7 patchset.
>>>>>>>
>>>>>>> A bike-shedding issue: We change the coalescing parameters for the veth
>>>>>>> net_device, but should this be a TX or RX parameter?
>>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.
>>>>>
>>>>> Luckily I saved the raw test result from /tmp (the raw ping output
>>>>> is included).
>>>>>
>>>>> extract_ping_p99() seems to be broken. It has issues ordering the
>>>>> numbers. So if there are 2 or 3 numbers after the point I think.
>>>>> avg ping is fine.
>>>>>
>>>>> Claude Sonnet wrote a script for me to recalculate the p99 pings
>>>>> from the raw ping output).
>>>>
>>>> For programming switch to Claude Opus, as Sonnet cannot code IMHO.
>>>
>>> Already used 95% of my Copilot Pro+ Tokens this month :P
>>>
> 
> LOL - I'm lucky to (still) have unlimited tokens.
> 
>>>> (Claude Opus wrote the selftest we placed in [1])
>>>>
>>>> Is your script still based on our github[1] selftest ?
>>>> If so, then please make some PR(s) against my repo.
>>>> I want this to be easy reproducible for others.
>>>>
>>>> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing
>>>>
>>>
>>> I forked Jonas latest branch [1] and ran the tests with that.
>>>
>>> But I faced 2 issues that I then fixed:
>>> - Ping fails in some runs (~20%) -> My fix: Retry the run until it
>>>    works.
>>> - extract_ping_p99() has a bug for me which caused wrong results,
>>>    as pointed out by Jonas -> Also fixed it
>>>
>>> Because of the inconsistencies I will rerun over night to have
>>> clean benchmark results.
>>>
>>> I will make a PR tomorrow.
>>>
>>> [1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
>>>
>>
>> I ran the benchmarks over night and they look fine,
>> see ASCII table below.
>> I do not see a regression, tx_usecs of 50us even outperforms stock.
>>
> 
> Very interesting that tx_usecs 50us is generally performing better.
> It seems we should "sweep" the area between 0-100 usecs for finding best
> default from the beginning. (And see cool colored graphs that Jonas
> generated).
> 

For quick verification I ran 5 runs in the 10-90us range
for --nrules=0 and 20s per run. Here the results:

sudo ./veth_bql_sweep.sh --runs 5 --tx-usecs-list \
"0 10 20 30 40 50 60 70 80 90" --nrules-list "0" --pktgen \
 --duration 20 --qdisc fq_codel --no-bpftrace

Throughput (pps)
================================================================================
nrules |  10us |  20us |  30us |  40us |  50us |  60us |  70us |  80us |  90us |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
     0 | 1.75M | 1.76M | 1.76M | 1.76M | 1.76M | 1.76M | 1.77M | 1.77M | 1.76M |


Ping RTT ms (avg)
================================================================================
nrules |  10us |  20us |  30us |  40us |  50us |  60us |  70us |  80us |  90us |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
     0 | 0.133 | 0.133 | 0.132 | 0.132 | 0.130 | 0.133 | 0.132 | 0.131 | 0.131 |


Ping RTT ms (p99)
================================================================================
nrules |  10us |  20us |  30us |  40us |  50us |  60us |  70us |  80us |  90us |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
     0 | 0.158 | 0.156 | 0.155 | 0.159 | 0.156 | 0.156 | 0.156 | 0.153 | 0.153 |


Now 50us is just as fast as stock, even though I changed *nothing* :/
Nevertheless still great.

Even 10us seems valid. Maybe it just has to be as big as the time
for running a single veth_xdp_rcv() execution? Idk :D

> 
>> I did a PR, see [1]. Most of the code changes are by Jonas,
>> because he rewrote the code to allow the usage of pktgen and
>> added the bql_sweep.sh script. Thanks Jonas!
>> My part is the 2 bugfixes, the output as ASCII table and
>> the new benchmark results with my new 'v7' method.
>>
>> I have not tested bql_sweep.sh inside virtme-ng, I ran my benchmarks
>> on the host system. I guess you could mount the folder with read+write
>> in vng and then run bql_sweep.sh there.
>>
>> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing/pull/9
>>
> 
> I will take a look today.
> - Will likely just merge after sanity checking as I want us iterate faster
> 
>>
>> System information:
>> Ryzen 5 5600X @ fixed 4.3 GHz, 3200 MHz RAM, CPU mitigations disabled,
>> SMT disabled, on host system (NOT virtme-ng).
>>
>> Patched benchmark command:
>> sudo ./veth_bql_sweep.sh --runs 10 --tx-usecs-list \
>> "0 50 100 500 1000 5000 10000" --nrules-list "0 1 10 100 1000 10000" \
>> --pktgen --duration 20 --qdisc fq_codel --no-bpftrace
>>
>> Stock benchmark command:
>> sudo ./veth_bql_sweep.sh --runs 10 --tx-usecs-list "0" --nrules-list \
>> "0 1 10 100 1000 10000" --pktgen --duration 20 --qdisc fq_codel \
>> --no-bpftrace
>>
>> Throughput (pps)
>> ===========================================================================
>> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
>> -------+-------+-------+-------+-------+--------+--------+---------++------
>>       0 | 1.62M | 1.89M | 1.75M | 1.73M |  1.73M |  1.73M |   1.73M || 1.76M
>>       1 | 1.51M | 1.72M | 1.63M | 1.60M |  1.60M |  1.60M |   1.60M || 1.65M
>>      10 | 1.33M | 1.52M | 1.47M | 1.41M |  1.41M |  1.41M |   1.41M || 1.45M
>>     100 |  681K |  751K |  756K |  726K |   721K |   723K |    730K ||  736K
>>    1000 |  116K |  123K |  124K |  124K |   124K |   125K |    123K ||  126K
>>   10000 |   13K |   13K |   13K |   13K |    13K |    13K |     13K ||   13K
>>
>>
>> Ping RTT ms (avg)
>> ===========================================================================
>> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
>> -------+-------+-------+-------+-------+--------+--------+---------++------
>>       0 | 0.016 | 0.089 | 0.136 | 0.137 |  0.138 |  0.134 |   0.135 || 0.132
>>       1 | 0.018 | 0.097 | 0.144 | 0.146 |  0.146 |  0.150 |   0.147 || 0.142
>>      10 | 0.019 | 0.093 | 0.156 | 0.164 |  0.164 |  0.164 |   0.167 || 0.158
>>     100 | 0.029 | 0.104 | 0.180 | 0.315 |  0.312 |  0.314 |   0.311 || 0.307
>>    1000 | 0.139 | 0.201 | 0.309 | 0.971 |   1.66 |   1.81 |    1.82 ||  1.77
>>   10000 |  1.12 |  1.74 |  1.72 |  1.83 |   2.90 |   9.14 |    15.9 ||  17.2
>>
>>
>> Ping RTT ms (p99)
>> ===========================================================================
>> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
>> -------+-------+-------+-------+-------+--------+--------+---------++------
>>       0 | 0.026 | 0.122 | 0.161 | 0.164 |  0.164 |  0.159 |   0.161 || 0.155
>>       1 | 0.028 | 0.123 | 0.169 | 0.173 |  0.173 |  0.175 |   0.173 || 0.165
>>      10 | 0.032 | 0.119 | 0.187 | 0.194 |  0.193 |  0.193 |   0.197 || 0.185
>>     100 | 0.047 | 0.134 | 0.232 | 0.368 |  0.369 |  0.367 |   0.359 || 0.361
>>    1000 | 0.232 | 0.303 | 0.409 |  1.27 |   2.13 |   2.10 |    2.13 ||  2.10
>>   10000 |  1.94 |  2.52 |  2.59 |  2.69 |   3.87 |   12.0 |    20.0 ||  20.3
>>
> 
> Looks like 50us delivers better (lower) latency and higher throughput across the range.
> I'm strongly considering changing V7 to use 50usec as default.
> 
>>
>>>>> Below is the new result. Not 100% sure but this seems right now.
>>>>> Most pings are the same.
>>>>>
>>>>> Ping RTT ms (p99)
>>>>> ===========================================================================
>>>>> nrules |   0us |  50us | 100us | 500us | 1000us | 5000us | 10000us || stock
>>>>> -------+-------+-------+-------+-------+--------+--------+---------++------
>>>>>        0 | 0.028 | 0.115 | 0.159 | 0.161 |  0.163 |  0.161 |   0.163 || 0.154
>>>>>        1 | 0.027 | 0.123 | 0.169 | 0.172 |  0.169 |  0.173 |   0.172 || 0.169
>>>>>       10 | 0.030 | 0.117 | 0.189 | 0.193 |  0.195 |  0.192 |   0.196 || 0.186
>>>>>      100 | 0.045 | 0.134 | 0.233 | 0.368 |  0.365 |  0.369 |   0.361 || 0.358
>>>>>     1000 | 0.230 | 0.300 | 0.412 |  1.26 |   2.11 |   2.12 |    2.13 ||  2.07
>>>>>    10000 |  2.04 |  2.55 |  2.54 |  2.72 |   3.77 |   12.1 |    20.1 ||  20.3
>>>>>
>>>>>>
>>>>>> Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.
>>>>>
>>>>> Ok, we could swap for something like
>>>>>
>>>>> if (veth_ptr_is_bql(ptr))
>>>>>      state->n_bql+;
>>>>
>>>> I'll see if I can adjust the V7 patch to this feedback.
>>>
>>> Yes, and also this snippet pointed out by Jonas below:
>>>
>>>   -       struct veth_priv *priv;
>>>   -       u64 bql_flush_ns;
>>>   +       u64 bql_flush_ns = 0;
>>>
>>>   -       priv = netdev_priv(rq->dev);
>>>   -       bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
>>>   +       if (peer_txq) {
>>>   +           struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
>>>   +           bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
>>>   +       }
>>>
>>> Or is peer_txq != 0 guaranteed in veth_xdp_rcv()?
>>>
>>>
>>> And also I would change in the commit message:
>>>
>>> "Flushing when n_bql exceeds dql.limit handles two cases:
>>> - BQL starvation
>>> - The steady-state case where the producer and consumer run at the
>>>    same speed with a large tx-usecs, which would otherwise allow n_bql
>>>    to grow without bound (and potentially overflow int)."
>>>
>>> ... to just...
>>>
>>> "Flushing when n_bql exceeds dql.limit handles BQL starvation."
>>>
>>> ... because after rethinking I think I overstated here.
>>> n_bql should also not grow infinitely for the v6 version.
>>> Nevertheless the new method should be simpler and faster.
> 
> Ok, adjusted this for V7 patch.
> 
>>>>
>>>>
>>>>>>
>>>>>>> For physical NICs adjusting TX coalescing will affect the BQL as this
>>>>>>> affect the TX completion of the transmitted packets. For veth, it is the
>>>>>>> veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
>>>>>>> is where this patch adds netdev_tx_completed_queue calls for BQL.
>>>>>>> Any opinions on the "TX" or "RX" color?
>>>>>
>>>>> Personally I would also say TX.
>>>>>
>>>>
>>>> Okay, we seem to have more votes for TX :-)
>>>>
>>>>
>>>>>> I think I would prefer to configure it on the tx dev, and the recv
>>>>>> side gets the value from the peer. Maybe something like this.
>>>>>>
>>>>
>>>> I'm considering if we should simplify the config by having veth pairs have the same tx-coal value. WDYT?
>>>
>>> I can not think of a case where a user would like to have a different
>>> tx-coal value on either side..
>>> So it's fine for me.
> 
> Okay, agreed - switching V7 to have mirrored tx-coal value.
> 
> --Jesper

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-06-04  8:19   ` Paolo Abeni
@ 2026-06-10 12:21     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2026-06-10 12:21 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf



On 04/06/2026 10.19, Paolo Abeni wrote:
> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>> @@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>>   	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>>   	struct veth_rq *rq = NULL;
>> -	struct netdev_queue *txq;
>> +	struct netdev_queue *txq = NULL;
> 
> Minor nit: please fix the variables declaration order above.
> 
>> @@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
>>   		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
>>   	}
>>   
>> +	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
>> +	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
>> +	 * synchronize_net(), and NAPI can no longer clear it.
>> +	 * Only wake when the device is still up.
>> +	 */
>> +	peer = rtnl_dereference(priv->peer);
>> +	if (peer) {
>> +		int peer_end = min_t(int, end, peer->real_num_tx_queues);
> 
> Sashiko noted you may need to additionally complete/reset the tx queue
> in the <old_real_num_tx_queues> .. <new_real_num_tx_queue> range, and I
> think it's right.
> 

Trying a new approach in V7, that avoids calling reset. Instead, before
ptr_ring_cleanup(), consume any remaining packets and do the correct BQL
accounting.  Lets see if that pleases Sashiko ;-)

--Jesper

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net
  2026-06-04  8:24   ` Paolo Abeni
@ 2026-06-10 12:37     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2026-06-10 12:37 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: Jonas Köppeler, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, linux-kernel




On 04/06/2026 10.24, Paolo Abeni wrote:
> On 5/27/26 3:54 PM, hawk@kernel.org wrote:
>> @@ -1819,6 +1836,7 @@ static void veth_setup(struct net_device *dev)
>>   	dev->priv_destructor = veth_dev_free;
>>   	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
>>   	dev->max_mtu = ETH_MAX_MTU;
>> +	dev->watchdog_timeo = msecs_to_jiffies(16000);
> 
> Since a repost is neede it could be possibly usedfull using a macro for
> the above constant and expanding the math leading to the actual value.

Makes sense to move to a DEFINE.
It might be practical to use in veth_set_coalesce() as I guess we should
enforce tx_coal to be less-than the timeout.

> Also possibly an additional + 1 to avoid a very unlikely false positive
> on exactly the maximum possible interval timer?

I don't think the +1 brings much value, because this timer is jiffies 
based so do don't expect to hit exact value anyhow.

--Jesper

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2026-06-10 12:37 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-28  7:45   ` Jonas Köppeler
2026-06-04  8:19   ` Paolo Abeni
2026-06-10 12:21     ` Jesper Dangaard Brouer
2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-06-04  8:24   ` Paolo Abeni
2026-06-10 12:37     ` Jesper Dangaard Brouer
2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-06-04  8:30   ` Paolo Abeni
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
2026-05-28  7:46   ` Jonas Köppeler
2026-06-01 12:00     ` Simon Schippers
2026-06-01 14:03       ` Jonas Köppeler
2026-06-01 16:16         ` Simon Schippers
2026-06-02  7:24           ` Jonas Köppeler
2026-06-02 15:37             ` Simon Schippers
2026-06-03  8:28               ` Jonas Köppeler
2026-05-29 14:51   ` Simon Schippers
2026-06-04  8:21   ` Paolo Abeni
2026-06-08 10:38   ` Simon Schippers
2026-06-08 13:04     ` Jesper Dangaard Brouer
2026-06-08 13:13       ` Jonas Köppeler
2026-06-08 14:21         ` Simon Schippers
2026-06-09 13:59           ` Jesper Dangaard Brouer
2026-06-09 15:08             ` Simon Schippers
2026-06-10  7:04               ` Simon Schippers
2026-06-10 10:15                 ` Jesper Dangaard Brouer
2026-06-10 12:00                   ` Simon Schippers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox