[PATCH net-next v5 1/5] veth: fix OOB txq access in veth

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts
       [not found] <20260505132159.241305-1-hawk@kernel.org>
@ 2026-05-05 13:21 ` hawk
  2026-05-07 14:25   ` Paolo Abeni
  2026-05-05 13:21 ` [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: hawk @ 2026-05-05 13:21 UTC (permalink / raw)
  To: netdev
  Cc: hawk, kernel-team, Sashiko, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Stanislav Fomichev,
	Toshiaki Makita, linux-kernel, bpf

From: Jesper Dangaard Brouer <hawk@kernel.org>

XDP redirect into a veth device (via bpf_redirect()) calls
veth_xdp_xmit(), which enqueues frames into the peer's ptr_ring using
  smp_processor_id() % peer->real_num_rx_queues
as the ring index.  With an asymmetric veth pair where the peer has
fewer TX queues than RX queues, that index can exceed
peer->real_num_tx_queues.

veth_poll() then resolves peer_txq for the ring via:

  peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;

where queue_idx = rq->xdp_rxq.queue_index.  When queue_idx exceeds
peer_dev->real_num_tx_queues this is an out-of-bounds (OOB) access
into the peer's netdev_queue array, triggering DEBUG_NET_WARN_ON_ONCE
in netdev_get_tx_queue().

The normal ndo_start_xmit path is not affected: the stack clamps
skb->queue_mapping via netdev_cap_txqueue() before invoking
ndo_start_xmit, so rxq in veth_xmit() never exceeds real_num_tx_queues.

Fix veth_poll() by clamping: only dereference peer_txq when queue_idx is
within bounds, otherwise set it to NULL.  The out-of-range rings are fed
exclusively via XDP redirect (veth_xdp_xmit), never via ndo_start_xmit
(veth_xmit), so the peer txq was never stopped and there is nothing to
wake; NULL is the correct fallback.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260502071828.616C3C19425@smtp.kernel.org/
Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e35df717e65e..0cfb19b760dd 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -972,7 +972,8 @@ static int veth_poll(struct napi_struct *napi, int budget)

 	/* NAPI functions as RCU section */
 	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
-	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
+	peer_txq = (peer_dev && queue_idx < peer_dev->real_num_tx_queues) ?
+		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;

 	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
       [not found] <20260505132159.241305-1-hawk@kernel.org>
  2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
@ 2026-05-05 13:21 ` hawk
  2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: hawk @ 2026-05-05 13:21 UTC (permalink / raw)
  To: netdev
  Cc: hawk, kernel-team, Jonas Köppeler, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, Kuniyuki Iwashima,
	Stanislav Fomichev, Christian Brauner, Yury Norov, Krishna Kumar,
	Yajun Deng, linux-doc, linux-kernel

From: Jesper Dangaard Brouer <hawk@kernel.org>

Virtual devices with IFF_NO_QUEUE or lltx are excluded from BQL sysfs
by netdev_uses_bql(), since they traditionally lack real hardware
queues. However, some virtual devices like veth implement a real
ptr_ring FIFO with NAPI processing and benefit from BQL to limit
in-flight bytes and reduce latency.

Add a per-device 'bql' bitfield boolean in the priv_flags_slow section
of struct net_device. When set, it overrides the IFF_NO_QUEUE/lltx
exclusion and exposes BQL sysfs entries (/sys/class/net/<dev>/queues/
tx-<n>/byte_queue_limits/). The flag is still gated on CONFIG_BQL.

This allows drivers that use BQL despite being IFF_NO_QUEUE to opt in
to sysfs visibility for monitoring and debugging.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 Documentation/networking/net_cachelines/net_device.rst | 1 +
 include/linux/netdevice.h                              | 2 ++
 net/core/net-sysfs.c                                   | 8 +++++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 1c19bb7705df..b775d3235a2d 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -170,6 +170,7 @@ unsigned_long:1                     see_all_hwtstamp_requests
 unsigned_long:1                     change_proto_down
 unsigned_long:1                     netns_immutable
 unsigned_long:1                     fcoe_mtu
+unsigned_long:1                     bql                                                                 netdev_uses_bql(net-sysfs.c)
 struct list_head                    net_notifier_list
 struct macsec_ops*                  macsec_ops
 struct udp_tunnel_nic_info*         udp_tunnel_nic_info
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 744ffa243501..d4c7b020b6a7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2065,6 +2065,7 @@ enum netdev_reg_state {
  *	@change_proto_down: device supports setting carrier via IFLA_PROTO_DOWN
  *	@netns_immutable: interface can't change network namespaces
  *	@fcoe_mtu:	device supports maximum FCoE MTU, 2158 bytes
+ *	@bql:		device uses BQL (DQL sysfs) despite having IFF_NO_QUEUE
  *
  *	@net_notifier_list:	List of per-net netdev notifier block
  *				that follow this device when it is moved
@@ -2479,6 +2480,7 @@ struct net_device {
 	unsigned long		change_proto_down:1;
 	unsigned long		netns_immutable:1;
 	unsigned long		fcoe_mtu:1;
+	unsigned long		bql:1;
 
 	struct list_head	net_notifier_list;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 3318b5666e43..82833e5dae03 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1945,10 +1945,16 @@ static const struct kobj_type netdev_queue_ktype = {
 
 static bool netdev_uses_bql(const struct net_device *dev)
 {
+	if (!IS_ENABLED(CONFIG_BQL))
+		return false;
+
+	if (dev->bql)
+		return true;
+
 	if (dev->lltx || (dev->priv_flags & IFF_NO_QUEUE))
 		return false;
 
-	return IS_ENABLED(CONFIG_BQL);
+	return true;
 }
 
 static int netdev_queue_add_kobject(struct net_device *dev, int index)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
       [not found] <20260505132159.241305-1-hawk@kernel.org>
  2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
  2026-05-05 13:21 ` [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
@ 2026-05-05 13:21 ` hawk
  2026-05-07  6:54   ` Simon Schippers
  2026-05-05 13:21 ` [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net hawk
  2026-05-05 13:21 ` [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
  4 siblings, 1 reply; 22+ messages in thread
From: hawk @ 2026-05-05 13:21 UTC (permalink / raw)
  To: netdev
  Cc: hawk, kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

From: Jesper Dangaard Brouer <hawk@kernel.org>

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") gave qdiscs control over veth by returning
NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
a dark buffer, adding base latency because the qdisc has no visibility
into how many bytes are already queued there.

Add BQL support so the qdisc gets feedback and can begin shaping traffic
before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
under UDP load from ~6.61ms to ~0.36ms (18x).

Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
has no link speed -- the ptr_ring drains at CPU speed and is
packet-indexed, not byte-indexed, so bytes are not the natural unit.
With byte-based charging, small packets sneak many more entries into
the ring before STACK_XOFF fires, deepening the dark buffer under
mixed-size workloads.  Testing with a concurrent min-size packet flood
shows 3.7x ping RTT degradation with skb->len charging versus no
change with fixed-unit charging.

Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
confirming the ring is not full.  The charge must precede the produce
because the NAPI consumer can run on another CPU and complete the SKB
the instant it becomes visible in the ring.  Doing both under the same
lock avoids a pre-charge/undo pattern -- BQL is only charged when
produce is guaranteed to succeed.

BQL is enabled only when a real qdisc is attached (guarded by
!qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
for TXQ modification like dql_queued(). For lltx devices, like veth,
this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
producer_lock provides additional serialization that would allow
BQL to work correctly even with noqueue, though that combination
is not currently enabled, as the netstack will drop and warn.

Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
entry.  This is necessary because the qdisc can be replaced live while
SKBs are in-flight -- each SKB must carry the charge decision made at
enqueue time rather than re-checking the peer's qdisc at completion.

Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
clears promptly when producer and consumer run on different CPUs.

BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
alongside the existing DRV_XOFF (ring full).  Both must be clear for
the queue to transmit.  Reset BQL state in veth_napi_del_range() after
synchronize_net() to avoid racing with in-flight veth_poll() calls.
Clamp the reset loop to the peer's real_num_tx_queues, since the peer
may have fewer TX queues than the local device has RX queues (e.g. when
veth is enslaved to a bond with XDP attached).

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c | 86 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 75 insertions(+), 11 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0cfb19b760dd..86b78900c48e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -34,9 +34,13 @@
 #define DRV_VERSION	"1.0"
 
 #define VETH_XDP_FLAG		BIT(0)
+#define VETH_BQL_FLAG		BIT(1)
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
+#define VETH_BQL_UNIT		1
+
 #define VETH_XDP_TX_BULK_SIZE	16
 #define VETH_XDP_BATCH		16
 
@@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
+static bool veth_ptr_is_bql(void *ptr)
+{
+	return (unsigned long)ptr & VETH_BQL_FLAG;
+}
+
+static struct sk_buff *veth_ptr_to_skb(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
+}
+
+static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
+{
+	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
+}
+
 static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
 	if (veth_is_xdp_frame(ptr))
 		xdp_return_frame(veth_ptr_to_xdp(ptr));
 	else
-		kfree_skb(ptr);
+		kfree_skb(veth_ptr_to_skb(ptr));
 }
 
 static void __veth_xdp_flush(struct veth_rq *rq)
@@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
 	}
 }
 
-static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
+		       struct netdev_queue *txq)
 {
-	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
+	struct ptr_ring *ring = &rq->xdp_ring;
+
+	spin_lock(&ring->producer_lock);
+	if (unlikely(!ring->size) || __ptr_ring_full(ring)) {
+		spin_unlock(&ring->producer_lock);
 		return NETDEV_TX_BUSY; /* signal qdisc layer */
+	}
+
+	/* BQL charge before produce; consumer cannot see entry yet */
+	if (do_bql)
+		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
+
+	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
+	spin_unlock(&ring->producer_lock);
 
 	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }
 
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
-			    struct veth_rq *rq, bool xdp)
+			    struct veth_rq *rq, bool xdp, bool do_bql,
+			    struct netdev_queue *txq)
 {
 	return __dev_forward_skb(dev, skb) ?: xdp ?
-		veth_xdp_rx(rq, skb) :
+		veth_xdp_rx(rq, skb, do_bql, txq) :
 		__netif_rx(skb);
 }
 
@@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct veth_rq *rq = NULL;
-	struct netdev_queue *txq;
+	struct netdev_queue *txq = NULL;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
+	bool do_bql = false;
 	int ret, rxq;
 
 	rcu_read_lock();
@@ -375,8 +409,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	skb_tx_timestamp(skb);
-
-	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	if (rxq < dev->real_num_tx_queues) {
+		txq = netdev_get_tx_queue(dev, rxq);
+		/* BQL charge happens inside veth_xdp_rx() under producer_lock */
+		do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
+	}
+	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
 	switch (ret) {
 	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
@@ -412,6 +450,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
 				     __func__, dev->name, ret);
 	}
+
 	rcu_read_unlock();
 
 	return ret;
@@ -900,7 +939,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
-			struct veth_stats *stats)
+			struct veth_stats *stats,
+			struct netdev_queue *peer_txq)
 {
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
@@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			}
 		} else {
 			/* ndo_start_xmit */
-			struct sk_buff *skb = ptr;
+			bool bql_charged = veth_ptr_is_bql(ptr);
+			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
 			stats->xdp_bytes += skb->len;
+			if (peer_txq && bql_charged)
+				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -976,7 +1020,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
 
 	xdp_set_return_frame_no_direct();
-	done = veth_xdp_rcv(rq, budget, &bq, &stats);
+	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
 
 	if (stats.xdp_redirect > 0)
 		xdp_do_flush();
@@ -1074,6 +1118,7 @@ static int __veth_napi_enable(struct net_device *dev)
 static void veth_napi_del_range(struct net_device *dev, int start, int end)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
 	int i;
 
 	for (i = start; i < end; i++) {
@@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
 	}
 
+	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
+	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
+	 * synchronize_net(), and NAPI can no longer clear it.
+	 * Only wake when the device is still up.
+	 */
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		int peer_end = min_t(int, end, peer->real_num_tx_queues);
+
+		for (i = start; i < peer_end; i++) {
+			struct netdev_queue *txq = netdev_get_tx_queue(peer, i);
+
+			netdev_tx_reset_queue(txq);
+			if (netif_running(dev))
+				netif_tx_wake_queue(txq);
+		}
+	}
+
 	for (i = start; i < end; i++) {
 		page_pool_destroy(priv->rq[i].page_pool);
 		priv->rq[i].page_pool = NULL;
@@ -1741,6 +1804,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_PHONY_HEADROOM;
 	dev->priv_flags |= IFF_DISABLE_NETPOLL;
 	dev->lltx = true;
+	dev->bql = true;
 
 	dev->netdev_ops = &veth_netdev_ops;
 	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net
       [not found] <20260505132159.241305-1-hawk@kernel.org>
                   ` (2 preceding siblings ...)
  2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-05-05 13:21 ` hawk
  2026-05-05 13:21 ` [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
  4 siblings, 0 replies; 22+ messages in thread
From: hawk @ 2026-05-05 13:21 UTC (permalink / raw)
  To: netdev
  Cc: hawk, kernel-team, Jonas Köppeler, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel

From: Jesper Dangaard Brouer <hawk@kernel.org>

With the introduction of BQL (Byte Queue Limits) for veth, there are
now two independent mechanisms that can stop a transmit queue:

 - DRV_XOFF: set by netif_tx_stop_queue() when the ptr_ring is full
 - STACK_XOFF: set by BQL when the byte-in-flight limit is reached

If either mechanism stalls without a corresponding wake/completion,
the queue stops permanently. Enable the net device watchdog timer and
implement ndo_tx_timeout as a failsafe recovery.

The timeout handler resets BQL state (clearing STACK_XOFF) and wakes
the queue (clearing DRV_XOFF), covering both stop mechanisms. The
watchdog fires after 16 seconds, which accommodates worst-case NAPI
processing (budget=64 packets x 250ms per-packet consumer delay)
without false positives under normal backpressure.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 drivers/net/veth.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 86b78900c48e..4103d298aa9b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1442,6 +1442,22 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev,
+		   "veth backpressure(0x%lX) stalled(n:%ld) TXQ(%u) re-enable\n",
+		   txq->state, atomic_long_read(&txq->trans_timeout), txqueue);
+
+	/* Cannot call netdev_tx_reset_queue(): dql_reset() races with
+	 * peer NAPI calling dql_completed() concurrently.
+	 * Just clear the stop bits; the qdisc will re-stop if still stuck.
+	 */
+	clear_bit(__QUEUE_STATE_STACK_XOFF, &txq->state);
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1780,6 +1796,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1819,6 +1836,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(16000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message
       [not found] <20260505132159.241305-1-hawk@kernel.org>
                   ` (3 preceding siblings ...)
  2026-05-05 13:21 ` [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net hawk
@ 2026-05-05 13:21 ` hawk
  4 siblings, 0 replies; 22+ messages in thread
From: hawk @ 2026-05-05 13:21 UTC (permalink / raw)
  To: netdev
  Cc: hawk, kernel-team, Jakub Kicinski, Jonas Köppeler,
	Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, linux-kernel

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add the per-queue timeout counter (trans_timeout) to the core NETDEV
WATCHDOG log message.  This makes it easy to determine how frequently
a particular queue is stalling from a single log line, without having
to search through and correlate spaced-out log entries.

Useful for production monitoring where timeouts are spaced by the
watchdog interval, making frequency hard to judge.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/all/20251107175445.58eba452@kernel.org/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 net/sched/sch_generic.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a93321db8fd7..3e2e2e887a86 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -533,13 +533,12 @@ static void dev_watchdog(struct timer_list *t)
 		    netif_running(dev) &&
 		    netif_carrier_ok(dev)) {
 			unsigned int timedout_ms = 0;
+			struct netdev_queue *txq;
 			unsigned int i;
 			unsigned long trans_start;
 			unsigned long oldest_start = jiffies;
 
 			for (i = 0; i < dev->num_tx_queues; i++) {
-				struct netdev_queue *txq;
-
 				txq = netdev_get_tx_queue(dev, i);
 				if (!netif_xmit_stopped(txq))
 					continue;
@@ -561,9 +560,10 @@ static void dev_watchdog(struct timer_list *t)
 
 			if (unlikely(timedout_ms)) {
 				trace_net_dev_xmit_timeout(dev, i);
-				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
+				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms (n:%ld)\n",
 					    raw_smp_processor_id(),
-					    i, timedout_ms);
+					    i, timedout_ms,
+					    atomic_long_read(&txq->trans_timeout));
 				netif_freeze_queues(dev);
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 				netif_unfreeze_queues(dev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-05-07  6:54   ` Simon Schippers
  2026-05-07 13:21     ` Paolo Abeni
  2026-05-07 14:34     ` Paolo Abeni
  0 siblings, 2 replies; 22+ messages in thread
From: Simon Schippers @ 2026-05-07  6:54 UTC (permalink / raw)
  To: hawk, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

On 5/5/26 15:21, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") gave qdiscs control over veth by returning
> NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
> a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
> a dark buffer, adding base latency because the qdisc has no visibility
> into how many bytes are already queued there.
> 
> Add BQL support so the qdisc gets feedback and can begin shaping traffic
> before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
> under UDP load from ~6.61ms to ~0.36ms (18x).
> 
> Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
> the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
> has no link speed -- the ptr_ring drains at CPU speed and is
> packet-indexed, not byte-indexed, so bytes are not the natural unit.
> With byte-based charging, small packets sneak many more entries into
> the ring before STACK_XOFF fires, deepening the dark buffer under
> mixed-size workloads.  Testing with a concurrent min-size packet flood
> shows 3.7x ping RTT degradation with skb->len charging versus no
> change with fixed-unit charging.
> 
> Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
> confirming the ring is not full.  The charge must precede the produce
> because the NAPI consumer can run on another CPU and complete the SKB
> the instant it becomes visible in the ring.  Doing both under the same
> lock avoids a pre-charge/undo pattern -- BQL is only charged when
> produce is guaranteed to succeed.
> 
> BQL is enabled only when a real qdisc is attached (guarded by
> !qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
> for TXQ modification like dql_queued(). For lltx devices, like veth,
> this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
> producer_lock provides additional serialization that would allow
> BQL to work correctly even with noqueue, though that combination
> is not currently enabled, as the netstack will drop and warn.
> 
> Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
> entry.  This is necessary because the qdisc can be replaced live while
> SKBs are in-flight -- each SKB must carry the charge decision made at
> enqueue time rather than re-checking the peer's qdisc at completion.
> 
> Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
> clears promptly when producer and consumer run on different CPUs.
> 
> BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
> alongside the existing DRV_XOFF (ring full).  Both must be clear for
> the queue to transmit.  Reset BQL state in veth_napi_del_range() after
> synchronize_net() to avoid racing with in-flight veth_poll() calls.
> Clamp the reset loop to the peer's real_num_tx_queues, since the peer
> may have fewer TX queues than the local device has RX queues (e.g. when
> veth is enslaved to a bond with XDP attached).
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c | 86 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 75 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 0cfb19b760dd..86b78900c48e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -34,9 +34,13 @@
>  #define DRV_VERSION	"1.0"
>  
>  #define VETH_XDP_FLAG		BIT(0)
> +#define VETH_BQL_FLAG		BIT(1)
>  #define VETH_RING_SIZE		256
>  #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
>  
> +/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
> +#define VETH_BQL_UNIT		1
> +
>  #define VETH_XDP_TX_BULK_SIZE	16
>  #define VETH_XDP_BATCH		16
>  
> @@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
>  	return (unsigned long)ptr & VETH_XDP_FLAG;
>  }
>  
> +static bool veth_ptr_is_bql(void *ptr)
> +{
> +	return (unsigned long)ptr & VETH_BQL_FLAG;
> +}
> +
> +static struct sk_buff *veth_ptr_to_skb(void *ptr)
> +{
> +	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
> +}
> +
> +static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
> +{
> +	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
> +}
> +
>  static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
>  {
>  	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
> @@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
>  	if (veth_is_xdp_frame(ptr))
>  		xdp_return_frame(veth_ptr_to_xdp(ptr));
>  	else
> -		kfree_skb(ptr);
> +		kfree_skb(veth_ptr_to_skb(ptr));
>  }
>  
>  static void __veth_xdp_flush(struct veth_rq *rq)
> @@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>  	}
>  }
>  
> -static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> +static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
> +		       struct netdev_queue *txq)
>  {
> -	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> +	struct ptr_ring *ring = &rq->xdp_ring;
> +
> +	spin_lock(&ring->producer_lock);
> +	if (unlikely(!ring->size) || __ptr_ring_full(ring)) {
> +		spin_unlock(&ring->producer_lock);
>  		return NETDEV_TX_BUSY; /* signal qdisc layer */
> +	}
> +
> +	/* BQL charge before produce; consumer cannot see entry yet */
> +	if (do_bql)
> +		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
> +
> +	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
> +	spin_unlock(&ring->producer_lock);
>  
>  	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>  }
>  
>  static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> -			    struct veth_rq *rq, bool xdp)
> +			    struct veth_rq *rq, bool xdp, bool do_bql,
> +			    struct netdev_queue *txq)
>  {
>  	return __dev_forward_skb(dev, skb) ?: xdp ?
> -		veth_xdp_rx(rq, skb) :
> +		veth_xdp_rx(rq, skb, do_bql, txq) :
>  		__netif_rx(skb);
>  }
>  
> @@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>  	struct veth_rq *rq = NULL;
> -	struct netdev_queue *txq;
> +	struct netdev_queue *txq = NULL;
>  	struct net_device *rcv;
>  	int length = skb->len;
>  	bool use_napi = false;
> +	bool do_bql = false;
>  	int ret, rxq;
>  
>  	rcu_read_lock();
> @@ -375,8 +409,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  
>  	skb_tx_timestamp(skb);
> -
> -	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	if (rxq < dev->real_num_tx_queues) {
> +		txq = netdev_get_tx_queue(dev, rxq);
> +		/* BQL charge happens inside veth_xdp_rx() under producer_lock */
> +		do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
> +	}
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
>  	switch (ret) {
>  	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>  		if (!use_napi)
> @@ -412,6 +450,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>  				     __func__, dev->name, ret);
>  	}
> +
>  	rcu_read_unlock();
>  
>  	return ret;
> @@ -900,7 +939,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>  
>  static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct veth_xdp_tx_bq *bq,
> -			struct veth_stats *stats)
> +			struct veth_stats *stats,
> +			struct netdev_queue *peer_txq)
>  {
>  	int i, done = 0, n_xdpf = 0;
>  	void *xdpf[VETH_XDP_BATCH];
> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			}
>  		} else {
>  			/* ndo_start_xmit */
> -			struct sk_buff *skb = ptr;
> +			bool bql_charged = veth_ptr_is_bql(ptr);
> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>  
>  			stats->xdp_bytes += skb->len;
> +			if (peer_txq && bql_charged)
> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);

In the discussion with Jonas [1], I left a comment explaining why I think
this doesn’t work.

I still think first that adding an option to modify the hard-coded
VETH_RING_SIZE is the way to go.

Thanks!

[1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/

> +
>  			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>  			if (skb) {
>  				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> @@ -976,7 +1020,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>  
>  	xdp_set_return_frame_no_direct();
> -	done = veth_xdp_rcv(rq, budget, &bq, &stats);
> +	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
>  
>  	if (stats.xdp_redirect > 0)
>  		xdp_do_flush();
> @@ -1074,6 +1118,7 @@ static int __veth_napi_enable(struct net_device *dev)
>  static void veth_napi_del_range(struct net_device *dev, int start, int end)
>  {
>  	struct veth_priv *priv = netdev_priv(dev);
> +	struct net_device *peer;
>  	int i;
>  
>  	for (i = start; i < end; i++) {
> @@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
>  		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
>  	}
>  
> +	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
> +	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
> +	 * synchronize_net(), and NAPI can no longer clear it.
> +	 * Only wake when the device is still up.
> +	 */
> +	peer = rtnl_dereference(priv->peer);
> +	if (peer) {
> +		int peer_end = min_t(int, end, peer->real_num_tx_queues);
> +
> +		for (i = start; i < peer_end; i++) {
> +			struct netdev_queue *txq = netdev_get_tx_queue(peer, i);
> +
> +			netdev_tx_reset_queue(txq);
> +			if (netif_running(dev))
> +				netif_tx_wake_queue(txq);
> +		}
> +	}
> +
>  	for (i = start; i < end; i++) {
>  		page_pool_destroy(priv->rq[i].page_pool);
>  		priv->rq[i].page_pool = NULL;
> @@ -1741,6 +1804,7 @@ static void veth_setup(struct net_device *dev)
>  	dev->priv_flags |= IFF_PHONY_HEADROOM;
>  	dev->priv_flags |= IFF_DISABLE_NETPOLL;
>  	dev->lltx = true;
> +	dev->bql = true;
>  
>  	dev->netdev_ops = &veth_netdev_ops;
>  	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07  6:54   ` Simon Schippers
@ 2026-05-07 13:21     ` Paolo Abeni
  2026-05-07 14:34     ` Paolo Abeni
  1 sibling, 0 replies; 22+ messages in thread
From: Paolo Abeni @ 2026-05-07 13:21 UTC (permalink / raw)
  To: Simon Schippers, hawk, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

On 5/7/26 8:54 AM, Simon Schippers wrote:
>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>  			}
>>  		} else {
>>  			/* ndo_start_xmit */
>> -			struct sk_buff *skb = ptr;
>> +			bool bql_charged = veth_ptr_is_bql(ptr);
>> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>  
>>  			stats->xdp_bytes += skb->len;
>> +			if (peer_txq && bql_charged)
>> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
> 
> In the discussion with Jonas [1], I left a comment explaining why I think
> this doesn’t work.
> 
> I still think first that adding an option to modify the hard-coded
> VETH_RING_SIZE is the way to go.
Isn't the veth_poll() (towards the end of the function) a more natural
place to issue completion events?

/P


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts
  2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
@ 2026-05-07 14:25   ` Paolo Abeni
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Abeni @ 2026-05-07 14:25 UTC (permalink / raw)
  To: hawk, netdev
  Cc: kernel-team, Sashiko, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, Toshiaki Makita, linux-kernel,
	bpf

On 5/5/26 3:21 PM, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> XDP redirect into a veth device (via bpf_redirect()) calls
> veth_xdp_xmit(), which enqueues frames into the peer's ptr_ring using
>   smp_processor_id() % peer->real_num_rx_queues
> as the ring index.  With an asymmetric veth pair where the peer has
> fewer TX queues than RX queues, that index can exceed
> peer->real_num_tx_queues.
> 
> veth_poll() then resolves peer_txq for the ring via:
> 
>   peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
> 
> where queue_idx = rq->xdp_rxq.queue_index.  When queue_idx exceeds
> peer_dev->real_num_tx_queues this is an out-of-bounds (OOB) access
> into the peer's netdev_queue array, triggering DEBUG_NET_WARN_ON_ONCE
> in netdev_get_tx_queue().
> 
> The normal ndo_start_xmit path is not affected: the stack clamps
> skb->queue_mapping via netdev_cap_txqueue() before invoking
> ndo_start_xmit, so rxq in veth_xmit() never exceeds real_num_tx_queues.
> 
> Fix veth_poll() by clamping: only dereference peer_txq when queue_idx is
> within bounds, otherwise set it to NULL.  The out-of-range rings are fed
> exclusively via XDP redirect (veth_xdp_xmit), never via ndo_start_xmit
> (veth_xmit), so the peer txq was never stopped and there is nothing to
> wake; NULL is the correct fallback.
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://lore.kernel.org/all/20260502071828.616C3C19425@smtp.kernel.org/
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

This looks fairly uncontroversial, but it's IMHO net material. Let me
apply it there.

/P


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07  6:54   ` Simon Schippers
  2026-05-07 13:21     ` Paolo Abeni
@ 2026-05-07 14:34     ` Paolo Abeni
  2026-05-07 14:46       ` Simon Schippers
  1 sibling, 1 reply; 22+ messages in thread
From: Paolo Abeni @ 2026-05-07 14:34 UTC (permalink / raw)
  To: Simon Schippers, hawk, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

On 5/7/26 8:54 AM, Simon Schippers wrote:
> On 5/5/26 15:21, hawk@kernel.org wrote:
>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>  			}
>>  		} else {
>>  			/* ndo_start_xmit */
>> -			struct sk_buff *skb = ptr;
>> +			bool bql_charged = veth_ptr_is_bql(ptr);
>> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>  
>>  			stats->xdp_bytes += skb->len;
>> +			if (peer_txq && bql_charged)
>> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
> 
> In the discussion with Jonas [1], I left a comment explaining why I think
> this doesn’t work.
> 
> I still think first that adding an option to modify the hard-coded
> VETH_RING_SIZE is the way to go.
> 
> Thanks!
> 
> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
In the above discussion a 20% regression is reported, which IMHO can't
be ignored. Still the tput figures in the data are extremely low,
something is possibly off?!? I would expect a few Mpps with pktgen on
top of veth, while the reported data is ~20-30Kpps.

/P


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07 14:34     ` Paolo Abeni
@ 2026-05-07 14:46       ` Simon Schippers
  2026-05-07 19:09         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Simon Schippers @ 2026-05-07 14:46 UTC (permalink / raw)
  To: Paolo Abeni, hawk, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf



On 5/7/26 16:34, Paolo Abeni wrote:
> On 5/7/26 8:54 AM, Simon Schippers wrote:
>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>  			}
>>>  		} else {
>>>  			/* ndo_start_xmit */
>>> -			struct sk_buff *skb = ptr;
>>> +			bool bql_charged = veth_ptr_is_bql(ptr);
>>> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>  
>>>  			stats->xdp_bytes += skb->len;
>>> +			if (peer_txq && bql_charged)
>>> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>
>> In the discussion with Jonas [1], I left a comment explaining why I think
>> this doesn’t work.
>>
>> I still think first that adding an option to modify the hard-coded
>> VETH_RING_SIZE is the way to go.
>>
>> Thanks!
>>
>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
> In the above discussion a 20% regression is reported, which IMHO can't
> be ignored. Still the tput figures in the data are extremely low,
> something is possibly off?!? I would expect a few Mpps with pktgen on
> top of veth, while the reported data is ~20-30Kpps.
> 
> /P
> 

The ~20-30Kpps occur when thousands of iptables rules are applied and
an UDP userspace application is sending.

And there is a 20% pktgen regression (no iptables rules applied).

I am pretty sure the reason is because the BQL limit is stuck at 2
packets (because the completed queue is always called with 1 packet
and not in a interrupt/timer with multiple packets...).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07 14:46       ` Simon Schippers
@ 2026-05-07 19:09         ` Jesper Dangaard Brouer
  2026-05-07 20:12           ` Simon Schippers
  2026-05-09  2:06           ` Jakub Kicinski
  0 siblings, 2 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2026-05-07 19:09 UTC (permalink / raw)
  To: Simon Schippers, Paolo Abeni, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

[-- Attachment #1: Type: text/plain, Size: 3900 bytes --]



On 07/05/2026 16.46, Simon Schippers wrote:
> 
> 
> On 5/7/26 16:34, Paolo Abeni wrote:
>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>   			}
>>>>   		} else {
>>>>   			/* ndo_start_xmit */
>>>> -			struct sk_buff *skb = ptr;
>>>> +			bool bql_charged = veth_ptr_is_bql(ptr);
>>>> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>   
>>>>   			stats->xdp_bytes += skb->len;
>>>> +			if (peer_txq && bql_charged)
>>>> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>
>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>> this doesn’t work.
>>>

I've experimented with doing the "completion" at NAPI-end in
veth_poll(), but that resulted in BQL limit being 128 packets, which
leads to bad latency results (not acceptable).
(See detailed report later)


>>> I still think first that adding an option to modify the hard-coded
>>> VETH_RING_SIZE is the way to go.
>>>

Not against being able to modify VETH_RING_SIZE, but I don't think it is
the solution here.

The simply solution is the configure BQL limit_min:
  `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`

My experiments (below) find that limit_min=8 is gives good performance.
We can simply set default to 8 as this still allows userspace to change
this later if lower latency is preferred.

>>> Thanks!
>>>
>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
 >>
>> In the above discussion a 20% regression is reported, which IMHO can't
>> be ignored. Still the tput figures in the data are extremely low,
>> something is possibly off?!? I would expect a few Mpps with pktgen on
>> top of veth, while the reported data is ~20-30Kpps.
>>
>> /P
>>
> 
> The ~20-30Kpps occur when thousands of iptables rules are applied and
> an UDP userspace application is sending.
> 
> And there is a 20% pktgen regression (no iptables rules applied).
> 

The pktgen test is a little dubious/weird and Jonas had to modify pktgen
to test this.   John Fastabend added a config to pktgen that allows us
to benchmarking egress qdisc path, this might be better to use this.
The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.

If redoing the tests, can you adjust limit_min to see the effect?
  /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min

20% throughput performance regression is of-cause too much, but I will
remind us, that adding a qdisc will "cost" some overhead, that is a
configuration choice.  Our purpose here is to reduce bufferbloat and
latency, not optimize for throughput.


> I am pretty sure the reason is because the BQL limit is stuck at 2
> packets (because the completed queue is always called with 1 packet
> and not in a interrupt/timer with multiple packets...).
> 

I've run a lot of experiments, which I made AI write a report over, see 
attachment.  The TL;DR is that best performance vs latency tradeoff is 
defaulting BQL/DQL limit_min to be 8 packets.

I fear this patchset will stall forever, if we keep searching for a 
perfect solution without any overhead.  The qdisc layer will be a 
baseline overhead.  The limit=2 packets is actually the optimal 
darkbuffer queue size, but I acknowledge that this causes too many qdisc 
requeue events (leading to overhead).  I suggest that I add another 
patch in V6, that defaults limit_min to 8 (separate patch to make it 
easier to revert/adjust later).

I've talked with Jonas, and we want to experiment with different 
solutions to make BQL/DQL work better with virtual devices.

This patchset helps our (production) use-case reduce mice-flow latency
from approx 22ms to 1.3ms for latency under-load.  Due to the consumer
namespace being the bottleneck the requeue overhead is negligible in
comparison.

-Jesper

[-- Attachment #2: PERF-2651-bql-completion-experiment.md --]
[-- Type: text/markdown, Size: 13601 bytes --]

# PERF-2651: BQL Completion Batching Experiment (2026-05-05)

## Background

Simon Schippers and Jonas Koeppeler raised concerns that DQL settles at
limit=2 with veth BQL, citing the netdevice.h comment:

> "Must be called at most once per TX completion round (and not per
>  individual packet), so that BQL can adjust its limits appropriately."

And Tom Herbert's original BQL cover letter:

> "BQL accounting is in the transmit path for every packet, and the
>  function to recompute the byte limit is run once per transmit completion."

Thread: https://lore.kernel.org/all/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/

## Experiment: Batch BQL completion at end of veth_poll

Created stg patch `experiment-batch-bql-completion` that moves
`netdev_tx_completed_queue()` from per-SKB inside `veth_xdp_rcv()` to a
single batched call at the end of `veth_poll()`.

### Code change (drivers/net/veth.c)

In `veth_xdp_rcv()`: replace per-SKB completion with counter accumulation:
```c
// Before (V5, per-packet):
if (peer_txq && bql_charged)
    netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);

// After (experiment, accumulate):
if (peer_txq && bql_charged)
    stats->bql_completed += VETH_BQL_UNIT;
```

In `veth_poll()`: single batched call after veth_xdp_rcv() returns:
```c
if (peer_txq && stats.bql_completed)
    netdev_tx_completed_queue(peer_txq, stats.bql_completed,
                              stats.bql_completed);
```

Note: cannot use `done` (return value of veth_xdp_rcv) because it counts
all consumed ring entries including XDP frames that were never BQL-charged.
Using `done` would over-complete and hit BUG_ON in dql_completed().

## Why DQL settles at limit=2 with per-packet completion

The DQL slack calculation in `dql_completed()` uses:
```c
slack = POSDIFF(limit + prev_ovlimit, 2 * (completed - num_completed));
```

`completed - num_completed` equals the `count` parameter (bytes completed
this call). Per-packet: count=1, so slack = limit + prev_ovlimit - 2.
With limit=2, slack=0, so the algorithm holds steady at 2.

With batched completion: count=~64, slack calculation sees the real batch
size, and DQL converges to limit=128 (~2x NAPI budget).

## Results: nrules=3500 (sfq + tiny-flood)

| Metric        | No BQL    | Per-pkt   | limit=4 | limit=8 | limit=16 | Batched     |
|               |           | (limit=2) |         |         |          | (limit=128) |
|---------------|-----------|-----------|---------|---------|----------|-------------|
| BQL limit     | unlimited | 2         | 4       | 8       | 16       | 128         |
| BQL inflight  | 254       | 3         | 5       | 9-17    | 25       | 133         |
| Ping RTT avg  | 9.3ms     | 0.94ms    | 1.07ms  | 1.24ms  | 1.69ms   | 4.0ms       |
| requeues      | 52K       | 454K      | 426K    | 399K    | 356K     | 112K        |
| NAPI avg_work | 63        | 5         | 15      | 63      | 63       | 63          |
| NAPI polls    | ~2.2K     | ~27K      | ~10.5K  | ~2.5K   | ~2.3K    | ~2.5K       |
| Consumer pps  | ~26K      | ~30K      | ~30K    | ~30K    | ~29K     | ~30K        |

## Results: nrules=15000 (sfq + tiny-flood, slower consumer)

| Metric        | No BQL     | Per-pkt   | limit=4   | limit=8   | limit=16  | Batched     |
|               |            | (limit=2) |           |           |           | (limit=128) |
|---------------|------------|-----------|-----------|-----------|-----------|-------------|
| BQL limit     | unlimited  | 2         | 4         | 8         | 16        | 128         |
| BQL inflight  | 211-227    | 3         | 5-12      | 9         | 17        | 132-136     |
| Ping RTT avg  | **37.8ms** | **4.5ms** | **5.0ms** | **6.0ms** | **6.4ms** | **20.0ms**  |
| Ping RTT min  | 27.7ms     | 1.4ms     | 1.7ms     | 2.4ms     | 3.0ms     | 10.4ms      |
| requeues      | 12.9K      | 93K       | 87K       | 80K       | 86K       | 22.8K       |
| NAPI avg_work | 61         | 6         | 17        | 60        | 61        | 61          |
| NAPI polls    | ~540       | ~4.9K     | ~1.9K     | ~540      | ~550      | ~540        |
| Consumer pps  | ~6.7K      | ~6.7K     | ~6.9K     | ~6.8K     | ~7.0K     | ~6.7K       |

## Analysis

### Batched completion is clearly worse for latency

At nrules=15000, batched completion gives 20ms ping RTT -- only 2x better
than no-BQL (37.8ms). Per-packet gives 4.5ms -- an 8x improvement.

The math confirms this: 128 packets / 6.7K pps = 19ms of uncontrolled
queuing delay. This matches the measured 20ms almost exactly.

### Per-packet completion (limit=2) is correct for veth

Simon's concern that limit=2 is a DQL defect is wrong. limit=2 is the
ideal behavior for dark-buffer elimination:
- Only 2-3 packets in the ptr_ring at any time
- Qdisc gets immediate control over all buffering
- 8x latency reduction vs no-BQL

The DQL comment "once per TX completion round" was written for HW NICs
where interrupt coalescing batches completions naturally. For veth, each
per-SKB completion within a NAPI poll technically violates the letter of
the comment, but the resulting limit=2 is correct for the use case.

The concern with limit=2 is the overhead it introduces:

### Trade-off: NAPI polling overhead

Per-packet (limit=2) causes many more NAPI polls:
- nrules=3500: 27K polls (avg_work=5) vs 2.5K polls (avg_work=63)
- nrules=15000: 4.9K polls (avg_work=6) vs 540 polls (avg_work=61)

This is because with only 2-3 items in the ring, each NAPI poll drains
the ring quickly -> napi_complete_done -> reschedule. More scheduling
overhead, but no throughput impact when consumer is the bottleneck.

### limit_min tuning via sysfs

DQL limit_min can be set via:
`/sys/class/net/<dev>/queues/tx-0/byte_queue_limits/limit_min`

The selftest `--bql-min-limit N` flag writes to this sysfs.

- **limit_min=4**: half a cache-line (32 bytes of ptr_ring pointers).
  avg_work=17, 1.9K polls. Ping 5.0ms -- close to limit=2 (4.5ms).
- **limit_min=8**: one cache-line (64 bytes of ptr_ring pointers).
  avg_work=60, 540 polls. Ping 6.0ms -- efficient full-budget polls.

### Dark buffer formula

At consumer rate R (pps) and BQL limit L (packets):
- Dark buffer latency = L / R
- limit=2: 2/6700 = 0.3ms (negligible)
- limit=8: 8/6700 = 1.2ms
- limit=128: 128/6700 = 19ms (matches measured 20ms)
- unlimited (254): 254/6700 = 38ms (matches measured 37.8ms)

## Results: nrules=0 (no consumer overhead, max throughput)

This tests the raw throughput overhead of BQL stop/start oscillation.
All values are averages of 4 runs (VM noise is ~15-20% per-run variance).

| Metric          | No BQL | limit=2 | limit=4 | limit=8 | limit=16 |
|-----------------|--------|---------|---------|---------|----------|
| Sink pps (large)| 841K   | 759K    | 692K    | 762K    | 736K     |
| Sink pps (small)| 950K   | 874K    | 807K    | 874K    | 844K     |
| qdisc pkts      | 48.6M  | 44.8M   | 40.1M   | 45.0M   | 44.8M    |
| requeues        | 311K   | 6.1M    | 13.4M   | 5.8M    | 5.2M     |
| NAPI avg_work   | 22     | 27      | 12      | 19      | 21       |
| Ping RTT avg    | 0.17ms | 0.11ms  | 0.10ms  | 0.085ms | 0.095ms  |
| Runs            | 4      | 4       | 4       | 4       | 4        |

Observations:
- **limit=2 is NOT the worst** -- limit=4 has higher requeues (13.4M) and
  lower throughput (692K sink) due to more stop/start cycles at a less
  efficient NAPI batch size (avg_work=12)
- **limit=8 and limit=16 match No-BQL throughput** within noise (~762K vs 841K
  sink pps for large pkts, ~3-10% difference)
- **Requeue overhead**: 311K (No BQL) -> 5.2-5.8M (limit=8/16) -> 13.4M (limit=4)
- Latency sub-0.2ms for all settings at this speed -- not a differentiator

## Comparison: limit=8 vs limit=16

Multi-run (4 iterations each, nrules=0) to cut through VM noise:

### limit=8 (4 runs)

| Run | Sink pps (large/small) | qdisc pkts | requeues | avg_work | Ping avg |
|-----|------------------------|-----------|----------|----------|----------|
| 1   | 796K / 911K            | 46.2M     | 5.6M     | 20       | 0.062ms  |
| 2   | 796K / 883K            | 45.5M     | 4.7M     | 16       | 0.081ms  |
| 3   | 654K / 836K            | 43.5M     | 8.3M     | 22       | 0.100ms  |
| 4   | 803K / 865K            | 44.8M     | 4.4M     | 16       | 0.095ms  |
| **avg** | **762K / 874K**    | **45.0M** | **5.8M** | **19**   | **0.085ms** |

### limit=16 (4 runs)

| Run | Sink pps (large/small) | qdisc pkts | requeues | avg_work | Ping avg |
|-----|------------------------|-----------|----------|----------|----------|
| 1   | 844K / 940K            | 48.1M     | 3.3M     | 20       | 0.081ms  |
| 2   | 768K / 873K            | 45.6M     | 4.1M     | 15       | 0.097ms  |
| 3   | 733K / 804K            | 44.8M     | 6.5M     | 26       | 0.085ms  |
| 4   | 597K / 757K            | 40.7M     | 6.9M     | 23       | 0.115ms  |
| **avg** | **736K / 844K**    | **44.8M** | **5.2M** | **21**   | **0.095ms** |

### Averaged comparison (nrules=0, 4 runs)

| Metric              | limit=8   | limit=16  |
|---------------------|-----------|-----------|
| Sink pps (large)    | 762K      | 736K      |
| Sink pps (small)    | 874K      | 844K      |
| qdisc pkts          | 45.0M     | 44.8M     |
| requeues            | 5.8M      | 5.2M      |
| avg_work            | 19        | 21        |
| Ping RTT avg        | 0.085ms   | 0.095ms   |

At max throughput, limit=8 and limit=16 are within VM noise (~3-4%).

### Cross-load comparison (all averages of 4 runs)

| Metric        | limit=8 | limit=16 | Winner        |
|---------------|---------|----------|---------------|
| nrules=15000: |         |          |               |
|   Ping RTT    | 6.73ms  | 8.00ms   | 8 (+1.3ms)    |
|   requeues    | 71K     | 73K      | ~same         |
|   avg_work    | 59      | 59       | ~same         |
| nrules=3500:  |         |          |               |
|   Ping RTT    | 1.77ms  | 2.11ms   | 8 (+0.34ms)   |
|   requeues    | 279K    | 282K     | ~same         |
|   avg_work    | 62      | 62       | ~same         |
| nrules=0:     |         |          |               |
|   Sink pps    | 762K    | 736K     | ~same (noise) |
|   requeues    | 5.8M    | 5.2M     | ~same (noise) |

**Verdict: limit=8 is the better default.**
- Consistent latency advantage under load: +1.3ms at nrules=15000,
  +0.34ms at nrules=3500 (reproducible across 4 runs each)
- Throughput indistinguishable from limit=16 after averaging
- One cache-line (64 bytes) is a clean hardware alignment
- More conservative -- smaller dark buffer

## Proposed patch: dql_set_min_limit() + veth default min_limit=8

Two-part solution in stg patch `veth-set-bql-min-limit-8`:

### 1. New DQL API helper (include/linux/dynamic_queue_limits.h)

```c
static inline void dql_set_min_limit(struct dql *dql, unsigned int min_limit)
{
    dql->min_limit = min_limit;
}
```

Gives drivers a clean API to set a default floor.  Currently no driver
sets min_limit -- all rely on the dql_init() default of 0 or user sysfs.

### 2. Veth sets min_limit=8 at device creation (drivers/net/veth.c)

In `veth_init_queues()`, after TX queue setup:
```c
#ifdef CONFIG_BQL
    for (i = 0; i < dev->num_tx_queues; i++)
        dql_set_min_limit(&netdev_get_tx_queue(dev, i)->dql,
                          VETH_BQL_UNIT * 8);
#endif
```

Called for both `dev` and `peer` in `veth_newlink()`.  Uses
`num_tx_queues` (all pre-allocated queues), not `real_num_tx_queues`,
so channel changes via `ethtool -L` are covered -- no new queues are
ever created at runtime.

### Why min_limit=8

- One cache-line of ptr_ring pointers (8 x 8 = 64 bytes)
- Lowest requeue count at max throughput (5.3M vs 16.9M at limit=2)
- Keeps full-budget NAPI polls (avg_work=63) -- no scheduling overhead
- Latency only 0.3ms worse than limit=2 at moderate load (1.24ms vs 0.94ms)
- Still 6x better latency than no-BQL at heavy load (6ms vs 37.8ms)
- User can lower to 0 or raise via sysfs limit_min at any time

### Verified: driver default works (nrules=15000, --hist)

Tested with `veth-set-bql-min-limit-8` patch applied, no `--bql-min-limit`
sysfs override.  BQL limit=8 held stable, ping RTT ~6.5ms (matches sysfs
override results).

BQL inflight histogram (bpftrace, 169K samples):
```
[1]              15 |                                  |
[2, 4)        21193 |@@@@@@@@@@@@@                     |
[4, 8)        63615 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)       80116 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)       4709 |@@@                               |
```

- Inflight avg=7, max=17 -- ring stays shallow
- Peak at [8,16): inflight near the limit=8 floor most of the time
- [4,8) second: ring draining between NAPI polls
- [16,32) rare: brief producer bursts
- stack_xoff ~15K/5s, drv_xoff=0 -- BQL stops queue well before ring fills
- NAPI avg_work=61, almost all full-budget polls

## Conclusion

Per-packet BQL completion in V5 is the right design. It gives DQL the
information it needs to keep the dark buffer minimal, which is exactly
what we want for latency reduction.

Simon's suggestion to call netdev_tx_completed_queue() once per NAPI poll
would regress ping latency from 4.5ms to 20ms at production-like iptables
rule counts.

The default min_limit=8 (via dql_set_min_limit) is the proposed follow-up
to address the requeue overhead that per-packet completion causes.  It
keeps latency close to optimal while reducing the ~10% throughput loss
and 20x requeue increase (6.1M vs 311K) that limit=2 causes at max speed.
Users wanting tighter latency can set limit_min=0 via sysfs to get the
original limit=2 behavior.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07 19:09         ` Jesper Dangaard Brouer
@ 2026-05-07 20:12           ` Simon Schippers
  2026-05-07 20:45             ` Jesper Dangaard Brouer
  2026-05-09  2:06           ` Jakub Kicinski
  1 sibling, 1 reply; 22+ messages in thread
From: Simon Schippers @ 2026-05-07 20:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Paolo Abeni, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
> 
> 
> On 07/05/2026 16.46, Simon Schippers wrote:
>>
>>
>> On 5/7/26 16:34, Paolo Abeni wrote:
>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>               }
>>>>>           } else {
>>>>>               /* ndo_start_xmit */
>>>>> -            struct sk_buff *skb = ptr;
>>>>> +            bool bql_charged = veth_ptr_is_bql(ptr);
>>>>> +            struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>                 stats->xdp_bytes += skb->len;
>>>>> +            if (peer_txq && bql_charged)
>>>>> +                netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>
>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>> this doesn’t work.
>>>>
> 
> I've experimented with doing the "completion" at NAPI-end in
> veth_poll(), but that resulted in BQL limit being 128 packets, which
> leads to bad latency results (not acceptable).
> (See detailed report later)
> 
> 
>>>> I still think first that adding an option to modify the hard-coded
>>>> VETH_RING_SIZE is the way to go.
>>>>
> 
> Not against being able to modify VETH_RING_SIZE, but I don't think it is
> the solution here.
> 
> The simply solution is the configure BQL limit_min:
>  `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
> 
> My experiments (below) find that limit_min=8 is gives good performance.
> We can simply set default to 8 as this still allows userspace to change
> this later if lower latency is preferred.
> 
>>>> Thanks!
>>>>
>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>
>>> In the above discussion a 20% regression is reported, which IMHO can't
>>> be ignored. Still the tput figures in the data are extremely low,
>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>> top of veth, while the reported data is ~20-30Kpps.
>>>
>>> /P
>>>
>>
>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>> an UDP userspace application is sending.
>>
>> And there is a 20% pktgen regression (no iptables rules applied).
>>
> 
> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
> to test this.   John Fastabend added a config to pktgen that allows us
> to benchmarking egress qdisc path, this might be better to use this.
> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
> 
> If redoing the tests, can you adjust limit_min to see the effect?
>  /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
> 
> 20% throughput performance regression is of-cause too much, but I will
> remind us, that adding a qdisc will "cost" some overhead, that is a
> configuration choice.  Our purpose here is to reduce bufferbloat and
> latency, not optimize for throughput.
> 
> 
>> I am pretty sure the reason is because the BQL limit is stuck at 2
>> packets (because the completed queue is always called with 1 packet
>> and not in a interrupt/timer with multiple packets...).
>>
> 
> I've run a lot of experiments, which I made AI write a report over, see attachment.  The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
> 
> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead.  The qdisc layer will be a baseline overhead.  The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead).  I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
> 
> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
> 
> This patchset helps our (production) use-case reduce mice-flow latency
> from approx 22ms to 1.3ms for latency under-load.  Due to the consumer
> namespace being the bottleneck the requeue overhead is negligible in
> comparison.
> 
> -Jesper

First of all thanks for you work and I really see the advantages of
avoiding bufferbloat :)

But the key of the BQL algorithm, which is the *dynamic* adaption of the
limit, is not working. Always calling netdev_completed_queue() with
1 packet results in a static limit of 2 packets (as seen by Jonas
measurements), which you force up to 8 packets.

So in the end this patchset has the same effect as just setting
VETH_RING_SIZE to 8 (and giving an option to change this value).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07 20:12           ` Simon Schippers
@ 2026-05-07 20:45             ` Jesper Dangaard Brouer
  2026-05-08  8:01               ` Simon Schippers
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2026-05-07 20:45 UTC (permalink / raw)
  To: Simon Schippers, Paolo Abeni, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

[-- Attachment #1: Type: text/plain, Size: 4859 bytes --]



On 07/05/2026 22.12, Simon Schippers wrote:
> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>
>>
>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>
>>>
>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>                }
>>>>>>            } else {
>>>>>>                /* ndo_start_xmit */
>>>>>> -            struct sk_buff *skb = ptr;
>>>>>> +            bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>> +            struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>>                  stats->xdp_bytes += skb->len;
>>>>>> +            if (peer_txq && bql_charged)
>>>>>> +                netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>
>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>> this doesn’t work.
>>>>>
>>
>> I've experimented with doing the "completion" at NAPI-end in
>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>> leads to bad latency results (not acceptable).
>> (See detailed report later)
>>
>>
>>>>> I still think first that adding an option to modify the hard-coded
>>>>> VETH_RING_SIZE is the way to go.
>>>>>
>>
>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>> the solution here.
>>
>> The simply solution is the configure BQL limit_min:
>>   `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>
>> My experiments (below) find that limit_min=8 is gives good performance.
>> We can simply set default to 8 as this still allows userspace to change
>> this later if lower latency is preferred.
>>
>>>>> Thanks!
>>>>>
>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>>
>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>> be ignored. Still the tput figures in the data are extremely low,
>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>
>>>> /P
>>>>
>>>
>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>> an UDP userspace application is sending.
>>>
>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>
>>
>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>> to test this.   John Fastabend added a config to pktgen that allows us
>> to benchmarking egress qdisc path, this might be better to use this.
>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>
>> If redoing the tests, can you adjust limit_min to see the effect?
>>   /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>
>> 20% throughput performance regression is of-cause too much, but I will
>> remind us, that adding a qdisc will "cost" some overhead, that is a
>> configuration choice.  Our purpose here is to reduce bufferbloat and
>> latency, not optimize for throughput.
>>
>>
>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>> packets (because the completed queue is always called with 1 packet
>>> and not in a interrupt/timer with multiple packets...).
>>>
>>
>> I've run a lot of experiments, which I made AI write a report over, see attachment.  The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>
>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead.  The qdisc layer will be a baseline overhead.  The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead).  I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>
>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>
>> This patchset helps our (production) use-case reduce mice-flow latency
>> from approx 22ms to 1.3ms for latency under-load.  Due to the consumer
>> namespace being the bottleneck the requeue overhead is negligible in
>> comparison.
>>
>> -Jesper
> 
> First of all thanks for you work and I really see the advantages of
> avoiding bufferbloat :)
> 
> But the key of the BQL algorithm, which is the *dynamic* adaption of the
> limit, is not working. Always calling netdev_completed_queue() with
> 1 packet results in a static limit of 2 packets (as seen by Jonas
> measurements), which you force up to 8 packets.
> 
> So in the end this patchset has the same effect as just setting
> VETH_RING_SIZE to 8 (and giving an option to change this value).
> 

I've code up a time based BQL implementation, see attachment.
WDYT?

--Jesper





[-- Attachment #2: 09-veth-time-based-bql-coalescing.patch --]
[-- Type: text/x-patch, Size: 5549 bytes --]

veth: time-based BQL completion coalescing via ethtool tx-usecs

From: Jesper Dangaard Brouer <hawk@kernel.org>

Bufferbloat is fundamentally a latency problem -- what matters is the
time packets spend waiting in queues, as perceived by users and
applications. Base BQL completion coalescing on elapsed time rather
than packet counts to directly control queuing delay.

Add ethtool tx-usecs support to veth for tuning BQL completion
coalescing. Instead of completing BQL per-packet (which forces DQL to
limit=2 with high NAPI scheduling overhead) or per-NAPI-poll (which
over-buffers at budget=64), accumulate completions and flush them when
a configurable time threshold is exceeded. This lets DQL discover a
limit that bounds the actual queuing delay to the configured interval.

The default of 10 usecs (VETH_BQL_COAL_TX_USECS) provides a good
balance: DQL converges to a small limit that keeps queuing delay
bounded while allowing efficient NAPI batch processing. Setting
tx-usecs to 0 disables coalescing and falls back to per-packet
completion (limit=2, lowest latency, highest NAPI overhead).

Usage:
  ethtool -C <veth-dev> tx-usecs 500   # 500us coalescing
  ethtool -C <veth-dev> tx-usecs 0     # per-packet (no coalescing)

Uses local_clock() (rdtsc on x86, ~20ns) for sub-jiffy resolution.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4103d298aa9b..6035f7ec92b4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -43,6 +43,7 @@
 
 #define VETH_XDP_TX_BULK_SIZE	16
 #define VETH_XDP_BATCH		16
+#define VETH_BQL_COAL_TX_USECS	10	/* default tx-usecs for BQL batching */
 
 struct veth_stats {
 	u64	rx_drops;
@@ -81,6 +82,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
 	unsigned int		requested_headroom;
+	unsigned int		tx_coal_usecs;	/* BQL completion coalescing */
 };
 
 struct veth_xdp_tx_bq {
@@ -265,7 +267,30 @@ static void veth_get_channels(struct net_device *dev,
 static int veth_set_channels(struct net_device *dev,
 			     struct ethtool_channels *ch);
 
+static int veth_get_coalesce(struct net_device *dev,
+			    struct ethtool_coalesce *ec,
+			    struct kernel_ethtool_coalesce *kernel_coal,
+			    struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+	return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+			    struct ethtool_coalesce *ec,
+			    struct kernel_ethtool_coalesce *kernel_coal,
+			    struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	priv->tx_coal_usecs = ec->tx_coalesce_usecs;
+	return 0;
+}
+
 static const struct ethtool_ops veth_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
 	.get_drvinfo		= veth_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
 	.get_strings		= veth_get_strings,
@@ -275,6 +300,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_ts_info		= ethtool_op_get_ts_info,
 	.get_channels		= veth_get_channels,
 	.set_channels		= veth_set_channels,
+	.get_coalesce		= veth_get_coalesce,
+	.set_coalesce		= veth_set_coalesce,
 };
 
 /* general routines */
@@ -942,8 +969,14 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_stats *stats,
 			struct netdev_queue *peer_txq)
 {
+	u64 bql_flush_ns, bql_flush_time = 0;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
+	struct veth_priv *priv;
+	int n_bql = 0;
+
+	priv = netdev_priv(rq->dev);
+	bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
 
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -972,8 +1005,23 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
 			stats->xdp_bytes += skb->len;
-			if (peer_txq && bql_charged)
-				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+			if (peer_txq && bql_charged) {
+				if (!bql_flush_ns) {
+					netdev_tx_completed_queue(peer_txq, 1,
+								  VETH_BQL_UNIT);
+				} else {
+					u64 now = local_clock();
+					n_bql++;
+					if (!bql_flush_time) {
+						bql_flush_time = now;
+					} else if (now - bql_flush_time >= bql_flush_ns) {
+						netdev_tx_completed_queue(peer_txq, n_bql,
+									  n_bql * VETH_BQL_UNIT);
+						n_bql = 0;
+						bql_flush_time = 0;
+					}
+				}
+			}
 
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
@@ -989,6 +1037,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	if (n_xdpf)
 		veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);
 
+	if (peer_txq && n_bql)
+		netdev_tx_completed_queue(peer_txq, n_bql, n_bql * VETH_BQL_UNIT);
+
 	u64_stats_update_begin(&rq->stats.syncp);
 	rq->stats.vs.xdp_redirect += stats->xdp_redirect;
 	rq->stats.vs.xdp_bytes += stats->xdp_bytes;
@@ -1813,6 +1864,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 
 static void veth_setup(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	ether_setup(dev);
 
 	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1891,8 @@ static void veth_setup(struct net_device *dev)
 	dev->max_mtu = ETH_MAX_MTU;
 	dev->watchdog_timeo = msecs_to_jiffies(16000);
 
+	priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
 	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07 20:45             ` Jesper Dangaard Brouer
@ 2026-05-08  8:01               ` Simon Schippers
  2026-05-08  9:20                 ` Simon Schippers
  0 siblings, 1 reply; 22+ messages in thread
From: Simon Schippers @ 2026-05-08  8:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Paolo Abeni, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf

On 5/7/26 22:45, Jesper Dangaard Brouer wrote:
> 
> 
> On 07/05/2026 22.12, Simon Schippers wrote:
>> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>>
>>>>
>>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>>                }
>>>>>>>            } else {
>>>>>>>                /* ndo_start_xmit */
>>>>>>> -            struct sk_buff *skb = ptr;
>>>>>>> +            bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>>> +            struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>>>                  stats->xdp_bytes += skb->len;
>>>>>>> +            if (peer_txq && bql_charged)
>>>>>>> +                netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>>
>>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>>> this doesn’t work.
>>>>>>
>>>
>>> I've experimented with doing the "completion" at NAPI-end in
>>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>>> leads to bad latency results (not acceptable).
>>> (See detailed report later)
>>>
>>>
>>>>>> I still think first that adding an option to modify the hard-coded
>>>>>> VETH_RING_SIZE is the way to go.
>>>>>>
>>>
>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>> the solution here.
>>>
>>> The simply solution is the configure BQL limit_min:
>>>   `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>>
>>> My experiments (below) find that limit_min=8 is gives good performance.
>>> We can simply set default to 8 as this still allows userspace to change
>>> this later if lower latency is preferred.
>>>
>>>>>> Thanks!
>>>>>>
>>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>>>
>>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>>> be ignored. Still the tput figures in the data are extremely low,
>>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>>
>>>>> /P
>>>>>
>>>>
>>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>>> an UDP userspace application is sending.
>>>>
>>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>>
>>>
>>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>>> to test this.   John Fastabend added a config to pktgen that allows us
>>> to benchmarking egress qdisc path, this might be better to use this.
>>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>>
>>> If redoing the tests, can you adjust limit_min to see the effect?
>>>   /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>>
>>> 20% throughput performance regression is of-cause too much, but I will
>>> remind us, that adding a qdisc will "cost" some overhead, that is a
>>> configuration choice.  Our purpose here is to reduce bufferbloat and
>>> latency, not optimize for throughput.
>>>
>>>
>>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>>> packets (because the completed queue is always called with 1 packet
>>>> and not in a interrupt/timer with multiple packets...).
>>>>
>>>
>>> I've run a lot of experiments, which I made AI write a report over, see attachment.  The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>>
>>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead.  The qdisc layer will be a baseline overhead.  The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead).  I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>>
>>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>>
>>> This patchset helps our (production) use-case reduce mice-flow latency
>>> from approx 22ms to 1.3ms for latency under-load.  Due to the consumer
>>> namespace being the bottleneck the requeue overhead is negligible in
>>> comparison.
>>>
>>> -Jesper
>>
>> First of all thanks for you work and I really see the advantages of
>> avoiding bufferbloat :)
>>
>> But the key of the BQL algorithm, which is the *dynamic* adaption of the
>> limit, is not working. Always calling netdev_completed_queue() with
>> 1 packet results in a static limit of 2 packets (as seen by Jonas
>> measurements), which you force up to 8 packets.
>>
>> So in the end this patchset has the same effect as just setting
>> VETH_RING_SIZE to 8 (and giving an option to change this value).
>>
> 
> I've code up a time based BQL implementation, see attachment.
> WDYT?
> 
> --Jesper
> 

A step in the right direction, but I dislike that you call
netdev_sent_queue() with at least 1 packet (never 0 packets).
I am not sure if it works, and I am not sure about the parameter.


I would propose doing it like other BQL implementations do
(for example usbnet for which I adapted BQL [1] :) ):

Call netdev_sent_queue() with n_bql in a periodic work. n_bql would
still be counted in veth_xdp_rcv() like you currently do (synchronized
with the work via ring.consumer_lock?).

The only weird thing that remains is that BQL's inflight != number of
packets in the ring and BQL's limit != "current ring size". Instead
the BQL limit describes the number of maximal allowed packets between
calls of netdev_sent_queue(), which occur periodically in a somewhat
fixed time interval.
I guess that could be fine, but it surely needs testing.

[1] Link: https://lore.kernel.org/netdev/20251106175615.26948-1-simon.schippers@tu-dortmund.de/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-08  8:01               ` Simon Schippers
@ 2026-05-08  9:20                 ` Simon Schippers
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Schippers @ 2026-05-08  9:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Paolo Abeni, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf



On 5/8/26 10:01, Simon Schippers wrote:
> On 5/7/26 22:45, Jesper Dangaard Brouer wrote:
>>
>>
>> On 07/05/2026 22.12, Simon Schippers wrote:
>>> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>>>
>>>>>
>>>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>>>                }
>>>>>>>>            } else {
>>>>>>>>                /* ndo_start_xmit */
>>>>>>>> -            struct sk_buff *skb = ptr;
>>>>>>>> +            bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>>>> +            struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>>>>                  stats->xdp_bytes += skb->len;
>>>>>>>> +            if (peer_txq && bql_charged)
>>>>>>>> +                netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>>>
>>>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>>>> this doesn’t work.
>>>>>>>
>>>>
>>>> I've experimented with doing the "completion" at NAPI-end in
>>>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>>>> leads to bad latency results (not acceptable).
>>>> (See detailed report later)
>>>>
>>>>
>>>>>>> I still think first that adding an option to modify the hard-coded
>>>>>>> VETH_RING_SIZE is the way to go.
>>>>>>>
>>>>
>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>> the solution here.
>>>>
>>>> The simply solution is the configure BQL limit_min:
>>>>   `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>>>
>>>> My experiments (below) find that limit_min=8 is gives good performance.
>>>> We can simply set default to 8 as this still allows userspace to change
>>>> this later if lower latency is preferred.
>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>>>>
>>>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>>>> be ignored. Still the tput figures in the data are extremely low,
>>>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>>>
>>>>>> /P
>>>>>>
>>>>>
>>>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>>>> an UDP userspace application is sending.
>>>>>
>>>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>>>
>>>>
>>>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>>>> to test this.   John Fastabend added a config to pktgen that allows us
>>>> to benchmarking egress qdisc path, this might be better to use this.
>>>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>>>
>>>> If redoing the tests, can you adjust limit_min to see the effect?
>>>>   /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>>>
>>>> 20% throughput performance regression is of-cause too much, but I will
>>>> remind us, that adding a qdisc will "cost" some overhead, that is a
>>>> configuration choice.  Our purpose here is to reduce bufferbloat and
>>>> latency, not optimize for throughput.
>>>>
>>>>
>>>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>>>> packets (because the completed queue is always called with 1 packet
>>>>> and not in a interrupt/timer with multiple packets...).
>>>>>
>>>>
>>>> I've run a lot of experiments, which I made AI write a report over, see attachment.  The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>>>
>>>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead.  The qdisc layer will be a baseline overhead.  The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead).  I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>>>
>>>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>>>
>>>> This patchset helps our (production) use-case reduce mice-flow latency
>>>> from approx 22ms to 1.3ms for latency under-load.  Due to the consumer
>>>> namespace being the bottleneck the requeue overhead is negligible in
>>>> comparison.
>>>>
>>>> -Jesper
>>>
>>> First of all thanks for you work and I really see the advantages of
>>> avoiding bufferbloat :)
>>>
>>> But the key of the BQL algorithm, which is the *dynamic* adaption of the
>>> limit, is not working. Always calling netdev_completed_queue() with
>>> 1 packet results in a static limit of 2 packets (as seen by Jonas
>>> measurements), which you force up to 8 packets.
>>>
>>> So in the end this patchset has the same effect as just setting
>>> VETH_RING_SIZE to 8 (and giving an option to change this value).
>>>
>>
>> I've code up a time based BQL implementation, see attachment.
>> WDYT?
>>
>> --Jesper
>>
> 
> A step in the right direction, but I dislike that you call
> netdev_sent_queue() with at least 1 packet (never 0 packets).
> I am not sure if it works, and I am not sure about the parameter.
> 

Rethinking of it this could be fine, but really needs testing because:

The weird thing is that is that BQL's inflight != number of packets
in the ring and BQL's limit != "current ring size". Instead the BQL
limit describes the number of maximal allowed packets between
calls of netdev_sent_queue().



I messed up in my approach below. Forget it :P

> 
> I would propose doing it like other BQL implementations do
> (for example usbnet for which I adapted BQL [1] :) ):
> 
> Call netdev_sent_queue() with n_bql in a periodic work. n_bql would
> still be counted in veth_xdp_rcv() like you currently do (synchronized
> with the work via ring.consumer_lock?).
> 
> The only weird thing that remains is that BQL's inflight != number of
> packets in the ring and BQL's limit != "current ring size". Instead
> the BQL limit describes the number of maximal allowed packets between
> calls of netdev_sent_queue(), which occur periodically in a somewhat
> fixed time interval.
> I guess that could be fine, but it surely needs testing.
> 
> [1] Link: https://lore.kernel.org/netdev/20251106175615.26948-1-simon.schippers@tu-dortmund.de/
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-07 19:09         ` Jesper Dangaard Brouer
  2026-05-07 20:12           ` Simon Schippers
@ 2026-05-09  2:06           ` Jakub Kicinski
  2026-05-09  9:09             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2026-05-09  2:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Simon Schippers, Paolo Abeni, netdev, kernel-team, Andrew Lunn,
	David S. Miller, Eric Dumazet, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
	bpf

On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:
> >>> I still think first that adding an option to modify the hard-coded
> >>> VETH_RING_SIZE is the way to go.
> 
> Not against being able to modify VETH_RING_SIZE, but I don't think it is
> the solution here.

Was it evaluated, tho?

It's obviously super easy these days have AI spew no end of complex
code. So it'd be great to have some solid, ideally production-like
data to back this all up.

VETH_RING_SIZE seems trivial, ethtool set ringparam

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-09  2:06           ` Jakub Kicinski
@ 2026-05-09  9:09             ` Jesper Dangaard Brouer
  2026-05-10 15:56               ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2026-05-09  9:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Simon Schippers, Paolo Abeni, netdev, kernel-team, Andrew Lunn,
	David S. Miller, Eric Dumazet, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
	bpf




On 09/05/2026 04.06, Jakub Kicinski wrote:
> On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:
>>>>> I still think first that adding an option to modify the hard-coded
>>>>> VETH_RING_SIZE is the way to go.
>>
>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>> the solution here.
> 
> Was it evaluated, tho?
> 
> It's obviously super easy these days have AI spew no end of complex
> code. So it'd be great to have some solid, ideally production-like
> data to back this all up.
> 
> VETH_RING_SIZE seems trivial, ethtool set ringparam

No, unfortunately we cannot just decrease the VETH_RING_SIZE.  The
reason is that XDP-redirect into veth don't have any back-pressure and
would simply drop packets if queue size becomes less than the NAPI
budget (64). (Yes, we use both normal path and XDP-redirect in
production).  My benchmarking shows that an optimal BQL limit is
dynamically adjusted between 17-55 depending on veth consumer namespace
overhead/speed, when balancing throughput and latency.

--Jesper

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-09  9:09             ` Jesper Dangaard Brouer
@ 2026-05-10 15:56               ` Jakub Kicinski
  2026-05-11  8:11                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2026-05-10 15:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Simon Schippers, Paolo Abeni, netdev, kernel-team, Andrew Lunn,
	David S. Miller, Eric Dumazet, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
	bpf

On Sat, 9 May 2026 11:09:51 +0200 Jesper Dangaard Brouer wrote:
> On 09/05/2026 04.06, Jakub Kicinski wrote:
> > On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:  
> >> Not against being able to modify VETH_RING_SIZE, but I don't think it is
> >> the solution here.  
> > 
> > Was it evaluated, tho?
> > 
> > It's obviously super easy these days have AI spew no end of complex
> > code. So it'd be great to have some solid, ideally production-like
> > data to back this all up.
> > 
> > VETH_RING_SIZE seems trivial, ethtool set ringparam  
> 
> No, unfortunately we cannot just decrease the VETH_RING_SIZE. 

To be clear - I said may it configurable with ethtool -G
not change the default.

> The reason is that XDP-redirect into veth don't have any
> back-pressure and would simply drop packets if queue size becomes
> less than the NAPI budget (64). (Yes, we use both normal path and
> XDP-redirect in production). 

Doesn't this mean you have a queue which is not under BQL control?

> My benchmarking shows that an optimal BQL limit is dynamically
> adjusted between 17-55 depending on veth consumer namespace
> overhead/speed, when balancing throughput and latency.

Testing with prod-approximating traffic pattern and load would be great.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-10 15:56               ` Jakub Kicinski
@ 2026-05-11  8:11                 ` Jesper Dangaard Brouer
  2026-05-11  9:55                   ` Simon Schippers
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2026-05-11  8:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Simon Schippers, Paolo Abeni, netdev, kernel-team, Andrew Lunn,
	David S. Miller, Eric Dumazet, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
	bpf

On 10/05/2026 17.56, Jakub Kicinski wrote:
> On Sat, 9 May 2026 11:09:51 +0200 Jesper Dangaard Brouer wrote:
>> On 09/05/2026 04.06, Jakub Kicinski wrote:
>>> On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:
>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>> the solution here.
>>>
>>> Was it evaluated, tho?
>>>
>>> It's obviously super easy these days have AI spew no end of complex
>>> code. So it'd be great to have some solid, ideally production-like
>>> data to back this all up.
>>>
>>> VETH_RING_SIZE seems trivial, ethtool set ringparam
>>
>> No, unfortunately we cannot just decrease the VETH_RING_SIZE.
> 
> To be clear - I said may it configurable with ethtool -G
> not change the default.
> 

Sure, I understand the desire to make VETH_RING_SIZE configurable.
If doing so we are making Linux network stack harder to tune and setup
correctly. E.g. adding a qdisc to veth would also require changing the
ring size, but if system also uses XDP then tuning below 64 (likely 128)
will lead to hard-to-find packet drops.

I prefer adding something (like BQL) that auto-tune how much of the ring
queue we are using.  Good queues function as shock absorbers when
concurrent processes in the OS have scheduling noise.

I acknowledge that Simon Schippers found that the BQL implementation was
actually not auto-tuning.  We need to work on this, my prototype
implementation [1] [2] works surprisingly well.

- [1] 
https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/2-09-veth-time-based-bql-coalescing.patch
- [2] 
https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/

>> The reason is that XDP-redirect into veth don't have any
>> back-pressure and would simply drop packets if queue size becomes
>> less than the NAPI budget (64). (Yes, we use both normal path and
>> XDP-redirect in production).
> 
> Doesn't this mean you have a queue which is not under BQL control?
> 

It is a matter of perspective. BQL needs between 17-55 elements in the
256 queue.  At the same time we handle if the ring runs full, e.g. due
to a sudden burst of XDP redirected packets, which pushes packets into
the qdisc layer.

>> My benchmarking shows that an optimal BQL limit is dynamically
>> adjusted between 17-55 depending on veth consumer namespace
>> overhead/speed, when balancing throughput and latency.
> 
> Testing with prod-approximating traffic pattern and load would be great.

That is what I'm doing.  I'm testing with prod-approximating traffic
pattern and changing the number of iptables rules to simulate the
overhead I measured from production.  I think I explained this in the
cover letter. We are going to use this in a production environment (to
be clear).

Simon found an issue testing the overload scenario.

--Jesper

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-11  8:11                 ` Jesper Dangaard Brouer
@ 2026-05-11  9:55                   ` Simon Schippers
  2026-05-11 18:08                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Simon Schippers @ 2026-05-11  9:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jakub Kicinski
  Cc: Paolo Abeni, netdev, kernel-team, Andrew Lunn, David S. Miller,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 5/11/26 10:11, Jesper Dangaard Brouer wrote:
> 
> 
> On 10/05/2026 17.56, Jakub Kicinski wrote:
>> On Sat, 9 May 2026 11:09:51 +0200 Jesper Dangaard Brouer wrote:
>>> On 09/05/2026 04.06, Jakub Kicinski wrote:
>>>> On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:
>>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>>> the solution here.
>>>>
>>>> Was it evaluated, tho?
>>>>
>>>> It's obviously super easy these days have AI spew no end of complex
>>>> code. So it'd be great to have some solid, ideally production-like
>>>> data to back this all up.
>>>>
>>>> VETH_RING_SIZE seems trivial, ethtool set ringparam
>>>
>>> No, unfortunately we cannot just decrease the VETH_RING_SIZE.
>>
>> To be clear - I said may it configurable with ethtool -G
>> not change the default.
>>
> 
> Sure, I understand the desire to make VETH_RING_SIZE configurable.
> If doing so we are making Linux network stack harder to tune and setup
> correctly. E.g. adding a qdisc to veth would also require changing the
> ring size, but if system also uses XDP then tuning below 64 (likely 128)
> will lead to hard-to-find packet drops.

I mean 64 still could be a 4x improvement at least.

> 
> I prefer adding something (like BQL) that auto-tune how much of the ring
> queue we are using.  Good queues function as shock absorbers when
> concurrent processes in the OS have scheduling noise.
> 
> I acknowledge that Simon Schippers found that the BQL implementation was
> actually not auto-tuning.  We need to work on this, my prototype
> implementation [1] [2] works surprisingly well.
> 
> 
> - [1] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/2-09-veth-time-based-bql-coalescing.patch
> - [2] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/
> 
> 
>>> The reason is that XDP-redirect into veth don't have any
>>> back-pressure and would simply drop packets if queue size becomes
>>> less than the NAPI budget (64). (Yes, we use both normal path and
>>> XDP-redirect in production).
>>
>> Doesn't this mean you have a queue which is not under BQL control?
>>
> 
> It is a matter of perspective. BQL needs between 17-55 elements in the
> 256 queue.  At the same time we handle if the ring runs full, e.g. due
> to a sudden burst of XDP redirected packets, which pushes packets into
> the qdisc layer.

You are checking inflight/limit in /sys directory to get the 17-55
number, right?

I think those elements are not really in the queue.

As written before:
The weird thing in this implementation is that is that BQL's inflight
!= number of packets in the ring and BQL's limit != "current ring size".
Instead the BQL limit describes the number of maximal allowed packets
between calls of netdev_sent_queue().
And in our case here we do not complete (in our case forward) the
packets when calling netdev_sent_queue() but instead immediately and
therefore they are not in the queue anymore when netdev_sent_queue() is
called.

Also that means that the number strongly depends on the
VETH_BQL_COAL_TX_USECS parameter.
For a fixed PPS the limit should be approx.:

Limit = VETH_BQL_COAL_TX_USECS * PPS

Assuming the default 10us coal and a fixed 1 MPPS:

Limit = 0.00001s * 1_000_000 = 10 packet

Can you follow my theory?

Judging from that I personally think VETH_BQL_COAL_TX_USECS needs to
bigger. More like 100us/1ms. With 10us the bql limit gets adjusted
very often I think..

Thanks.

> 
> 
>>> My benchmarking shows that an optimal BQL limit is dynamically
>>> adjusted between 17-55 depending on veth consumer namespace
>>> overhead/speed, when balancing throughput and latency.
>>
>> Testing with prod-approximating traffic pattern and load would be great.
> 
> That is what I'm doing.  I'm testing with prod-approximating traffic
> pattern and changing the number of iptables rules to simulate the
> overhead I measured from production.  I think I explained this in the
> cover letter. We are going to use this in a production environment (to
> be clear).
> 
> Simon found an issue testing the overload scenario.
> 
> --Jesper
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-11  9:55                   ` Simon Schippers
@ 2026-05-11 18:08                     ` Jesper Dangaard Brouer
  2026-05-11 20:37                       ` Simon Schippers
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2026-05-11 18:08 UTC (permalink / raw)
  To: Simon Schippers, Jakub Kicinski
  Cc: Paolo Abeni, netdev, kernel-team, Andrew Lunn, David S. Miller,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf



On 11/05/2026 11.55, Simon Schippers wrote:
> On 5/11/26 10:11, Jesper Dangaard Brouer wrote:
>>
>>
>> On 10/05/2026 17.56, Jakub Kicinski wrote:
>>> On Sat, 9 May 2026 11:09:51 +0200 Jesper Dangaard Brouer wrote:
>>>> On 09/05/2026 04.06, Jakub Kicinski wrote:
>>>>> On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:
>>>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>>>> the solution here.
>>>>>
>>>>> Was it evaluated, tho?
>>>>>
>>>>> It's obviously super easy these days have AI spew no end of complex
>>>>> code. So it'd be great to have some solid, ideally production-like
>>>>> data to back this all up.
>>>>>
>>>>> VETH_RING_SIZE seems trivial, ethtool set ringparam
>>>>
>>>> No, unfortunately we cannot just decrease the VETH_RING_SIZE.
>>>
>>> To be clear - I said may it configurable with ethtool -G
>>> not change the default.
>>>
>>
>> Sure, I understand the desire to make VETH_RING_SIZE configurable.
>> If doing so we are making Linux network stack harder to tune and setup
>> correctly. E.g. adding a qdisc to veth would also require changing the
>> ring size, but if system also uses XDP then tuning below 64 (likely 128)
>> will lead to hard-to-find packet drops.
> 
> I mean 64 still could be a 4x improvement at least.
> 

No not really, setting it to 64 will give same (bad) latency from "BQL
off" which that patchset is trying to address.

>>
>> I prefer adding something (like BQL) that auto-tune how much of the ring
>> queue we are using.  Good queues function as shock absorbers when
>> concurrent processes in the OS have scheduling noise.
>>
>> I acknowledge that Simon Schippers found that the BQL implementation was
>> actually not auto-tuning.  We need to work on this, my prototype
>> implementation [1] [2] works surprisingly well.
>>
>>
>> - [1] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/2-09-veth-time-based-bql-coalescing.patch
>> - [2] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/
>>
>>
>>>> The reason is that XDP-redirect into veth don't have any
>>>> back-pressure and would simply drop packets if queue size becomes
>>>> less than the NAPI budget (64). (Yes, we use both normal path and
>>>> XDP-redirect in production).
>>>
>>> Doesn't this mean you have a queue which is not under BQL control?
>>>
>>
>> It is a matter of perspective. BQL needs between 17-55 elements in the
>> 256 queue.  At the same time we handle if the ring runs full, e.g. due
>> to a sudden burst of XDP redirected packets, which pushes packets into
>> the qdisc layer.
> 
> You are checking inflight/limit in /sys directory to get the 17-55
> number, right?
> 

Nope, I'm using a bpftrace program to keep track of the inflight/limit
in a BPF hashmap.  Reading from /sys will not be accurate.

I moved the selftests into a github repo [1] to allow us to collaborate
and evaluate the changes more easily.  I explicitly kept the new BPF
based BQL tracking as a commit[2] for your benefit.

  [1] 
https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests

  [2] 
https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977

Sorry for cutting the remaining of the message, but I ran out of time,
as things are a bit challenging/hectic here at Cloudflare at the moment.

--Jesper

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-05-11 18:08                     ` Jesper Dangaard Brouer
@ 2026-05-11 20:37                       ` Simon Schippers
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Schippers @ 2026-05-11 20:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jakub Kicinski
  Cc: Paolo Abeni, netdev, kernel-team, Andrew Lunn, David S. Miller,
	Eric Dumazet, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, linux-kernel, bpf

On 5/11/26 20:08, Jesper Dangaard Brouer wrote:
> 
> 
> On 11/05/2026 11.55, Simon Schippers wrote:
>> On 5/11/26 10:11, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 10/05/2026 17.56, Jakub Kicinski wrote:
>>>> On Sat, 9 May 2026 11:09:51 +0200 Jesper Dangaard Brouer wrote:
>>>>> On 09/05/2026 04.06, Jakub Kicinski wrote:
>>>>>> On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:
>>>>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>>>>> the solution here.
>>>>>>
>>>>>> Was it evaluated, tho?
>>>>>>
>>>>>> It's obviously super easy these days have AI spew no end of complex
>>>>>> code. So it'd be great to have some solid, ideally production-like
>>>>>> data to back this all up.
>>>>>>
>>>>>> VETH_RING_SIZE seems trivial, ethtool set ringparam
>>>>>
>>>>> No, unfortunately we cannot just decrease the VETH_RING_SIZE.
>>>>
>>>> To be clear - I said may it configurable with ethtool -G
>>>> not change the default.
>>>>
>>>
>>> Sure, I understand the desire to make VETH_RING_SIZE configurable.
>>> If doing so we are making Linux network stack harder to tune and setup
>>> correctly. E.g. adding a qdisc to veth would also require changing the
>>> ring size, but if system also uses XDP then tuning below 64 (likely 128)
>>> will lead to hard-to-find packet drops.
>>
>> I mean 64 still could be a 4x improvement at least.
>>
> 
> No not really, setting it to 64 will give same (bad) latency from "BQL
> off" which that patchset is trying to address.
> 
>>>
>>> I prefer adding something (like BQL) that auto-tune how much of the ring
>>> queue we are using.  Good queues function as shock absorbers when
>>> concurrent processes in the OS have scheduling noise.
>>>
>>> I acknowledge that Simon Schippers found that the BQL implementation was
>>> actually not auto-tuning.  We need to work on this, my prototype
>>> implementation [1] [2] works surprisingly well.
>>>
>>>
>>> - [1] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/2-09-veth-time-based-bql-coalescing.patch
>>> - [2] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/
>>>
>>>
>>>>> The reason is that XDP-redirect into veth don't have any
>>>>> back-pressure and would simply drop packets if queue size becomes
>>>>> less than the NAPI budget (64). (Yes, we use both normal path and
>>>>> XDP-redirect in production).
>>>>
>>>> Doesn't this mean you have a queue which is not under BQL control?
>>>>
>>>
>>> It is a matter of perspective. BQL needs between 17-55 elements in the
>>> 256 queue.  At the same time we handle if the ring runs full, e.g. due
>>> to a sudden burst of XDP redirected packets, which pushes packets into
>>> the qdisc layer.
>>
>> You are checking inflight/limit in /sys directory to get the 17-55
>> number, right?
>>
> 
> Nope, I'm using a bpftrace program to keep track of the inflight/limit
> in a BPF hashmap.  Reading from /sys will not be accurate.

Ah nice.

> 
> I moved the selftests into a github repo [1] to allow us to collaborate
> and evaluate the changes more easily.  I explicitly kept the new BPF
> based BQL tracking as a commit[2] for your benefit.
> 
>  [1] https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
> 
>  [2] https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977

Thanks for sharing. After minor issues I was able to set it up
(currently I am just using plain v5, will look at the coalescing patch
when I find the time):

Can confirm the latency reduction with the default settings, in my case
4.888ms to 0.241ms.

With the same script I was also able to see a performance slow down:
veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
--> ~510 Kpps
Same with --bql-disable
--> ~570 Kpps
--> 12% faster

> 
> Sorry for cutting the remaining of the message, but I ran out of time,
> as things are a bit challenging/hectic here at Cloudflare at the moment.
> 
> --Jesper

All good, just ignore it. I think I misunderstood something anyway.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-05-11 20:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260505132159.241305-1-hawk@kernel.org>
2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
2026-05-07 14:25   ` Paolo Abeni
2026-05-05 13:21 ` [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-07  6:54   ` Simon Schippers
2026-05-07 13:21     ` Paolo Abeni
2026-05-07 14:34     ` Paolo Abeni
2026-05-07 14:46       ` Simon Schippers
2026-05-07 19:09         ` Jesper Dangaard Brouer
2026-05-07 20:12           ` Simon Schippers
2026-05-07 20:45             ` Jesper Dangaard Brouer
2026-05-08  8:01               ` Simon Schippers
2026-05-08  9:20                 ` Simon Schippers
2026-05-09  2:06           ` Jakub Kicinski
2026-05-09  9:09             ` Jesper Dangaard Brouer
2026-05-10 15:56               ` Jakub Kicinski
2026-05-11  8:11                 ` Jesper Dangaard Brouer
2026-05-11  9:55                   ` Simon Schippers
2026-05-11 18:08                     ` Jesper Dangaard Brouer
2026-05-11 20:37                       ` Simon Schippers
2026-05-05 13:21 ` [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-05-05 13:21 ` [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox