[RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support
@ 2026-03-18 13:48 hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 1/6] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight bytes in the ptr_ring and
moving buffering into the qdisc where AQM algorithms (fq_codel, CAKE)
can act on it.

Problem:
  veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued
  there are invisible to the qdisc's AQM.  Under load, the ring fills
  completely (DRV_XOFF backpressure), adding up to 256 packets of
  unmanaged latency before the qdisc even sees congestion.

Solution:
  BQL (STACK_XOFF) dynamically limits in-flight bytes, stopping the
  queue before the ring fills.  This keeps the ring shallow and pushes
  excess packets into the qdisc, where sojourn-based AQM can measure
  and drop them.

  Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
  namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load.

                   BQL off                    BQL on
  fq_codel:  RTT ~22ms, 4% loss         RTT ~1.3ms, 0% loss
  sfq:       RTT ~24ms, 0% loss          RTT ~1.5ms, 0% loss

  BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
  is unchanged (~10K pps) -- BQL adds no overhead.

CoDel bug discovered during BQL development:
  Our original motivation for BQL was fq_codel ping loss observed under
  load (4-26% depending on NAPI cycle time).  Investigating this led us
  to discover a bug in the CoDel implementation: codel_dequeue() does
  not reset vars->first_above_time when a flow goes empty, contrary to
  the reference algorithm.  This causes stale CoDel state to persist
  across empty periods in fq_codel's per-flow queues, penalizing sparse
  flows like ICMP ping.  A fix for this is included as patch 6/6.

  BQL remains valuable independently: it reduces RTT by ~17x by moving
  buffering from the dark ptr_ring into the qdisc.  Additionally, BQL
  clears STACK_XOFF per-SKB as each packet completes, rather than
  batch-waking after 64 packets (DRV_XOFF).  This keeps sojourn times
  below fq_codel's target, preventing CoDel from entering dropping
  state on non-congested flows in the first place.

Key design decisions:
  - Charge before produce: BQL charge is placed before ptr_ring_produce()
    because with threaded NAPI the consumer can complete on another CPU
    before veth_xmit() returns.

  - Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the
    ptr_ring pointer records whether each SKB was BQL-charged.  This is
    necessary because the qdisc can be replaced live (noqueue->sfq or
    vice versa) while SKBs are in-flight -- the completion side must
    know the charge state that was decided at enqueue time.

  - IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL
    sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL
    accounting, without changing IFF_NO_QUEUE semantics.

Background and acknowledgments:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling the kernel with a ptr_ring
  size of 30 (down from 256) made fq_codel work dramatically better.
  This was the primary motivation for a proper BQL solution that
  achieves the same effect dynamically without a kernel rebuild.

  Chris Arges wrote a reproducer for the dark buffer latency problem:
    https://github.com/netoptimizer/veth-backpressure-performance-testing
  This is where we first observed ping packets being dropped under
  fq_codel, which became our secondary motivation for BQL.  In
  production we switched to SFQ on veth devices as a workaround.

  Jonas Koeppeler provided extensive testing and code review.
  Together we discovered that the fq_codel ping loss was actually a
  12-year-old CoDel bug (stale first_above_time in empty flows), not
  caused by the dark buffer itself.  The fix is included as patch 6/6.

Jesper Dangaard Brouer (5):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message
  selftests: net: add veth BQL stress test

Jonas Köppeler (1):
  net_sched: codel: fix stale state for empty flows in fq_codel

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            |  88 ++-
 include/linux/netdevice.h                     |   1 +
 include/net/codel_impl.h                      |   1 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   8 +-
 tools/testing/selftests/net/veth_bql_test.sh  | 626 ++++++++++++++++++
 .../selftests/net/veth_bql_test_virtme.sh     | 113 ++++
 8 files changed, 829 insertions(+), 17 deletions(-)
 create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
 create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh

-- 
2.43.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH net-next 1/6] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
@ 2026-03-18 13:48 ` hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jesper Dangaard Brouer <hawk@kernel.org>

Virtual devices with IFF_NO_QUEUE or lltx are excluded from BQL sysfs
by netdev_uses_bql(), since they traditionally lack real hardware
queues. However, some virtual devices like veth implement a real
ptr_ring FIFO with NAPI processing and benefit from BQL to limit
in-flight bytes and reduce latency.

Add a per-device 'bql' bitfield boolean in the priv_flags_slow section
of struct net_device. When set, it overrides the IFF_NO_QUEUE/lltx
exclusion and exposes BQL sysfs entries (/sys/class/net/<dev>/queues/
tx-<n>/byte_queue_limits/). The flag is still gated on CONFIG_BQL.

This allows drivers that use BQL despite being IFF_NO_QUEUE to opt in
to sysfs visibility for monitoring and debugging.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 Documentation/networking/net_cachelines/net_device.rst | 1 +
 include/linux/netdevice.h                              | 1 +
 net/core/net-sysfs.c                                   | 8 +++++++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 1c19bb7705df..b775d3235a2d 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -170,6 +170,7 @@ unsigned_long:1                     see_all_hwtstamp_requests
 unsigned_long:1                     change_proto_down
 unsigned_long:1                     netns_immutable
 unsigned_long:1                     fcoe_mtu
+unsigned_long:1                     bql                                                                 netdev_uses_bql(net-sysfs.c)
 struct list_head                    net_notifier_list
 struct macsec_ops*                  macsec_ops
 struct udp_tunnel_nic_info*         udp_tunnel_nic_info
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 67e25f6d15a4..09e8c4dc16a0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2463,6 +2463,7 @@ struct net_device {
 	unsigned long		change_proto_down:1;
 	unsigned long		netns_immutable:1;
 	unsigned long		fcoe_mtu:1;
+	unsigned long		bql:1;
 
 	struct list_head	net_notifier_list;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2ce011fae249..b57f76b7e578 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1945,10 +1945,16 @@ static const struct kobj_type netdev_queue_ktype = {
 
 static bool netdev_uses_bql(const struct net_device *dev)
 {
+	if (!IS_ENABLED(CONFIG_BQL))
+		return false;
+
+	if (dev->bql)
+		return true;
+
 	if (dev->lltx || (dev->priv_flags & IFF_NO_QUEUE))
 		return false;
 
-	return IS_ENABLED(CONFIG_BQL);
+	return true;
 }
 
 static int netdev_queue_add_kobject(struct net_device *dev, int index)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 1/6] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
@ 2026-03-18 13:48 ` hawk
  2026-03-18 14:28   ` Toke Høiland-Jørgensen
  2026-03-18 13:48 ` [RFC PATCH net-next 3/6] veth: add tx_timeout watchdog as BQL safety net hawk
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add BQL support to the veth driver to dynamically limit the number of
bytes queued in the ptr_ring, giving the qdisc earlier feedback to shape
traffic and reduce latency.

The BQL charge (netdev_tx_sent_queue) is placed in veth_xmit() BEFORE
veth_forward_skb() produces the SKB into the ptr_ring. This ordering is
critical: with threaded NAPI the consumer runs on a separate CPU and can
complete the SKB (calling dql_completed) before veth_xmit() returns. If
the charge happened after the produce, the completion could race ahead
of the charge, violating dql_completed()'s invariant that completed
bytes never exceed queued bytes (BUG_ON).

Whether an SKB was BQL-charged is tracked per-SKB using a VETH_BQL_FLAG
bit in the ptr_ring pointer (BIT(1), alongside the existing VETH_XDP_FLAG
BIT(0)). The do_bql flag from veth_xmit() propagates through
veth_forward_skb() and veth_xdp_rx() into the ptr_ring entry. On the
completion side in veth_xdp_rcv(), veth_ptr_is_bql() reads the flag to
decide whether to call netdev_tx_completed_queue(). Per-SKB tracking is
necessary because the qdisc can be replaced live (e.g. noqueue->sfq or
vice versa via 'tc qdisc replace') while SKBs are already in-flight in
the ptr_ring. SKBs charged under the old qdisc must complete correctly
regardless of what qdisc is attached when the consumer runs, so each
SKB carries its own BQL-charged state rather than re-checking the peer's
qdisc at completion time.

Completion is done per-SKB inside veth_xdp_rcv() as each packet is
consumed from the ptr_ring. Per-SKB granularity ensures STACK_XOFF is
cleared promptly, which is important for threaded NAPI where producer
and consumer run on separate CPUs.

The completion byte count uses skb->len + ETH_HLEN to match the charge
value. veth_xmit() captures length = skb->len before veth_forward_skb()
calls eth_type_trans(), which pulls the 14-byte Ethernet header. Without
adding ETH_HLEN back, the per-packet deficit prevents DQL from ever
seeing inprogress == 0 (queue empty), so the DQL limit never increases
from its initial value of 0 and BQL becomes silently ineffective.

Only SKBs from the ndo_start_xmit path are tracked. XDP frames entering
via ndo_xdp_xmit bypass veth_xmit() entirely and are never charged to
BQL, so they are correctly excluded from completion accounting.

BQL state is reset on the peer's txqs in veth_napi_del_range() after
ptr_ring_cleanup() frees remaining items without BQL completion. The
peer dereference is placed after synchronize_net() to ensure no
in-flight veth_poll() calls can race with the reset.

BQL's STACK_XOFF operates independently from the existing DRV_XOFF
backpressure mechanism (ptr_ring full). netif_tx_wake_queue() at the end
of veth_poll() clears only DRV_XOFF, while netdev_tx_completed_queue()
clears only STACK_XOFF. Both must be clear for the queue to transmit.
The existing NAPI scheduling guarantee (every successful veth_xmit()
calls __veth_xdp_flush(), plus the napi_complete_done() ring re-check)
ensures NAPI always runs when items are in the ring, preventing stuck
STACK_XOFF state.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reported-by: Mike Freemon <mfreemon@cloudflare.com>
Reported-by: Chris Arges <carges@cloudflare.com>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 drivers/net/veth.c | 70 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 58 insertions(+), 12 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e35df717e65e..bad49e0d0bfc 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -34,6 +34,7 @@
 #define DRV_VERSION	"1.0"

 #define VETH_XDP_FLAG		BIT(0)
+#define VETH_BQL_FLAG		BIT(1)
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)

@@ -280,6 +281,21 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }

+static bool veth_ptr_is_bql(void *ptr)
+{
+	return (unsigned long)ptr & VETH_BQL_FLAG;
+}
+
+static struct sk_buff *veth_ptr_to_skb(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
+}
+
+static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
+{
+	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
+}
+
 static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -295,7 +311,7 @@ static void veth_ptr_free(void *ptr)
 	if (veth_is_xdp_frame(ptr))
 		xdp_return_frame(veth_ptr_to_xdp(ptr));
 	else
-		kfree_skb(ptr);
+		kfree_skb(veth_ptr_to_skb(ptr));
 }

 static void __veth_xdp_flush(struct veth_rq *rq)
@@ -309,19 +325,20 @@ static void __veth_xdp_flush(struct veth_rq *rq)
 	}
 }

-static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql)
 {
-	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
+	if (unlikely(ptr_ring_produce(&rq->xdp_ring,
+				     veth_skb_to_ptr(skb, do_bql))))
 		return NETDEV_TX_BUSY; /* signal qdisc layer */

 	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }

 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
-			    struct veth_rq *rq, bool xdp)
+			    struct veth_rq *rq, bool xdp, bool do_bql)
 {
 	return __dev_forward_skb(dev, skb) ?: xdp ?
-		veth_xdp_rx(rq, skb) :
+		veth_xdp_rx(rq, skb, do_bql) :
 		__netif_rx(skb);
 }

@@ -352,6 +369,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
+	bool do_bql = false;
 	int ret, rxq;

 	rcu_read_lock();
@@ -375,21 +393,27 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}

 	skb_tx_timestamp(skb);
+	txq = netdev_get_tx_queue(dev, rxq);

-	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	/* BQL charge before forward: consumer may complete on another
+	 * CPU before we return.  BQL is only valid with a real qdisc.
+	 */
+	do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
+	if (do_bql)
+		netdev_tx_sent_queue(txq, length);
+
+	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql);
 	switch (ret) {
 	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
 			dev_sw_netstats_tx_add(dev, 1, length);
 		else
 			__veth_xdp_flush(rq);
-		break;
+		goto out;
 	case NETDEV_TX_BUSY:
 		/* If a qdisc is attached to our virtual device, returning
 		 * NETDEV_TX_BUSY is allowed.
 		 */
-		txq = netdev_get_tx_queue(dev, rxq);
-
 		if (qdisc_txq_has_no_queue(txq)) {
 			dev_kfree_skb_any(skb);
 			goto drop;
@@ -412,6 +436,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
 				     __func__, dev->name, ret);
 	}
+
+	/* Reverse BQL charge -- SKB didn't make it into the ptr_ring */
+	if (do_bql)
+		netdev_tx_completed_queue(txq, 1, length);
+out:
 	rcu_read_unlock();

 	return ret;
@@ -900,7 +929,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,

 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
-			struct veth_stats *stats)
+			struct veth_stats *stats,
+			struct netdev_queue *peer_txq)
 {
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
@@ -928,9 +958,14 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			}
 		} else {
 			/* ndo_start_xmit */
-			struct sk_buff *skb = ptr;
+			bool bql_charged = veth_ptr_is_bql(ptr);
+			struct sk_buff *skb = veth_ptr_to_skb(ptr);

 			stats->xdp_bytes += skb->len;
+			if (peer_txq && bql_charged)
+				netdev_tx_completed_queue(peer_txq, 1,
+							 skb->len + ETH_HLEN);
+
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -975,7 +1010,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;

 	xdp_set_return_frame_no_direct();
-	done = veth_xdp_rcv(rq, budget, &bq, &stats);
+	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);

 	if (stats.xdp_redirect > 0)
 		xdp_do_flush();
@@ -1073,6 +1108,7 @@ static int __veth_napi_enable(struct net_device *dev)
 static void veth_napi_del_range(struct net_device *dev, int start, int end)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
 	int i;

 	for (i = start; i < end; i++) {
@@ -1091,6 +1127,15 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
 	}

+	/* Reset BQL on peer's txqs: remaining ring items were freed above
+	 * without BQL completion, so DQL state must be reset.
+	 */
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		for (i = start; i < end; i++)
+			netdev_tx_reset_queue(netdev_get_tx_queue(peer, i));
+	}
+
 	for (i = start; i < end; i++) {
 		page_pool_destroy(priv->rq[i].page_pool);
 		priv->rq[i].page_pool = NULL;
@@ -1740,6 +1785,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_PHONY_HEADROOM;
 	dev->priv_flags |= IFF_DISABLE_NETPOLL;
 	dev->lltx = true;
+	dev->bql = true;

 	dev->netdev_ops = &veth_netdev_ops;
 	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH net-next 3/6] veth: add tx_timeout watchdog as BQL safety net
  2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 1/6] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-03-18 13:48 ` hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 4/6] net: sched: add timeout count to NETDEV WATCHDOG message hawk
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jesper Dangaard Brouer <hawk@kernel.org>

With the introduction of BQL (Byte Queue Limits) for veth, there are
now two independent mechanisms that can stop a transmit queue:

 - DRV_XOFF: set by netif_tx_stop_queue() when the ptr_ring is full
 - STACK_XOFF: set by BQL when the byte-in-flight limit is reached

If either mechanism stalls without a corresponding wake/completion,
the queue stops permanently. Enable the net device watchdog timer and
implement ndo_tx_timeout as a failsafe recovery.

The timeout handler resets BQL state (clearing STACK_XOFF) and wakes
the queue (clearing DRV_XOFF), covering both stop mechanisms. The
watchdog fires after 16 seconds, which accommodates worst-case NAPI
processing (budget=64 packets x 250ms per-packet consumer delay)
without false positives under normal backpressure.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 drivers/net/veth.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index bad49e0d0bfc..2717f65fb352 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1423,6 +1423,22 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev,
+		 "veth backpressure(0x%lX) stalled(n:%ld) TXQ(%u) re-enable\n",
+		   txq->state, atomic_long_read(&txq->trans_timeout), txqueue);
+
+	/* Clear both stop mechanisms:
+	 * - DRV_XOFF: set by netif_tx_stop_queue (ptr_ring backpressure)
+	 * - STACK_XOFF: set by BQL when byte limit is reached
+	 */
+	netdev_tx_reset_queue(txq);
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1761,6 +1777,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1800,6 +1817,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(16000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH net-next 4/6] net: sched: add timeout count to NETDEV WATCHDOG message
  2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
                   ` (2 preceding siblings ...)
  2026-03-18 13:48 ` [RFC PATCH net-next 3/6] veth: add tx_timeout watchdog as BQL safety net hawk
@ 2026-03-18 13:48 ` hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 5/6] selftests: net: add veth BQL stress test hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 6/6] net_sched: codel: fix stale state for empty flows in fq_codel hawk
  5 siblings, 0 replies; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add the per-queue timeout counter (trans_timeout) to the core NETDEV
WATCHDOG log message.  This makes it easy to determine how frequently
a particular queue is stalling from a single log line, without having
to search through and correlate spaced-out log entries.

Useful for production monitoring where timeouts are spaced by the
watchdog interval, making frequency hard to judge.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/all/20251107175445.58eba452@kernel.org/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 net/sched/sch_generic.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 69d5ac4f17d1..da97cda1a1e7 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -533,13 +533,12 @@ static void dev_watchdog(struct timer_list *t)
 		    netif_running(dev) &&
 		    netif_carrier_ok(dev)) {
 			unsigned int timedout_ms = 0;
+			struct netdev_queue *txq;
 			unsigned int i;
 			unsigned long trans_start;
 			unsigned long oldest_start = jiffies;
 
 			for (i = 0; i < dev->num_tx_queues; i++) {
-				struct netdev_queue *txq;
-
 				txq = netdev_get_tx_queue(dev, i);
 				if (!netif_xmit_stopped(txq))
 					continue;
@@ -561,9 +560,10 @@ static void dev_watchdog(struct timer_list *t)
 
 			if (unlikely(timedout_ms)) {
 				trace_net_dev_xmit_timeout(dev, i);
-				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
+				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms (n:%ld)\n",
 					    raw_smp_processor_id(),
-					    i, timedout_ms);
+					    i, timedout_ms,
+					    atomic_long_read(&txq->trans_timeout));
 				netif_freeze_queues(dev);
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 				netif_unfreeze_queues(dev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH net-next 5/6] selftests: net: add veth BQL stress test
  2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
                   ` (3 preceding siblings ...)
  2026-03-18 13:48 ` [RFC PATCH net-next 4/6] net: sched: add timeout count to NETDEV WATCHDOG message hawk
@ 2026-03-18 13:48 ` hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 6/6] net_sched: codel: fix stale state for empty flows in fq_codel hawk
  5 siblings, 0 replies; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add a selftest that exercises veth's BQL (Byte Queue Limits) code path
under sustained UDP load. The test creates a veth pair with GRO enabled
(activating the NAPI path and BQL), attaches a qdisc, optionally loads
iptables rules in the consumer namespace to slow NAPI processing, and
floods UDP packets for a configurable duration.

The test serves two purposes: benchmarking BQL's latency impact under
configurable load (iptables rules, qdisc type and parameters), and
detecting kernel BUG/Oops from DQL accounting mismatches. It monitors
dmesg throughout the run and reports PASS/FAIL via kselftest (lib.sh).

Diagnostic output is printed every 5 seconds:
  - BQL sysfs inflight/limit and watchdog tx_timeout counter
  - qdisc stats: packets, drops, requeues, backlog, qlen, overlimits
  - consumer PPS and NAPI-64 cycle time (shows fq_codel target impact)
  - ping RTT to measure latency under load

Generating enough traffic to fill the 256-entry ptr_ring requires care:
the UDP sendto() path charges each SKB to sk_wmem_alloc, and the SKB
stays charged (via sock_wfree destructor) until the consumer NAPI thread
finishes processing it -- including any iptables rules in the receive
path. With the default sk_sndbuf (~208KB from wmem_default), only ~93
packets can be in-flight before sendto(MSG_DONTWAIT) returns EAGAIN.
Since 93 < 256 ring entries, the ring never fills and no backpressure
occurs. The test raises wmem_max via sysctl and sets SO_SNDBUF=1MB on
the flood socket to remove this bottleneck. An earlier multi-namespace
routing approach avoided this limit because ip_forward creates new SKBs
detached from the sender's socket.

The --bql-disable option (sets limit_min=1GB) enables A/B comparison.
Typical results with --nrules 6000 --qdisc-opts 'target 2ms interval 20ms':

  fq_codel + BQL disabled:  ping RTT ~10.8ms, 15% loss, 400KB in ptr_ring
  fq_codel + BQL enabled:   ping RTT ~0.6ms,   0% loss, 4KB in ptr_ring

Both cases show identical consumer speed (~20Kpps) and fq_codel drops
(~255K), proving the improvement comes purely from where packets buffer.

BQL moves buffering from the ptr_ring into the qdisc, where AQM
(fq_codel/CAKE) can act on it -- eliminating the "dark buffer" that
hides congestion from the scheduler.

A companion wrapper (veth_bql_test_virtme.sh) launches the test inside
a virtme-ng VM, with .config validation to prevent silent stalls.

Usage:
  sudo ./veth_bql_test.sh [--duration 300] [--nrules 100]
                          [--qdisc sfq] [--qdisc-opts '...']
                          [--bql-disable]

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 tools/testing/selftests/net/veth_bql_test.sh  | 626 ++++++++++++++++++
 .../selftests/net/veth_bql_test_virtme.sh     | 113 ++++
 2 files changed, 739 insertions(+)
 create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
 create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh

diff --git a/tools/testing/selftests/net/veth_bql_test.sh b/tools/testing/selftests/net/veth_bql_test.sh
new file mode 100755
index 000000000000..071294a86cd9
--- /dev/null
+++ b/tools/testing/selftests/net/veth_bql_test.sh
@@ -0,0 +1,626 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Veth BQL (Byte Queue Limits) stress test and A/B benchmarking tool.
+#
+# Creates a veth pair with GRO on and TSO off (ensures all packets use
+# the NAPI/ptr_ring path where BQL operates), attaches a configurable
+# qdisc, optionally loads iptables rules to slow the consumer NAPI
+# processing, and floods UDP packets at maximum rate.
+#
+# Primary uses:
+#   1) A/B comparison of latency with/without BQL (--bql-disable flag)
+#   2) Testing different qdiscs and their parameters (--qdisc, --qdisc-opts)
+#   3) Detecting kernel BUG/Oops from DQL accounting mismatches
+#
+# Key design detail -- SO_SNDBUF and wmem_max:
+#   The UDP sendto() path charges each SKB to the socket's sk_wmem_alloc
+#   counter.  The SKB carries a destructor (sock_wfree) that releases the
+#   charge only after the consumer NAPI thread on the peer veth finishes
+#   processing it -- including any iptables rules in the receive path.
+#   With the default sk_sndbuf (~208KB from wmem_default), only ~93
+#   packets (1442B each) can be in-flight before sendto() returns EAGAIN.
+#   Since 93 < 256 ptr_ring entries, the ring never fills and no qdisc
+#   backpressure occurs.  The test temporarily raises the global wmem_max
+#   sysctl and sets SO_SNDBUF=1MB to allow enough in-flight SKBs to
+#   saturate the ptr_ring.  The original wmem_max is restored on exit.
+#
+# Two TX-stop mechanisms and the dark-buffer problem:
+#   DRV_XOFF backpressure (commit dc82a33297fc) stops the TX queue when
+#   the 256-entry ptr_ring is full.  The queue is released at the end of
+#   veth_poll() (commit 5442a9da6978) after processing up to 64 packets
+#   (NAPI budget).  Without BQL, the entire ring is a FIFO "dark buffer"
+#   in front of the qdisc -- packets there are invisible to AQM.
+#
+#   BQL adds STACK_XOFF, which dynamically limits in-flight bytes and
+#   stops the queue *before* the ring fills.  This keeps the ring
+#   shallow and moves buffering into the qdisc where sojourn-based AQM
+#   (codel, fq_codel, CAKE/COBALT) can measure and drop packets.
+#
+# Sojourn time and NAPI budget interaction:
+#   DRV_XOFF releases backpressure once per NAPI poll (up to 64 pkts).
+#   During that cycle, packets queued in the qdisc accumulate sojourn
+#   time.  With fq_codel's default target of 5ms, the threshold is:
+#     5000us / 64 pkts = 78us/pkt --> ~12,800 pps consumer speed.
+#   Below that rate the NAPI-64 cycle exceeds the target and fq_codel
+#   starts dropping.  Use --nrules and --qdisc-opts to experiment.
+#
+cd "$(dirname -- "$0")" || exit 1
+source lib.sh
+
+# Defaults
+DURATION=300      # seconds; 307M pkts @ ~1Mpps needs ~5min
+NRULES=100        # iptables rules in consumer NS (0 to disable)
+QDISC=sfq         # qdisc to use (sfq, pfifo, fq_codel, etc.)
+QDISC_OPTS=""     # extra qdisc parameters (e.g. "target 1ms interval 10ms")
+BQL_DISABLE=0     # 1 to disable BQL (sets limit_min high)
+QDISC_REPLACE=0   # 1 to test qdisc replacement under active traffic
+VETH_A="veth_bql0"
+VETH_B="veth_bql1"
+IP_A="10.99.0.1"
+IP_B="10.99.0.2"
+PORT=9999
+PKT_SIZE=1400     # large packets: slower producer, bigger BQL charges
+
+usage() {
+    echo "Usage: $0 [OPTIONS]"
+    echo "  --duration SEC   test duration (default: $DURATION)"
+    echo "  --nrules N       iptables rules to slow consumer (default: $NRULES, 0=disable)"
+    echo "  --qdisc NAME     qdisc to install (default: $QDISC)"
+    echo "  --qdisc-opts STR extra qdisc params (e.g. 'target 1ms interval 10ms')"
+    echo "  --bql-disable    disable BQL for A/B comparison"
+    echo "  --qdisc-replace  test qdisc replacement under active traffic"
+    exit 1
+}
+
+while [ $# -gt 0 ]; do
+    case "$1" in
+    --duration)   DURATION="$2"; shift 2 ;;
+    --nrules)     NRULES="$2"; shift 2 ;;
+    --qdisc)      QDISC="$2"; shift 2 ;;
+    --qdisc-opts) QDISC_OPTS="$2"; shift 2 ;;
+    --bql-disable) BQL_DISABLE=1; shift ;;
+    --qdisc-replace) QDISC_REPLACE=1; shift ;;
+    --help|-h)    usage ;;
+    *)            echo "Unknown option: $1" >&2; usage ;;
+    esac
+done
+
+TMPDIR=$(mktemp -d)
+
+FLOOD_PID=""
+SINK_PID=""
+PING_PID=""
+
+cleanup() {
+    [ -n "$FLOOD_PID" ] && kill_process $FLOOD_PID
+    [ -n "$SINK_PID" ] && kill_process $SINK_PID
+    [ -n "$PING_PID" ] && kill_process $PING_PID
+    cleanup_all_ns
+    ip link del "$VETH_A" 2>/dev/null || true
+    [ -n "$ORIG_WMEM_MAX" ] && sysctl -qw net.core.wmem_max="$ORIG_WMEM_MAX"
+    rm -rf "$TMPDIR"
+}
+trap cleanup EXIT
+
+require_command gcc
+require_command ethtool
+require_command tc
+
+# --- Function definitions ---
+
+compile_tools() {
+    echo "--- Compiling UDP flood tool ---"
+cat > "$TMPDIR"/udp_flood.c << 'CEOF'
+#include <arpa/inet.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <time.h>
+#include <unistd.h>
+
+static volatile int running = 1;
+
+static void stop(int sig) { running = 0; }
+
+struct pkt_hdr {
+	struct timespec ts;
+	unsigned long seq;
+};
+
+int main(int argc, char **argv)
+{
+	struct sockaddr_in dst;
+	struct pkt_hdr hdr;
+	unsigned long count = 0;
+	char buf[1500];
+	int sndbuf = 1048576;
+	int pkt_size;
+	int duration;
+	int fd;
+
+	if (argc < 5) {
+		fprintf(stderr, "Usage: %s <ip> <pkt_size> <port> <duration_sec>\n",
+			argv[0]);
+		return 1;
+	}
+
+	pkt_size = atoi(argv[2]);
+	if (pkt_size < (int)sizeof(struct pkt_hdr))
+		pkt_size = sizeof(struct pkt_hdr);
+	if (pkt_size > (int)sizeof(buf))
+		pkt_size = sizeof(buf);
+	duration = atoi(argv[4]);
+
+	memset(&dst, 0, sizeof(dst));
+	dst.sin_family = AF_INET;
+	dst.sin_port = htons(atoi(argv[3]));
+	inet_pton(AF_INET, argv[1], &dst.sin_addr);
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return 1;
+	}
+
+	/* Raise send buffer so sk_wmem_alloc limit doesn't cap
+	 * in-flight packets before the ptr_ring (256 entries) fills.
+	 * Default wmem_default ~208K only allows ~93 packets.
+	 */
+	setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, sizeof(sndbuf));
+
+	memset(buf, 0xAA, sizeof(buf));
+	signal(SIGINT, stop);
+	signal(SIGTERM, stop);
+	signal(SIGALRM, stop);
+	alarm(duration);
+
+	while (running) {
+		clock_gettime(CLOCK_MONOTONIC, &hdr.ts);
+		hdr.seq = count;
+		memcpy(buf, &hdr, sizeof(hdr));
+		sendto(fd, buf, pkt_size, MSG_DONTWAIT,
+		       (struct sockaddr *)&dst, sizeof(dst));
+		count++;
+		if (!(count % 10000000))
+			fprintf(stderr, "  sent: %lu M packets\n",
+				count / 1000000);
+	}
+
+	fprintf(stderr, "Total sent: %lu packets (%.1f M)\n",
+		count, (double)count / 1e6);
+	close(fd);
+	return 0;
+}
+CEOF
+gcc -O2 -Wall -o "$TMPDIR"/udp_flood "$TMPDIR"/udp_flood.c || exit $ksft_fail
+
+# UDP sink with latency measurement
+cat > "$TMPDIR"/udp_sink.c << 'CEOF'
+#include <arpa/inet.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <time.h>
+#include <unistd.h>
+
+static volatile int running = 1;
+
+static void stop(int sig) { running = 0; }
+
+struct pkt_hdr {
+	struct timespec ts;
+	unsigned long seq;
+};
+
+static void print_stats(unsigned long count, unsigned long drops,
+			unsigned long reorders,
+			double lat_min, double lat_avg_sum,
+			double lat_max)
+{
+	if (!count)
+		return;
+	fprintf(stderr, "  sink: %lu pkts  drops=%lu  reorders=%lu"
+		"  latency min/avg/max = %.3f/%.3f/%.3f ms\n",
+		count, drops, reorders,
+		lat_min * 1e3, (lat_avg_sum / count) * 1e3,
+		lat_max * 1e3);
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long next_seq = 0, drops = 0, reorders = 0;
+	double lat_min = 1e9, lat_max = 0, lat_sum = 0;
+	unsigned long count = 0;
+	struct sockaddr_in addr;
+	char buf[2048];
+	int fd, one = 1;
+
+	if (argc < 2) {
+		fprintf(stderr, "Usage: %s <port>\n", argv[0]);
+		return 1;
+	}
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return 1;
+	}
+	setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one));
+
+	/* Timeout so recv() unblocks periodically to check 'running' flag.
+	 * Needed because glibc signal() sets SA_RESTART, so SIGTERM
+	 * does not interrupt recv().
+	 */
+	struct timeval tv = { .tv_sec = 1 };
+	setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+
+	memset(&addr, 0, sizeof(addr));
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(atoi(argv[1]));
+	addr.sin_addr.s_addr = INADDR_ANY;
+	if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		perror("bind");
+		return 1;
+	}
+
+	signal(SIGINT, stop);
+	signal(SIGTERM, stop);
+
+	while (running) {
+		struct pkt_hdr hdr;
+		struct timespec now;
+		ssize_t n;
+		double lat;
+
+		n = recv(fd, buf, sizeof(buf), 0);
+		if (n < (ssize_t)sizeof(struct pkt_hdr))
+			continue;
+
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		memcpy(&hdr, buf, sizeof(hdr));
+
+		/* Track drops (gaps) and reorders (late arrivals) */
+		if (hdr.seq > next_seq)
+			drops += hdr.seq - next_seq;
+		if (hdr.seq < next_seq)
+			reorders++;
+		if (hdr.seq >= next_seq)
+			next_seq = hdr.seq + 1;
+
+		lat = (now.tv_sec - hdr.ts.tv_sec) +
+		      (now.tv_nsec - hdr.ts.tv_nsec) * 1e-9;
+
+		if (lat < lat_min)
+			lat_min = lat;
+		if (lat > lat_max)
+			lat_max = lat;
+		lat_sum += lat;
+		count++;
+
+		if (!(count % 1000000))
+			print_stats(count, drops, reorders,
+				    lat_min, lat_sum, lat_max);
+	}
+
+	print_stats(count, drops, reorders, lat_min, lat_sum, lat_max);
+	close(fd);
+	return 0;
+}
+CEOF
+gcc -O2 -Wall -o "$TMPDIR"/udp_sink "$TMPDIR"/udp_sink.c || exit $ksft_fail
+}
+
+setup_veth() {
+    log_info "Setting up veth pair with GRO"
+    setup_ns NS || exit $ksft_skip
+    ip link add "$VETH_A" type veth peer name "$VETH_B" || \
+        { echo "Failed to create veth pair (need root?)"; exit $ksft_skip; }
+    ip link set "$VETH_B" netns "$NS" || \
+        { echo "Failed to move veth to namespace"; exit $ksft_skip; }
+
+    # Configure IPs
+    ip addr add "${IP_A}/24" dev "$VETH_A"
+    ip link set "$VETH_A" up
+
+    ip -netns "$NS" addr add "${IP_B}/24" dev "$VETH_B"
+    ip -netns "$NS" link set "$VETH_B" up
+
+    # Raise wmem_max so the flood tool's SO_SNDBUF takes effect.
+    # Default 212992 caps in-flight to ~93 packets (sk_wmem_alloc limit),
+    # which is less than the 256-entry ptr_ring and prevents backpressure.
+    ORIG_WMEM_MAX=$(sysctl -n net.core.wmem_max)
+    sysctl -qw net.core.wmem_max=1048576
+
+    # Enable GRO on both ends -- activates NAPI -- BQL code path
+    ethtool -K "$VETH_A" gro on 2>/dev/null || true
+    ip netns exec "$NS" ethtool -K "$VETH_B" gro on 2>/dev/null || true
+
+    # Disable TSO so veth_skb_is_eligible_for_gro() returns true for all
+    # packets, ensuring every SKB takes the NAPI/ptr_ring path.  With TSO
+    # enabled, only packets matching sock_wfree + GRO features are eligible;
+    # disabling TSO removes that filter unconditionally.
+    ethtool -K "$VETH_A" tso off gso off 2>/dev/null || true
+    ip netns exec "$NS" ethtool -K "$VETH_B" tso off gso off 2>/dev/null || true
+
+    # Enable threaded NAPI -- this is critical: BQL backpressure (STACK_XOFF)
+    # only engages when producer and consumer run on separate CPUs.
+    # Without threaded NAPI, softirq completions happen too fast for BQL
+    # to build up enough in-flight bytes to trigger the limit.
+    echo 1 > /sys/class/net/"$VETH_A"/threaded 2>/dev/null || true
+    ip netns exec "$NS" sh -c "echo 1 > /sys/class/net/$VETH_B/threaded" 2>/dev/null || true
+}
+
+install_qdisc() {
+    local qdisc="${1:-$QDISC}"
+    local opts="${2:-}"
+    # Add a qdisc -- veth defaults to noqueue, but BQL needs a qdisc
+    # because STACK_XOFF is checked by the qdisc layer.
+    # Note: qdisc_create() auto-fixes txqueuelen=0 on IFF_NO_QUEUE devices
+    # to DEFAULT_TX_QUEUE_LEN (commit 84c46dd86538).
+    log_info "Installing qdisc: $qdisc $opts"
+    tc qdisc replace dev "$VETH_A" root $qdisc $opts
+    ip netns exec "$NS" tc qdisc replace dev "$VETH_B" root $qdisc $opts
+}
+
+remove_qdisc() {
+    log_info "Removing qdisc (reverting to noqueue)"
+    tc qdisc del dev "$VETH_A" root 2>/dev/null || true
+    ip netns exec "$NS" tc qdisc del dev "$VETH_B" root 2>/dev/null || true
+}
+
+setup_iptables() {
+    # Bulk-load iptables rules in consumer namespace to slow NAPI processing.
+    # Many rules force per-packet linear rule traversal, increasing consumer
+    # overhead and BQL inflight bytes -- simulates realistic k8s-like workload.
+    if [ "$NRULES" -gt 0 ]; then
+        ip netns exec "$NS" bash -c '
+        iptables-restore < <(
+        echo "*filter"
+        for n in $(seq 1 '"$NRULES"'); do
+          echo "-I INPUT -d '"$IP_B"'"
+        done
+        echo "COMMIT"
+        )
+        ' 2>/dev/null || log_info "iptables not available, skipping rules"
+        log_info "Loaded $NRULES iptables rules in consumer NS"
+    fi
+}
+
+check_bql_sysfs() {
+    BQL_DIR="/sys/class/net/${VETH_A}/queues/tx-0/byte_queue_limits"
+    if [ -d "$BQL_DIR" ]; then
+        log_info "BQL sysfs found: $BQL_DIR"
+        if [ "$BQL_DISABLE" -eq 1 ]; then
+            echo 1073741824 > "$BQL_DIR/limit_min"
+            log_info "BQL effectively disabled (limit_min=1G)"
+        fi
+    else
+        log_info "BQL sysfs absent (veth IFF_NO_QUEUE+lltx, DQL accounting still active)"
+        BQL_DIR=""
+    fi
+}
+
+start_traffic() {
+    # Snapshot dmesg before test
+    DMESG_BEFORE=$(dmesg | wc -l)
+
+    log_info "Starting UDP sink in namespace"
+    ip netns exec "$NS" "$TMPDIR"/udp_sink "$PORT" &
+    SINK_PID=$!
+    sleep 0.2
+
+    log_info "Starting ping to $IP_B (5/s) to measure latency under load"
+    ping -i 0.2 -w "$DURATION" "$IP_B" > "$TMPDIR"/ping.log 2>&1 &
+    PING_PID=$!
+
+    log_info "Flooding ${PKT_SIZE}-byte UDP packets for ${DURATION}s"
+    "$TMPDIR"/udp_flood "$IP_B" "$PKT_SIZE" "$PORT" "$DURATION" &
+    FLOOD_PID=$!
+}
+
+stop_traffic() {
+    [ -n "$FLOOD_PID" ] && kill_process $FLOOD_PID
+    FLOOD_PID=""
+    [ -n "$SINK_PID" ] && kill_process $SINK_PID
+    SINK_PID=""
+    [ -n "$PING_PID" ] && kill_process $PING_PID
+    PING_PID=""
+}
+
+check_dmesg_bug() {
+    if dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+       grep -qE 'kernel BUG|BUG:|Oops:|dql_completed'; then
+        dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+            grep -B2 -A20 -E 'kernel BUG|BUG:|Oops:|dql_completed|NETDEV WATCHDOG'
+        return 1
+    fi
+    return 0
+}
+
+print_periodic_stats() {
+    local elapsed="$1"
+
+    # BQL stats and watchdog counter
+    WD_CNT=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+        2>/dev/null) || WD_CNT="?"
+    if [ -n "$BQL_DIR" ] && [ -d "$BQL_DIR" ]; then
+        INFLIGHT=$(cat "$BQL_DIR/inflight" 2>/dev/null || echo "?")
+        LIMIT=$(cat "$BQL_DIR/limit" 2>/dev/null || echo "?")
+        echo "  [${elapsed}s] BQL inflight=${INFLIGHT} limit=${LIMIT}" \
+            "watchdog=${WD_CNT}"
+    else
+        echo "  [${elapsed}s] watchdog=${WD_CNT} (no BQL sysfs)"
+    fi
+
+    # Qdisc stats
+    JQ_FMT='"qdisc \(.kind) pkts=\(.packets) drops=\(.drops)'
+    JQ_FMT+=' requeues=\(.requeues) backlog=\(.backlog)'
+    JQ_FMT+=' qlen=\(.qlen) overlimits=\(.overlimits)"'
+    CUR_QPKTS=$(tc -j -s qdisc show dev "$VETH_A" root 2>/dev/null |
+        jq -r '.[0].packets // 0' 2>/dev/null) || CUR_QPKTS=0
+    QSTATS=$(tc -j -s qdisc show dev "$VETH_A" root 2>/dev/null |
+        jq -r ".[0] | $JQ_FMT" 2>/dev/null) &&
+        echo "  [${elapsed}s] $QSTATS" || true
+
+    # Consumer PPS and per-packet processing time
+    if [ "$PREV_QPKTS" -gt 0 ] 2>/dev/null; then
+        DELTA=$((CUR_QPKTS - PREV_QPKTS))
+        PPS=$((DELTA / INTERVAL))
+        if [ "$PPS" -gt 0 ]; then
+            PKT_MS=$(awk "BEGIN {printf \"%.3f\", 1000.0/$PPS}")
+            NAPI_MS=$(awk "BEGIN {printf \"%.1f\", 64000.0/$PPS}")
+            echo "  [${elapsed}s] consumer: ${PPS} pps" \
+                "(~${PKT_MS}ms/pkt, NAPI-64 cycle ~${NAPI_MS}ms)"
+        fi
+    fi
+    PREV_QPKTS=$CUR_QPKTS
+
+    # Ping RTT
+    PING_RTT=$(tail -1 "$TMPDIR"/ping.log 2>/dev/null | grep -oP 'time=\K[0-9.]+') &&
+        echo "  [${elapsed}s] ping RTT=${PING_RTT}ms" || true
+}
+
+monitor_loop() {
+    ELAPSED=0
+    INTERVAL=5
+    PREV_QPKTS=0
+    while kill -0 "$FLOOD_PID" 2>/dev/null; do
+        sleep "$INTERVAL"
+        ELAPSED=$((ELAPSED + INTERVAL))
+
+        if ! check_dmesg_bug; then
+            RET=$ksft_fail
+            retmsg="BUG_ON triggered in dql_completed at ${ELAPSED}s"
+            log_test "veth_bql"
+            exit $EXIT_STATUS
+        fi
+
+        print_periodic_stats "$ELAPSED"
+    done
+    wait "$FLOOD_PID" || true
+    FLOOD_PID=""
+}
+
+# Verify traffic is flowing by checking device tx_packets counter.
+# Works for both qdisc and noqueue modes.
+verify_traffic_flowing() {
+    local label="$1"
+    local prev_tx cur_tx
+    prev_tx=$(cat /sys/class/net/${VETH_A}/statistics/tx_packets \
+        2>/dev/null) || prev_tx=0
+    sleep 2
+    cur_tx=$(cat /sys/class/net/${VETH_A}/statistics/tx_packets \
+        2>/dev/null) || cur_tx=0
+    if [ "$cur_tx" -gt "$prev_tx" ]; then
+        log_info "$label traffic flowing (tx: $prev_tx -> $cur_tx)"
+        return 0
+    fi
+    log_info "$label traffic STALLED (tx: $prev_tx -> $cur_tx)"
+    return 1
+}
+
+collect_results() {
+    local test_name="${1:-veth_bql}"
+
+    # Ping summary
+    wait "$PING_PID" 2>/dev/null || true
+    PING_PID=""
+    if [ -f "$TMPDIR"/ping.log ]; then
+        PING_LOSS=$(grep -o '[0-9.]*% packet loss' "$TMPDIR"/ping.log) &&
+            log_info "Ping loss: $PING_LOSS"
+        PING_SUMMARY=$(tail -1 "$TMPDIR"/ping.log)
+        log_info "Ping summary: $PING_SUMMARY"
+    fi
+
+    # Watchdog summary
+    WD_FINAL=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+        2>/dev/null) || WD_FINAL=0
+    if [ "$WD_FINAL" -gt 0 ] 2>/dev/null; then
+        log_info "Watchdog fired ${WD_FINAL} time(s)"
+        dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+            grep -E 'NETDEV WATCHDOG|veth backpressure' || true
+    fi
+
+    # Final dmesg check -- only upgrade to fail, never override existing fail
+    if ! check_dmesg_bug; then
+        RET=$ksft_fail
+        retmsg="BUG_ON triggered in dql_completed"
+    fi
+    log_test "$test_name"
+    exit $EXIT_STATUS
+}
+
+# --- Test modes ---
+
+test_bql_stress() {
+    RET=$ksft_pass
+    compile_tools
+    setup_veth
+    install_qdisc "$QDISC" "$QDISC_OPTS"
+    setup_iptables
+    log_info "kernel: $(uname -r)"
+    check_bql_sysfs
+    start_traffic
+    monitor_loop
+    collect_results "veth_bql"
+}
+
+# Test qdisc replacement under active traffic.  Cycles through several
+# qdiscs including a transition to noqueue (tc qdisc del) to verify
+# that stale BQL state (STACK_XOFF) is properly reset during qdisc
+# transitions.
+test_qdisc_replace() {
+    local qdiscs=("sfq" "pfifo" "fq_codel")
+    local step=2
+    local elapsed=0
+    local idx
+
+    RET=$ksft_pass
+    compile_tools
+    setup_veth
+    install_qdisc "$QDISC" "$QDISC_OPTS"
+    setup_iptables
+    log_info "kernel: $(uname -r)"
+    check_bql_sysfs
+    start_traffic
+
+    while [ "$elapsed" -lt "$DURATION" ] && kill -0 "$FLOOD_PID" 2>/dev/null; do
+        sleep "$step"
+        elapsed=$((elapsed + step))
+
+        if ! check_dmesg_bug; then
+            RET=$ksft_fail
+            retmsg="BUG_ON during qdisc replacement at ${elapsed}s"
+            break
+        fi
+
+        # Cycle: sfq -> pfifo -> fq_codel -> noqueue -> sfq -> ...
+        idx=$(( (elapsed / step - 1) % (${#qdiscs[@]} + 1) ))
+        if [ "$idx" -eq "${#qdiscs[@]}" ]; then
+            remove_qdisc
+        else
+            install_qdisc "${qdiscs[$idx]}"
+        fi
+
+        if ! verify_traffic_flowing "[${elapsed}s]"; then
+            RET=$ksft_fail
+            retmsg="Traffic stalled after qdisc replacement at ${elapsed}s"
+            break
+        fi
+    done
+
+    stop_traffic
+    collect_results "veth_bql_qdisc_replace"
+}
+
+# --- Main ---
+if [ "$QDISC_REPLACE" -eq 1 ]; then
+    test_qdisc_replace
+else
+    test_bql_stress
+fi
diff --git a/tools/testing/selftests/net/veth_bql_test_virtme.sh b/tools/testing/selftests/net/veth_bql_test_virtme.sh
new file mode 100755
index 000000000000..9409a40feb15
--- /dev/null
+++ b/tools/testing/selftests/net/veth_bql_test_virtme.sh
@@ -0,0 +1,113 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Launch veth BQL test inside virtme-ng
+#
+# Must be run from the kernel build tree root.
+#
+# Options:
+#   --verbose       Show kernel console (vng boot messages) in real time.
+#                   Useful for debugging kernel panics / BUG_ON crashes.
+#   All other options are forwarded to veth_bql_test.sh (see --help there).
+#
+# Examples:
+#   ./tools/testing/selftests/net/veth_bql_test_virtme.sh
+#   ./tools/testing/selftests/net/veth_bql_test_virtme.sh --duration 20 --nrules 1000 --qdisc fq_codel
+#   ./tools/testing/selftests/net/veth_bql_test_virtme.sh --qdisc fq_codel --bql-disable
+#   ./tools/testing/selftests/net/veth_bql_test_virtme.sh --verbose --qdisc-replace --duration 60
+
+set -eu
+
+# Parse --verbose (consumed here, not forwarded to the inner test).
+VERBOSE=""
+INNER_ARGS=()
+for arg in "$@"; do
+    if [ "$arg" = "--verbose" ]; then
+        VERBOSE="--verbose"
+    else
+        INNER_ARGS+=("$arg")
+    fi
+done
+TEST_ARGS=$(printf '%q ' "${INNER_ARGS[@]+${INNER_ARGS[@]}}")
+TEST="tools/testing/selftests/net/veth_bql_test.sh"
+
+if [ ! -f "$TEST" ]; then
+    echo "ERROR: Run this from the kernel tree root" >&2
+    echo "  cd /path/to/kernel && $0" >&2
+    exit 1
+fi
+
+# Verify .config has the options needed for virtme-ng and this test.
+# Without these the VM silently stalls with no output.
+KCONFIG=".config"
+if [ ! -f "$KCONFIG" ]; then
+    echo "ERROR: No .config found -- build the kernel first" >&2
+    exit 1
+fi
+
+MISSING=""
+for opt in CONFIG_VIRTIO CONFIG_VIRTIO_PCI CONFIG_VIRTIO_NET \
+           CONFIG_VIRTIO_CONSOLE CONFIG_NET_9P CONFIG_NET_9P_VIRTIO \
+           CONFIG_9P_FS CONFIG_VETH CONFIG_BQL; do
+    if ! grep -q "^${opt}=[ym]" "$KCONFIG"; then
+        MISSING+="  $opt\n"
+    fi
+done
+if [ -n "$MISSING" ]; then
+    echo "ERROR: .config is missing options required by virtme-ng:" >&2
+    echo -e "$MISSING" >&2
+    echo "Consider: vng --kconfig (or make defconfig + enable above)" >&2
+    exit 1
+fi
+
+TESTDIR="tools/testing/selftests/net"
+TESTNAME="veth_bql_test.sh"
+LOGFILE="veth_bql_test.log"
+LOGPATH="$TESTDIR/$LOGFILE"
+CONSOLELOG="veth_bql_console.log"
+rm -f "$LOGPATH" "$CONSOLELOG"
+
+echo "Starting VM... test output in $LOGPATH, kernel console in $CONSOLELOG"
+echo "(VM is booting, please wait ~30s)"
+
+# Always capture kernel console to a file via a second QEMU serial port.
+# vng claims ttyS0 (mapped to /dev/null); --qemu-opts adds ttyS1 on COM2.
+# earlycon registers COM2's I/O port (0x2f8) as a persistent console.
+# (plain console=ttyS1 does NOT work: the 8250 driver registers once,
+# ttyS0 wins, and ttyS1 is never picked up.)
+# --verbose additionally shows kernel console in real time on the terminal.
+SERIAL_CONSOLE="earlycon=uart8250,io,0x2f8,115200"
+SERIAL_CONSOLE+=" console=uart8250,io,0x2f8,115200"
+set +e
+vng $VERBOSE --cpus 4 --memory 2G \
+    --rwdir "$TESTDIR" \
+    --append "panic=5 loglevel=4 $SERIAL_CONSOLE" \
+    --qemu-opts="-serial file:$CONSOLELOG" \
+    --exec "cd $TESTDIR && \
+        ./$TESTNAME $TEST_ARGS 2>&1 | \
+        tee $LOGFILE; echo EXIT_CODE=\$? >> $LOGFILE"
+VNG_RC=$?
+set -e
+
+echo ""
+if [ "$VNG_RC" -ne 0 ]; then
+    echo "***********************************************************"
+    echo "* VM CRASHED -- kernel panic or BUG_ON (vng rc=$VNG_RC)"
+    echo "***********************************************************"
+    if [ -s "$CONSOLELOG" ] && \
+       grep -qiE 'kernel BUG|BUG:|Oops:|panic|dql_completed' "$CONSOLELOG"; then
+        echo ""
+        echo "--- kernel backtrace ($CONSOLELOG) ---"
+        grep -iE -A30 'kernel BUG|BUG:|Oops:|panic|dql_completed' \
+            "$CONSOLELOG" | head -50
+    else
+        echo ""
+        echo "Re-run with --verbose to see the kernel backtrace:"
+        echo "  $0 --verbose ${INNER_ARGS[*]}"
+    fi
+    exit 1
+elif [ ! -f "$LOGPATH" ]; then
+    echo "No log file found -- VM may have crashed before writing output"
+    exit 2
+else
+    echo "=== VM finished ==="
+fi
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH net-next 6/6] net_sched: codel: fix stale state for empty flows in fq_codel
  2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
                   ` (4 preceding siblings ...)
  2026-03-18 13:48 ` [RFC PATCH net-next 5/6] selftests: net: add veth BQL stress test hawk
@ 2026-03-18 13:48 ` hawk
  2026-03-18 14:10   ` Toke Høiland-Jørgensen
  5 siblings, 1 reply; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jonas Köppeler <j.koeppeler@tu-berlin.de>

When codel_dequeue() finds an empty queue, it resets vars->dropping
but does not reset vars->first_above_time.  The reference CoDel
algorithm (Nichols & Jacobson, ACM Queue 2012) resets both:

  dodeque_result codel_queue_t::dodeque(time_t now) {
      ...
      if (r.p == NULL) {
          first_above_time = 0;   // <-- Linux omits this
      }
      ...
  }

Note that codel_should_drop() does reset first_above_time when called
with a NULL skb, but codel_dequeue() returns early before ever calling
codel_should_drop() in the empty-queue case.  The post-drop code paths
do reach codel_should_drop(NULL) and correctly reset the timer, so a
dropped packet breaks the cycle -- but the next delivered packet
re-arms first_above_time and the cycle repeats.

For sparse flows such as ICMP ping (one packet every 200ms-1s), the
first packet arms first_above_time, the flow goes empty, and the
second packet arrives after the interval has elapsed and gets dropped.
The pattern repeats, producing sustained loss on flows that are not
actually congested.

Test: veth pair, fq_codel, BQL disabled, 30000 iptables rules in the
consumer namespace (NAPI-64 cycle ~14ms, well above fq_codel's 5ms
target), ping at 5 pps under UDP flood:

  Before fix:  26% ping packet loss
  After fix:    0% ping packet loss

Fix by resetting first_above_time to zero in the empty-queue path
of codel_dequeue(), matching the reference algorithm.

Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")
Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reported-by: Chris Arges <carges@cloudflare.com>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 include/net/codel_impl.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/codel_impl.h b/include/net/codel_impl.h
index 78a27ac73070..b2c359c6dd1b 100644
--- a/include/net/codel_impl.h
+++ b/include/net/codel_impl.h
@@ -158,6 +158,7 @@ static struct sk_buff *codel_dequeue(void *ctx,
 	bool drop;

 	if (!skb) {
+		vars->first_above_time = 0;
 		vars->dropping = false;
 		return skb;
 	}
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 6/6] net_sched: codel: fix stale state for empty flows in fq_codel
  2026-03-18 13:48 ` [RFC PATCH net-next 6/6] net_sched: codel: fix stale state for empty flows in fq_codel hawk
@ 2026-03-18 14:10   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2026-03-18 14:10 UTC (permalink / raw)
  To: hawk, netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	sdf, j.koeppeler, mfreemon, carges

hawk@kernel.org writes:

> From: Jonas Köppeler <j.koeppeler@tu-berlin.de>
>
> When codel_dequeue() finds an empty queue, it resets vars->dropping
> but does not reset vars->first_above_time.  The reference CoDel
> algorithm (Nichols & Jacobson, ACM Queue 2012) resets both:
>
>   dodeque_result codel_queue_t::dodeque(time_t now) {
>       ...
>       if (r.p == NULL) {
>           first_above_time = 0;   // <-- Linux omits this
>       }
>       ...
>   }
>
> Note that codel_should_drop() does reset first_above_time when called
> with a NULL skb, but codel_dequeue() returns early before ever calling
> codel_should_drop() in the empty-queue case.  The post-drop code paths
> do reach codel_should_drop(NULL) and correctly reset the timer, so a
> dropped packet breaks the cycle -- but the next delivered packet
> re-arms first_above_time and the cycle repeats.

Nice find! I think this is worth submitting as a separate fix, no?
Probably with a fixes tag:

Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")

and possibly this one as well to aid stable backports:

Fixes: d068ca2ae2e6 ("codel: split into multiple files")


> For sparse flows such as ICMP ping (one packet every 200ms-1s), the
> first packet arms first_above_time, the flow goes empty, and the
> second packet arrives after the interval has elapsed and gets dropped.
> The pattern repeats, producing sustained loss on flows that are not
> actually congested.

Note that this only happens if the ping flow actually gets into a
dropping state before emptying out. Which rarely happens in practice,
which I'm guessing is why this was not discovered before now :)

> Test: veth pair, fq_codel, BQL disabled, 30000 iptables rules in the
> consumer namespace (NAPI-64 cycle ~14ms, well above fq_codel's 5ms
> target), ping at 5 pps under UDP flood:
>
>   Before fix:  26% ping packet loss
>   After fix:    0% ping packet loss
>
> Fix by resetting first_above_time to zero in the empty-queue path
> of codel_dequeue(), matching the reference algorithm.
>
> Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")
> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
> Reported-by: Chris Arges <carges@cloudflare.com>
> Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-03-18 13:48 ` [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-03-18 14:28   ` Toke Høiland-Jørgensen
  2026-03-18 16:24     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2026-03-18 14:28 UTC (permalink / raw)
  To: hawk, netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	sdf, j.koeppeler, mfreemon, carges

hawk@kernel.org writes:

> From: Jesper Dangaard Brouer <hawk@kernel.org>
>
> Add BQL support to the veth driver to dynamically limit the number of
> bytes queued in the ptr_ring, giving the qdisc earlier feedback to shape
> traffic and reduce latency.
>
> The BQL charge (netdev_tx_sent_queue) is placed in veth_xmit() BEFORE
> veth_forward_skb() produces the SKB into the ptr_ring. This ordering is
> critical: with threaded NAPI the consumer runs on a separate CPU and can
> complete the SKB (calling dql_completed) before veth_xmit() returns. If
> the charge happened after the produce, the completion could race ahead
> of the charge, violating dql_completed()'s invariant that completed
> bytes never exceed queued bytes (BUG_ON).
>
> Whether an SKB was BQL-charged is tracked per-SKB using a VETH_BQL_FLAG
> bit in the ptr_ring pointer (BIT(1), alongside the existing VETH_XDP_FLAG
> BIT(0)). The do_bql flag from veth_xmit() propagates through
> veth_forward_skb() and veth_xdp_rx() into the ptr_ring entry. On the
> completion side in veth_xdp_rcv(), veth_ptr_is_bql() reads the flag to
> decide whether to call netdev_tx_completed_queue(). Per-SKB tracking is
> necessary because the qdisc can be replaced live (e.g. noqueue->sfq or
> vice versa via 'tc qdisc replace') while SKBs are already in-flight in
> the ptr_ring. SKBs charged under the old qdisc must complete correctly
> regardless of what qdisc is attached when the consumer runs, so each
> SKB carries its own BQL-charged state rather than re-checking the peer's
> qdisc at completion time.

It's not completely obvious to me why BQL can't be active regardless of
whether there's a qdisc installed or not? If there's no qdisc, shouldn't
BQL auto-tune to a higher value because the queue runs empty more?

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-03-18 14:28   ` Toke Høiland-Jørgensen
@ 2026-03-18 16:24     ` Jesper Dangaard Brouer
  2026-03-19 10:04       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2026-03-18 16:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	sdf, j.koeppeler, mfreemon, carges, kernel-team



On 18/03/2026 15.28, Toke Høiland-Jørgensen wrote:
> hawk@kernel.org writes:
> 
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> Add BQL support to the veth driver to dynamically limit the number of
>> bytes queued in the ptr_ring, giving the qdisc earlier feedback to shape
>> traffic and reduce latency.
>>
>> The BQL charge (netdev_tx_sent_queue) is placed in veth_xmit() BEFORE
>> veth_forward_skb() produces the SKB into the ptr_ring. This ordering is
>> critical: with threaded NAPI the consumer runs on a separate CPU and can
>> complete the SKB (calling dql_completed) before veth_xmit() returns. If
>> the charge happened after the produce, the completion could race ahead
>> of the charge, violating dql_completed()'s invariant that completed
>> bytes never exceed queued bytes (BUG_ON).
>>
>> Whether an SKB was BQL-charged is tracked per-SKB using a VETH_BQL_FLAG
>> bit in the ptr_ring pointer (BIT(1), alongside the existing VETH_XDP_FLAG
>> BIT(0)). The do_bql flag from veth_xmit() propagates through
>> veth_forward_skb() and veth_xdp_rx() into the ptr_ring entry. On the
>> completion side in veth_xdp_rcv(), veth_ptr_is_bql() reads the flag to
>> decide whether to call netdev_tx_completed_queue(). Per-SKB tracking is
>> necessary because the qdisc can be replaced live (e.g. noqueue->sfq or
>> vice versa via 'tc qdisc replace') while SKBs are already in-flight in
>> the ptr_ring. SKBs charged under the old qdisc must complete correctly
>> regardless of what qdisc is attached when the consumer runs, so each
>> SKB carries its own BQL-charged state rather than re-checking the peer's
>> qdisc at completion time.
> 
> It's not completely obvious to me why BQL can't be active regardless of
> whether there's a qdisc installed or not? If there's no qdisc, shouldn't
> BQL auto-tune to a higher value because the queue runs empty more?
> 

When net_device don't have qdisc we hit this code path:
  - [0] 
https://elixir.bootlin.com/linux/v7.0-rc4/source/net/core/dev.c#L4806-L4852
  - Notice the check if(!netif_xmit_stopped(txq))
  - resulting in "Virtual device %s asks to queue packet!"

We cannot unconditionally track BQL as calling netdev_tx_sent_queue()
can result in setting STACK_XOFF.  Resulting in above code dropping
packets and complaining.  (It have no qdisc to requeue store back-
pressured packet).

--Jesper


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-03-18 16:24     ` Jesper Dangaard Brouer
@ 2026-03-19 10:04       ` Toke Høiland-Jørgensen
  2026-03-20 14:50         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2026-03-19 10:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	sdf, j.koeppeler, mfreemon, carges, kernel-team

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 18/03/2026 15.28, Toke Høiland-Jørgensen wrote:
>> hawk@kernel.org writes:
>> 
>>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>>
>>> Add BQL support to the veth driver to dynamically limit the number of
>>> bytes queued in the ptr_ring, giving the qdisc earlier feedback to shape
>>> traffic and reduce latency.
>>>
>>> The BQL charge (netdev_tx_sent_queue) is placed in veth_xmit() BEFORE
>>> veth_forward_skb() produces the SKB into the ptr_ring. This ordering is
>>> critical: with threaded NAPI the consumer runs on a separate CPU and can
>>> complete the SKB (calling dql_completed) before veth_xmit() returns. If
>>> the charge happened after the produce, the completion could race ahead
>>> of the charge, violating dql_completed()'s invariant that completed
>>> bytes never exceed queued bytes (BUG_ON).
>>>
>>> Whether an SKB was BQL-charged is tracked per-SKB using a VETH_BQL_FLAG
>>> bit in the ptr_ring pointer (BIT(1), alongside the existing VETH_XDP_FLAG
>>> BIT(0)). The do_bql flag from veth_xmit() propagates through
>>> veth_forward_skb() and veth_xdp_rx() into the ptr_ring entry. On the
>>> completion side in veth_xdp_rcv(), veth_ptr_is_bql() reads the flag to
>>> decide whether to call netdev_tx_completed_queue(). Per-SKB tracking is
>>> necessary because the qdisc can be replaced live (e.g. noqueue->sfq or
>>> vice versa via 'tc qdisc replace') while SKBs are already in-flight in
>>> the ptr_ring. SKBs charged under the old qdisc must complete correctly
>>> regardless of what qdisc is attached when the consumer runs, so each
>>> SKB carries its own BQL-charged state rather than re-checking the peer's
>>> qdisc at completion time.
>> 
>> It's not completely obvious to me why BQL can't be active regardless of
>> whether there's a qdisc installed or not? If there's no qdisc, shouldn't
>> BQL auto-tune to a higher value because the queue runs empty more?
>> 
>
> When net_device don't have qdisc we hit this code path:
>   - [0] 
> https://elixir.bootlin.com/linux/v7.0-rc4/source/net/core/dev.c#L4806-L4852
>   - Notice the check if(!netif_xmit_stopped(txq))
>   - resulting in "Virtual device %s asks to queue packet!"
>
> We cannot unconditionally track BQL as calling netdev_tx_sent_queue()
> can result in setting STACK_XOFF.  Resulting in above code dropping
> packets and complaining.  (It have no qdisc to requeue store back-
> pressured packet).

Ah, right. I realised the packet would be dropped, of course, but I did
not realise the stack would complain. That seems... odd? Why not just
get rid of the complaint instead of having this kludge to work around
it?

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-03-19 10:04       ` Toke Høiland-Jørgensen
@ 2026-03-20 14:50         ` Jesper Dangaard Brouer
  2026-03-23 10:10           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2026-03-20 14:50 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	sdf, j.koeppeler, mfreemon, carges, kernel-team



On 19/03/2026 11.04, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> On 18/03/2026 15.28, Toke Høiland-Jørgensen wrote:
>>> hawk@kernel.org writes:
>>>
>>>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>
>>>> Add BQL support to the veth driver to dynamically limit the number of
>>>> bytes queued in the ptr_ring, giving the qdisc earlier feedback to shape
>>>> traffic and reduce latency.
>>>>
>>>> The BQL charge (netdev_tx_sent_queue) is placed in veth_xmit() BEFORE
>>>> veth_forward_skb() produces the SKB into the ptr_ring. This ordering is
>>>> critical: with threaded NAPI the consumer runs on a separate CPU and can
>>>> complete the SKB (calling dql_completed) before veth_xmit() returns. If
>>>> the charge happened after the produce, the completion could race ahead
>>>> of the charge, violating dql_completed()'s invariant that completed
>>>> bytes never exceed queued bytes (BUG_ON).
>>>>
>>>> Whether an SKB was BQL-charged is tracked per-SKB using a VETH_BQL_FLAG
>>>> bit in the ptr_ring pointer (BIT(1), alongside the existing VETH_XDP_FLAG
>>>> BIT(0)). The do_bql flag from veth_xmit() propagates through
>>>> veth_forward_skb() and veth_xdp_rx() into the ptr_ring entry. On the
>>>> completion side in veth_xdp_rcv(), veth_ptr_is_bql() reads the flag to
>>>> decide whether to call netdev_tx_completed_queue(). Per-SKB tracking is
>>>> necessary because the qdisc can be replaced live (e.g. noqueue->sfq or
>>>> vice versa via 'tc qdisc replace') while SKBs are already in-flight in
>>>> the ptr_ring. SKBs charged under the old qdisc must complete correctly
>>>> regardless of what qdisc is attached when the consumer runs, so each
>>>> SKB carries its own BQL-charged state rather than re-checking the peer's
>>>> qdisc at completion time.
>>>
>>> It's not completely obvious to me why BQL can't be active regardless of
>>> whether there's a qdisc installed or not? If there's no qdisc, shouldn't
>>> BQL auto-tune to a higher value because the queue runs empty more?
>>>
>>
>> When net_device don't have qdisc we hit this code path:
>>    - [0]
>> https://elixir.bootlin.com/linux/v7.0-rc4/source/net/core/dev.c#L4806-L4852
>>    - Notice the check if(!netif_xmit_stopped(txq))
>>    - resulting in "Virtual device %s asks to queue packet!"
>>
>> We cannot unconditionally track BQL as calling netdev_tx_sent_queue()
>> can result in setting STACK_XOFF.  Resulting in above code dropping
>> packets and complaining.  (It have no qdisc to requeue store back-
>> pressured packet).
> 
> Ah, right. I realised the packet would be dropped, of course, but I did
> not realise the stack would complain. That seems... odd? Why not just
> get rid of the complaint instead of having this kludge to work around
> it?
> 

The BQL code manipulates txq (struct dql) and this requires correct
locking. Using noqueue on veth doesn't do the correct locking.

In the linked[0] code you were likely fooled by this code line[1]:
  HARD_TX_LOCK(dev, txq, cpu);
It is actually not taking the lock, because veth is a lltx device.
(Don't be fooled by the annotation it is for WARN_CONTEXT_ANALYSIS)

For fun I actually implemented it to see what happened. And I did manage
to crash the kernel on the DQL internal BUG_ON.  Analyzing the BUG_ON I
realized that my scheme of charging BQL and then undo the charge (if
ptr_ring were full) isn't correct, there is a race.  The BQL call
netdev_tx_completed_queue() is strictly to be used by the consumer, not
by the producer (like I did).

I'm now working on different approaches for the undo...

--Jesper

[1] https://elixir.bootlin.com/linux/v7.0-rc4/source/net/core/dev.c#L4831
[2] 
https://elixir.bootlin.com/linux/v7.0-rc4/source/include/linux/compiler-context-analysis.h#L166

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction
  2026-03-20 14:50         ` Jesper Dangaard Brouer
@ 2026-03-23 10:10           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2026-03-23 10:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	sdf, j.koeppeler, mfreemon, carges, kernel-team

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 19/03/2026 11.04, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>> 
>>> On 18/03/2026 15.28, Toke Høiland-Jørgensen wrote:
>>>> hawk@kernel.org writes:
>>>>
>>>>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>
>>>>> Add BQL support to the veth driver to dynamically limit the number of
>>>>> bytes queued in the ptr_ring, giving the qdisc earlier feedback to shape
>>>>> traffic and reduce latency.
>>>>>
>>>>> The BQL charge (netdev_tx_sent_queue) is placed in veth_xmit() BEFORE
>>>>> veth_forward_skb() produces the SKB into the ptr_ring. This ordering is
>>>>> critical: with threaded NAPI the consumer runs on a separate CPU and can
>>>>> complete the SKB (calling dql_completed) before veth_xmit() returns. If
>>>>> the charge happened after the produce, the completion could race ahead
>>>>> of the charge, violating dql_completed()'s invariant that completed
>>>>> bytes never exceed queued bytes (BUG_ON).
>>>>>
>>>>> Whether an SKB was BQL-charged is tracked per-SKB using a VETH_BQL_FLAG
>>>>> bit in the ptr_ring pointer (BIT(1), alongside the existing VETH_XDP_FLAG
>>>>> BIT(0)). The do_bql flag from veth_xmit() propagates through
>>>>> veth_forward_skb() and veth_xdp_rx() into the ptr_ring entry. On the
>>>>> completion side in veth_xdp_rcv(), veth_ptr_is_bql() reads the flag to
>>>>> decide whether to call netdev_tx_completed_queue(). Per-SKB tracking is
>>>>> necessary because the qdisc can be replaced live (e.g. noqueue->sfq or
>>>>> vice versa via 'tc qdisc replace') while SKBs are already in-flight in
>>>>> the ptr_ring. SKBs charged under the old qdisc must complete correctly
>>>>> regardless of what qdisc is attached when the consumer runs, so each
>>>>> SKB carries its own BQL-charged state rather than re-checking the peer's
>>>>> qdisc at completion time.
>>>>
>>>> It's not completely obvious to me why BQL can't be active regardless of
>>>> whether there's a qdisc installed or not? If there's no qdisc, shouldn't
>>>> BQL auto-tune to a higher value because the queue runs empty more?
>>>>
>>>
>>> When net_device don't have qdisc we hit this code path:
>>>    - [0]
>>> https://elixir.bootlin.com/linux/v7.0-rc4/source/net/core/dev.c#L4806-L4852
>>>    - Notice the check if(!netif_xmit_stopped(txq))
>>>    - resulting in "Virtual device %s asks to queue packet!"
>>>
>>> We cannot unconditionally track BQL as calling netdev_tx_sent_queue()
>>> can result in setting STACK_XOFF.  Resulting in above code dropping
>>> packets and complaining.  (It have no qdisc to requeue store back-
>>> pressured packet).
>> 
>> Ah, right. I realised the packet would be dropped, of course, but I did
>> not realise the stack would complain. That seems... odd? Why not just
>> get rid of the complaint instead of having this kludge to work around
>> it?
>> 
>
> The BQL code manipulates txq (struct dql) and this requires correct
> locking. Using noqueue on veth doesn't do the correct locking.
>
> In the linked[0] code you were likely fooled by this code line[1]:
>   HARD_TX_LOCK(dev, txq, cpu);
> It is actually not taking the lock, because veth is a lltx device.
> (Don't be fooled by the annotation it is for WARN_CONTEXT_ANALYSIS)
>
> For fun I actually implemented it to see what happened. And I did manage
> to crash the kernel on the DQL internal BUG_ON.  Analyzing the BUG_ON I
> realized that my scheme of charging BQL and then undo the charge (if
> ptr_ring were full) isn't correct, there is a race.  The BQL call
> netdev_tx_completed_queue() is strictly to be used by the consumer, not
> by the producer (like I did).
>
> I'm now working on different approaches for the undo...

Alright, good to hear that my obtuse questions turned out to be helpful ;)

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-23 10:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
2026-03-18 13:48 ` [RFC PATCH net-next 1/6] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-03-18 13:48 ` [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-03-18 14:28   ` Toke Høiland-Jørgensen
2026-03-18 16:24     ` Jesper Dangaard Brouer
2026-03-19 10:04       ` Toke Høiland-Jørgensen
2026-03-20 14:50         ` Jesper Dangaard Brouer
2026-03-23 10:10           ` Toke Høiland-Jørgensen
2026-03-18 13:48 ` [RFC PATCH net-next 3/6] veth: add tx_timeout watchdog as BQL safety net hawk
2026-03-18 13:48 ` [RFC PATCH net-next 4/6] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-03-18 13:48 ` [RFC PATCH net-next 5/6] selftests: net: add veth BQL stress test hawk
2026-03-18 13:48 ` [RFC PATCH net-next 6/6] net_sched: codel: fix stale state for empty flows in fq_codel hawk
2026-03-18 14:10   ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox