* [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support
@ 2026-05-27 13:54 hawk
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
To: netdev
Cc: Jesper Dangaard Brouer, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Chris Arges,
Mike Freemon, Toke Høiland-Jørgensen,
Jonas Köppeler, Breno Leitao, Simon Schippers,
Simon Schippers, kernel-team
From: Jesper Dangaard Brouer <hawk@kernel.org>
This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.
Problem: veth's 256-entry ptr_ring acts as a "dark buffer" invisible
to the qdisc's AQM. Under load the ring fills, adding up to 256
packets of unmanaged latency before the qdisc sees congestion.
Solution: BQL stops the queue before the ring fills, pushing excess
packets into the qdisc where sojourn-based AQM can drop them. V6 adds
time-based completion coalescing (ethtool tx-usecs, default 100 us)
so DQL converges on a limit that bounds actual queuing delay rather
than oscillating at limit=2 with per-packet completion.
Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7 ms). Ping measures RTT.
BQL off BQL on
fq_codel: RTT ~22 ms, 4% loss RTT ~1.3 ms, 0% loss
sfq: RTT ~24 ms, 0% loss RTT ~1.5 ms, 0% loss
BQL reduces ping RTT by ~17x for both qdiscs. Consumer throughput
unchanged.
Why BQL over configurable ring size (ethtool -G): BQL auto-tunes per
workload via DQL; a static ring size requires per-setup manual tuning
and a too-small ring drops XDP packets in batches of 16-64.
Selftests:
https://github.com/netoptimizer/veth-backpressure-performance-testing
Background:
Mike Freemon reported the veth dark buffer problem internally at
Cloudflare and showed that recompiling with ptr_ring size 30 made
fq_codel work dramatically better -- motivating a dynamic BQL
solution. Chris Arges wrote the reproducer. Jonas Koeppeler and
Simon Schippers provided extensive testing and code review.
During BQL development we also fixed an unrelated 12-year-old CoDel
bug (stale first_above_time in empty flows), see
commit 815980fe6dbb ("net_sched: codel: fix stale state for empty flows in fq_codel").
BQL remains valuable independently.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: Breno Leitao <leitao@debian.org>
Cc: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: Simon Schippers <simon@schippers-hamm.de>
Cc: kernel-team@cloudflare.com
Jesper Dangaard Brouer (4):
net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
veth: implement Byte Queue Limits (BQL) for latency reduction
veth: add tx_timeout watchdog as BQL safety net
net: sched: add timeout count to NETDEV WATCHDOG message
Simon Schippers (1):
veth: time-based BQL completion coalescing via ethtool tx-usecs
.../networking/net_cachelines/net_device.rst | 1 +
drivers/net/veth.c | 198 +++++++++++++++++-
include/linux/netdevice.h | 2 +
net/core/net-sysfs.c | 8 +-
net/sched/sch_generic.c | 8 +-
5 files changed, 201 insertions(+), 16 deletions(-)
---
Changes since V5:
- Drop patch 1 (OOB txq fix) -- applied to net tree by Paolo as
08f566e8f83b, already in net-next.
- New patch 5: time-based BQL completion coalescing via ethtool
tx-usecs (Simon Schippers). Resolves the throughput regression
Paolo flagged on V5: per-packet completion forced DQL to limit=2,
causing cache-line bouncing between producer/consumer CPUs.
Coalescing batches completions on a configurable time threshold
(default 100 us), letting DQL discover a higher useful limit.
- Patch 2: use __ptr_ring_check_produce() instead of open-coded
ring-full check (Simon Schippers nit).
Prior versions:
V5: https://lore.kernel.org/all/20260505132159.241305-1-hawk@kernel.org/
V4: https://lore.kernel.org/all/20260501071633.644353-1-hawk@kernel.org/
V3: https://lore.kernel.org/all/20260429172036.1028526-1-hawk@kernel.org/
V2: https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org/
V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/
Changes since V4:
- New patch 1: fix OOB txq access in veth_poll() when veth peers have
asymmetric RX/TX queue counts. XDP redirect can deliver frames to
an RX queue index that exceeds the peer's TX queue count, causing
an out-of-bounds netdev_get_tx_queue() access. Found by sashiko-bot.
- Patch 3 (veth BQL): wake stopped peer txqs in veth_napi_del_range()
to clear DRV_XOFF after NAPI teardown. A concurrent veth_xmit()
can set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
synchronize_net(); with NAPI gone, no veth_poll() clears it.
Guarded by netif_running() to skip during device close.
Changes since V3:
- Drop selftest patch (patch 5 from V3) per maintainer request.
- Rebase on latest net-next.
Changes since V2:
- Patch 2 (veth BQL): fix syzbot WARNING in veth_napi_del_range():
clamp BQL reset loop to peer's real_num_tx_queues. The loop was
iterating dev->real_num_rx_queues but indexing peer's txq[], which
goes out of bounds when the peer has fewer TX queues (e.g. veth
enslaved to a bond with XDP attached).
Changes since V1:
- Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
- Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
of skb->len. veth has no link speed; the ptr_ring is packet-indexed.
Byte-based charging lets small packets sneak many entries into the ring.
Testing: min-size packet flood causes 3.7x ping RTT degradation with
skb->len vs no change with fixed-unit charging.
- Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
to avoid dql_reset() racing with concurrent dql_completed().
- Cover letter: update CoDel fix reference to merged commit in net tree.
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
@ 2026-05-27 13:54 ` hawk
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
To: netdev
Cc: Jesper Dangaard Brouer, Jonas Köppeler, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Jonathan Corbet, Shuah Khan, Kuniyuki Iwashima,
Stanislav Fomichev, Christian Brauner, Krishna Kumar, Yajun Deng,
linux-doc, linux-kernel
From: Jesper Dangaard Brouer <hawk@kernel.org>
Virtual devices with IFF_NO_QUEUE or lltx are excluded from BQL sysfs
by netdev_uses_bql(), since they traditionally lack real hardware
queues. However, some virtual devices like veth implement a real
ptr_ring FIFO with NAPI processing and benefit from BQL to limit
in-flight bytes and reduce latency.
Add a per-device 'bql' bitfield boolean in the priv_flags_slow section
of struct net_device. When set, it overrides the IFF_NO_QUEUE/lltx
exclusion and exposes BQL sysfs entries (/sys/class/net/<dev>/queues/
tx-<n>/byte_queue_limits/). The flag is still gated on CONFIG_BQL.
This allows drivers that use BQL despite being IFF_NO_QUEUE to opt in
to sysfs visibility for monitoring and debugging.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
Documentation/networking/net_cachelines/net_device.rst | 1 +
include/linux/netdevice.h | 2 ++
net/core/net-sysfs.c | 8 +++++++-
3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 7b3392553fd6..62df1e09656b 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -170,6 +170,7 @@ unsigned_long:1 see_all_hwtstamp_requests
unsigned_long:1 change_proto_down
unsigned_long:1 netns_immutable
unsigned_long:1 fcoe_mtu
+unsigned_long:1 bql netdev_uses_bql(net-sysfs.c)
struct list_head net_notifier_list
struct macsec_ops* macsec_ops
struct udp_tunnel_nic_info* udp_tunnel_nic_info
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bf3dd9b2c1a7..6b02e82d9625 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2077,6 +2077,7 @@ enum netdev_reg_state {
* @change_proto_down: device supports setting carrier via IFLA_PROTO_DOWN
* @netns_immutable: interface can't change network namespaces
* @fcoe_mtu: device supports maximum FCoE MTU, 2158 bytes
+ * @bql: device uses BQL (DQL sysfs) despite having IFF_NO_QUEUE
*
* @net_notifier_list: List of per-net netdev notifier block
* that follow this device when it is moved
@@ -2491,6 +2492,7 @@ struct net_device {
unsigned long change_proto_down:1;
unsigned long netns_immutable:1;
unsigned long fcoe_mtu:1;
+ unsigned long bql:1;
struct list_head net_notifier_list;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 3318b5666e43..82833e5dae03 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1945,10 +1945,16 @@ static const struct kobj_type netdev_queue_ktype = {
static bool netdev_uses_bql(const struct net_device *dev)
{
+ if (!IS_ENABLED(CONFIG_BQL))
+ return false;
+
+ if (dev->bql)
+ return true;
+
if (dev->lltx || (dev->priv_flags & IFF_NO_QUEUE))
return false;
- return IS_ENABLED(CONFIG_BQL);
+ return true;
}
static int netdev_queue_add_kobject(struct net_device *dev, int index)
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
@ 2026-05-27 13:54 ` hawk
2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
To: netdev
Cc: Jesper Dangaard Brouer, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
bpf
From: Jesper Dangaard Brouer <hawk@kernel.org>
Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") gave qdiscs control over veth by returning
NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF). That commit noted
a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
a dark buffer, adding base latency because the qdisc has no visibility
into how many bytes are already queued there.
Add BQL support so the qdisc gets feedback and can begin shaping traffic
before the ring fills. In testing with fq_codel, BQL reduces ping RTT
under UDP load from ~6.61ms to ~0.36ms (18x).
Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
the DQL limit tracks packets-in-flight. Unlike a physical NIC, veth
has no link speed -- the ptr_ring drains at CPU speed and is
packet-indexed, not byte-indexed, so bytes are not the natural unit.
With byte-based charging, small packets sneak many more entries into
the ring before STACK_XOFF fires, deepening the dark buffer under
mixed-size workloads. Testing with a concurrent min-size packet flood
shows 3.7x ping RTT degradation with skb->len charging versus no
change with fixed-unit charging.
Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
confirming the ring is not full. The charge must precede the produce
because the NAPI consumer can run on another CPU and complete the SKB
the instant it becomes visible in the ring. Doing both under the same
lock avoids a pre-charge/undo pattern -- BQL is only charged when
produce is guaranteed to succeed.
BQL is enabled only when a real qdisc is attached (guarded by
!qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
for TXQ modification like dql_queued(). For lltx devices, like veth,
this HARD_TX_LOCK serialization isn't provided. The ptr_ring
producer_lock provides additional serialization that would allow
BQL to work correctly even with noqueue, though that combination
is not currently enabled, as the netstack will drop and warn.
Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
entry. This is necessary because the qdisc can be replaced live while
SKBs are in-flight -- each SKB must carry the charge decision made at
enqueue time rather than re-checking the peer's qdisc at completion.
Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
clears promptly when producer and consumer run on different CPUs.
BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
alongside the existing DRV_XOFF (ring full). Both must be clear for
the queue to transmit. Reset BQL state in veth_napi_del_range() after
synchronize_net() to avoid racing with in-flight veth_poll() calls.
Clamp the reset loop to the peer's real_num_tx_queues, since the peer
may have fewer TX queues than the local device has RX queues (e.g. when
veth is enslaved to a bond with XDP attached).
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
drivers/net/veth.c | 86 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 75 insertions(+), 11 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0cfb19b760dd..21ff78533943 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -34,9 +34,13 @@
#define DRV_VERSION "1.0"
#define VETH_XDP_FLAG BIT(0)
+#define VETH_BQL_FLAG BIT(1)
#define VETH_RING_SIZE 256
#define VETH_XDP_HEADROOM (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
+#define VETH_BQL_UNIT 1
+
#define VETH_XDP_TX_BULK_SIZE 16
#define VETH_XDP_BATCH 16
@@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
return (unsigned long)ptr & VETH_XDP_FLAG;
}
+static bool veth_ptr_is_bql(void *ptr)
+{
+ return (unsigned long)ptr & VETH_BQL_FLAG;
+}
+
+static struct sk_buff *veth_ptr_to_skb(void *ptr)
+{
+ return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
+}
+
+static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
+{
+ return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
+}
+
static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
{
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
if (veth_is_xdp_frame(ptr))
xdp_return_frame(veth_ptr_to_xdp(ptr));
else
- kfree_skb(ptr);
+ kfree_skb(veth_ptr_to_skb(ptr));
}
static void __veth_xdp_flush(struct veth_rq *rq)
@@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
}
}
-static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
+ struct netdev_queue *txq)
{
- if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
+ struct ptr_ring *ring = &rq->xdp_ring;
+
+ spin_lock(&ring->producer_lock);
+ if (unlikely(__ptr_ring_check_produce(ring))) {
+ spin_unlock(&ring->producer_lock);
return NETDEV_TX_BUSY; /* signal qdisc layer */
+ }
+
+ /* BQL charge before produce; consumer cannot see entry yet */
+ if (do_bql)
+ netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
+
+ __ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
+ spin_unlock(&ring->producer_lock);
return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
}
static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
- struct veth_rq *rq, bool xdp)
+ struct veth_rq *rq, bool xdp, bool do_bql,
+ struct netdev_queue *txq)
{
return __dev_forward_skb(dev, skb) ?: xdp ?
- veth_xdp_rx(rq, skb) :
+ veth_xdp_rx(rq, skb, do_bql, txq) :
__netif_rx(skb);
}
@@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct veth_rq *rq = NULL;
- struct netdev_queue *txq;
+ struct netdev_queue *txq = NULL;
struct net_device *rcv;
int length = skb->len;
bool use_napi = false;
+ bool do_bql = false;
int ret, rxq;
rcu_read_lock();
@@ -375,8 +409,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
}
skb_tx_timestamp(skb);
-
- ret = veth_forward_skb(rcv, skb, rq, use_napi);
+ if (rxq < dev->real_num_tx_queues) {
+ txq = netdev_get_tx_queue(dev, rxq);
+ /* BQL charge happens inside veth_xdp_rx() under producer_lock */
+ do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
+ }
+ ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
switch (ret) {
case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
if (!use_napi)
@@ -412,6 +450,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
net_crit_ratelimited("%s(%s): Invalid return code(%d)",
__func__, dev->name, ret);
}
+
rcu_read_unlock();
return ret;
@@ -900,7 +939,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_xdp_tx_bq *bq,
- struct veth_stats *stats)
+ struct veth_stats *stats,
+ struct netdev_queue *peer_txq)
{
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
@@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
}
} else {
/* ndo_start_xmit */
- struct sk_buff *skb = ptr;
+ bool bql_charged = veth_ptr_is_bql(ptr);
+ struct sk_buff *skb = veth_ptr_to_skb(ptr);
stats->xdp_bytes += skb->len;
+ if (peer_txq && bql_charged)
+ netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+
skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
if (skb) {
if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -976,7 +1020,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
xdp_set_return_frame_no_direct();
- done = veth_xdp_rcv(rq, budget, &bq, &stats);
+ done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
if (stats.xdp_redirect > 0)
xdp_do_flush();
@@ -1074,6 +1118,7 @@ static int __veth_napi_enable(struct net_device *dev)
static void veth_napi_del_range(struct net_device *dev, int start, int end)
{
struct veth_priv *priv = netdev_priv(dev);
+ struct net_device *peer;
int i;
for (i = start; i < end; i++) {
@@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
}
+ /* Reset BQL and wake stopped peer txqs. A concurrent veth_xmit()
+ * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
+ * synchronize_net(), and NAPI can no longer clear it.
+ * Only wake when the device is still up.
+ */
+ peer = rtnl_dereference(priv->peer);
+ if (peer) {
+ int peer_end = min_t(int, end, peer->real_num_tx_queues);
+
+ for (i = start; i < peer_end; i++) {
+ struct netdev_queue *txq = netdev_get_tx_queue(peer, i);
+
+ netdev_tx_reset_queue(txq);
+ if (netif_running(dev))
+ netif_tx_wake_queue(txq);
+ }
+ }
+
for (i = start; i < end; i++) {
page_pool_destroy(priv->rq[i].page_pool);
priv->rq[i].page_pool = NULL;
@@ -1741,6 +1804,7 @@ static void veth_setup(struct net_device *dev)
dev->priv_flags |= IFF_PHONY_HEADROOM;
dev->priv_flags |= IFF_DISABLE_NETPOLL;
dev->lltx = true;
+ dev->bql = true;
dev->netdev_ops = &veth_netdev_ops;
dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
@ 2026-05-27 13:54 ` hawk
2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
4 siblings, 0 replies; 6+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
To: netdev
Cc: Jesper Dangaard Brouer, Jonas Köppeler, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-kernel
From: Jesper Dangaard Brouer <hawk@kernel.org>
With the introduction of BQL (Byte Queue Limits) for veth, there are
now two independent mechanisms that can stop a transmit queue:
- DRV_XOFF: set by netif_tx_stop_queue() when the ptr_ring is full
- STACK_XOFF: set by BQL when the byte-in-flight limit is reached
If either mechanism stalls without a corresponding wake/completion,
the queue stops permanently. Enable the net device watchdog timer and
implement ndo_tx_timeout as a failsafe recovery.
The timeout handler resets BQL state (clearing STACK_XOFF) and wakes
the queue (clearing DRV_XOFF), covering both stop mechanisms. The
watchdog fires after 16 seconds, which accommodates worst-case NAPI
processing (budget=64 packets x 250ms per-packet consumer delay)
without false positives under normal backpressure.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
drivers/net/veth.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 21ff78533943..d5675d9d5236 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1442,6 +1442,22 @@ static int veth_set_channels(struct net_device *dev,
goto out;
}
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+ struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+ netdev_err(dev,
+ "veth backpressure(0x%lX) stalled(n:%ld) TXQ(%u) re-enable\n",
+ txq->state, atomic_long_read(&txq->trans_timeout), txqueue);
+
+ /* Cannot call netdev_tx_reset_queue(): dql_reset() races with
+ * peer NAPI calling dql_completed() concurrently.
+ * Just clear the stop bits; the qdisc will re-stop if still stuck.
+ */
+ clear_bit(__QUEUE_STATE_STACK_XOFF, &txq->state);
+ netif_tx_wake_queue(txq);
+}
+
static int veth_open(struct net_device *dev)
{
struct veth_priv *priv = netdev_priv(dev);
@@ -1780,6 +1796,7 @@ static const struct net_device_ops veth_netdev_ops = {
.ndo_bpf = veth_xdp,
.ndo_xdp_xmit = veth_ndo_xdp_xmit,
.ndo_get_peer_dev = veth_peer_dev,
+ .ndo_tx_timeout = veth_tx_timeout,
};
static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1819,6 +1836,7 @@ static void veth_setup(struct net_device *dev)
dev->priv_destructor = veth_dev_free;
dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
dev->max_mtu = ETH_MAX_MTU;
+ dev->watchdog_timeo = msecs_to_jiffies(16000);
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
` (2 preceding siblings ...)
2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
@ 2026-05-27 13:54 ` hawk
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
4 siblings, 0 replies; 6+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
To: netdev
Cc: Jesper Dangaard Brouer, Jakub Kicinski, Jonas Köppeler,
Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, linux-kernel
From: Jesper Dangaard Brouer <hawk@kernel.org>
Add the per-queue timeout counter (trans_timeout) to the core NETDEV
WATCHDOG log message. This makes it easy to determine how frequently
a particular queue is stalling from a single log line, without having
to search through and correlate spaced-out log entries.
Useful for production monitoring where timeouts are spaced by the
watchdog interval, making frequency hard to judge.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/all/20251107175445.58eba452@kernel.org/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
net/sched/sch_generic.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 237ee1cd0136..2aebab985c00 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -533,13 +533,12 @@ static void dev_watchdog(struct timer_list *t)
netif_running(dev) &&
netif_carrier_ok(dev)) {
unsigned int timedout_ms = 0;
+ struct netdev_queue *txq;
unsigned int i;
unsigned long trans_start;
unsigned long oldest_start = jiffies;
for (i = 0; i < dev->num_tx_queues; i++) {
- struct netdev_queue *txq;
-
txq = netdev_get_tx_queue(dev, i);
if (!netif_xmit_stopped(txq))
continue;
@@ -561,9 +560,10 @@ static void dev_watchdog(struct timer_list *t)
if (unlikely(timedout_ms)) {
trace_net_dev_xmit_timeout(dev, i);
- netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
+ netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms (n:%ld)\n",
raw_smp_processor_id(),
- i, timedout_ms);
+ i, timedout_ms,
+ atomic_long_read(&txq->trans_timeout));
netif_freeze_queues(dev);
dev->netdev_ops->ndo_tx_timeout(dev, i);
netif_unfreeze_queues(dev);
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
` (3 preceding siblings ...)
2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
@ 2026-05-27 13:54 ` hawk
4 siblings, 0 replies; 6+ messages in thread
From: hawk @ 2026-05-27 13:54 UTC (permalink / raw)
To: netdev
Cc: Simon Schippers, Jesper Dangaard Brouer, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Stanislav Fomichev, linux-kernel, bpf
From: Simon Schippers <simon.schippers@tu-dortmund.de>
Per-packet BQL completion forces DQL to converge on limit=2, causing
excessive NAPI scheduling overhead and qdisc requeues.
Accumulate BQL completions and flush them when a configurable time
threshold is exceeded, letting DQL discover a limit that bounds actual
queuing delay to the configured interval. Coalescing state persists
across NAPI polls in struct veth_rq so completions can accumulate
beyond a single budget=64 cycle.
Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
setting tx-usecs to 0 disables coalescing and falls back to per-packet
completion.
ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)
Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
drivers/net/veth.c | 100 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 98 insertions(+), 2 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d5675d9d5236..743d17b37223 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -28,6 +28,7 @@
#include <linux/bpf_trace.h>
#include <linux/net_tstamp.h>
#include <linux/skbuff_ref.h>
+#include <linux/sched/clock.h>
#include <net/page_pool/helpers.h>
#define DRV_NAME "veth"
@@ -44,6 +45,8 @@
#define VETH_XDP_TX_BULK_SIZE 16
#define VETH_XDP_BATCH 16
+#define VETH_BQL_COAL_TX_USECS 100 /* default tx-usecs for BQL batching */
+
struct veth_stats {
u64 rx_drops;
/* xdp */
@@ -62,6 +65,11 @@ struct veth_rq_stats {
struct u64_stats_sync syncp;
};
+struct veth_bql_state {
+ u64 time; /* sched_clock() when current coalescing window started */
+ int n_bql; /* BQL completions batched in the current window */
+};
+
struct veth_rq {
struct napi_struct xdp_napi;
struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,6 +77,7 @@ struct veth_rq {
struct bpf_prog __rcu *xdp_prog;
struct xdp_mem_info xdp_mem;
struct veth_rq_stats stats;
+ struct veth_bql_state bql_state;
bool rx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
@@ -81,6 +90,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
unsigned int requested_headroom;
+ unsigned int tx_coal_usecs; /* BQL completion coalescing */
};
struct veth_xdp_tx_bq {
@@ -265,7 +275,30 @@ static void veth_get_channels(struct net_device *dev,
static int veth_set_channels(struct net_device *dev,
struct ethtool_channels *ch);
+static int veth_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+ return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ priv->tx_coal_usecs = ec->tx_coalesce_usecs;
+ return 0;
+}
+
static const struct ethtool_ops veth_ethtool_ops = {
+ .supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
.get_drvinfo = veth_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = veth_get_strings,
@@ -275,6 +308,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_channels = veth_get_channels,
.set_channels = veth_set_channels,
+ .get_coalesce = veth_get_coalesce,
+ .set_coalesce = veth_set_coalesce,
};
/* general routines */
@@ -937,13 +972,45 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
return NULL;
}
+static void veth_bql_complete(struct veth_bql_state *state,
+ struct netdev_queue *peer_txq)
+{
+ netdev_tx_completed_queue(peer_txq, state->n_bql,
+ state->n_bql * VETH_BQL_UNIT);
+ state->n_bql = 0;
+ state->time = sched_clock();
+}
+
+static void veth_bql_maybe_complete(struct veth_bql_state *state,
+ struct netdev_queue *peer_txq,
+ u64 coalescing_ns)
+{
+ if (state->n_bql && sched_clock() >= state->time + coalescing_ns)
+ veth_bql_complete(state, peer_txq);
+}
+
static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_xdp_tx_bq *bq,
struct veth_stats *stats,
struct netdev_queue *peer_txq)
{
+ struct veth_bql_state *state = &rq->bql_state;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
+ struct veth_priv *priv;
+ u64 bql_flush_ns;
+
+ priv = netdev_priv(rq->dev);
+ bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
+
+ /* Clamp stored timestamp in case we migrated to a CPU with a behind
+ * sched_clock(); prevents the deadline from never firing.
+ */
+ state->time = min(state->time, sched_clock());
+
+ /* Flush completions that timed out since the previous NAPI poll. */
+ if (peer_txq && bql_flush_ns)
+ veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
for (i = 0; i < budget; i++) {
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -972,8 +1039,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct sk_buff *skb = veth_ptr_to_skb(ptr);
stats->xdp_bytes += skb->len;
- if (peer_txq && bql_charged)
- netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+ if (peer_txq && bql_charged) {
+ if (!bql_flush_ns) {
+ netdev_tx_completed_queue(peer_txq, 1,
+ VETH_BQL_UNIT);
+ } else {
+ state->n_bql++;
+ veth_bql_maybe_complete(state, peer_txq,
+ bql_flush_ns);
+ }
+ }
skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
if (skb) {
@@ -989,6 +1064,18 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
if (n_xdpf)
veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);
+ /* If the ring is now empty and the peer TX queue is stalled by DQL
+ * backpressure, release completions immediately to unblock it.
+ */
+ if (peer_txq && state->n_bql && __ptr_ring_empty(&rq->xdp_ring)) {
+ /* Pairs with smp_wmb() in __ptr_ring_produce(); ensure ring
+ * emptiness is observed before reading peer_txq->state.
+ */
+ smp_rmb();
+ if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
+ veth_bql_complete(state, peer_txq);
+ }
+
u64_stats_update_begin(&rq->stats.syncp);
rq->stats.vs.xdp_redirect += stats->xdp_redirect;
rq->stats.vs.xdp_bytes += stats->xdp_bytes;
@@ -1093,6 +1180,9 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
napi_enable(&rq->xdp_napi);
rcu_assign_pointer(priv->rq[i].napi, &priv->rq[i].xdp_napi);
+
+ rq->bql_state.time = sched_clock();
+ rq->bql_state.n_bql = 0;
}
return 0;
@@ -1134,6 +1224,8 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
struct veth_rq *rq = &priv->rq[i];
rq->rx_notify_masked = false;
+ rq->bql_state.n_bql = 0;
+ rq->bql_state.time = 0;
ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
}
@@ -1813,6 +1905,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
static void veth_setup(struct net_device *dev)
{
+ struct veth_priv *priv = netdev_priv(dev);
+
ether_setup(dev);
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1932,8 @@ static void veth_setup(struct net_device *dev)
dev->max_mtu = ETH_MAX_MTU;
dev->watchdog_timeo = msecs_to_jiffies(16000);
+ priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-27 13:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox