From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: "Jesper Dangaard Brouer" <hawk@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Simon Horman" <horms@kernel.org>,
"Chris Arges" <carges@cloudflare.com>,
"Mike Freemon" <mfreemon@cloudflare.com>,
"Toke Høiland-Jørgensen" <toke@toke.dk>,
"Jonas Köppeler" <j.koeppeler@tu-berlin.de>,
"Breno Leitao" <leitao@debian.org>,
"Simon Schippers" <simon.schippers@tu-dortmund.de>,
"Simon Schippers" <simon@schippers-hamm.de>,
kernel-team@cloudflare.com
Subject: [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support
Date: Wed, 27 May 2026 15:54:11 +0200 [thread overview]
Message-ID: <20260527135418.1166665-1-hawk@kernel.org> (raw)
From: Jesper Dangaard Brouer <hawk@kernel.org>
This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.
Problem: veth's 256-entry ptr_ring acts as a "dark buffer" invisible
to the qdisc's AQM. Under load the ring fills, adding up to 256
packets of unmanaged latency before the qdisc sees congestion.
Solution: BQL stops the queue before the ring fills, pushing excess
packets into the qdisc where sojourn-based AQM can drop them. V6 adds
time-based completion coalescing (ethtool tx-usecs, default 100 us)
so DQL converges on a limit that bounds actual queuing delay rather
than oscillating at limit=2 with per-packet completion.
Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7 ms). Ping measures RTT.
BQL off BQL on
fq_codel: RTT ~22 ms, 4% loss RTT ~1.3 ms, 0% loss
sfq: RTT ~24 ms, 0% loss RTT ~1.5 ms, 0% loss
BQL reduces ping RTT by ~17x for both qdiscs. Consumer throughput
unchanged.
Why BQL over configurable ring size (ethtool -G): BQL auto-tunes per
workload via DQL; a static ring size requires per-setup manual tuning
and a too-small ring drops XDP packets in batches of 16-64.
Selftests:
https://github.com/netoptimizer/veth-backpressure-performance-testing
Background:
Mike Freemon reported the veth dark buffer problem internally at
Cloudflare and showed that recompiling with ptr_ring size 30 made
fq_codel work dramatically better -- motivating a dynamic BQL
solution. Chris Arges wrote the reproducer. Jonas Koeppeler and
Simon Schippers provided extensive testing and code review.
During BQL development we also fixed an unrelated 12-year-old CoDel
bug (stale first_above_time in empty flows), see
commit 815980fe6dbb ("net_sched: codel: fix stale state for empty flows in fq_codel").
BQL remains valuable independently.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: Breno Leitao <leitao@debian.org>
Cc: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: Simon Schippers <simon@schippers-hamm.de>
Cc: kernel-team@cloudflare.com
Jesper Dangaard Brouer (4):
net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
veth: implement Byte Queue Limits (BQL) for latency reduction
veth: add tx_timeout watchdog as BQL safety net
net: sched: add timeout count to NETDEV WATCHDOG message
Simon Schippers (1):
veth: time-based BQL completion coalescing via ethtool tx-usecs
.../networking/net_cachelines/net_device.rst | 1 +
drivers/net/veth.c | 198 +++++++++++++++++-
include/linux/netdevice.h | 2 +
net/core/net-sysfs.c | 8 +-
net/sched/sch_generic.c | 8 +-
5 files changed, 201 insertions(+), 16 deletions(-)
---
Changes since V5:
- Drop patch 1 (OOB txq fix) -- applied to net tree by Paolo as
08f566e8f83b, already in net-next.
- New patch 5: time-based BQL completion coalescing via ethtool
tx-usecs (Simon Schippers). Resolves the throughput regression
Paolo flagged on V5: per-packet completion forced DQL to limit=2,
causing cache-line bouncing between producer/consumer CPUs.
Coalescing batches completions on a configurable time threshold
(default 100 us), letting DQL discover a higher useful limit.
- Patch 2: use __ptr_ring_check_produce() instead of open-coded
ring-full check (Simon Schippers nit).
Prior versions:
V5: https://lore.kernel.org/all/20260505132159.241305-1-hawk@kernel.org/
V4: https://lore.kernel.org/all/20260501071633.644353-1-hawk@kernel.org/
V3: https://lore.kernel.org/all/20260429172036.1028526-1-hawk@kernel.org/
V2: https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org/
V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/
Changes since V4:
- New patch 1: fix OOB txq access in veth_poll() when veth peers have
asymmetric RX/TX queue counts. XDP redirect can deliver frames to
an RX queue index that exceeds the peer's TX queue count, causing
an out-of-bounds netdev_get_tx_queue() access. Found by sashiko-bot.
- Patch 3 (veth BQL): wake stopped peer txqs in veth_napi_del_range()
to clear DRV_XOFF after NAPI teardown. A concurrent veth_xmit()
can set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
synchronize_net(); with NAPI gone, no veth_poll() clears it.
Guarded by netif_running() to skip during device close.
Changes since V3:
- Drop selftest patch (patch 5 from V3) per maintainer request.
- Rebase on latest net-next.
Changes since V2:
- Patch 2 (veth BQL): fix syzbot WARNING in veth_napi_del_range():
clamp BQL reset loop to peer's real_num_tx_queues. The loop was
iterating dev->real_num_rx_queues but indexing peer's txq[], which
goes out of bounds when the peer has fewer TX queues (e.g. veth
enslaved to a bond with XDP attached).
Changes since V1:
- Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
- Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
of skb->len. veth has no link speed; the ptr_ring is packet-indexed.
Byte-based charging lets small packets sneak many entries into the ring.
Testing: min-size packet flood causes 3.7x ping RTT degradation with
skb->len vs no change with fixed-unit charging.
- Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
to avoid dql_reset() racing with concurrent dql_completed().
- Cover letter: update CoDel fix reference to merged commit in net tree.
--
2.43.0
next reply other threads:[~2026-05-27 13:54 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-27 13:54 hawk [this message]
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260527135418.1166665-1-hawk@kernel.org \
--to=hawk@kernel.org \
--cc=carges@cloudflare.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=j.koeppeler@tu-berlin.de \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=leitao@debian.org \
--cc=mfreemon@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=simon.schippers@tu-dortmund.de \
--cc=simon@schippers-hamm.de \
--cc=toke@toke.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox