From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: hawk@kernel.org, andrew+netdev@lunn.ch, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
horms@kernel.org, jhs@mojatatu.com, jiri@resnulli.us,
j.koeppeler@tu-berlin.de, kernel-team@cloudflare.com,
"Chris Arges" <chris.arges@gmail.com>,
"Mike Freemon" <mike.freemon@cloudflare.com>,
"Toke Høiland-Jørgensen" <toke@toke.dk>
Subject: [PATCH net-next 0/5] veth: add Byte Queue Limits (BQL) support
Date: Tue, 24 Mar 2026 18:46:58 +0100 [thread overview]
Message-ID: <20260324174719.1224337-1-hawk@kernel.org> (raw)
From: Jesper Dangaard Brouer <hawk@kernel.org>
This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight bytes in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.
Problem:
veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued
there are invisible to the qdisc's AQM. Under load, the ring fills
completely (DRV_XOFF backpressure), adding up to 256 packets of
unmanaged latency before the qdisc even sees congestion.
Solution:
BQL (STACK_XOFF) dynamically limits in-flight bytes, stopping the
queue before the ring fills. This keeps the ring shallow and pushes
excess packets into the qdisc, where sojourn-based AQM can measure
and drop them.
Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load.
BQL off BQL on
fq_codel: RTT ~22ms, 4% loss RTT ~1.3ms, 0% loss
sfq: RTT ~24ms, 0% loss RTT ~1.5ms, 0% loss
BQL reduces ping RTT by ~17x for both qdiscs. Consumer throughput
is unchanged (~10K pps) -- BQL adds no overhead.
CoDel bug discovered during BQL development:
Our original motivation for BQL was fq_codel ping loss observed under
load (4-26% depending on NAPI cycle time). Investigating this led us
to discover a bug in the CoDel implementation: codel_dequeue() does
not reset vars->first_above_time when a flow goes empty, contrary to
the reference algorithm. This causes stale CoDel state to persist
across empty periods in fq_codel's per-flow queues, penalizing sparse
flows like ICMP ping. A fix for this is submitted separately:
https://lore.kernel.org/netdev/20260323174920.253526-1-hawk@kernel.org
BQL remains valuable independently: it reduces RTT by ~17x by moving
buffering from the dark ptr_ring into the qdisc. Additionally, BQL
clears STACK_XOFF per-SKB as each packet completes, rather than
batch-waking after 64 packets (DRV_XOFF). This keeps sojourn times
below fq_codel's target, preventing CoDel from entering dropping
state on non-congested flows in the first place.
Key design decisions:
- Charge-under-lock in veth_xdp_rx(): The BQL charge must precede
the ptr_ring produce, because the NAPI consumer can run on another
CPU and complete the SKB immediately after it becomes visible. To
avoid a pre-charge/undo pattern, the charge is done under the
ptr_ring producer_lock after confirming the ring is not full. BQL
is only charged when produce is guaranteed to succeed, keeping
num_queued monotonically increasing. HARD_TX_LOCK already
serializes dql_queued() (veth requires a qdisc for BQL); the
ptr_ring lock additionally would allow noqueue to work correctly.
- Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the
ptr_ring pointer records whether each SKB was BQL-charged. This is
necessary because the qdisc can be replaced live (noqueue->sfq or
vice versa) while SKBs are in-flight -- the completion side must
know the charge state that was decided at enqueue time.
- IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL
sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL
accounting, without changing IFF_NO_QUEUE semantics.
Background and acknowledgments:
Mike Freemon reported the veth dark buffer problem internally at
Cloudflare and showed that recompiling the kernel with a ptr_ring
size of 30 (down from 256) made fq_codel work dramatically better.
This was the primary motivation for a proper BQL solution that
achieves the same effect dynamically without a kernel rebuild.
Chris Arges wrote a reproducer for the dark buffer latency problem:
https://github.com/netoptimizer/veth-backpressure-performance-testing
This is where we first observed ping packets being dropped under
fq_codel, which became our secondary motivation for BQL. In
production we switched to SFQ on veth devices as a workaround.
Jonas Koeppeler provided extensive testing and code review.
Together we discovered that the fq_codel ping loss was actually a
12-year-old CoDel bug (stale first_above_time in empty flows), not
caused by the dark buffer itself. A fix is submitted separately:
https://lore.kernel.org/netdev/20260323174920.253526-1-hawk@kernel.org
Patch overview:
1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
2. veth: implement Byte Queue Limits (BQL) for latency reduction
3. veth: add tx_timeout watchdog as BQL safety net
4. net: sched: add timeout count to NETDEV WATCHDOG message
5. selftests: net: add veth BQL stress test
Cc: Chris Arges <chris.arges@gmail.com>
Cc: Mike Freemon <mike.freemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Jesper Dangaard Brouer (5):
net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
veth: implement Byte Queue Limits (BQL) for latency reduction
veth: add tx_timeout watchdog as BQL safety net
net: sched: add timeout count to NETDEV WATCHDOG message
selftests: net: add veth BQL stress test
.../networking/net_cachelines/net_device.rst | 1 +
drivers/net/veth.c | 92 +-
include/linux/netdevice.h | 1 +
net/core/net-sysfs.c | 8 +-
net/sched/sch_generic.c | 8 +-
tools/testing/selftests/net/veth_bql_test.sh | 784 ++++++++++++++++++
.../selftests/net/veth_bql_test_virtme.sh | 124 +++
7 files changed, 1001 insertions(+), 17 deletions(-)
create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh
--
2.43.0
next reply other threads:[~2026-03-24 17:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-24 17:46 hawk [this message]
2026-03-24 17:46 ` [PATCH 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-03-24 17:56 ` Jesper Dangaard Brouer
2026-03-24 17:47 ` [PATCH net-next 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-03-24 17:47 ` [PATCH net-next 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-03-24 17:47 ` [PATCH net-next 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-03-24 17:47 ` [PATCH net-next 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-03-24 17:47 ` [PATCH net-next 5/5] selftests: net: add veth BQL stress test hawk
2026-03-26 12:19 ` Jesper Dangaard Brouer
2026-03-26 19:55 ` Jakub Kicinski
2026-03-28 15:19 ` Simon Schippers
[not found] ` <1c435d90-8d08-4ac1-8b84-cc72c0b4e30f@tu-berlin.de>
2026-04-30 9:45 ` Simon Schippers
2026-04-30 12:31 ` Jesper Dangaard Brouer
[not found] ` <a841e7ed-eee0-4069-bd0d-ab043a1509c5@tu-berlin.de>
2026-05-01 20:35 ` Simon Schippers
2026-03-27 9:50 ` [PATCH net-next 0/5] veth: add Byte Queue Limits (BQL) support Toke Høiland-Jørgensen
2026-03-27 12:49 ` Jesper Dangaard Brouer
2026-03-27 15:37 ` Jonas Köppeler
2026-03-28 20:06 ` Toke Høiland-Jørgensen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260324174719.1224337-1-hawk@kernel.org \
--to=hawk@kernel.org \
--cc=andrew+netdev@lunn.ch \
--cc=chris.arges@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=j.koeppeler@tu-berlin.de \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=mike.freemon@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=toke@toke.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.