From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Chris Arges" <carges@cloudflare.com>,
"Mike Freemon" <mfreemon@cloudflare.com>,
"Toke Høiland-Jørgensen" <toke@toke.dk>,
"Shuah Khan" <shuah@kernel.org>,
linux-kselftest@vger.kernel.org,
"Jonas Köppeler" <j.koeppeler@tu-berlin.de>,
"Alexei Starovoitov" <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"David S. Miller" <davem@davemloft.net>,
"Jakub Kicinski" <kuba@kernel.org>,
"John Fastabend" <john.fastabend@gmail.com>,
"Stanislav Fomichev" <sdf@fomichev.me>,
bpf@vger.kernel.org
Subject: [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support
Date: Mon, 13 Apr 2026 11:44:33 +0200 [thread overview]
Message-ID: <20260413094442.1376022-1-hawk@kernel.org> (raw)
From: Jesper Dangaard Brouer <hawk@kernel.org>
This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.
Problem:
veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued
there are invisible to the qdisc's AQM. Under load, the ring fills
completely (DRV_XOFF backpressure), adding up to 256 packets of
unmanaged latency before the qdisc even sees congestion.
Solution:
BQL (STACK_XOFF) dynamically limits in-flight packets, stopping the
queue before the ring fills. This keeps the ring shallow and pushes
excess packets into the qdisc, where sojourn-based AQM can measure
and drop them.
Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load.
BQL off BQL on
fq_codel: RTT ~22ms, 4% loss RTT ~1.3ms, 0% loss
sfq: RTT ~24ms, 0% loss RTT ~1.5ms, 0% loss
BQL reduces ping RTT by ~17x for both qdiscs. Consumer throughput
is unchanged (~10K pps) -- BQL adds no overhead.
CoDel bug discovered during BQL development:
Our original motivation for BQL was fq_codel ping loss observed under
load (4-26% depending on NAPI cycle time). Investigating this led us
to discover a bug in the CoDel implementation: codel_dequeue() does
not reset vars->first_above_time when a flow goes empty, contrary to
the reference algorithm. This causes stale CoDel state to persist
across empty periods in fq_codel's per-flow queues, penalizing sparse
flows like ICMP ping. A fix for this has been applied to the net tree:
https://git.kernel.org/netdev/net/c/815980fe6dbb
BQL remains valuable independently: it reduces RTT by ~17x by moving
buffering from the dark ptr_ring into the qdisc. Additionally, BQL
clears STACK_XOFF per-SKB as each packet completes, rather than
batch-waking after 64 packets (DRV_XOFF). This keeps sojourn times
below fq_codel's target, preventing CoDel from entering dropping
state on non-congested flows in the first place.
Key design decisions:
- Charge-under-lock in veth_xdp_rx(): The BQL charge must precede
the ptr_ring produce, because the NAPI consumer can run on another
CPU and complete the SKB immediately after it becomes visible. To
avoid a pre-charge/undo pattern, the charge is done under the
ptr_ring producer_lock after confirming the ring is not full. BQL
is only charged when produce is guaranteed to succeed, keeping
num_queued monotonically increasing. HARD_TX_LOCK already
serializes dql_queued() (veth requires a qdisc for BQL); the
ptr_ring lock additionally would allow noqueue to work correctly.
- Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the
ptr_ring pointer records whether each SKB was BQL-charged. This is
necessary because the qdisc can be replaced live (noqueue->sfq or
vice versa) while SKBs are in-flight -- the completion side must
know the charge state that was decided at enqueue time.
- IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL
sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL
accounting, without changing IFF_NO_QUEUE semantics.
Background and acknowledgments:
Mike Freemon reported the veth dark buffer problem internally at
Cloudflare and showed that recompiling the kernel with a ptr_ring
size of 30 (down from 256) made fq_codel work dramatically better.
This was the primary motivation for a proper BQL solution that
achieves the same effect dynamically without a kernel rebuild.
Chris Arges wrote a reproducer for the dark buffer latency problem:
https://github.com/netoptimizer/veth-backpressure-performance-testing
This is where we first observed ping packets being dropped under
fq_codel, which became our secondary motivation for BQL. In
production we switched to SFQ on veth devices as a workaround.
Jonas Koeppeler provided extensive testing and code review.
Together we discovered that the fq_codel ping loss was actually a
12-year-old CoDel bug (stale first_above_time in empty flows), not
caused by the dark buffer itself. A fix has been applied to the net tree:
https://git.kernel.org/netdev/net/c/815980fe6dbb
Patch overview:
1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
2. veth: implement Byte Queue Limits (BQL) for latency reduction
3. veth: add tx_timeout watchdog as BQL safety net
4. net: sched: add timeout count to NETDEV WATCHDOG message
5. selftests: net: add veth BQL stress test
Jesper Dangaard Brouer (5):
net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
veth: implement Byte Queue Limits (BQL) for latency reduction
veth: add tx_timeout watchdog as BQL safety net
net: sched: add timeout count to NETDEV WATCHDOG message
selftests: net: add veth BQL stress test
.../networking/net_cachelines/net_device.rst | 1 +
drivers/net/veth.c | 92 +-
include/linux/netdevice.h | 2 +
net/core/net-sysfs.c | 8 +-
net/sched/sch_generic.c | 8 +-
tools/testing/selftests/net/Makefile | 3 +
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/napi_poll_hist.bt | 40 +
tools/testing/selftests/net/veth_bql_test.sh | 821 ++++++++++++++++++
.../selftests/net/veth_bql_test_virtme.sh | 124 +++
10 files changed, 1084 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/net/napi_poll_hist.bt
create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh
V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/
Changes since V1:
- Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
- Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
of skb->len. veth has no link speed; the ptr_ring is packet-indexed.
Byte-based charging lets small packets sneak many entries into the ring.
Testing: min-size packet flood causes 3.7x ping RTT degradation with
skb->len vs no change with fixed-unit charging.
- Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
to avoid dql_reset() racing with concurrent dql_completed().
- Patch 5 (selftests): fix shellcheck warnings and infos:
- Quote variables passed to kill_process and exit.
- Declare and assign local variables separately (SC2155).
- Use read -r to avoid mangling backslashes (SC2162).
- Add shellcheck disable comments for intentional word splitting
(set -- $line, tc $qdisc $opts) and indirect invocation (trap).
- Make iptables-restore failure a hard FAIL instead of continuing.
- Add veth_bql_test.sh to TEST_PROGS in net/Makefile.
- Add veth_bql_test_virtme.sh to TEST_FILES (needs kernel build tree).
- Add napi_poll_hist.bt to TEST_FILES in net/Makefile.
- Add CONFIG_NET_SCH_SFQ=m to net/config (default qdisc is sfq).
- Reduce default test duration from 300s to 30s for kselftest CI.
- Fix virtme wrapper: empty args bug, check vmlinux instead of test path.
- Cover letter: update CoDel fix reference to merged commit in net tree.
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: kernel-team@cloudflare.com
--
2.43.0
next reply other threads:[~2026-04-13 9:45 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 9:44 hawk [this message]
2026-04-13 9:44 ` [PATCH net-next v2 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-04-13 9:44 ` [PATCH net-next v2 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-04-13 9:44 ` [PATCH net-next v2 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-04-13 9:44 ` [PATCH net-next v2 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-04-13 9:44 ` [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test hawk
2026-04-15 11:47 ` Breno Leitao
2026-04-13 19:49 ` [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support syzbot ci
2026-04-14 8:06 ` Jesper Dangaard Brouer
2026-04-14 8:08 ` syzbot ci
2026-04-14 8:17 ` Aleksandr Nogikh
2026-04-14 8:33 ` Aleksandr Nogikh
2026-04-14 17:05 ` syzbot ci
2026-04-15 13:05 ` Aleksandr Nogikh
2026-04-15 16:22 ` syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260413094442.1376022-1-hawk@kernel.org \
--to=hawk@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=carges@cloudflare.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=j.koeppeler@tu-berlin.de \
--cc=john.fastabend@gmail.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mfreemon@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=sdf@fomichev.me \
--cc=shuah@kernel.org \
--cc=toke@toke.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox