From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Chris Arges" <carges@cloudflare.com>,
"Mike Freemon" <mfreemon@cloudflare.com>,
"Toke Høiland-Jørgensen" <toke@toke.dk>,
"Shuah Khan" <shuah@kernel.org>,
linux-kselftest@vger.kernel.org,
"Jonas Köppeler" <j.koeppeler@tu-berlin.de>,
"Alexei Starovoitov" <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"David S. Miller" <davem@davemloft.net>,
"Jakub Kicinski" <kuba@kernel.org>,
"John Fastabend" <john.fastabend@gmail.com>,
"Stanislav Fomichev" <sdf@fomichev.me>,
bpf@vger.kernel.org
Subject: [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support
Date: Mon, 13 Apr 2026 11:44:33 +0200 [thread overview]
Message-ID: <20260413094442.1376022-1-hawk@kernel.org> (raw)
From: Jesper Dangaard Brouer <hawk@kernel.org>
This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.
Problem:
veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued
there are invisible to the qdisc's AQM. Under load, the ring fills
completely (DRV_XOFF backpressure), adding up to 256 packets of
unmanaged latency before the qdisc even sees congestion.
Solution:
BQL (STACK_XOFF) dynamically limits in-flight packets, stopping the
queue before the ring fills. This keeps the ring shallow and pushes
excess packets into the qdisc, where sojourn-based AQM can measure
and drop them.
Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load.
BQL off BQL on
fq_codel: RTT ~22ms, 4% loss RTT ~1.3ms, 0% loss
sfq: RTT ~24ms, 0% loss RTT ~1.5ms, 0% loss
BQL reduces ping RTT by ~17x for both qdiscs. Consumer throughput
is unchanged (~10K pps) -- BQL adds no overhead.
CoDel bug discovered during BQL development:
Our original motivation for BQL was fq_codel ping loss observed under
load (4-26% depending on NAPI cycle time). Investigating this led us
to discover a bug in the CoDel implementation: codel_dequeue() does
not reset vars->first_above_time when a flow goes empty, contrary to
the reference algorithm. This causes stale CoDel state to persist
across empty periods in fq_codel's per-flow queues, penalizing sparse
flows like ICMP ping. A fix for this has been applied to the net tree:
https://git.kernel.org/netdev/net/c/815980fe6dbb
BQL remains valuable independently: it reduces RTT by ~17x by moving
buffering from the dark ptr_ring into the qdisc. Additionally, BQL
clears STACK_XOFF per-SKB as each packet completes, rather than
batch-waking after 64 packets (DRV_XOFF). This keeps sojourn times
below fq_codel's target, preventing CoDel from entering dropping
state on non-congested flows in the first place.
Key design decisions:
- Charge-under-lock in veth_xdp_rx(): The BQL charge must precede
the ptr_ring produce, because the NAPI consumer can run on another
CPU and complete the SKB immediately after it becomes visible. To
avoid a pre-charge/undo pattern, the charge is done under the
ptr_ring producer_lock after confirming the ring is not full. BQL
is only charged when produce is guaranteed to succeed, keeping
num_queued monotonically increasing. HARD_TX_LOCK already
serializes dql_queued() (veth requires a qdisc for BQL); the
ptr_ring lock additionally would allow noqueue to work correctly.
- Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the
ptr_ring pointer records whether each SKB was BQL-charged. This is
necessary because the qdisc can be replaced live (noqueue->sfq or
vice versa) while SKBs are in-flight -- the completion side must
know the charge state that was decided at enqueue time.
- IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL
sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL
accounting, without changing IFF_NO_QUEUE semantics.
Background and acknowledgments:
Mike Freemon reported the veth dark buffer problem internally at
Cloudflare and showed that recompiling the kernel with a ptr_ring
size of 30 (down from 256) made fq_codel work dramatically better.
This was the primary motivation for a proper BQL solution that
achieves the same effect dynamically without a kernel rebuild.
Chris Arges wrote a reproducer for the dark buffer latency problem:
https://github.com/netoptimizer/veth-backpressure-performance-testing
This is where we first observed ping packets being dropped under
fq_codel, which became our secondary motivation for BQL. In
production we switched to SFQ on veth devices as a workaround.
Jonas Koeppeler provided extensive testing and code review.
Together we discovered that the fq_codel ping loss was actually a
12-year-old CoDel bug (stale first_above_time in empty flows), not
caused by the dark buffer itself. A fix has been applied to the net tree:
https://git.kernel.org/netdev/net/c/815980fe6dbb
Patch overview:
1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
2. veth: implement Byte Queue Limits (BQL) for latency reduction
3. veth: add tx_timeout watchdog as BQL safety net
4. net: sched: add timeout count to NETDEV WATCHDOG message
5. selftests: net: add veth BQL stress test
Jesper Dangaard Brouer (5):
net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
veth: implement Byte Queue Limits (BQL) for latency reduction
veth: add tx_timeout watchdog as BQL safety net
net: sched: add timeout count to NETDEV WATCHDOG message
selftests: net: add veth BQL stress test
.../networking/net_cachelines/net_device.rst | 1 +
drivers/net/veth.c | 92 +-
include/linux/netdevice.h | 2 +
net/core/net-sysfs.c | 8 +-
net/sched/sch_generic.c | 8 +-
tools/testing/selftests/net/Makefile | 3 +
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/napi_poll_hist.bt | 40 +
tools/testing/selftests/net/veth_bql_test.sh | 821 ++++++++++++++++++
.../selftests/net/veth_bql_test_virtme.sh | 124 +++
10 files changed, 1084 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/net/napi_poll_hist.bt
create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh
V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/
Changes since V1:
- Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
- Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
of skb->len. veth has no link speed; the ptr_ring is packet-indexed.
Byte-based charging lets small packets sneak many entries into the ring.
Testing: min-size packet flood causes 3.7x ping RTT degradation with
skb->len vs no change with fixed-unit charging.
- Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
to avoid dql_reset() racing with concurrent dql_completed().
- Patch 5 (selftests): fix shellcheck warnings and infos:
- Quote variables passed to kill_process and exit.
- Declare and assign local variables separately (SC2155).
- Use read -r to avoid mangling backslashes (SC2162).
- Add shellcheck disable comments for intentional word splitting
(set -- $line, tc $qdisc $opts) and indirect invocation (trap).
- Make iptables-restore failure a hard FAIL instead of continuing.
- Add veth_bql_test.sh to TEST_PROGS in net/Makefile.
- Add veth_bql_test_virtme.sh to TEST_FILES (needs kernel build tree).
- Add napi_poll_hist.bt to TEST_FILES in net/Makefile.
- Add CONFIG_NET_SCH_SFQ=m to net/config (default qdisc is sfq).
- Reduce default test duration from 300s to 30s for kselftest CI.
- Fix virtme wrapper: empty args bug, check vmlinux instead of test path.
- Cover letter: update CoDel fix reference to merged commit in net tree.
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: kernel-team@cloudflare.com
--
2.43.0
next reply other threads:[~2026-04-13 9:45 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 9:44 hawk [this message]
2026-04-13 9:44 ` [PATCH net-next v2 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-04-13 9:44 ` [PATCH net-next v2 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-04-13 9:44 ` [PATCH net-next v2 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-04-13 9:44 ` [PATCH net-next v2 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-04-13 9:44 ` [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test hawk
2026-04-15 11:47 ` Breno Leitao
2026-04-13 19:49 ` [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support syzbot ci
2026-04-14 8:06 ` Jesper Dangaard Brouer
2026-04-14 8:08 ` syzbot ci
2026-04-14 8:17 ` Aleksandr Nogikh
2026-04-14 8:17 ` Forwarded: " syzbot
2026-04-14 8:17 ` syzbot
2026-04-14 8:23 ` syzbot
2026-04-14 8:23 ` syzbot
2026-04-14 8:33 ` Aleksandr Nogikh
2026-04-14 17:05 ` syzbot ci
2026-04-15 13:05 ` Aleksandr Nogikh
2026-04-15 16:22 ` syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260413094442.1376022-1-hawk@kernel.org \
--to=hawk@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=carges@cloudflare.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=j.koeppeler@tu-berlin.de \
--cc=john.fastabend@gmail.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mfreemon@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=sdf@fomichev.me \
--cc=shuah@kernel.org \
--cc=toke@toke.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.