From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47D983AC0DE; Mon, 13 Apr 2026 09:45:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776073503; cv=none; b=AUA/5MDT/gKhBFQYmbjwB3oCy8MiDtvn+6OxXpD8eXTOrtyTELGGfyyf7e6XmS1X2i+HacFXIi/PH1nHIWO0fBZTxrRJiVq57c0QNp7lmIOuEam4CfPOAMatuwz/Ss/YL3puf1KtmBm2JTJPe/4q4x/oZLKQYvWqn4HyE/3WsHo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776073503; c=relaxed/simple; bh=aKDkzkaOTpDSvZmKMyWep/L7YNs9uy0phTomPHRATiI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=HCc3tEgN/hAk+oUk3CAxvNCD7txugFXOQd3aGos7/wDLI7AcA1LGBUHtfxSAF6AvED5fl0GCXfiDoM6YwDnKwnOvElAbgNftOdNTab6TiT71dpWFYrfZw4MJ5uOz7d1RdJVxfq6+tEMEGFnvJ7dz1anyw9oKqjSowVLOqCwNQwI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=K04IvbyJ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="K04IvbyJ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id BD70BC2BCAF; Mon, 13 Apr 2026 09:44:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776073503; bh=aKDkzkaOTpDSvZmKMyWep/L7YNs9uy0phTomPHRATiI=; h=From:To:Cc:Subject:Date:From; b=K04IvbyJajZVrv9Qda7TJ4mlyLIqO/UJAaFXTJSCR08FXipTD6tizzlJGxOCqukVn AgDEmLeTMcJ/AzVYLQWdc/6gKyikqVR07Sftx+MhAU9wbf3epIULJQ23nDAog/h2v5 bdVz8SbvWHqph3kKGg64ecRsHSLcX8UV+J/7ZhB2ilP8/iktw6k7iKhG8oqg0Ntjt2 ce4NYZu7FaOXxafdXRI85k3QWxHnp/bxa5fyoY+ixqcOG7EAa7c4uZePxzPp4x4ViM 5g0m/L5Ny9vvK2yW+j73NwIQz0a+EGozZCdT0BqAZVLpsH/KFUST4LwYX4VWD8jgL/ BKP+cLrlArAJA== From: hawk@kernel.org To: netdev@vger.kernel.org Cc: kernel-team@cloudflare.com, Jesper Dangaard Brouer , Chris Arges , Mike Freemon , =?UTF-8?q?Toke=20H=C3=B8iland-J=C3=B8rgensen?= , Shuah Khan , linux-kselftest@vger.kernel.org, =?UTF-8?q?Jonas=20K=C3=B6ppeler?= , Alexei Starovoitov , Daniel Borkmann , "David S. Miller" , Jakub Kicinski , John Fastabend , Stanislav Fomichev , bpf@vger.kernel.org Subject: [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support Date: Mon, 13 Apr 2026 11:44:33 +0200 Message-ID: <20260413094442.1376022-1-hawk@kernel.org> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Jesper Dangaard Brouer This series adds BQL (Byte Queue Limits) to the veth driver, reducing latency by dynamically limiting in-flight packets in the ptr_ring and moving buffering into the qdisc where AQM algorithms can act on it. Problem: veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued there are invisible to the qdisc's AQM. Under load, the ring fills completely (DRV_XOFF backpressure), adding up to 256 packets of unmanaged latency before the qdisc even sees congestion. Solution: BQL (STACK_XOFF) dynamically limits in-flight packets, stopping the queue before the ring fills. This keeps the ring shallow and pushes excess packets into the qdisc, where sojourn-based AQM can measure and drop them. Test setup: veth pair, UDP flood, 13000 iptables rules in consumer namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load. BQL off BQL on fq_codel: RTT ~22ms, 4% loss RTT ~1.3ms, 0% loss sfq: RTT ~24ms, 0% loss RTT ~1.5ms, 0% loss BQL reduces ping RTT by ~17x for both qdiscs. Consumer throughput is unchanged (~10K pps) -- BQL adds no overhead. CoDel bug discovered during BQL development: Our original motivation for BQL was fq_codel ping loss observed under load (4-26% depending on NAPI cycle time). Investigating this led us to discover a bug in the CoDel implementation: codel_dequeue() does not reset vars->first_above_time when a flow goes empty, contrary to the reference algorithm. This causes stale CoDel state to persist across empty periods in fq_codel's per-flow queues, penalizing sparse flows like ICMP ping. A fix for this has been applied to the net tree: https://git.kernel.org/netdev/net/c/815980fe6dbb BQL remains valuable independently: it reduces RTT by ~17x by moving buffering from the dark ptr_ring into the qdisc. Additionally, BQL clears STACK_XOFF per-SKB as each packet completes, rather than batch-waking after 64 packets (DRV_XOFF). This keeps sojourn times below fq_codel's target, preventing CoDel from entering dropping state on non-congested flows in the first place. Key design decisions: - Charge-under-lock in veth_xdp_rx(): The BQL charge must precede the ptr_ring produce, because the NAPI consumer can run on another CPU and complete the SKB immediately after it becomes visible. To avoid a pre-charge/undo pattern, the charge is done under the ptr_ring producer_lock after confirming the ring is not full. BQL is only charged when produce is guaranteed to succeed, keeping num_queued monotonically increasing. HARD_TX_LOCK already serializes dql_queued() (veth requires a qdisc for BQL); the ptr_ring lock additionally would allow noqueue to work correctly. - Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the ptr_ring pointer records whether each SKB was BQL-charged. This is necessary because the qdisc can be replaced live (noqueue->sfq or vice versa) while SKBs are in-flight -- the completion side must know the charge state that was decided at enqueue time. - IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL accounting, without changing IFF_NO_QUEUE semantics. Background and acknowledgments: Mike Freemon reported the veth dark buffer problem internally at Cloudflare and showed that recompiling the kernel with a ptr_ring size of 30 (down from 256) made fq_codel work dramatically better. This was the primary motivation for a proper BQL solution that achieves the same effect dynamically without a kernel rebuild. Chris Arges wrote a reproducer for the dark buffer latency problem: https://github.com/netoptimizer/veth-backpressure-performance-testing This is where we first observed ping packets being dropped under fq_codel, which became our secondary motivation for BQL. In production we switched to SFQ on veth devices as a workaround. Jonas Koeppeler provided extensive testing and code review. Together we discovered that the fq_codel ping loss was actually a 12-year-old CoDel bug (stale first_above_time in empty flows), not caused by the dark buffer itself. A fix has been applied to the net tree: https://git.kernel.org/netdev/net/c/815980fe6dbb Patch overview: 1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices 2. veth: implement Byte Queue Limits (BQL) for latency reduction 3. veth: add tx_timeout watchdog as BQL safety net 4. net: sched: add timeout count to NETDEV WATCHDOG message 5. selftests: net: add veth BQL stress test Jesper Dangaard Brouer (5): net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices veth: implement Byte Queue Limits (BQL) for latency reduction veth: add tx_timeout watchdog as BQL safety net net: sched: add timeout count to NETDEV WATCHDOG message selftests: net: add veth BQL stress test .../networking/net_cachelines/net_device.rst | 1 + drivers/net/veth.c | 92 +- include/linux/netdevice.h | 2 + net/core/net-sysfs.c | 8 +- net/sched/sch_generic.c | 8 +- tools/testing/selftests/net/Makefile | 3 + tools/testing/selftests/net/config | 1 + tools/testing/selftests/net/napi_poll_hist.bt | 40 + tools/testing/selftests/net/veth_bql_test.sh | 821 ++++++++++++++++++ .../selftests/net/veth_bql_test_virtme.sh | 124 +++ 10 files changed, 1084 insertions(+), 16 deletions(-) create mode 100644 tools/testing/selftests/net/napi_poll_hist.bt create mode 100755 tools/testing/selftests/net/veth_bql_test.sh create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/ Changes since V1: - Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device. - Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead of skb->len. veth has no link speed; the ptr_ring is packet-indexed. Byte-based charging lets small packets sneak many entries into the ring. Testing: min-size packet flood causes 3.7x ping RTT degradation with skb->len vs no change with fixed-unit charging. - Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue() to avoid dql_reset() racing with concurrent dql_completed(). - Patch 5 (selftests): fix shellcheck warnings and infos: - Quote variables passed to kill_process and exit. - Declare and assign local variables separately (SC2155). - Use read -r to avoid mangling backslashes (SC2162). - Add shellcheck disable comments for intentional word splitting (set -- $line, tc $qdisc $opts) and indirect invocation (trap). - Make iptables-restore failure a hard FAIL instead of continuing. - Add veth_bql_test.sh to TEST_PROGS in net/Makefile. - Add veth_bql_test_virtme.sh to TEST_FILES (needs kernel build tree). - Add napi_poll_hist.bt to TEST_FILES in net/Makefile. - Add CONFIG_NET_SCH_SFQ=m to net/config (default qdisc is sfq). - Reduce default test duration from 300s to 30s for kselftest CI. - Fix virtme wrapper: empty args bug, check vmlinux instead of test path. - Cover letter: update CoDel fix reference to merged commit in net tree. Cc: Chris Arges Cc: Mike Freemon Cc: Toke Høiland-Jørgensen Cc: Shuah Khan Cc: linux-kselftest@vger.kernel.org Cc: Jonas Köppeler Cc: kernel-team@cloudflare.com -- 2.43.0