From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1A7A37EFFF for ; Wed, 18 Mar 2026 13:48:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773841712; cv=none; b=BUYeKbxU6fa2D7VUsLBChOShwUKc/rGr+Izf/3B5JXVlM46KnEnDgFlqZWPYL3XwI0P2xOQt06AhIgnDstlY+ILcEq0mSmOSEmj/r9gUsMYesDPnnxOEDm/ung/+5dxRPRL0lUtWYsezUPKvJBDcLene2vZpFwZfhOCtQ7uw8WI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773841712; c=relaxed/simple; bh=YwtDX0r3hyEzRQVpMzDVAEFRSNV/ATxeP9rbT8tZyiI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=daMTTp03UHN4rNc3iZXQJecgHETe/w5Mr2nyJDe4KwUlhOcPSR6DldB2WAtrpDnBsHDgVkhthYWHtnR2p678SCVSgi3cPZILZDd0Pbb/XJGhk1iAStEP98WRXEeNNWTEhP4MZ8R+4/RK1CaBvbDr10Dl78I/paF9A3I9FvXugI8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=R86s0Q6j; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="R86s0Q6j" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 20989C19421; Wed, 18 Mar 2026 13:48:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773841712; bh=YwtDX0r3hyEzRQVpMzDVAEFRSNV/ATxeP9rbT8tZyiI=; h=From:To:Cc:Subject:Date:From; b=R86s0Q6jCSUA0iS9BGSdvvI8dpAGlbYTG9sxhIxGzFl5D13P5oPo2XlqOnjTEhDps OrqP5/N8jqJtt5O+8s2BWqqKILDTnnrfSS3yoy7xIEFg4zjbKD73Ldd1XpN4w2q+h3 vji/8owpUGvS6AFC5dcYtTmvWc7pbnfq6p7GGtvqFkQwYrhSNS/NgoYg95UBDH8A5m lH7kJuif+wOIRC7+WV9ilF9u0skXO8TFw7jHUNbaSy/1iXJlhlrp7vda+1buG3isdx OGyGWX9UhDUMFHuKPnUQNnlWArR+j+pubL1qQQ5zuEsk91BGjFT3Vt28MFrCJUebXD NUiEfCg+ihZ/Q== From: hawk@kernel.org To: netdev@vger.kernel.org Cc: edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, davem@davemloft.net, andrew+netdev@lunn.ch, horms@kernel.org, jhs@mojatatu.com, jiri@resnulli.us, toke@toke.dk, sdf@fomichev.me, j.koeppeler@tu-berlin.de, mfreemon@cloudflare.com, carges@cloudflare.com Subject: [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support Date: Wed, 18 Mar 2026 14:48:20 +0100 Message-ID: <20260318134826.1281205-1-hawk@kernel.org> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Jesper Dangaard Brouer This series adds BQL (Byte Queue Limits) to the veth driver, reducing latency by dynamically limiting in-flight bytes in the ptr_ring and moving buffering into the qdisc where AQM algorithms (fq_codel, CAKE) can act on it. Problem: veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued there are invisible to the qdisc's AQM. Under load, the ring fills completely (DRV_XOFF backpressure), adding up to 256 packets of unmanaged latency before the qdisc even sees congestion. Solution: BQL (STACK_XOFF) dynamically limits in-flight bytes, stopping the queue before the ring fills. This keeps the ring shallow and pushes excess packets into the qdisc, where sojourn-based AQM can measure and drop them. Test setup: veth pair, UDP flood, 13000 iptables rules in consumer namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load. BQL off BQL on fq_codel: RTT ~22ms, 4% loss RTT ~1.3ms, 0% loss sfq: RTT ~24ms, 0% loss RTT ~1.5ms, 0% loss BQL reduces ping RTT by ~17x for both qdiscs. Consumer throughput is unchanged (~10K pps) -- BQL adds no overhead. CoDel bug discovered during BQL development: Our original motivation for BQL was fq_codel ping loss observed under load (4-26% depending on NAPI cycle time). Investigating this led us to discover a bug in the CoDel implementation: codel_dequeue() does not reset vars->first_above_time when a flow goes empty, contrary to the reference algorithm. This causes stale CoDel state to persist across empty periods in fq_codel's per-flow queues, penalizing sparse flows like ICMP ping. A fix for this is included as patch 6/6. BQL remains valuable independently: it reduces RTT by ~17x by moving buffering from the dark ptr_ring into the qdisc. Additionally, BQL clears STACK_XOFF per-SKB as each packet completes, rather than batch-waking after 64 packets (DRV_XOFF). This keeps sojourn times below fq_codel's target, preventing CoDel from entering dropping state on non-congested flows in the first place. Key design decisions: - Charge before produce: BQL charge is placed before ptr_ring_produce() because with threaded NAPI the consumer can complete on another CPU before veth_xmit() returns. - Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the ptr_ring pointer records whether each SKB was BQL-charged. This is necessary because the qdisc can be replaced live (noqueue->sfq or vice versa) while SKBs are in-flight -- the completion side must know the charge state that was decided at enqueue time. - IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL accounting, without changing IFF_NO_QUEUE semantics. Background and acknowledgments: Mike Freemon reported the veth dark buffer problem internally at Cloudflare and showed that recompiling the kernel with a ptr_ring size of 30 (down from 256) made fq_codel work dramatically better. This was the primary motivation for a proper BQL solution that achieves the same effect dynamically without a kernel rebuild. Chris Arges wrote a reproducer for the dark buffer latency problem: https://github.com/netoptimizer/veth-backpressure-performance-testing This is where we first observed ping packets being dropped under fq_codel, which became our secondary motivation for BQL. In production we switched to SFQ on veth devices as a workaround. Jonas Koeppeler provided extensive testing and code review. Together we discovered that the fq_codel ping loss was actually a 12-year-old CoDel bug (stale first_above_time in empty flows), not caused by the dark buffer itself. The fix is included as patch 6/6. Jesper Dangaard Brouer (5): net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices veth: implement Byte Queue Limits (BQL) for latency reduction veth: add tx_timeout watchdog as BQL safety net net: sched: add timeout count to NETDEV WATCHDOG message selftests: net: add veth BQL stress test Jonas Köppeler (1): net_sched: codel: fix stale state for empty flows in fq_codel .../networking/net_cachelines/net_device.rst | 1 + drivers/net/veth.c | 88 ++- include/linux/netdevice.h | 1 + include/net/codel_impl.h | 1 + net/core/net-sysfs.c | 8 +- net/sched/sch_generic.c | 8 +- tools/testing/selftests/net/veth_bql_test.sh | 626 ++++++++++++++++++ .../selftests/net/veth_bql_test_virtme.sh | 113 ++++ 8 files changed, 829 insertions(+), 17 deletions(-) create mode 100755 tools/testing/selftests/net/veth_bql_test.sh create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh -- 2.43.0