public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support
@ 2026-03-18 13:48 hawk
  2026-03-18 13:48 ` [RFC PATCH net-next 1/6] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: hawk @ 2026-03-18 13:48 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, kuba, pabeni, davem, andrew+netdev, horms, jhs, jiri,
	toke, sdf, j.koeppeler, mfreemon, carges

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight bytes in the ptr_ring and
moving buffering into the qdisc where AQM algorithms (fq_codel, CAKE)
can act on it.

Problem:
  veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued
  there are invisible to the qdisc's AQM.  Under load, the ring fills
  completely (DRV_XOFF backpressure), adding up to 256 packets of
  unmanaged latency before the qdisc even sees congestion.

Solution:
  BQL (STACK_XOFF) dynamically limits in-flight bytes, stopping the
  queue before the ring fills.  This keeps the ring shallow and pushes
  excess packets into the qdisc, where sojourn-based AQM can measure
  and drop them.

  Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
  namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load.

                   BQL off                    BQL on
  fq_codel:  RTT ~22ms, 4% loss         RTT ~1.3ms, 0% loss
  sfq:       RTT ~24ms, 0% loss          RTT ~1.5ms, 0% loss

  BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
  is unchanged (~10K pps) -- BQL adds no overhead.

CoDel bug discovered during BQL development:
  Our original motivation for BQL was fq_codel ping loss observed under
  load (4-26% depending on NAPI cycle time).  Investigating this led us
  to discover a bug in the CoDel implementation: codel_dequeue() does
  not reset vars->first_above_time when a flow goes empty, contrary to
  the reference algorithm.  This causes stale CoDel state to persist
  across empty periods in fq_codel's per-flow queues, penalizing sparse
  flows like ICMP ping.  A fix for this is included as patch 6/6.

  BQL remains valuable independently: it reduces RTT by ~17x by moving
  buffering from the dark ptr_ring into the qdisc.  Additionally, BQL
  clears STACK_XOFF per-SKB as each packet completes, rather than
  batch-waking after 64 packets (DRV_XOFF).  This keeps sojourn times
  below fq_codel's target, preventing CoDel from entering dropping
  state on non-congested flows in the first place.

Key design decisions:
  - Charge before produce: BQL charge is placed before ptr_ring_produce()
    because with threaded NAPI the consumer can complete on another CPU
    before veth_xmit() returns.

  - Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the
    ptr_ring pointer records whether each SKB was BQL-charged.  This is
    necessary because the qdisc can be replaced live (noqueue->sfq or
    vice versa) while SKBs are in-flight -- the completion side must
    know the charge state that was decided at enqueue time.

  - IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL
    sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL
    accounting, without changing IFF_NO_QUEUE semantics.

Background and acknowledgments:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling the kernel with a ptr_ring
  size of 30 (down from 256) made fq_codel work dramatically better.
  This was the primary motivation for a proper BQL solution that
  achieves the same effect dynamically without a kernel rebuild.

  Chris Arges wrote a reproducer for the dark buffer latency problem:
    https://github.com/netoptimizer/veth-backpressure-performance-testing
  This is where we first observed ping packets being dropped under
  fq_codel, which became our secondary motivation for BQL.  In
  production we switched to SFQ on veth devices as a workaround.

  Jonas Koeppeler provided extensive testing and code review.
  Together we discovered that the fq_codel ping loss was actually a
  12-year-old CoDel bug (stale first_above_time in empty flows), not
  caused by the dark buffer itself.  The fix is included as patch 6/6.

Jesper Dangaard Brouer (5):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message
  selftests: net: add veth BQL stress test

Jonas Köppeler (1):
  net_sched: codel: fix stale state for empty flows in fq_codel

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            |  88 ++-
 include/linux/netdevice.h                     |   1 +
 include/net/codel_impl.h                      |   1 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   8 +-
 tools/testing/selftests/net/veth_bql_test.sh  | 626 ++++++++++++++++++
 .../selftests/net/veth_bql_test_virtme.sh     | 113 ++++
 8 files changed, 829 insertions(+), 17 deletions(-)
 create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
 create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh

-- 
2.43.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-23 10:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-18 13:48 [RFC PATCH net-next 0/6] veth: add Byte Queue Limits (BQL) support hawk
2026-03-18 13:48 ` [RFC PATCH net-next 1/6] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-03-18 13:48 ` [RFC PATCH net-next 2/6] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-03-18 14:28   ` Toke Høiland-Jørgensen
2026-03-18 16:24     ` Jesper Dangaard Brouer
2026-03-19 10:04       ` Toke Høiland-Jørgensen
2026-03-20 14:50         ` Jesper Dangaard Brouer
2026-03-23 10:10           ` Toke Høiland-Jørgensen
2026-03-18 13:48 ` [RFC PATCH net-next 3/6] veth: add tx_timeout watchdog as BQL safety net hawk
2026-03-18 13:48 ` [RFC PATCH net-next 4/6] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-03-18 13:48 ` [RFC PATCH net-next 5/6] selftests: net: add veth BQL stress test hawk
2026-03-18 13:48 ` [RFC PATCH net-next 6/6] net_sched: codel: fix stale state for empty flows in fq_codel hawk
2026-03-18 14:10   ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox