[PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support

Netdev List
 help / color / mirror / Atom feed

From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com, simon.schippers@tu-dortmund.de,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Simon Horman" <horms@kernel.org>,
	"Chris Arges" <carges@cloudflare.com>,
	"Mike Freemon" <mfreemon@cloudflare.com>,
	"Toke Høiland-Jørgensen" <toke@toke.dk>,
	"Jonas Köppeler" <j.koeppeler@tu-berlin.de>,
	"Breno Leitao" <leitao@debian.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Stanislav Fomichev" <sdf@fomichev.me>,
	bpf@vger.kernel.org
Subject: [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support
Date: Fri, 12 Jun 2026 10:35:23 +0200	[thread overview]
Message-ID: <20260612083530.1650245-1-hawk@kernel.org> (raw)

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.

Problem: veth's 256-entry ptr_ring acts as a "dark buffer" invisible
to the qdisc's AQM.  Under load the ring fills, adding up to 256
packets of unmanaged latency before the qdisc sees congestion.

Solution: BQL stops the queue before the ring fills, pushing excess
packets into the qdisc where sojourn-based AQM can drop them.
Time-based completion coalescing (ethtool tx-usecs, default 100 us)
lets DQL converge on a limit that bounds actual queuing delay rather
than oscillating at limit=2 with per-packet completion.

Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7 ms).  Ping measures RTT.

                   BQL off                    BQL on
  fq_codel:  RTT ~22 ms, 4% loss        RTT ~1.3 ms, 0% loss
  sfq:       RTT ~24 ms, 0% loss        RTT ~1.5 ms, 0% loss

BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
unchanged.

Why BQL over configurable ring size (ethtool -G): BQL auto-tunes per
workload via DQL; a static ring size requires per-setup manual tuning
and a too-small ring drops XDP packets in batches of 16-64.

Selftests: https://github.com/netoptimizer/veth-backpressure-performance-testing

Background:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling with ptr_ring size 30 made
  fq_codel work dramatically better -- motivating a dynamic BQL
  solution.  Chris Arges wrote the reproducer.  Jonas Koeppeler and
  Simon Schippers provided extensive testing and code review.

  During BQL development we also fixed an unrelated 12-year-old CoDel
  bug (stale first_above_time in empty flows), see
  commit 815980fe6dbb ("net_sched: codel: fix stale state for empty flows in fq_codel").
  BQL remains valuable independently.

Patch overview:
  1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  2. veth: implement Byte Queue Limits (BQL) for latency reduction
  3. veth: add tx_timeout watchdog as BQL safety net
  4. net: sched: add timeout count to NETDEV WATCHDOG message
  5. veth: time-based BQL completion coalescing via ethtool tx-usecs

Jesper Dangaard Brouer (4):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message

Simon Schippers (1):
  veth: time-based BQL completion coalescing via ethtool tx-usecs

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: Breno Leitao <leitao@debian.org>
Cc: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: kernel-team@cloudflare.com

---
Changes since V6:
  - Patch 2 (teardown): rework veth_napi_del_range() to drain the ptr_ring
    and balance the peer txq's BQL/DQL by completing the outstanding charges
    via netdev_tx_completed_queue(), instead of netdev_tx_reset_queue().
    dql_reset() races with a concurrent producer, whereas completion is the
    normal single-completer path and is safe once NAPI is gone
    (synchronize_net()) and the producer has stopped charging BQL (it
    observes rq->napi == NULL).  Fixes races reported by sashiko.  The
    peer txq is still woken to clear any leaked DRV_XOFF.
  - Patch 2: document the lltx/DQL locking model at the BQL charge site --
    veth is lltx so the stack skips HARD_TX_LOCK; the ring producer_lock is
    the single-producer lock for dql_queued() (1:1 with the peer txq) and
    the peer NAPI in veth_xdp_rcv() is the single completer.
  - Patch 3 (watchdog): add a VETH_WATCHDOG_TIMEOUT_MS #define for the
    tx_timeout value instead of an inline magic number, and expand the
    math behind the value (64-packet NAPI budget * 250 ms/pkt = 16 s)
    in a comment (Paolo).
  - Patches 2+5: add Co-developed-by for Jonas Koeppeler.
  - Patch 5 (coalescing): flush when n_bql exceeds dql.limit to handle
    BQL starvation.  Removes the empty-ring STACK_XOFF check (and its
    smp_rmb) -- the dql.limit comparison handles it more directly.
  - Patch 5: at teardown, also complete the coalesced bql_state.n_bql
    pending from the last NAPI poll, on top of the patch 2 ring drain.
  - Patch 5: mirror tx_coal_usecs onto the peer in veth_set_coalesce() so
    a veth pair always coalesces symmetrically; the completion path reads
    its own device's value.
  - Patch 5: use WRITE_ONCE/READ_ONCE for tx_coal_usecs (Paolo).
  - Patch 5: init bql_state before napi_enable() to avoid race (sashiko).
  - Patch 5: tx-usecs=0 handled as normal path, no special case.
  - Patch 5: n_bql type changed to uint; explicit if/++ instead of
    implicit bool-to-int addition.
  - Patch 5: call veth_bql_maybe_complete() on every iteration for
    accurate completion intervals with mixed XDP/non-XDP packets.
  - Patch 5: reject tx-usecs above half the tx_timeout watchdog in
    veth_set_coalesce() (return -ERANGE) -- the coalescing window delays
    BQL completions, so an over-large value could trip a false watchdog;
    half the watchdog leaves a generous margin.

Prior versions:
  V6: https://lore.kernel.org/all/20260527135418.1166665-1-hawk@kernel.org/
  V5: https://lore.kernel.org/all/20260505132159.241305-1-hawk@kernel.org/
  V4: https://lore.kernel.org/all/20260501071633.644353-1-hawk@kernel.org/
  V3: https://lore.kernel.org/all/20260429172036.1028526-1-hawk@kernel.org/
  V2: https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org/
  V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/

Changes since V5:
  - Drop patch 1 (OOB txq fix) -- applied to net tree by Paolo as
    08f566e8f83b, already in net-next.
  - New patch 5: time-based BQL completion coalescing via ethtool
    tx-usecs (Simon Schippers).  Resolves the throughput regression
    Paolo flagged on V5: per-packet completion forced DQL to limit=2,
    causing cache-line bouncing between producer/consumer CPUs.
    Coalescing batches completions on a configurable time threshold
    (default 100 us), letting DQL discover a higher useful limit.
  - Patch 2: use __ptr_ring_check_produce() instead of open-coded
    ring-full check (Simon Schippers nit).

Changes since V4:
  - New patch 1: fix OOB txq access in veth_poll() when veth peers have
    asymmetric RX/TX queue counts.  XDP redirect can deliver frames to
    an RX queue index that exceeds the peer's TX queue count, causing
    an out-of-bounds netdev_get_tx_queue() access.  Found by sashiko-bot.
  - Patch 3 (veth BQL): wake stopped peer txqs in veth_napi_del_range()
    to clear DRV_XOFF after NAPI teardown.  A concurrent veth_xmit()
    can set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
    synchronize_net(); with NAPI gone, no veth_poll() clears it.
    Guarded by netif_running() to skip during device close.

Changes since V3:
  - Drop selftest patch (patch 5 from V3) per maintainer request.
  - Rebase on latest net-next.

Changes since V2:
  - Patch 2 (veth BQL): fix syzbot WARNING in veth_napi_del_range():
    clamp BQL reset loop to peer's real_num_tx_queues.  The loop was
    iterating dev->real_num_rx_queues but indexing peer's txq[], which
    goes out of bounds when the peer has fewer TX queues (e.g. veth
    enslaved to a bond with XDP attached).

Changes since V1:
  - Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
  - Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
    of skb->len.  veth has no link speed; the ptr_ring is packet-indexed.
    Byte-based charging lets small packets sneak many entries into the ring.
    Testing: min-size packet flood causes 3.7x ping RTT degradation with
    skb->len vs no change with fixed-unit charging.
  - Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
    netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
    to avoid dql_reset() racing with concurrent dql_completed().
  - Cover letter: update CoDel fix reference to merged commit in net tree.

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            | 267 +++++++++++++++++-
 include/linux/netdevice.h                     |   2 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   6 +-
 5 files changed, 270 insertions(+), 14 deletions(-)

-- 
2.43.0

next             reply	other threads:[~2026-06-12  8:35 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  8:35 hawk [this message]
2026-06-12  8:35 ` [PATCH net-next v7 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-06-12  8:35 ` [PATCH net-next v7 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-06-12  8:35 ` [PATCH net-next v7 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-06-12  8:35 ` [PATCH net-next v7 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-06-12  8:35 ` [PATCH net-next v7 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612083530.1650245-1-hawk@kernel.org \
    --to=hawk@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=carges@cloudflare.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=j.koeppeler@tu-berlin.de \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=leitao@debian.org \
    --cc=mfreemon@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=simon.schippers@tu-dortmund.de \
    --cc=toke@toke.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox