All of lore.kernel.org
 help / color / mirror / Atom feed
From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com, simon.schippers@tu-dortmund.de,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Simon Horman" <horms@kernel.org>,
	"Chris Arges" <carges@cloudflare.com>,
	"Mike Freemon" <mfreemon@cloudflare.com>,
	"Toke Høiland-Jørgensen" <toke@toke.dk>,
	"Jonas Köppeler" <j.koeppeler@tu-berlin.de>,
	"Breno Leitao" <leitao@debian.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Stanislav Fomichev" <sdf@fomichev.me>,
	bpf@vger.kernel.org
Subject: [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support
Date: Fri, 12 Jun 2026 10:35:23 +0200	[thread overview]
Message-ID: <20260612083530.1650245-1-hawk@kernel.org> (raw)

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.

Problem: veth's 256-entry ptr_ring acts as a "dark buffer" invisible
to the qdisc's AQM.  Under load the ring fills, adding up to 256
packets of unmanaged latency before the qdisc sees congestion.

Solution: BQL stops the queue before the ring fills, pushing excess
packets into the qdisc where sojourn-based AQM can drop them.
Time-based completion coalescing (ethtool tx-usecs, default 100 us)
lets DQL converge on a limit that bounds actual queuing delay rather
than oscillating at limit=2 with per-packet completion.

Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
namespace (slows NAPI-64 cycle to ~6-7 ms).  Ping measures RTT.

                   BQL off                    BQL on
  fq_codel:  RTT ~22 ms, 4% loss        RTT ~1.3 ms, 0% loss
  sfq:       RTT ~24 ms, 0% loss        RTT ~1.5 ms, 0% loss

BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
unchanged.

Why BQL over configurable ring size (ethtool -G): BQL auto-tunes per
workload via DQL; a static ring size requires per-setup manual tuning
and a too-small ring drops XDP packets in batches of 16-64.

Selftests: https://github.com/netoptimizer/veth-backpressure-performance-testing

Background:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling with ptr_ring size 30 made
  fq_codel work dramatically better -- motivating a dynamic BQL
  solution.  Chris Arges wrote the reproducer.  Jonas Koeppeler and
  Simon Schippers provided extensive testing and code review.

  During BQL development we also fixed an unrelated 12-year-old CoDel
  bug (stale first_above_time in empty flows), see
  commit 815980fe6dbb ("net_sched: codel: fix stale state for empty flows in fq_codel").
  BQL remains valuable independently.

Patch overview:
  1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  2. veth: implement Byte Queue Limits (BQL) for latency reduction
  3. veth: add tx_timeout watchdog as BQL safety net
  4. net: sched: add timeout count to NETDEV WATCHDOG message
  5. veth: time-based BQL completion coalescing via ethtool tx-usecs

Jesper Dangaard Brouer (4):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message

Simon Schippers (1):
  veth: time-based BQL completion coalescing via ethtool tx-usecs

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: Breno Leitao <leitao@debian.org>
Cc: Simon Schippers <simon.schippers@tu-dortmund.de>
Cc: kernel-team@cloudflare.com

---
Changes since V6:
  - Patch 2 (teardown): rework veth_napi_del_range() to drain the ptr_ring
    and balance the peer txq's BQL/DQL by completing the outstanding charges
    via netdev_tx_completed_queue(), instead of netdev_tx_reset_queue().
    dql_reset() races with a concurrent producer, whereas completion is the
    normal single-completer path and is safe once NAPI is gone
    (synchronize_net()) and the producer has stopped charging BQL (it
    observes rq->napi == NULL).  Fixes races reported by sashiko.  The
    peer txq is still woken to clear any leaked DRV_XOFF.
  - Patch 2: document the lltx/DQL locking model at the BQL charge site --
    veth is lltx so the stack skips HARD_TX_LOCK; the ring producer_lock is
    the single-producer lock for dql_queued() (1:1 with the peer txq) and
    the peer NAPI in veth_xdp_rcv() is the single completer.
  - Patch 3 (watchdog): add a VETH_WATCHDOG_TIMEOUT_MS #define for the
    tx_timeout value instead of an inline magic number, and expand the
    math behind the value (64-packet NAPI budget * 250 ms/pkt = 16 s)
    in a comment (Paolo).
  - Patches 2+5: add Co-developed-by for Jonas Koeppeler.
  - Patch 5 (coalescing): flush when n_bql exceeds dql.limit to handle
    BQL starvation.  Removes the empty-ring STACK_XOFF check (and its
    smp_rmb) -- the dql.limit comparison handles it more directly.
  - Patch 5: at teardown, also complete the coalesced bql_state.n_bql
    pending from the last NAPI poll, on top of the patch 2 ring drain.
  - Patch 5: mirror tx_coal_usecs onto the peer in veth_set_coalesce() so
    a veth pair always coalesces symmetrically; the completion path reads
    its own device's value.
  - Patch 5: use WRITE_ONCE/READ_ONCE for tx_coal_usecs (Paolo).
  - Patch 5: init bql_state before napi_enable() to avoid race (sashiko).
  - Patch 5: tx-usecs=0 handled as normal path, no special case.
  - Patch 5: n_bql type changed to uint; explicit if/++ instead of
    implicit bool-to-int addition.
  - Patch 5: call veth_bql_maybe_complete() on every iteration for
    accurate completion intervals with mixed XDP/non-XDP packets.
  - Patch 5: reject tx-usecs above half the tx_timeout watchdog in
    veth_set_coalesce() (return -ERANGE) -- the coalescing window delays
    BQL completions, so an over-large value could trip a false watchdog;
    half the watchdog leaves a generous margin.

Prior versions:
  V6: https://lore.kernel.org/all/20260527135418.1166665-1-hawk@kernel.org/
  V5: https://lore.kernel.org/all/20260505132159.241305-1-hawk@kernel.org/
  V4: https://lore.kernel.org/all/20260501071633.644353-1-hawk@kernel.org/
  V3: https://lore.kernel.org/all/20260429172036.1028526-1-hawk@kernel.org/
  V2: https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org/
  V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/

Changes since V5:
  - Drop patch 1 (OOB txq fix) -- applied to net tree by Paolo as
    08f566e8f83b, already in net-next.
  - New patch 5: time-based BQL completion coalescing via ethtool
    tx-usecs (Simon Schippers).  Resolves the throughput regression
    Paolo flagged on V5: per-packet completion forced DQL to limit=2,
    causing cache-line bouncing between producer/consumer CPUs.
    Coalescing batches completions on a configurable time threshold
    (default 100 us), letting DQL discover a higher useful limit.
  - Patch 2: use __ptr_ring_check_produce() instead of open-coded
    ring-full check (Simon Schippers nit).

Changes since V4:
  - New patch 1: fix OOB txq access in veth_poll() when veth peers have
    asymmetric RX/TX queue counts.  XDP redirect can deliver frames to
    an RX queue index that exceeds the peer's TX queue count, causing
    an out-of-bounds netdev_get_tx_queue() access.  Found by sashiko-bot.
  - Patch 3 (veth BQL): wake stopped peer txqs in veth_napi_del_range()
    to clear DRV_XOFF after NAPI teardown.  A concurrent veth_xmit()
    can set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
    synchronize_net(); with NAPI gone, no veth_poll() clears it.
    Guarded by netif_running() to skip during device close.

Changes since V3:
  - Drop selftest patch (patch 5 from V3) per maintainer request.
  - Rebase on latest net-next.

Changes since V2:
  - Patch 2 (veth BQL): fix syzbot WARNING in veth_napi_del_range():
    clamp BQL reset loop to peer's real_num_tx_queues.  The loop was
    iterating dev->real_num_rx_queues but indexing peer's txq[], which
    goes out of bounds when the peer has fewer TX queues (e.g. veth
    enslaved to a bond with XDP attached).

Changes since V1:
  - Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
  - Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
    of skb->len.  veth has no link speed; the ptr_ring is packet-indexed.
    Byte-based charging lets small packets sneak many entries into the ring.
    Testing: min-size packet flood causes 3.7x ping RTT degradation with
    skb->len vs no change with fixed-unit charging.
  - Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
    netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
    to avoid dql_reset() racing with concurrent dql_completed().
  - Cover letter: update CoDel fix reference to merged commit in net tree.

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            | 267 +++++++++++++++++-
 include/linux/netdevice.h                     |   2 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   6 +-
 5 files changed, 270 insertions(+), 14 deletions(-)

-- 
2.43.0


             reply	other threads:[~2026-06-12  8:35 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  8:35 hawk [this message]
2026-06-12  8:35 ` [PATCH net-next v7 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-06-12  8:35 ` [PATCH net-next v7 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-06-12  8:35 ` [PATCH net-next v7 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-06-12  8:35 ` [PATCH net-next v7 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-06-12  8:35 ` [PATCH net-next v7 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612083530.1650245-1-hawk@kernel.org \
    --to=hawk@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=carges@cloudflare.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=j.koeppeler@tu-berlin.de \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=leitao@debian.org \
    --cc=mfreemon@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=simon.schippers@tu-dortmund.de \
    --cc=toke@toke.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.