[PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support

public inbox for linux-kselftest@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support
@ 2026-04-13  9:44 hawk
  2026-04-13  9:44 ` [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test hawk
  2026-04-13 19:49 ` [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support syzbot ci
  0 siblings, 2 replies; 6+ messages in thread
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Chris Arges, Mike Freemon,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Jonas Köppeler, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, John Fastabend,
	Stanislav Fomichev, bpf

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.

Problem:
  veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued
  there are invisible to the qdisc's AQM.  Under load, the ring fills
  completely (DRV_XOFF backpressure), adding up to 256 packets of
  unmanaged latency before the qdisc even sees congestion.

Solution:
  BQL (STACK_XOFF) dynamically limits in-flight packets, stopping the
  queue before the ring fills.  This keeps the ring shallow and pushes
  excess packets into the qdisc, where sojourn-based AQM can measure
  and drop them.

  Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
  namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load.

                   BQL off                    BQL on
  fq_codel:  RTT ~22ms, 4% loss         RTT ~1.3ms, 0% loss
  sfq:       RTT ~24ms, 0% loss         RTT ~1.5ms, 0% loss

  BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
  is unchanged (~10K pps) -- BQL adds no overhead.

CoDel bug discovered during BQL development:
  Our original motivation for BQL was fq_codel ping loss observed under
  load (4-26% depending on NAPI cycle time).  Investigating this led us
  to discover a bug in the CoDel implementation: codel_dequeue() does
  not reset vars->first_above_time when a flow goes empty, contrary to
  the reference algorithm.  This causes stale CoDel state to persist
  across empty periods in fq_codel's per-flow queues, penalizing sparse
  flows like ICMP ping.  A fix for this has been applied to the net tree:
    https://git.kernel.org/netdev/net/c/815980fe6dbb

  BQL remains valuable independently: it reduces RTT by ~17x by moving
  buffering from the dark ptr_ring into the qdisc.  Additionally, BQL
  clears STACK_XOFF per-SKB as each packet completes, rather than
  batch-waking after 64 packets (DRV_XOFF).  This keeps sojourn times
  below fq_codel's target, preventing CoDel from entering dropping
  state on non-congested flows in the first place.

Key design decisions:
  - Charge-under-lock in veth_xdp_rx(): The BQL charge must precede
    the ptr_ring produce, because the NAPI consumer can run on another
    CPU and complete the SKB immediately after it becomes visible.  To
    avoid a pre-charge/undo pattern, the charge is done under the
    ptr_ring producer_lock after confirming the ring is not full.  BQL
    is only charged when produce is guaranteed to succeed, keeping
    num_queued monotonically increasing.  HARD_TX_LOCK already
    serializes dql_queued() (veth requires a qdisc for BQL); the
    ptr_ring lock additionally would allow noqueue to work correctly.

  - Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the
    ptr_ring pointer records whether each SKB was BQL-charged.  This is
    necessary because the qdisc can be replaced live (noqueue->sfq or
    vice versa) while SKBs are in-flight -- the completion side must
    know the charge state that was decided at enqueue time.

  - IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL
    sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL
    accounting, without changing IFF_NO_QUEUE semantics.

Background and acknowledgments:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling the kernel with a ptr_ring
  size of 30 (down from 256) made fq_codel work dramatically better.
  This was the primary motivation for a proper BQL solution that
  achieves the same effect dynamically without a kernel rebuild.

  Chris Arges wrote a reproducer for the dark buffer latency problem:
    https://github.com/netoptimizer/veth-backpressure-performance-testing
  This is where we first observed ping packets being dropped under
  fq_codel, which became our secondary motivation for BQL.  In
  production we switched to SFQ on veth devices as a workaround.

  Jonas Koeppeler provided extensive testing and code review.
  Together we discovered that the fq_codel ping loss was actually a
  12-year-old CoDel bug (stale first_above_time in empty flows), not
  caused by the dark buffer itself.  A fix has been applied to the net tree:
    https://git.kernel.org/netdev/net/c/815980fe6dbb

Patch overview:
  1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  2. veth: implement Byte Queue Limits (BQL) for latency reduction
  3. veth: add tx_timeout watchdog as BQL safety net
  4. net: sched: add timeout count to NETDEV WATCHDOG message
  5. selftests: net: add veth BQL stress test

Jesper Dangaard Brouer (5):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message
  selftests: net: add veth BQL stress test

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            |  92 +-
 include/linux/netdevice.h                     |   2 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   8 +-
 tools/testing/selftests/net/Makefile          |   3 +
 tools/testing/selftests/net/config            |   1 +
 tools/testing/selftests/net/napi_poll_hist.bt |  40 +
 tools/testing/selftests/net/veth_bql_test.sh  | 821 ++++++++++++++++++
 .../selftests/net/veth_bql_test_virtme.sh     | 124 +++
 10 files changed, 1084 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/net/napi_poll_hist.bt
 create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
 create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh

V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/

Changes since V1:
  - Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
  - Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
    of skb->len.  veth has no link speed; the ptr_ring is packet-indexed.
    Byte-based charging lets small packets sneak many entries into the ring.
    Testing: min-size packet flood causes 3.7x ping RTT degradation with
    skb->len vs no change with fixed-unit charging.
  - Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
    netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
    to avoid dql_reset() racing with concurrent dql_completed().
  - Patch 5 (selftests): fix shellcheck warnings and infos:
    - Quote variables passed to kill_process and exit.
    - Declare and assign local variables separately (SC2155).
    - Use read -r to avoid mangling backslashes (SC2162).
    - Add shellcheck disable comments for intentional word splitting
      (set -- $line, tc $qdisc $opts) and indirect invocation (trap).
    - Make iptables-restore failure a hard FAIL instead of continuing.
    - Add veth_bql_test.sh to TEST_PROGS in net/Makefile.
    - Add veth_bql_test_virtme.sh to TEST_FILES (needs kernel build tree).
    - Add napi_poll_hist.bt to TEST_FILES in net/Makefile.
    - Add CONFIG_NET_SCH_SFQ=m to net/config (default qdisc is sfq).
    - Reduce default test duration from 300s to 30s for kselftest CI.
    - Fix virtme wrapper: empty args bug, check vmlinux instead of test path.
  - Cover letter: update CoDel fix reference to merged commit in net tree.

Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: kernel-team@cloudflare.com

-- 
2.43.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test
  2026-04-13  9:44 [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support hawk
@ 2026-04-13  9:44 ` hawk
  2026-04-15 11:47   ` Breno Leitao
  2026-04-13 19:49 ` [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support syzbot ci
  1 sibling, 1 reply; 6+ messages in thread
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Jonas Köppeler,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Shuah Khan, linux-kernel, linux-kselftest

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add a selftest that exercises veth's BQL (Byte Queue Limits) code path
under sustained UDP load. The test creates a veth pair with GRO enabled
(activating the NAPI path and BQL), attaches a qdisc, optionally loads
iptables rules in the consumer namespace to slow NAPI processing, and
floods UDP packets for a configurable duration.

The test serves two purposes: benchmarking BQL's latency impact under
configurable load (iptables rules, qdisc type and parameters), and
detecting kernel BUG/Oops from DQL accounting mismatches. It monitors
dmesg throughout the run and reports PASS/FAIL via kselftest (lib.sh).

Diagnostic output is printed every 5 seconds:
  - BQL sysfs inflight/limit and watchdog tx_timeout counter
  - qdisc stats: packets, drops, requeues, backlog, qlen, overlimits
  - consumer PPS and NAPI-64 cycle time (shows fq_codel target impact)
  - sink PPS (per-period delta), latency min/avg/max (stddev at exit)
  - ping RTT to measure latency under load

Generating enough traffic to fill the 256-entry ptr_ring requires care:
the UDP sendto() path charges each SKB to sk_wmem_alloc, and the SKB
stays charged (via sock_wfree destructor) until the consumer NAPI thread
finishes processing it -- including any iptables rules in the receive
path. With the default sk_sndbuf (~208KB from wmem_default), only ~93
packets can be in-flight before sendto(MSG_DONTWAIT) returns EAGAIN.
Since 93 < 256 ring entries, the ring never fills and no backpressure
occurs. The test raises wmem_max via sysctl and sets SO_SNDBUF=1MB on
the flood socket to remove this bottleneck. An earlier multi-namespace
routing approach avoided this limit because ip_forward creates new SKBs
detached from the sender's socket.

The --bql-disable option (sets limit_min=1GB) enables A/B comparison.
Typical results with --nrules 6000 --qdisc-opts 'target 2ms interval 20ms':

  fq_codel + BQL disabled:  ping RTT ~10.8ms, 15% loss, 400KB in ptr_ring
  fq_codel + BQL enabled:   ping RTT ~0.6ms,   0% loss, 4KB in ptr_ring

Both cases show identical consumer speed (~20Kpps) and fq_codel drops
(~255K), proving the improvement comes purely from where packets buffer.

BQL moves buffering from the ptr_ring into the qdisc, where AQM
(fq_codel/CAKE) can act on it -- eliminating the "dark buffer" that
hides congestion from the scheduler.

The --qdisc-replace mode cycles through sfq/pfifo/fq_codel/noqueue
under active traffic to verify that stale BQL state (STACK_XOFF) is
properly handled during live qdisc transitions.

A companion wrapper (veth_bql_test_virtme.sh) launches the test inside
a virtme-ng VM, with .config validation to prevent silent stalls.

Usage:
  sudo ./veth_bql_test.sh [--duration 300] [--nrules 100]
                          [--qdisc sfq] [--qdisc-opts '...']
                          [--bql-disable] [--normal-napi]
                          [--qdisc-replace]

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 tools/testing/selftests/net/Makefile          |   3 +
 tools/testing/selftests/net/config            |   1 +
 tools/testing/selftests/net/napi_poll_hist.bt |  40 +
 tools/testing/selftests/net/veth_bql_test.sh  | 821 ++++++++++++++++++
 .../selftests/net/veth_bql_test_virtme.sh     | 124 +++
 5 files changed, 989 insertions(+)
 create mode 100644 tools/testing/selftests/net/napi_poll_hist.bt
 create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
 create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 231245a95879..7f6524169b93 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -119,6 +119,7 @@ TEST_PROGS := \
 	udpgso_bench.sh \
 	unicast_extensions.sh \
 	veth.sh \
+	veth_bql_test.sh \
 	vlan_bridge_binding.sh \
 	vlan_hw_filter.sh \
 	vrf-xfrm-tests.sh \
@@ -196,7 +197,9 @@ TEST_FILES := \
 	fcnal-test.sh \
 	in_netns.sh \
 	lib.sh \
+	napi_poll_hist.bt \
 	settings \
+	veth_bql_test_virtme.sh \
 # end of TEST_FILES
 
 # YNL files, must be before "include ..lib.mk"
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 2a390cae41bf..7b1f41421145 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -97,6 +97,7 @@ CONFIG_NET_PKTGEN=m
 CONFIG_NET_SCH_ETF=m
 CONFIG_NET_SCH_FQ=m
 CONFIG_NET_SCH_FQ_CODEL=m
+CONFIG_NET_SCH_SFQ=m
 CONFIG_NET_SCH_HTB=m
 CONFIG_NET_SCH_INGRESS=m
 CONFIG_NET_SCH_NETEM=y
diff --git a/tools/testing/selftests/net/napi_poll_hist.bt b/tools/testing/selftests/net/napi_poll_hist.bt
new file mode 100644
index 000000000000..34d1a43906bf
--- /dev/null
+++ b/tools/testing/selftests/net/napi_poll_hist.bt
@@ -0,0 +1,40 @@
+#!/usr/bin/env bpftrace
+// SPDX-License-Identifier: GPL-2.0
+// napi_poll work histogram for veth BQL testing.
+// Shows how many packets each NAPI poll processes (0..64).
+// Full-budget (64) polls mean more work is pending; partial (<64) means
+// the ring drained before the budget was exhausted.
+//
+// Usage: bpftrace napi_poll_hist.bt
+// Interval output is a single compact line for easy script parsing.
+
+tracepoint:napi:napi_poll
+/str(args->dev_name, 8) == "veth_bql"/
+{
+	@work = lhist(args->work, 0, 65, 1);
+	@total++;
+	@sum += args->work;
+	if (args->work == args->budget) {
+		@full++;
+	}
+}
+
+interval:s:5
+{
+	$avg = @total > 0 ? @sum / @total : 0;
+	printf("napi_poll: polls=%llu full_budget=%llu partial=%llu avg_work=%llu\n",
+	       @total, @full, @total - @full, $avg);
+	clear(@total);
+	clear(@full);
+	clear(@sum);
+}
+
+END
+{
+	printf("\n--- napi_poll work histogram (lifetime) ---\n");
+	print(@work);
+	clear(@work);
+	clear(@total);
+	clear(@full);
+	clear(@sum);
+}
diff --git a/tools/testing/selftests/net/veth_bql_test.sh b/tools/testing/selftests/net/veth_bql_test.sh
new file mode 100755
index 000000000000..bfbbb3432a8f
--- /dev/null
+++ b/tools/testing/selftests/net/veth_bql_test.sh
@@ -0,0 +1,821 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Veth BQL (Byte Queue Limits) stress test and A/B benchmarking tool.
+#
+# Creates a veth pair with GRO on and TSO off (ensures all packets use
+# the NAPI/ptr_ring path where BQL operates), attaches a configurable
+# qdisc, optionally loads iptables rules to slow the consumer NAPI
+# processing, and floods UDP packets at maximum rate.
+#
+# Primary uses:
+#   1) A/B comparison of latency with/without BQL (--bql-disable flag)
+#   2) Testing different qdiscs and their parameters (--qdisc, --qdisc-opts)
+#   3) Detecting kernel BUG/Oops from DQL accounting mismatches
+#
+# Key design detail -- SO_SNDBUF and wmem_max:
+#   The UDP sendto() path charges each SKB to the socket's sk_wmem_alloc
+#   counter.  The SKB carries a destructor (sock_wfree) that releases the
+#   charge only after the consumer NAPI thread on the peer veth finishes
+#   processing it -- including any iptables rules in the receive path.
+#   With the default sk_sndbuf (~208KB from wmem_default), only ~93
+#   packets (1442B each) can be in-flight before sendto() returns EAGAIN.
+#   Since 93 < 256 ptr_ring entries, the ring never fills and no qdisc
+#   backpressure occurs.  The test temporarily raises the global wmem_max
+#   sysctl and sets SO_SNDBUF=1MB to allow enough in-flight SKBs to
+#   saturate the ptr_ring.  The original wmem_max is restored on exit.
+#
+# Two TX-stop mechanisms and the dark-buffer problem:
+#   DRV_XOFF backpressure (commit dc82a33297fc) stops the TX queue when
+#   the 256-entry ptr_ring is full.  The queue is released at the end of
+#   veth_poll() (commit 5442a9da6978) after processing up to 64 packets
+#   (NAPI budget).  Without BQL, the entire ring is a FIFO "dark buffer"
+#   in front of the qdisc -- packets there are invisible to AQM.
+#
+#   BQL adds STACK_XOFF, which dynamically limits in-flight bytes and
+#   stops the queue *before* the ring fills.  This keeps the ring
+#   shallow and moves buffering into the qdisc where sojourn-based AQM
+#   (codel, fq_codel, CAKE/COBALT) can measure and drop packets.
+#
+# Sojourn time and NAPI budget interaction:
+#   DRV_XOFF releases backpressure once per NAPI poll (up to 64 pkts).
+#   During that cycle, packets queued in the qdisc accumulate sojourn
+#   time.  With fq_codel's default target of 5ms, the threshold is:
+#     5000us / 64 pkts = 78us/pkt --> ~12,800 pps consumer speed.
+#   Below that rate the NAPI-64 cycle exceeds the target and fq_codel
+#   starts dropping.  Use --nrules and --qdisc-opts to experiment.
+#
+cd "$(dirname -- "$0")" || exit 1
+source lib.sh
+
+# Defaults
+DURATION=30       # seconds; use longer --duration to reach DQL counter wrap
+NRULES=3500       # iptables rules in consumer NS (0 to disable)
+QDISC=sfq         # qdisc to use (sfq, pfifo, fq_codel, etc.)
+QDISC_OPTS=""     # extra qdisc parameters (e.g. "target 1ms interval 10ms")
+BQL_DISABLE=0     # 1 to disable BQL (sets limit_min high)
+NORMAL_NAPI=0     # 1 to use normal softirq NAPI (skip threaded NAPI)
+QDISC_REPLACE=0   # 1 to test qdisc replacement under active traffic
+TINY_FLOOD=0      # 1 to add 2nd UDP thread with min-size packets
+VETH_A="veth_bql0"
+VETH_B="veth_bql1"
+IP_A="10.99.0.1"
+IP_B="10.99.0.2"
+PORT=9999
+PKT_SIZE=1400     # large packets: slower producer, bigger BQL charges
+
+usage() {
+    echo "Usage: $0 [OPTIONS]"
+    echo "  --duration SEC   test duration (default: $DURATION)"
+    echo "  --nrules N       iptables rules to slow consumer (default: $NRULES, 0=disable)"
+    echo "  --qdisc NAME     qdisc to install (default: $QDISC)"
+    echo "  --qdisc-opts STR extra qdisc params (e.g. 'target 1ms interval 10ms')"
+    echo "  --bql-disable    disable BQL for A/B comparison"
+    echo "  --normal-napi    use softirq NAPI instead of threaded NAPI"
+    echo "  --qdisc-replace  test qdisc replacement under active traffic"
+    echo "  --tiny-flood     add 2nd UDP thread with min-size packets (stress BQL bytes)"
+    exit 1
+}
+
+while [ $# -gt 0 ]; do
+    case "$1" in
+    --duration)   DURATION="$2"; shift 2 ;;
+    --nrules)     NRULES="$2"; shift 2 ;;
+    --qdisc)      QDISC="$2"; shift 2 ;;
+    --qdisc-opts) QDISC_OPTS="$2"; shift 2 ;;
+    --bql-disable) BQL_DISABLE=1; shift ;;
+    --normal-napi) NORMAL_NAPI=1; shift ;;
+    --qdisc-replace) QDISC_REPLACE=1; shift ;;
+    --tiny-flood) TINY_FLOOD=1; shift ;;
+    --help|-h)    usage ;;
+    *)            echo "Unknown option: $1" >&2; usage ;;
+    esac
+done
+
+TMPDIR=$(mktemp -d)
+
+FLOOD_PID=""
+FLOOD2_PID=""
+SINK_PID=""
+PING_PID=""
+BPFTRACE_PID=""
+
+# shellcheck disable=SC2329  # cleanup is invoked indirectly via trap
+cleanup() {
+    [ -n "$BPFTRACE_PID" ] && kill_process "$BPFTRACE_PID"
+    [ -n "$FLOOD_PID" ] && kill_process "$FLOOD_PID"
+    [ -n "$FLOOD2_PID" ] && kill_process "$FLOOD2_PID"
+    [ -n "$SINK_PID" ] && kill_process "$SINK_PID"
+    [ -n "$PING_PID" ] && kill_process "$PING_PID"
+    cleanup_all_ns
+    ip link del "$VETH_A" 2>/dev/null || true
+    [ -n "$ORIG_WMEM_MAX" ] && sysctl -qw net.core.wmem_max="$ORIG_WMEM_MAX"
+    rm -rf "$TMPDIR"
+}
+trap cleanup EXIT
+
+require_command gcc
+require_command ethtool
+require_command tc
+
+# --- Function definitions ---
+
+compile_tools() {
+    echo "--- Compiling UDP flood tool ---"
+cat > "$TMPDIR"/udp_flood.c << 'CEOF'
+#include <arpa/inet.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <time.h>
+#include <unistd.h>
+
+static volatile int running = 1;
+
+static void stop(int sig) { running = 0; }
+
+struct pkt_hdr {
+	struct timespec ts;
+	unsigned long seq;
+};
+
+int main(int argc, char **argv)
+{
+	struct sockaddr_in dst;
+	struct pkt_hdr hdr;
+	unsigned long count = 0;
+	char buf[1500];
+	int sndbuf = 1048576;
+	int pkt_size, max_pkt_size;
+	int cur_size;
+	int duration;
+	int fd;
+
+	if (argc < 5) {
+		fprintf(stderr, "Usage: %s <ip> <pkt_size> <port> <duration> [max_pkt_size]\n",
+			argv[0]);
+		return 1;
+	}
+
+	pkt_size = atoi(argv[2]);
+	if (pkt_size < (int)sizeof(struct pkt_hdr))
+		pkt_size = sizeof(struct pkt_hdr);
+	if (pkt_size > (int)sizeof(buf))
+		pkt_size = sizeof(buf);
+	max_pkt_size = (argc > 5) ? atoi(argv[5]) : pkt_size;
+	if (max_pkt_size < pkt_size)
+		max_pkt_size = pkt_size;
+	if (max_pkt_size > (int)sizeof(buf))
+		max_pkt_size = sizeof(buf);
+	duration = atoi(argv[4]);
+
+	memset(&dst, 0, sizeof(dst));
+	dst.sin_family = AF_INET;
+	dst.sin_port = htons(atoi(argv[3]));
+	inet_pton(AF_INET, argv[1], &dst.sin_addr);
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return 1;
+	}
+
+	/* Raise send buffer so sk_wmem_alloc limit doesn't cap
+	 * in-flight packets before the ptr_ring (256 entries) fills.
+	 * Default wmem_default ~208K only allows ~93 packets.
+	 */
+	setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, sizeof(sndbuf));
+
+	memset(buf, 0xAA, sizeof(buf));
+	signal(SIGINT, stop);
+	signal(SIGTERM, stop);
+	signal(SIGALRM, stop);
+	alarm(duration);
+
+	while (running) {
+		if (max_pkt_size > pkt_size)
+			cur_size = pkt_size + (rand() % (max_pkt_size - pkt_size + 1));
+		else
+			cur_size = pkt_size;
+		clock_gettime(CLOCK_MONOTONIC, &hdr.ts);
+		hdr.seq = count;
+		memcpy(buf, &hdr, sizeof(hdr));
+		sendto(fd, buf, cur_size, MSG_DONTWAIT,
+		       (struct sockaddr *)&dst, sizeof(dst));
+		count++;
+		if (!(count % 10000000))
+			fprintf(stderr, "  sent: %lu M packets\n",
+				count / 1000000);
+	}
+
+	fprintf(stderr, "Total sent: %lu packets (%.1f M)\n",
+		count, (double)count / 1e6);
+	close(fd);
+	return 0;
+}
+CEOF
+gcc -O2 -Wall -o "$TMPDIR"/udp_flood "$TMPDIR"/udp_flood.c || exit $ksft_fail
+
+# UDP sink with latency measurement
+cat > "$TMPDIR"/udp_sink.c << 'CEOF'
+#include <arpa/inet.h>
+#include <math.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <time.h>
+#include <unistd.h>
+
+static volatile int running = 1;
+
+static void stop(int sig) { running = 0; }
+
+struct pkt_hdr {
+	struct timespec ts;
+	unsigned long seq;
+};
+
+static void print_periodic(unsigned long count, unsigned long delta_count,
+			   double delta_sec, unsigned long drops,
+			   unsigned long reorders,
+			   double lat_min, double lat_sum,
+			   double lat_max)
+{
+	unsigned long pps;
+
+	if (!count)
+		return;
+	pps = delta_sec > 0 ? (unsigned long)(delta_count / delta_sec) : 0;
+	fprintf(stderr, "  sink: %lu pkts (%lu pps)  drops=%lu  reorders=%lu"
+		"  latency min/avg/max = %.3f/%.3f/%.3f ms\n",
+		count, pps, drops, reorders,
+		lat_min * 1e3, (lat_sum / count) * 1e3,
+		lat_max * 1e3);
+}
+
+static void print_final(unsigned long count, double elapsed_sec,
+			unsigned long drops, unsigned long reorders,
+			double lat_min, double lat_sum,
+			double lat_sum_sq, double lat_max)
+{
+	unsigned long pps;
+	double avg, stddev;
+
+	if (!count)
+		return;
+	pps = elapsed_sec > 0 ? (unsigned long)(count / elapsed_sec) : 0;
+	avg = lat_sum / count;
+	stddev = sqrt(lat_sum_sq / count - avg * avg);
+	fprintf(stderr, "  sink: %lu pkts (%lu avg pps)  drops=%lu  reorders=%lu"
+		"  latency min/avg/max/stddev = %.3f/%.3f/%.3f/%.3f ms\n",
+		count, pps, drops, reorders,
+		lat_min * 1e3, avg * 1e3,
+		lat_max * 1e3, stddev * 1e3);
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long next_seq = 0, drops = 0, reorders = 0;
+	double lat_min = 1e9, lat_max = 0, lat_sum = 0, lat_sum_sq = 0;
+	unsigned long count = 0, last_count = 0;
+	struct sockaddr_in addr;
+	char buf[2048];
+	int fd, one = 1;
+
+	if (argc < 2) {
+		fprintf(stderr, "Usage: %s <port>\n", argv[0]);
+		return 1;
+	}
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return 1;
+	}
+	setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one));
+
+	/* Timeout so recv() unblocks periodically to check 'running' flag.
+	 * Needed because glibc signal() sets SA_RESTART, so SIGTERM
+	 * does not interrupt recv().
+	 */
+	struct timeval tv = { .tv_sec = 1 };
+	setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+
+	memset(&addr, 0, sizeof(addr));
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(atoi(argv[1]));
+	addr.sin_addr.s_addr = INADDR_ANY;
+	if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		perror("bind");
+		return 1;
+	}
+
+	signal(SIGINT, stop);
+	signal(SIGTERM, stop);
+
+	struct timespec t_start, t_last_print;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+	t_last_print = t_start;
+
+	while (running) {
+		struct pkt_hdr hdr;
+		struct timespec now;
+		ssize_t n;
+		double lat;
+
+		n = recv(fd, buf, sizeof(buf), 0);
+		if (n < (ssize_t)sizeof(struct pkt_hdr))
+			continue;
+
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		memcpy(&hdr, buf, sizeof(hdr));
+
+		/* Track drops (gaps) and reorders (late arrivals) */
+		if (hdr.seq > next_seq)
+			drops += hdr.seq - next_seq;
+		if (hdr.seq < next_seq)
+			reorders++;
+		if (hdr.seq >= next_seq)
+			next_seq = hdr.seq + 1;
+
+		lat = (now.tv_sec - hdr.ts.tv_sec) +
+		      (now.tv_nsec - hdr.ts.tv_nsec) * 1e-9;
+
+		if (lat < lat_min)
+			lat_min = lat;
+		if (lat > lat_max)
+			lat_max = lat;
+		lat_sum += lat;
+		lat_sum_sq += lat * lat;
+		count++;
+
+		{
+			double since_print;
+
+			since_print = (now.tv_sec - t_last_print.tv_sec) +
+				      (now.tv_nsec - t_last_print.tv_nsec) * 1e-9;
+			if (since_print >= 5.0) {
+				print_periodic(count, count - last_count,
+					       since_print, drops,
+					       reorders, lat_min,
+					       lat_sum, lat_max);
+				last_count = count;
+				t_last_print = now;
+			}
+		}
+	}
+
+	{
+		struct timespec t_now;
+		double elapsed;
+
+		clock_gettime(CLOCK_MONOTONIC, &t_now);
+		elapsed = (t_now.tv_sec - t_start.tv_sec) +
+			  (t_now.tv_nsec - t_start.tv_nsec) * 1e-9;
+		print_final(count, elapsed, drops, reorders,
+			    lat_min, lat_sum, lat_sum_sq, lat_max);
+	}
+	close(fd);
+	return 0;
+}
+CEOF
+gcc -O2 -Wall -o "$TMPDIR"/udp_sink "$TMPDIR"/udp_sink.c -lm || exit $ksft_fail
+}
+
+setup_veth() {
+    log_info "Setting up veth pair with GRO"
+    setup_ns NS || exit $ksft_skip
+    ip link add "$VETH_A" type veth peer name "$VETH_B" || \
+        { echo "Failed to create veth pair (need root?)"; exit $ksft_skip; }
+    ip link set "$VETH_B" netns "$NS" || \
+        { echo "Failed to move veth to namespace"; exit $ksft_skip; }
+
+    # Configure IPs
+    ip addr add "${IP_A}/24" dev "$VETH_A"
+    ip link set "$VETH_A" up
+
+    ip -netns "$NS" addr add "${IP_B}/24" dev "$VETH_B"
+    ip -netns "$NS" link set "$VETH_B" up
+
+    # Raise wmem_max so the flood tool's SO_SNDBUF takes effect.
+    # Default 212992 caps in-flight to ~93 packets (sk_wmem_alloc limit),
+    # which is less than the 256-entry ptr_ring and prevents backpressure.
+    ORIG_WMEM_MAX=$(sysctl -n net.core.wmem_max)
+    sysctl -qw net.core.wmem_max=1048576
+
+    # Enable GRO on both ends -- activates NAPI -- BQL code path
+    ethtool -K "$VETH_A" gro on 2>/dev/null || true
+    ip netns exec "$NS" ethtool -K "$VETH_B" gro on 2>/dev/null || true
+
+    # Disable TSO so veth_skb_is_eligible_for_gro() returns true for all
+    # packets, ensuring every SKB takes the NAPI/ptr_ring path.  With TSO
+    # enabled, only packets matching sock_wfree + GRO features are eligible;
+    # disabling TSO removes that filter unconditionally.
+    ethtool -K "$VETH_A" tso off gso off 2>/dev/null || true
+    ip netns exec "$NS" ethtool -K "$VETH_B" tso off gso off 2>/dev/null || true
+
+    # Enable threaded NAPI -- this is critical: BQL backpressure (STACK_XOFF)
+    # only engages when producer and consumer run on separate CPUs.
+    # Without threaded NAPI, softirq completions happen too fast for BQL
+    # to build up enough in-flight bytes to trigger the limit.
+    if [ "$NORMAL_NAPI" -eq 0 ]; then
+        echo 1 > /sys/class/net/"$VETH_A"/threaded 2>/dev/null || true
+        ip netns exec "$NS" sh -c "echo 1 > /sys/class/net/$VETH_B/threaded" 2>/dev/null || true
+        log_info "Threaded NAPI enabled"
+    else
+        log_info "Using normal softirq NAPI (threaded NAPI disabled)"
+    fi
+}
+
+install_qdisc() {
+    local qdisc="${1:-$QDISC}"
+    local opts="${2:-}"
+    # Add a qdisc -- veth defaults to noqueue, but BQL needs a qdisc
+    # because STACK_XOFF is checked by the qdisc layer.
+    # Note: qdisc_create() auto-fixes txqueuelen=0 on IFF_NO_QUEUE devices
+    # to DEFAULT_TX_QUEUE_LEN (commit 84c46dd86538).
+    log_info "Installing qdisc: $qdisc $opts"
+    # shellcheck disable=SC2086  # $opts must word-split for tc arguments
+    tc qdisc replace dev "$VETH_A" root $qdisc $opts
+    # shellcheck disable=SC2086
+    ip netns exec "$NS" tc qdisc replace dev "$VETH_B" root $qdisc $opts
+}
+
+remove_qdisc() {
+    log_info "Removing qdisc (reverting to noqueue)"
+    tc qdisc del dev "$VETH_A" root 2>/dev/null || true
+    ip netns exec "$NS" tc qdisc del dev "$VETH_B" root 2>/dev/null || true
+}
+
+setup_iptables() {
+    # Bulk-load iptables rules in consumer namespace to slow NAPI processing.
+    # Many rules force per-packet linear rule traversal, increasing consumer
+    # overhead and BQL inflight bytes -- simulates realistic k8s-like workload.
+    if [ "$NRULES" -gt 0 ]; then
+        # shellcheck disable=SC2016  # single quotes intentional
+        ip netns exec "$NS" bash -c '
+        iptables-restore < <(
+        echo "*filter"
+        for n in $(seq 1 '"$NRULES"'); do
+          echo "-I INPUT -d '"$IP_B"'"
+        done
+        echo "COMMIT"
+        )
+        ' 2>/dev/null || { RET=$ksft_fail retmsg="iptables not available" \
+            log_test "iptables"; exit "$EXIT_STATUS"; }
+        log_info "Loaded $NRULES iptables rules in consumer NS"
+    fi
+}
+
+check_bql_sysfs() {
+    BQL_DIR="/sys/class/net/${VETH_A}/queues/tx-0/byte_queue_limits"
+    if [ -d "$BQL_DIR" ]; then
+        log_info "BQL sysfs found: $BQL_DIR"
+        if [ "$BQL_DISABLE" -eq 1 ]; then
+            echo 1073741824 > "$BQL_DIR/limit_min"
+            log_info "BQL effectively disabled (limit_min=1G)"
+        fi
+    else
+        log_info "BQL sysfs absent (veth IFF_NO_QUEUE+lltx, DQL accounting still active)"
+        BQL_DIR=""
+    fi
+}
+
+start_traffic() {
+    # Snapshot dmesg before test
+    DMESG_BEFORE=$(dmesg | wc -l)
+
+    log_info "Starting UDP sink in namespace"
+    ip netns exec "$NS" "$TMPDIR"/udp_sink "$PORT" &
+    SINK_PID=$!
+    sleep 0.2
+
+    log_info "Starting ping to $IP_B (5/s) to measure latency under load"
+    ping -i 0.2 -w "$DURATION" "$IP_B" > "$TMPDIR"/ping.log 2>&1 &
+    PING_PID=$!
+
+    log_info "Flooding ${PKT_SIZE}-byte UDP packets for ${DURATION}s"
+    "$TMPDIR"/udp_flood "$IP_B" "$PKT_SIZE" "$PORT" "$DURATION" &
+    FLOOD_PID=$!
+
+    # Optional: 2nd UDP thread with tiny packets to stress byte-based BQL.
+    # Small packets charge few BQL bytes, letting many more into the
+    # ptr_ring before STACK_XOFF fires -- exposing the dark buffer.
+    if [ "$TINY_FLOOD" -eq 1 ]; then
+        local port2=$((PORT + 1))
+        ip netns exec "$NS" "$TMPDIR"/udp_sink "$port2" &
+        log_info "Starting 2nd UDP flood (min-size pkts) on port $port2"
+        "$TMPDIR"/udp_flood "$IP_B" 24 "$port2" "$DURATION" &
+        FLOOD2_PID=$!
+    fi
+
+    # Optional: start bpftrace napi_poll histogram (best-effort)
+    local bt_script
+    bt_script="$(dirname -- "$0")/napi_poll_hist.bt"
+    if command -v bpftrace >/dev/null 2>&1 && [ -f "$bt_script" ]; then
+        bpftrace "$bt_script" > "$TMPDIR"/napi_poll.log 2>&1 &
+        BPFTRACE_PID=$!
+        log_info "bpftrace napi_poll histogram started (pid=$BPFTRACE_PID)"
+    fi
+}
+
+stop_traffic() {
+    [ -n "$FLOOD_PID" ] && kill_process "$FLOOD_PID"
+    FLOOD_PID=""
+    [ -n "$FLOOD2_PID" ] && kill_process "$FLOOD2_PID"
+    FLOOD2_PID=""
+    [ -n "$SINK_PID" ] && kill_process "$SINK_PID"
+    SINK_PID=""
+    [ -n "$PING_PID" ] && kill_process "$PING_PID"
+    PING_PID=""
+    [ -n "$BPFTRACE_PID" ] && kill_process "$BPFTRACE_PID"
+    BPFTRACE_PID=""
+}
+
+check_dmesg_bug() {
+    local bug_pattern='kernel BUG|BUG:|Oops:|dql_completed'
+    local warn_pattern='WARNING:|asks to queue packet|NETDEV WATCHDOG'
+    if dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+       grep -qE "$bug_pattern"; then
+        dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+            grep -B2 -A20 -E "$bug_pattern|$warn_pattern"
+        return 1
+    fi
+    # Log new warnings since last check (don't repeat old ones)
+    local cur_lines
+    cur_lines=$(dmesg | wc -l)
+    if [ "$cur_lines" -gt "${DMESG_WARN_SEEN:-$DMESG_BEFORE}" ]; then
+        local new_warns
+        new_warns=$(dmesg | tail -n +$(("${DMESG_WARN_SEEN:-$DMESG_BEFORE}" + 1)) | \
+            grep -E "$warn_pattern") || true
+        if [ -n "$new_warns" ]; then
+            local cnt
+            cnt=$(echo "$new_warns" | wc -l)
+            echo "  WARN: $cnt new kernel warning(s):"
+            echo "$new_warns" | tail -5
+        fi
+    fi
+    DMESG_WARN_SEEN=$cur_lines
+    return 0
+}
+
+print_periodic_stats() {
+    local elapsed="$1"
+
+    # BQL stats and watchdog counter
+    WD_CNT=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+        2>/dev/null) || WD_CNT="?"
+    if [ -n "$BQL_DIR" ] && [ -d "$BQL_DIR" ]; then
+        INFLIGHT=$(cat "$BQL_DIR/inflight" 2>/dev/null || echo "?")
+        LIMIT=$(cat "$BQL_DIR/limit" 2>/dev/null || echo "?")
+        echo "  [${elapsed}s] BQL inflight=${INFLIGHT} limit=${LIMIT}" \
+            "watchdog=${WD_CNT}"
+    else
+        echo "  [${elapsed}s] watchdog=${WD_CNT} (no BQL sysfs)"
+    fi
+
+    # Qdisc stats
+    JQ_FMT='"qdisc \(.kind) pkts=\(.packets) drops=\(.drops)'
+    JQ_FMT+=' requeues=\(.requeues) backlog=\(.backlog)'
+    JQ_FMT+=' qlen=\(.qlen) overlimits=\(.overlimits)"'
+    CUR_QPKTS=$(tc -j -s qdisc show dev "$VETH_A" root 2>/dev/null |
+        jq -r '.[0].packets // 0' 2>/dev/null) || CUR_QPKTS=0
+    QSTATS=$(tc -j -s qdisc show dev "$VETH_A" root 2>/dev/null |
+        jq -r ".[0] | $JQ_FMT" 2>/dev/null) &&
+        echo "  [${elapsed}s] $QSTATS" || true
+
+    # Consumer PPS and per-packet processing time
+    if [ "$PREV_QPKTS" -gt 0 ] 2>/dev/null; then
+        DELTA=$((CUR_QPKTS - PREV_QPKTS))
+        PPS=$((DELTA / INTERVAL))
+        if [ "$PPS" -gt 0 ]; then
+            PKT_MS=$(awk "BEGIN {printf \"%.3f\", 1000.0/$PPS}")
+            NAPI_MS=$(awk "BEGIN {printf \"%.1f\", 64000.0/$PPS}")
+            echo "  [${elapsed}s] consumer: ${PPS} pps" \
+                "(~${PKT_MS}ms/pkt, NAPI-64 cycle ~${NAPI_MS}ms)"
+        fi
+    fi
+    PREV_QPKTS=$CUR_QPKTS
+
+    # softnet_stat: per-CPU tracking to detect same-CPU vs multi-CPU NAPI
+    # /proc/net/softnet_stat columns: processed, dropped, time_squeeze (hex, per-CPU)
+    local cpu=0 total_proc=0 total_sq=0 active_cpus=""
+    while read -r line; do
+        # shellcheck disable=SC2086  # word splitting on $line is intentional
+        set -- $line
+        local cur_p=$((0x${1})) cur_sq=$((0x${3}))
+        if [ -f "$TMPDIR/softnet_cpu${cpu}" ]; then
+            read -r prev_p prev_sq < "$TMPDIR/softnet_cpu${cpu}"
+            local dp=$((cur_p - prev_p)) dsq=$((cur_sq - prev_sq))
+            total_proc=$((total_proc + dp))
+            total_sq=$((total_sq + dsq))
+            [ "$dp" -gt 0 ] && active_cpus="${active_cpus} cpu${cpu}(+${dp})"
+        fi
+        echo "$cur_p $cur_sq" > "$TMPDIR/softnet_cpu${cpu}"
+        cpu=$((cpu + 1))
+    done < /proc/net/softnet_stat
+    local n_active
+    n_active=$(echo "$active_cpus" | wc -w)
+    local cpu_mode="single-CPU"
+    [ "$n_active" -gt 1 ] && cpu_mode="multi-CPU(${n_active})"
+    if [ "$total_sq" -gt 0 ] && [ "$INTERVAL" -gt 0 ]; then
+        echo "  [${elapsed}s] softnet: processed=${total_proc}" \
+            "time_squeeze=${total_sq} (${total_sq}/${INTERVAL}s)" \
+            "${cpu_mode}:${active_cpus}"
+    else
+        echo "  [${elapsed}s] softnet: processed=${total_proc}" \
+            "time_squeeze=${total_sq}" \
+            "${cpu_mode}:${active_cpus}"
+    fi
+
+    # napi_poll histogram (from bpftrace, if running)
+    if [ -n "$BPFTRACE_PID" ] && [ -f "$TMPDIR"/napi_poll.log ]; then
+        local napi_line
+        napi_line=$(grep '^napi_poll:' "$TMPDIR"/napi_poll.log | tail -1)
+        [ -n "$napi_line" ] && echo "  [${elapsed}s] $napi_line"
+    fi
+
+    # Ping RTT
+    PING_RTT=$(tail -1 "$TMPDIR"/ping.log 2>/dev/null | grep -oP 'time=\K[0-9.]+') &&
+        echo "  [${elapsed}s] ping RTT=${PING_RTT}ms" || true
+}
+
+monitor_loop() {
+    ELAPSED=0
+    INTERVAL=5
+    PREV_QPKTS=0
+    # Seed per-CPU softnet baselines
+    local cpu=0
+    while read -r line; do
+        # shellcheck disable=SC2086  # word splitting on $line is intentional
+        set -- $line
+        echo "$((0x${1})) $((0x${3}))" > "$TMPDIR/softnet_cpu${cpu}"
+        cpu=$((cpu + 1))
+    done < /proc/net/softnet_stat
+    while kill -0 "$FLOOD_PID" 2>/dev/null; do
+        sleep "$INTERVAL"
+        ELAPSED=$((ELAPSED + INTERVAL))
+
+        if ! check_dmesg_bug; then
+            RET=$ksft_fail
+            retmsg="BUG_ON triggered in dql_completed at ${ELAPSED}s"
+            log_test "veth_bql"
+            exit "$EXIT_STATUS"
+        fi
+
+        print_periodic_stats "$ELAPSED"
+    done
+    wait "$FLOOD_PID" || true
+    FLOOD_PID=""
+}
+
+# Verify traffic is flowing by checking device tx_packets counter.
+# Works for both qdisc and noqueue modes.
+verify_traffic_flowing() {
+    local label="$1"
+    local prev_tx cur_tx
+
+    # Skip check if flood producer already exited (not a stall)
+    if [ -n "$FLOOD_PID" ] && ! kill -0 "$FLOOD_PID" 2>/dev/null; then
+        log_info "$label flood producer exited (duration reached)"
+        return 0
+    fi
+
+    prev_tx=$(cat /sys/class/net/${VETH_A}/statistics/tx_packets \
+        2>/dev/null) || prev_tx=0
+    sleep 0.5
+    cur_tx=$(cat /sys/class/net/${VETH_A}/statistics/tx_packets \
+        2>/dev/null) || cur_tx=0
+    if [ "$cur_tx" -gt "$prev_tx" ]; then
+        log_info "$label traffic flowing (tx: $prev_tx -> $cur_tx)"
+        return 0
+    fi
+    log_info "$label traffic STALLED (tx: $prev_tx -> $cur_tx)"
+    return 1
+}
+
+collect_results() {
+    local test_name="${1:-veth_bql}"
+
+    # Ping summary
+    wait "$PING_PID" 2>/dev/null || true
+    PING_PID=""
+    if [ -f "$TMPDIR"/ping.log ]; then
+        PING_LOSS=$(grep -o '[0-9.]*% packet loss' "$TMPDIR"/ping.log) &&
+            log_info "Ping loss: $PING_LOSS"
+        PING_SUMMARY=$(tail -1 "$TMPDIR"/ping.log)
+        log_info "Ping summary: $PING_SUMMARY"
+    fi
+
+    # Watchdog summary
+    WD_FINAL=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+        2>/dev/null) || WD_FINAL=0
+    if [ "$WD_FINAL" -gt 0 ] 2>/dev/null; then
+        log_info "Watchdog fired ${WD_FINAL} time(s)"
+        dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+            grep -E 'NETDEV WATCHDOG|veth backpressure' || true
+    fi
+
+    # Final dmesg check -- only upgrade to fail, never override existing fail
+    if ! check_dmesg_bug; then
+        RET=$ksft_fail
+        retmsg="BUG_ON triggered in dql_completed"
+    fi
+    log_test "$test_name"
+    exit "$EXIT_STATUS"
+}
+
+# --- Test modes ---
+
+test_bql_stress() {
+    RET=$ksft_pass
+    compile_tools
+    setup_veth
+    install_qdisc "$QDISC" "$QDISC_OPTS"
+    setup_iptables
+    log_info "kernel: $(uname -r)"
+    check_bql_sysfs
+    start_traffic
+    monitor_loop
+    collect_results "veth_bql"
+}
+
+# Test qdisc replacement under active traffic.  Cycles through several
+# qdiscs including a transition to noqueue (tc qdisc del) to verify
+# that stale BQL state (STACK_XOFF) is properly reset during qdisc
+# transitions.
+test_qdisc_replace() {
+    local qdiscs=("sfq" "pfifo" "fq_codel")
+    local step=2
+    local elapsed=0
+    local idx
+
+    RET=$ksft_pass
+    compile_tools
+    setup_veth
+    install_qdisc "$QDISC" "$QDISC_OPTS"
+    setup_iptables
+    log_info "kernel: $(uname -r)"
+    check_bql_sysfs
+    start_traffic
+
+    while [ "$elapsed" -lt "$DURATION" ] && kill -0 "$FLOOD_PID" 2>/dev/null; do
+        sleep "$step"
+        elapsed=$((elapsed + step))
+
+        if ! check_dmesg_bug; then
+            RET=$ksft_fail
+            retmsg="BUG_ON during qdisc replacement at ${elapsed}s"
+            break
+        fi
+
+        # Cycle: sfq -> pfifo -> fq_codel -> noqueue -> sfq -> ...
+        idx=$(( (elapsed / step - 1) % (${#qdiscs[@]} + 1) ))
+        if [ "$idx" -eq "${#qdiscs[@]}" ]; then
+            remove_qdisc
+        else
+            install_qdisc "${qdiscs[$idx]}"
+        fi
+
+        # Print BQL and qdisc stats after each replacement
+        if [ -n "$BQL_DIR" ] && [ -d "$BQL_DIR" ]; then
+            local inflight limit limit_min limit_max holding
+            inflight=$(cat "$BQL_DIR/inflight" 2>/dev/null || echo "?")
+            limit=$(cat "$BQL_DIR/limit" 2>/dev/null || echo "?")
+            limit_min=$(cat "$BQL_DIR/limit_min" 2>/dev/null || echo "?")
+            limit_max=$(cat "$BQL_DIR/limit_max" 2>/dev/null || echo "?")
+            holding=$(cat "$BQL_DIR/holding_time" 2>/dev/null || echo "?")
+            echo "  [${elapsed}s] BQL inflight=${inflight} limit=${limit}" \
+                "limit_min=${limit_min} limit_max=${limit_max}" \
+                "holding=${holding}"
+        fi
+        local cur_qdisc
+        cur_qdisc=$(tc qdisc show dev "$VETH_A" root 2>/dev/null | \
+            awk '{print $2}') || cur_qdisc="none"
+        local txq_state
+        txq_state=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+            2>/dev/null) || txq_state="?"
+        echo "  [${elapsed}s] qdisc=${cur_qdisc} watchdog=${txq_state}"
+
+        if ! verify_traffic_flowing "[${elapsed}s]"; then
+            RET=$ksft_fail
+            retmsg="Traffic stalled after qdisc replacement at ${elapsed}s"
+            break
+        fi
+    done
+
+    stop_traffic
+    collect_results "veth_bql_qdisc_replace"
+}
+
+# --- Main ---
+if [ "$QDISC_REPLACE" -eq 1 ]; then
+    test_qdisc_replace
+else
+    test_bql_stress
+fi
diff --git a/tools/testing/selftests/net/veth_bql_test_virtme.sh b/tools/testing/selftests/net/veth_bql_test_virtme.sh
new file mode 100755
index 000000000000..bb8dde0f6c00
--- /dev/null
+++ b/tools/testing/selftests/net/veth_bql_test_virtme.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Launch veth BQL test inside virtme-ng
+#
+# Must be run from the kernel build tree root.
+#
+# Options:
+#   --verbose       Show kernel console (vng boot messages) in real time.
+#                   Useful for debugging kernel panics / BUG_ON crashes.
+#   All other options are forwarded to veth_bql_test.sh (see --help there).
+#
+# Examples (run from kernel tree root):
+#   ./tools/testing/selftests/net/veth_bql_test_virtme.sh [OPTIONS]
+#     --duration 20 --nrules 1000
+#     --qdisc fq_codel --bql-disable
+#     --verbose --qdisc-replace --duration 60
+
+set -eu
+
+# Parse --verbose (consumed here, not forwarded to the inner test).
+VERBOSE=""
+INNER_ARGS=()
+for arg in "$@"; do
+    if [ "$arg" = "--verbose" ]; then
+        VERBOSE="--verbose"
+    else
+        INNER_ARGS+=("$arg")
+    fi
+done
+TEST_ARGS=""
+[ ${#INNER_ARGS[@]} -gt 0 ] && TEST_ARGS=$(printf '%q ' "${INNER_ARGS[@]}")
+
+if [ ! -f "vmlinux" ]; then
+    echo "ERROR: virtme-ng needs vmlinux; run from a compiled kernel tree:" >&2
+    echo "  cd /path/to/kernel && $0" >&2
+    exit 1
+fi
+
+# Verify .config has the options needed for virtme-ng and this test.
+# Without these the VM silently stalls with no output.
+KCONFIG=".config"
+if [ ! -f "$KCONFIG" ]; then
+    echo "ERROR: No .config found -- build the kernel first" >&2
+    exit 1
+fi
+
+MISSING=""
+for opt in CONFIG_VIRTIO CONFIG_VIRTIO_PCI CONFIG_VIRTIO_NET \
+           CONFIG_VIRTIO_CONSOLE CONFIG_NET_9P CONFIG_NET_9P_VIRTIO \
+           CONFIG_9P_FS CONFIG_VETH CONFIG_BQL; do
+    if ! grep -q "^${opt}=[ym]" "$KCONFIG"; then
+        MISSING+="  $opt\n"
+    fi
+done
+if [ -n "$MISSING" ]; then
+    echo "ERROR: .config is missing options required by virtme-ng:" >&2
+    echo -e "$MISSING" >&2
+    echo "Consider: vng --kconfig (or make defconfig + enable above)" >&2
+    exit 1
+fi
+
+TESTDIR="tools/testing/selftests/net"
+TESTNAME="veth_bql_test.sh"
+LOGFILE="veth_bql_test.log"
+LOGPATH="$TESTDIR/$LOGFILE"
+CONSOLELOG="veth_bql_console.log"
+rm -f "$LOGPATH" "$CONSOLELOG"
+
+echo "Starting VM... test output in $LOGPATH, kernel console in $CONSOLELOG"
+echo "(VM is booting, please wait ~30s)"
+
+# Always capture kernel console to a file via a second QEMU serial port.
+# vng claims ttyS0 (mapped to /dev/null); --qemu-opts adds ttyS1 on COM2.
+# earlycon registers COM2's I/O port (0x2f8) as a persistent console.
+# (plain console=ttyS1 does NOT work: the 8250 driver registers once,
+# ttyS0 wins, and ttyS1 is never picked up.)
+# --verbose additionally shows kernel console in real time on the terminal.
+SERIAL_CONSOLE="earlycon=uart8250,io,0x2f8,115200"
+SERIAL_CONSOLE+=" console=uart8250,io,0x2f8,115200"
+set +e
+vng $VERBOSE --cpus 4 --memory 2G \
+    --rwdir "$TESTDIR" \
+    --append "panic=5 loglevel=4 $SERIAL_CONSOLE" \
+    --qemu-opts="-serial file:$CONSOLELOG" \
+    --exec "cd $TESTDIR && \
+        ./$TESTNAME $TEST_ARGS 2>&1 | \
+        tee $LOGFILE; echo EXIT_CODE=\$? >> $LOGFILE"
+VNG_RC=$?
+set -e
+
+echo ""
+if [ "$VNG_RC" -ne 0 ]; then
+    echo "***********************************************************"
+    echo "* VM CRASHED -- kernel panic or BUG_ON (vng rc=$VNG_RC)"
+    echo "***********************************************************"
+    if [ -s "$CONSOLELOG" ] && \
+       grep -qiE 'kernel BUG|BUG:|Oops:|panic|dql_completed' "$CONSOLELOG"; then
+        echo ""
+        echo "--- kernel backtrace ($CONSOLELOG) ---"
+        grep -iE -A30 'kernel BUG|BUG:|Oops:|panic|dql_completed' \
+            "$CONSOLELOG" | head -50
+    else
+        echo ""
+        echo "Re-run with --verbose to see the kernel backtrace:"
+        echo "  $0 --verbose ${INNER_ARGS[*]}"
+    fi
+    exit 1
+elif [ ! -f "$LOGPATH" ]; then
+    echo "No log file found -- VM may have crashed before writing output"
+    exit 2
+else
+    echo "=== VM finished ==="
+fi
+
+# Scan console log for unexpected kernel warnings (even on clean exit)
+if [ -s "$CONSOLELOG" ]; then
+    WARN_PATTERN='kernel BUG|BUG:|Oops:|dql_completed|WARNING:|asks to queue packet|NETDEV WATCHDOG'
+    WARN_LINES=$(grep -cE "$WARN_PATTERN" "$CONSOLELOG" 2>/dev/null) || WARN_LINES=0
+    if [ "$WARN_LINES" -gt 0 ]; then
+        echo ""
+        echo "*** kernel warnings in $CONSOLELOG ($WARN_LINES lines) ***"
+        grep -E "$WARN_PATTERN" "$CONSOLELOG" | head -20
+    fi
+fi
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support
  2026-04-13  9:44 [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support hawk
  2026-04-13  9:44 ` [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test hawk
@ 2026-04-13 19:49 ` syzbot ci
  2026-04-14  8:06   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 6+ messages in thread
From: syzbot ci @ 2026-04-13 19:49 UTC (permalink / raw)
  To: andrew, ast, bpf, corbet, daniel, davem, edumazet, frederic, hawk,
	horms, j.koeppeler, jhs, jiri, john.fastabend, kernel-team,
	krikku, kuba, kuniyu, linux-doc, linux-kernel, linux-kselftest,
	netdev, pabeni, sdf, shuah, skhan, yajun.deng
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v2] veth: add Byte Queue Limits (BQL) support
https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org
* [PATCH net-next v2 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
* [PATCH net-next v2 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
* [PATCH net-next v2 3/5] veth: add tx_timeout watchdog as BQL safety net
* [PATCH net-next v2 4/5] net: sched: add timeout count to NETDEV WATCHDOG message
* [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test

and found the following issue:
WARNING in veth_napi_del_range

Full report is available here:
https://ci.syzbot.org/series/ee732006-8545-4abd-a105-b4b1592a7baf

***

WARNING in veth_napi_del_range

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      8806d502e0a7e7d895b74afbd24e8550a65a2b17
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/90743a26-f003-44cf-abcc-5991c47588b2/config
syz repro: https://ci.syzbot.org/findings/d068bfb2-9f8b-466a-95b4-cd7e7b00006c/syz_repro

------------[ cut here ]------------
index >= dev->num_tx_queues
WARNING: ./include/linux/netdevice.h:2672 at netdev_get_tx_queue include/linux/netdevice.h:2672 [inline], CPU#0: syz.1.27/6002
WARNING: ./include/linux/netdevice.h:2672 at veth_napi_del_range+0x3b7/0x4e0 drivers/net/veth.c:1142, CPU#0: syz.1.27/6002
Modules linked in:
CPU: 0 UID: 0 PID: 6002 Comm: syz.1.27 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:netdev_get_tx_queue include/linux/netdevice.h:2672 [inline]
RIP: 0010:veth_napi_del_range+0x3b7/0x4e0 drivers/net/veth.c:1142
Code: 00 e8 ad 96 69 fe 44 39 6c 24 10 74 5e e8 41 61 44 fb 41 ff c5 49 bc 00 00 00 00 00 fc ff df e9 6d ff ff ff e8 2a 61 44 fb 90 <0f> 0b 90 42 80 3c 23 00 75 8e eb 94 48 8b 0c 24 80 e1 07 80 c1 03
RSP: 0018:ffffc90003adf918 EFLAGS: 00010293
RAX: ffffffff86814ec6 RBX: 1ffff110227a6c03 RCX: ffff888103a857c0
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000002
RBP: 1ffff110227a6c9a R08: ffff888113f01ab7 R09: 0000000000000000
R10: ffff888113f01a98 R11: ffffed10227e0357 R12: dffffc0000000000
R13: 0000000000000002 R14: 0000000000000002 R15: ffff888113d36018
FS:  000055555ea16500(0000) GS:ffff88818de4a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efc287456b8 CR3: 000000010cdd0000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 veth_napi_del drivers/net/veth.c:1153 [inline]
 veth_disable_xdp+0x1b0/0x310 drivers/net/veth.c:1255
 veth_xdp_set drivers/net/veth.c:1693 [inline]
 veth_xdp+0x48e/0x730 drivers/net/veth.c:1717
 dev_xdp_propagate+0x125/0x260 net/core/dev_api.c:348
 bond_xdp_set drivers/net/bonding/bond_main.c:5715 [inline]
 bond_xdp+0x3ca/0x830 drivers/net/bonding/bond_main.c:5761
 dev_xdp_install+0x42c/0x600 net/core/dev.c:10387
 dev_xdp_detach_link net/core/dev.c:10579 [inline]
 bpf_xdp_link_release+0x362/0x540 net/core/dev.c:10595
 bpf_link_free+0x103/0x480 kernel/bpf/syscall.c:3292
 bpf_link_put_direct kernel/bpf/syscall.c:3344 [inline]
 bpf_link_release+0x6b/0x80 kernel/bpf/syscall.c:3351
 __fput+0x44f/0xa70 fs/file_table.c:469
 task_work_run+0x1d9/0x270 kernel/task_work.c:233
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
 exit_to_user_mode_loop+0xed/0x480 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
 do_syscall_64+0x32d/0xf80 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5bda39c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffdca2969e8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4
RAX: 0000000000000000 RBX: 00007f5bda617da0 RCX: 00007f5bda39c819
RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003
RBP: 00007f5bda617da0 R08: 00007f5bda616128 R09: 0000000000000000
R10: 000000000003fd78 R11: 0000000000000246 R12: 0000000000010fb8
R13: 00007f5bda61609c R14: 0000000000010cdd R15: 00007ffdca296af0
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support
  2026-04-13 19:49 ` [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support syzbot ci
@ 2026-04-14  8:06   ` Jesper Dangaard Brouer
  2026-04-14  8:08     ` syzbot ci
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2026-04-14  8:06 UTC (permalink / raw)
  To: syzbot ci, andrew, ast, bpf, corbet, daniel, davem, edumazet,
	frederic, horms, j.koeppeler, john.fastabend, kernel-team, kuba,
	linux-doc, linux-kernel, linux-kselftest, netdev, pabeni, sdf,
	shuah
  Cc: syzbot, syzkaller-bugs

[-- Attachment #1: Type: text/plain, Size: 4594 bytes --]



On 13/04/2026 21.49, syzbot ci wrote:
> syzbot ci has tested the following series
> 
> [v2] veth: add Byte Queue Limits (BQL) support
> https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org
> * [PATCH net-next v2 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
> * [PATCH net-next v2 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
> * [PATCH net-next v2 3/5] veth: add tx_timeout watchdog as BQL safety net
> * [PATCH net-next v2 4/5] net: sched: add timeout count to NETDEV WATCHDOG message
> * [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test
> 
> and found the following issue:
> WARNING in veth_napi_del_range
> 
> Full report is available here:
> https://ci.syzbot.org/series/ee732006-8545-4abd-a105-b4b1592a7baf
> 
> ***
> 
> WARNING in veth_napi_del_range
>

Attached a reproducer myself.
- I have V3 ready see below for diff

> tree:      net-next
> URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
> base:      8806d502e0a7e7d895b74afbd24e8550a65a2b17
> arch:      amd64
> compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> config:    https://ci.syzbot.org/builds/90743a26-f003-44cf-abcc-5991c47588b2/config
> syz repro: https://ci.syzbot.org/findings/d068bfb2-9f8b-466a-95b4-cd7e7b00006c/syz_repro
> 
> ------------[ cut here ]------------
> index >= dev->num_tx_queues
> WARNING: ./include/linux/netdevice.h:2672 at netdev_get_tx_queue include/linux/netdevice.h:2672 [inline], CPU#0: syz.1.27/6002
> WARNING: ./include/linux/netdevice.h:2672 at veth_napi_del_range+0x3b7/0x4e0 drivers/net/veth.c:1142, CPU#0: syz.1.27/6002
> Modules linked in:
> CPU: 0 UID: 0 PID: 6002 Comm: syz.1.27 Not tainted syzkaller #0 PREEMPT(full)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:netdev_get_tx_queue include/linux/netdevice.h:2672 [inline]
> RIP: 0010:veth_napi_del_range+0x3b7/0x4e0 drivers/net/veth.c:1142
> Code: 00 e8 ad 96 69 fe 44 39 6c 24 10 74 5e e8 41 61 44 fb 41 ff c5 49 bc 00 00 00 00 00 fc ff df e9 6d ff ff ff e8 2a 61 44 fb 90 <0f> 0b 90 42 80 3c 23 00 75 8e eb 94 48 8b 0c 24 80 e1 07 80 c1 03
> RSP: 0018:ffffc90003adf918 EFLAGS: 00010293
> RAX: ffffffff86814ec6 RBX: 1ffff110227a6c03 RCX: ffff888103a857c0
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000002
> RBP: 1ffff110227a6c9a R08: ffff888113f01ab7 R09: 0000000000000000
> R10: ffff888113f01a98 R11: ffffed10227e0357 R12: dffffc0000000000
> R13: 0000000000000002 R14: 0000000000000002 R15: ffff888113d36018
> FS:  000055555ea16500(0000) GS:ffff88818de4a000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007efc287456b8 CR3: 000000010cdd0000 CR4: 00000000000006f0
> Call Trace:
>   <TASK>
>   veth_napi_del drivers/net/veth.c:1153 [inline]
>   veth_disable_xdp+0x1b0/0x310 drivers/net/veth.c:1255
>   veth_xdp_set drivers/net/veth.c:1693 [inline]
>   veth_xdp+0x48e/0x730 drivers/net/veth.c:1717
>   dev_xdp_propagate+0x125/0x260 net/core/dev_api.c:348
>   bond_xdp_set drivers/net/bonding/bond_main.c:5715 [inline]
>   bond_xdp+0x3ca/0x830 drivers/net/bonding/bond_main.c:5761
>   dev_xdp_install+0x42c/0x600 net/core/dev.c:10387
>   dev_xdp_detach_link net/core/dev.c:10579 [inline]
>   bpf_xdp_link_release+0x362/0x540 net/core/dev.c:10595
>   bpf_link_free+0x103/0x480 kernel/bpf/syscall.c:3292
>   bpf_link_put_direct kernel/bpf/syscall.c:3344 [inline]
>   bpf_link_release+0x6b/0x80 kernel/bpf/syscall.c:3351
>   __fput+0x44f/0xa70 fs/file_table.c:469
>   task_work_run+0x1d9/0x270 kernel/task_work.c:233


The BQL reset loop in veth_napi_del_range() iterates
dev->real_num_rx_queues but indexes into peer's TX queues,
which goes out of bounds when the peer has fewer TX queues
(e.g. veth enslaved to a bond with XDP).

Fix is to clamp the loop to the peer's real_num_tx_queues.
Will be included in the V3 submission.

#syz test

---
  drivers/net/veth.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 911e7e36e166..9d7b085c9548 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1138,7 +1138,9 @@ static void veth_napi_del_range(struct net_device 
*dev, int start, int end)
  	 */
  	peer = rtnl_dereference(priv->peer);
  	if (peer) {
-		for (i = start; i < end; i++)
+		int peer_end = min(end, (int)peer->real_num_tx_queues);
+
+		for (i = start; i < peer_end; i++)
  			netdev_tx_reset_queue(netdev_get_tx_queue(peer, i));
  	}



[-- Attachment #2: repro-syzbot-veth-bql.sh --]
[-- Type: application/x-shellscript, Size: 2967 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Re: [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support
  2026-04-14  8:06   ` Jesper Dangaard Brouer
@ 2026-04-14  8:08     ` syzbot ci
  0 siblings, 0 replies; 6+ messages in thread
From: syzbot ci @ 2026-04-14  8:08 UTC (permalink / raw)
  To: hawk
  Cc: andrew, ast, bpf, corbet, daniel, davem, edumazet, frederic, hawk,
	horms, j.koeppeler, john.fastabend, kernel-team, kuba, linux-doc,
	linux-kernel, linux-kselftest, netdev, pabeni, sdf, shuah, syzbot,
	syzkaller-bugs


Please attach the patch to act upon.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test
  2026-04-13  9:44 ` [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test hawk
@ 2026-04-15 11:47   ` Breno Leitao
  0 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-04-15 11:47 UTC (permalink / raw)
  To: hawk
  Cc: netdev, kernel-team, Jonas Köppeler, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan, linux-kernel, linux-kselftest

On Mon, Apr 13, 2026 at 11:44:38AM +0200, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Add a selftest that exercises veth's BQL (Byte Queue Limits) code path
> under sustained UDP load. The test creates a veth pair with GRO enabled
> (activating the NAPI path and BQL), attaches a qdisc, optionally loads
> iptables rules in the consumer namespace to slow NAPI processing, and
> floods UDP packets for a configurable duration.
> 
> The test serves two purposes: benchmarking BQL's latency impact under
> configurable load (iptables rules, qdisc type and parameters), and
> detecting kernel BUG/Oops from DQL accounting mismatches. It monitors
> dmesg throughout the run and reports PASS/FAIL via kselftest (lib.sh).
> 
> Diagnostic output is printed every 5 seconds:
>   - BQL sysfs inflight/limit and watchdog tx_timeout counter
>   - qdisc stats: packets, drops, requeues, backlog, qlen, overlimits
>   - consumer PPS and NAPI-64 cycle time (shows fq_codel target impact)
>   - sink PPS (per-period delta), latency min/avg/max (stddev at exit)
>   - ping RTT to measure latency under load
> 
> Generating enough traffic to fill the 256-entry ptr_ring requires care:
> the UDP sendto() path charges each SKB to sk_wmem_alloc, and the SKB
> stays charged (via sock_wfree destructor) until the consumer NAPI thread
> finishes processing it -- including any iptables rules in the receive
> path. With the default sk_sndbuf (~208KB from wmem_default), only ~93
> packets can be in-flight before sendto(MSG_DONTWAIT) returns EAGAIN.
> Since 93 < 256 ring entries, the ring never fills and no backpressure
> occurs. The test raises wmem_max via sysctl and sets SO_SNDBUF=1MB on
> the flood socket to remove this bottleneck. An earlier multi-namespace
> routing approach avoided this limit because ip_forward creates new SKBs
> detached from the sender's socket.
> 
> The --bql-disable option (sets limit_min=1GB) enables A/B comparison.
> Typical results with --nrules 6000 --qdisc-opts 'target 2ms interval 20ms':
> 
>   fq_codel + BQL disabled:  ping RTT ~10.8ms, 15% loss, 400KB in ptr_ring
>   fq_codel + BQL enabled:   ping RTT ~0.6ms,   0% loss, 4KB in ptr_ring
> 
> Both cases show identical consumer speed (~20Kpps) and fq_codel drops
> (~255K), proving the improvement comes purely from where packets buffer.
> 
> BQL moves buffering from the ptr_ring into the qdisc, where AQM
> (fq_codel/CAKE) can act on it -- eliminating the "dark buffer" that
> hides congestion from the scheduler.
> 
> The --qdisc-replace mode cycles through sfq/pfifo/fq_codel/noqueue
> under active traffic to verify that stale BQL state (STACK_XOFF) is
> properly handled during live qdisc transitions.
> 
> A companion wrapper (veth_bql_test_virtme.sh) launches the test inside
> a virtme-ng VM, with .config validation to prevent silent stalls.
> 
> Usage:
>   sudo ./veth_bql_test.sh [--duration 300] [--nrules 100]
>                           [--qdisc sfq] [--qdisc-opts '...']
>                           [--bql-disable] [--normal-napi]
>                           [--qdisc-replace]
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>

Tested-by: Breno Leitao <leitao@debian.org>

> diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
> index 2a390cae41bf..7b1f41421145 100644
> --- a/tools/testing/selftests/net/config
> +++ b/tools/testing/selftests/net/config
> @@ -97,6 +97,7 @@ CONFIG_NET_PKTGEN=m
>  CONFIG_NET_SCH_ETF=m
>  CONFIG_NET_SCH_FQ=m
>  CONFIG_NET_SCH_FQ_CODEL=m
> +CONFIG_NET_SCH_SFQ=m

nit: This breaks the alphabetical ordering of the config file.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-15 11:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-13  9:44 [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-04-13  9:44 ` [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test hawk
2026-04-15 11:47   ` Breno Leitao
2026-04-13 19:49 ` [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support syzbot ci
2026-04-14  8:06   ` Jesper Dangaard Brouer
2026-04-14  8:08     ` syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox