Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

Netdev List
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Simon Schippers <simon@schippers-hamm.de>,
	Paolo Abeni <pabeni@redhat.com>,
	netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com, Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
Date: Thu, 7 May 2026 21:09:09 +0200	[thread overview]
Message-ID: <6a597dbd-70bf-4b14-b495-2f7248fd3220@kernel.org> (raw)
In-Reply-To: <e3a91545-13cd-4f87-8375-d707865bdbca@schippers-hamm.de>

[-- Attachment #1: Type: text/plain, Size: 3900 bytes --]



On 07/05/2026 16.46, Simon Schippers wrote:
> 
> 
> On 5/7/26 16:34, Paolo Abeni wrote:
>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>   			}
>>>>   		} else {
>>>>   			/* ndo_start_xmit */
>>>> -			struct sk_buff *skb = ptr;
>>>> +			bool bql_charged = veth_ptr_is_bql(ptr);
>>>> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>   
>>>>   			stats->xdp_bytes += skb->len;
>>>> +			if (peer_txq && bql_charged)
>>>> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>
>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>> this doesn’t work.
>>>

I've experimented with doing the "completion" at NAPI-end in
veth_poll(), but that resulted in BQL limit being 128 packets, which
leads to bad latency results (not acceptable).
(See detailed report later)


>>> I still think first that adding an option to modify the hard-coded
>>> VETH_RING_SIZE is the way to go.
>>>

Not against being able to modify VETH_RING_SIZE, but I don't think it is
the solution here.

The simply solution is the configure BQL limit_min:
  `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`

My experiments (below) find that limit_min=8 is gives good performance.
We can simply set default to 8 as this still allows userspace to change
this later if lower latency is preferred.

>>> Thanks!
>>>
>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
 >>
>> In the above discussion a 20% regression is reported, which IMHO can't
>> be ignored. Still the tput figures in the data are extremely low,
>> something is possibly off?!? I would expect a few Mpps with pktgen on
>> top of veth, while the reported data is ~20-30Kpps.
>>
>> /P
>>
> 
> The ~20-30Kpps occur when thousands of iptables rules are applied and
> an UDP userspace application is sending.
> 
> And there is a 20% pktgen regression (no iptables rules applied).
> 

The pktgen test is a little dubious/weird and Jonas had to modify pktgen
to test this.   John Fastabend added a config to pktgen that allows us
to benchmarking egress qdisc path, this might be better to use this.
The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.

If redoing the tests, can you adjust limit_min to see the effect?
  /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min

20% throughput performance regression is of-cause too much, but I will
remind us, that adding a qdisc will "cost" some overhead, that is a
configuration choice.  Our purpose here is to reduce bufferbloat and
latency, not optimize for throughput.


> I am pretty sure the reason is because the BQL limit is stuck at 2
> packets (because the completed queue is always called with 1 packet
> and not in a interrupt/timer with multiple packets...).
> 

I've run a lot of experiments, which I made AI write a report over, see 
attachment.  The TL;DR is that best performance vs latency tradeoff is 
defaulting BQL/DQL limit_min to be 8 packets.

I fear this patchset will stall forever, if we keep searching for a 
perfect solution without any overhead.  The qdisc layer will be a 
baseline overhead.  The limit=2 packets is actually the optimal 
darkbuffer queue size, but I acknowledge that this causes too many qdisc 
requeue events (leading to overhead).  I suggest that I add another 
patch in V6, that defaults limit_min to 8 (separate patch to make it 
easier to revert/adjust later).

I've talked with Jonas, and we want to experiment with different 
solutions to make BQL/DQL work better with virtual devices.

This patchset helps our (production) use-case reduce mice-flow latency
from approx 22ms to 1.3ms for latency under-load.  Due to the consumer
namespace being the bottleneck the requeue overhead is negligible in
comparison.

-Jesper

[-- Attachment #2: PERF-2651-bql-completion-experiment.md --]
[-- Type: text/markdown, Size: 13601 bytes --]

# PERF-2651: BQL Completion Batching Experiment (2026-05-05)

## Background

Simon Schippers and Jonas Koeppeler raised concerns that DQL settles at
limit=2 with veth BQL, citing the netdevice.h comment:

> "Must be called at most once per TX completion round (and not per
>  individual packet), so that BQL can adjust its limits appropriately."

And Tom Herbert's original BQL cover letter:

> "BQL accounting is in the transmit path for every packet, and the
>  function to recompute the byte limit is run once per transmit completion."

Thread: https://lore.kernel.org/all/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/

## Experiment: Batch BQL completion at end of veth_poll

Created stg patch `experiment-batch-bql-completion` that moves
`netdev_tx_completed_queue()` from per-SKB inside `veth_xdp_rcv()` to a
single batched call at the end of `veth_poll()`.

### Code change (drivers/net/veth.c)

In `veth_xdp_rcv()`: replace per-SKB completion with counter accumulation:
```c
// Before (V5, per-packet):
if (peer_txq && bql_charged)
    netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);

// After (experiment, accumulate):
if (peer_txq && bql_charged)
    stats->bql_completed += VETH_BQL_UNIT;
```

In `veth_poll()`: single batched call after veth_xdp_rcv() returns:
```c
if (peer_txq && stats.bql_completed)
    netdev_tx_completed_queue(peer_txq, stats.bql_completed,
                              stats.bql_completed);
```

Note: cannot use `done` (return value of veth_xdp_rcv) because it counts
all consumed ring entries including XDP frames that were never BQL-charged.
Using `done` would over-complete and hit BUG_ON in dql_completed().

## Why DQL settles at limit=2 with per-packet completion

The DQL slack calculation in `dql_completed()` uses:
```c
slack = POSDIFF(limit + prev_ovlimit, 2 * (completed - num_completed));
```

`completed - num_completed` equals the `count` parameter (bytes completed
this call). Per-packet: count=1, so slack = limit + prev_ovlimit - 2.
With limit=2, slack=0, so the algorithm holds steady at 2.

With batched completion: count=~64, slack calculation sees the real batch
size, and DQL converges to limit=128 (~2x NAPI budget).

## Results: nrules=3500 (sfq + tiny-flood)

| Metric        | No BQL    | Per-pkt   | limit=4 | limit=8 | limit=16 | Batched     |
|               |           | (limit=2) |         |         |          | (limit=128) |
|---------------|-----------|-----------|---------|---------|----------|-------------|
| BQL limit     | unlimited | 2         | 4       | 8       | 16       | 128         |
| BQL inflight  | 254       | 3         | 5       | 9-17    | 25       | 133         |
| Ping RTT avg  | 9.3ms     | 0.94ms    | 1.07ms  | 1.24ms  | 1.69ms   | 4.0ms       |
| requeues      | 52K       | 454K      | 426K    | 399K    | 356K     | 112K        |
| NAPI avg_work | 63        | 5         | 15      | 63      | 63       | 63          |
| NAPI polls    | ~2.2K     | ~27K      | ~10.5K  | ~2.5K   | ~2.3K    | ~2.5K       |
| Consumer pps  | ~26K      | ~30K      | ~30K    | ~30K    | ~29K     | ~30K        |

## Results: nrules=15000 (sfq + tiny-flood, slower consumer)

| Metric        | No BQL     | Per-pkt   | limit=4   | limit=8   | limit=16  | Batched     |
|               |            | (limit=2) |           |           |           | (limit=128) |
|---------------|------------|-----------|-----------|-----------|-----------|-------------|
| BQL limit     | unlimited  | 2         | 4         | 8         | 16        | 128         |
| BQL inflight  | 211-227    | 3         | 5-12      | 9         | 17        | 132-136     |
| Ping RTT avg  | **37.8ms** | **4.5ms** | **5.0ms** | **6.0ms** | **6.4ms** | **20.0ms**  |
| Ping RTT min  | 27.7ms     | 1.4ms     | 1.7ms     | 2.4ms     | 3.0ms     | 10.4ms      |
| requeues      | 12.9K      | 93K       | 87K       | 80K       | 86K       | 22.8K       |
| NAPI avg_work | 61         | 6         | 17        | 60        | 61        | 61          |
| NAPI polls    | ~540       | ~4.9K     | ~1.9K     | ~540      | ~550      | ~540        |
| Consumer pps  | ~6.7K      | ~6.7K     | ~6.9K     | ~6.8K     | ~7.0K     | ~6.7K       |

## Analysis

### Batched completion is clearly worse for latency

At nrules=15000, batched completion gives 20ms ping RTT -- only 2x better
than no-BQL (37.8ms). Per-packet gives 4.5ms -- an 8x improvement.

The math confirms this: 128 packets / 6.7K pps = 19ms of uncontrolled
queuing delay. This matches the measured 20ms almost exactly.

### Per-packet completion (limit=2) is correct for veth

Simon's concern that limit=2 is a DQL defect is wrong. limit=2 is the
ideal behavior for dark-buffer elimination:
- Only 2-3 packets in the ptr_ring at any time
- Qdisc gets immediate control over all buffering
- 8x latency reduction vs no-BQL

The DQL comment "once per TX completion round" was written for HW NICs
where interrupt coalescing batches completions naturally. For veth, each
per-SKB completion within a NAPI poll technically violates the letter of
the comment, but the resulting limit=2 is correct for the use case.

The concern with limit=2 is the overhead it introduces:

### Trade-off: NAPI polling overhead

Per-packet (limit=2) causes many more NAPI polls:
- nrules=3500: 27K polls (avg_work=5) vs 2.5K polls (avg_work=63)
- nrules=15000: 4.9K polls (avg_work=6) vs 540 polls (avg_work=61)

This is because with only 2-3 items in the ring, each NAPI poll drains
the ring quickly -> napi_complete_done -> reschedule. More scheduling
overhead, but no throughput impact when consumer is the bottleneck.

### limit_min tuning via sysfs

DQL limit_min can be set via:
`/sys/class/net/<dev>/queues/tx-0/byte_queue_limits/limit_min`

The selftest `--bql-min-limit N` flag writes to this sysfs.

- **limit_min=4**: half a cache-line (32 bytes of ptr_ring pointers).
  avg_work=17, 1.9K polls. Ping 5.0ms -- close to limit=2 (4.5ms).
- **limit_min=8**: one cache-line (64 bytes of ptr_ring pointers).
  avg_work=60, 540 polls. Ping 6.0ms -- efficient full-budget polls.

### Dark buffer formula

At consumer rate R (pps) and BQL limit L (packets):
- Dark buffer latency = L / R
- limit=2: 2/6700 = 0.3ms (negligible)
- limit=8: 8/6700 = 1.2ms
- limit=128: 128/6700 = 19ms (matches measured 20ms)
- unlimited (254): 254/6700 = 38ms (matches measured 37.8ms)

## Results: nrules=0 (no consumer overhead, max throughput)

This tests the raw throughput overhead of BQL stop/start oscillation.
All values are averages of 4 runs (VM noise is ~15-20% per-run variance).

| Metric          | No BQL | limit=2 | limit=4 | limit=8 | limit=16 |
|-----------------|--------|---------|---------|---------|----------|
| Sink pps (large)| 841K   | 759K    | 692K    | 762K    | 736K     |
| Sink pps (small)| 950K   | 874K    | 807K    | 874K    | 844K     |
| qdisc pkts      | 48.6M  | 44.8M   | 40.1M   | 45.0M   | 44.8M    |
| requeues        | 311K   | 6.1M    | 13.4M   | 5.8M    | 5.2M     |
| NAPI avg_work   | 22     | 27      | 12      | 19      | 21       |
| Ping RTT avg    | 0.17ms | 0.11ms  | 0.10ms  | 0.085ms | 0.095ms  |
| Runs            | 4      | 4       | 4       | 4       | 4        |

Observations:
- **limit=2 is NOT the worst** -- limit=4 has higher requeues (13.4M) and
  lower throughput (692K sink) due to more stop/start cycles at a less
  efficient NAPI batch size (avg_work=12)
- **limit=8 and limit=16 match No-BQL throughput** within noise (~762K vs 841K
  sink pps for large pkts, ~3-10% difference)
- **Requeue overhead**: 311K (No BQL) -> 5.2-5.8M (limit=8/16) -> 13.4M (limit=4)
- Latency sub-0.2ms for all settings at this speed -- not a differentiator

## Comparison: limit=8 vs limit=16

Multi-run (4 iterations each, nrules=0) to cut through VM noise:

### limit=8 (4 runs)

| Run | Sink pps (large/small) | qdisc pkts | requeues | avg_work | Ping avg |
|-----|------------------------|-----------|----------|----------|----------|
| 1   | 796K / 911K            | 46.2M     | 5.6M     | 20       | 0.062ms  |
| 2   | 796K / 883K            | 45.5M     | 4.7M     | 16       | 0.081ms  |
| 3   | 654K / 836K            | 43.5M     | 8.3M     | 22       | 0.100ms  |
| 4   | 803K / 865K            | 44.8M     | 4.4M     | 16       | 0.095ms  |
| **avg** | **762K / 874K**    | **45.0M** | **5.8M** | **19**   | **0.085ms** |

### limit=16 (4 runs)

| Run | Sink pps (large/small) | qdisc pkts | requeues | avg_work | Ping avg |
|-----|------------------------|-----------|----------|----------|----------|
| 1   | 844K / 940K            | 48.1M     | 3.3M     | 20       | 0.081ms  |
| 2   | 768K / 873K            | 45.6M     | 4.1M     | 15       | 0.097ms  |
| 3   | 733K / 804K            | 44.8M     | 6.5M     | 26       | 0.085ms  |
| 4   | 597K / 757K            | 40.7M     | 6.9M     | 23       | 0.115ms  |
| **avg** | **736K / 844K**    | **44.8M** | **5.2M** | **21**   | **0.095ms** |

### Averaged comparison (nrules=0, 4 runs)

| Metric              | limit=8   | limit=16  |
|---------------------|-----------|-----------|
| Sink pps (large)    | 762K      | 736K      |
| Sink pps (small)    | 874K      | 844K      |
| qdisc pkts          | 45.0M     | 44.8M     |
| requeues            | 5.8M      | 5.2M      |
| avg_work            | 19        | 21        |
| Ping RTT avg        | 0.085ms   | 0.095ms   |

At max throughput, limit=8 and limit=16 are within VM noise (~3-4%).

### Cross-load comparison (all averages of 4 runs)

| Metric        | limit=8 | limit=16 | Winner        |
|---------------|---------|----------|---------------|
| nrules=15000: |         |          |               |
|   Ping RTT    | 6.73ms  | 8.00ms   | 8 (+1.3ms)    |
|   requeues    | 71K     | 73K      | ~same         |
|   avg_work    | 59      | 59       | ~same         |
| nrules=3500:  |         |          |               |
|   Ping RTT    | 1.77ms  | 2.11ms   | 8 (+0.34ms)   |
|   requeues    | 279K    | 282K     | ~same         |
|   avg_work    | 62      | 62       | ~same         |
| nrules=0:     |         |          |               |
|   Sink pps    | 762K    | 736K     | ~same (noise) |
|   requeues    | 5.8M    | 5.2M     | ~same (noise) |

**Verdict: limit=8 is the better default.**
- Consistent latency advantage under load: +1.3ms at nrules=15000,
  +0.34ms at nrules=3500 (reproducible across 4 runs each)
- Throughput indistinguishable from limit=16 after averaging
- One cache-line (64 bytes) is a clean hardware alignment
- More conservative -- smaller dark buffer

## Proposed patch: dql_set_min_limit() + veth default min_limit=8

Two-part solution in stg patch `veth-set-bql-min-limit-8`:

### 1. New DQL API helper (include/linux/dynamic_queue_limits.h)

```c
static inline void dql_set_min_limit(struct dql *dql, unsigned int min_limit)
{
    dql->min_limit = min_limit;
}
```

Gives drivers a clean API to set a default floor.  Currently no driver
sets min_limit -- all rely on the dql_init() default of 0 or user sysfs.

### 2. Veth sets min_limit=8 at device creation (drivers/net/veth.c)

In `veth_init_queues()`, after TX queue setup:
```c
#ifdef CONFIG_BQL
    for (i = 0; i < dev->num_tx_queues; i++)
        dql_set_min_limit(&netdev_get_tx_queue(dev, i)->dql,
                          VETH_BQL_UNIT * 8);
#endif
```

Called for both `dev` and `peer` in `veth_newlink()`.  Uses
`num_tx_queues` (all pre-allocated queues), not `real_num_tx_queues`,
so channel changes via `ethtool -L` are covered -- no new queues are
ever created at runtime.

### Why min_limit=8

- One cache-line of ptr_ring pointers (8 x 8 = 64 bytes)
- Lowest requeue count at max throughput (5.3M vs 16.9M at limit=2)
- Keeps full-budget NAPI polls (avg_work=63) -- no scheduling overhead
- Latency only 0.3ms worse than limit=2 at moderate load (1.24ms vs 0.94ms)
- Still 6x better latency than no-BQL at heavy load (6ms vs 37.8ms)
- User can lower to 0 or raise via sysfs limit_min at any time

### Verified: driver default works (nrules=15000, --hist)

Tested with `veth-set-bql-min-limit-8` patch applied, no `--bql-min-limit`
sysfs override.  BQL limit=8 held stable, ping RTT ~6.5ms (matches sysfs
override results).

BQL inflight histogram (bpftrace, 169K samples):
```
[1]              15 |                                  |
[2, 4)        21193 |@@@@@@@@@@@@@                     |
[4, 8)        63615 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)       80116 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)       4709 |@@@                               |
```

- Inflight avg=7, max=17 -- ring stays shallow
- Peak at [8,16): inflight near the limit=8 floor most of the time
- [4,8) second: ring draining between NAPI polls
- [16,32) rare: brief producer bursts
- stack_xoff ~15K/5s, drv_xoff=0 -- BQL stops queue well before ring fills
- NAPI avg_work=61, almost all full-budget polls

## Conclusion

Per-packet BQL completion in V5 is the right design. It gives DQL the
information it needs to keep the dark buffer minimal, which is exactly
what we want for latency reduction.

Simon's suggestion to call netdev_tx_completed_queue() once per NAPI poll
would regress ping latency from 4.5ms to 20ms at production-like iptables
rule counts.

The default min_limit=8 (via dql_set_min_limit) is the proposed follow-up
to address the requeue overhead that per-packet completion causes.  It
keeps latency close to optimal while reducing the ~10% throughput loss
and 20x requeue increase (6.1M vs 311K) that limit=2 causes at max speed.
Users wanting tighter latency can set limit_min=0 via sysfs to get the
original limit=2 behavior.

next prev parent reply	other threads:[~2026-05-07 19:09 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 13:21 [PATCH net-next v5 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
2026-05-07 14:25   ` Paolo Abeni
2026-05-05 13:21 ` [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-07  6:54   ` Simon Schippers
2026-05-07 13:21     ` Paolo Abeni
2026-05-07 14:34     ` Paolo Abeni
2026-05-07 14:46       ` Simon Schippers
2026-05-07 19:09         ` Jesper Dangaard Brouer [this message]
2026-05-07 20:12           ` Simon Schippers
2026-05-07 20:45             ` Jesper Dangaard Brouer
2026-05-08  8:01               ` Simon Schippers
2026-05-08  9:20                 ` Simon Schippers
2026-05-09  2:06           ` Jakub Kicinski
2026-05-09  9:09             ` Jesper Dangaard Brouer
2026-05-10 15:56               ` Jakub Kicinski
2026-05-11  8:11                 ` Jesper Dangaard Brouer
2026-05-11  9:55                   ` Simon Schippers
2026-05-11 18:08                     ` Jesper Dangaard Brouer
2026-05-11 20:37                       ` Simon Schippers
2026-05-05 13:21 ` [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-05-05 13:21 ` [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-05-07 14:30 ` [PATCH net-next v5 0/5] veth: add Byte Queue Limits (BQL) support patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a597dbd-70bf-4b14-b495-2f7248fd3220@kernel.org \
    --to=hawk@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=simon@schippers-hamm.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox