From: Simon Schippers <simon@schippers-hamm.de>
To: Jesper Dangaard Brouer <hawk@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com, Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
John Fastabend <john.fastabend@gmail.com>,
Stanislav Fomichev <sdf@fomichev.me>,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
Date: Fri, 8 May 2026 11:20:05 +0200 [thread overview]
Message-ID: <ab56d050-33c9-4f65-94e6-64e1bc2b03b4@schippers-hamm.de> (raw)
In-Reply-To: <21d639fc-e244-486e-8368-8891b3c43215@schippers-hamm.de>
On 5/8/26 10:01, Simon Schippers wrote:
> On 5/7/26 22:45, Jesper Dangaard Brouer wrote:
>>
>>
>> On 07/05/2026 22.12, Simon Schippers wrote:
>>> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>>>
>>>>>
>>>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>>> }
>>>>>>>> } else {
>>>>>>>> /* ndo_start_xmit */
>>>>>>>> - struct sk_buff *skb = ptr;
>>>>>>>> + bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>>>> + struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>>>> stats->xdp_bytes += skb->len;
>>>>>>>> + if (peer_txq && bql_charged)
>>>>>>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>>>
>>>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>>>> this doesn’t work.
>>>>>>>
>>>>
>>>> I've experimented with doing the "completion" at NAPI-end in
>>>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>>>> leads to bad latency results (not acceptable).
>>>> (See detailed report later)
>>>>
>>>>
>>>>>>> I still think first that adding an option to modify the hard-coded
>>>>>>> VETH_RING_SIZE is the way to go.
>>>>>>>
>>>>
>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>> the solution here.
>>>>
>>>> The simply solution is the configure BQL limit_min:
>>>> `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>>>
>>>> My experiments (below) find that limit_min=8 is gives good performance.
>>>> We can simply set default to 8 as this still allows userspace to change
>>>> this later if lower latency is preferred.
>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>>>>
>>>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>>>> be ignored. Still the tput figures in the data are extremely low,
>>>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>>>
>>>>>> /P
>>>>>>
>>>>>
>>>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>>>> an UDP userspace application is sending.
>>>>>
>>>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>>>
>>>>
>>>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>>>> to test this. John Fastabend added a config to pktgen that allows us
>>>> to benchmarking egress qdisc path, this might be better to use this.
>>>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>>>
>>>> If redoing the tests, can you adjust limit_min to see the effect?
>>>> /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>>>
>>>> 20% throughput performance regression is of-cause too much, but I will
>>>> remind us, that adding a qdisc will "cost" some overhead, that is a
>>>> configuration choice. Our purpose here is to reduce bufferbloat and
>>>> latency, not optimize for throughput.
>>>>
>>>>
>>>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>>>> packets (because the completed queue is always called with 1 packet
>>>>> and not in a interrupt/timer with multiple packets...).
>>>>>
>>>>
>>>> I've run a lot of experiments, which I made AI write a report over, see attachment. The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>>>
>>>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead. The qdisc layer will be a baseline overhead. The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead). I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>>>
>>>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>>>
>>>> This patchset helps our (production) use-case reduce mice-flow latency
>>>> from approx 22ms to 1.3ms for latency under-load. Due to the consumer
>>>> namespace being the bottleneck the requeue overhead is negligible in
>>>> comparison.
>>>>
>>>> -Jesper
>>>
>>> First of all thanks for you work and I really see the advantages of
>>> avoiding bufferbloat :)
>>>
>>> But the key of the BQL algorithm, which is the *dynamic* adaption of the
>>> limit, is not working. Always calling netdev_completed_queue() with
>>> 1 packet results in a static limit of 2 packets (as seen by Jonas
>>> measurements), which you force up to 8 packets.
>>>
>>> So in the end this patchset has the same effect as just setting
>>> VETH_RING_SIZE to 8 (and giving an option to change this value).
>>>
>>
>> I've code up a time based BQL implementation, see attachment.
>> WDYT?
>>
>> --Jesper
>>
>
> A step in the right direction, but I dislike that you call
> netdev_sent_queue() with at least 1 packet (never 0 packets).
> I am not sure if it works, and I am not sure about the parameter.
>
Rethinking of it this could be fine, but really needs testing because:
The weird thing is that is that BQL's inflight != number of packets
in the ring and BQL's limit != "current ring size". Instead the BQL
limit describes the number of maximal allowed packets between
calls of netdev_sent_queue().
I messed up in my approach below. Forget it :P
>
> I would propose doing it like other BQL implementations do
> (for example usbnet for which I adapted BQL [1] :) ):
>
> Call netdev_sent_queue() with n_bql in a periodic work. n_bql would
> still be counted in veth_xdp_rcv() like you currently do (synchronized
> with the work via ring.consumer_lock?).
>
> The only weird thing that remains is that BQL's inflight != number of
> packets in the ring and BQL's limit != "current ring size". Instead
> the BQL limit describes the number of maximal allowed packets between
> calls of netdev_sent_queue(), which occur periodically in a somewhat
> fixed time interval.
> I guess that could be fine, but it surely needs testing.
>
> [1] Link: https://lore.kernel.org/netdev/20251106175615.26948-1-simon.schippers@tu-dortmund.de/
>
next prev parent reply other threads:[~2026-05-08 9:20 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 13:21 [PATCH net-next v5 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
2026-05-07 14:25 ` Paolo Abeni
2026-05-05 13:21 ` [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-07 6:54 ` Simon Schippers
2026-05-07 13:21 ` Paolo Abeni
2026-05-07 14:34 ` Paolo Abeni
2026-05-07 14:46 ` Simon Schippers
2026-05-07 19:09 ` Jesper Dangaard Brouer
2026-05-07 20:12 ` Simon Schippers
2026-05-07 20:45 ` Jesper Dangaard Brouer
2026-05-08 8:01 ` Simon Schippers
2026-05-08 9:20 ` Simon Schippers [this message]
2026-05-09 2:06 ` Jakub Kicinski
2026-05-09 9:09 ` Jesper Dangaard Brouer
2026-05-10 15:56 ` Jakub Kicinski
2026-05-11 8:11 ` Jesper Dangaard Brouer
2026-05-11 9:55 ` Simon Schippers
2026-05-11 18:08 ` Jesper Dangaard Brouer
2026-05-11 20:37 ` Simon Schippers
2026-05-05 13:21 ` [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-05-05 13:21 ` [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-05-07 14:30 ` [PATCH net-next v5 0/5] veth: add Byte Queue Limits (BQL) support patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ab56d050-33c9-4f65-94e6-64e1bc2b03b4@schippers-hamm.de \
--to=simon@schippers-hamm.de \
--cc=andrew+netdev@lunn.ch \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox