Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Simon Schippers <simon@schippers-hamm.de>,
	Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>,
	netdev@vger.kernel.org, kernel-team@cloudflare.com,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
Date: Tue, 12 May 2026 15:54:50 +0200	[thread overview]
Message-ID: <18855e57-f050-411f-9958-d4babcc81ba3@kernel.org> (raw)
In-Reply-To: <873511fa-4316-4411-a76b-ec4c5805abd3@schippers-hamm.de>




On 11/05/2026 22.37, Simon Schippers wrote:
> On 5/11/26 20:08, Jesper Dangaard Brouer wrote:
>>
>>
>> On 11/05/2026 11.55, Simon Schippers wrote:
>>> On 5/11/26 10:11, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 10/05/2026 17.56, Jakub Kicinski wrote:
>>>>> On Sat, 9 May 2026 11:09:51 +0200 Jesper Dangaard Brouer wrote:
>>>>>> On 09/05/2026 04.06, Jakub Kicinski wrote:
>>>>>>> On Thu, 7 May 2026 21:09:09 +0200 Jesper Dangaard Brouer wrote:
>>>>>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>>>>>> the solution here.
>>>>>>>
>>>>>>> Was it evaluated, tho?
>>>>>>>
>>>>>>> It's obviously super easy these days have AI spew no end of complex
>>>>>>> code. So it'd be great to have some solid, ideally production-like
>>>>>>> data to back this all up.
>>>>>>>
>>>>>>> VETH_RING_SIZE seems trivial, ethtool set ringparam
>>>>>>
>>>>>> No, unfortunately we cannot just decrease the VETH_RING_SIZE.
>>>>>
>>>>> To be clear - I said may it configurable with ethtool -G
>>>>> not change the default.
>>>>>
>>>>
>>>> Sure, I understand the desire to make VETH_RING_SIZE configurable.
>>>> If doing so we are making Linux network stack harder to tune and setup
>>>> correctly. E.g. adding a qdisc to veth would also require changing the
>>>> ring size, but if system also uses XDP then tuning below 64 (likely 128)
>>>> will lead to hard-to-find packet drops.
>>>
>>> I mean 64 still could be a 4x improvement at least.
>>>
>>
>> No not really, setting it to 64 will give same (bad) latency from "BQL
>> off" which that patchset is trying to address.
>>
>>>>
>>>> I prefer adding something (like BQL) that auto-tune how much of the ring
>>>> queue we are using.  Good queues function as shock absorbers when
>>>> concurrent processes in the OS have scheduling noise.
>>>>
>>>> I acknowledge that Simon Schippers found that the BQL implementation was
>>>> actually not auto-tuning.  We need to work on this, my prototype
>>>> implementation [1] [2] works surprisingly well.
>>>>
>>>>
>>>> - [1] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/2-09-veth-time-based-bql-coalescing.patch
>>>> - [2] https://lore.kernel.org/all/3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org/
>>>>
>>>>
>>>>>> The reason is that XDP-redirect into veth don't have any
>>>>>> back-pressure and would simply drop packets if queue size becomes
>>>>>> less than the NAPI budget (64). (Yes, we use both normal path and
>>>>>> XDP-redirect in production).
>>>>>
>>>>> Doesn't this mean you have a queue which is not under BQL control?
>>>>>
>>>>
>>>> It is a matter of perspective. BQL needs between 17-55 elements in the
>>>> 256 queue.  At the same time we handle if the ring runs full, e.g. due
>>>> to a sudden burst of XDP redirected packets, which pushes packets into
>>>> the qdisc layer.
>>>
>>> You are checking inflight/limit in /sys directory to get the 17-55
>>> number, right?
>>>
>>
>> Nope, I'm using a bpftrace program to keep track of the inflight/limit
>> in a BPF hashmap.  Reading from /sys will not be accurate.
> 
> Ah nice.

Add the option --hist to have both NAPI and BQL histograms printed when
script ends.  This will give you an accurate pattern of how inflight and
limit evolves.

>>
>> I moved the selftests into a github repo [1] to allow us to collaborate
>> and evaluate the changes more easily.  I explicitly kept the new BPF
>> based BQL tracking as a commit[2] for your benefit.
>>
>>   [1] https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
>>
>>   [2] https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
> 
> Thanks for sharing. After minor issues I was able to set it up
> (currently I am just using plain v5, will look at the coalescing patch
> when I find the time):
> 
> Can confirm the latency reduction with the default settings, in my case
> 4.888ms to 0.241ms.
> 
> With the same script I was also able to see a performance slow down:
> veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
> --> ~510 Kpps
> Same with --bql-disable
> --> ~570 Kpps
> --> 12% faster
> 

Thanks for running these benchmarks.

Notice that --nrules 0 can easily result in no-queuing (on average),
because the veth NAPI consumer is faster than the producer.  You will
likely see BQL inflight=1 and sink reported avg latency very low
(remember it okay that sink get high latency penalty as long at ping
latency remains low, as that show AQM is working).

There is an important gotcha. We actually have micro-burst of queuing
(likely due to scheduling noise). Reading BQL stats from /sys will show
BQL inflight=1, but when using the option --hist is it visible that
@inflight have a long tail (see below signature).  The "qdisc" output
line also shows this happening via requeues increasing (approx 17/sec in
a test with 567Kpps). (this was with the time-based BQL impl).


>>
>> Sorry for cutting the remaining of the message, but I ran out of time,
>> as things are a bit challenging/hectic here at Cloudflare at the moment.
>>
>> --Jesper
> 
> All good, just ignore it. I think I misunderstood something anyway.

Okay, I'll ignore it as I couldn't make sense of it ;-)
--Jesper



--- BQL inflight histogram (VETH_BQL_UNIT=1, values = packets) ---
@inflight:
[0, 1)            306565 |@ 
      |
[1, 2)           9250454 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 3)           5561919 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[3, 4)            354341 |@ 
      |
[4, 5)             50137 | 
      |
[5, 6)             16771 | 
      |
[6, 7)              6001 | 
      |
[7, 8)              3076 | 
      |
[8, 9)              1949 | 
      |
[9, 10)             1965 | 
      |
[10, 11)            1954 | 
      |
[11, 12)            1914 | 
      |
[12, 13)            1732 | 
      |
[13, 14)            1559 | 
      |
[14, 15)            1405 | 
      |
[15, 16)            1269 | 
      |
[16, 17)            1194 | 
      |
[17, 18)            1190 | 
      |
[18, 19)            1148 | 
      |
[19, 20)            1079 | 
      |
[20, 21)            1008 | 
      |
[21, 22)             951 | 
      |
[22, 23)             870 | 
      |
[23, 24)             826 | 
      |
[24, 25)             775 | 
      |
[25, 26)             764 | 
      |
[26, 27)             740 | 
      |
[27, 28)             714 | 
      |
[28, 29)             665 | 
      |
[29, 30)             626 | 
      |
[30, 31)             607 | 
      |
[31, 32)             601 | 
      |
[32, 33)             583 | 
      |
[33, 34)             593 | 
      |
[34, 35)             574 | 
      |
[35, 36)             562 | 
      |
[36, 37)             554 | 
      |
[37, 38)             538 | 
      |
[38, 39)             528 | 
      |
[39, 40)             525 | 
      |
[40, 41)             512 | 
      |
[41, 42)             542 | 
      |
[42, 43)             529 | 
      |
[43, 44)             526 | 
      |
[44, 45)             513 | 
      |
[45, 46)             503 | 
      |
[46, 47)             485 | 
      |
[47, 48)             480 | 
      |
[48, 49)             473 | 
      |
[49, 50)             474 | 
      |
[50, 51)             476 | 
      |
[51, 52)             476 | 
      |
[52, 53)             465 | 
      |
[53, 54)             454 | 
      |
[54, 55)             446 | 
      |
[55, 56)             430 | 
      |
[56, 57)             425 | 
      |
[57, 58)             425 | 
      |
[58, 59)             422 | 
      |
[59, 60)             407 | 
      |
[60, 61)             390 | 
      |
[61, 62)             370 | 
      |
[62, 63)             354 | 
      |
[63, 64)             343 | 
      |
[64, 65)             325 | 
      |
[65, 66)             303 | 
      |
[66, 67)             158 | 
      |
[67, 68)             136 | 
      |
[68, 69)             124 | 
      |
[69, 70)             110 | 
      |
[70, 71)              99 | 
      |
[71, 72)              94 | 
      |
[72, 73)              82 | 
      |
[73, 74)              74 | 
      |
[74, 75)              58 | 
      |
[75, 76)              52 | 
      |
[76, 77)              45 | 
      |
[77, 78)              40 | 
      |
[78, 79)              39 | 
      |
[79, 80)              38 | 
      |
[80, 81)              21 | 
      |
[81, 82)               4 | 
      |
[82, 83)               4 | 
      |
[83, 84)               4 | 
      |
[84, 85)               2 | 
      |
[85, 86)               2 | 
      |
[86, 87)               2 | 
      |
[87, 88)               2 | 
      |
[88, 89)               1 | 
      |


--- BQL limit histogram (auto-tuned, values = packets) ---
@limit_val:
[61, 62)          221346 |@ 
      |
[62, 63)               0 | 
      |
[63, 64)          772169 |@@@ 
      |
[64, 65)        10053949 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[65, 66)               0 | 
      |
[66, 67)               0 | 
      |
[67, 68)               0 | 
      |
[68, 69)               0 | 
      |
[69, 70)               0 | 
      |
[70, 71)          457838 |@@ 
      |
[71, 72)               0 | 
      |
[72, 73)          610198 |@@@ 
      |
[73, 74)               0 | 
      |
[74, 75)               0 | 
      |
[75, 76)               0 | 
      |
[76, 77)               0 | 
      |
[77, 78)               0 | 
      |
[78, 79)         2328284 |@@@@@@@@@@@@ 
      |
[79, 80)         1150181 |@@@@@ 
      |

@inflight_stats: count 15593965, average 1, total 23078061

@limit_stats: count 15593965, average 67, total 1054054856

next prev parent reply	other threads:[~2026-05-12 13:54 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260505132159.241305-1-hawk@kernel.org>
2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
2026-05-07 14:25   ` Paolo Abeni
2026-05-05 13:21 ` [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-07  6:54   ` Simon Schippers
2026-05-07 13:21     ` Paolo Abeni
2026-05-07 14:34     ` Paolo Abeni
2026-05-07 14:46       ` Simon Schippers
2026-05-07 19:09         ` Jesper Dangaard Brouer
2026-05-07 20:12           ` Simon Schippers
2026-05-07 20:45             ` Jesper Dangaard Brouer
2026-05-08  8:01               ` Simon Schippers
2026-05-08  9:20                 ` Simon Schippers
2026-05-09  2:06           ` Jakub Kicinski
2026-05-09  9:09             ` Jesper Dangaard Brouer
2026-05-10 15:56               ` Jakub Kicinski
2026-05-11  8:11                 ` Jesper Dangaard Brouer
2026-05-11  9:55                   ` Simon Schippers
2026-05-11 18:08                     ` Jesper Dangaard Brouer
2026-05-11 20:37                       ` Simon Schippers
2026-05-12 13:54                         ` Jesper Dangaard Brouer [this message]
2026-05-05 13:21 ` [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-05-05 13:21 ` [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18855e57-f050-411f-9958-d4babcc81ba3@kernel.org \
    --to=hawk@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=simon@schippers-hamm.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox