From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Simon Schippers <simon@schippers-hamm.de>,
Paolo Abeni <pabeni@redhat.com>,
netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com, Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
John Fastabend <john.fastabend@gmail.com>,
Stanislav Fomichev <sdf@fomichev.me>,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
Date: Thu, 7 May 2026 22:45:22 +0200 [thread overview]
Message-ID: <3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org> (raw)
In-Reply-To: <68223314-1a44-4aee-8207-57437ef9f3ab@schippers-hamm.de>
[-- Attachment #1: Type: text/plain, Size: 4859 bytes --]
On 07/05/2026 22.12, Simon Schippers wrote:
> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>
>>
>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>
>>>
>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>> }
>>>>>> } else {
>>>>>> /* ndo_start_xmit */
>>>>>> - struct sk_buff *skb = ptr;
>>>>>> + bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>> + struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>> stats->xdp_bytes += skb->len;
>>>>>> + if (peer_txq && bql_charged)
>>>>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>
>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>> this doesn’t work.
>>>>>
>>
>> I've experimented with doing the "completion" at NAPI-end in
>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>> leads to bad latency results (not acceptable).
>> (See detailed report later)
>>
>>
>>>>> I still think first that adding an option to modify the hard-coded
>>>>> VETH_RING_SIZE is the way to go.
>>>>>
>>
>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>> the solution here.
>>
>> The simply solution is the configure BQL limit_min:
>> `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>
>> My experiments (below) find that limit_min=8 is gives good performance.
>> We can simply set default to 8 as this still allows userspace to change
>> this later if lower latency is preferred.
>>
>>>>> Thanks!
>>>>>
>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>>
>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>> be ignored. Still the tput figures in the data are extremely low,
>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>
>>>> /P
>>>>
>>>
>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>> an UDP userspace application is sending.
>>>
>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>
>>
>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>> to test this. John Fastabend added a config to pktgen that allows us
>> to benchmarking egress qdisc path, this might be better to use this.
>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>
>> If redoing the tests, can you adjust limit_min to see the effect?
>> /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>
>> 20% throughput performance regression is of-cause too much, but I will
>> remind us, that adding a qdisc will "cost" some overhead, that is a
>> configuration choice. Our purpose here is to reduce bufferbloat and
>> latency, not optimize for throughput.
>>
>>
>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>> packets (because the completed queue is always called with 1 packet
>>> and not in a interrupt/timer with multiple packets...).
>>>
>>
>> I've run a lot of experiments, which I made AI write a report over, see attachment. The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>
>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead. The qdisc layer will be a baseline overhead. The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead). I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>
>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>
>> This patchset helps our (production) use-case reduce mice-flow latency
>> from approx 22ms to 1.3ms for latency under-load. Due to the consumer
>> namespace being the bottleneck the requeue overhead is negligible in
>> comparison.
>>
>> -Jesper
>
> First of all thanks for you work and I really see the advantages of
> avoiding bufferbloat :)
>
> But the key of the BQL algorithm, which is the *dynamic* adaption of the
> limit, is not working. Always calling netdev_completed_queue() with
> 1 packet results in a static limit of 2 packets (as seen by Jonas
> measurements), which you force up to 8 packets.
>
> So in the end this patchset has the same effect as just setting
> VETH_RING_SIZE to 8 (and giving an option to change this value).
>
I've code up a time based BQL implementation, see attachment.
WDYT?
--Jesper
[-- Attachment #2: 09-veth-time-based-bql-coalescing.patch --]
[-- Type: text/x-patch, Size: 5549 bytes --]
veth: time-based BQL completion coalescing via ethtool tx-usecs
From: Jesper Dangaard Brouer <hawk@kernel.org>
Bufferbloat is fundamentally a latency problem -- what matters is the
time packets spend waiting in queues, as perceived by users and
applications. Base BQL completion coalescing on elapsed time rather
than packet counts to directly control queuing delay.
Add ethtool tx-usecs support to veth for tuning BQL completion
coalescing. Instead of completing BQL per-packet (which forces DQL to
limit=2 with high NAPI scheduling overhead) or per-NAPI-poll (which
over-buffers at budget=64), accumulate completions and flush them when
a configurable time threshold is exceeded. This lets DQL discover a
limit that bounds the actual queuing delay to the configured interval.
The default of 10 usecs (VETH_BQL_COAL_TX_USECS) provides a good
balance: DQL converges to a small limit that keeps queuing delay
bounded while allowing efficient NAPI batch processing. Setting
tx-usecs to 0 disables coalescing and falls back to per-packet
completion (limit=2, lowest latency, highest NAPI overhead).
Usage:
ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)
Uses local_clock() (rdtsc on x86, ~20ns) for sub-jiffy resolution.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
drivers/net/veth.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 57 insertions(+), 2 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4103d298aa9b..6035f7ec92b4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -43,6 +43,7 @@
#define VETH_XDP_TX_BULK_SIZE 16
#define VETH_XDP_BATCH 16
+#define VETH_BQL_COAL_TX_USECS 10 /* default tx-usecs for BQL batching */
struct veth_stats {
u64 rx_drops;
@@ -81,6 +82,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
unsigned int requested_headroom;
+ unsigned int tx_coal_usecs; /* BQL completion coalescing */
};
struct veth_xdp_tx_bq {
@@ -265,7 +267,30 @@ static void veth_get_channels(struct net_device *dev,
static int veth_set_channels(struct net_device *dev,
struct ethtool_channels *ch);
+static int veth_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+ return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ priv->tx_coal_usecs = ec->tx_coalesce_usecs;
+ return 0;
+}
+
static const struct ethtool_ops veth_ethtool_ops = {
+ .supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
.get_drvinfo = veth_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = veth_get_strings,
@@ -275,6 +300,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_channels = veth_get_channels,
.set_channels = veth_set_channels,
+ .get_coalesce = veth_get_coalesce,
+ .set_coalesce = veth_set_coalesce,
};
/* general routines */
@@ -942,8 +969,14 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_stats *stats,
struct netdev_queue *peer_txq)
{
+ u64 bql_flush_ns, bql_flush_time = 0;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
+ struct veth_priv *priv;
+ int n_bql = 0;
+
+ priv = netdev_priv(rq->dev);
+ bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
for (i = 0; i < budget; i++) {
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -972,8 +1005,23 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct sk_buff *skb = veth_ptr_to_skb(ptr);
stats->xdp_bytes += skb->len;
- if (peer_txq && bql_charged)
- netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+ if (peer_txq && bql_charged) {
+ if (!bql_flush_ns) {
+ netdev_tx_completed_queue(peer_txq, 1,
+ VETH_BQL_UNIT);
+ } else {
+ u64 now = local_clock();
+ n_bql++;
+ if (!bql_flush_time) {
+ bql_flush_time = now;
+ } else if (now - bql_flush_time >= bql_flush_ns) {
+ netdev_tx_completed_queue(peer_txq, n_bql,
+ n_bql * VETH_BQL_UNIT);
+ n_bql = 0;
+ bql_flush_time = 0;
+ }
+ }
+ }
skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
if (skb) {
@@ -989,6 +1037,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
if (n_xdpf)
veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);
+ if (peer_txq && n_bql)
+ netdev_tx_completed_queue(peer_txq, n_bql, n_bql * VETH_BQL_UNIT);
+
u64_stats_update_begin(&rq->stats.syncp);
rq->stats.vs.xdp_redirect += stats->xdp_redirect;
rq->stats.vs.xdp_bytes += stats->xdp_bytes;
@@ -1813,6 +1864,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
static void veth_setup(struct net_device *dev)
{
+ struct veth_priv *priv = netdev_priv(dev);
+
ether_setup(dev);
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1891,8 @@ static void veth_setup(struct net_device *dev)
dev->max_mtu = ETH_MAX_MTU;
dev->watchdog_timeo = msecs_to_jiffies(16000);
+ priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
next prev parent reply other threads:[~2026-05-07 20:45 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 13:21 [PATCH net-next v5 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-05 13:21 ` [PATCH net-next v5 1/5] veth: fix OOB txq access in veth_poll() with asymmetric queue counts hawk
2026-05-07 14:25 ` Paolo Abeni
2026-05-05 13:21 ` [PATCH net-next v5 2/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-05 13:21 ` [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-07 6:54 ` Simon Schippers
2026-05-07 13:21 ` Paolo Abeni
2026-05-07 14:34 ` Paolo Abeni
2026-05-07 14:46 ` Simon Schippers
2026-05-07 19:09 ` Jesper Dangaard Brouer
2026-05-07 20:12 ` Simon Schippers
2026-05-07 20:45 ` Jesper Dangaard Brouer [this message]
2026-05-08 8:01 ` Simon Schippers
2026-05-08 9:20 ` Simon Schippers
2026-05-09 2:06 ` Jakub Kicinski
2026-05-09 9:09 ` Jesper Dangaard Brouer
2026-05-10 15:56 ` Jakub Kicinski
2026-05-11 8:11 ` Jesper Dangaard Brouer
2026-05-11 9:55 ` Simon Schippers
2026-05-11 18:08 ` Jesper Dangaard Brouer
2026-05-11 20:37 ` Simon Schippers
2026-05-05 13:21 ` [PATCH net-next v5 4/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-05-05 13:21 ` [PATCH net-next v5 5/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-05-07 14:30 ` [PATCH net-next v5 0/5] veth: add Byte Queue Limits (BQL) support patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org \
--to=hawk@kernel.org \
--cc=andrew+netdev@lunn.ch \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=john.fastabend@gmail.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=simon@schippers-hamm.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox