Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs

Netdev List
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <hawk@kernel.org>
To: "Simon Schippers" <simon.schippers@tu-dortmund.de>,
	netdev@vger.kernel.org,
	"Jonas Köppeler" <j.koeppeler@tu-berlin.de>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
Date: Mon, 8 Jun 2026 15:04:52 +0200	[thread overview]
Message-ID: <7fcb9f02-db61-416c-a6f5-b737d74110ec@kernel.org> (raw)
In-Reply-To: <db2aa3b9-09d4-46d8-aafc-0a859ee8b635@tu-dortmund.de>



On 08/06/2026 12.38, Simon Schippers wrote:
> On 5/27/26 15:54, hawk@kernel.org wrote:
>> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>>
>> Per-packet BQL completion forces DQL to converge on limit=2, causing
>> excessive NAPI scheduling overhead and qdisc requeues.
>>
>> Accumulate BQL completions and flush them when a configurable time
>> threshold is exceeded, letting DQL discover a limit that bounds actual
>> queuing delay to the configured interval. Coalescing state persists
>> across NAPI polls in struct veth_rq so completions can accumulate
>> beyond a single budget=64 cycle.
>>
>> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
>> setting tx-usecs to 0 disables coalescing and falls back to per-packet
>> completion.
>>
>>    ethtool -C <veth-dev> tx-usecs 500  # 500us coalescing
>>    ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)
>>
>> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
> 
> I found the issue that n_bql may become infinitly large if producer
> and consumer have the same speed (and tx_usecs is large). It could
> cause a potential BUG_ON if n_bql grows beyond INT_MAX...
> Also I figured that no hardware BQL driver ever completes more than
> BQL limit many elements.
> 
> Therefore, I propose a simpler logic (see attachment) that completes
> either on the usual bql_flush_ns or if n_bql > dql.limit.
> If n_bql > dql.limit then we either have the case above that the
> producer is as fast as the consumer or we have BQL starvation.
> 
> if (state->time + bql_flush_ns <= current_time ||
> 	state->n_bql > peer_txq->dql.limit) {
> 
> It must be n_bql *bigger than* dql.limit because the producer will
> always exceed the limit before it stops, see netdev_tx_sent_queue().
> It is fast because peer_txq->dql.limit is in the cacheline of the
> completion path, see dynamic_queue_limits.h.
> 
> Another advantage is that we avoid the snippet checking for empty
> and BQL stopped which requires an smp_rmb() and an test_bit().
> 
> Apart from that I:
> - Always call veth_bql_maybe_complete() in the for loop to have
>    more accurate completion intervals when having mixed XDP and
>    non-XDP packets.
> - Made it so tx_usecs = 0 is now also a normal case.
> - Change the type of n_bql to uint instead of int.
> - Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
> - Moved the bql_state init in __veth_napi_enable_range() in front
>    of napi_enable() to avoid a race (Sashiko).
> - Moved the bql_state reset in veth_napi_del_range() after the
>    ptr_ring_cleanup() (probably does not matter but makes sense to me)
> 
> Benchmarks look just fine, see commit message.
> 
> WDYT?

Looks good to me, I will use this in my V7 patchset.

A bike-shedding issue: We change the coalescing parameters for the veth
net_device, but should this be a TX or RX parameter?

For physical NICs adjusting TX coalescing will affect the BQL as this
affect the TX completion of the transmitted packets. For veth, it is the
veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
is where this patch adds netdev_tx_completed_queue calls for BQL.
Any opinions on the "TX" or "RX" color?

--Jesper

next prev parent reply	other threads:[~2026-06-08 13:04 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 13:54 [PATCH net-next v6 0/5] veth: add Byte Queue Limits (BQL) support hawk
2026-05-27 13:54 ` [PATCH net-next v6 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices hawk
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-28  7:45   ` Jonas Köppeler
2026-06-04  8:19   ` Paolo Abeni
2026-06-10 12:21     ` Jesper Dangaard Brouer
2026-05-27 13:54 ` [PATCH net-next v6 3/5] veth: add tx_timeout watchdog as BQL safety net hawk
2026-06-04  8:24   ` Paolo Abeni
2026-06-10 12:37     ` Jesper Dangaard Brouer
2026-05-27 13:54 ` [PATCH net-next v6 4/5] net: sched: add timeout count to NETDEV WATCHDOG message hawk
2026-06-04  8:30   ` Paolo Abeni
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
2026-05-28  7:46   ` Jonas Köppeler
2026-06-01 12:00     ` Simon Schippers
2026-06-01 14:03       ` Jonas Köppeler
2026-06-01 16:16         ` Simon Schippers
2026-06-02  7:24           ` Jonas Köppeler
2026-06-02 15:37             ` Simon Schippers
2026-06-03  8:28               ` Jonas Köppeler
2026-05-29 14:51   ` Simon Schippers
2026-06-04  8:21   ` Paolo Abeni
2026-06-08 10:38   ` Simon Schippers
2026-06-08 13:04     ` Jesper Dangaard Brouer [this message]
2026-06-08 13:13       ` Jonas Köppeler
2026-06-08 14:21         ` Simon Schippers
2026-06-09 13:59           ` Jesper Dangaard Brouer
2026-06-09 15:08             ` Simon Schippers
2026-06-10  7:04               ` Simon Schippers
2026-06-10 10:15                 ` Jesper Dangaard Brouer
2026-06-10 12:00                   ` Simon Schippers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7fcb9f02-db61-416c-a6f5-b737d74110ec@kernel.org \
    --to=hawk@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=j.koeppeler@tu-berlin.de \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=simon.schippers@tu-dortmund.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox