Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs

BPF List
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: hawk@kernel.org
Cc: bpf@vger.kernel.org
Subject: Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
Date: Sat, 30 May 2026 00:50:10 +0000	[thread overview]
Message-ID: <20260530005011.3F4E51F00899@smtp.kernel.org> (raw)
In-Reply-To: <20260527135418.1166665-6-hawk@kernel.org>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Coalesced BQL completions are left pending indefinitely when the receive ring is empty and NAPI yields, corrupting DQL timing.
- [Medium] Disabling BQL coalescing at runtime (tx-usecs = 0) orphans previously accumulated completions, permanently skewing DQL state.
- [Low] Data race on `priv->tx_coal_usecs` between ethtool configuration and the NAPI poll data path.
--

commit d710fd75a7c5d61400890032084a8df84bc9498d
Author: Simon Schippers <simon.schippers@tu-dortmund.de>

    veth: time-based BQL completion coalescing via ethtool tx-usecs

This patch implements time-based batching of BQL completions for the veth
driver to reduce NAPI scheduling overhead. It accumulates BQL completions
and flushes them when a configurable time threshold is exceeded, letting
DQL discover a limit that bounds actual queuing delay to the configured
interval.

> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index d5675d9d52361..743d17b37223e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c

[ ... ]

> @@ -265,7 +275,30 @@ static void veth_get_channels(struct net_device *dev,
>  static int veth_set_channels(struct net_device *dev,
>  			     struct ethtool_channels *ch);
>  
> +static int veth_get_coalesce(struct net_device *dev,
> +			     struct ethtool_coalesce *ec,
> +			     struct kernel_ethtool_coalesce *kernel_coal,
> +			     struct netlink_ext_ack *extack)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +
> +	ec->tx_coalesce_usecs = priv->tx_coal_usecs;
> +	return 0;
> +}
> +
> +static int veth_set_coalesce(struct net_device *dev,
> +			     struct ethtool_coalesce *ec,
> +			     struct kernel_ethtool_coalesce *kernel_coal,
> +			     struct netlink_ext_ack *extack)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +
> +	priv->tx_coal_usecs = ec->tx_coalesce_usecs;

[Severity: Low]
Are we missing a WRITE_ONCE() here? priv->tx_coal_usecs is modified under
the RTNL lock but read by veth_xdp_rcv() in softirq context during NAPI
polling without any synchronization. Using WRITE_ONCE() and READ_ONCE()
would prevent compiler reordering and load tearing.

> +	return 0;
> +}

[ ... ]

>  static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct veth_xdp_tx_bq *bq,
>  			struct veth_stats *stats,
>  			struct netdev_queue *peer_txq)
>  {
> +	struct veth_bql_state *state = &rq->bql_state;
>  	int i, done = 0, n_xdpf = 0;
>  	void *xdpf[VETH_XDP_BATCH];
> +	struct veth_priv *priv;
> +	u64 bql_flush_ns;
> +
> +	priv = netdev_priv(rq->dev);
> +	bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
> +
> +	/* Clamp stored timestamp in case we migrated to a CPU with a behind
> +	 * sched_clock(); prevents the deadline from never firing.
> +	 */
> +	state->time = min(state->time, sched_clock());
> +
> +	/* Flush completions that timed out since the previous NAPI poll. */
> +	if (peer_txq && bql_flush_ns)
> +		veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);

[Severity: Medium]
What happens if the user disables coalescing by setting tx-usecs to 0 while
there are accumulated completions?

Because bql_flush_ns becomes 0, this pre-loop time-based flush check will
evaluate to false, skipping the flush of previously batched completions.
Inside the loop, new packets are completed individually, ignoring the old
state->n_bql count.

If those pending completions remain stranded indefinitely, won't the
in-flight byte count tracked by DQL become permanently inflated,
artificially reducing the available queue capacity?

>  	for (i = 0; i < budget; i++) {
>  		void *ptr = __ptr_ring_consume(&rq->xdp_ring);

[ ... ]

> @@ -989,6 +1064,18 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  	if (n_xdpf)
>  		veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);
>  
> +	/* If the ring is now empty and the peer TX queue is stalled by DQL
> +	 * backpressure, release completions immediately to unblock it.
> +	 */
> +	if (peer_txq && state->n_bql && __ptr_ring_empty(&rq->xdp_ring)) {
> +		/* Pairs with smp_wmb() in __ptr_ring_produce(); ensure ring
> +		 * emptiness is observed before reading peer_txq->state.
> +		 */
> +		smp_rmb();
> +		if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
> +			veth_bql_complete(state, peer_txq);
> +	}

[Severity: High]
Does this leave coalesced BQL completions pending indefinitely if the
receive ring is empty but the peer TX queue isn't stalled by DQL?

When the NAPI budget loop finishes and the RX ring is empty, the device may
go idle and NAPI polling stops. The batched completions (state->n_bql) are
only flushed if the TX queue is currently stalled by DQL.

If the queue is not stalled, these completions are left pending indefinitely
without a timer to flush them. Since DQL measures the time between enqueue
and completion, wouldn't this arbitrary delay (until the next packet
arrives) be perceived by DQL as a massive transmission delay?

This could cause it to incorrectly shrink the dynamic limit to its minimum
and severely throttle transmission performance.

> +
>  	u64_stats_update_begin(&rq->stats.syncp);
>  	rq->stats.vs.xdp_redirect += stats->xdp_redirect;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260527135418.1166665-1-hawk@kernel.org?part=5

next prev parent reply	other threads:[~2026-05-30  0:50 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260527135418.1166665-1-hawk@kernel.org>
2026-05-27 13:54 ` [PATCH net-next v6 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction hawk
2026-05-28  7:45   ` Jonas Köppeler
2026-05-30  0:50   ` sashiko-bot
2026-06-04  8:19   ` Paolo Abeni
2026-06-10 12:21     ` Jesper Dangaard Brouer
2026-05-27 13:54 ` [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs hawk
2026-05-28  7:46   ` Jonas Köppeler
2026-06-01 12:00     ` Simon Schippers
2026-06-01 14:03       ` Jonas Köppeler
2026-06-01 16:16         ` Simon Schippers
2026-06-02  7:24           ` Jonas Köppeler
2026-06-02 15:37             ` Simon Schippers
2026-06-03  8:28               ` Jonas Köppeler
2026-05-29 14:51   ` Simon Schippers
2026-05-30  0:50   ` sashiko-bot [this message]
2026-06-04  8:21   ` Paolo Abeni
2026-06-08 10:38   ` Simon Schippers
2026-06-08 13:04     ` Jesper Dangaard Brouer
2026-06-08 13:13       ` Jonas Köppeler
2026-06-08 14:21         ` Simon Schippers
2026-06-09 13:59           ` Jesper Dangaard Brouer
2026-06-09 15:08             ` Simon Schippers
2026-06-10  7:04               ` Simon Schippers
2026-06-10 10:15                 ` Jesper Dangaard Brouer
2026-06-10 12:00                   ` Simon Schippers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260530005011.3F4E51F00899@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=hawk@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox