From: Simon Horman <horms@kernel.org>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: netdev@vger.kernel.org, "Jakub Kicinski" <kuba@kernel.org>,
bpf@vger.kernel.org, tom@herbertland.com,
"Eric Dumazet" <eric.dumazet@gmail.com>,
"David S. Miller" <davem@davemloft.net>,
"Paolo Abeni" <pabeni@redhat.com>,
"Toke Høiland-Jørgensen" <toke@toke.dk>,
kernel-team@cloudflare.com
Subject: Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
Date: Mon, 7 Apr 2025 18:02:56 +0100 [thread overview]
Message-ID: <20250407170256.GU395307@horms.kernel.org> (raw)
In-Reply-To: <174377814192.3376479.16481605648460889310.stgit@firesoul>
On Fri, Apr 04, 2025 at 04:49:01PM +0200, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
>
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
>
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
>
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
>
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
>
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
>
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
>
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
> drivers/net/veth.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 50 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
...
> @@ -373,17 +383,39 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
> }
>
> skb_tx_timestamp(skb);
> - if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> + ret = veth_forward_skb(rcv, skb, rq, use_napi);
> + switch(ret) {
> + case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
> if (!use_napi)
> dev_sw_netstats_tx_add(dev, 1, length);
> else
> __veth_xdp_flush(rq);
> - } else {
> + break;
> + case NETDEV_TX_BUSY:
> + /* If a qdisc is attached to our virtual device, returning
> + * NETDEV_TX_BUSY is allowed.
> + */
> + struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
Hi Toke,
FYI, clang 20.1.1 W=1 build says:
drivers/net/veth.c:399:3: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
399 | struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
> +
> + if (!txq_has_qdisc(txq)) {
> + dev_kfree_skb_any(skb);
> + goto drop;
> + }
> + netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
> + if (use_napi)
> + __veth_xdp_flush(rq);
> +
> + break;
> + case NET_RX_DROP: /* same as NET_XMIT_DROP */
> drop:
> atomic64_inc(&priv->dropped);
> ret = NET_XMIT_DROP;
> + break;
> + default:
> + net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
> + dev->name, ret);
> }
> -
> rcu_read_unlock();
>
> return ret;
...
prev parent reply other threads:[~2025-04-07 17:03 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
2025-04-04 19:51 ` Jesper Dangaard Brouer
2025-04-05 13:26 ` kernel test robot
2025-04-05 13:58 ` kernel test robot
2025-04-07 9:15 ` Toke Høiland-Jørgensen
2025-04-07 12:42 ` Jesper Dangaard Brouer
2025-04-07 23:02 ` David Ahern
2025-04-08 11:23 ` Toke Høiland-Jørgensen
2025-04-08 15:20 ` David Ahern
2025-04-07 17:02 ` Simon Horman [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250407170256.GU395307@horms.kernel.org \
--to=horms@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=hawk@kernel.org \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=toke@toke.dk \
--cc=tom@herbertland.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.