* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
@ 2025-04-04 19:51 ` Jesper Dangaard Brouer
2025-04-05 13:26 ` kernel test robot
` (3 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-04 19:51 UTC (permalink / raw)
To: netdev, Jakub Kicinski
Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
Toke Høiland-Jørgensen, kernel-team
On 04/04/2025 16.49, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
>
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
>
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
>
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
>
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
>
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
>
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
>
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
> drivers/net/veth.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 50 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 7bb53961c0ea..fff2b615781e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> {
> if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
> - dev_kfree_skb_any(skb);
> - return NET_RX_DROP;
> + return NETDEV_TX_BUSY; /* signal qdisc layer */
> }
>
> - return NET_RX_SUCCESS;
> + return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> }
>
> static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -342,15 +341,26 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
> rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD));
> }
>
> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
> +{
> + struct Qdisc *q;
> +
> + q = rcu_dereference(txq->qdisc);
> + if (q->enqueue)
> + return true;
> + else
> + return false;
> +}
> +
> static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
> {
> struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> struct veth_rq *rq = NULL;
> - int ret = NETDEV_TX_OK;
> struct net_device *rcv;
> int length = skb->len;
> bool use_napi = false;
> - int rxq;
> + int ret, rxq;
>
> rcu_read_lock();
> rcv = rcu_dereference(priv->peer);
> @@ -373,17 +383,39 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
> }
>
> skb_tx_timestamp(skb);
> - if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> + ret = veth_forward_skb(rcv, skb, rq, use_napi);
> + switch(ret) {
> + case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
> if (!use_napi)
> dev_sw_netstats_tx_add(dev, 1, length);
> else
> __veth_xdp_flush(rq);
> - } else {
> + break;
> + case NETDEV_TX_BUSY:
> + /* If a qdisc is attached to our virtual device, returning
> + * NETDEV_TX_BUSY is allowed.
> + */
> + struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
> +
> + if (!txq_has_qdisc(txq)) {
> + dev_kfree_skb_any(skb);
> + goto drop;
> + }
> + netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
> + if (use_napi)
> + __veth_xdp_flush(rq);
> +
Found a bug here... I need to skb_push back Ethernet header, because
__dev_forward_skb() via eth_type_trans() pulled it off.
Code fix:
__skb_push(skb, ETH_HLEN)
> + break;
> + case NET_RX_DROP: /* same as NET_XMIT_DROP */
> drop:
> atomic64_inc(&priv->dropped);
> ret = NET_XMIT_DROP;
> + break;
> + default:
> + net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
> + dev->name, ret);
> }
> -
> rcu_read_unlock();
>
> return ret;
> @@ -874,9 +906,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> struct veth_xdp_tx_bq *bq,
> struct veth_stats *stats)
> {
> + struct veth_priv *priv = netdev_priv(rq->dev);
> + int queue_idx = rq->xdp_rxq.queue_index;
> + struct netdev_queue *peer_txq;
> + struct net_device *peer_dev;
> int i, done = 0, n_xdpf = 0;
> void *xdpf[VETH_XDP_BATCH];
>
> + peer_dev = priv->peer;
> + peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
> +
> for (i = 0; i < budget; i++) {
> void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>
> @@ -925,6 +964,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> rq->stats.vs.xdp_packets += done;
> u64_stats_update_end(&rq->stats.syncp);
>
> + if (unlikely(netif_tx_queue_stopped(peer_txq)))
> + netif_tx_wake_queue(peer_txq);
> +
> return done;
> }
>
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
2025-04-04 19:51 ` Jesper Dangaard Brouer
@ 2025-04-05 13:26 ` kernel test robot
2025-04-05 13:58 ` kernel test robot
` (2 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2025-04-05 13:26 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: oe-kbuild-all
Hi Jesper,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Jesper-Dangaard-Brouer/veth-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-TX-drops/20250404-225209
base: net-next/main
patch link: https://lore.kernel.org/r/174377814192.3376479.16481605648460889310.stgit%40firesoul
patch subject: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
config: arm64-randconfig-002-20250405 (https://download.01.org/0day-ci/archive/20250405/202504052142.TGI9fCwf-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250405/202504052142.TGI9fCwf-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504052142.TGI9fCwf-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/net/veth.c: In function 'veth_xmit':
>> drivers/net/veth.c:399:3: error: a label can only be part of a statement and a declaration is not a statement
struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
^~~~~~
vim +399 drivers/net/veth.c
355
356 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
357 {
358 struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
359 struct veth_rq *rq = NULL;
360 struct net_device *rcv;
361 int length = skb->len;
362 bool use_napi = false;
363 int ret, rxq;
364
365 rcu_read_lock();
366 rcv = rcu_dereference(priv->peer);
367 if (unlikely(!rcv) || !pskb_may_pull(skb, ETH_HLEN)) {
368 kfree_skb(skb);
369 goto drop;
370 }
371
372 rcv_priv = netdev_priv(rcv);
373 rxq = skb_get_queue_mapping(skb);
374 if (rxq < rcv->real_num_rx_queues) {
375 rq = &rcv_priv->rq[rxq];
376
377 /* The napi pointer is available when an XDP program is
378 * attached or when GRO is enabled
379 * Don't bother with napi/GRO if the skb can't be aggregated
380 */
381 use_napi = rcu_access_pointer(rq->napi) &&
382 veth_skb_is_eligible_for_gro(dev, rcv, skb);
383 }
384
385 skb_tx_timestamp(skb);
386
387 ret = veth_forward_skb(rcv, skb, rq, use_napi);
388 switch(ret) {
389 case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
390 if (!use_napi)
391 dev_sw_netstats_tx_add(dev, 1, length);
392 else
393 __veth_xdp_flush(rq);
394 break;
395 case NETDEV_TX_BUSY:
396 /* If a qdisc is attached to our virtual device, returning
397 * NETDEV_TX_BUSY is allowed.
398 */
> 399 struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
400
401 if (!txq_has_qdisc(txq)) {
402 dev_kfree_skb_any(skb);
403 goto drop;
404 }
405 netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
406 if (use_napi)
407 __veth_xdp_flush(rq);
408
409 break;
410 case NET_RX_DROP: /* same as NET_XMIT_DROP */
411 drop:
412 atomic64_inc(&priv->dropped);
413 ret = NET_XMIT_DROP;
414 break;
415 default:
416 net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
417 dev->name, ret);
418 }
419 rcu_read_unlock();
420
421 return ret;
422 }
423
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
2025-04-04 19:51 ` Jesper Dangaard Brouer
2025-04-05 13:26 ` kernel test robot
@ 2025-04-05 13:58 ` kernel test robot
2025-04-07 9:15 ` Toke Høiland-Jørgensen
2025-04-07 17:02 ` Simon Horman
4 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2025-04-05 13:58 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: llvm, oe-kbuild-all
Hi Jesper,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Jesper-Dangaard-Brouer/veth-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-TX-drops/20250404-225209
base: net-next/main
patch link: https://lore.kernel.org/r/174377814192.3376479.16481605648460889310.stgit%40firesoul
patch subject: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
config: arm-randconfig-001-20250405 (https://download.01.org/0day-ci/archive/20250405/202504052130.3cm740O3-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250405/202504052130.3cm740O3-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504052130.3cm740O3-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/net/veth.c:399:3: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
399 | struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
| ^
1 warning generated.
vim +399 drivers/net/veth.c
355
356 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
357 {
358 struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
359 struct veth_rq *rq = NULL;
360 struct net_device *rcv;
361 int length = skb->len;
362 bool use_napi = false;
363 int ret, rxq;
364
365 rcu_read_lock();
366 rcv = rcu_dereference(priv->peer);
367 if (unlikely(!rcv) || !pskb_may_pull(skb, ETH_HLEN)) {
368 kfree_skb(skb);
369 goto drop;
370 }
371
372 rcv_priv = netdev_priv(rcv);
373 rxq = skb_get_queue_mapping(skb);
374 if (rxq < rcv->real_num_rx_queues) {
375 rq = &rcv_priv->rq[rxq];
376
377 /* The napi pointer is available when an XDP program is
378 * attached or when GRO is enabled
379 * Don't bother with napi/GRO if the skb can't be aggregated
380 */
381 use_napi = rcu_access_pointer(rq->napi) &&
382 veth_skb_is_eligible_for_gro(dev, rcv, skb);
383 }
384
385 skb_tx_timestamp(skb);
386
387 ret = veth_forward_skb(rcv, skb, rq, use_napi);
388 switch(ret) {
389 case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
390 if (!use_napi)
391 dev_sw_netstats_tx_add(dev, 1, length);
392 else
393 __veth_xdp_flush(rq);
394 break;
395 case NETDEV_TX_BUSY:
396 /* If a qdisc is attached to our virtual device, returning
397 * NETDEV_TX_BUSY is allowed.
398 */
> 399 struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
400
401 if (!txq_has_qdisc(txq)) {
402 dev_kfree_skb_any(skb);
403 goto drop;
404 }
405 netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
406 if (use_napi)
407 __veth_xdp_flush(rq);
408
409 break;
410 case NET_RX_DROP: /* same as NET_XMIT_DROP */
411 drop:
412 atomic64_inc(&priv->dropped);
413 ret = NET_XMIT_DROP;
414 break;
415 default:
416 net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
417 dev->name, ret);
418 }
419 rcu_read_unlock();
420
421 return ret;
422 }
423
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
` (2 preceding siblings ...)
2025-04-05 13:58 ` kernel test robot
@ 2025-04-07 9:15 ` Toke Høiland-Jørgensen
2025-04-07 12:42 ` Jesper Dangaard Brouer
2025-04-07 23:02 ` David Ahern
2025-04-07 17:02 ` Simon Horman
4 siblings, 2 replies; 10+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-07 9:15 UTC (permalink / raw)
To: Jesper Dangaard Brouer, netdev, Jakub Kicinski
Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
Paolo Abeni, kernel-team
Jesper Dangaard Brouer <hawk@kernel.org> writes:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
>
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
Right, I definitely agree that this is the right solution; having no
backpressure and a fixed-size ringbuffer is obviously not ideal.
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
Not sure I like this bit, though; see below.
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
>
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
>
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
So the BQL algorithm tries to tune the maximum number of outstanding
bytes to be ~twice the maximum that can be completed in one batch. Since
we're not really limited by bytes in the same sense here (as you point
out), an approximate equivalent would be the NAPI budget, I guess? I.e.,
as a first approximation, we could have veth stop the queue once the
ringbuffer has 2x the NAPI budget packets in it?
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
>
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
[...]
> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
> +{
> + struct Qdisc *q;
> +
> + q = rcu_dereference(txq->qdisc);
> + if (q->enqueue)
> + return true;
> + else
> + return false;
> +}
This seems like a pretty ugly layering violation, inspecting the qdisc
like this in the driver?
AFAICT, __dev_queue_xmit() turns a stopped queue into drops anyway, but
emits a warning (looks like this, around line 4640 in dev.c):
net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
dev->name);
As this patch shows, it can clearly be appropriate for a virtual device
to stop the queue even if there's no qdisc, so how about we just get rid
of that warning? Then this logic won't be needed at all in the driver..
-Toke
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-07 9:15 ` Toke Høiland-Jørgensen
@ 2025-04-07 12:42 ` Jesper Dangaard Brouer
2025-04-07 23:02 ` David Ahern
1 sibling, 0 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-07 12:42 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, netdev, Jakub Kicinski
Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team,
makita.toshiaki, Yan Zhai, Jesse Brandeburg, Michael S. Tsirkin
On 07/04/2025 11.15, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>
>> In production, we're seeing TX drops on veth devices when the ptr_ring
>> fills up. This can occur when NAPI mode is enabled, though it's
>> relatively rare. However, with threaded NAPI - which we use in
>> production - the drops become significantly more frequent.
>>
>> The underlying issue is that with threaded NAPI, the consumer often runs
>> on a different CPU than the producer. This increases the likelihood of
>> the ring filling up before the consumer gets scheduled, especially under
>> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>>
>> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
>> ring is full, signaling the qdisc layer to requeue the packet. The txq
>> (netdev queue) is stopped in this condition and restarted once
>> veth_poll() drains entries from the ring, ensuring coordination between
>> NAPI and qdisc.
>
> Right, I definitely agree that this is the right solution; having no
> backpressure and a fixed-size ringbuffer is obviously not ideal.
>
>> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
>> the driver retains its original behavior - dropping packets immediately
>> when the ring is full. This avoids unexpected behavior changes in setups
>> without a configured qdisc.
>
> Not sure I like this bit, though; see below.
>
I agree, but the reason for being chicken here is my prod deployment plan.
I'm about to roll this change out to production, and it is practical
that we can do a roll-back based on removing the qdisc.
In the future, we (netdev community) should consider changing veth to
automatically get the default qdisc attached, when enabling NAPI mode.
I will propose this when I have more data from production.
The ptr_ring overflow case is very easy to reproduce, even on testlab
servers without any competing traffic.
Together with Yan (Cc) we/I have a reproducer script available here:
https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_setup01_NAPI_TX_drops.sh
>> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
>> (AQM) to fairly schedule packets across flows and reduce collateral
>> damage from elephant flows.
>>
>> A known limitation of this approach is that the full ring sits in front
>> of the qdisc layer, effectively forming a FIFO buffer that introduces
>> base latency. While AQM still improves fairness and mitigates flow
>> dominance, the latency impact is measurable.
>>
>> In hardware drivers, this issue is typically addressed using BQL (Byte
>> Queue Limits), which tracks in-flight bytes needed based on physical link
>> rate. However, for virtual drivers like veth, there is no fixed bandwidth
>> constraint - the bottleneck is CPU availability and the scheduler's ability
>> to run the NAPI thread. It is unclear how effective BQL would be in this
>> context.
>
> So the BQL algorithm tries to tune the maximum number of outstanding
> bytes to be ~twice the maximum that can be completed in one batch. Since
> we're not really limited by bytes in the same sense here (as you point
> out), an approximate equivalent would be the NAPI budget, I guess? I.e.,
> as a first approximation, we could have veth stop the queue once the
> ringbuffer has 2x the NAPI budget packets in it?
>
I like the idea. (Note: The current ptr_ring size is 256.)
To implement this, we could simply reduce the ptr_ring size to 128
(2*64), right?
IMHO would be cooler to have a larger queue, but dynamically stop it
once the ringbuffer has 2x the NAPI budget. but...
There are two subtle issue with the ptr_ring (Cc MST). (So, perhaps we
need choose another ring buffer API for veth?)
(1) We don't have a counter for how many elements are in the queue.
Most hardware drivers tried to stop the TXQ *before* their ring buffer
runs full, such that it avoids returning NETDEV_TX_BUSY.
(This is why it is hard to implement BQL for veth.)
(2) ptr_ring consumer delays making the slots available to the producer,
to avoiding bouncing cache-lines, in the ring-full case. Thus, we
cannot make ptr_ring too small. And I'm unsure how this interacts with
the idea of having two NAPI budgets available.
>> This patch serves as a first step toward addressing TX drops. Future work
>> may explore adapting a BQL-like mechanism to better suit virtual devices
>> like veth.
>>
>> Reported-by: Yan Zhai <yan@cloudflare.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>
> [...]
>
>> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>> +{
>> + struct Qdisc *q;
>> +
>> + q = rcu_dereference(txq->qdisc);
>> + if (q->enqueue)
>> + return true;
>> + else
>> + return false;
>> +}
>
> This seems like a pretty ugly layering violation, inspecting the qdisc
> like this in the driver?
>
I agree. (as minimum it should be moved to include/net/sch_generic.h.)
I did considered using qdisc_tx_is_noop() defined in
include/net/sch_generic.h, but IMHO it is too slow for data-(fast)path
code as it walks all the TXQs.
(more below)
> AFAICT, __dev_queue_xmit() turns a stopped queue into drops anyway, but
> emits a warning (looks like this, around line 4640 in dev.c):
>
> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> dev->name);
>
> As this patch shows, it can clearly be appropriate for a virtual device
> to stop the queue even if there's no qdisc, so how about we just get rid
> of that warning? Then this logic won't be needed at all in the driver..
>
Yes, the real reason for the layer violation is to avoid this warning in
_dev_queue_xmit().
The NETDEV_TX_BUSY is (correctly) converted into a drop, when the
net_device TXQ doesn't have a qdisc attached. Thus, we don't need the
driver layer violation, if we can remove this warning.
Does anyone object to removing this net_crit_ratelimited?
--Jesper
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-07 9:15 ` Toke Høiland-Jørgensen
2025-04-07 12:42 ` Jesper Dangaard Brouer
@ 2025-04-07 23:02 ` David Ahern
2025-04-08 11:23 ` Toke Høiland-Jørgensen
1 sibling, 1 reply; 10+ messages in thread
From: David Ahern @ 2025-04-07 23:02 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, netdev,
Jakub Kicinski
Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team
On 4/7/25 3:15 AM, Toke Høiland-Jørgensen wrote:
>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>> +{
>> + struct Qdisc *q;
>> +
>> + q = rcu_dereference(txq->qdisc);
>> + if (q->enqueue)
>> + return true;
>> + else
>> + return false;
>> +}
>
> This seems like a pretty ugly layering violation, inspecting the qdisc
> like this in the driver?
vrf driver has something very similar - been there since March 2017.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-07 23:02 ` David Ahern
@ 2025-04-08 11:23 ` Toke Høiland-Jørgensen
2025-04-08 15:20 ` David Ahern
0 siblings, 1 reply; 10+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-08 11:23 UTC (permalink / raw)
To: David Ahern, Jesper Dangaard Brouer, netdev, Jakub Kicinski
Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team
David Ahern <dsahern@kernel.org> writes:
> On 4/7/25 3:15 AM, Toke Høiland-Jørgensen wrote:
>>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>>> +{
>>> + struct Qdisc *q;
>>> +
>>> + q = rcu_dereference(txq->qdisc);
>>> + if (q->enqueue)
>>> + return true;
>>> + else
>>> + return false;
>>> +}
>>
>> This seems like a pretty ugly layering violation, inspecting the qdisc
>> like this in the driver?
>
> vrf driver has something very similar - been there since March 2017.
Doesn't make it any less ugly, though ;)
And AFAICT, vrf is doing more with the information; basically picking a
whole different TX path? Can you elaborate on the reasoning for this (do
people actually install qdiscs on VRF devices in practice)?
-Toke
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-08 11:23 ` Toke Høiland-Jørgensen
@ 2025-04-08 15:20 ` David Ahern
0 siblings, 0 replies; 10+ messages in thread
From: David Ahern @ 2025-04-08 15:20 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, netdev,
Jakub Kicinski
Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team
On 4/8/25 5:23 AM, Toke Høiland-Jørgensen wrote:
> David Ahern <dsahern@kernel.org> writes:
>
>> On 4/7/25 3:15 AM, Toke Høiland-Jørgensen wrote:
>>>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>>>> +{
>>>> + struct Qdisc *q;
>>>> +
>>>> + q = rcu_dereference(txq->qdisc);
>>>> + if (q->enqueue)
>>>> + return true;
>>>> + else
>>>> + return false;
>>>> +}
>>>
>>> This seems like a pretty ugly layering violation, inspecting the qdisc
>>> like this in the driver?
>>
>> vrf driver has something very similar - been there since March 2017.
>
> Doesn't make it any less ugly, though ;)
in my eyes, it is a thing of beauty.
>
> And AFAICT, vrf is doing more with the information; basically picking a
> whole different TX path? Can you elaborate on the reasoning for this (do
> people actually install qdiscs on VRF devices in practice)?
It is not common AFAIK, but it is possible. I wanted to avoid the
overhead for what I think is a rare configuration.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
` (3 preceding siblings ...)
2025-04-07 9:15 ` Toke Høiland-Jørgensen
@ 2025-04-07 17:02 ` Simon Horman
4 siblings, 0 replies; 10+ messages in thread
From: Simon Horman @ 2025-04-07 17:02 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: netdev, Jakub Kicinski, bpf, tom, Eric Dumazet, David S. Miller,
Paolo Abeni, Toke Høiland-Jørgensen, kernel-team
On Fri, Apr 04, 2025 at 04:49:01PM +0200, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
>
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
>
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
>
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
>
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
>
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
>
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
>
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
> drivers/net/veth.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 50 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
...
> @@ -373,17 +383,39 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
> }
>
> skb_tx_timestamp(skb);
> - if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> + ret = veth_forward_skb(rcv, skb, rq, use_napi);
> + switch(ret) {
> + case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
> if (!use_napi)
> dev_sw_netstats_tx_add(dev, 1, length);
> else
> __veth_xdp_flush(rq);
> - } else {
> + break;
> + case NETDEV_TX_BUSY:
> + /* If a qdisc is attached to our virtual device, returning
> + * NETDEV_TX_BUSY is allowed.
> + */
> + struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
Hi Toke,
FYI, clang 20.1.1 W=1 build says:
drivers/net/veth.c:399:3: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
399 | struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
> +
> + if (!txq_has_qdisc(txq)) {
> + dev_kfree_skb_any(skb);
> + goto drop;
> + }
> + netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
> + if (use_napi)
> + __veth_xdp_flush(rq);
> +
> + break;
> + case NET_RX_DROP: /* same as NET_XMIT_DROP */
> drop:
> atomic64_inc(&priv->dropped);
> ret = NET_XMIT_DROP;
> + break;
> + default:
> + net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
> + dev->name, ret);
> }
> -
> rcu_read_unlock();
>
> return ret;
...
^ permalink raw reply [flat|nested] 10+ messages in thread