[PATCH net-next V4 0/2] veth: qdisc backpressure and qdisc check refactor

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next V4 0/2] veth: qdisc backpressure and qdisc check refactor
@ 2025-04-15 13:44 Jesper Dangaard Brouer
  2025-04-15 13:44 ` [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
  2025-04-15 13:45 ` [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  0 siblings, 2 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-15 13:44 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
	makita.toshiaki, kernel-team, phil

This patch series addresses TX drops seen on veth devices under load,
particularly when using threaded NAPI, which is our setup in production.

The root cause is that the NAPI consumer often runs on a different CPU
than the producer. Combined with scheduling delays or simply slower
consumption, this increases the chance that the ptr_ring fills up before
packets are drained, resulting in drops from veth_xmit() (ndo_start_xmit()).

To make this easier to reproduce, we’ve created a script that sets up a
test scenario using network namespaces. The script inserts 1000 iptables
rules in the consumer namespace to slow down packet processing and
amplify the issue. Reproducer script:

https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_setup01_NAPI_TX_drops.sh

This series first introduces a helper to detect no-queue qdiscs and then
uses it in the veth driver to conditionally apply qdisc-level
backpressure when a real qdisc is attached. The behavior is off by
default and opt-in, ensuring minimal impact and easy activation.

---
V4:
 - Check against no-queue instead of no-op qdisc
 - Link to V3: https://lore.kernel.org/all/174464549885.20396.6987653753122223942.stgit@firesoul/
V3:
 - Reorder patches, generalize check for no-op qdisc as first patch
   - RFC: As testing show this is incorrect
 - rcu_dereference(priv->peer) in veth_xdp_rcv as this runs in NAPI
   context rcu_read_lock() is implicit.
 - Link to V2: https://lore.kernel.org/all/174412623473.3702169.4235683143719614624.stgit@firesoul/
V2:
 - Generalize check for no-op qdisc
 - Link to RFC-V1: https://lore.kernel.org/all/174377814192.3376479.16481605648460889310.stgit@firesoul/

Jesper Dangaard Brouer (2):
      net: sched: generalize check for no-queue qdisc on TX queue
      veth: apply qdisc backpressure on full ptr_ring to reduce TX drops

 drivers/net/veth.c        | 49 ++++++++++++++++++++++++++++++++-------
 drivers/net/vrf.c         |  4 +---
 include/net/sch_generic.h |  8 +++++++
 3 files changed, 50 insertions(+), 11 deletions(-)

--

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue
  2025-04-15 13:44 [PATCH net-next V4 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
@ 2025-04-15 13:44 ` Jesper Dangaard Brouer
  2025-04-15 15:43   ` David Ahern
  2025-04-16 13:32   ` Toke Høiland-Jørgensen
  2025-04-15 13:45 ` [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  1 sibling, 2 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-15 13:44 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
	makita.toshiaki, kernel-team, phil

The "noqueue" qdisc can either be directly attached, or get default
attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
allocated Qdisc structure gets it's enqueue function pointer reset to
NULL by noqueue_init() via noqueue_qdisc_ops.

This is a common case for software virtual net_devices. For these devices
with no-queue, the transmission path in __dev_queue_xmit() will bypass
the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
dev_hard_start_xmit).  In this mode the device driver is not allowed to
ask for packets to be queued (either via returning NETDEV_TX_BUSY or
stopping the TXQ).

The simplest and most reliable way to identify this no-queue case is by
checking if enqueue == NULL.

The vrf driver currently open-codes this check (!qdisc->enqueue). While
functionally correct, this low-level detail is better encapsulated in a
dedicated helper for clarity and long-term maintainability.

To make this behavior more explicit and reusable, this patch introduce a
new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
veth driver in the next patch, which introduces optional qdisc-based
backpressure.

This is a non-functional change.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/vrf.c         |    4 +---
 include/net/sch_generic.h |    8 ++++++++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 7168b33adadb..9a4beea6ee0c 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -343,15 +343,13 @@ static int vrf_ifindex_lookup_by_table_id(struct net *net, u32 table_id)
 static bool qdisc_tx_is_default(const struct net_device *dev)
 {
 	struct netdev_queue *txq;
-	struct Qdisc *qdisc;

 	if (dev->num_tx_queues > 1)
 		return false;

 	txq = netdev_get_tx_queue(dev, 0);
-	qdisc = rcu_access_pointer(txq->qdisc);

-	return !qdisc->enqueue;
+	return qdisc_txq_has_no_queue(txq);
 }

 /* Local traffic destined to local address. Reinsert the packet to rx
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d48c657191cd..b6c177f7141c 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -803,6 +803,14 @@ static inline bool qdisc_tx_changing(const struct net_device *dev)
 	return false;
 }

+/* "noqueue" qdisc identified by not having any enqueue, see noqueue_init() */
+static inline bool qdisc_txq_has_no_queue(const struct netdev_queue *txq)
+{
+	struct Qdisc *qdisc = rcu_access_pointer(txq->qdisc);
+
+	return qdisc->enqueue == NULL;
+}
+
 /* Is the device using the noop qdisc on all queues?  */
 static inline bool qdisc_tx_is_noop(const struct net_device *dev)
 {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-15 13:44 [PATCH net-next V4 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
  2025-04-15 13:44 ` [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
@ 2025-04-15 13:45 ` Jesper Dangaard Brouer
  2025-04-16 12:44   ` Toshiaki Makita
                     ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-15 13:45 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
	makita.toshiaki, kernel-team, phil

In production, we're seeing TX drops on veth devices when the ptr_ring
fills up. This can occur when NAPI mode is enabled, though it's
relatively rare. However, with threaded NAPI - which we use in
production - the drops become significantly more frequent.

The underlying issue is that with threaded NAPI, the consumer often runs
on a different CPU than the producer. This increases the likelihood of
the ring filling up before the consumer gets scheduled, especially under
load, leading to drops in veth_xmit() (ndo_start_xmit()).

This patch introduces backpressure by returning NETDEV_TX_BUSY when the
ring is full, signaling the qdisc layer to requeue the packet. The txq
(netdev queue) is stopped in this condition and restarted once
veth_poll() drains entries from the ring, ensuring coordination between
NAPI and qdisc.

Backpressure is only enabled when a qdisc is attached. Without a qdisc,
the driver retains its original behavior - dropping packets immediately
when the ring is full. This avoids unexpected behavior changes in setups
without a configured qdisc.

With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
(AQM) to fairly schedule packets across flows and reduce collateral
damage from elephant flows.

A known limitation of this approach is that the full ring sits in front
of the qdisc layer, effectively forming a FIFO buffer that introduces
base latency. While AQM still improves fairness and mitigates flow
dominance, the latency impact is measurable.

In hardware drivers, this issue is typically addressed using BQL (Byte
Queue Limits), which tracks in-flight bytes needed based on physical link
rate. However, for virtual drivers like veth, there is no fixed bandwidth
constraint - the bottleneck is CPU availability and the scheduler's ability
to run the NAPI thread. It is unclear how effective BQL would be in this
context.

This patch serves as a first step toward addressing TX drops. Future work
may explore adapting a BQL-like mechanism to better suit virtual devices
like veth.

Reported-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   49 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7bb53961c0ea..a419d5e198d8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
 static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
 	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
-		dev_kfree_skb_any(skb);
-		return NET_RX_DROP;
+		return NETDEV_TX_BUSY; /* signal qdisc layer */
 	}

-	return NET_RX_SUCCESS;
+	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }

 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
@@ -346,11 +345,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct veth_rq *rq = NULL;
-	int ret = NETDEV_TX_OK;
+	struct netdev_queue *txq;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
-	int rxq;
+	int ret, rxq;

 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -373,17 +372,41 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}

 	skb_tx_timestamp(skb);
-	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
+
+	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	switch(ret) {
+	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
 			dev_sw_netstats_tx_add(dev, 1, length);
 		else
 			__veth_xdp_flush(rq);
-	} else {
+		break;
+	case NETDEV_TX_BUSY:
+		/* If a qdisc is attached to our virtual device, returning
+		 * NETDEV_TX_BUSY is allowed.
+		 */
+		txq = netdev_get_tx_queue(dev, rxq);
+
+		if (qdisc_txq_has_no_queue(txq)) {
+			dev_kfree_skb_any(skb);
+			goto drop;
+		}
+		netif_tx_stop_queue(txq);
+		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
+		__skb_push(skb, ETH_HLEN);
+		if (use_napi)
+			__veth_xdp_flush(rq);
+
+		break;
+	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
 		atomic64_inc(&priv->dropped);
 		ret = NET_XMIT_DROP;
+		break;
+	default:
+		net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
+				     dev->name, ret);
 	}
-
 	rcu_read_unlock();

 	return ret;
@@ -874,9 +897,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
+	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];

+	peer_dev = rcu_dereference(priv->peer);
+	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
+
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);

@@ -925,6 +955,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);

+	if (unlikely(netif_tx_queue_stopped(peer_txq)))
+		netif_tx_wake_queue(peer_txq);
+
 	return done;
 }

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue
  2025-04-15 13:44 ` [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
@ 2025-04-15 15:43   ` David Ahern
  2025-04-15 16:30     ` Jesper Dangaard Brouer
  2025-04-16 13:32   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 13+ messages in thread
From: David Ahern @ 2025-04-15 15:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, makita.toshiaki, kernel-team,
	phil

On 4/15/25 7:44 AM, Jesper Dangaard Brouer wrote:
> The "noqueue" qdisc can either be directly attached, or get default
> attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
> allocated Qdisc structure gets it's enqueue function pointer reset to
> NULL by noqueue_init() via noqueue_qdisc_ops.
> 
> This is a common case for software virtual net_devices. For these devices
> with no-queue, the transmission path in __dev_queue_xmit() will bypass
> the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
> dev_hard_start_xmit).  In this mode the device driver is not allowed to
> ask for packets to be queued (either via returning NETDEV_TX_BUSY or
> stopping the TXQ).
> 
> The simplest and most reliable way to identify this no-queue case is by
> checking if enqueue == NULL.
> 
> The vrf driver currently open-codes this check (!qdisc->enqueue). While
> functionally correct, this low-level detail is better encapsulated in a
> dedicated helper for clarity and long-term maintainability.
> 
> To make this behavior more explicit and reusable, this patch introduce a
> new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
> veth driver in the next patch, which introduces optional qdisc-based
> backpressure.
> 
> This is a non-functional change.
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/vrf.c         |    4 +---
>  include/net/sch_generic.h |    8 ++++++++
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 


>  /* Local traffic destined to local address. Reinsert the packet to rx
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index d48c657191cd..b6c177f7141c 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -803,6 +803,14 @@ static inline bool qdisc_tx_changing(const struct net_device *dev)
>  	return false;
>  }
>  
> +/* "noqueue" qdisc identified by not having any enqueue, see noqueue_init() */
> +static inline bool qdisc_txq_has_no_queue(const struct netdev_queue *txq)
> +{
> +	struct Qdisc *qdisc = rcu_access_pointer(txq->qdisc);
> +
> +	return qdisc->enqueue == NULL;

Did checkpatch not complain that this should be '!qdisc->enqueue' ?


Reviewed-by: David Ahern <dsahern@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue
  2025-04-15 15:43   ` David Ahern
@ 2025-04-15 16:30     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-15 16:30 UTC (permalink / raw)
  To: David Ahern, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, makita.toshiaki, kernel-team,
	phil



On 15/04/2025 17.43, David Ahern wrote:
> On 4/15/25 7:44 AM, Jesper Dangaard Brouer wrote:
>> The "noqueue" qdisc can either be directly attached, or get default
>> attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
>> allocated Qdisc structure gets it's enqueue function pointer reset to
>> NULL by noqueue_init() via noqueue_qdisc_ops.
>>
>> This is a common case for software virtual net_devices. For these devices
>> with no-queue, the transmission path in __dev_queue_xmit() will bypass
>> the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
>> dev_hard_start_xmit).  In this mode the device driver is not allowed to
>> ask for packets to be queued (either via returning NETDEV_TX_BUSY or
>> stopping the TXQ).
>>
>> The simplest and most reliable way to identify this no-queue case is by
>> checking if enqueue == NULL.
>>
>> The vrf driver currently open-codes this check (!qdisc->enqueue). While
>> functionally correct, this low-level detail is better encapsulated in a
>> dedicated helper for clarity and long-term maintainability.
>>
>> To make this behavior more explicit and reusable, this patch introduce a
>> new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
>> veth driver in the next patch, which introduces optional qdisc-based
>> backpressure.
>>
>> This is a non-functional change.
>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/vrf.c         |    4 +---
>>   include/net/sch_generic.h |    8 ++++++++
>>   2 files changed, 9 insertions(+), 3 deletions(-)
>>
> 
> 
>>   /* Local traffic destined to local address. Reinsert the packet to rx
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index d48c657191cd..b6c177f7141c 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -803,6 +803,14 @@ static inline bool qdisc_tx_changing(const struct net_device *dev)
>>   	return false;
>>   }
>>   
>> +/* "noqueue" qdisc identified by not having any enqueue, see noqueue_init() */
>> +static inline bool qdisc_txq_has_no_queue(const struct netdev_queue *txq)
>> +{
>> +	struct Qdisc *qdisc = rcu_access_pointer(txq->qdisc);
>> +
>> +	return qdisc->enqueue == NULL;
> 
> Did checkpatch not complain that this should be '!qdisc->enqueue' ?
> 

Nope:

./scripts/checkpatch.pl 
patches-veth_pushback_to_qdisc03/01-0001-net-sched-generalize-noop.patch
total: 0 errors, 0 warnings, 30 lines checked

patches-veth_pushback_to_qdisc03/01-0001-net-sched-generalize-noop.patch 
has no obvious style problems and is ready for submission.

> 
> Reviewed-by: David Ahern <dsahern@kernel.org>

Thx for review :-)

--Jesper


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-15 13:45 ` [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
@ 2025-04-16 12:44   ` Toshiaki Makita
  2025-04-17 13:00     ` Jesper Dangaard Brouer
  2025-04-16 13:38   ` Jakub Kicinski
  2025-04-16 13:56   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 13+ messages in thread
From: Toshiaki Makita @ 2025-04-16 12:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, netdev, Jakub Kicinski

On 2025/04/15 22:45, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
> 
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
> 
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
> 
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
> 
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
> 
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
> 
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
> 
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.

Thank you for the patch.

> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>   drivers/net/veth.c |   49 +++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 7bb53961c0ea..a419d5e198d8 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>   {
>   	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
> -		dev_kfree_skb_any(skb);
> -		return NET_RX_DROP;
> +		return NETDEV_TX_BUSY; /* signal qdisc layer */
>   	}

You don't need this braces any more?

if (...)
     return ...;

>   
> -	return NET_RX_SUCCESS;
> +	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>   }
>   
>   static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -346,11 +345,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   {
>   	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>   	struct veth_rq *rq = NULL;
> -	int ret = NETDEV_TX_OK;
> +	struct netdev_queue *txq;
>   	struct net_device *rcv;
>   	int length = skb->len;
>   	bool use_napi = false;
> -	int rxq;
> +	int ret, rxq;
>   
>   	rcu_read_lock();
>   	rcv = rcu_dereference(priv->peer);
> @@ -373,17 +372,41 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   	}
>   
>   	skb_tx_timestamp(skb);
> -	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	switch(ret) {
> +	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>   		if (!use_napi)
>   			dev_sw_netstats_tx_add(dev, 1, length);
>   		else
>   			__veth_xdp_flush(rq);
> -	} else {
> +		break;
> +	case NETDEV_TX_BUSY:
> +		/* If a qdisc is attached to our virtual device, returning
> +		 * NETDEV_TX_BUSY is allowed.
> +		 */
> +		txq = netdev_get_tx_queue(dev, rxq);
> +
> +		if (qdisc_txq_has_no_queue(txq)) {
> +			dev_kfree_skb_any(skb);
> +			goto drop;
> +		}
> +		netif_tx_stop_queue(txq);
> +		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
> +		__skb_push(skb, ETH_HLEN);
> +		if (use_napi)
> +			__veth_xdp_flush(rq);

You did not add a packet to the ring.
No need for flush here?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue
  2025-04-15 13:44 ` [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
  2025-04-15 15:43   ` David Ahern
@ 2025-04-16 13:32   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-16 13:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, dsahern, makita.toshiaki, kernel-team, phil

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> The "noqueue" qdisc can either be directly attached, or get default
> attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
> allocated Qdisc structure gets it's enqueue function pointer reset to
> NULL by noqueue_init() via noqueue_qdisc_ops.
>
> This is a common case for software virtual net_devices. For these devices
> with no-queue, the transmission path in __dev_queue_xmit() will bypass
> the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
> dev_hard_start_xmit).  In this mode the device driver is not allowed to
> ask for packets to be queued (either via returning NETDEV_TX_BUSY or
> stopping the TXQ).
>
> The simplest and most reliable way to identify this no-queue case is by
> checking if enqueue == NULL.
>
> The vrf driver currently open-codes this check (!qdisc->enqueue). While
> functionally correct, this low-level detail is better encapsulated in a
> dedicated helper for clarity and long-term maintainability.
>
> To make this behavior more explicit and reusable, this patch introduce a
> new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
> veth driver in the next patch, which introduces optional qdisc-based
> backpressure.
>
> This is a non-functional change.
>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-15 13:45 ` [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  2025-04-16 12:44   ` Toshiaki Makita
@ 2025-04-16 13:38   ` Jakub Kicinski
  2025-04-16 13:58     ` Toke Høiland-Jørgensen
  2025-04-16 13:56   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2025-04-16 13:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil

On Tue, 15 Apr 2025 15:45:05 +0200 Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.

It splats:

[ 5319.025772][ C1] dump_stack_lvl (lib/dump_stack.c:123) 
[ 5319.025786][ C1] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6866) 
[ 5319.025797][ C1] veth_xdp_rcv (drivers/net/veth.c:907 (discriminator 9)) 
[ 5319.025850][ C1] veth_poll (drivers/net/veth.c:977) 
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-15 13:45 ` [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  2025-04-16 12:44   ` Toshiaki Makita
  2025-04-16 13:38   ` Jakub Kicinski
@ 2025-04-16 13:56   ` Toke Høiland-Jørgensen
  2025-04-17  9:32     ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-16 13:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, dsahern, makita.toshiaki, kernel-team, phil

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
>
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
>
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
>
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
>
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
>
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
>
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
>
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c |   49 +++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 7bb53961c0ea..a419d5e198d8 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>  static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>  {
>  	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
> -		dev_kfree_skb_any(skb);
> -		return NET_RX_DROP;
> +		return NETDEV_TX_BUSY; /* signal qdisc layer */
>  	}
>  
> -	return NET_RX_SUCCESS;
> +	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>  }
>  
>  static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -346,11 +345,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>  	struct veth_rq *rq = NULL;
> -	int ret = NETDEV_TX_OK;
> +	struct netdev_queue *txq;
>  	struct net_device *rcv;
>  	int length = skb->len;
>  	bool use_napi = false;
> -	int rxq;
> +	int ret, rxq;
>  
>  	rcu_read_lock();
>  	rcv = rcu_dereference(priv->peer);
> @@ -373,17 +372,41 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  
>  	skb_tx_timestamp(skb);
> -	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	switch(ret) {
> +	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>  		if (!use_napi)
>  			dev_sw_netstats_tx_add(dev, 1, length);
>  		else
>  			__veth_xdp_flush(rq);
> -	} else {
> +		break;
> +	case NETDEV_TX_BUSY:
> +		/* If a qdisc is attached to our virtual device, returning
> +		 * NETDEV_TX_BUSY is allowed.
> +		 */
> +		txq = netdev_get_tx_queue(dev, rxq);
> +
> +		if (qdisc_txq_has_no_queue(txq)) {
> +			dev_kfree_skb_any(skb);
> +			goto drop;
> +		}
> +		netif_tx_stop_queue(txq);
> +		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
> +		__skb_push(skb, ETH_HLEN);
> +		if (use_napi)
> +			__veth_xdp_flush(rq);
> +
> +		break;
> +	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>  drop:
>  		atomic64_inc(&priv->dropped);
>  		ret = NET_XMIT_DROP;
> +		break;
> +	default:
> +		net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
> +				     dev->name, ret);
>  	}
> -
>  	rcu_read_unlock();
>  
>  	return ret;
> @@ -874,9 +897,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct veth_xdp_tx_bq *bq,
>  			struct veth_stats *stats)
>  {
> +	struct veth_priv *priv = netdev_priv(rq->dev);
> +	int queue_idx = rq->xdp_rxq.queue_index;
> +	struct netdev_queue *peer_txq;
> +	struct net_device *peer_dev;
>  	int i, done = 0, n_xdpf = 0;
>  	void *xdpf[VETH_XDP_BATCH];
>  
> +	peer_dev = rcu_dereference(priv->peer);
> +	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
> +
>  	for (i = 0; i < budget; i++) {
>  		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>  
> @@ -925,6 +955,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  	rq->stats.vs.xdp_packets += done;
>  	u64_stats_update_end(&rq->stats.syncp);
>  
> +	if (unlikely(netif_tx_queue_stopped(peer_txq)))
> +		netif_tx_wake_queue(peer_txq);
> +

netif_tx_wake_queue() does a test_and_clear_bit() and does nothing if
the bit is not set; so does this optimisation really make any
difference? :)

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-16 13:38   ` Jakub Kicinski
@ 2025-04-16 13:58     ` Toke Høiland-Jørgensen
  2025-04-17 10:02       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-16 13:58 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: netdev, bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	dsahern, makita.toshiaki, kernel-team, phil

Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 15 Apr 2025 15:45:05 +0200 Jesper Dangaard Brouer wrote:
>> In production, we're seeing TX drops on veth devices when the ptr_ring
>> fills up. This can occur when NAPI mode is enabled, though it's
>> relatively rare. However, with threaded NAPI - which we use in
>> production - the drops become significantly more frequent.
>
> It splats:
>
> [ 5319.025772][ C1] dump_stack_lvl (lib/dump_stack.c:123) 
> [ 5319.025786][ C1] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6866) 
> [ 5319.025797][ C1] veth_xdp_rcv (drivers/net/veth.c:907 (discriminator 9)) 
> [ 5319.025850][ C1] veth_poll (drivers/net/veth.c:977)

I believe the way to silence this one is to use:

rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());

instead of just rcu_dereference()

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-16 13:56   ` Toke Høiland-Jørgensen
@ 2025-04-17  9:32     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-17  9:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, dsahern,
	makita.toshiaki, kernel-team, phil



On 16/04/2025 15.56, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> In production, we're seeing TX drops on veth devices when the ptr_ring
>> fills up. This can occur when NAPI mode is enabled, though it's
>> relatively rare. However, with threaded NAPI - which we use in
>> production - the drops become significantly more frequent.
>>
>> The underlying issue is that with threaded NAPI, the consumer often runs
>> on a different CPU than the producer. This increases the likelihood of
>> the ring filling up before the consumer gets scheduled, especially under
>> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>>
>> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
>> ring is full, signaling the qdisc layer to requeue the packet. The txq
>> (netdev queue) is stopped in this condition and restarted once
>> veth_poll() drains entries from the ring, ensuring coordination between
>> NAPI and qdisc.
>>
>> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
>> the driver retains its original behavior - dropping packets immediately
>> when the ring is full. This avoids unexpected behavior changes in setups
>> without a configured qdisc.
>>
>> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
>> (AQM) to fairly schedule packets across flows and reduce collateral
>> damage from elephant flows.
>>
>> A known limitation of this approach is that the full ring sits in front
>> of the qdisc layer, effectively forming a FIFO buffer that introduces
>> base latency. While AQM still improves fairness and mitigates flow
>> dominance, the latency impact is measurable.
>>
>> In hardware drivers, this issue is typically addressed using BQL (Byte
>> Queue Limits), which tracks in-flight bytes needed based on physical link
>> rate. However, for virtual drivers like veth, there is no fixed bandwidth
>> constraint - the bottleneck is CPU availability and the scheduler's ability
>> to run the NAPI thread. It is unclear how effective BQL would be in this
>> context.
>>
>> This patch serves as a first step toward addressing TX drops. Future work
>> may explore adapting a BQL-like mechanism to better suit virtual devices
>> like veth.
>>
>> Reported-by: Yan Zhai <yan@cloudflare.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/veth.c |   49 +++++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 41 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 7bb53961c0ea..a419d5e198d8 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
[...]
>> @@ -874,9 +897,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   			struct veth_xdp_tx_bq *bq,
>>   			struct veth_stats *stats)
>>   {
>> +	struct veth_priv *priv = netdev_priv(rq->dev);
>> +	int queue_idx = rq->xdp_rxq.queue_index;
>> +	struct netdev_queue *peer_txq;
>> +	struct net_device *peer_dev;
>>   	int i, done = 0, n_xdpf = 0;
>>   	void *xdpf[VETH_XDP_BATCH];
>>   
>> +	peer_dev = rcu_dereference(priv->peer);
>> +	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>> +
>>   	for (i = 0; i < budget; i++) {
>>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>   
>> @@ -925,6 +955,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   	rq->stats.vs.xdp_packets += done;
>>   	u64_stats_update_end(&rq->stats.syncp);
>>   
>> +	if (unlikely(netif_tx_queue_stopped(peer_txq)))
>> +		netif_tx_wake_queue(peer_txq);
>> +
> 
> netif_tx_wake_queue() does a test_and_clear_bit() and does nothing if
> the bit is not set; so does this optimisation really make any
> difference? :)

Yes, it avoids a function call. As netif_tx_queue_stopped() inlines the
test_bit() here, and netif_tx_wake_queue() is an exported symbol.
I'm being very careful that I'm not slowing down the common veth code
path with this change. Your suggestion is a paper-cut, so I'm not taking
this advice :-P

--Jesper

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-16 13:58     ` Toke Høiland-Jørgensen
@ 2025-04-17 10:02       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-17 10:02 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jakub Kicinski
  Cc: netdev, bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	dsahern, makita.toshiaki, kernel-team, phil



On 16/04/2025 15.58, Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> 
>> On Tue, 15 Apr 2025 15:45:05 +0200 Jesper Dangaard Brouer wrote:
>>> In production, we're seeing TX drops on veth devices when the ptr_ring
>>> fills up. This can occur when NAPI mode is enabled, though it's
>>> relatively rare. However, with threaded NAPI - which we use in
>>> production - the drops become significantly more frequent.
>>
>> It splats:
>>
>> [ 5319.025772][ C1] dump_stack_lvl (lib/dump_stack.c:123)
>> [ 5319.025786][ C1] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6866)
>> [ 5319.025797][ C1] veth_xdp_rcv (drivers/net/veth.c:907 (discriminator 9))
>> [ 5319.025850][ C1] veth_poll (drivers/net/veth.c:977)
> 
> I believe the way to silence this one is to use:
> 
> rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> 
> instead of just rcu_dereference()
> 

Thanks for this suggestion.
Normally, this indicate I should add a rcu_read_lock() section around
for-loop in veth_xdp_rcv(), but this isn't necessary due to NAPI, right?

For background, this is because (1) NAPI is already running with RCU
read-lock held or is it because (2) BH is considered a RCU section?

--Jesper

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-16 12:44   ` Toshiaki Makita
@ 2025-04-17 13:00     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-17 13:00 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, netdev, Jakub Kicinski



On 16/04/2025 14.44, Toshiaki Makita wrote:
> On 2025/04/15 22:45, Jesper Dangaard Brouer wrote:
>> In production, we're seeing TX drops on veth devices when the ptr_ring
>> fills up. This can occur when NAPI mode is enabled, though it's
>> relatively rare. However, with threaded NAPI - which we use in
>> production - the drops become significantly more frequent.
>>
>> The underlying issue is that with threaded NAPI, the consumer often runs
>> on a different CPU than the producer. This increases the likelihood of
>> the ring filling up before the consumer gets scheduled, especially under
>> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>>
>> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
>> ring is full, signaling the qdisc layer to requeue the packet. The txq
>> (netdev queue) is stopped in this condition and restarted once
>> veth_poll() drains entries from the ring, ensuring coordination between
>> NAPI and qdisc.
>>
>> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
>> the driver retains its original behavior - dropping packets immediately
>> when the ring is full. This avoids unexpected behavior changes in setups
>> without a configured qdisc.
>>
>> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
>> (AQM) to fairly schedule packets across flows and reduce collateral
>> damage from elephant flows.
>>
>> A known limitation of this approach is that the full ring sits in front
>> of the qdisc layer, effectively forming a FIFO buffer that introduces
>> base latency. While AQM still improves fairness and mitigates flow
>> dominance, the latency impact is measurable.
>>
>> In hardware drivers, this issue is typically addressed using BQL (Byte
>> Queue Limits), which tracks in-flight bytes needed based on physical link
>> rate. However, for virtual drivers like veth, there is no fixed bandwidth
>> constraint - the bottleneck is CPU availability and the scheduler's 
>> ability
>> to run the NAPI thread. It is unclear how effective BQL would be in this
>> context.
>>
>> This patch serves as a first step toward addressing TX drops. Future work
>> may explore adapting a BQL-like mechanism to better suit virtual devices
>> like veth.
> 
> Thank you for the patch.
> 
>> Reported-by: Yan Zhai <yan@cloudflare.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/veth.c |   49 
>> +++++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 41 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 7bb53961c0ea..a419d5e198d8 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>   {
>>       if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
>> -        dev_kfree_skb_any(skb);
>> -        return NET_RX_DROP;
>> +        return NETDEV_TX_BUSY; /* signal qdisc layer */
>>       }
> 
> You don't need this braces any more?
> 
> if (...)
>      return ...;
> 

Correct, fixed for V5.

>> -    return NET_RX_SUCCESS;
>> +    return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>   }
>>   static int veth_forward_skb(struct net_device *dev, struct sk_buff 
>> *skb,
>> @@ -346,11 +345,11 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>> *skb, struct net_device *dev)
>>   {
>>       struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>>       struct veth_rq *rq = NULL;
>> -    int ret = NETDEV_TX_OK;
>> +    struct netdev_queue *txq;
>>       struct net_device *rcv;
>>       int length = skb->len;
>>       bool use_napi = false;
>> -    int rxq;
>> +    int ret, rxq;
>>       rcu_read_lock();
>>       rcv = rcu_dereference(priv->peer);
>> @@ -373,17 +372,41 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>> *skb, struct net_device *dev)
>>       }
>>       skb_tx_timestamp(skb);
>> -    if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == 
>> NET_RX_SUCCESS)) {
>> +
>> +    ret = veth_forward_skb(rcv, skb, rq, use_napi);
>> +    switch(ret) {
>> +    case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>>           if (!use_napi)
>>               dev_sw_netstats_tx_add(dev, 1, length);
>>           else
>>               __veth_xdp_flush(rq);
>> -    } else {
>> +        break;
>> +    case NETDEV_TX_BUSY:
>> +        /* If a qdisc is attached to our virtual device, returning
>> +         * NETDEV_TX_BUSY is allowed.
>> +         */
>> +        txq = netdev_get_tx_queue(dev, rxq);
>> +
>> +        if (qdisc_txq_has_no_queue(txq)) {
>> +            dev_kfree_skb_any(skb);
>> +            goto drop;
>> +        }
>> +        netif_tx_stop_queue(txq);
>> +        /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>> +        __skb_push(skb, ETH_HLEN);
>> +        if (use_napi)
>> +            __veth_xdp_flush(rq);
> 
> You did not add a packet to the ring.
> No need for flush here?

IMHO we do need a flush here.
This is related to the netif_tx_stop_queue(txq) call, that stops the
TXQ, and that need to be started again by NAPI side.

This is need to handle a very unlikely race, but if the race happens
then it can cause the TXQ to stay stopped (blocking all traffic).

Given we arrive at NETDEV_TX_BUSY, when ptr_ring is full, it is very
likely that someone else have called flush and NAPI veth_poll is
running.  Thus, the extra flush will likely be a no-op as
rx_notify_masked is true.

The race is that before calling netif_tx_stop_queue(txq) the other CPU
running NAPI veth_poll manages to NAPI complete and empty the ptr_ring.
In this case, the flush will avoid race, as it will have an effect as
rx_notify_masked will be false.

Looking closer at code: There is still a possible race, in veth_poll,
after calling veth_xdp_rcv() and until rq->rx_notify_masked is set to
false (via smp_store_mb).  If netif_tx_stop_queue(txq) is executed in
this window, then we still have the race, where TXQ stays stopped
forever. (There is a optional call to xdp_do_flush that can increase
race window).

I'll add something in V5 that also handles the second race window.

--Jesper

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-04-17 13:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15 13:44 [PATCH net-next V4 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
2025-04-15 13:44 ` [PATCH net-next V4 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
2025-04-15 15:43   ` David Ahern
2025-04-15 16:30     ` Jesper Dangaard Brouer
2025-04-16 13:32   ` Toke Høiland-Jørgensen
2025-04-15 13:45 ` [PATCH net-next V4 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
2025-04-16 12:44   ` Toshiaki Makita
2025-04-17 13:00     ` Jesper Dangaard Brouer
2025-04-16 13:38   ` Jakub Kicinski
2025-04-16 13:58     ` Toke Høiland-Jørgensen
2025-04-17 10:02       ` Jesper Dangaard Brouer
2025-04-16 13:56   ` Toke Høiland-Jørgensen
2025-04-17  9:32     ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).