[RFC PATCH net-next] veth: apply qdisc backpressure on full ptr

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
@ 2025-04-04 14:49 Jesper Dangaard Brouer
  2025-04-04 19:51 ` Jesper Dangaard Brouer
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-04 14:49 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, kernel-team

In production, we're seeing TX drops on veth devices when the ptr_ring
fills up. This can occur when NAPI mode is enabled, though it's
relatively rare. However, with threaded NAPI - which we use in
production - the drops become significantly more frequent.

The underlying issue is that with threaded NAPI, the consumer often runs
on a different CPU than the producer. This increases the likelihood of
the ring filling up before the consumer gets scheduled, especially under
load, leading to drops in veth_xmit() (ndo_start_xmit()).

This patch introduces backpressure by returning NETDEV_TX_BUSY when the
ring is full, signaling the qdisc layer to requeue the packet. The txq
(netdev queue) is stopped in this condition and restarted once
veth_poll() drains entries from the ring, ensuring coordination between
NAPI and qdisc.

Backpressure is only enabled when a qdisc is attached. Without a qdisc,
the driver retains its original behavior - dropping packets immediately
when the ring is full. This avoids unexpected behavior changes in setups
without a configured qdisc.

With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
(AQM) to fairly schedule packets across flows and reduce collateral
damage from elephant flows.

A known limitation of this approach is that the full ring sits in front
of the qdisc layer, effectively forming a FIFO buffer that introduces
base latency. While AQM still improves fairness and mitigates flow
dominance, the latency impact is measurable.

In hardware drivers, this issue is typically addressed using BQL (Byte
Queue Limits), which tracks in-flight bytes needed based on physical link
rate. However, for virtual drivers like veth, there is no fixed bandwidth
constraint - the bottleneck is CPU availability and the scheduler's ability
to run the NAPI thread. It is unclear how effective BQL would be in this
context.

This patch serves as a first step toward addressing TX drops. Future work
may explore adapting a BQL-like mechanism to better suit virtual devices
like veth.

Reported-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   58 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7bb53961c0ea..fff2b615781e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
 static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
 	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
-		dev_kfree_skb_any(skb);
-		return NET_RX_DROP;
+		return NETDEV_TX_BUSY; /* signal qdisc layer */
 	}

-	return NET_RX_SUCCESS;
+	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }

 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
@@ -342,15 +341,26 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
 		 rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD));
 }

+/* Does specific txq have a real qdisc attached? - see noqueue_init() */
+static inline bool txq_has_qdisc(struct netdev_queue *txq)
+{
+	struct Qdisc *q;
+
+	q = rcu_dereference(txq->qdisc);
+	if (q->enqueue)
+		return true;
+	else
+		return false;
+}
+
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct veth_rq *rq = NULL;
-	int ret = NETDEV_TX_OK;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
-	int rxq;
+	int ret, rxq;

 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -373,17 +383,39 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}

 	skb_tx_timestamp(skb);
-	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
+
+	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	switch(ret) {
+	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
 			dev_sw_netstats_tx_add(dev, 1, length);
 		else
 			__veth_xdp_flush(rq);
-	} else {
+		break;
+	case NETDEV_TX_BUSY:
+		/* If a qdisc is attached to our virtual device, returning
+		 * NETDEV_TX_BUSY is allowed.
+		 */
+		struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
+
+		if (!txq_has_qdisc(txq)) {
+			dev_kfree_skb_any(skb);
+			goto drop;
+		}
+		netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
+		if (use_napi)
+			__veth_xdp_flush(rq);
+
+		break;
+	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
 		atomic64_inc(&priv->dropped);
 		ret = NET_XMIT_DROP;
+		break;
+	default:
+		net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
+				     dev->name, ret);
 	}
-
 	rcu_read_unlock();

 	return ret;
@@ -874,9 +906,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
+	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];

+	peer_dev = priv->peer;
+	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
+
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);

@@ -925,6 +964,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);

+	if (unlikely(netif_tx_queue_stopped(peer_txq)))
+		netif_tx_wake_queue(peer_txq);
+
 	return done;
 }

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
@ 2025-04-04 19:51 ` Jesper Dangaard Brouer
  2025-04-07  9:15 ` Toke Høiland-Jørgensen
  2025-04-07 17:02 ` Simon Horman
  2 siblings, 0 replies; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-04 19:51 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, kernel-team



On 04/04/2025 16.49, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
> 
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
> 
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
> 
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
> 
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
> 
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
> 
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
> 
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
> 
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>   drivers/net/veth.c |   58 +++++++++++++++++++++++++++++++++++++++++++++-------
>   1 file changed, 50 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 7bb53961c0ea..fff2b615781e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>   {
>   	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
> -		dev_kfree_skb_any(skb);
> -		return NET_RX_DROP;
> +		return NETDEV_TX_BUSY; /* signal qdisc layer */
>   	}
>   
> -	return NET_RX_SUCCESS;
> +	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>   }
>   
>   static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -342,15 +341,26 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev,
>   		 rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD));
>   }
>   
> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
> +{
> +	struct Qdisc *q;
> +
> +	q = rcu_dereference(txq->qdisc);
> +	if (q->enqueue)
> +		return true;
> +	else
> +		return false;
> +}
> +
>   static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   {
>   	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>   	struct veth_rq *rq = NULL;
> -	int ret = NETDEV_TX_OK;
>   	struct net_device *rcv;
>   	int length = skb->len;
>   	bool use_napi = false;
> -	int rxq;
> +	int ret, rxq;
>   
>   	rcu_read_lock();
>   	rcv = rcu_dereference(priv->peer);
> @@ -373,17 +383,39 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   	}
>   
>   	skb_tx_timestamp(skb);
> -	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	switch(ret) {
> +	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>   		if (!use_napi)
>   			dev_sw_netstats_tx_add(dev, 1, length);
>   		else
>   			__veth_xdp_flush(rq);
> -	} else {
> +		break;
> +	case NETDEV_TX_BUSY:
> +		/* If a qdisc is attached to our virtual device, returning
> +		 * NETDEV_TX_BUSY is allowed.
> +		 */
> +		struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);
> +
> +		if (!txq_has_qdisc(txq)) {
> +			dev_kfree_skb_any(skb);
> +			goto drop;
> +		}
> +		netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
> +		if (use_napi)
> +			__veth_xdp_flush(rq);
> +

Found a bug here... I need to skb_push back Ethernet header, because
__dev_forward_skb() via eth_type_trans() pulled it off.

Code fix:
		__skb_push(skb, ETH_HLEN)


> +		break;
> +	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>   drop:
>   		atomic64_inc(&priv->dropped);
>   		ret = NET_XMIT_DROP;
> +		break;
> +	default:
> +		net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
> +				     dev->name, ret);
>   	}
> -
>   	rcu_read_unlock();
>   
>   	return ret;
> @@ -874,9 +906,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			struct veth_xdp_tx_bq *bq,
>   			struct veth_stats *stats)
>   {
> +	struct veth_priv *priv = netdev_priv(rq->dev);
> +	int queue_idx = rq->xdp_rxq.queue_index;
> +	struct netdev_queue *peer_txq;
> +	struct net_device *peer_dev;
>   	int i, done = 0, n_xdpf = 0;
>   	void *xdpf[VETH_XDP_BATCH];
>   
> +	peer_dev = priv->peer;
> +	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
> +
>   	for (i = 0; i < budget; i++) {
>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>   
> @@ -925,6 +964,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   	rq->stats.vs.xdp_packets += done;
>   	u64_stats_update_end(&rq->stats.syncp);
>   
> +	if (unlikely(netif_tx_queue_stopped(peer_txq)))
> +		netif_tx_wake_queue(peer_txq);
> +
>   	return done;
>   }
>   
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  2025-04-04 19:51 ` Jesper Dangaard Brouer
@ 2025-04-07  9:15 ` Toke Høiland-Jørgensen
  2025-04-07 12:42   ` Jesper Dangaard Brouer
  2025-04-07 23:02   ` David Ahern
  2025-04-07 17:02 ` Simon Horman
  2 siblings, 2 replies; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-07  9:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, kernel-team

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
>
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.

Right, I definitely agree that this is the right solution; having no
backpressure and a fixed-size ringbuffer is obviously not ideal.

> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.

Not sure I like this bit, though; see below.

> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
>
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
>
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.

So the BQL algorithm tries to tune the maximum number of outstanding
bytes to be ~twice the maximum that can be completed in one batch. Since
we're not really limited by bytes in the same sense here (as you point
out), an approximate equivalent would be the NAPI budget, I guess? I.e.,
as a first approximation, we could have veth stop the queue once the
ringbuffer has 2x the NAPI budget packets in it?

> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
>
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

[...]

> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
> +{
> +	struct Qdisc *q;
> +
> +	q = rcu_dereference(txq->qdisc);
> +	if (q->enqueue)
> +		return true;
> +	else
> +		return false;
> +}

This seems like a pretty ugly layering violation, inspecting the qdisc
like this in the driver?

AFAICT, __dev_queue_xmit() turns a stopped queue into drops anyway, but
emits a warning (looks like this, around line 4640 in dev.c):

			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
					     dev->name);

As this patch shows, it can clearly be appropriate for a virtual device
to stop the queue even if there's no qdisc, so how about we just get rid
of that warning? Then this logic won't be needed at all in the driver..

-Toke


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-07  9:15 ` Toke Høiland-Jørgensen
@ 2025-04-07 12:42   ` Jesper Dangaard Brouer
  2025-04-07 23:02   ` David Ahern
  1 sibling, 0 replies; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-07 12:42 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team,
	makita.toshiaki, Yan Zhai, Jesse Brandeburg, Michael S. Tsirkin

On 07/04/2025 11.15, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> In production, we're seeing TX drops on veth devices when the ptr_ring
>> fills up. This can occur when NAPI mode is enabled, though it's
>> relatively rare. However, with threaded NAPI - which we use in
>> production - the drops become significantly more frequent.
>>
>> The underlying issue is that with threaded NAPI, the consumer often runs
>> on a different CPU than the producer. This increases the likelihood of
>> the ring filling up before the consumer gets scheduled, especially under
>> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>>
>> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
>> ring is full, signaling the qdisc layer to requeue the packet. The txq
>> (netdev queue) is stopped in this condition and restarted once
>> veth_poll() drains entries from the ring, ensuring coordination between
>> NAPI and qdisc.
> 
> Right, I definitely agree that this is the right solution; having no
> backpressure and a fixed-size ringbuffer is obviously not ideal.
> 
>> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
>> the driver retains its original behavior - dropping packets immediately
>> when the ring is full. This avoids unexpected behavior changes in setups
>> without a configured qdisc.
> 
> Not sure I like this bit, though; see below.
> 

I agree, but the reason for being chicken here is my prod deployment plan.
I'm about to roll this change out to production, and it is practical
that we can do a roll-back based on removing the qdisc.

In the future, we (netdev community) should consider changing veth to
automatically get the default qdisc attached, when enabling NAPI mode.
I will propose this when I have more data from production.

The ptr_ring overflow case is very easy to reproduce, even on testlab
servers without any competing traffic.
Together with Yan (Cc) we/I have a reproducer script available here:

https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_setup01_NAPI_TX_drops.sh

>> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
>> (AQM) to fairly schedule packets across flows and reduce collateral
>> damage from elephant flows.
>>
>> A known limitation of this approach is that the full ring sits in front
>> of the qdisc layer, effectively forming a FIFO buffer that introduces
>> base latency. While AQM still improves fairness and mitigates flow
>> dominance, the latency impact is measurable.
>>
>> In hardware drivers, this issue is typically addressed using BQL (Byte
>> Queue Limits), which tracks in-flight bytes needed based on physical link
>> rate. However, for virtual drivers like veth, there is no fixed bandwidth
>> constraint - the bottleneck is CPU availability and the scheduler's ability
>> to run the NAPI thread. It is unclear how effective BQL would be in this
>> context.
> 
> So the BQL algorithm tries to tune the maximum number of outstanding
> bytes to be ~twice the maximum that can be completed in one batch. Since
> we're not really limited by bytes in the same sense here (as you point
> out), an approximate equivalent would be the NAPI budget, I guess? I.e.,
> as a first approximation, we could have veth stop the queue once the
> ringbuffer has 2x the NAPI budget packets in it?
> 

I like the idea. (Note: The current ptr_ring size is 256.)

To implement this, we could simply reduce the ptr_ring size to 128 
(2*64), right?
IMHO would be cooler to have a larger queue, but dynamically stop it
once the ringbuffer has 2x the NAPI budget. but...

There are two subtle issue with the ptr_ring (Cc MST). (So, perhaps we
need choose another ring buffer API for veth?)

(1) We don't have a counter for how many elements are in the queue.
Most hardware drivers tried to stop the TXQ *before* their ring buffer
runs full, such that it avoids returning NETDEV_TX_BUSY.
(This is why it is hard to implement BQL for veth.)

(2) ptr_ring consumer delays making the slots available to the producer,
to avoiding bouncing cache-lines, in the ring-full case.  Thus, we
cannot make ptr_ring too small.  And I'm unsure how this interacts with
the idea of having two NAPI budgets available.

>> This patch serves as a first step toward addressing TX drops. Future work
>> may explore adapting a BQL-like mechanism to better suit virtual devices
>> like veth.
>>
>> Reported-by: Yan Zhai <yan@cloudflare.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> [...]
> 
>> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>> +{
>> +	struct Qdisc *q;
>> +
>> +	q = rcu_dereference(txq->qdisc);
>> +	if (q->enqueue)
>> +		return true;
>> +	else
>> +		return false;
>> +}
> 
> This seems like a pretty ugly layering violation, inspecting the qdisc
> like this in the driver?
> 

I agree. (as minimum it should be moved to include/net/sch_generic.h.)

I did considered using qdisc_tx_is_noop() defined in
include/net/sch_generic.h, but IMHO it is too slow for data-(fast)path
code as it walks all the TXQs.

(more below)

> AFAICT, __dev_queue_xmit() turns a stopped queue into drops anyway, but
> emits a warning (looks like this, around line 4640 in dev.c):
> 
> 			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> 					     dev->name);
> 
> As this patch shows, it can clearly be appropriate for a virtual device
> to stop the queue even if there's no qdisc, so how about we just get rid
> of that warning? Then this logic won't be needed at all in the driver..
> 

Yes, the real reason for the layer violation is to avoid this warning in
_dev_queue_xmit().

The NETDEV_TX_BUSY is (correctly) converted into a drop, when the
net_device TXQ doesn't have a qdisc attached.   Thus, we don't need the
driver layer violation, if we can remove this warning.

Does anyone object to removing this net_crit_ratelimited?

--Jesper

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  2025-04-04 19:51 ` Jesper Dangaard Brouer
  2025-04-07  9:15 ` Toke Høiland-Jørgensen
@ 2025-04-07 17:02 ` Simon Horman
  2 siblings, 0 replies; 8+ messages in thread
From: Simon Horman @ 2025-04-07 17:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Jakub Kicinski, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, kernel-team

On Fri, Apr 04, 2025 at 04:49:01PM +0200, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
> 
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
> 
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
> 
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
> 
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
> 
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
> 
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
> 
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
> 
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c |   58 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 50 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c

...

> @@ -373,17 +383,39 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  
>  	skb_tx_timestamp(skb);
> -	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	switch(ret) {
> +	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>  		if (!use_napi)
>  			dev_sw_netstats_tx_add(dev, 1, length);
>  		else
>  			__veth_xdp_flush(rq);
> -	} else {
> +		break;
> +	case NETDEV_TX_BUSY:
> +		/* If a qdisc is attached to our virtual device, returning
> +		 * NETDEV_TX_BUSY is allowed.
> +		 */
> +		struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);

Hi Toke,

FYI, clang 20.1.1 W=1 build says:

drivers/net/veth.c:399:3: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
  399 |                 struct netdev_queue *txq = netdev_get_tx_queue(dev, rxq);

> +
> +		if (!txq_has_qdisc(txq)) {
> +			dev_kfree_skb_any(skb);
> +			goto drop;
> +		}
> +		netif_tx_stop_queue(txq); /* Unconditional netif_txq_try_stop */
> +		if (use_napi)
> +			__veth_xdp_flush(rq);
> +
> +		break;
> +	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>  drop:
>  		atomic64_inc(&priv->dropped);
>  		ret = NET_XMIT_DROP;
> +		break;
> +	default:
> +		net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)",
> +				     dev->name, ret);
>  	}
> -
>  	rcu_read_unlock();
>  
>  	return ret;

...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-07  9:15 ` Toke Høiland-Jørgensen
  2025-04-07 12:42   ` Jesper Dangaard Brouer
@ 2025-04-07 23:02   ` David Ahern
  2025-04-08 11:23     ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 8+ messages in thread
From: David Ahern @ 2025-04-07 23:02 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, netdev,
	Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team

On 4/7/25 3:15 AM, Toke Høiland-Jørgensen wrote:
>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>> +{
>> +	struct Qdisc *q;
>> +
>> +	q = rcu_dereference(txq->qdisc);
>> +	if (q->enqueue)
>> +		return true;
>> +	else
>> +		return false;
>> +}
> 
> This seems like a pretty ugly layering violation, inspecting the qdisc
> like this in the driver?

vrf driver has something very similar - been there since March 2017.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-07 23:02   ` David Ahern
@ 2025-04-08 11:23     ` Toke Høiland-Jørgensen
  2025-04-08 15:20       ` David Ahern
  0 siblings, 1 reply; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-08 11:23 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team

David Ahern <dsahern@kernel.org> writes:

> On 4/7/25 3:15 AM, Toke Høiland-Jørgensen wrote:
>>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>>> +{
>>> +	struct Qdisc *q;
>>> +
>>> +	q = rcu_dereference(txq->qdisc);
>>> +	if (q->enqueue)
>>> +		return true;
>>> +	else
>>> +		return false;
>>> +}
>> 
>> This seems like a pretty ugly layering violation, inspecting the qdisc
>> like this in the driver?
>
> vrf driver has something very similar - been there since March 2017.

Doesn't make it any less ugly, though ;)

And AFAICT, vrf is doing more with the information; basically picking a
whole different TX path? Can you elaborate on the reasoning for this (do
people actually install qdiscs on VRF devices in practice)?

-Toke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-08 11:23     ` Toke Høiland-Jørgensen
@ 2025-04-08 15:20       ` David Ahern
  0 siblings, 0 replies; 8+ messages in thread
From: David Ahern @ 2025-04-08 15:20 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, netdev,
	Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni, kernel-team

On 4/8/25 5:23 AM, Toke Høiland-Jørgensen wrote:
> David Ahern <dsahern@kernel.org> writes:
> 
>> On 4/7/25 3:15 AM, Toke Høiland-Jørgensen wrote:
>>>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>>>> +{
>>>> +	struct Qdisc *q;
>>>> +
>>>> +	q = rcu_dereference(txq->qdisc);
>>>> +	if (q->enqueue)
>>>> +		return true;
>>>> +	else
>>>> +		return false;
>>>> +}
>>>
>>> This seems like a pretty ugly layering violation, inspecting the qdisc
>>> like this in the driver?
>>
>> vrf driver has something very similar - been there since March 2017.
> 
> Doesn't make it any less ugly, though ;)

in my eyes, it is a thing of beauty.

> 
> And AFAICT, vrf is doing more with the information; basically picking a
> whole different TX path? Can you elaborate on the reasoning for this (do
> people actually install qdiscs on VRF devices in practice)?

It is not common AFAIK, but it is possible. I wanted to avoid the
overhead for what I think is a rare configuration.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-08 15:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 14:49 [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
2025-04-04 19:51 ` Jesper Dangaard Brouer
2025-04-07  9:15 ` Toke Høiland-Jørgensen
2025-04-07 12:42   ` Jesper Dangaard Brouer
2025-04-07 23:02   ` David Ahern
2025-04-08 11:23     ` Toke Høiland-Jørgensen
2025-04-08 15:20       ` David Ahern
2025-04-07 17:02 ` Simon Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).