[PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor
@ 2025-04-25 14:55 Jesper Dangaard Brouer
  2025-04-25 14:55 ` [PATCH net-next V7 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-25 14:55 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
	makita.toshiaki, kernel-team, phil

This patch series addresses TX drops seen on veth devices under load,
particularly when using threaded NAPI, which is our setup in production.

The root cause is that the NAPI consumer often runs on a different CPU
than the producer. Combined with scheduling delays or simply slower
consumption, this increases the chance that the ptr_ring fills up before
packets are drained, resulting in drops from veth_xmit() (ndo_start_xmit()).

To make this easier to reproduce, we’ve created a script that sets up a
test scenario using network namespaces. The script inserts 1000 iptables
rules in the consumer namespace to slow down packet processing and
amplify the issue. Reproducer script:

https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_setup01_NAPI_TX_drops.sh

This series first introduces a helper to detect no-queue qdiscs and then
uses it in the veth driver to conditionally apply qdisc-level
backpressure when a real qdisc is attached. The behavior is off by
default and opt-in, ensuring minimal impact and easy activation.

---
V7:
 - Adjust race handling with smp_mb__after_atomic() for other archs than x86
 - Link to V6: https://lore.kernel.org/all/174549933665.608169.392044991754158047.stgit@firesoul/
V6:
 - Remove __veth_xdp_flush() and handle race via __ptr_ring_empty instead
 - Link to V5: https://lore.kernel.org/all/174489803410.355490.13216831426556849084.stgit@firesoul/
V5:
 - use rcu_dereference_check to signal that NAPI is a RCU section
 - whitespace fixes reported by checkpatch.pl
 - handle unlikely race
 - Link to V4 https://lore.kernel.org/all/174472463778.274639.12670590457453196991.stgit@firesoul/
V4:
 - Check against no-queue instead of no-op qdisc
 - Link to V3: https://lore.kernel.org/all/174464549885.20396.6987653753122223942.stgit@firesoul/
V3:
 - Reorder patches, generalize check for no-op qdisc as first patch
   - RFC: As testing show this is incorrect
 - rcu_dereference(priv->peer) in veth_xdp_rcv as this runs in NAPI
   context rcu_read_lock() is implicit.
 - Link to V2: https://lore.kernel.org/all/174412623473.3702169.4235683143719614624.stgit@firesoul/
V2:
 - Generalize check for no-op qdisc
 - Link to RFC-V1: https://lore.kernel.org/all/174377814192.3376479.16481605648460889310.stgit@firesoul/
---

Jesper Dangaard Brouer (2):
      net: sched: generalize check for no-queue qdisc on TX queue
      veth: apply qdisc backpressure on full ptr_ring to reduce TX drops


 drivers/net/veth.c        | 57 ++++++++++++++++++++++++++++++++-------
 drivers/net/vrf.c         |  4 +--
 include/net/sch_generic.h |  8 ++++++
 3 files changed, 56 insertions(+), 13 deletions(-)

--


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next V7 1/2] net: sched: generalize check for no-queue qdisc on TX queue
  2025-04-25 14:55 [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
@ 2025-04-25 14:55 ` Jesper Dangaard Brouer
  2025-04-25 14:55 ` [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  2025-04-28 22:20 ` [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor patchwork-bot+netdevbpf
  2 siblings, 0 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-25 14:55 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
	makita.toshiaki, kernel-team, phil

The "noqueue" qdisc can either be directly attached, or get default
attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
allocated Qdisc structure gets it's enqueue function pointer reset to
NULL by noqueue_init() via noqueue_qdisc_ops.

This is a common case for software virtual net_devices. For these devices
with no-queue, the transmission path in __dev_queue_xmit() will bypass
the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
dev_hard_start_xmit).  In this mode the device driver is not allowed to
ask for packets to be queued (either via returning NETDEV_TX_BUSY or
stopping the TXQ).

The simplest and most reliable way to identify this no-queue case is by
checking if enqueue == NULL.

The vrf driver currently open-codes this check (!qdisc->enqueue). While
functionally correct, this low-level detail is better encapsulated in a
dedicated helper for clarity and long-term maintainability.

To make this behavior more explicit and reusable, this patch introduce a
new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
veth driver in the next patch, which introduces optional qdisc-based
backpressure.

This is a non-functional change.

Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/vrf.c         |    4 +---
 include/net/sch_generic.h |    8 ++++++++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 7168b33adadb..9a4beea6ee0c 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -343,15 +343,13 @@ static int vrf_ifindex_lookup_by_table_id(struct net *net, u32 table_id)
 static bool qdisc_tx_is_default(const struct net_device *dev)
 {
 	struct netdev_queue *txq;
-	struct Qdisc *qdisc;

 	if (dev->num_tx_queues > 1)
 		return false;

 	txq = netdev_get_tx_queue(dev, 0);
-	qdisc = rcu_access_pointer(txq->qdisc);

-	return !qdisc->enqueue;
+	return qdisc_txq_has_no_queue(txq);
 }

 /* Local traffic destined to local address. Reinsert the packet to rx
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d48c657191cd..b6c177f7141c 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -803,6 +803,14 @@ static inline bool qdisc_tx_changing(const struct net_device *dev)
 	return false;
 }

+/* "noqueue" qdisc identified by not having any enqueue, see noqueue_init() */
+static inline bool qdisc_txq_has_no_queue(const struct netdev_queue *txq)
+{
+	struct Qdisc *qdisc = rcu_access_pointer(txq->qdisc);
+
+	return qdisc->enqueue == NULL;
+}
+
 /* Is the device using the noop qdisc on all queues?  */
 static inline bool qdisc_tx_is_noop(const struct net_device *dev)
 {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-25 14:55 [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
  2025-04-25 14:55 ` [PATCH net-next V7 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
@ 2025-04-25 14:55 ` Jesper Dangaard Brouer
  2025-04-26  5:40   ` Toshiaki Makita
  2025-06-09 22:09   ` Ihor Solodrai
  2025-04-28 22:20 ` [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor patchwork-bot+netdevbpf
  2 siblings, 2 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-04-25 14:55 UTC (permalink / raw)
  To: netdev, Jakub Kicinski
  Cc: Jesper Dangaard Brouer, bpf, tom, Eric Dumazet, David S. Miller,
	Paolo Abeni, Toke Høiland-Jørgensen, dsahern,
	makita.toshiaki, kernel-team, phil

In production, we're seeing TX drops on veth devices when the ptr_ring
fills up. This can occur when NAPI mode is enabled, though it's
relatively rare. However, with threaded NAPI - which we use in
production - the drops become significantly more frequent.

The underlying issue is that with threaded NAPI, the consumer often runs
on a different CPU than the producer. This increases the likelihood of
the ring filling up before the consumer gets scheduled, especially under
load, leading to drops in veth_xmit() (ndo_start_xmit()).

This patch introduces backpressure by returning NETDEV_TX_BUSY when the
ring is full, signaling the qdisc layer to requeue the packet. The txq
(netdev queue) is stopped in this condition and restarted once
veth_poll() drains entries from the ring, ensuring coordination between
NAPI and qdisc.

Backpressure is only enabled when a qdisc is attached. Without a qdisc,
the driver retains its original behavior - dropping packets immediately
when the ring is full. This avoids unexpected behavior changes in setups
without a configured qdisc.

With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
(AQM) to fairly schedule packets across flows and reduce collateral
damage from elephant flows.

A known limitation of this approach is that the full ring sits in front
of the qdisc layer, effectively forming a FIFO buffer that introduces
base latency. While AQM still improves fairness and mitigates flow
dominance, the latency impact is measurable.

In hardware drivers, this issue is typically addressed using BQL (Byte
Queue Limits), which tracks in-flight bytes needed based on physical link
rate. However, for virtual drivers like veth, there is no fixed bandwidth
constraint - the bottleneck is CPU availability and the scheduler's ability
to run the NAPI thread. It is unclear how effective BQL would be in this
context.

This patch serves as a first step toward addressing TX drops. Future work
may explore adapting a BQL-like mechanism to better suit virtual devices
like veth.

Reported-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   57 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7bb53961c0ea..e58a0f1b5c5b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -307,12 +307,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)

 static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
-		dev_kfree_skb_any(skb);
-		return NET_RX_DROP;
-	}
+	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
+		return NETDEV_TX_BUSY; /* signal qdisc layer */

-	return NET_RX_SUCCESS;
+	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }

 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
@@ -346,11 +344,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct veth_rq *rq = NULL;
-	int ret = NETDEV_TX_OK;
+	struct netdev_queue *txq;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
-	int rxq;
+	int ret, rxq;

 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -373,17 +371,45 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}

 	skb_tx_timestamp(skb);
-	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
+
+	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	switch (ret) {
+	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
 			dev_sw_netstats_tx_add(dev, 1, length);
 		else
 			__veth_xdp_flush(rq);
-	} else {
+		break;
+	case NETDEV_TX_BUSY:
+		/* If a qdisc is attached to our virtual device, returning
+		 * NETDEV_TX_BUSY is allowed.
+		 */
+		txq = netdev_get_tx_queue(dev, rxq);
+
+		if (qdisc_txq_has_no_queue(txq)) {
+			dev_kfree_skb_any(skb);
+			goto drop;
+		}
+		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
+		__skb_push(skb, ETH_HLEN);
+		/* Depend on prior success packets started NAPI consumer via
+		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
+		 * paired with empty check in veth_poll().
+		 */
+		netif_tx_stop_queue(txq);
+		smp_mb__after_atomic();
+		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
+			netif_tx_wake_queue(txq);
+		break;
+	case NET_RX_DROP: /* same as NET_XMIT_DROP */
 drop:
 		atomic64_inc(&priv->dropped);
 		ret = NET_XMIT_DROP;
+		break;
+	default:
+		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
+				     __func__, dev->name, ret);
 	}
-
 	rcu_read_unlock();

 	return ret;
@@ -874,9 +900,17 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
 			struct veth_stats *stats)
 {
+	struct veth_priv *priv = netdev_priv(rq->dev);
+	int queue_idx = rq->xdp_rxq.queue_index;
+	struct netdev_queue *peer_txq;
+	struct net_device *peer_dev;
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];

+	/* NAPI functions as RCU section */
+	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
+	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
+
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);

@@ -925,6 +959,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);

+	if (unlikely(netif_tx_queue_stopped(peer_txq)))
+		netif_tx_wake_queue(peer_txq);
+
 	return done;
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-25 14:55 ` [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
@ 2025-04-26  5:40   ` Toshiaki Makita
  2025-06-09 22:09   ` Ihor Solodrai
  1 sibling, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2025-04-26  5:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, netdev, Jakub Kicinski

On 2025/04/25 23:55, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
> 
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
> 
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
> 
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
> 
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
> 
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
> 
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
> 
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
> 
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor
  2025-04-25 14:55 [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
  2025-04-25 14:55 ` [PATCH net-next V7 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
  2025-04-25 14:55 ` [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
@ 2025-04-28 22:20 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 15+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-04-28 22:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, kuba, bpf, tom, eric.dumazet, davem, pabeni, toke,
	dsahern, makita.toshiaki, kernel-team, phil

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 25 Apr 2025 16:55:25 +0200 you wrote:
> This patch series addresses TX drops seen on veth devices under load,
> particularly when using threaded NAPI, which is our setup in production.
> 
> The root cause is that the NAPI consumer often runs on a different CPU
> than the producer. Combined with scheduling delays or simply slower
> consumption, this increases the chance that the ptr_ring fills up before
> packets are drained, resulting in drops from veth_xmit() (ndo_start_xmit()).
> 
> [...]

Here is the summary with links:
  - [net-next,V7,1/2] net: sched: generalize check for no-queue qdisc on TX queue
    https://git.kernel.org/netdev/net-next/c/34dd0fecaa02
  - [net-next,V7,2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
    https://git.kernel.org/netdev/net-next/c/dc82a33297fc

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-04-25 14:55 ` [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
  2025-04-26  5:40   ` Toshiaki Makita
@ 2025-06-09 22:09   ` Ihor Solodrai
  2025-06-10 11:43     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 15+ messages in thread
From: Ihor Solodrai @ 2025-06-09 22:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil

On 4/25/25 7:55 AM, Jesper Dangaard Brouer wrote:
> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
> 
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
> 
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.
> 
> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.
> 
> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
> 
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
> 
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.
> 
> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
> 
> Reported-by: Yan Zhai <yan@cloudflare.com>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>   drivers/net/veth.c |   57 +++++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 7bb53961c0ea..e58a0f1b5c5b 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -307,12 +307,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>   
>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>   {
> -	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
> -		dev_kfree_skb_any(skb);
> -		return NET_RX_DROP;
> -	}
> +	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> +		return NETDEV_TX_BUSY; /* signal qdisc layer */
>   
> -	return NET_RX_SUCCESS;
> +	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>   }
>   
>   static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -346,11 +344,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   {
>   	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>   	struct veth_rq *rq = NULL;
> -	int ret = NETDEV_TX_OK;
> +	struct netdev_queue *txq;
>   	struct net_device *rcv;
>   	int length = skb->len;
>   	bool use_napi = false;
> -	int rxq;
> +	int ret, rxq;
>   
>   	rcu_read_lock();
>   	rcv = rcu_dereference(priv->peer);
> @@ -373,17 +371,45 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   	}
>   
>   	skb_tx_timestamp(skb);
> -	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
> +
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	switch (ret) {
> +	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>   		if (!use_napi)
>   			dev_sw_netstats_tx_add(dev, 1, length);
>   		else
>   			__veth_xdp_flush(rq);
> -	} else {
> +		break;
> +	case NETDEV_TX_BUSY:
> +		/* If a qdisc is attached to our virtual device, returning
> +		 * NETDEV_TX_BUSY is allowed.
> +		 */
> +		txq = netdev_get_tx_queue(dev, rxq);
> +
> +		if (qdisc_txq_has_no_queue(txq)) {
> +			dev_kfree_skb_any(skb);
> +			goto drop;
> +		}
> +		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
> +		__skb_push(skb, ETH_HLEN);
> +		/* Depend on prior success packets started NAPI consumer via
> +		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
> +		 * paired with empty check in veth_poll().
> +		 */
> +		netif_tx_stop_queue(txq);
> +		smp_mb__after_atomic();
> +		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
> +			netif_tx_wake_queue(txq);
> +		break;
> +	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>   drop:
>   		atomic64_inc(&priv->dropped);
>   		ret = NET_XMIT_DROP;
> +		break;
> +	default:
> +		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
> +				     __func__, dev->name, ret);
>   	}
> -
>   	rcu_read_unlock();
>   
>   	return ret;
> @@ -874,9 +900,17 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			struct veth_xdp_tx_bq *bq,
>   			struct veth_stats *stats)
>   {
> +	struct veth_priv *priv = netdev_priv(rq->dev);
> +	int queue_idx = rq->xdp_rxq.queue_index;
> +	struct netdev_queue *peer_txq;
> +	struct net_device *peer_dev;
>   	int i, done = 0, n_xdpf = 0;
>   	void *xdpf[VETH_XDP_BATCH];
>   
> +	/* NAPI functions as RCU section */
> +	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> +	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
> +
>   	for (i = 0; i < budget; i++) {
>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>

Hi Jesper.

Could you please take a look at the reported call traces and help
understand whether this patch may have introduced a null dereference?

Pasting a snippet, for full logs (2 examples) see the link:
https://lore.kernel.org/bpf/6fd7a5b5-ee26-4cc5-8eb0-449c4e326ccc@linux.dev/

[  343.217465] BUG: kernel NULL pointer dereference, address:
0000000000000018
[  343.218173] #PF: supervisor read access in kernel mode
[  343.218644] #PF: error_code(0x0000) - not-present page
[  343.219128] PGD 0 P4D 0
[  343.219379] Oops: Oops: 0000 [#1] SMP NOPTI
[  343.219768] CPU: 1 UID: 0 PID: 7635 Comm: kworker/1:11 Tainted: G
     W  OE       6.15.0-g2b36f2252b0a-dirty #7 PREEMPT(full)
[  343.220844] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  343.221436] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  343.222356] Workqueue: mld mld_dad_work
[  343.222730] RIP: 0010:veth_xdp_rcv.constprop.0+0x6b/0x380

     [...]

[  343.231061] Call Trace:
[  343.231306]  <IRQ>
[  343.231522]  veth_poll+0x7b/0x3a0
[  343.231856]  __napi_poll.constprop.0+0x28/0x1d0
[  343.232297]  net_rx_action+0x199/0x350
[  343.232682]  handle_softirqs+0xd3/0x400
[  343.233057]  ? __dev_queue_xmit+0x27b/0x1250
[  343.233473]  do_softirq+0x43/0x90
[  343.233804]  </IRQ>
[  343.234016]  <TASK>
[  343.234226]  __local_bh_enable_ip+0xb5/0xd0
[  343.234622]  ? __dev_queue_xmit+0x27b/0x1250
[  343.235035]  __dev_queue_xmit+0x290/0x1250
[  343.235431]  ? lock_acquire+0xbe/0x2c0
[  343.235797]  ? ip6_finish_output+0x25e/0x540
[  343.236210]  ? mark_held_locks+0x40/0x70
[  343.236583]  ip6_finish_output2+0x38f/0xb80
[  343.237002]  ? lock_release+0xc6/0x290
[  343.237364]  ip6_finish_output+0x25e/0x540
[  343.237761]  mld_sendpack+0x1c1/0x3a0
[  343.238123]  mld_dad_work+0x3e/0x150
[  343.238473]  process_one_work+0x1f8/0x580
[  343.238859]  worker_thread+0x1ce/0x3c0
[  343.239224]  ? __pfx_worker_thread+0x10/0x10
[  343.239638]  kthread+0x128/0x250
[  343.239954]  ? __pfx_kthread+0x10/0x10
[  343.240320]  ? __pfx_kthread+0x10/0x10
[  343.240691]  ret_from_fork+0x15c/0x1b0
[  343.241056]  ? __pfx_kthread+0x10/0x10
[  343.241418]  ret_from_fork_asm+0x1a/0x30
[  343.241800]  </TASK>
[  343.242021] Modules linked in: bpf_testmod(OE) [last unloaded:
est_no_cfi(OE)]
[  343.242737] CR2: 0000000000000018
[  343.243064] ---[ end trace 0000000000000000 ]---


Thank you.


> @@ -925,6 +959,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   	rq->stats.vs.xdp_packets += done;
>   	u64_stats_update_end(&rq->stats.syncp);
>   
> +	if (unlikely(netif_tx_queue_stopped(peer_txq)))
> +		netif_tx_wake_queue(peer_txq);
> +
>   	return done;
>   }
>   
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-06-09 22:09   ` Ihor Solodrai
@ 2025-06-10 11:43     ` Jesper Dangaard Brouer
  2025-06-10 15:56       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-10 11:43 UTC (permalink / raw)
  To: Ihor Solodrai, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil



On 10/06/2025 00.09, Ihor Solodrai wrote:
> On 4/25/25 7:55 AM, Jesper Dangaard Brouer wrote:
>> In production, we're seeing TX drops on veth devices when the ptr_ring
>> fills up. This can occur when NAPI mode is enabled, though it's
>> relatively rare. However, with threaded NAPI - which we use in
>> production - the drops become significantly more frequent.
>>
>> The underlying issue is that with threaded NAPI, the consumer often runs
>> on a different CPU than the producer. This increases the likelihood of
>> the ring filling up before the consumer gets scheduled, especially under
>> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>>
>> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
>> ring is full, signaling the qdisc layer to requeue the packet. The txq
>> (netdev queue) is stopped in this condition and restarted once
>> veth_poll() drains entries from the ring, ensuring coordination between
>> NAPI and qdisc.
>>
>> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
>> the driver retains its original behavior - dropping packets immediately
>> when the ring is full. This avoids unexpected behavior changes in setups
>> without a configured qdisc.
>>
>> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
>> (AQM) to fairly schedule packets across flows and reduce collateral
>> damage from elephant flows.
>>
>> A known limitation of this approach is that the full ring sits in front
>> of the qdisc layer, effectively forming a FIFO buffer that introduces
>> base latency. While AQM still improves fairness and mitigates flow
>> dominance, the latency impact is measurable.
>>
>> In hardware drivers, this issue is typically addressed using BQL (Byte
>> Queue Limits), which tracks in-flight bytes needed based on physical link
>> rate. However, for virtual drivers like veth, there is no fixed bandwidth
>> constraint - the bottleneck is CPU availability and the scheduler's 
>> ability
>> to run the NAPI thread. It is unclear how effective BQL would be in this
>> context.
>>
>> This patch serves as a first step toward addressing TX drops. Future work
>> may explore adapting a BQL-like mechanism to better suit virtual devices
>> like veth.
>>
>> Reported-by: Yan Zhai <yan@cloudflare.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   drivers/net/veth.c |   57 
>> +++++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 47 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 7bb53961c0ea..e58a0f1b5c5b 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -307,12 +307,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>   {
>> -    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
>> -        dev_kfree_skb_any(skb);
>> -        return NET_RX_DROP;
>> -    }
>> +    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>> +        return NETDEV_TX_BUSY; /* signal qdisc layer */
>> -    return NET_RX_SUCCESS;
>> +    return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>   }
>>   static int veth_forward_skb(struct net_device *dev, struct sk_buff 
>> *skb,
>> @@ -346,11 +344,11 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>> *skb, struct net_device *dev)
>>   {
>>       struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>>       struct veth_rq *rq = NULL;
>> -    int ret = NETDEV_TX_OK;
>> +    struct netdev_queue *txq;
>>       struct net_device *rcv;
>>       int length = skb->len;
>>       bool use_napi = false;
>> -    int rxq;
>> +    int ret, rxq;
>>       rcu_read_lock();
>>       rcv = rcu_dereference(priv->peer);
>> @@ -373,17 +371,45 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>> *skb, struct net_device *dev)
>>       }
>>       skb_tx_timestamp(skb);
>> -    if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == 
>> NET_RX_SUCCESS)) {
>> +
>> +    ret = veth_forward_skb(rcv, skb, rq, use_napi);
>> +    switch (ret) {
>> +    case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>>           if (!use_napi)
>>               dev_sw_netstats_tx_add(dev, 1, length);
>>           else
>>               __veth_xdp_flush(rq);
>> -    } else {
>> +        break;
>> +    case NETDEV_TX_BUSY:
>> +        /* If a qdisc is attached to our virtual device, returning
>> +         * NETDEV_TX_BUSY is allowed.
>> +         */
>> +        txq = netdev_get_tx_queue(dev, rxq);
>> +
>> +        if (qdisc_txq_has_no_queue(txq)) {
>> +            dev_kfree_skb_any(skb);
>> +            goto drop;
>> +        }
>> +        /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>> +        __skb_push(skb, ETH_HLEN);
>> +        /* Depend on prior success packets started NAPI consumer via
>> +         * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
>> +         * paired with empty check in veth_poll().
>> +         */
>> +        netif_tx_stop_queue(txq);
>> +        smp_mb__after_atomic();
>> +        if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
>> +            netif_tx_wake_queue(txq);
>> +        break;
>> +    case NET_RX_DROP: /* same as NET_XMIT_DROP */
>>   drop:
>>           atomic64_inc(&priv->dropped);
>>           ret = NET_XMIT_DROP;
>> +        break;
>> +    default:
>> +        net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>> +                     __func__, dev->name, ret);
>>       }
>> -
>>       rcu_read_unlock();
>>       return ret;
>> @@ -874,9 +900,17 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>> budget,
>>               struct veth_xdp_tx_bq *bq,
>>               struct veth_stats *stats)
>>   {
>> +    struct veth_priv *priv = netdev_priv(rq->dev);
>> +    int queue_idx = rq->xdp_rxq.queue_index;
>> +    struct netdev_queue *peer_txq;
>> +    struct net_device *peer_dev;
>>       int i, done = 0, n_xdpf = 0;
>>       void *xdpf[VETH_XDP_BATCH];
>> +    /* NAPI functions as RCU section */
>> +    peer_dev = rcu_dereference_check(priv->peer, 
>> rcu_read_lock_bh_held());
>> +    peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>> +
>>       for (i = 0; i < budget; i++) {
>>           void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>
> 
> Hi Jesper.
> 
> Could you please take a look at the reported call traces and help
> understand whether this patch may have introduced a null dereference?

I'm investigating... thanks for reporting.
(more below)

> Pasting a snippet, for full logs (2 examples) see the link:
> https://lore.kernel.org/bpf/6fd7a5b5-ee26-4cc5-8eb0-449c4e326ccc@linux.dev/
> 
> [  343.217465] BUG: kernel NULL pointer dereference, address: 0000000000000018
> [  343.218173] #PF: supervisor read access in kernel mode
> [  343.218644] #PF: error_code(0x0000) - not-present page
> [  343.219128] PGD 0 P4D 0
> [  343.219379] Oops: Oops: 0000 [#1] SMP NOPTI
> [  343.219768] CPU: 1 UID: 0 PID: 7635 Comm: kworker/1:11 Tainted: G
>      W  OE       6.15.0-g2b36f2252b0a-dirty #7 PREEMPT(full)
> [  343.220844] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [  343.221436] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [  343.222356] Workqueue: mld mld_dad_work
> [  343.222730] RIP: 0010:veth_xdp_rcv.constprop.0+0x6b/0x380
> 

Can you give me the output from below command (on your compiled kernel):

  ./scripts/faddr2line drivers/net/veth.o veth_xdp_rcv.constprop.0+0x6b

>      [...]
> 
> [  343.231061] Call Trace:
> [  343.231306]  <IRQ>
> [  343.231522]  veth_poll+0x7b/0x3a0
> [  343.231856]  __napi_poll.constprop.0+0x28/0x1d0
> [  343.232297]  net_rx_action+0x199/0x350
> [  343.232682]  handle_softirqs+0xd3/0x400
> [  343.233057]  ? __dev_queue_xmit+0x27b/0x1250
> [  343.233473]  do_softirq+0x43/0x90
> [  343.233804]  </IRQ>
> [  343.234016]  <TASK>
> [  343.234226]  __local_bh_enable_ip+0xb5/0xd0
> [  343.234622]  ? __dev_queue_xmit+0x27b/0x1250
> [  343.235035]  __dev_queue_xmit+0x290/0x1250
> [  343.235431]  ? lock_acquire+0xbe/0x2c0
> [  343.235797]  ? ip6_finish_output+0x25e/0x540
> [  343.236210]  ? mark_held_locks+0x40/0x70
> [  343.236583]  ip6_finish_output2+0x38f/0xb80
> [  343.237002]  ? lock_release+0xc6/0x290
> [  343.237364]  ip6_finish_output+0x25e/0x540
> [  343.237761]  mld_sendpack+0x1c1/0x3a0
> [  343.238123]  mld_dad_work+0x3e/0x150
> [  343.238473]  process_one_work+0x1f8/0x580
> [  343.238859]  worker_thread+0x1ce/0x3c0
> [  343.239224]  ? __pfx_worker_thread+0x10/0x10
> [  343.239638]  kthread+0x128/0x250
> [  343.239954]  ? __pfx_kthread+0x10/0x10
> [  343.240320]  ? __pfx_kthread+0x10/0x10
> [  343.240691]  ret_from_fork+0x15c/0x1b0
> [  343.241056]  ? __pfx_kthread+0x10/0x10
> [  343.241418]  ret_from_fork_asm+0x1a/0x30
> [  343.241800]  </TASK>
> [  343.242021] Modules linked in: bpf_testmod(OE) [last unloaded:
> est_no_cfi(OE)]
> [  343.242737] CR2: 0000000000000018
> [  343.243064] ---[ end trace 0000000000000000 ]---
> 
> 
> Thank you.
> 
> 
>> @@ -925,6 +959,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>> budget,
>>       rq->stats.vs.xdp_packets += done;
>>       u64_stats_update_end(&rq->stats.syncp);
>> +    if (unlikely(netif_tx_queue_stopped(peer_txq)))
>> +        netif_tx_wake_queue(peer_txq);
>> +
>>       return done;
>>   }
>>
>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-06-10 11:43     ` Jesper Dangaard Brouer
@ 2025-06-10 15:56       ` Jesper Dangaard Brouer
  2025-06-10 18:26         ` Ihor Solodrai
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-10 15:56 UTC (permalink / raw)
  To: Ihor Solodrai, netdev, Jakub Kicinski
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, Sebastian Andrzej Siewior



On 10/06/2025 13.43, Jesper Dangaard Brouer wrote:
> 
> On 10/06/2025 00.09, Ihor Solodrai wrote:
>> On 4/25/25 7:55 AM, Jesper Dangaard Brouer wrote:
[...]
>>> ---
>>>   drivers/net/veth.c |   57 
>>> +++++++++++++++++++++++++++++++++++++++++++---------
>>>   1 file changed, 47 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 7bb53961c0ea..e58a0f1b5c5b 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -307,12 +307,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>>>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>>   {
>>> -    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
>>> -        dev_kfree_skb_any(skb);
>>> -        return NET_RX_DROP;
>>> -    }
>>> +    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>>> +        return NETDEV_TX_BUSY; /* signal qdisc layer */
>>> -    return NET_RX_SUCCESS;
>>> +    return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>>   }
>>>   static int veth_forward_skb(struct net_device *dev, struct sk_buff 
>>> *skb,
>>> @@ -346,11 +344,11 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>>> *skb, struct net_device *dev)
>>>   {
>>>       struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>>>       struct veth_rq *rq = NULL;
>>> -    int ret = NETDEV_TX_OK;
>>> +    struct netdev_queue *txq;
>>>       struct net_device *rcv;
>>>       int length = skb->len;
>>>       bool use_napi = false;
>>> -    int rxq;
>>> +    int ret, rxq;
>>>       rcu_read_lock();
>>>       rcv = rcu_dereference(priv->peer);
>>> @@ -373,17 +371,45 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>>> *skb, struct net_device *dev)
>>>       }
>>>       skb_tx_timestamp(skb);
>>> -    if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == 
>>> NET_RX_SUCCESS)) {
>>> +
>>> +    ret = veth_forward_skb(rcv, skb, rq, use_napi);
>>> +    switch (ret) {
>>> +    case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>>>           if (!use_napi)
>>>               dev_sw_netstats_tx_add(dev, 1, length);
>>>           else
>>>               __veth_xdp_flush(rq);
>>> -    } else {
>>> +        break;
>>> +    case NETDEV_TX_BUSY:
>>> +        /* If a qdisc is attached to our virtual device, returning
>>> +         * NETDEV_TX_BUSY is allowed.
>>> +         */
>>> +        txq = netdev_get_tx_queue(dev, rxq);
>>> +
>>> +        if (qdisc_txq_has_no_queue(txq)) {
>>> +            dev_kfree_skb_any(skb);
>>> +            goto drop;
>>> +        }
>>> +        /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>>> +        __skb_push(skb, ETH_HLEN);
>>> +        /* Depend on prior success packets started NAPI consumer via
>>> +         * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
>>> +         * paired with empty check in veth_poll().
>>> +         */
>>> +        netif_tx_stop_queue(txq);
>>> +        smp_mb__after_atomic();
>>> +        if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
>>> +            netif_tx_wake_queue(txq);
>>> +        break;
>>> +    case NET_RX_DROP: /* same as NET_XMIT_DROP */
>>>   drop:
>>>           atomic64_inc(&priv->dropped);
>>>           ret = NET_XMIT_DROP;
>>> +        break;
>>> +    default:
>>> +        net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>>> +                     __func__, dev->name, ret);
>>>       }
>>> -
>>>       rcu_read_unlock();
>>>       return ret;
>>> @@ -874,9 +900,17 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>>> budget,
>>>               struct veth_xdp_tx_bq *bq,
>>>               struct veth_stats *stats)
>>>   {
>>> +    struct veth_priv *priv = netdev_priv(rq->dev);
>>> +    int queue_idx = rq->xdp_rxq.queue_index;
>>> +    struct netdev_queue *peer_txq;
>>> +    struct net_device *peer_dev;
>>>       int i, done = 0, n_xdpf = 0;
>>>       void *xdpf[VETH_XDP_BATCH];
>>> +    /* NAPI functions as RCU section */
>>> +    peer_dev = rcu_dereference_check(priv->peer, 
>>> rcu_read_lock_bh_held());
>>> +    peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>>> +
>>>       for (i = 0; i < budget; i++) {
>>>           void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>>
>>
>> Hi Jesper.
>>
>> Could you please take a look at the reported call traces and help
>> understand whether this patch may have introduced a null dereference?
> 
> I'm investigating... thanks for reporting.
> (more below)
> 
>> Pasting a snippet, for full logs (2 examples) see the link:
>> https://lore.kernel.org/bpf/6fd7a5b5-ee26-4cc5-8eb0-449c4e326ccc@linux.dev/

Do you have any qdisc's attached to the veth device when reproducing?

Via above link I see that you managed to reproduce by running BPF 
selftest "xdp_veth_broadcast_redirect".  (Is this correct?)

How often does this happen?

Does this only happen with XDP redirected frames?
(Func veth_xdp_rcv() also active for SKBs when veth is in GRO mode)

I'm not able to reproduce this running selftests (bpf-next at 5fcf896efe28c)
Below example selecting all "xdp_veth*" related tests:

$ sudo ./test_progs --name=xdp_veth
#630/1   xdp_veth_broadcast_redirect/0/BROADCAST:OK
#630/2   xdp_veth_broadcast_redirect/0/(BROADCAST | EXCLUDE_INGRESS):OK
#630/3   xdp_veth_broadcast_redirect/DRV_MODE/BROADCAST:OK
#630/4   xdp_veth_broadcast_redirect/DRV_MODE/(BROADCAST | 
EXCLUDE_INGRESS):OK
#630/5   xdp_veth_broadcast_redirect/SKB_MODE/BROADCAST:OK
#630/6   xdp_veth_broadcast_redirect/SKB_MODE/(BROADCAST | 
EXCLUDE_INGRESS):OK
#630     xdp_veth_broadcast_redirect:OK
#631/1   xdp_veth_egress/0/egress:OK
#631/2   xdp_veth_egress/DRV_MODE/egress:OK
#631/3   xdp_veth_egress/SKB_MODE/egress:OK
#631     xdp_veth_egress:OK
#632/1   xdp_veth_redirect/0:OK
#632/2   xdp_veth_redirect/DRV_MODE:OK
#632/3   xdp_veth_redirect/SKB_MODE:OK
#632     xdp_veth_redirect:OK
Summary: 3/12 PASSED, 0 SKIPPED, 0 FAILED


>> [  343.217465] BUG: kernel NULL pointer dereference, address: 
>> 0000000000000018
>> [  343.218173] #PF: supervisor read access in kernel mode
>> [  343.218644] #PF: error_code(0x0000) - not-present page
>> [  343.219128] PGD 0 P4D 0
>> [  343.219379] Oops: Oops: 0000 [#1] SMP NOPTI
>> [  343.219768] CPU: 1 UID: 0 PID: 7635 Comm: kworker/1:11 Tainted: G
>>      W  OE       6.15.0-g2b36f2252b0a-dirty #7 PREEMPT(full)
                             ^^^^^^^^^^^^
The SHA 2b36f2252b0 doesn't seem to exist.
What upstream kernel commit SHA is this kernel based on?

>> [  343.220844] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>> [  343.221436] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 
>> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> [  343.222356] Workqueue: mld mld_dad_work
>> [  343.222730] RIP: 0010:veth_xdp_rcv.constprop.0+0x6b/0x380
>>
> 
> Can you give me the output from below command (on your compiled kernel):
> 
>   ./scripts/faddr2line drivers/net/veth.o veth_xdp_rcv.constprop.0+0x6b
> 

Still need above data/info please.

--Jesper

>>      [...]
>>
>> [  343.231061] Call Trace:
>> [  343.231306]  <IRQ>
>> [  343.231522]  veth_poll+0x7b/0x3a0
>> [  343.231856]  __napi_poll.constprop.0+0x28/0x1d0
>> [  343.232297]  net_rx_action+0x199/0x350
>> [  343.232682]  handle_softirqs+0xd3/0x400
>> [  343.233057]  ? __dev_queue_xmit+0x27b/0x1250
>> [  343.233473]  do_softirq+0x43/0x90
>> [  343.233804]  </IRQ>
>> [  343.234016]  <TASK>
>> [  343.234226]  __local_bh_enable_ip+0xb5/0xd0
>> [  343.234622]  ? __dev_queue_xmit+0x27b/0x1250
>> [  343.235035]  __dev_queue_xmit+0x290/0x1250
>> [  343.235431]  ? lock_acquire+0xbe/0x2c0
>> [  343.235797]  ? ip6_finish_output+0x25e/0x540
>> [  343.236210]  ? mark_held_locks+0x40/0x70
>> [  343.236583]  ip6_finish_output2+0x38f/0xb80
>> [  343.237002]  ? lock_release+0xc6/0x290
>> [  343.237364]  ip6_finish_output+0x25e/0x540
>> [  343.237761]  mld_sendpack+0x1c1/0x3a0
>> [  343.238123]  mld_dad_work+0x3e/0x150
>> [  343.238473]  process_one_work+0x1f8/0x580
>> [  343.238859]  worker_thread+0x1ce/0x3c0
>> [  343.239224]  ? __pfx_worker_thread+0x10/0x10
>> [  343.239638]  kthread+0x128/0x250
>> [  343.239954]  ? __pfx_kthread+0x10/0x10
>> [  343.240320]  ? __pfx_kthread+0x10/0x10
>> [  343.240691]  ret_from_fork+0x15c/0x1b0
>> [  343.241056]  ? __pfx_kthread+0x10/0x10
>> [  343.241418]  ret_from_fork_asm+0x1a/0x30
>> [  343.241800]  </TASK>
>> [  343.242021] Modules linked in: bpf_testmod(OE) [last unloaded:
>> est_no_cfi(OE)]
>> [  343.242737] CR2: 0000000000000018
>> [  343.243064] ---[ end trace 0000000000000000 ]---
>>
>>
>> Thank you.
>>
>>
>>> @@ -925,6 +959,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>>> budget,
>>>       rq->stats.vs.xdp_packets += done;
>>>       u64_stats_update_end(&rq->stats.syncp);
>>> +    if (unlikely(netif_tx_queue_stopped(peer_txq)))
>>> +        netif_tx_wake_queue(peer_txq);
>>> +
>>>       return done;
>>>   }
>>>
>>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-06-10 15:56       ` Jesper Dangaard Brouer
@ 2025-06-10 18:26         ` Ihor Solodrai
  2025-06-10 21:40           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Ihor Solodrai @ 2025-06-10 18:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Jakub Kicinski, Bastien Curutchet
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, Sebastian Andrzej Siewior

On 6/10/25 8:56 AM, Jesper Dangaard Brouer wrote:
> 
> 
> On 10/06/2025 13.43, Jesper Dangaard Brouer wrote:
>>
>> On 10/06/2025 00.09, Ihor Solodrai wrote:
>>> On 4/25/25 7:55 AM, Jesper Dangaard Brouer wrote:
> [...]
>>>> ---
>>>>   drivers/net/veth.c |   57 ++++++++++++++++++++++++++++++++++++++++ 
>>>> +++---------
>>>>   1 file changed, 47 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index 7bb53961c0ea..e58a0f1b5c5b 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -307,12 +307,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>>>>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>>>   {
>>>> -    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
>>>> -        dev_kfree_skb_any(skb);
>>>> -        return NET_RX_DROP;
>>>> -    }
>>>> +    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>>>> +        return NETDEV_TX_BUSY; /* signal qdisc layer */
>>>> -    return NET_RX_SUCCESS;
>>>> +    return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>>>   }
>>>>   static int veth_forward_skb(struct net_device *dev, struct sk_buff 
>>>> *skb,
>>>> @@ -346,11 +344,11 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>>>> *skb, struct net_device *dev)
>>>>   {
>>>>       struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>>>>       struct veth_rq *rq = NULL;
>>>> -    int ret = NETDEV_TX_OK;
>>>> +    struct netdev_queue *txq;
>>>>       struct net_device *rcv;
>>>>       int length = skb->len;
>>>>       bool use_napi = false;
>>>> -    int rxq;
>>>> +    int ret, rxq;
>>>>       rcu_read_lock();
>>>>       rcv = rcu_dereference(priv->peer);
>>>> @@ -373,17 +371,45 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>>>> *skb, struct net_device *dev)
>>>>       }
>>>>       skb_tx_timestamp(skb);
>>>> -    if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == 
>>>> NET_RX_SUCCESS)) {
>>>> +
>>>> +    ret = veth_forward_skb(rcv, skb, rq, use_napi);
>>>> +    switch (ret) {
>>>> +    case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>>>>           if (!use_napi)
>>>>               dev_sw_netstats_tx_add(dev, 1, length);
>>>>           else
>>>>               __veth_xdp_flush(rq);
>>>> -    } else {
>>>> +        break;
>>>> +    case NETDEV_TX_BUSY:
>>>> +        /* If a qdisc is attached to our virtual device, returning
>>>> +         * NETDEV_TX_BUSY is allowed.
>>>> +         */
>>>> +        txq = netdev_get_tx_queue(dev, rxq);
>>>> +
>>>> +        if (qdisc_txq_has_no_queue(txq)) {
>>>> +            dev_kfree_skb_any(skb);
>>>> +            goto drop;
>>>> +        }
>>>> +        /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>>>> +        __skb_push(skb, ETH_HLEN);
>>>> +        /* Depend on prior success packets started NAPI consumer via
>>>> +         * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
>>>> +         * paired with empty check in veth_poll().
>>>> +         */
>>>> +        netif_tx_stop_queue(txq);
>>>> +        smp_mb__after_atomic();
>>>> +        if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
>>>> +            netif_tx_wake_queue(txq);
>>>> +        break;
>>>> +    case NET_RX_DROP: /* same as NET_XMIT_DROP */
>>>>   drop:
>>>>           atomic64_inc(&priv->dropped);
>>>>           ret = NET_XMIT_DROP;
>>>> +        break;
>>>> +    default:
>>>> +        net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>>>> +                     __func__, dev->name, ret);
>>>>       }
>>>> -
>>>>       rcu_read_unlock();
>>>>       return ret;
>>>> @@ -874,9 +900,17 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>>>> budget,
>>>>               struct veth_xdp_tx_bq *bq,
>>>>               struct veth_stats *stats)
>>>>   {
>>>> +    struct veth_priv *priv = netdev_priv(rq->dev);
>>>> +    int queue_idx = rq->xdp_rxq.queue_index;
>>>> +    struct netdev_queue *peer_txq;
>>>> +    struct net_device *peer_dev;
>>>>       int i, done = 0, n_xdpf = 0;
>>>>       void *xdpf[VETH_XDP_BATCH];
>>>> +    /* NAPI functions as RCU section */
>>>> +    peer_dev = rcu_dereference_check(priv->peer, 
>>>> rcu_read_lock_bh_held());
>>>> +    peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>>>> +
>>>>       for (i = 0; i < budget; i++) {
>>>>           void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>>>
>>>
>>> Hi Jesper.
>>>
>>> Could you please take a look at the reported call traces and help
>>> understand whether this patch may have introduced a null dereference?
>>
>> I'm investigating... thanks for reporting.

Thank you for looking into this.

>> (more below)
>>
>>> Pasting a snippet, for full logs (2 examples) see the link:
>>> https://lore.kernel.org/bpf/6fd7a5b5- 
>>> ee26-4cc5-8eb0-449c4e326ccc@linux.dev/
> 
> Do you have any qdisc's attached to the veth device when reproducing?

Looking at the selftest code, I don't see any qdisc set up in the
create_network. But I am not familiar with all this, and could be
wrong.

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/test_xdp_veth.c#n172

> 
> Via above link I see that you managed to reproduce by running BPF 
> selftest "xdp_veth_broadcast_redirect".  (Is this correct?)
> 
> How often does this happen?

So far I wasn't able to reproduce it locally.

It does not happen often. There were two failures on BPF CI out of
100+ runs in the past couple of days:

* One on bpf-next:
   * 
https://github.com/kernel-patches/bpf/commit/e41079f53e8792c99cc8888f545c31bc341ea9ac
   * 
https://github.com/kernel-patches/bpf/actions/runs/15543380196/job/43759847203
* One on netdev tree (netdev merged into bpf-next):
   * 
https://github.com/kernel-patches/bpf/commit/6f3b8b4f8c133ff4eaf3fb74865731d2105c77eb
   * 
https://github.com/kernel-patches/bpf/actions/runs/15563091012/job/43820682289

Click on a gear in top-right and "View raw logs" for full logs.
Note that you can download kbuild-output from the job artifacts.

Also a syzbot report does not involve the selftests:
* https://lore.kernel.org/lkml/683da55e.a00a0220.d8eae.0052.GAE@google.com/

> 
> Does this only happen with XDP redirected frames?
> (Func veth_xdp_rcv() also active for SKBs when veth is in GRO mode)


Sorry, I don't know an answer to this... Bastien, could you please comment?

It seems to happen after xdp_veth_broadcast_redirect test successfully
completed, so before/during xdp_veth_egress tests.

> 
> I'm not able to reproduce this running selftests (bpf-next at 
> 5fcf896efe28c)
> Below example selecting all "xdp_veth*" related tests:
> 
> $ sudo ./test_progs --name=xdp_veth
> #630/1   xdp_veth_broadcast_redirect/0/BROADCAST:OK
> #630/2   xdp_veth_broadcast_redirect/0/(BROADCAST | EXCLUDE_INGRESS):OK
> #630/3   xdp_veth_broadcast_redirect/DRV_MODE/BROADCAST:OK
> #630/4   xdp_veth_broadcast_redirect/DRV_MODE/(BROADCAST | 
> EXCLUDE_INGRESS):OK
> #630/5   xdp_veth_broadcast_redirect/SKB_MODE/BROADCAST:OK
> #630/6   xdp_veth_broadcast_redirect/SKB_MODE/(BROADCAST | 
> EXCLUDE_INGRESS):OK
> #630     xdp_veth_broadcast_redirect:OK
> #631/1   xdp_veth_egress/0/egress:OK
> #631/2   xdp_veth_egress/DRV_MODE/egress:OK
> #631/3   xdp_veth_egress/SKB_MODE/egress:OK
> #631     xdp_veth_egress:OK
> #632/1   xdp_veth_redirect/0:OK
> #632/2   xdp_veth_redirect/DRV_MODE:OK
> #632/3   xdp_veth_redirect/SKB_MODE:OK
> #632     xdp_veth_redirect:OK
> Summary: 3/12 PASSED, 0 SKIPPED, 0 FAILED
> 
> 
>>> [  343.217465] BUG: kernel NULL pointer dereference, address: 
>>> 0000000000000018
>>> [  343.218173] #PF: supervisor read access in kernel mode
>>> [  343.218644] #PF: error_code(0x0000) - not-present page
>>> [  343.219128] PGD 0 P4D 0
>>> [  343.219379] Oops: Oops: 0000 [#1] SMP NOPTI
>>> [  343.219768] CPU: 1 UID: 0 PID: 7635 Comm: kworker/1:11 Tainted: G
>>>      W  OE       6.15.0-g2b36f2252b0a-dirty #7 PREEMPT(full)
>                              ^^^^^^^^^^^^
> The SHA 2b36f2252b0 doesn't seem to exist.
> What upstream kernel commit SHA is this kernel based on?

This sha isn't very helpful, unfortunately, because CI adds commits on
top. See the links to github I shared above for exact revisions.

> 
>>> [  343.220844] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>> [  343.221436] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 
>>> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>>> [  343.222356] Workqueue: mld mld_dad_work
>>> [  343.222730] RIP: 0010:veth_xdp_rcv.constprop.0+0x6b/0x380
>>>
>>
>> Can you give me the output from below command (on your compiled kernel):
>>
>>   ./scripts/faddr2line drivers/net/veth.o veth_xdp_rcv.constprop.0+0x6b
>>
> 
> Still need above data/info please.

root@devvm7589:/ci/workspace# ./scripts/faddr2line 
./kout.gcc/drivers/net/veth.o veth_xdp_rcv.constprop.0+0x6b
veth_xdp_rcv.constprop.0+0x6b/0x390:
netdev_get_tx_queue at 
/ci/workspace/kout.gcc/../include/linux/netdevice.h:2637
(inlined by) veth_xdp_rcv at 
/ci/workspace/kout.gcc/../drivers/net/veth.c:912

Which is:

veth.c:912
	struct veth_priv *priv = netdev_priv(rq->dev);
	int queue_idx = rq->xdp_rxq.queue_index;
	struct netdev_queue *peer_txq;
	struct net_device *peer_dev;
	int i, done = 0, n_xdpf = 0;
	void *xdpf[VETH_XDP_BATCH];

	/* NAPI functions as RCU section */
	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
  --->	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);

netdevice.h:2637
	static inline
	struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
					 unsigned int index)
	{
		DEBUG_NET_WARN_ON_ONCE(index >= dev->num_tx_queues);
  --->		return &dev->_tx[index];
	}

So the suspect is peer_dev (priv->peer)?

Didn't know about faddr2line, thanks.

Let me know if I can help with anything else.

> 
> --Jesper
> 
>>>      [...]
>>>
>>> [  343.231061] Call Trace:
>>> [  343.231306]  <IRQ>
>>> [  343.231522]  veth_poll+0x7b/0x3a0
>>> [  343.231856]  __napi_poll.constprop.0+0x28/0x1d0
>>> [  343.232297]  net_rx_action+0x199/0x350
>>> [  343.232682]  handle_softirqs+0xd3/0x400
>>> [  343.233057]  ? __dev_queue_xmit+0x27b/0x1250
>>> [  343.233473]  do_softirq+0x43/0x90
>>> [  343.233804]  </IRQ>
>>> [  343.234016]  <TASK>
>>> [  343.234226]  __local_bh_enable_ip+0xb5/0xd0
>>> [  343.234622]  ? __dev_queue_xmit+0x27b/0x1250
>>> [  343.235035]  __dev_queue_xmit+0x290/0x1250
>>> [  343.235431]  ? lock_acquire+0xbe/0x2c0
>>> [  343.235797]  ? ip6_finish_output+0x25e/0x540
>>> [  343.236210]  ? mark_held_locks+0x40/0x70
>>> [  343.236583]  ip6_finish_output2+0x38f/0xb80
>>> [  343.237002]  ? lock_release+0xc6/0x290
>>> [  343.237364]  ip6_finish_output+0x25e/0x540
>>> [  343.237761]  mld_sendpack+0x1c1/0x3a0
>>> [  343.238123]  mld_dad_work+0x3e/0x150
>>> [  343.238473]  process_one_work+0x1f8/0x580
>>> [  343.238859]  worker_thread+0x1ce/0x3c0
>>> [  343.239224]  ? __pfx_worker_thread+0x10/0x10
>>> [  343.239638]  kthread+0x128/0x250
>>> [  343.239954]  ? __pfx_kthread+0x10/0x10
>>> [  343.240320]  ? __pfx_kthread+0x10/0x10
>>> [  343.240691]  ret_from_fork+0x15c/0x1b0
>>> [  343.241056]  ? __pfx_kthread+0x10/0x10
>>> [  343.241418]  ret_from_fork_asm+0x1a/0x30
>>> [  343.241800]  </TASK>
>>> [  343.242021] Modules linked in: bpf_testmod(OE) [last unloaded:
>>> est_no_cfi(OE)]
>>> [  343.242737] CR2: 0000000000000018
>>> [  343.243064] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> Thank you.
>>>
>>>
>>>> @@ -925,6 +959,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>>>> budget,
>>>>       rq->stats.vs.xdp_packets += done;
>>>>       u64_stats_update_end(&rq->stats.syncp);
>>>> +    if (unlikely(netif_tx_queue_stopped(peer_txq)))
>>>> +        netif_tx_wake_queue(peer_txq);
>>>> +
>>>>       return done;
>>>>   }
>>>>
>>>>
>>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-06-10 18:26         ` Ihor Solodrai
@ 2025-06-10 21:40           ` Jesper Dangaard Brouer
  2025-06-11  0:25             ` Ihor Solodrai
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-10 21:40 UTC (permalink / raw)
  To: Ihor Solodrai, netdev, Jakub Kicinski, Bastien Curutchet
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, Sebastian Andrzej Siewior



On 10/06/2025 20.26, Ihor Solodrai wrote:
> On 6/10/25 8:56 AM, Jesper Dangaard Brouer wrote:
>>
>>
>> On 10/06/2025 13.43, Jesper Dangaard Brouer wrote:
>>>
>>> On 10/06/2025 00.09, Ihor Solodrai wrote:
>>>> On 4/25/25 7:55 AM, Jesper Dangaard Brouer wrote:
>> [...]
>>>>> ---
>>>>>   drivers/net/veth.c |   57 
>>>>> ++++++++++++++++++++++++++++++++++++++++ +++---------
>>>>>   1 file changed, 47 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>> index 7bb53961c0ea..e58a0f1b5c5b 100644
>>>>> --- a/drivers/net/veth.c
>>>>> +++ b/drivers/net/veth.c
>>>>> @@ -307,12 +307,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>>>>>   static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>>>>   {
>>>>> -    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
>>>>> -        dev_kfree_skb_any(skb);
>>>>> -        return NET_RX_DROP;
>>>>> -    }
>>>>> +    if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>>>>> +        return NETDEV_TX_BUSY; /* signal qdisc layer */
>>>>> -    return NET_RX_SUCCESS;
>>>>> +    return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>>>>   }
>>>>>   static int veth_forward_skb(struct net_device *dev, struct 
>>>>> sk_buff *skb,
>>>>> @@ -346,11 +344,11 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>>>>> *skb, struct net_device *dev)
>>>>>   {
>>>>>       struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>>>>>       struct veth_rq *rq = NULL;
>>>>> -    int ret = NETDEV_TX_OK;
>>>>> +    struct netdev_queue *txq;
>>>>>       struct net_device *rcv;
>>>>>       int length = skb->len;
>>>>>       bool use_napi = false;
>>>>> -    int rxq;
>>>>> +    int ret, rxq;
>>>>>       rcu_read_lock();
>>>>>       rcv = rcu_dereference(priv->peer);
>>>>> @@ -373,17 +371,45 @@ static netdev_tx_t veth_xmit(struct sk_buff 
>>>>> *skb, struct net_device *dev)
>>>>>       }
>>>>>       skb_tx_timestamp(skb);
>>>>> -    if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == 
>>>>> NET_RX_SUCCESS)) {
>>>>> +
>>>>> +    ret = veth_forward_skb(rcv, skb, rq, use_napi);
>>>>> +    switch (ret) {
>>>>> +    case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>>>>>           if (!use_napi)
>>>>>               dev_sw_netstats_tx_add(dev, 1, length);
>>>>>           else
>>>>>               __veth_xdp_flush(rq);
>>>>> -    } else {
>>>>> +        break;
>>>>> +    case NETDEV_TX_BUSY:
>>>>> +        /* If a qdisc is attached to our virtual device, returning
>>>>> +         * NETDEV_TX_BUSY is allowed.
>>>>> +         */
>>>>> +        txq = netdev_get_tx_queue(dev, rxq);
>>>>> +
>>>>> +        if (qdisc_txq_has_no_queue(txq)) {
>>>>> +            dev_kfree_skb_any(skb);
>>>>> +            goto drop;
>>>>> +        }
>>>>> +        /* Restore Eth hdr pulled by 
>>>>> dev_forward_skb/eth_type_trans */
>>>>> +        __skb_push(skb, ETH_HLEN);
>>>>> +        /* Depend on prior success packets started NAPI consumer via
>>>>> +         * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
>>>>> +         * paired with empty check in veth_poll().
>>>>> +         */
>>>>> +        netif_tx_stop_queue(txq);
>>>>> +        smp_mb__after_atomic();
>>>>> +        if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
>>>>> +            netif_tx_wake_queue(txq);
>>>>> +        break;
>>>>> +    case NET_RX_DROP: /* same as NET_XMIT_DROP */
>>>>>   drop:
>>>>>           atomic64_inc(&priv->dropped);
>>>>>           ret = NET_XMIT_DROP;
>>>>> +        break;
>>>>> +    default:
>>>>> +        net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>>>>> +                     __func__, dev->name, ret);
>>>>>       }
>>>>> -
>>>>>       rcu_read_unlock();
>>>>>       return ret;
>>>>> @@ -874,9 +900,17 @@ static int veth_xdp_rcv(struct veth_rq *rq, 
>>>>> int budget,
>>>>>               struct veth_xdp_tx_bq *bq,
>>>>>               struct veth_stats *stats)
>>>>>   {
>>>>> +    struct veth_priv *priv = netdev_priv(rq->dev);
>>>>> +    int queue_idx = rq->xdp_rxq.queue_index;
>>>>> +    struct netdev_queue *peer_txq;
>>>>> +    struct net_device *peer_dev;
>>>>>       int i, done = 0, n_xdpf = 0;
>>>>>       void *xdpf[VETH_XDP_BATCH];
>>>>> +    /* NAPI functions as RCU section */
>>>>> +    peer_dev = rcu_dereference_check(priv->peer, 
>>>>> rcu_read_lock_bh_held());
>>>>> +    peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>>>>> +
>>>>>       for (i = 0; i < budget; i++) {
>>>>>           void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>>>>
>>>>
>>>> Hi Jesper.
>>>>
>>>> Could you please take a look at the reported call traces and help
>>>> understand whether this patch may have introduced a null dereference?
>>>
>>> I'm investigating... thanks for reporting.
> 
> Thank you for looking into this.
> 
>>> (more below)
>>>
>>>> Pasting a snippet, for full logs (2 examples) see the link:
>>>> https://lore.kernel.org/bpf/6fd7a5b5- 
>>>> ee26-4cc5-8eb0-449c4e326ccc@linux.dev/
>>
>> Do you have any qdisc's attached to the veth device when reproducing?
> 
> Looking at the selftest code, I don't see any qdisc set up in the
> create_network. But I am not familiar with all this, and could be
> wrong.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/test_xdp_veth.c#n172
> 
>>
>> Via above link I see that you managed to reproduce by running BPF 
>> selftest "xdp_veth_broadcast_redirect".  (Is this correct?)
>>
>> How often does this happen?
> 
> So far I wasn't able to reproduce it locally.
> 
> It does not happen often. There were two failures on BPF CI out of
> 100+ runs in the past couple of days:
> 
> * One on bpf-next:
>    * 
> https://github.com/kernel-patches/bpf/commit/e41079f53e8792c99cc8888f545c31bc341ea9ac
>    * 
> https://github.com/kernel-patches/bpf/actions/runs/15543380196/job/43759847203
> * One on netdev tree (netdev merged into bpf-next):
>    * 
> https://github.com/kernel-patches/bpf/commit/6f3b8b4f8c133ff4eaf3fb74865731d2105c77eb
>    * 
> https://github.com/kernel-patches/bpf/actions/runs/15563091012/job/43820682289
> 
> Click on a gear in top-right and "View raw logs" for full logs.
> Note that you can download kbuild-output from the job artifacts.
> 
> Also a syzbot report does not involve the selftests:
> * https://lore.kernel.org/lkml/683da55e.a00a0220.d8eae.0052.GAE@google.com/
> 
>>
>> Does this only happen with XDP redirected frames?
>> (Func veth_xdp_rcv() also active for SKBs when veth is in GRO mode)
> 
> 
> Sorry, I don't know an answer to this... Bastien, could you please comment?
> 
> It seems to happen after xdp_veth_broadcast_redirect test successfully
> completed, so before/during xdp_veth_egress tests.
> 
>>
>> I'm not able to reproduce this running selftests (bpf-next at 
>> 5fcf896efe28c)
>> Below example selecting all "xdp_veth*" related tests:
>>
>> $ sudo ./test_progs --name=xdp_veth
>> #630/1   xdp_veth_broadcast_redirect/0/BROADCAST:OK
>> #630/2   xdp_veth_broadcast_redirect/0/(BROADCAST | EXCLUDE_INGRESS):OK
>> #630/3   xdp_veth_broadcast_redirect/DRV_MODE/BROADCAST:OK
>> #630/4   xdp_veth_broadcast_redirect/DRV_MODE/(BROADCAST | 
>> EXCLUDE_INGRESS):OK
>> #630/5   xdp_veth_broadcast_redirect/SKB_MODE/BROADCAST:OK
>> #630/6   xdp_veth_broadcast_redirect/SKB_MODE/(BROADCAST | 
>> EXCLUDE_INGRESS):OK
>> #630     xdp_veth_broadcast_redirect:OK
>> #631/1   xdp_veth_egress/0/egress:OK
>> #631/2   xdp_veth_egress/DRV_MODE/egress:OK
>> #631/3   xdp_veth_egress/SKB_MODE/egress:OK
>> #631     xdp_veth_egress:OK
>> #632/1   xdp_veth_redirect/0:OK
>> #632/2   xdp_veth_redirect/DRV_MODE:OK
>> #632/3   xdp_veth_redirect/SKB_MODE:OK
>> #632     xdp_veth_redirect:OK
>> Summary: 3/12 PASSED, 0 SKIPPED, 0 FAILED
>>
>>
>>>> [  343.217465] BUG: kernel NULL pointer dereference, address: 
>>>> 0000000000000018
>>>> [  343.218173] #PF: supervisor read access in kernel mode
>>>> [  343.218644] #PF: error_code(0x0000) - not-present page
>>>> [  343.219128] PGD 0 P4D 0
>>>> [  343.219379] Oops: Oops: 0000 [#1] SMP NOPTI
>>>> [  343.219768] CPU: 1 UID: 0 PID: 7635 Comm: kworker/1:11 Tainted: G
>>>>      W  OE       6.15.0-g2b36f2252b0a-dirty #7 PREEMPT(full)
>>                              ^^^^^^^^^^^^
>> The SHA 2b36f2252b0 doesn't seem to exist.
>> What upstream kernel commit SHA is this kernel based on?
> 
> This sha isn't very helpful, unfortunately, because CI adds commits on
> top. See the links to github I shared above for exact revisions.
> 
>>
>>>> [  343.220844] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>>> [  343.221436] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 
>>>> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>>>> [  343.222356] Workqueue: mld mld_dad_work
>>>> [  343.222730] RIP: 0010:veth_xdp_rcv.constprop.0+0x6b/0x380
>>>>
>>>
>>> Can you give me the output from below command (on your compiled kernel):
>>>
>>>   ./scripts/faddr2line drivers/net/veth.o veth_xdp_rcv.constprop.0+0x6b
>>>
>>
>> Still need above data/info please.
> 
> root@devvm7589:/ci/workspace# ./scripts/faddr2line 
> ./kout.gcc/drivers/net/veth.o veth_xdp_rcv.constprop.0+0x6b
> veth_xdp_rcv.constprop.0+0x6b/0x390:
> netdev_get_tx_queue at 
> /ci/workspace/kout.gcc/../include/linux/netdevice.h:2637
> (inlined by) veth_xdp_rcv at 
> /ci/workspace/kout.gcc/../drivers/net/veth.c:912
> 
> Which is:
> 
> veth.c:912
>      struct veth_priv *priv = netdev_priv(rq->dev);
>      int queue_idx = rq->xdp_rxq.queue_index;
>      struct netdev_queue *peer_txq;
>      struct net_device *peer_dev;
>      int i, done = 0, n_xdpf = 0;
>      void *xdpf[VETH_XDP_BATCH];
> 
>      /* NAPI functions as RCU section */
>      peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
>   --->    peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
> 
> netdevice.h:2637
>      static inline
>      struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
>                       unsigned int index)
>      {
>          DEBUG_NET_WARN_ON_ONCE(index >= dev->num_tx_queues);
>   --->        return &dev->_tx[index];
>      }
> 
> So the suspect is peer_dev (priv->peer)?

Yes, this is the problem!

So, it seems that peer_dev (priv->peer) can become a NULL pointer.

Managed to reproduce - via manually deleting the peer device:
  - ip link delete dev veth42
  - while overloading veth41 via XDP redirecting packets into it.

Managed to trigger concurrent crashes on two CPUs (C0 + C3)
  - so below output gets interlaced a bit:

[26700.481334][    C0] BUG: kernel NULL pointer dereference, address: 
0000000000000438
[26700.489097][    C0] #PF: supervisor read access in kernel mode
[26700.495008][    C0] #PF: error_code(0x0000) - not-present page
[26700.500920][    C0] PGD 0 P4D 0
[26700.504233][    C0] Oops: Oops: 0000 [#1] SMP PTI
[26700.509023][    C0] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G 
         O     N  6.15.0-bpf-next+ #40 PREEMPT(full)
[26700.520173][    C0] Tainted: [O]=OOT_MODULE, [N]=TEST
[26700.525305][    C0] Hardware name: Supermicro Super Server/X10SRi-F, 
BIOS 2.0a 08/01/2016
[26700.533582][    C0] RIP: 0010:veth_xdp_rcv.constprop.0+0x37/0x2c0
[26700.539766][    C0] Code: 55 41 89 f5 41 54 55 53 48 89 fb 48 81 ec 
90 00 00 00 8b 87 48 03 00 00 48 89 14 24 48 8b 97 f8 01 00 00 48 8b 92 
c0 09 00 00 <3b> 82 38 04 00 00 0f 83 6d 02 00 00 4c 8d 3c 80 49 c1 e7 
06 4c 03
[26700.559359][    C0] RSP: 0000:ffffc90000003cb0 EFLAGS: 00010282
[26700.565367][    C0] RAX: 0000000000000000 RBX: ffff888160e5c000 RCX: 
ffffc90000003d78
[26700.573300][    C0] RDX: 0000000000000000 RSI: 0000000000000040 RDI: 
ffff888160e5c000
[26700.581233][    C0] RBP: 0000000000000000 R08: 0000000000000000 R09: 
0000000000000101
[26700.589161][    C0] R10: ffffffff830080c0 R11: 0000000000000000 R12: 
0000000000000040
[26700.597092][    C0] R13: 0000000000000040 R14: ffffc90000003d78 R15: 
ffff88887fc2d400
[26700.605022][    C0] FS:  0000000000000000(0000) 
GS:ffff8888fbe16000(0000) knlGS:0000000000000000
[26700.613909][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26700.620437][    C0] CR2: 0000000000000438 CR3: 000000010466c003 CR4: 
00000000003726f0
[26700.628370][    C0] DR0: ffffffff83f801a4 DR1: ffffffff83f801a5 DR2: 
ffffffff83f801a2
[26700.636301][    C0] DR3: ffffffff83f801a3 DR6: 00000000fffe0ff0 DR7: 
0000000000000600
[26700.644232][    C0] Call Trace:
[26700.647453][    C0]  <IRQ>
[26700.650243][    C0]  ? 
bpf_prog_4666c41a92ffb35e_tp_xdp_devmap_xmit_multi+0x129/0x1aa
[26700.658177][    C0]  ? bpf_trace_run5+0x78/0xe0
[26700.662788][    C0]  veth_poll+0x62/0x210
[26700.666878][    C0]  ? mlx5e_napi_poll+0x35c/0x710 [mlx5_core]
[26700.672887][    C0]  __napi_poll+0x2b/0x1f0
[26700.677155][    C0]  net_rx_action+0x352/0x400
[26700.681682][    C0]  handle_softirqs+0xe6/0x2d0
[26700.686300][    C0]  irq_exit_rcu+0x9d/0xc0
[26700.690565][    C0]  common_interrupt+0x81/0xa0
[26700.695183][    C0]  </IRQ>
[26700.698056][    C0]  <TASK>
[26700.700936][    C0]  asm_common_interrupt+0x22/0x40
[26700.705898][    C0] RIP: 0010:cpuidle_enter_state+0xc2/0x430
[26700.711643][    C0] Code: 00 e8 02 99 35 ff e8 8d f8 ff ff 8b 53 04 
49 89 c5 0f 1f 44 00 00 31 ff e8 4b 0c 34 ff 45 84 ff 0f 85 49 02 00 00 
fb 45 85 f6 <0f> 88 8a 01 00 00 49 63 ce 4c 8b 14 24 48 8d 04 49 48 8d 
14 81 48
[26700.731229][    C0] RSP: 0000:ffffffff83003e48 EFLAGS: 00000202
[26700.737229][    C0] RAX: ffff8888fbe16000 RBX: ffff88887fc347c0 RCX: 
000000000000001f
[26700.745165][    C0] RDX: 0000000000000000 RSI: 00000000238e3e11 RDI: 
0000000000000000
[26700.753095][    C0] RBP: 0000000000000002 R08: 0000000000000002 R09: 
0000000000000006
[26700.761025][    C0] R10: 00000000ffffffff R11: 0000000000000000 R12: 
ffffffff83350c40
[26700.768956][    C0] R13: 00001848b0b34003 R14: 0000000000000002 R15: 
0000000000000000
[26700.776885][    C0]  ? cpuidle_enter_state+0xb5/0x430
[26700.782019][    C0]  cpuidle_enter+0x29/0x40
[26700.786375][    C0]  do_idle+0x171/0x1d0
[26700.790378][    C0]  cpu_startup_entry+0x25/0x30
[26700.795076][    C0]  rest_init+0xcc/0xd0
[26700.799081][    C0]  start_kernel+0x379/0x5b0
[26700.803520][    C0]  x86_64_start_reservations+0x14/0x30
[26700.808910][    C0]  x86_64_start_kernel+0xcb/0xd0
[26700.813782][    C0]  common_startup_64+0x13e/0x148
[26700.818662][    C0]  </TASK>
[26700.821628][    C0] Modules linked in: iptable_filter ip_tables 
x_tables nf_defrag_ipv6 nf_defrag_ipv4 rpcrdma rdma_ucm ib_umad ib_ipoib 
rdma_cm iw_cm ib_cm mlx5_ib macsec ib_uverbs ib_core mlx5_core sunrpc 
intel_uncore_frequency intel_uncore_frequency_common coretemp i40e ixgbe 
bnxt_en igb i2c_i801 igc rapl intel_cstate mei_me psample mdio libie 
i2c_smbus acpi_ipmi ptp intel_uncore pcspkr mei i2c_algo_bit pps_core 
wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad bfq sch_fq_codel drm 
i2c_core dm_multipath nfnetlink zram 842_decompress lz4hc_compress 
842_compress zstd_compress hid_generic scsi_dh_rdac fuse [last unloaded: 
nf_conntrack]
[26700.878187][    C0] CR2: 0000000000000438
[26700.882282][    C0] ---[ end trace 0000000000000000 ]---
[26700.882283][    C3] BUG: kernel NULL pointer dereference, address: 
0000000000000438
[26700.890195][    C0] pstore: backend (erst) writing error (-28)
[26700.895228][    C3] #PF: supervisor read access in kernel mode
[26700.901054][    C0] RIP: 0010:veth_xdp_rcv.constprop.0+0x37/0x2c0
[26700.906877][    C3] #PF: error_code(0x0000) - not-present page
[26700.912960][    C0] Code: 55 41 89 f5 41 54 55 53 48 89 fb 48 81 ec 
90 00 00 00 8b 87 48 03 00 00 48 89 14 24 48 8b 97 f8 01 00 00 48 8b 92 
c0 09 00 00 <3b> 82 38 04 00 00 0f 83 6d 02 00 00 4c 8d 3c 80 49 c1 e7 
06 4c 03
[26700.918786][    C3] PGD 0
[26700.938217][    C0] RSP: 0000:ffffc90000003cb0 EFLAGS: 00010282
[26700.938308][    C3] P4D 0
[26700.941016][    C0]
[26700.946927][    C3]
[26700.949629][    C0] RAX: 0000000000000000 RBX: ffff888160e5c000 RCX: 
ffffc90000003d78
[26700.951815][    C3] Oops: Oops: 0000 [#2] SMP PTI
[26700.953997][    C0] RDX: 0000000000000000 RSI: 0000000000000040 RDI: 
ffff888160e5c000
[26700.961816][    C3] CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Tainted: G 
    D    O     N  6.15.0-bpf-next+ #40 PREEMPT(full)
[26700.966513][    C0] RBP: 0000000000000000 R08: 0000000000000000 R09: 
0000000000000101
[26700.974331][    C3] Tainted: [D]=DIE, [O]=OOT_MODULE, [N]=TEST
[26700.985356][    C0] R10: ffffffff830080c0 R11: 0000000000000000 R12: 
0000000000000040
[26700.993173][    C3] Hardware name: Supermicro Super Server/X10SRi-F, 
BIOS 2.0a 08/01/2016
[26700.998995][    C0] R13: 0000000000000040 R14: ffffc90000003d78 R15: 
ffff88887fc2d400
[26701.006813][    C3] RIP: 0010:veth_xdp_rcv.constprop.0+0x37/0x2c0
[26701.014977][    C0] FS:  0000000000000000(0000) 
GS:ffff8888fbe16000(0000) knlGS:0000000000000000
[26701.022796][    C3] Code: 55 41 89 f5 41 54 55 53 48 89 fb 48 81 ec 
90 00 00 00 8b 87 48 03 00 00 48 89 14 24 48 8b 97 f8 01 00 00 48 8b 92 
c0 09 00 00 <3b> 82 38 04 00 00 0f 83 6d 02 00 00 4c 8d 3c 80 49 c1 e7 
06 4c 03
[26701.028879][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26701.037650][    C3] RSP: 0018:ffffc900001accb0 EFLAGS: 00010282
[26701.057082][    C0] CR2: 0000000000000438 CR3: 000000010466c003 CR4: 
00000000003726f0
[26701.063511][    C3]
[26701.069423][    C0] DR0: ffffffff83f801a4 DR1: ffffffff83f801a5 DR2: 
ffffffff83f801a2
[26701.077241][    C3] RAX: 0000000000000003 RBX: ffff888160e5cb40 RCX: 
ffffc900001acd78
[26701.079423][    C0] DR3: ffffffff83f801a3 DR6: 00000000fffe0ff0 DR7: 
0000000000000600
[26701.087242][    C3] RDX: 0000000000000000 RSI: 0000000000000040 RDI: 
ffff888160e5cb40
[26701.095059][    C0] Kernel panic - not syncing: Fatal exception in 
interrupt
[26702.123660][    C0] Shutting down cpus with NMI
[26702.138666][    C0] Kernel Offset: disabled
[26702.145524][    C0] ---[ end Kernel panic - not syncing: Fatal 
exception in interrupt ]---


>>
>>>>      [...] >>>>
>>>>> @@ -925,6 +959,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>>>>> budget,
>>>>>       rq->stats.vs.xdp_packets += done;
>>>>>       u64_stats_update_end(&rq->stats.syncp);
>>>>> +    if (unlikely(netif_tx_queue_stopped(peer_txq)))
>>>>> +        netif_tx_wake_queue(peer_txq);
>>>>> +
>>>>>       return done;
>>>>>   }


A fix could look like this:

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e58a0f1b5c5b..a3046142cb8e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -909,7 +909,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,

         /* NAPI functions as RCU section */
         peer_dev = rcu_dereference_check(priv->peer, 
rcu_read_lock_bh_held());
-       peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
+       peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : 
NULL;

         for (i = 0; i < budget; i++) {
                 void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -959,7 +959,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
         rq->stats.vs.xdp_packets += done;
         u64_stats_update_end(&rq->stats.syncp);

-       if (unlikely(netif_tx_queue_stopped(peer_txq)))
+       if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
                 netif_tx_wake_queue(peer_txq);




--Jesper



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-06-10 21:40           ` Jesper Dangaard Brouer
@ 2025-06-11  0:25             ` Ihor Solodrai
  2025-06-11  7:40               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Ihor Solodrai @ 2025-06-11  0:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Jakub Kicinski, Bastien Curutchet
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, Sebastian Andrzej Siewior

On 6/10/25 2:40 PM, Jesper Dangaard Brouer wrote:
> 
> 
> On 10/06/2025 20.26, Ihor Solodrai wrote:
>> On 6/10/25 8:56 AM, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 10/06/2025 13.43, Jesper Dangaard Brouer wrote:
>>>>
>>>> On 10/06/2025 00.09, Ihor Solodrai wrote:
>>> [...]
>>>>
>>>> Can you give me the output from below command (on your compiled 
>>>> kernel):
>>>>
>>>>   ./scripts/faddr2line drivers/net/veth.o veth_xdp_rcv.constprop.0+0x6b
>>>>
>>>
>>> Still need above data/info please.
>>
>> root@devvm7589:/ci/workspace# ./scripts/faddr2line ./kout.gcc/drivers/ 
>> net/veth.o veth_xdp_rcv.constprop.0+0x6b
>> veth_xdp_rcv.constprop.0+0x6b/0x390:
>> netdev_get_tx_queue at /ci/workspace/kout.gcc/../include/linux/ 
>> netdevice.h:2637
>> (inlined by) veth_xdp_rcv at /ci/workspace/kout.gcc/../drivers/net/ 
>> veth.c:912
>>
>> Which is:
>>
>> veth.c:912
>>      struct veth_priv *priv = netdev_priv(rq->dev);
>>      int queue_idx = rq->xdp_rxq.queue_index;
>>      struct netdev_queue *peer_txq;
>>      struct net_device *peer_dev;
>>      int i, done = 0, n_xdpf = 0;
>>      void *xdpf[VETH_XDP_BATCH];
>>
>>      /* NAPI functions as RCU section */
>>      peer_dev = rcu_dereference_check(priv->peer, 
>> rcu_read_lock_bh_held());
>>   --->    peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>>
>> netdevice.h:2637
>>      static inline
>>      struct netdev_queue *netdev_get_tx_queue(const struct net_device 
>> *dev,
>>                       unsigned int index)
>>      {
>>          DEBUG_NET_WARN_ON_ONCE(index >= dev->num_tx_queues);
>>   --->        return &dev->_tx[index];
>>      }
>>
>> So the suspect is peer_dev (priv->peer)?
> 
> Yes, this is the problem!
> 
> So, it seems that peer_dev (priv->peer) can become a NULL pointer.
> 
> Managed to reproduce - via manually deleting the peer device:
>   - ip link delete dev veth42
>   - while overloading veth41 via XDP redirecting packets into it.
> 
> Managed to trigger concurrent crashes on two CPUs (C0 + C3)
>   - so below output gets interlaced a bit:
> 
> [...]
> 
> A fix could look like this:
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index e58a0f1b5c5b..a3046142cb8e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -909,7 +909,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> 
>          /* NAPI functions as RCU section */
>          peer_dev = rcu_dereference_check(priv->peer, 
> rcu_read_lock_bh_held());
> -       peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
> +       peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : 
> NULL;
> 
>          for (i = 0; i < budget; i++) {
>                  void *ptr = __ptr_ring_consume(&rq->xdp_ring);
> @@ -959,7 +959,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>          rq->stats.vs.xdp_packets += done;
>          u64_stats_update_end(&rq->stats.syncp);
> 
> -       if (unlikely(netif_tx_queue_stopped(peer_txq)))
> +       if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>                  netif_tx_wake_queue(peer_txq);
> 

Great! I presume you will send a patch separately?

> 
> 
> 
> --Jesper
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2025-06-11  0:25             ` Ihor Solodrai
@ 2025-06-11  7:40               ` Jesper Dangaard Brouer
  2025-06-11 12:40                 ` [PATCH net V1] veth: prevent NULL pointer dereference in veth_xdp_rcv Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-11  7:40 UTC (permalink / raw)
  To: Ihor Solodrai, netdev, Jakub Kicinski, Bastien Curutchet
  Cc: bpf, tom, Eric Dumazet, David S. Miller, Paolo Abeni,
	Toke Høiland-Jørgensen, dsahern, makita.toshiaki,
	kernel-team, phil, Sebastian Andrzej Siewior



On 11/06/2025 02.25, Ihor Solodrai wrote:
> On 6/10/25 2:40 PM, Jesper Dangaard Brouer wrote:
>>
>>
>> On 10/06/2025 20.26, Ihor Solodrai wrote:
>>> On 6/10/25 8:56 AM, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 10/06/2025 13.43, Jesper Dangaard Brouer wrote:
>>>>>
>>>>> On 10/06/2025 00.09, Ihor Solodrai wrote:
>>>> [...]
>>>>>
>>>>> Can you give me the output from below command (on your compiled 
>>>>> kernel):
>>>>>
>>>>>   ./scripts/faddr2line drivers/net/veth.o 
>>>>> veth_xdp_rcv.constprop.0+0x6b
>>>>>
>>>>
>>>> Still need above data/info please.
>>>
>>> root@devvm7589:/ci/workspace# ./scripts/faddr2line 
>>> ./kout.gcc/drivers/ net/veth.o veth_xdp_rcv.constprop.0+0x6b
>>> veth_xdp_rcv.constprop.0+0x6b/0x390:
>>> netdev_get_tx_queue at /ci/workspace/kout.gcc/../include/linux/ 
>>> netdevice.h:2637
>>> (inlined by) veth_xdp_rcv at /ci/workspace/kout.gcc/../drivers/net/ 
>>> veth.c:912
>>>
>>> Which is:
>>>
>>> veth.c:912
>>>      struct veth_priv *priv = netdev_priv(rq->dev);
>>>      int queue_idx = rq->xdp_rxq.queue_index;
>>>      struct netdev_queue *peer_txq;
>>>      struct net_device *peer_dev;
>>>      int i, done = 0, n_xdpf = 0;
>>>      void *xdpf[VETH_XDP_BATCH];
>>>
>>>      /* NAPI functions as RCU section */
>>>      peer_dev = rcu_dereference_check(priv->peer, 
>>> rcu_read_lock_bh_held());
>>>   --->    peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>>>
>>> netdevice.h:2637
>>>      static inline
>>>      struct netdev_queue *netdev_get_tx_queue(const struct net_device 
>>> *dev,
>>>                       unsigned int index)
>>>      {
>>>          DEBUG_NET_WARN_ON_ONCE(index >= dev->num_tx_queues);
>>>   --->        return &dev->_tx[index];
>>>      }
>>>
>>> So the suspect is peer_dev (priv->peer)?
>>
>> Yes, this is the problem!
>>
>> So, it seems that peer_dev (priv->peer) can become a NULL pointer.
>>
>> Managed to reproduce - via manually deleting the peer device:
>>   - ip link delete dev veth42
>>   - while overloading veth41 via XDP redirecting packets into it.
>>
>> Managed to trigger concurrent crashes on two CPUs (C0 + C3)
>>   - so below output gets interlaced a bit:
>>
>> [...]
>>
>> A fix could look like this:
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index e58a0f1b5c5b..a3046142cb8e 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -909,7 +909,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>> budget,
>>
>>          /* NAPI functions as RCU section */
>>          peer_dev = rcu_dereference_check(priv->peer, 
>> rcu_read_lock_bh_held());
>> -       peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
>> +       peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) 
>> : NULL;
>>
>>          for (i = 0; i < budget; i++) {
>>                  void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>> @@ -959,7 +959,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int 
>> budget,
>>          rq->stats.vs.xdp_packets += done;
>>          u64_stats_update_end(&rq->stats.syncp);
>>
>> -       if (unlikely(netif_tx_queue_stopped(peer_txq)))
>> +       if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>>                  netif_tx_wake_queue(peer_txq);
>>
> 
> Great! I presume you will send a patch separately?

Yes, I will send this as a separate patch.  In a couple of hours (first 
some breakfast and a walk with the dog ;-)).

--Jesper

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net V1] veth: prevent NULL pointer dereference in veth_xdp_rcv
  2025-06-11  7:40               ` Jesper Dangaard Brouer
@ 2025-06-11 12:40                 ` Jesper Dangaard Brouer
  2025-06-11 16:00                   ` Ihor Solodrai
  2025-06-12 15:20                   ` patchwork-bot+netdevbpf
  0 siblings, 2 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-11 12:40 UTC (permalink / raw)
  To: netdev, ihor.solodrai
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, bpf, linux-kernel, kernel-team

The veth peer device is RCU protected, but when the peer device gets
deleted (veth_dellink) then the pointer is assigned NULL (via
RCU_INIT_POINTER).

This patch adds a necessary NULL check in veth_xdp_rcv when accessing
the veth peer net_device.

This fixes a bug introduced in commit dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops"). The bug is a race
and only triggers when having inflight packets on a veth that is being
deleted.

Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Closes: https://lore.kernel.org/all/fecfcad0-7a16-42b8-bff2-66ee83a6e5c4@linux.dev/
Reported-by: syzbot+c4c7bf27f6b0c4bd97fe@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/683da55e.a00a0220.d8eae.0052.GAE@google.com/
Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
Commit dc82a33297fc have reached Linus'es tree in v6.16-rc1~132^2~208^2.
Thus, we can correct this before the release of v6.16.

 drivers/net/veth.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e58a0f1b5c5b..a3046142cb8e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -909,7 +909,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 
 	/* NAPI functions as RCU section */
 	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
-	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
+	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
 
 	for (i = 0; i < budget; i++) {
 		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -959,7 +959,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 	rq->stats.vs.xdp_packets += done;
 	u64_stats_update_end(&rq->stats.syncp);
 
-	if (unlikely(netif_tx_queue_stopped(peer_txq)))
+	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
 		netif_tx_wake_queue(peer_txq);
 
 	return done;



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net V1] veth: prevent NULL pointer dereference in veth_xdp_rcv
  2025-06-11 12:40                 ` [PATCH net V1] veth: prevent NULL pointer dereference in veth_xdp_rcv Jesper Dangaard Brouer
@ 2025-06-11 16:00                   ` Ihor Solodrai
  2025-06-12 15:20                   ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 15+ messages in thread
From: Ihor Solodrai @ 2025-06-11 16:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni, bpf,
	linux-kernel, kernel-team

On 6/11/25 5:40 AM, Jesper Dangaard Brouer wrote:
> The veth peer device is RCU protected, but when the peer device gets
> deleted (veth_dellink) then the pointer is assigned NULL (via
> RCU_INIT_POINTER).
> 
> This patch adds a necessary NULL check in veth_xdp_rcv when accessing
> the veth peer net_device.
> 
> This fixes a bug introduced in commit dc82a33297fc ("veth: apply qdisc
> backpressure on full ptr_ring to reduce TX drops"). The bug is a race
> and only triggers when having inflight packets on a veth that is being
> deleted.
> 
> Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Closes: https://lore.kernel.org/all/fecfcad0-7a16-42b8-bff2-66ee83a6e5c4@linux.dev/
> Reported-by: syzbot+c4c7bf27f6b0c4bd97fe@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/683da55e.a00a0220.d8eae.0052.GAE@google.com/
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
> Commit dc82a33297fc have reached Linus'es tree in v6.16-rc1~132^2~208^2.
> Thus, we can correct this before the release of v6.16.
> 
>   drivers/net/veth.c |    4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index e58a0f1b5c5b..a3046142cb8e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -909,7 +909,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   
>   	/* NAPI functions as RCU section */
>   	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> -	peer_txq = netdev_get_tx_queue(peer_dev, queue_idx);
> +	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>   
>   	for (i = 0; i < budget; i++) {
>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
> @@ -959,7 +959,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   	rq->stats.vs.xdp_packets += done;
>   	u64_stats_update_end(&rq->stats.syncp);
>   
> -	if (unlikely(netif_tx_queue_stopped(peer_txq)))
> +	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq)))
>   		netif_tx_wake_queue(peer_txq);
>   
>   	return done;
> 
> 

Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>

Thank you!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net V1] veth: prevent NULL pointer dereference in veth_xdp_rcv
  2025-06-11 12:40                 ` [PATCH net V1] veth: prevent NULL pointer dereference in veth_xdp_rcv Jesper Dangaard Brouer
  2025-06-11 16:00                   ` Ihor Solodrai
@ 2025-06-12 15:20                   ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 15+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-06-12 15:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, ihor.solodrai, eric.dumazet, davem, kuba, pabeni, bpf,
	linux-kernel, kernel-team

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 11 Jun 2025 14:40:04 +0200 you wrote:
> The veth peer device is RCU protected, but when the peer device gets
> deleted (veth_dellink) then the pointer is assigned NULL (via
> RCU_INIT_POINTER).
> 
> This patch adds a necessary NULL check in veth_xdp_rcv when accessing
> the veth peer net_device.
> 
> [...]

Here is the summary with links:
  - [net,V1] veth: prevent NULL pointer dereference in veth_xdp_rcv
    https://git.kernel.org/netdev/net/c/9337c54401a5

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-12 15:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 14:55 [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor Jesper Dangaard Brouer
2025-04-25 14:55 ` [PATCH net-next V7 1/2] net: sched: generalize check for no-queue qdisc on TX queue Jesper Dangaard Brouer
2025-04-25 14:55 ` [PATCH net-next V7 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops Jesper Dangaard Brouer
2025-04-26  5:40   ` Toshiaki Makita
2025-06-09 22:09   ` Ihor Solodrai
2025-06-10 11:43     ` Jesper Dangaard Brouer
2025-06-10 15:56       ` Jesper Dangaard Brouer
2025-06-10 18:26         ` Ihor Solodrai
2025-06-10 21:40           ` Jesper Dangaard Brouer
2025-06-11  0:25             ` Ihor Solodrai
2025-06-11  7:40               ` Jesper Dangaard Brouer
2025-06-11 12:40                 ` [PATCH net V1] veth: prevent NULL pointer dereference in veth_xdp_rcv Jesper Dangaard Brouer
2025-06-11 16:00                   ` Ihor Solodrai
2025-06-12 15:20                   ` patchwork-bot+netdevbpf
2025-04-28 22:20 ` [PATCH net-next V7 0/2] veth: qdisc backpressure and qdisc check refactor patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).