[PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr

public inbox for virtualization@lists.linux-foundation.org
 help / color / mirror / Atom feed

* [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
@ 2026-03-12 13:06 Simon Schippers
  2026-03-12 13:06 ` [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Simon Schippers @ 2026-03-12 13:06 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This patch series deals with tun/tap & vhost-net which drop incoming
SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
patch series, the associated netdev queue is stopped - but only when a
qdisc is attached. If no qdisc is present the existing behavior is
preserved. This patch series touches tun/tap and vhost-net, as they
share common logic and must be updated together. Modifying only one of
them would break the other.

By applying proper backpressure, this change allows the connected qdisc to 
operate correctly, as reported in [1], and significantly improves
performance in real-world scenarios, as demonstrated in our paper [2]. For 
example, we observed a 36% TCP throughput improvement for an OpenVPN 
connection between Germany and the USA.

Synthetic pktgen benchmarks indicate a slight regression.
Pktgen benchmarks are provided per commit, with the final commit showing
the overall performance.

Thanks!

[1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
[2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
[3] Link: https://lore.kernel.org/r/174549940981.608169.4363875844729313831.stgit@firesoul
[4] Link: https://lore.kernel.org/r/176295323282.307447.14790015927673763094.stgit@firesoul

---
Changelog:
V8:
- Drop code changes in drivers/net/tap.c; The code there deals with
  ipvtap/macvtap which are unrelated to the goal of this patch series
  and I did not realize that before
-> Greatly simplified logic, 4 instead of 9 commits
-> No more duplicated logics and distinction in vhost required
- Only wake after the queue stopped and half of the ring was consumed
  as suggested by MST
-> Performance improvements for TAP, but still slightly slower
- Better benchmarking with pinned threads, XDP drop program for
  tap+vhost-net and disabling CPU mitigations (and newer Ryzen 5 5600X
  processor) as suggested by Jason Wang

V7: https://lore.kernel.org/netdev/20260107210448.37851-1-simon.schippers@tu-dortmund.de/
- Switch to an approach similar to veth [3] (excluding the recently fixed 
variant [4]), as suggested by MST, with minor adjustments discussed in V6
- Rename the cover-letter title
- Add multithreaded pktgen and iperf3 benchmarks, as suggested by Jason 
Wang
- Rework __ptr_ring_consume_created_space() so it can also be used after 
batched consume

V6: https://lore.kernel.org/netdev/20251120152914.1127975-1-simon.schippers@tu-dortmund.de/
General:
- Major adjustments to the descriptions. Special thanks to Jon Kohler!
- Fix git bisect by moving most logic into dedicated functions and only 
start using them in patch 7.
- Moved the main logic of the coupled producer and consumer into a single 
patch to avoid a chicken-and-egg dependency between commits :-)
- Rebased to 6.18-rc5 and ran benchmarks again that now also include lost 
packets (previously I missed a 0, so all benchmark results were higher by 
factor 10...).
- Also include the benchmark in patch 7.

Producer:
- Move logic into the new helper tun_ring_produce()
- Added a smp_rmb() paired with the consumer, ensuring freed space of the 
consumer is visible
- Assume that ptr_ring is not full when __ptr_ring_full_next() is called

Consumer:
- Use an unpaired smp_rmb() instead of barrier() to ensure that the 
netdev_tx_queue_stopped() call completes before discarding
- Also wake the netdev queue if it was stopped before discarding and then 
becomes empty
-> Fixes race with producer as identified by MST in V5
-> Waking the netdev queues upon resize is not required anymore
- Use __ptr_ring_consume_created_space() instead of messing with ptr_ring 
internals
-> Batched consume now just calls 
__tun_ring_consume()/__tap_ring_consume() in a loop
- Added an smp_wmb() before waking the netdev queue which is paired with 
the smp_rmb() discussed above

V5: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#u
- Stop the netdev queue prior to producing the final fitting ptr_ring entry
-> Ensures the consumer has the latest netdev queue state, making it safe 
to wake the queue
-> Resolves an issue in vhost-net where the netdev queue could remain 
stopped despite being empty
-> For TUN/TAP, the netdev queue no longer needs to be woken in the 
blocking loop
-> Introduces new helpers __ptr_ring_full_next and 
__ptr_ring_will_invalidate for this purpose
- vhost-net now uses wrappers of TUN/TAP for ptr_ring consumption rather 
than maintaining its own rx_ring pointer

V4: https://lore.kernel.org/netdev/20250902080957.47265-1-simon.schippers@tu-dortmund.de/T/#u
- Target net-next instead of net
- Changed to patch series instead of single patch
- Changed to new title from old title
"TUN/TAP: Improving throughput and latency by avoiding SKB drops"
- Wake netdev queue with new helpers wake_netdev_queue when there is any 
spare capacity in the ptr_ring instead of waiting for it to be empty
- Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
logic
- Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
that happened rarely before
- Use safer logic for vhost-net using RCU read locks to access TUN/TAP data

V3: https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
- Added support for TAP and TAP+vhost-net.

V2: https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
- Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
unnecessary netif_tx_wake_queue in tun_ring_recv.

V1: https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u
---

Simon Schippers (4):
  tun/tap: add ptr_ring consume helper with netdev queue wakeup
  vhost-net: wake queue of tun/tap after ptr_ring consume
  ptr_ring: move free-space check into separate helper
  tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

 drivers/net/tun.c        | 91 +++++++++++++++++++++++++++++++++++++---
 drivers/vhost/net.c      | 15 +++++--
 include/linux/if_tun.h   |  3 ++
 include/linux/ptr_ring.h | 14 ++++++-
 4 files changed, 111 insertions(+), 12 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
@ 2026-03-12 13:06 ` Simon Schippers
  2026-03-24  1:47   ` Jason Wang
  2026-03-12 13:06 ` [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Simon Schippers @ 2026-03-12 13:06 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
__tun_wake_queue(). The latter wakes the stopped netdev subqueue once
half of the ring capacity has been consumed, tracked via the new
cons_cnt field in tun_file. When the ring is empty the queue is also
woken to handle potential races.

Without the corresponding queue stopping (introduced in a subsequent
commit), this patch alone causes no regression for a tap setup sending
to a qemu VM: 1.151 Mpps to 1.153 Mpps.

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, pktgen sender; Avg over 20 runs @ 100,000,000 packets;
SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c492fda6fc15..a82d665dab5f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -145,6 +145,7 @@ struct tun_file {
 	struct list_head next;
 	struct tun_struct *detached;
 	struct ptr_ring tx_ring;
+	int cons_cnt;
 	struct xdp_rxq_info xdp_rxq;
 };
 
@@ -564,6 +565,7 @@ static void tun_queue_purge(struct tun_file *tfile)
 	while ((ptr = ptr_ring_consume(&tfile->tx_ring)) != NULL)
 		tun_ptr_free(ptr);
 
+	tfile->cons_cnt = 0;
 	skb_queue_purge(&tfile->sk.sk_write_queue);
 	skb_queue_purge(&tfile->sk.sk_error_queue);
 }
@@ -730,6 +732,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 		goto out;
 	}
 
+	tfile->cons_cnt = 0;
 	tfile->queue_index = tun->numqueues;
 	tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
 
@@ -2113,13 +2116,39 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
-static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
+static void __tun_wake_queue(struct tun_struct *tun, struct tun_file *tfile)
+{
+	if (ptr_ring_empty(&tfile->tx_ring))
+		goto wake;
+
+	if (!__netif_subqueue_stopped(tun->dev, tfile->queue_index) ||
+	    ++tfile->cons_cnt < tfile->tx_ring.size / 2)
+		return;
+
+wake:
+	netif_wake_subqueue(tun->dev, tfile->queue_index);
+	tfile->cons_cnt = 0;
+}
+
+static void *tun_ring_consume(struct tun_struct *tun, struct tun_file *tfile)
+{
+	void *ptr;
+
+	ptr = ptr_ring_consume(&tfile->tx_ring);
+	if (ptr)
+		__tun_wake_queue(tun, tfile);
+
+	return ptr;
+}
+
+static void *tun_ring_recv(struct tun_struct *tun, struct tun_file *tfile,
+			   int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
 	void *ptr = NULL;
 	int error = 0;
 
-	ptr = ptr_ring_consume(&tfile->tx_ring);
+	ptr = tun_ring_consume(tun, tfile);
 	if (ptr)
 		goto out;
 	if (noblock) {
@@ -2131,7 +2160,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		ptr = ptr_ring_consume(&tfile->tx_ring);
+		ptr = tun_ring_consume(tun, tfile);
 		if (ptr)
 			break;
 		if (signal_pending(current)) {
@@ -2168,7 +2197,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 
 	if (!ptr) {
 		/* Read frames from ring */
-		ptr = tun_ring_recv(tfile, noblock, &err);
+		ptr = tun_ring_recv(tun, tfile, noblock, &err);
 		if (!ptr)
 			return err;
 	}
@@ -3404,6 +3433,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
 		return -ENOMEM;
 	}
 
+	tfile->cons_cnt = 0;
+
 	mutex_init(&tfile->napi_mutex);
 	RCU_INIT_POINTER(tfile->tun, NULL);
 	tfile->flags = 0;
@@ -3612,6 +3643,7 @@ static int tun_queue_resize(struct tun_struct *tun)
 	for (i = 0; i < tun->numqueues; i++) {
 		tfile = rtnl_dereference(tun->tfiles[i]);
 		rings[i] = &tfile->tx_ring;
+		tfile->cons_cnt = 0;
 	}
 	list_for_each_entry(tfile, &tun->disabled, next)
 		rings[i++] = &tfile->tx_ring;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume
  2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  2026-03-12 13:06 ` [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
@ 2026-03-12 13:06 ` Simon Schippers
  2026-03-12 13:54   ` Michael S. Tsirkin
  2026-03-24  1:47   ` Jason Wang
  2026-03-12 13:06 ` [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Simon Schippers @ 2026-03-12 13:06 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
function validates that the file belongs to a tun/tap device,
dereferences the tun_struct under RCU, and delegates to
__tun_wake_queue().

vhost_net_buf_produce() now calls tun_wake_queue() after a successful
batched consume of the ring to allow the netdev subqueue to be woken up.

Without the corresponding queue stopping (introduced in a subsequent
commit), this patch alone causes a slight throughput regression for a
tap+vhost-net setup sending to a qemu VM:
3.948 Mpps to 3.888 Mpps (-1.5%).

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, XDP drop program active in VM, pktgen sender; Avg over
20 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c      | 21 +++++++++++++++++++++
 drivers/vhost/net.c    | 15 +++++++++++----
 include/linux/if_tun.h |  3 +++
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a82d665dab5f..b86582cc6cb6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3760,6 +3760,27 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_tx_ring);
 
+void tun_wake_queue(struct file *file)
+{
+	struct tun_file *tfile;
+	struct tun_struct *tun;
+
+	if (file->f_op != &tun_fops)
+		return;
+	tfile = file->private_data;
+	if (!tfile)
+		return;
+
+	rcu_read_lock();
+
+	tun = rcu_dereference(tfile->tun);
+	if (tun)
+		__tun_wake_queue(tun, tfile);
+
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(tun_wake_queue);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 80965181920c..c8ef804ef28c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -176,13 +176,19 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 	return ret;
 }
 
-static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_produce(struct sock *sk,
+				 struct vhost_net_virtqueue *nvq)
 {
+	struct file *file = sk->sk_socket->file;
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	rxq->head = 0;
 	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
 					      VHOST_NET_BATCH);
+
+	if (rxq->tail)
+		tun_wake_queue(file);
+
 	return rxq->tail;
 }
 
@@ -209,14 +215,15 @@ static int vhost_net_buf_peek_len(void *ptr)
 	return __skb_array_len_with_tag(ptr);
 }
 
-static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_peek(struct sock *sk,
+			      struct vhost_net_virtqueue *nvq)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	if (!vhost_net_buf_is_empty(rxq))
 		goto out;
 
-	if (!vhost_net_buf_produce(nvq))
+	if (!vhost_net_buf_produce(sk, nvq))
 		return 0;
 
 out:
@@ -995,7 +1002,7 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	unsigned long flags;
 
 	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+		return vhost_net_buf_peek(sk, rvq);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 80166eb62f41..ab3b4ebca059 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -22,6 +22,7 @@ struct tun_msg_ctl {
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
+void tun_wake_queue(struct file *file);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -55,6 +56,8 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
+static inline void tun_wake_queue(struct file *f) {}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper
  2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  2026-03-12 13:06 ` [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
  2026-03-12 13:06 ` [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
@ 2026-03-12 13:06 ` Simon Schippers
  2026-03-12 13:17   ` Eric Dumazet
  2026-03-12 13:06 ` [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Simon Schippers @ 2026-03-12 13:06 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This patch moves the check for available free space for a new entry into
a separate function. As a result, __ptr_ring_produce() remains logically
unchanged, while the new helper allows callers to determine in advance
whether subsequent __ptr_ring_produce() calls will succeed. This
information can, for example, be used to temporarily stop producing until
__ptr_ring_peek() indicates that space is available again.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 include/linux/ptr_ring.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 534531807d95..a5a3fa4916d3 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
 	return ret;
 }
 
+static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
+{
+	if (unlikely(!r->size) || r->queue[r->producer])
+		return -ENOSPC;
+
+	return 0;
+}
+
 /* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax(). Callers must hold producer_lock.
  * Callers are responsible for making sure pointer that is being queued
@@ -103,8 +111,10 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
  */
 static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
 {
-	if (unlikely(!r->size) || r->queue[r->producer])
-		return -ENOSPC;
+	int p = __ptr_ring_produce_peek(r);
+
+	if (p)
+		return p;
 
 	/* Make sure the pointer we are storing points to a valid data. */
 	/* Pairs with the dependency ordering in __ptr_ring_consume. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (2 preceding siblings ...)
  2026-03-12 13:06 ` [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
@ 2026-03-12 13:06 ` Simon Schippers
  2026-03-24  1:47   ` Jason Wang
  2026-03-12 13:55 ` [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Michael S. Tsirkin
  2026-03-23 21:49 ` Simon Schippers
  5 siblings, 1 reply; 21+ messages in thread
From: Simon Schippers @ 2026-03-12 13:06 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once an entry is successfully produced and the ptr_ring
reaches capacity, the netdev queue is stopped instead of dropping
subsequent packets.

If producing an entry fails anyways due to a race, tun_net_xmit returns
NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
LLTX is enabled and the transmit path operates without the usual locking.

The existing __tun_wake_queue() function wakes the netdev queue. Races
between this wakeup and the queue-stop logic could leave the queue
stopped indefinitely. To prevent this, a memory barrier is enforced
(as discussed in a similar implementation in [1]), followed by a recheck
that wakes the queue if space is already available.

If no qdisc is present, the previous tail-drop behavior is preserved.

Benchmarks:
The benchmarks show a slight regression in raw transmission performance,
though no packets are lost anymore.

The previously introduced threshold to only wake after the queue stopped
and half of the ring was consumed showed to be a descent choice:
Waking the queue whenever a consume made space in the ring strongly
degrades performance for tap, while waking only when the ring is empty
is too late and also hurts throughput for tap & tap+vhost-net.
Other ratios (3/4, 7/8) showed similar results (not shown here), so
1/2 was chosen for the sake of simplicity for both tun/tap and
tun/tap+vhost-net.

Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.

Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster, slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)

+--------------------------+--------------+----------------+----------+
| 1 thread                 | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

+--------------------------+--------------+----------------+----------+
| 2 threads                | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

[1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b86582cc6cb6..9b7daec69acd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct netdev_queue *queue;
 	struct tun_file *tfile;
 	int len = skb->len;
+	bool qdisc_present;
+	int ret;
 
 	rcu_read_lock();
 	tfile = rcu_dereference(tun->tfiles[txq]);
@@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	qdisc_present = !qdisc_txq_has_no_queue(queue);
+
+	spin_lock(&tfile->tx_ring.producer_lock);
+	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
+	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
+		netif_tx_stop_queue(queue);
+		/* Avoid races with queue wake-ups in __tun_wake_queue by
+		 * waking if space is available in a re-check.
+		 * The barrier makes sure that the stop is visible before
+		 * we re-check.
+		 */
+		smp_mb__after_atomic();
+		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
+			netif_tx_wake_queue(queue);
+	}
+	spin_unlock(&tfile->tx_ring.producer_lock);
+
+	if (ret) {
+		/* If a qdisc is attached to our virtual device,
+		 * returning NETDEV_TX_BUSY is allowed.
+		 */
+		if (qdisc_present) {
+			rcu_read_unlock();
+			return NETDEV_TX_BUSY;
+		}
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
 
 	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
 	txq_trans_cond_update(queue);
 
 	/* Notify and wake up reader process */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper
  2026-03-12 13:06 ` [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
@ 2026-03-12 13:17   ` Eric Dumazet
  2026-03-12 13:48     ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2026-03-12 13:17 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 2:06 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> This patch moves the check for available free space for a new entry into
> a separate function. As a result, __ptr_ring_produce() remains logically
> unchanged, while the new helper allows callers to determine in advance
> whether subsequent __ptr_ring_produce() calls will succeed. This
> information can, for example, be used to temporarily stop producing until
> __ptr_ring_peek() indicates that space is available again.
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 534531807d95..a5a3fa4916d3 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
>         return ret;
>  }
>
> +static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
> +{
> +       if (unlikely(!r->size) || r->queue[r->producer])

I think this should be

       if (unlikely(!r->size) || READ_ONCE(r->queue[r->producer]))

And of course:

@@ -194,7 +194,7 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
 static inline bool __ptr_ring_empty(struct ptr_ring *r)
 {
        if (likely(r->size))
-               return !r->queue[READ_ONCE(r->consumer_head)];
+               return !READ_ONCE(r->queue[READ_ONCE(r->consumer_head)]);
        return true;
 }

@@ -256,7 +256,7 @@ static inline void __ptr_ring_zero_tail(struct
ptr_ring *r, int consumer_head)
         * besides the first one until we write out all entries.
         */
        while (likely(head > r->consumer_tail))
-               r->queue[--head] = NULL;
+               WRITE_ONCE(r->queue[--head], NULL);

        r->consumer_tail = consumer_head;
 }


Presumably we should fix this in net tree first.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper
  2026-03-12 13:17   ` Eric Dumazet
@ 2026-03-12 13:48     ` Michael S. Tsirkin
  2026-03-12 14:21       ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2026-03-12 13:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Simon Schippers, willemdebruijn.kernel, jasowang, andrew+netdev,
	davem, kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 02:17:16PM +0100, Eric Dumazet wrote:
> On Thu, Mar 12, 2026 at 2:06 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
> >
> > This patch moves the check for available free space for a new entry into
> > a separate function. As a result, __ptr_ring_produce() remains logically
> > unchanged, while the new helper allows callers to determine in advance
> > whether subsequent __ptr_ring_produce() calls will succeed. This
> > information can, for example, be used to temporarily stop producing until
> > __ptr_ring_peek() indicates that space is available again.
> >
> > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > ---
> >  include/linux/ptr_ring.h | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 534531807d95..a5a3fa4916d3 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
> >         return ret;
> >  }
> >
> > +static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
> > +{
> > +       if (unlikely(!r->size) || r->queue[r->producer])
> 
> I think this should be
> 
>        if (unlikely(!r->size) || READ_ONCE(r->queue[r->producer]))
> 
> And of course:
> 
> @@ -194,7 +194,7 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
>  static inline bool __ptr_ring_empty(struct ptr_ring *r)
>  {
>         if (likely(r->size))
> -               return !r->queue[READ_ONCE(r->consumer_head)];
> +               return !READ_ONCE(r->queue[READ_ONCE(r->consumer_head)]);
>         return true;
>  }


I don't understand why it's necessary. consumer_head etc are
all lock protected.

queue itself is not but we are only checking it for NULL -
it is fine if compiler reads it in many chunks and not all
at once.

> @@ -256,7 +256,7 @@ static inline void __ptr_ring_zero_tail(struct
> ptr_ring *r, int consumer_head)
>          * besides the first one until we write out all entries.
>          */
>         while (likely(head > r->consumer_tail))
> -               r->queue[--head] = NULL;
> +               WRITE_ONCE(r->queue[--head], NULL);
> 
>         r->consumer_tail = consumer_head;
>  }
> 
> 
> Presumably we should fix this in net tree first.


Maybe this one yes but I am not sure at all - KCSAN is happy.


-- 
MST


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume
  2026-03-12 13:06 ` [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
@ 2026-03-12 13:54   ` Michael S. Tsirkin
  2026-03-24  1:47   ` Jason Wang
  1 sibling, 0 replies; 21+ messages in thread
From: Michael S. Tsirkin @ 2026-03-12 13:54 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 02:06:37PM +0100, Simon Schippers wrote:
> Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
> function validates that the file belongs to a tun/tap device,
> dereferences the tun_struct under RCU, and delegates to
> __tun_wake_queue().
> 
> vhost_net_buf_produce() now calls tun_wake_queue() after a successful
> batched consume of the ring to allow the netdev subqueue to be woken up.

A sentence missing here:
the point is to allow queue to be stopped when it gets full,
which is required for traffic shaping - implemented
by the following
"avoid ptr_ring tail-drop when a qdisc is present"




> Without the corresponding queue stopping (introduced in a subsequent
> commit), this patch alone causes a slight throughput regression for a
> tap+vhost-net setup sending to a qemu VM:
> 3.948 Mpps to 3.888 Mpps (-1.5%).
> 
> Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
> threads, XDP drop program active in VM, pktgen sender; Avg over
> 20 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.
>


 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c      | 21 +++++++++++++++++++++
>  drivers/vhost/net.c    | 15 +++++++++++----
>  include/linux/if_tun.h |  3 +++
>  3 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index a82d665dab5f..b86582cc6cb6 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -3760,6 +3760,27 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
>  }
>  EXPORT_SYMBOL_GPL(tun_get_tx_ring);
>  
> +void tun_wake_queue(struct file *file)
> +{
> +	struct tun_file *tfile;
> +	struct tun_struct *tun;
> +
> +	if (file->f_op != &tun_fops)
> +		return;
> +	tfile = file->private_data;
> +	if (!tfile)
> +		return;
> +
> +	rcu_read_lock();
> +
> +	tun = rcu_dereference(tfile->tun);
> +	if (tun)
> +		__tun_wake_queue(tun, tfile);
> +
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(tun_wake_queue);
> +
>  module_init(tun_init);
>  module_exit(tun_cleanup);
>  MODULE_DESCRIPTION(DRV_DESCRIPTION);
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 80965181920c..c8ef804ef28c 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -176,13 +176,19 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
>  	return ret;
>  }
>  
> -static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
> +static int vhost_net_buf_produce(struct sock *sk,
> +				 struct vhost_net_virtqueue *nvq)
>  {
> +	struct file *file = sk->sk_socket->file;
>  	struct vhost_net_buf *rxq = &nvq->rxq;
>  
>  	rxq->head = 0;
>  	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
>  					      VHOST_NET_BATCH);
> +
> +	if (rxq->tail)
> +		tun_wake_queue(file);
> +
>  	return rxq->tail;
>  }
>  
> @@ -209,14 +215,15 @@ static int vhost_net_buf_peek_len(void *ptr)
>  	return __skb_array_len_with_tag(ptr);
>  }
>  
> -static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
> +static int vhost_net_buf_peek(struct sock *sk,
> +			      struct vhost_net_virtqueue *nvq)
>  {
>  	struct vhost_net_buf *rxq = &nvq->rxq;
>  
>  	if (!vhost_net_buf_is_empty(rxq))
>  		goto out;
>  
> -	if (!vhost_net_buf_produce(nvq))
> +	if (!vhost_net_buf_produce(sk, nvq))
>  		return 0;
>  
>  out:
> @@ -995,7 +1002,7 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
>  	unsigned long flags;
>  
>  	if (rvq->rx_ring)
> -		return vhost_net_buf_peek(rvq);
> +		return vhost_net_buf_peek(sk, rvq);
>  
>  	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
>  	head = skb_peek(&sk->sk_receive_queue);
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index 80166eb62f41..ab3b4ebca059 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -22,6 +22,7 @@ struct tun_msg_ctl {
>  #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
>  struct socket *tun_get_socket(struct file *);
>  struct ptr_ring *tun_get_tx_ring(struct file *file);
> +void tun_wake_queue(struct file *file);
>  
>  static inline bool tun_is_xdp_frame(void *ptr)
>  {
> @@ -55,6 +56,8 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
>  	return ERR_PTR(-EINVAL);
>  }
>  
> +static inline void tun_wake_queue(struct file *f) {}
> +
>  static inline bool tun_is_xdp_frame(void *ptr)
>  {
>  	return false;
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (3 preceding siblings ...)
  2026-03-12 13:06 ` [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
@ 2026-03-12 13:55 ` Michael S. Tsirkin
  2026-03-13  9:49   ` Simon Schippers
  2026-03-23 21:49 ` Simon Schippers
  5 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2026-03-12 13:55 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 02:06:35PM +0100, Simon Schippers wrote:
> This patch series deals with tun/tap & vhost-net which drop incoming
> SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
> patch series, the associated netdev queue is stopped - but only when a
> qdisc is attached. If no qdisc is present the existing behavior is
> preserved. This patch series touches tun/tap and vhost-net, as they
> share common logic and must be updated together. Modifying only one of
> them would break the other.
> 
> By applying proper backpressure, this change allows the connected qdisc to 
> operate correctly, as reported in [1], and significantly improves
> performance in real-world scenarios, as demonstrated in our paper [2]. For 
> example, we observed a 36% TCP throughput improvement for an OpenVPN 
> connection between Germany and the USA.
> 
> Synthetic pktgen benchmarks indicate a slight regression.
> Pktgen benchmarks are provided per commit, with the final commit showing
> the overall performance.
> 
> Thanks!

I posted a minor nit on patch 2.

Otherwise LGTM:

Acked-by: Michael S. Tsirkin <mst@redhat.com>

thanks for the work!


> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> [3] Link: https://lore.kernel.org/r/174549940981.608169.4363875844729313831.stgit@firesoul
> [4] Link: https://lore.kernel.org/r/176295323282.307447.14790015927673763094.stgit@firesoul
> 
> ---
> Changelog:
> V8:
> - Drop code changes in drivers/net/tap.c; The code there deals with
>   ipvtap/macvtap which are unrelated to the goal of this patch series
>   and I did not realize that before
> -> Greatly simplified logic, 4 instead of 9 commits
> -> No more duplicated logics and distinction in vhost required
> - Only wake after the queue stopped and half of the ring was consumed
>   as suggested by MST
> -> Performance improvements for TAP, but still slightly slower
> - Better benchmarking with pinned threads, XDP drop program for
>   tap+vhost-net and disabling CPU mitigations (and newer Ryzen 5 5600X
>   processor) as suggested by Jason Wang
> 
> V7: https://lore.kernel.org/netdev/20260107210448.37851-1-simon.schippers@tu-dortmund.de/
> - Switch to an approach similar to veth [3] (excluding the recently fixed 
> variant [4]), as suggested by MST, with minor adjustments discussed in V6
> - Rename the cover-letter title
> - Add multithreaded pktgen and iperf3 benchmarks, as suggested by Jason 
> Wang
> - Rework __ptr_ring_consume_created_space() so it can also be used after 
> batched consume
> 
> V6: https://lore.kernel.org/netdev/20251120152914.1127975-1-simon.schippers@tu-dortmund.de/
> General:
> - Major adjustments to the descriptions. Special thanks to Jon Kohler!
> - Fix git bisect by moving most logic into dedicated functions and only 
> start using them in patch 7.
> - Moved the main logic of the coupled producer and consumer into a single 
> patch to avoid a chicken-and-egg dependency between commits :-)
> - Rebased to 6.18-rc5 and ran benchmarks again that now also include lost 
> packets (previously I missed a 0, so all benchmark results were higher by 
> factor 10...).
> - Also include the benchmark in patch 7.
> 
> Producer:
> - Move logic into the new helper tun_ring_produce()
> - Added a smp_rmb() paired with the consumer, ensuring freed space of the 
> consumer is visible
> - Assume that ptr_ring is not full when __ptr_ring_full_next() is called
> 
> Consumer:
> - Use an unpaired smp_rmb() instead of barrier() to ensure that the 
> netdev_tx_queue_stopped() call completes before discarding
> - Also wake the netdev queue if it was stopped before discarding and then 
> becomes empty
> -> Fixes race with producer as identified by MST in V5
> -> Waking the netdev queues upon resize is not required anymore
> - Use __ptr_ring_consume_created_space() instead of messing with ptr_ring 
> internals
> -> Batched consume now just calls 
> __tun_ring_consume()/__tap_ring_consume() in a loop
> - Added an smp_wmb() before waking the netdev queue which is paired with 
> the smp_rmb() discussed above
> 
> V5: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#u
> - Stop the netdev queue prior to producing the final fitting ptr_ring entry
> -> Ensures the consumer has the latest netdev queue state, making it safe 
> to wake the queue
> -> Resolves an issue in vhost-net where the netdev queue could remain 
> stopped despite being empty
> -> For TUN/TAP, the netdev queue no longer needs to be woken in the 
> blocking loop
> -> Introduces new helpers __ptr_ring_full_next and 
> __ptr_ring_will_invalidate for this purpose
> - vhost-net now uses wrappers of TUN/TAP for ptr_ring consumption rather 
> than maintaining its own rx_ring pointer
> 
> V4: https://lore.kernel.org/netdev/20250902080957.47265-1-simon.schippers@tu-dortmund.de/T/#u
> - Target net-next instead of net
> - Changed to patch series instead of single patch
> - Changed to new title from old title
> "TUN/TAP: Improving throughput and latency by avoiding SKB drops"
> - Wake netdev queue with new helpers wake_netdev_queue when there is any 
> spare capacity in the ptr_ring instead of waiting for it to be empty
> - Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
> logic
> - Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
> that happened rarely before
> - Use safer logic for vhost-net using RCU read locks to access TUN/TAP data
> 
> V3: https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
> - Added support for TAP and TAP+vhost-net.
> 
> V2: https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
> - Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
> unnecessary netif_tx_wake_queue in tun_ring_recv.
> 
> V1: https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u
> ---
> 
> Simon Schippers (4):
>   tun/tap: add ptr_ring consume helper with netdev queue wakeup
>   vhost-net: wake queue of tun/tap after ptr_ring consume
>   ptr_ring: move free-space check into separate helper
>   tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
> 
>  drivers/net/tun.c        | 91 +++++++++++++++++++++++++++++++++++++---
>  drivers/vhost/net.c      | 15 +++++--
>  include/linux/if_tun.h   |  3 ++
>  include/linux/ptr_ring.h | 14 ++++++-
>  4 files changed, 111 insertions(+), 12 deletions(-)
> 
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper
  2026-03-12 13:48     ` Michael S. Tsirkin
@ 2026-03-12 14:21       ` Eric Dumazet
  2026-03-25 11:07         ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2026-03-12 14:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Schippers, willemdebruijn.kernel, jasowang, andrew+netdev,
	davem, kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 2:48 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Mar 12, 2026 at 02:17:16PM +0100, Eric Dumazet wrote:
> > On Thu, Mar 12, 2026 at 2:06 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> > >
> > > This patch moves the check for available free space for a new entry into
> > > a separate function. As a result, __ptr_ring_produce() remains logically
> > > unchanged, while the new helper allows callers to determine in advance
> > > whether subsequent __ptr_ring_produce() calls will succeed. This
> > > information can, for example, be used to temporarily stop producing until
> > > __ptr_ring_peek() indicates that space is available again.
> > >
> > > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > > ---
> > >  include/linux/ptr_ring.h | 14 ++++++++++++--
> > >  1 file changed, 12 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > index 534531807d95..a5a3fa4916d3 100644
> > > --- a/include/linux/ptr_ring.h
> > > +++ b/include/linux/ptr_ring.h
> > > @@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
> > >         return ret;
> > >  }
> > >
> > > +static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
> > > +{
> > > +       if (unlikely(!r->size) || r->queue[r->producer])
> >
> > I think this should be
> >
> >        if (unlikely(!r->size) || READ_ONCE(r->queue[r->producer]))
> >
> > And of course:
> >
> > @@ -194,7 +194,7 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
> >  static inline bool __ptr_ring_empty(struct ptr_ring *r)
> >  {
> >         if (likely(r->size))
> > -               return !r->queue[READ_ONCE(r->consumer_head)];
> > +               return !READ_ONCE(r->queue[READ_ONCE(r->consumer_head)]);
> >         return true;
> >  }
>
>
> I don't understand why it's necessary. consumer_head etc are
> all lock protected.
>
> queue itself is not but we are only checking it for NULL -
> it is fine if compiler reads it in many chunks and not all
> at once.
>
> > @@ -256,7 +256,7 @@ static inline void __ptr_ring_zero_tail(struct
> > ptr_ring *r, int consumer_head)
> >          * besides the first one until we write out all entries.
> >          */
> >         while (likely(head > r->consumer_tail))
> > -               r->queue[--head] = NULL;
> > +               WRITE_ONCE(r->queue[--head], NULL);
> >
> >         r->consumer_tail = consumer_head;
> >  }
> >
> >
> > Presumably we should fix this in net tree first.
>
>
> Maybe this one yes but I am not sure at all - KCSAN is happy.
>

Hmmm.. what about this trace ?

BUG: KCSAN: data-race in pfifo_fast_dequeue / pfifo_fast_enqueue

write to 0xffff88811d5ccc00 of 8 bytes by interrupt on cpu 0:
__ptr_ring_zero_tail include/linux/ptr_ring.h:259 [inline]
__ptr_ring_discard_one include/linux/ptr_ring.h:291 [inline]
__ptr_ring_consume include/linux/ptr_ring.h:311 [inline]
__skb_array_consume include/linux/skb_array.h:98 [inline]
pfifo_fast_dequeue+0x770/0x8f0 net/sched/sch_generic.c:770
dequeue_skb net/sched/sch_generic.c:297 [inline]
qdisc_restart net/sched/sch_generic.c:402 [inline]
__qdisc_run+0x189/0xc80 net/sched/sch_generic.c:420
qdisc_run include/net/pkt_sched.h:120 [inline]
net_tx_action+0x379/0x590 net/core/dev.c:5793
handle_softirqs+0xb9/0x280 kernel/softirq.c:622
do_softirq+0x45/0x60 kernel/softirq.c:523
__local_bh_enable_ip+0x70/0x80 kernel/softirq.c:450
local_bh_enable include/linux/bottom_half.h:33 [inline]
bpf_test_run+0x2db/0x620 net/bpf/test_run.c:426
bpf_prog_test_run_skb+0x9a4/0xef0 net/bpf/test_run.c:1159
bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4721
__sys_bpf+0x52e/0x7e0 kernel/bpf/syscall.c:6246
__do_sys_bpf kernel/bpf/syscall.c:6341 [inline]
__se_sys_bpf kernel/bpf/syscall.c:6339 [inline]
__x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6339
x64_sys_call+0x10cb/0x3020 arch/x86/include/generated/asm/syscalls_64.h:322
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88811d5ccc00 of 8 bytes by task 22632 on cpu 1:
__ptr_ring_produce include/linux/ptr_ring.h:106 [inline]
ptr_ring_produce include/linux/ptr_ring.h:129 [inline]
skb_array_produce include/linux/skb_array.h:44 [inline]
pfifo_fast_enqueue+0xd5/0x2c0 net/sched/sch_generic.c:741
dev_qdisc_enqueue net/core/dev.c:4144 [inline]
__dev_xmit_skb net/core/dev.c:4188 [inline]
__dev_queue_xmit+0x6a4/0x1f20 net/core/dev.c:4795
dev_queue_xmit include/linux/netdevice.h:3384 [inline]
__bpf_tx_skb net/core/filter.c:2153 [inline]
__bpf_redirect_common net/core/filter.c:2197 [inline]
__bpf_redirect+0x862/0x990 net/core/filter.c:2204
____bpf_clone_redirect net/core/filter.c:2487 [inline]
bpf_clone_redirect+0x20c/0x290 net/core/filter.c:2450
bpf_prog_53f18857bc887b09+0x22/0x2a
bpf_dispatcher_nop_func include/linux/bpf.h:1402 [inline]
__bpf_prog_run include/linux/filter.h:723 [inline]
bpf_prog_run include/linux/filter.h:730 [inline]
bpf_test_run+0x29d/0x620 net/bpf/test_run.c:423
bpf_prog_test_run_skb+0x9a4/0xef0 net/bpf/test_run.c:1159
bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4721
__sys_bpf+0x52e/0x7e0 kernel/bpf/syscall.c:6246
__do_sys_bpf kernel/bpf/syscall.c:6341 [inline]
__se_sys_bpf kernel/bpf/syscall.c:6339 [inline]
__x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6339
x64_sys_call+0x10cb/0x3020 arch/x86/include/generated/asm/syscalls_64.h:322
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0xffff888104a93a00 -> 0x0000000000000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 22632 Comm: syz.0.4135 Tainted: G W syzkaller #0
PREEMPT(full)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/24/2026

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2026-03-12 13:55 ` [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Michael S. Tsirkin
@ 2026-03-13  9:49   ` Simon Schippers
  2026-03-13 10:35     ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Schippers @ 2026-03-13  9:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 3/12/26 14:55, Michael S. Tsirkin wrote:
> On Thu, Mar 12, 2026 at 02:06:35PM +0100, Simon Schippers wrote:
>> This patch series deals with tun/tap & vhost-net which drop incoming
>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
>> patch series, the associated netdev queue is stopped - but only when a
>> qdisc is attached. If no qdisc is present the existing behavior is
>> preserved. This patch series touches tun/tap and vhost-net, as they
>> share common logic and must be updated together. Modifying only one of
>> them would break the other.
>>
>> By applying proper backpressure, this change allows the connected qdisc to 
>> operate correctly, as reported in [1], and significantly improves
>> performance in real-world scenarios, as demonstrated in our paper [2]. For 
>> example, we observed a 36% TCP throughput improvement for an OpenVPN 
>> connection between Germany and the USA.
>>
>> Synthetic pktgen benchmarks indicate a slight regression.
>> Pktgen benchmarks are provided per commit, with the final commit showing
>> the overall performance.
>>
>> Thanks!
> 
> I posted a minor nit on patch 2.
> 
> Otherwise LGTM:
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> thanks for the work!

Thanks!

Should I do a new version for the minor nit?

And how about the ptr_ring race:
I see there is a seperate discussion for that now. [1]
Should I wait for that?

Before sending a potential new version I would of course wait for
Jason's take.

[1] Link: https://lore.kernel.org/netdev/CANn89iJLC9p+N=Rqtuj7ZuPRdpSGCATCtZdz1Vi9mbzf3ATekQ@mail.gmail.com/

> 
> 
>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>> [3] Link: https://lore.kernel.org/r/174549940981.608169.4363875844729313831.stgit@firesoul
>> [4] Link: https://lore.kernel.org/r/176295323282.307447.14790015927673763094.stgit@firesoul
>>
>> ---
>> Changelog:
>> V8:
>> - Drop code changes in drivers/net/tap.c; The code there deals with
>>   ipvtap/macvtap which are unrelated to the goal of this patch series
>>   and I did not realize that before
>> -> Greatly simplified logic, 4 instead of 9 commits
>> -> No more duplicated logics and distinction in vhost required
>> - Only wake after the queue stopped and half of the ring was consumed
>>   as suggested by MST
>> -> Performance improvements for TAP, but still slightly slower
>> - Better benchmarking with pinned threads, XDP drop program for
>>   tap+vhost-net and disabling CPU mitigations (and newer Ryzen 5 5600X
>>   processor) as suggested by Jason Wang
>>
>> V7: https://lore.kernel.org/netdev/20260107210448.37851-1-simon.schippers@tu-dortmund.de/
>> - Switch to an approach similar to veth [3] (excluding the recently fixed 
>> variant [4]), as suggested by MST, with minor adjustments discussed in V6
>> - Rename the cover-letter title
>> - Add multithreaded pktgen and iperf3 benchmarks, as suggested by Jason 
>> Wang
>> - Rework __ptr_ring_consume_created_space() so it can also be used after 
>> batched consume
>>
>> V6: https://lore.kernel.org/netdev/20251120152914.1127975-1-simon.schippers@tu-dortmund.de/
>> General:
>> - Major adjustments to the descriptions. Special thanks to Jon Kohler!
>> - Fix git bisect by moving most logic into dedicated functions and only 
>> start using them in patch 7.
>> - Moved the main logic of the coupled producer and consumer into a single 
>> patch to avoid a chicken-and-egg dependency between commits :-)
>> - Rebased to 6.18-rc5 and ran benchmarks again that now also include lost 
>> packets (previously I missed a 0, so all benchmark results were higher by 
>> factor 10...).
>> - Also include the benchmark in patch 7.
>>
>> Producer:
>> - Move logic into the new helper tun_ring_produce()
>> - Added a smp_rmb() paired with the consumer, ensuring freed space of the 
>> consumer is visible
>> - Assume that ptr_ring is not full when __ptr_ring_full_next() is called
>>
>> Consumer:
>> - Use an unpaired smp_rmb() instead of barrier() to ensure that the 
>> netdev_tx_queue_stopped() call completes before discarding
>> - Also wake the netdev queue if it was stopped before discarding and then 
>> becomes empty
>> -> Fixes race with producer as identified by MST in V5
>> -> Waking the netdev queues upon resize is not required anymore
>> - Use __ptr_ring_consume_created_space() instead of messing with ptr_ring 
>> internals
>> -> Batched consume now just calls 
>> __tun_ring_consume()/__tap_ring_consume() in a loop
>> - Added an smp_wmb() before waking the netdev queue which is paired with 
>> the smp_rmb() discussed above
>>
>> V5: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#u
>> - Stop the netdev queue prior to producing the final fitting ptr_ring entry
>> -> Ensures the consumer has the latest netdev queue state, making it safe 
>> to wake the queue
>> -> Resolves an issue in vhost-net where the netdev queue could remain 
>> stopped despite being empty
>> -> For TUN/TAP, the netdev queue no longer needs to be woken in the 
>> blocking loop
>> -> Introduces new helpers __ptr_ring_full_next and 
>> __ptr_ring_will_invalidate for this purpose
>> - vhost-net now uses wrappers of TUN/TAP for ptr_ring consumption rather 
>> than maintaining its own rx_ring pointer
>>
>> V4: https://lore.kernel.org/netdev/20250902080957.47265-1-simon.schippers@tu-dortmund.de/T/#u
>> - Target net-next instead of net
>> - Changed to patch series instead of single patch
>> - Changed to new title from old title
>> "TUN/TAP: Improving throughput and latency by avoiding SKB drops"
>> - Wake netdev queue with new helpers wake_netdev_queue when there is any 
>> spare capacity in the ptr_ring instead of waiting for it to be empty
>> - Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
>> logic
>> - Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
>> that happened rarely before
>> - Use safer logic for vhost-net using RCU read locks to access TUN/TAP data
>>
>> V3: https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
>> - Added support for TAP and TAP+vhost-net.
>>
>> V2: https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
>> - Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
>> unnecessary netif_tx_wake_queue in tun_ring_recv.
>>
>> V1: https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u
>> ---
>>
>> Simon Schippers (4):
>>   tun/tap: add ptr_ring consume helper with netdev queue wakeup
>>   vhost-net: wake queue of tun/tap after ptr_ring consume
>>   ptr_ring: move free-space check into separate helper
>>   tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
>>
>>  drivers/net/tun.c        | 91 +++++++++++++++++++++++++++++++++++++---
>>  drivers/vhost/net.c      | 15 +++++--
>>  include/linux/if_tun.h   |  3 ++
>>  include/linux/ptr_ring.h | 14 ++++++-
>>  4 files changed, 111 insertions(+), 12 deletions(-)
>>
>> -- 
>> 2.43.0
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2026-03-13  9:49   ` Simon Schippers
@ 2026-03-13 10:35     ` Michael S. Tsirkin
  0 siblings, 0 replies; 21+ messages in thread
From: Michael S. Tsirkin @ 2026-03-13 10:35 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Fri, Mar 13, 2026 at 10:49:53AM +0100, Simon Schippers wrote:
> On 3/12/26 14:55, Michael S. Tsirkin wrote:
> > On Thu, Mar 12, 2026 at 02:06:35PM +0100, Simon Schippers wrote:
> >> This patch series deals with tun/tap & vhost-net which drop incoming
> >> SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
> >> patch series, the associated netdev queue is stopped - but only when a
> >> qdisc is attached. If no qdisc is present the existing behavior is
> >> preserved. This patch series touches tun/tap and vhost-net, as they
> >> share common logic and must be updated together. Modifying only one of
> >> them would break the other.
> >>
> >> By applying proper backpressure, this change allows the connected qdisc to 
> >> operate correctly, as reported in [1], and significantly improves
> >> performance in real-world scenarios, as demonstrated in our paper [2]. For 
> >> example, we observed a 36% TCP throughput improvement for an OpenVPN 
> >> connection between Germany and the USA.
> >>
> >> Synthetic pktgen benchmarks indicate a slight regression.
> >> Pktgen benchmarks are provided per commit, with the final commit showing
> >> the overall performance.
> >>
> >> Thanks!
> > 
> > I posted a minor nit on patch 2.
> > 
> > Otherwise LGTM:
> > 
> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > thanks for the work!
> 
> Thanks!
> 
> Should I do a new version for the minor nit?

It's easier if you do, yes.

> And how about the ptr_ring race:
> I see there is a seperate discussion for that now. [1]
> Should I wait for that?

If there's a conflict it will be trivial to resolve, so I'd say no.

> Before sending a potential new version I would of course wait for
> Jason's take.
> 
> [1] Link: https://lore.kernel.org/netdev/CANn89iJLC9p+N=Rqtuj7ZuPRdpSGCATCtZdz1Vi9mbzf3ATekQ@mail.gmail.com/
> 
> > 
> > 
> >> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >> [3] Link: https://lore.kernel.org/r/174549940981.608169.4363875844729313831.stgit@firesoul
> >> [4] Link: https://lore.kernel.org/r/176295323282.307447.14790015927673763094.stgit@firesoul
> >>
> >> ---
> >> Changelog:
> >> V8:
> >> - Drop code changes in drivers/net/tap.c; The code there deals with
> >>   ipvtap/macvtap which are unrelated to the goal of this patch series
> >>   and I did not realize that before
> >> -> Greatly simplified logic, 4 instead of 9 commits
> >> -> No more duplicated logics and distinction in vhost required
> >> - Only wake after the queue stopped and half of the ring was consumed
> >>   as suggested by MST
> >> -> Performance improvements for TAP, but still slightly slower
> >> - Better benchmarking with pinned threads, XDP drop program for
> >>   tap+vhost-net and disabling CPU mitigations (and newer Ryzen 5 5600X
> >>   processor) as suggested by Jason Wang
> >>
> >> V7: https://lore.kernel.org/netdev/20260107210448.37851-1-simon.schippers@tu-dortmund.de/
> >> - Switch to an approach similar to veth [3] (excluding the recently fixed 
> >> variant [4]), as suggested by MST, with minor adjustments discussed in V6
> >> - Rename the cover-letter title
> >> - Add multithreaded pktgen and iperf3 benchmarks, as suggested by Jason 
> >> Wang
> >> - Rework __ptr_ring_consume_created_space() so it can also be used after 
> >> batched consume
> >>
> >> V6: https://lore.kernel.org/netdev/20251120152914.1127975-1-simon.schippers@tu-dortmund.de/
> >> General:
> >> - Major adjustments to the descriptions. Special thanks to Jon Kohler!
> >> - Fix git bisect by moving most logic into dedicated functions and only 
> >> start using them in patch 7.
> >> - Moved the main logic of the coupled producer and consumer into a single 
> >> patch to avoid a chicken-and-egg dependency between commits :-)
> >> - Rebased to 6.18-rc5 and ran benchmarks again that now also include lost 
> >> packets (previously I missed a 0, so all benchmark results were higher by 
> >> factor 10...).
> >> - Also include the benchmark in patch 7.
> >>
> >> Producer:
> >> - Move logic into the new helper tun_ring_produce()
> >> - Added a smp_rmb() paired with the consumer, ensuring freed space of the 
> >> consumer is visible
> >> - Assume that ptr_ring is not full when __ptr_ring_full_next() is called
> >>
> >> Consumer:
> >> - Use an unpaired smp_rmb() instead of barrier() to ensure that the 
> >> netdev_tx_queue_stopped() call completes before discarding
> >> - Also wake the netdev queue if it was stopped before discarding and then 
> >> becomes empty
> >> -> Fixes race with producer as identified by MST in V5
> >> -> Waking the netdev queues upon resize is not required anymore
> >> - Use __ptr_ring_consume_created_space() instead of messing with ptr_ring 
> >> internals
> >> -> Batched consume now just calls 
> >> __tun_ring_consume()/__tap_ring_consume() in a loop
> >> - Added an smp_wmb() before waking the netdev queue which is paired with 
> >> the smp_rmb() discussed above
> >>
> >> V5: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#u
> >> - Stop the netdev queue prior to producing the final fitting ptr_ring entry
> >> -> Ensures the consumer has the latest netdev queue state, making it safe 
> >> to wake the queue
> >> -> Resolves an issue in vhost-net where the netdev queue could remain 
> >> stopped despite being empty
> >> -> For TUN/TAP, the netdev queue no longer needs to be woken in the 
> >> blocking loop
> >> -> Introduces new helpers __ptr_ring_full_next and 
> >> __ptr_ring_will_invalidate for this purpose
> >> - vhost-net now uses wrappers of TUN/TAP for ptr_ring consumption rather 
> >> than maintaining its own rx_ring pointer
> >>
> >> V4: https://lore.kernel.org/netdev/20250902080957.47265-1-simon.schippers@tu-dortmund.de/T/#u
> >> - Target net-next instead of net
> >> - Changed to patch series instead of single patch
> >> - Changed to new title from old title
> >> "TUN/TAP: Improving throughput and latency by avoiding SKB drops"
> >> - Wake netdev queue with new helpers wake_netdev_queue when there is any 
> >> spare capacity in the ptr_ring instead of waiting for it to be empty
> >> - Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
> >> logic
> >> - Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
> >> that happened rarely before
> >> - Use safer logic for vhost-net using RCU read locks to access TUN/TAP data
> >>
> >> V3: https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
> >> - Added support for TAP and TAP+vhost-net.
> >>
> >> V2: https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
> >> - Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
> >> unnecessary netif_tx_wake_queue in tun_ring_recv.
> >>
> >> V1: https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u
> >> ---
> >>
> >> Simon Schippers (4):
> >>   tun/tap: add ptr_ring consume helper with netdev queue wakeup
> >>   vhost-net: wake queue of tun/tap after ptr_ring consume
> >>   ptr_ring: move free-space check into separate helper
> >>   tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
> >>
> >>  drivers/net/tun.c        | 91 +++++++++++++++++++++++++++++++++++++---
> >>  drivers/vhost/net.c      | 15 +++++--
> >>  include/linux/if_tun.h   |  3 ++
> >>  include/linux/ptr_ring.h | 14 ++++++-
> >>  4 files changed, 111 insertions(+), 12 deletions(-)
> >>
> >> -- 
> >> 2.43.0
> > 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (4 preceding siblings ...)
  2026-03-12 13:55 ` [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Michael S. Tsirkin
@ 2026-03-23 21:49 ` Simon Schippers
  5 siblings, 0 replies; 21+ messages in thread
From: Simon Schippers @ 2026-03-23 21:49 UTC (permalink / raw)
  To: jasowang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 3/12/26 14:06, Simon Schippers wrote:
> This patch series deals with tun/tap & vhost-net which drop incoming
> SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
> patch series, the associated netdev queue is stopped - but only when a
> qdisc is attached. If no qdisc is present the existing behavior is
> preserved. This patch series touches tun/tap and vhost-net, as they
> share common logic and must be updated together. Modifying only one of
> them would break the other.

Hi,

@Jason what do you think about the patchset?

Thanks!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-03-12 13:06 ` [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
@ 2026-03-24  1:47   ` Jason Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Wang @ 2026-03-24  1:47 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
> __tun_wake_queue(). The latter wakes the stopped netdev subqueue once
> half of the ring capacity has been consumed, tracked via the new
> cons_cnt field in tun_file. When the ring is empty the queue is also
> woken to handle potential races.
>
> Without the corresponding queue stopping (introduced in a subsequent
> commit), this patch alone causes no regression for a tap setup sending
> to a qemu VM: 1.151 Mpps to 1.153 Mpps.
>

Another call to squash this into next patch.

Thanks


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume
  2026-03-12 13:06 ` [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
  2026-03-12 13:54   ` Michael S. Tsirkin
@ 2026-03-24  1:47   ` Jason Wang
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Wang @ 2026-03-24  1:47 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
> function validates that the file belongs to a tun/tap device,
> dereferences the tun_struct under RCU, and delegates to
> __tun_wake_queue().
>
> vhost_net_buf_produce() now calls tun_wake_queue() after a successful
> batched consume of the ring to allow the netdev subqueue to be woken up.
>
> Without the corresponding queue stopping (introduced in a subsequent
> commit), this patch alone causes a slight throughput regression for a
> tap+vhost-net setup sending to a qemu VM:
> 3.948 Mpps to 3.888 Mpps (-1.5%).
>
> Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
> threads, XDP drop program active in VM, pktgen sender; Avg over
> 20 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c      | 21 +++++++++++++++++++++
>  drivers/vhost/net.c    | 15 +++++++++++----
>  include/linux/if_tun.h |  3 +++
>  3 files changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index a82d665dab5f..b86582cc6cb6 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -3760,6 +3760,27 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
>  }
>  EXPORT_SYMBOL_GPL(tun_get_tx_ring);
>
> +void tun_wake_queue(struct file *file)
> +{
> +       struct tun_file *tfile;
> +       struct tun_struct *tun;
> +
> +       if (file->f_op != &tun_fops)
> +               return;
> +       tfile = file->private_data;
> +       if (!tfile)
> +               return;
> +
> +       rcu_read_lock();
> +
> +       tun = rcu_dereference(tfile->tun);
> +       if (tun)
> +               __tun_wake_queue(tun, tfile);
> +
> +       rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(tun_wake_queue);
> +
>  module_init(tun_init);
>  module_exit(tun_cleanup);
>  MODULE_DESCRIPTION(DRV_DESCRIPTION);
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 80965181920c..c8ef804ef28c 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -176,13 +176,19 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
>         return ret;
>  }
>
> -static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
> +static int vhost_net_buf_produce(struct sock *sk,
> +                                struct vhost_net_virtqueue *nvq)
>  {
> +       struct file *file = sk->sk_socket->file;
>         struct vhost_net_buf *rxq = &nvq->rxq;
>
>         rxq->head = 0;
>         rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
>                                               VHOST_NET_BATCH);
> +
> +       if (rxq->tail)
> +               tun_wake_queue(file);
> +
>         return rxq->tail;
>  }
>
> @@ -209,14 +215,15 @@ static int vhost_net_buf_peek_len(void *ptr)
>         return __skb_array_len_with_tag(ptr);
>  }
>
> -static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
> +static int vhost_net_buf_peek(struct sock *sk,
> +                             struct vhost_net_virtqueue *nvq)
>  {
>         struct vhost_net_buf *rxq = &nvq->rxq;
>
>         if (!vhost_net_buf_is_empty(rxq))
>                 goto out;
>
> -       if (!vhost_net_buf_produce(nvq))
> +       if (!vhost_net_buf_produce(sk, nvq))
>                 return 0;
>
>  out:
> @@ -995,7 +1002,7 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
>         unsigned long flags;
>
>         if (rvq->rx_ring)
> -               return vhost_net_buf_peek(rvq);
> +               return vhost_net_buf_peek(sk, rvq);
>
>         spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
>         head = skb_peek(&sk->sk_receive_queue);
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index 80166eb62f41..ab3b4ebca059 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -22,6 +22,7 @@ struct tun_msg_ctl {
>  #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
>  struct socket *tun_get_socket(struct file *);
>  struct ptr_ring *tun_get_tx_ring(struct file *file);
> +void tun_wake_queue(struct file *file);
>
>  static inline bool tun_is_xdp_frame(void *ptr)
>  {
> @@ -55,6 +56,8 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
>         return ERR_PTR(-EINVAL);
>  }
>
> +static inline void tun_wake_queue(struct file *f) {}
> +
>  static inline bool tun_is_xdp_frame(void *ptr)
>  {
>         return false;
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-03-12 13:06 ` [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
@ 2026-03-24  1:47   ` Jason Wang
  2026-03-24 10:14     ` Simon Schippers
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Wang @ 2026-03-24  1:47 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> becomes full. Once an entry is successfully produced and the ptr_ring
> reaches capacity, the netdev queue is stopped instead of dropping
> subsequent packets.
>
> If producing an entry fails anyways due to a race, tun_net_xmit returns
> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> LLTX is enabled and the transmit path operates without the usual locking.
>
> The existing __tun_wake_queue() function wakes the netdev queue. Races
> between this wakeup and the queue-stop logic could leave the queue
> stopped indefinitely. To prevent this, a memory barrier is enforced
> (as discussed in a similar implementation in [1]), followed by a recheck
> that wakes the queue if space is already available.
>
> If no qdisc is present, the previous tail-drop behavior is preserved.

I wonder if we need a dedicated TUN flag to enable this. With this new
flag, we can even prevent TUN from using noqueue (not sure if it's
possible or not).

>
> Benchmarks:
> The benchmarks show a slight regression in raw transmission performance,
> though no packets are lost anymore.
>
> The previously introduced threshold to only wake after the queue stopped
> and half of the ring was consumed showed to be a descent choice:
> Waking the queue whenever a consume made space in the ring strongly
> degrades performance for tap, while waking only when the ring is empty
> is too late and also hurts throughput for tap & tap+vhost-net.
> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> 1/2 was chosen for the sake of simplicity for both tun/tap and
> tun/tap+vhost-net.
>
> Test setup:
> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
> mitigations disabled.
>
> Note for tap+vhost-net:
> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>
> +--------------------------+--------------+----------------+----------+
> | 1 thread                 | Stock        | Patched with   | diff     |
> | sending                  |              | fq_codel qdisc |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
> |            +-------------+--------------+----------------+----------+
> |            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
> |            +-------------+--------------+----------------+----------+
> | +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
>
> +--------------------------+--------------+----------------+----------+
> | 2 threads                | Stock        | Patched with   | diff     |
> | sending                  |              | fq_codel qdisc |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
> |            +-------------+--------------+----------------+----------+
> |            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
> |            +-------------+--------------+----------------+----------+
> | +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
>
> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index b86582cc6cb6..9b7daec69acd 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>         struct netdev_queue *queue;
>         struct tun_file *tfile;
>         int len = skb->len;
> +       bool qdisc_present;
> +       int ret;
>
>         rcu_read_lock();
>         tfile = rcu_dereference(tun->tfiles[txq]);
> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>
>         nf_reset_ct(skb);
>
> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +       queue = netdev_get_tx_queue(dev, txq);
> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> +
> +       spin_lock(&tfile->tx_ring.producer_lock);
> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {

So, it's possible that the administrator is switching between noqueue
and another qdisc. So ptr_ring_produce() can fail here, do we need to
check that or not?

> +               netif_tx_stop_queue(queue);
> +               /* Avoid races with queue wake-ups in __tun_wake_queue by
> +                * waking if space is available in a re-check.
> +                * The barrier makes sure that the stop is visible before
> +                * we re-check.
> +                */
> +               smp_mb__after_atomic();

Let's document which barrier is paired with this.

> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> +                       netif_tx_wake_queue(queue);
> +       }
> +       spin_unlock(&tfile->tx_ring.producer_lock);
> +
> +       if (ret) {
> +               /* If a qdisc is attached to our virtual device,
> +                * returning NETDEV_TX_BUSY is allowed.
> +                */
> +               if (qdisc_present) {
> +                       rcu_read_unlock();
> +                       return NETDEV_TX_BUSY;
> +               }
>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>                 goto drop;
>         }
>
>         /* dev->lltx requires to do our own update of trans_start */
> -       queue = netdev_get_tx_queue(dev, txq);
>         txq_trans_cond_update(queue);
>
>         /* Notify and wake up reader process */
> --
> 2.43.0
>

Thanks


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-03-24  1:47   ` Jason Wang
@ 2026-03-24 10:14     ` Simon Schippers
  2026-03-25 14:47       ` Simon Schippers
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Schippers @ 2026-03-24 10:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 3/24/26 02:47, Jason Wang wrote:
> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>> becomes full. Once an entry is successfully produced and the ptr_ring
>> reaches capacity, the netdev queue is stopped instead of dropping
>> subsequent packets.
>>
>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>> LLTX is enabled and the transmit path operates without the usual locking.
>>
>> The existing __tun_wake_queue() function wakes the netdev queue. Races
>> between this wakeup and the queue-stop logic could leave the queue
>> stopped indefinitely. To prevent this, a memory barrier is enforced
>> (as discussed in a similar implementation in [1]), followed by a recheck
>> that wakes the queue if space is already available.
>>
>> If no qdisc is present, the previous tail-drop behavior is preserved.
> 
> I wonder if we need a dedicated TUN flag to enable this. With this new
> flag, we can even prevent TUN from using noqueue (not sure if it's
> possible or not).
> 

Except of the slight regressions because of this patchset I do not see
a reason for such a flag.

I have never seen that the driver prevents noqueue. For example you can
set noqueue to your ethernet interface and under load you soon get 

net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
					     dev->name);

followed by a -ENETDOWN. And this is not prevented even though it is
clearly not something a user wants.

>>
>> Benchmarks:
>> The benchmarks show a slight regression in raw transmission performance,
>> though no packets are lost anymore.
>>
>> The previously introduced threshold to only wake after the queue stopped
>> and half of the ring was consumed showed to be a descent choice:
>> Waking the queue whenever a consume made space in the ring strongly
>> degrades performance for tap, while waking only when the ring is empty
>> is too late and also hurts throughput for tap & tap+vhost-net.
>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>> tun/tap+vhost-net.
>>
>> Test setup:
>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
>> mitigations disabled.
>>
>> Note for tap+vhost-net:
>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>
>> +--------------------------+--------------+----------------+----------+
>> | 1 thread                 | Stock        | Patched with   | diff     |
>> | sending                  |              | fq_codel qdisc |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
>> |            +-------------+--------------+----------------+----------+
>> |            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
>> |            +-------------+--------------+----------------+----------+
>> | +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>>
>> +--------------------------+--------------+----------------+----------+
>> | 2 threads                | Stock        | Patched with   | diff     |
>> | sending                  |              | fq_codel qdisc |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
>> |            +-------------+--------------+----------------+----------+
>> |            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
>> |            +-------------+--------------+----------------+----------+
>> | +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>>
>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index b86582cc6cb6..9b7daec69acd 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>         struct netdev_queue *queue;
>>         struct tun_file *tfile;
>>         int len = skb->len;
>> +       bool qdisc_present;
>> +       int ret;
>>
>>         rcu_read_lock();
>>         tfile = rcu_dereference(tun->tfiles[txq]);
>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>         nf_reset_ct(skb);
>>
>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>> +       queue = netdev_get_tx_queue(dev, txq);
>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>> +
>> +       spin_lock(&tfile->tx_ring.producer_lock);
>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> 
> So, it's possible that the administrator is switching between noqueue
> and another qdisc. So ptr_ring_produce() can fail here, do we need to
> check that or not?
> 

Do you mean that? My thoughts:

Switching from noqueue to some qdisc can cause a 

net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
					     dev->name);

followed by a return of -ENETDOWN in __dev_queue_xmit().
This is because tun_net_xmit detects some qdisc with 

qdisc_present = !qdisc_txq_has_no_queue(queue);

and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
detect noqueue.

I am not sure how to solve this/if this has to be solved.
I do not see a proper way to avoid parallel execution of ndo_start_xmit
and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).

And from my understanding the veth implementation faces the same issue.


Switching from some qdisc to noqueue is no problem I think.

>> +               netif_tx_stop_queue(queue);
>> +               /* Avoid races with queue wake-ups in __tun_wake_queue by
>> +                * waking if space is available in a re-check.
>> +                * The barrier makes sure that the stop is visible before
>> +                * we re-check.
>> +                */
>> +               smp_mb__after_atomic();
> 
> Let's document which barrier is paired with this.
> 

I am basically copying the (old) logic of veth [1] proposed by
Jakub Kicinski. I must admit I am not 100% sure what it pairs with.

[1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/

>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>> +                       netif_tx_wake_queue(queue);
>> +       }
>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>> +
>> +       if (ret) {
>> +               /* If a qdisc is attached to our virtual device,
>> +                * returning NETDEV_TX_BUSY is allowed.
>> +                */
>> +               if (qdisc_present) {
>> +                       rcu_read_unlock();
>> +                       return NETDEV_TX_BUSY;
>> +               }
>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>                 goto drop;
>>         }
>>
>>         /* dev->lltx requires to do our own update of trans_start */
>> -       queue = netdev_get_tx_queue(dev, txq);
>>         txq_trans_cond_update(queue);
>>
>>         /* Notify and wake up reader process */
>> --
>> 2.43.0
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper
  2026-03-12 14:21       ` Eric Dumazet
@ 2026-03-25 11:07         ` Michael S. Tsirkin
  0 siblings, 0 replies; 21+ messages in thread
From: Michael S. Tsirkin @ 2026-03-25 11:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Simon Schippers, willemdebruijn.kernel, jasowang, andrew+netdev,
	davem, kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Thu, Mar 12, 2026 at 03:21:25PM +0100, Eric Dumazet wrote:
> On Thu, Mar 12, 2026 at 2:48 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Mar 12, 2026 at 02:17:16PM +0100, Eric Dumazet wrote:
> > > On Thu, Mar 12, 2026 at 2:06 PM Simon Schippers
> > > <simon.schippers@tu-dortmund.de> wrote:
> > > >
> > > > This patch moves the check for available free space for a new entry into
> > > > a separate function. As a result, __ptr_ring_produce() remains logically
> > > > unchanged, while the new helper allows callers to determine in advance
> > > > whether subsequent __ptr_ring_produce() calls will succeed. This
> > > > information can, for example, be used to temporarily stop producing until
> > > > __ptr_ring_peek() indicates that space is available again.
> > > >
> > > > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > > > ---
> > > >  include/linux/ptr_ring.h | 14 ++++++++++++--
> > > >  1 file changed, 12 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 534531807d95..a5a3fa4916d3 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
> > > >         return ret;
> > > >  }
> > > >
> > > > +static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
> > > > +{
> > > > +       if (unlikely(!r->size) || r->queue[r->producer])
> > >
> > > I think this should be
> > >
> > >        if (unlikely(!r->size) || READ_ONCE(r->queue[r->producer]))
> > >
> > > And of course:
> > >
> > > @@ -194,7 +194,7 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > >  static inline bool __ptr_ring_empty(struct ptr_ring *r)
> > >  {
> > >         if (likely(r->size))
> > > -               return !r->queue[READ_ONCE(r->consumer_head)];
> > > +               return !READ_ONCE(r->queue[READ_ONCE(r->consumer_head)]);
> > >         return true;
> > >  }
> >
> >
> > I don't understand why it's necessary. consumer_head etc are
> > all lock protected.
> >
> > queue itself is not but we are only checking it for NULL -
> > it is fine if compiler reads it in many chunks and not all
> > at once.
> >
> > > @@ -256,7 +256,7 @@ static inline void __ptr_ring_zero_tail(struct
> > > ptr_ring *r, int consumer_head)
> > >          * besides the first one until we write out all entries.
> > >          */
> > >         while (likely(head > r->consumer_tail))
> > > -               r->queue[--head] = NULL;
> > > +               WRITE_ONCE(r->queue[--head], NULL);
> > >
> > >         r->consumer_tail = consumer_head;
> > >  }
> > >
> > >
> > > Presumably we should fix this in net tree first.
> >
> >
> > Maybe this one yes but I am not sure at all - KCSAN is happy.
> >
> 
> Hmmm.. what about this trace ?
> 
> BUG: KCSAN: data-race in pfifo_fast_dequeue / pfifo_fast_enqueue
> 
> write to 0xffff88811d5ccc00 of 8 bytes by interrupt on cpu 0:
> __ptr_ring_zero_tail include/linux/ptr_ring.h:259 [inline]
> __ptr_ring_discard_one include/linux/ptr_ring.h:291 [inline]
> __ptr_ring_consume include/linux/ptr_ring.h:311 [inline]
> __skb_array_consume include/linux/skb_array.h:98 [inline]
> pfifo_fast_dequeue+0x770/0x8f0 net/sched/sch_generic.c:770
> dequeue_skb net/sched/sch_generic.c:297 [inline]
> qdisc_restart net/sched/sch_generic.c:402 [inline]
> __qdisc_run+0x189/0xc80 net/sched/sch_generic.c:420
> qdisc_run include/net/pkt_sched.h:120 [inline]
> net_tx_action+0x379/0x590 net/core/dev.c:5793
> handle_softirqs+0xb9/0x280 kernel/softirq.c:622
> do_softirq+0x45/0x60 kernel/softirq.c:523
> __local_bh_enable_ip+0x70/0x80 kernel/softirq.c:450
> local_bh_enable include/linux/bottom_half.h:33 [inline]
> bpf_test_run+0x2db/0x620 net/bpf/test_run.c:426
> bpf_prog_test_run_skb+0x9a4/0xef0 net/bpf/test_run.c:1159
> bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4721
> __sys_bpf+0x52e/0x7e0 kernel/bpf/syscall.c:6246
> __do_sys_bpf kernel/bpf/syscall.c:6341 [inline]
> __se_sys_bpf kernel/bpf/syscall.c:6339 [inline]
> __x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6339
> x64_sys_call+0x10cb/0x3020 arch/x86/include/generated/asm/syscalls_64.h:322
> do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> read to 0xffff88811d5ccc00 of 8 bytes by task 22632 on cpu 1:
> __ptr_ring_produce include/linux/ptr_ring.h:106 [inline]
> ptr_ring_produce include/linux/ptr_ring.h:129 [inline]
> skb_array_produce include/linux/skb_array.h:44 [inline]
> pfifo_fast_enqueue+0xd5/0x2c0 net/sched/sch_generic.c:741
> dev_qdisc_enqueue net/core/dev.c:4144 [inline]
> __dev_xmit_skb net/core/dev.c:4188 [inline]
> __dev_queue_xmit+0x6a4/0x1f20 net/core/dev.c:4795
> dev_queue_xmit include/linux/netdevice.h:3384 [inline]
> __bpf_tx_skb net/core/filter.c:2153 [inline]
> __bpf_redirect_common net/core/filter.c:2197 [inline]
> __bpf_redirect+0x862/0x990 net/core/filter.c:2204
> ____bpf_clone_redirect net/core/filter.c:2487 [inline]
> bpf_clone_redirect+0x20c/0x290 net/core/filter.c:2450
> bpf_prog_53f18857bc887b09+0x22/0x2a
> bpf_dispatcher_nop_func include/linux/bpf.h:1402 [inline]
> __bpf_prog_run include/linux/filter.h:723 [inline]
> bpf_prog_run include/linux/filter.h:730 [inline]
> bpf_test_run+0x29d/0x620 net/bpf/test_run.c:423
> bpf_prog_test_run_skb+0x9a4/0xef0 net/bpf/test_run.c:1159
> bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4721
> __sys_bpf+0x52e/0x7e0 kernel/bpf/syscall.c:6246
> __do_sys_bpf kernel/bpf/syscall.c:6341 [inline]
> __se_sys_bpf kernel/bpf/syscall.c:6339 [inline]
> __x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6339
> x64_sys_call+0x10cb/0x3020 arch/x86/include/generated/asm/syscalls_64.h:322
> do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> value changed: 0xffff888104a93a00 -> 0x0000000000000000
> 
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 1 UID: 0 PID: 22632 Comm: syz.0.4135 Tainted: G W syzkaller #0
> PREEMPT(full)
> Tainted: [W]=WARN
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/24/2026


I'm a bit unhappy to demand "ONCE" when it's actually OK if it is
written in any order (e.g. I had some optimizations doing a memcpy
here in mind).

So I wanted to reproduce this, but couldn't. And I got this so
I know KCSAN was enabled:



[  333.381268] kworker/u8:2 (40) used greatest stack depth: 12600 bytes left
Thread 0 done
[  390.104354] ==================================================================
[  390.106378] BUG: KCSAN: data-race in enqueue_hrtimer / hrtimer_interrupt
[  390.108170] 
[  390.108656] write to 0xffff88807dd1c7cc of 1 bytes by interrupt on cpu 1:
[  390.110562]  hrtimer_interrupt+0x3d7/0x400
[  390.111697]  __sysvec_apic_timer_interrupt+0xaf/0x2c0
[  390.113130]  sysvec_apic_timer_interrupt+0x6b/0x80
[  390.114469]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  390.116319]  qdisc_pkt_len_segs_init+0x81/0x370
[  390.117840]  __dev_queue_xmit+0x13e/0x1fe0
[  390.119229]  __bpf_redirect+0x31b/0x5d0
[  390.120572]  bpf_clone_redirect+0x193/0x200
[  390.122127]  bpf_prog_5c0c01093e7cbf07+0x27/0x30
[  390.123762]  bpf_test_run+0x24a/0x520
[  390.125082]  bpf_prog_test_run_skb+0x7f0/0x14d0
[  390.126720]  __sys_bpf+0x1f34/0x3e60
[  390.128077]  __x64_sys_bpf+0x4c/0x70
[  390.129475]  x64_sys_call+0x12eb/0x2520
[  390.130900]  do_syscall_64+0x133/0x530
[  390.132270]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  390.134143] 
[  390.134744] read to 0xffff88807dd1c7cc of 1 bytes by interrupt on cpu 0:
[  390.137213]  enqueue_hrtimer+0x69/0x1c0
[  390.138630]  hrtimer_start_range_ns+0x54b/0x640
[  390.140363]  start_dl_timer+0xb2/0x1a0
[  390.141777]  update_curr_dl_se+0x29c/0x360
[  390.143410]  sched_tick+0xe3/0x2d0
[  390.144773]  update_process_times+0xa2/0x120
[  390.146440]  tick_nohz_handler+0x11a/0x350
[  390.148068]  __hrtimer_run_queues+0x11a/0x680
[  390.149710]  hrtimer_interrupt+0x1f7/0x400
[  390.151244]  __sysvec_apic_timer_interrupt+0xaf/0x2c0
[  390.153143]  sysvec_apic_timer_interrupt+0x6b/0x80
[  390.154604]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  390.156003]  pv_native_safe_halt+0xf/0x20
[  390.157216]  default_idle+0x9/0x10
[  390.158189]  default_idle_call+0x88/0x230
[  390.159369]  do_idle+0x1f9/0x280
[  390.160383]  cpu_startup_entry+0x24/0x30
[  390.161528]  rest_init+0x1be/0x1c0
[  390.162556]  start_kernel+0xa1b/0xa20
[  390.163658]  x86_64_start_reservations+0x24/0x30
[  390.164950]  x86_64_start_kernel+0xd1/0xe0
[  390.166094]  common_startup_64+0x13e/0x148
[  390.167242] 
[  390.167721] Reported by Kernel Concurrency Sanitizer on:
[  390.169462] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-rc2-dirty #352 PREEMPT(full) 
[  390.171946] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025
[  390.174471] ==================================================================


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-03-24 10:14     ` Simon Schippers
@ 2026-03-25 14:47       ` Simon Schippers
  2026-03-26  2:41         ` Jason Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Schippers @ 2026-03-25 14:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 3/24/26 11:14, Simon Schippers wrote:
> On 3/24/26 02:47, Jason Wang wrote:
>> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
>> <simon.schippers@tu-dortmund.de> wrote:
>>>
>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>> reaches capacity, the netdev queue is stopped instead of dropping
>>> subsequent packets.
>>>
>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>
>>> The existing __tun_wake_queue() function wakes the netdev queue. Races
>>> between this wakeup and the queue-stop logic could leave the queue
>>> stopped indefinitely. To prevent this, a memory barrier is enforced
>>> (as discussed in a similar implementation in [1]), followed by a recheck
>>> that wakes the queue if space is already available.
>>>
>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>
>> I wonder if we need a dedicated TUN flag to enable this. With this new
>> flag, we can even prevent TUN from using noqueue (not sure if it's
>> possible or not).
>>
> 
> Except of the slight regressions because of this patchset I do not see
> a reason for such a flag.
> 
> I have never seen that the driver prevents noqueue. For example you can
> set noqueue to your ethernet interface and under load you soon get 
> 
> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> 					     dev->name);
> 
> followed by a -ENETDOWN. And this is not prevented even though it is
> clearly not something a user wants.
> 
>>>
>>> Benchmarks:
>>> The benchmarks show a slight regression in raw transmission performance,
>>> though no packets are lost anymore.
>>>
>>> The previously introduced threshold to only wake after the queue stopped
>>> and half of the ring was consumed showed to be a descent choice:
>>> Waking the queue whenever a consume made space in the ring strongly
>>> degrades performance for tap, while waking only when the ring is empty
>>> is too late and also hurts throughput for tap & tap+vhost-net.
>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>>> tun/tap+vhost-net.
>>>
>>> Test setup:
>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
>>> mitigations disabled.
>>>
>>> Note for tap+vhost-net:
>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>
>>> +--------------------------+--------------+----------------+----------+
>>> | 1 thread                 | Stock        | Patched with   | diff     |
>>> | sending                  |              | fq_codel qdisc |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
>>> |            +-------------+--------------+----------------+----------+
>>> |            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
>>> |            +-------------+--------------+----------------+----------+
>>> | +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
>>> +------------+-------------+--------------+----------------+----------+
>>>
>>> +--------------------------+--------------+----------------+----------+
>>> | 2 threads                | Stock        | Patched with   | diff     |
>>> | sending                  |              | fq_codel qdisc |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
>>> |            +-------------+--------------+----------------+----------+
>>> |            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
>>> |            +-------------+--------------+----------------+----------+
>>> | +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
>>> +------------+-------------+--------------+----------------+----------+
>>>
>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>>
>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>> ---
>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>> index b86582cc6cb6..9b7daec69acd 100644
>>> --- a/drivers/net/tun.c
>>> +++ b/drivers/net/tun.c
>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>         struct netdev_queue *queue;
>>>         struct tun_file *tfile;
>>>         int len = skb->len;
>>> +       bool qdisc_present;
>>> +       int ret;
>>>
>>>         rcu_read_lock();
>>>         tfile = rcu_dereference(tun->tfiles[txq]);
>>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>
>>>         nf_reset_ct(skb);
>>>
>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>> +       queue = netdev_get_tx_queue(dev, txq);
>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>>> +
>>> +       spin_lock(&tfile->tx_ring.producer_lock);
>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>
>> So, it's possible that the administrator is switching between noqueue
>> and another qdisc. So ptr_ring_produce() can fail here, do we need to
>> check that or not?
>>
> 
> Do you mean that? My thoughts:
> 
> Switching from noqueue to some qdisc can cause a 
> 
> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> 					     dev->name);
> 
> followed by a return of -ENETDOWN in __dev_queue_xmit().
> This is because tun_net_xmit detects some qdisc with 
> 
> qdisc_present = !qdisc_txq_has_no_queue(queue);
> 
> and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
> detect noqueue.
> 
> I am not sure how to solve this/if this has to be solved.
> I do not see a proper way to avoid parallel execution of ndo_start_xmit
> and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
> 
> And from my understanding the veth implementation faces the same issue.

How about rechecking if a qdisc is connected?
This would avoid -ENETDOWN.

diff --git a/net/core/dev.c b/net/core/dev.c
index f48dc299e4b2..2731a1a70732 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
                                if (is_list)
                                        rc = NETDEV_TX_OK;
                        }
+                       bool qdisc_present = !qdisc_txq_has_no_queue(txq);
                        HARD_TX_UNLOCK(dev, txq);
                        if (!skb) /* xmit completed */
                                goto out;
 
+                       /* Maybe a qdisc was connected in the meantime */
+                       if (qdisc_present) {
+                               kfree_skb(skb);
+                               goto out;
+                       }
+
                        net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
                                             dev->name);
                        /* NETDEV_TX_BUSY or queue was stopped */


> 
> 
> Switching from some qdisc to noqueue is no problem I think.
> 
>>> +               netif_tx_stop_queue(queue);
>>> +               /* Avoid races with queue wake-ups in __tun_wake_queue by
>>> +                * waking if space is available in a re-check.
>>> +                * The barrier makes sure that the stop is visible before
>>> +                * we re-check.
>>> +                */
>>> +               smp_mb__after_atomic();
>>
>> Let's document which barrier is paired with this.
>>
> 
> I am basically copying the (old) logic of veth [1] proposed by
> Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
> 
> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> 
>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>> +                       netif_tx_wake_queue(queue);
>>> +       }
>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>>> +
>>> +       if (ret) {
>>> +               /* If a qdisc is attached to our virtual device,
>>> +                * returning NETDEV_TX_BUSY is allowed.
>>> +                */
>>> +               if (qdisc_present) {
>>> +                       rcu_read_unlock();
>>> +                       return NETDEV_TX_BUSY;
>>> +               }
>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>                 goto drop;
>>>         }
>>>
>>>         /* dev->lltx requires to do our own update of trans_start */
>>> -       queue = netdev_get_tx_queue(dev, txq);
>>>         txq_trans_cond_update(queue);
>>>
>>>         /* Notify and wake up reader process */
>>> --
>>> 2.43.0
>>>
>>
>> Thanks
>>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-03-25 14:47       ` Simon Schippers
@ 2026-03-26  2:41         ` Jason Wang
  2026-03-26 15:30           ` Simon Schippers
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Wang @ 2026-03-26  2:41 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 3/24/26 11:14, Simon Schippers wrote:
> > On 3/24/26 02:47, Jason Wang wrote:
> >> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
> >> <simon.schippers@tu-dortmund.de> wrote:
> >>>
> >>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>> reaches capacity, the netdev queue is stopped instead of dropping
> >>> subsequent packets.
> >>>
> >>> If producing an entry fails anyways due to a race, tun_net_xmit returns
> >>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> >>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>
> >>> The existing __tun_wake_queue() function wakes the netdev queue. Races
> >>> between this wakeup and the queue-stop logic could leave the queue
> >>> stopped indefinitely. To prevent this, a memory barrier is enforced
> >>> (as discussed in a similar implementation in [1]), followed by a recheck
> >>> that wakes the queue if space is already available.
> >>>
> >>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>
> >> I wonder if we need a dedicated TUN flag to enable this. With this new
> >> flag, we can even prevent TUN from using noqueue (not sure if it's
> >> possible or not).
> >>
> >
> > Except of the slight regressions because of this patchset I do not see
> > a reason for such a flag.
> >
> > I have never seen that the driver prevents noqueue. For example you can
> > set noqueue to your ethernet interface and under load you soon get
> >
> > net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> >                                            dev->name);
> >
> > followed by a -ENETDOWN. And this is not prevented even though it is
> > clearly not something a user wants.
> >
> >>>
> >>> Benchmarks:
> >>> The benchmarks show a slight regression in raw transmission performance,
> >>> though no packets are lost anymore.
> >>>
> >>> The previously introduced threshold to only wake after the queue stopped
> >>> and half of the ring was consumed showed to be a descent choice:
> >>> Waking the queue whenever a consume made space in the ring strongly
> >>> degrades performance for tap, while waking only when the ring is empty
> >>> is too late and also hurts throughput for tap & tap+vhost-net.
> >>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> >>> 1/2 was chosen for the sake of simplicity for both tun/tap and
> >>> tun/tap+vhost-net.
> >>>
> >>> Test setup:
> >>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> >>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
> >>> mitigations disabled.
> >>>
> >>> Note for tap+vhost-net:
> >>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> >>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> >>>
> >>> +--------------------------+--------------+----------------+----------+
> >>> | 1 thread                 | Stock        | Patched with   | diff     |
> >>> | sending                  |              | fq_codel qdisc |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> |            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> | +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>>
> >>> +--------------------------+--------------+----------------+----------+
> >>> | 2 threads                | Stock        | Patched with   | diff     |
> >>> | sending                  |              | fq_codel qdisc |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> |            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>> | TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
> >>> |            +-------------+--------------+----------------+----------+
> >>> | +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
> >>> +------------+-------------+--------------+----------------+----------+
> >>>
> >>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>
> >>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>> ---
> >>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
> >>>  1 file changed, 28 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>> index b86582cc6cb6..9b7daec69acd 100644
> >>> --- a/drivers/net/tun.c
> >>> +++ b/drivers/net/tun.c
> >>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>         struct netdev_queue *queue;
> >>>         struct tun_file *tfile;
> >>>         int len = skb->len;
> >>> +       bool qdisc_present;
> >>> +       int ret;
> >>>
> >>>         rcu_read_lock();
> >>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>
> >>>         nf_reset_ct(skb);
> >>>
> >>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>> +       queue = netdev_get_tx_queue(dev, txq);
> >>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>> +
> >>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>
> >> So, it's possible that the administrator is switching between noqueue
> >> and another qdisc. So ptr_ring_produce() can fail here, do we need to
> >> check that or not?
> >>
> >
> > Do you mean that? My thoughts:
> >
> > Switching from noqueue to some qdisc can cause a
> >
> > net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> >                                            dev->name);
> >
> > followed by a return of -ENETDOWN in __dev_queue_xmit().
> > This is because tun_net_xmit detects some qdisc with
> >
> > qdisc_present = !qdisc_txq_has_no_queue(queue);
> >
> > and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
> > detect noqueue.
> >
> > I am not sure how to solve this/if this has to be solved.
> > I do not see a proper way to avoid parallel execution of ndo_start_xmit
> > and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
> >
> > And from my understanding the veth implementation faces the same issue.
>
> How about rechecking if a qdisc is connected?
> This would avoid -ENETDOWN.
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f48dc299e4b2..2731a1a70732 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>                                 if (is_list)
>                                         rc = NETDEV_TX_OK;
>                         }
> +                       bool qdisc_present = !qdisc_txq_has_no_queue(txq);
>                         HARD_TX_UNLOCK(dev, txq);
>                         if (!skb) /* xmit completed */
>                                 goto out;
>
> +                       /* Maybe a qdisc was connected in the meantime */
> +                       if (qdisc_present) {
> +                               kfree_skb(skb);
> +                               goto out;
> +                       }
> +
>                         net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>                                              dev->name);
>                         /* NETDEV_TX_BUSY or queue was stopped */
>

Probably not, and we likely won't hit this warning because qdisc could
not be changed during ndo_start_xmit().

I meant something like this:

1) set noqueue to tuntap
2) produce packets so tuntap is full
3) set e.g fq_codel to tuntap
4) then we can hit the failure of __ptr_ring_produce()

Rethink of the code, it looks just fine.

>
> >
> >
> > Switching from some qdisc to noqueue is no problem I think.
> >
> >>> +               netif_tx_stop_queue(queue);
> >>> +               /* Avoid races with queue wake-ups in __tun_wake_queue by
> >>> +                * waking if space is available in a re-check.
> >>> +                * The barrier makes sure that the stop is visible before
> >>> +                * we re-check.
> >>> +                */
> >>> +               smp_mb__after_atomic();
> >>
> >> Let's document which barrier is paired with this.
> >>
> >
> > I am basically copying the (old) logic of veth [1] proposed by
> > Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
> >
> > [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/

So it looks like it implicitly tries to pair with tun_ring_consume():

1) spinlock(consumer_lock)
2) store NULL to ptr_ring // STORE
3) spinunlock(consumer_lock) // RELEASE
4) spinlock(consumer_lock) // ACQURE
5) check empty
6) spinunlock(consumer_lock)
7) netif_wakeup_queue() // test_and_set() which is an RMW

RELEASE + ACQUIRE implies a full barrier

I see several problems

1) Due to batch consumption, we may get spurious wakeups under heavy
load (we can try disabling batch consuming to see if it helps).
2) So the barriers don't help but would slow down the consuming
3) Two spinlocks were used instead of one, this is another reason we
will see a performance regression
4) Tricky code that needs to be understood or at least requires a comment tweak.

Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
may not notice the real degradation.

> >
> >>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>> +                       netif_tx_wake_queue(queue);
> >>> +       }
> >>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>> +
> >>> +       if (ret) {
> >>> +               /* If a qdisc is attached to our virtual device,
> >>> +                * returning NETDEV_TX_BUSY is allowed.
> >>> +                */
> >>> +               if (qdisc_present) {
> >>> +                       rcu_read_unlock();
> >>> +                       return NETDEV_TX_BUSY;
> >>> +               }
> >>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>                 goto drop;
> >>>         }
> >>>
> >>>         /* dev->lltx requires to do our own update of trans_start */
> >>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>         txq_trans_cond_update(queue);
> >>>
> >>>         /* Notify and wake up reader process */
> >>> --
> >>> 2.43.0
> >>>
> >>
> >> Thanks
> >>
>

Thanks


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-03-26  2:41         ` Jason Wang
@ 2026-03-26 15:30           ` Simon Schippers
  0 siblings, 0 replies; 21+ messages in thread
From: Simon Schippers @ 2026-03-26 15:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 3/26/26 03:41, Jason Wang wrote:
> On Wed, Mar 25, 2026 at 10:48 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 3/24/26 11:14, Simon Schippers wrote:
>>> On 3/24/26 02:47, Jason Wang wrote:
>>>> On Thu, Mar 12, 2026 at 9:07 PM Simon Schippers
>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>
>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>>> subsequent packets.
>>>>>
>>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>>>
>>>>> The existing __tun_wake_queue() function wakes the netdev queue. Races
>>>>> between this wakeup and the queue-stop logic could leave the queue
>>>>> stopped indefinitely. To prevent this, a memory barrier is enforced
>>>>> (as discussed in a similar implementation in [1]), followed by a recheck
>>>>> that wakes the queue if space is already available.
>>>>>
>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>
>>>> I wonder if we need a dedicated TUN flag to enable this. With this new
>>>> flag, we can even prevent TUN from using noqueue (not sure if it's
>>>> possible or not).
>>>>
>>>
>>> Except of the slight regressions because of this patchset I do not see
>>> a reason for such a flag.
>>>
>>> I have never seen that the driver prevents noqueue. For example you can
>>> set noqueue to your ethernet interface and under load you soon get
>>>
>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>>>                                            dev->name);
>>>
>>> followed by a -ENETDOWN. And this is not prevented even though it is
>>> clearly not something a user wants.
>>>
>>>>>
>>>>> Benchmarks:
>>>>> The benchmarks show a slight regression in raw transmission performance,
>>>>> though no packets are lost anymore.
>>>>>
>>>>> The previously introduced threshold to only wake after the queue stopped
>>>>> and half of the ring was consumed showed to be a descent choice:
>>>>> Waking the queue whenever a consume made space in the ring strongly
>>>>> degrades performance for tap, while waking only when the ring is empty
>>>>> is too late and also hurts throughput for tap & tap+vhost-net.
>>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>>>>> tun/tap+vhost-net.
>>>>>
>>>>> Test setup:
>>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>>>> Average over 20 runs @ 100,000,000 packets. SRSO and spectre v2
>>>>> mitigations disabled.
>>>>>
>>>>> Note for tap+vhost-net:
>>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>>>
>>>>> +--------------------------+--------------+----------------+----------+
>>>>> | 1 thread                 | Stock        | Patched with   | diff     |
>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>> +------------+-------------+--------------+----------------+----------+
>>>>> | TAP        | Transmitted | 1.151 Mpps   | 1.139 Mpps     | -1.1%    |
>>>>> |            +-------------+--------------+----------------+----------+
>>>>> |            | Lost/s      | 3.606 Mpps   | 0 pps          |          |
>>>>> +------------+-------------+--------------+----------------+----------+
>>>>> | TAP        | Transmitted | 3.948 Mpps   | 3.738 Mpps     | -5.3%    |
>>>>> |            +-------------+--------------+----------------+----------+
>>>>> | +vhost-net | Lost/s      | 496.5 Kpps   | 0 pps          |          |
>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>
>>>>> +--------------------------+--------------+----------------+----------+
>>>>> | 2 threads                | Stock        | Patched with   | diff     |
>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>> +------------+-------------+--------------+----------------+----------+
>>>>> | TAP        | Transmitted | 1.133 Mpps   | 1.109 Mpps     | -2.1%    |
>>>>> |            +-------------+--------------+----------------+----------+
>>>>> |            | Lost/s      | 8.269 Mpps   | 0 pps          |          |
>>>>> +------------+-------------+--------------+----------------+----------+
>>>>> | TAP        | Transmitted | 3.820 Mpps   | 3.513 Mpps     | -8.0%    |
>>>>> |            +-------------+--------------+----------------+----------+
>>>>> | +vhost-net | Lost/s      | 4.961 Mpps   | 0 pps          |          |
>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>
>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>>>>
>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>> ---
>>>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>>>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>> index b86582cc6cb6..9b7daec69acd 100644
>>>>> --- a/drivers/net/tun.c
>>>>> +++ b/drivers/net/tun.c
>>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>         struct netdev_queue *queue;
>>>>>         struct tun_file *tfile;
>>>>>         int len = skb->len;
>>>>> +       bool qdisc_present;
>>>>> +       int ret;
>>>>>
>>>>>         rcu_read_lock();
>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
>>>>> @@ -1063,13 +1065,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>
>>>>>         nf_reset_ct(skb);
>>>>>
>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>>> +       queue = netdev_get_tx_queue(dev, txq);
>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>> +
>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>>
>>>> So, it's possible that the administrator is switching between noqueue
>>>> and another qdisc. So ptr_ring_produce() can fail here, do we need to
>>>> check that or not?
>>>>
>>>
>>> Do you mean that? My thoughts:
>>>
>>> Switching from noqueue to some qdisc can cause a
>>>
>>> net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>>>                                            dev->name);
>>>
>>> followed by a return of -ENETDOWN in __dev_queue_xmit().
>>> This is because tun_net_xmit detects some qdisc with
>>>
>>> qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>
>>> and returns NETDEV_TX_BUSY even though __dev_queue_xmit() did still
>>> detect noqueue.
>>>
>>> I am not sure how to solve this/if this has to be solved.
>>> I do not see a proper way to avoid parallel execution of ndo_start_xmit
>>> and a qdisc change (dev_graft_qdisc only takes qdisc_skb_head lock).
>>>
>>> And from my understanding the veth implementation faces the same issue.
>>
>> How about rechecking if a qdisc is connected?
>> This would avoid -ENETDOWN.
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index f48dc299e4b2..2731a1a70732 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4845,10 +4845,17 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>>                                 if (is_list)
>>                                         rc = NETDEV_TX_OK;
>>                         }
>> +                       bool qdisc_present = !qdisc_txq_has_no_queue(txq);
>>                         HARD_TX_UNLOCK(dev, txq);
>>                         if (!skb) /* xmit completed */
>>                                 goto out;
>>
>> +                       /* Maybe a qdisc was connected in the meantime */
>> +                       if (qdisc_present) {
>> +                               kfree_skb(skb);
>> +                               goto out;
>> +                       }
>> +
>>                         net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>>                                              dev->name);
>>                         /* NETDEV_TX_BUSY or queue was stopped */
>>
> 
> Probably not, and we likely won't hit this warning because qdisc could
> not be changed during ndo_start_xmit().

Okay.

> 
> I meant something like this:
> 
> 1) set noqueue to tuntap
> 2) produce packets so tuntap is full
> 3) set e.g fq_codel to tuntap
> 4) then we can hit the failure of __ptr_ring_produce()
> 
> Rethink of the code, it looks just fine.

Yes, in this case it just returns NETDEV_TX_BUSY which is fine with a
qdisc attached.

> 
>>
>>>
>>>
>>> Switching from some qdisc to noqueue is no problem I think.
>>>
>>>>> +               netif_tx_stop_queue(queue);
>>>>> +               /* Avoid races with queue wake-ups in __tun_wake_queue by
>>>>> +                * waking if space is available in a re-check.
>>>>> +                * The barrier makes sure that the stop is visible before
>>>>> +                * we re-check.
>>>>> +                */
>>>>> +               smp_mb__after_atomic();
>>>>
>>>> Let's document which barrier is paired with this.
>>>>
>>>
>>> I am basically copying the (old) logic of veth [1] proposed by
>>> Jakub Kicinski. I must admit I am not 100% sure what it pairs with.
>>>
>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> 
> So it looks like it implicitly tries to pair with tun_ring_consume():
> 
> 1) spinlock(consumer_lock)
> 2) store NULL to ptr_ring // STORE
> 3) spinunlock(consumer_lock) // RELEASE
> 4) spinlock(consumer_lock) // ACQURE
> 5) check empty
> 6) spinunlock(consumer_lock)
> 7) netif_wakeup_queue() // test_and_set() which is an RMW
> 
> RELEASE + ACQUIRE implies a full barrier

Thanks.

> 
> I see several problems
> 
> 1) Due to batch consumption, we may get spurious wakeups under heavy
> load (we can try disabling batch consuming to see if it helps).

I assume that you mean the waking in the recheck of the producer happens
too often and then wakes too often. But this would just take slightly
more producer cpu as the SOFTIRQ runs on the producer cpu and not slow
down the consumer?

Why would disabling batch consume help here?
Wouldn't it just decrease the consumer speed?

Apart from that I do not see a different method to do this recheck.
The ring producer is only safely able to do a !produce_peek (so a check
for !full).

The normal waking (after consuming half of the ring) should be fine IMO.

> 2) So the barriers don't help but would slow down the consuming
> 3) Two spinlocks were used instead of one, this is another reason we
> will see a performance regression

You are right, I can change it to a single spin_lock. Apart from that
I do not see how the barriers/locking could be reduced further.

> 4) Tricky code that needs to be understood or at least requires a comment tweak.
> 
> Note that due to ~IFF_TX_SKB_SHARING, pktgen can't clone skbs, so we
> may not notice the real degradation.

So run pktgen with pg_set SHARED? I am pretty sure that the vhost
thread was always at 100% CPU so pktgen was not the bottleneck. And when
I had perf enabled I always saw that in my patched version not the
creation of SKB's took most CPU in pktgen but a different unnamed
function (I assume this is a waiting function).


Thank you!

> 
>>>
>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>>> +                       netif_tx_wake_queue(queue);
>>>>> +       }
>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>>>>> +
>>>>> +       if (ret) {
>>>>> +               /* If a qdisc is attached to our virtual device,
>>>>> +                * returning NETDEV_TX_BUSY is allowed.
>>>>> +                */
>>>>> +               if (qdisc_present) {
>>>>> +                       rcu_read_unlock();
>>>>> +                       return NETDEV_TX_BUSY;
>>>>> +               }
>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>                 goto drop;
>>>>>         }
>>>>>
>>>>>         /* dev->lltx requires to do our own update of trans_start */
>>>>> -       queue = netdev_get_tx_queue(dev, txq);
>>>>>         txq_trans_cond_update(queue);
>>>>>
>>>>>         /* Notify and wake up reader process */
>>>>> --
>>>>> 2.43.0
>>>>>
>>>>
>>>> Thanks
>>>>
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-03-26 15:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 13:06 [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
2026-03-12 13:06 ` [PATCH net-next v8 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
2026-03-24  1:47   ` Jason Wang
2026-03-12 13:06 ` [PATCH net-next v8 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
2026-03-12 13:54   ` Michael S. Tsirkin
2026-03-24  1:47   ` Jason Wang
2026-03-12 13:06 ` [PATCH net-next v8 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
2026-03-12 13:17   ` Eric Dumazet
2026-03-12 13:48     ` Michael S. Tsirkin
2026-03-12 14:21       ` Eric Dumazet
2026-03-25 11:07         ` Michael S. Tsirkin
2026-03-12 13:06 ` [PATCH net-next v8 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
2026-03-24  1:47   ` Jason Wang
2026-03-24 10:14     ` Simon Schippers
2026-03-25 14:47       ` Simon Schippers
2026-03-26  2:41         ` Jason Wang
2026-03-26 15:30           ` Simon Schippers
2026-03-12 13:55 ` [PATCH net-next v8 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Michael S. Tsirkin
2026-03-13  9:49   ` Simon Schippers
2026-03-13 10:35     ` Michael S. Tsirkin
2026-03-23 21:49 ` Simon Schippers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox