public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
@ 2026-04-28 12:38 Simon Schippers
  2026-04-28 12:38 ` [PATCH net-next v9 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 12:38 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This patch series deals with tun/tap & vhost-net which drop incoming
SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
patch series, the associated netdev queue is stopped - but only when a
qdisc is attached. If no qdisc is present the existing behavior is
preserved. This patch series touches tun/tap and vhost-net, as they
share common logic and must be updated together. Modifying only one of
them would break the other.

By applying proper backpressure, this change allows the connected qdisc to 
operate correctly, as reported in [1], and significantly improves
performance in real-world scenarios, as demonstrated in our paper [2]. For 
example, we observed a 36% TCP throughput improvement for an OpenVPN 
connection between Germany and the USA.

Synthetic pktgen benchmarks indicate a slight regression, but packet loss
no longer occurs. Pktgen benchmarks are provided per commit, with the final
commit showing the overall performance.

Thanks!

[1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
[2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf

---
Changelog:
V9:
- Addressed minor nit by MST in patches 1 and 2.
- Rebased patch 3 because of commit d748047
  ("ptr_ring: disable KCSAN warnings").
- Documented the pair of the smp_mb__after_atomic() in tun_net_xmit()
  with tun_ring_consume().
  --> It simply pairs with the test_and_clear_bit() inside of
      netif_wake_subqueue().
- Use 1 ptr_ring consumer spinlock instead of 2.
- Ran pktgen benchmarks with pg_set SHARED for 50 iterations on
  latest kernel
  --> No significant performance difference noticed

V8: https://lore.kernel.org/netdev/20260312130639.138988-1-simon.schippers@tu-dortmund.de/
- Drop code changes in drivers/net/tap.c; The code there deals with
  ipvtap/macvtap which are unrelated to the goal of this patch series
  and I did not realize that before
-> Greatly simplified logic, 4 instead of 9 commits
-> No more duplicated logics and distinction in vhost required
- Only wake after the queue stopped and half of the ring was consumed
  as suggested by MST
-> Performance improvements for TAP, but still slightly slower
- Better benchmarking with pinned threads, XDP drop program for
  tap+vhost-net and disabling CPU mitigations (and newer Ryzen 5 5600X
  processor) as suggested by Jason Wang

V7: https://lore.kernel.org/netdev/20260107210448.37851-1-simon.schippers@tu-dortmund.de/
- Switch to an approach similar to veth (excluding the recently fixed 
variant), as suggested by MST, with minor adjustments discussed in V6
- Rename the cover-letter title
- Add multithreaded pktgen and iperf3 benchmarks, as suggested by Jason 
Wang
- Rework __ptr_ring_consume_created_space() so it can also be used after 
batched consume

V6: https://lore.kernel.org/netdev/20251120152914.1127975-1-simon.schippers@tu-dortmund.de/
General:
- Major adjustments to the descriptions. Special thanks to Jon Kohler!
- Fix git bisect by moving most logic into dedicated functions and only 
start using them in patch 7.
- Moved the main logic of the coupled producer and consumer into a single 
patch to avoid a chicken-and-egg dependency between commits :-)
- Rebased to 6.18-rc5 and ran benchmarks again that now also include lost 
packets (previously I missed a 0, so all benchmark results were higher by 
factor 10...).
- Also include the benchmark in patch 7.

Producer:
- Move logic into the new helper tun_ring_produce()
- Added a smp_rmb() paired with the consumer, ensuring freed space of the 
consumer is visible
- Assume that ptr_ring is not full when __ptr_ring_full_next() is called

Consumer:
- Use an unpaired smp_rmb() instead of barrier() to ensure that the 
netdev_tx_queue_stopped() call completes before discarding
- Also wake the netdev queue if it was stopped before discarding and then 
becomes empty
-> Fixes race with producer as identified by MST in V5
-> Waking the netdev queues upon resize is not required anymore
- Use __ptr_ring_consume_created_space() instead of messing with ptr_ring 
internals
-> Batched consume now just calls 
__tun_ring_consume()/__tap_ring_consume() in a loop
- Added an smp_wmb() before waking the netdev queue which is paired with 
the smp_rmb() discussed above

V5: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#u
- Stop the netdev queue prior to producing the final fitting ptr_ring entry
-> Ensures the consumer has the latest netdev queue state, making it safe 
to wake the queue
-> Resolves an issue in vhost-net where the netdev queue could remain 
stopped despite being empty
-> For TUN/TAP, the netdev queue no longer needs to be woken in the 
blocking loop
-> Introduces new helpers __ptr_ring_full_next and 
__ptr_ring_will_invalidate for this purpose
- vhost-net now uses wrappers of TUN/TAP for ptr_ring consumption rather 
than maintaining its own rx_ring pointer

V4: https://lore.kernel.org/netdev/20250902080957.47265-1-simon.schippers@tu-dortmund.de/T/#u
- Target net-next instead of net
- Changed to patch series instead of single patch
- Changed to new title from old title
"TUN/TAP: Improving throughput and latency by avoiding SKB drops"
- Wake netdev queue with new helpers wake_netdev_queue when there is any 
spare capacity in the ptr_ring instead of waiting for it to be empty
- Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
logic
- Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
that happened rarely before
- Use safer logic for vhost-net using RCU read locks to access TUN/TAP data

V3: https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
- Added support for TAP and TAP+vhost-net.

V2: https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
- Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
unnecessary netif_tx_wake_queue in tun_ring_recv.

V1: https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u
---

Simon Schippers (4):
  tun/tap: add ptr_ring consume helper with netdev queue wakeup
  vhost-net: wake queue of tun/tap after ptr_ring consume
  ptr_ring: move free-space check into separate helper
  tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

 drivers/net/tun.c        | 95 +++++++++++++++++++++++++++++++++++++---
 drivers/vhost/net.c      | 21 ++++++---
 include/linux/if_tun.h   |  3 ++
 include/linux/ptr_ring.h | 14 +++++-
 4 files changed, 119 insertions(+), 14 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next v9 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-04-28 12:38 [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
@ 2026-04-28 12:38 ` Simon Schippers
  2026-04-28 12:38 ` [PATCH net-next v9 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 12:38 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
__tun_wake_queue(). The latter wakes the stopped netdev subqueue once
half of the ring capacity has been consumed, tracked via the new
cons_cnt field in tun_file. When the ring is empty the queue is also
woken to handle potential races. The point is to allow the queue to
be stopped when it gets full, which is required for traffic shaping -
implemented by the following "avoid ptr_ring tail-drop when a qdisc
is present".

Without the corresponding queue stopping, this patch alone causes no
regression for a tap setup sending to a qemu VM: 1.136 Mpps
to 1.141 Mpps.

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, pktgen sender; Avg over 50 runs @ 100,000,000 packets;
SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 43 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b183189f1853..e6ee2271732f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -145,6 +145,7 @@ struct tun_file {
 	struct list_head next;
 	struct tun_struct *detached;
 	struct ptr_ring tx_ring;
+	int cons_cnt;
 	struct xdp_rxq_info xdp_rxq;
 };
 
@@ -564,6 +565,7 @@ static void tun_queue_purge(struct tun_file *tfile)
 	while ((ptr = ptr_ring_consume(&tfile->tx_ring)) != NULL)
 		tun_ptr_free(ptr);
 
+	tfile->cons_cnt = 0;
 	skb_queue_purge(&tfile->sk.sk_write_queue);
 	skb_queue_purge(&tfile->sk.sk_error_queue);
 }
@@ -730,6 +732,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 		goto out;
 	}
 
+	tfile->cons_cnt = 0;
 	tfile->queue_index = tun->numqueues;
 	tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
 
@@ -2115,13 +2118,42 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
-static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
+/* Callers must hold ring.consumer_lock */
+static void __tun_wake_queue(struct tun_struct *tun, struct tun_file *tfile)
+{
+	if (__ptr_ring_empty(&tfile->tx_ring))
+		goto wake;
+
+	if (!__netif_subqueue_stopped(tun->dev, tfile->queue_index) ||
+	    ++tfile->cons_cnt < tfile->tx_ring.size / 2)
+		return;
+
+wake:
+	netif_wake_subqueue(tun->dev, tfile->queue_index);
+	tfile->cons_cnt = 0;
+}
+
+static void *tun_ring_consume(struct tun_struct *tun, struct tun_file *tfile)
+{
+	void *ptr;
+
+	spin_lock(&tfile->tx_ring.consumer_lock);
+	ptr = __ptr_ring_consume(&tfile->tx_ring);
+	if (ptr)
+		__tun_wake_queue(tun, tfile);
+
+	spin_unlock(&tfile->tx_ring.consumer_lock);
+	return ptr;
+}
+
+static void *tun_ring_recv(struct tun_struct *tun, struct tun_file *tfile,
+			   int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
 	void *ptr = NULL;
 	int error = 0;
 
-	ptr = ptr_ring_consume(&tfile->tx_ring);
+	ptr = tun_ring_consume(tun, tfile);
 	if (ptr)
 		goto out;
 	if (noblock) {
@@ -2133,7 +2165,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		ptr = ptr_ring_consume(&tfile->tx_ring);
+		ptr = tun_ring_consume(tun, tfile);
 		if (ptr)
 			break;
 		if (signal_pending(current)) {
@@ -2170,7 +2202,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 
 	if (!ptr) {
 		/* Read frames from ring */
-		ptr = tun_ring_recv(tfile, noblock, &err);
+		ptr = tun_ring_recv(tun, tfile, noblock, &err);
 		if (!ptr)
 			return err;
 	}
@@ -3406,6 +3438,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
 		return -ENOMEM;
 	}
 
+	tfile->cons_cnt = 0;
+
 	mutex_init(&tfile->napi_mutex);
 	RCU_INIT_POINTER(tfile->tun, NULL);
 	tfile->flags = 0;
@@ -3614,6 +3648,7 @@ static int tun_queue_resize(struct tun_struct *tun)
 	for (i = 0; i < tun->numqueues; i++) {
 		tfile = rtnl_dereference(tun->tfiles[i]);
 		rings[i] = &tfile->tx_ring;
+		tfile->cons_cnt = 0;
 	}
 	list_for_each_entry(tfile, &tun->disabled, next)
 		rings[i++] = &tfile->tx_ring;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next v9 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume
  2026-04-28 12:38 [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  2026-04-28 12:38 ` [PATCH net-next v9 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
@ 2026-04-28 12:38 ` Simon Schippers
  2026-04-28 12:38 ` [PATCH net-next v9 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 12:38 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
function validates that the file belongs to a tun/tap device,
dereferences the tun_struct under RCU, and delegates to
__tun_wake_queue().

vhost_net_buf_produce() now calls tun_wake_queue() after a successful
batched consume of the ring to allow the netdev subqueue to be woken up.
The point is to allow the queue to be stopped when it gets full, which
is required for traffic shaping - implemented by the following
"avoid ptr_ring tail-drop when a qdisc is present".

Without the corresponding queue stopping, this patch alone causes no
throughput regression for a tap+vhost-net setup sending to a qemu VM:
3.858 Mpps to 3.898 Mpps.

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, XDP drop program active in VM, pktgen sender; Avg over
50 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c      | 22 ++++++++++++++++++++++
 drivers/vhost/net.c    | 21 +++++++++++++++------
 include/linux/if_tun.h |  3 +++
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e6ee2271732f..efe809597622 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3765,6 +3765,28 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_tx_ring);
 
+/* Callers must hold ring.consumer_lock */
+void tun_wake_queue(struct file *file)
+{
+	struct tun_file *tfile;
+	struct tun_struct *tun;
+
+	if (file->f_op != &tun_fops)
+		return;
+	tfile = file->private_data;
+	if (!tfile)
+		return;
+
+	rcu_read_lock();
+
+	tun = rcu_dereference(tfile->tun);
+	if (tun)
+		__tun_wake_queue(tun, tfile);
+
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(tun_wake_queue);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 80965181920c..7fba518ac3cd 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -176,13 +176,21 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 	return ret;
 }
 
-static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_produce(struct sock *sk,
+				 struct vhost_net_virtqueue *nvq)
 {
+	struct file *file = sk->sk_socket->file;
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	rxq->head = 0;
-	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
-					      VHOST_NET_BATCH);
+	spin_lock(&nvq->rx_ring->consumer_lock);
+	rxq->tail = __ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
+					       VHOST_NET_BATCH);
+
+	if (rxq->tail)
+		tun_wake_queue(file);
+
+	spin_unlock(&nvq->rx_ring->consumer_lock);
 	return rxq->tail;
 }
 
@@ -209,14 +217,15 @@ static int vhost_net_buf_peek_len(void *ptr)
 	return __skb_array_len_with_tag(ptr);
 }
 
-static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_peek(struct sock *sk,
+			      struct vhost_net_virtqueue *nvq)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	if (!vhost_net_buf_is_empty(rxq))
 		goto out;
 
-	if (!vhost_net_buf_produce(nvq))
+	if (!vhost_net_buf_produce(sk, nvq))
 		return 0;
 
 out:
@@ -995,7 +1004,7 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	unsigned long flags;
 
 	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+		return vhost_net_buf_peek(sk, rvq);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 80166eb62f41..ab3b4ebca059 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -22,6 +22,7 @@ struct tun_msg_ctl {
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
+void tun_wake_queue(struct file *file);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -55,6 +56,8 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
+static inline void tun_wake_queue(struct file *f) {}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next v9 3/4] ptr_ring: move free-space check into separate helper
  2026-04-28 12:38 [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  2026-04-28 12:38 ` [PATCH net-next v9 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
  2026-04-28 12:38 ` [PATCH net-next v9 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
@ 2026-04-28 12:38 ` Simon Schippers
  2026-04-28 12:38 ` [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
  2026-04-29 21:04 ` [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  4 siblings, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 12:38 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This patch moves the check for available free space for a new entry into
a separate function. As a result, __ptr_ring_produce() remains logically
unchanged, while the new helper allows callers to determine in advance
whether subsequent __ptr_ring_produce() calls will succeed. This
information can, for example, be used to temporarily stop producing until
__ptr_ring_peek() indicates that space is available again.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 include/linux/ptr_ring.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index d2c3629bbe45..b51fe4a484be 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
 	return ret;
 }
 
+static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
+{
+	if (unlikely(!r->size) || data_race(r->queue[r->producer]))
+		return -ENOSPC;
+
+	return 0;
+}
+
 /* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax(). Callers must hold producer_lock.
  * Callers are responsible for making sure pointer that is being queued
@@ -103,8 +111,10 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
  */
 static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
 {
-	if (unlikely(!r->size) || data_race(r->queue[r->producer]))
-		return -ENOSPC;
+	int p = __ptr_ring_produce_peek(r);
+
+	if (p)
+		return p;
 
 	/* Make sure the pointer we are storing points to a valid data. */
 	/* Pairs with the dependency ordering in __ptr_ring_consume. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 12:38 [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (2 preceding siblings ...)
  2026-04-28 12:38 ` [PATCH net-next v9 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
@ 2026-04-28 12:38 ` Simon Schippers
  2026-04-28 12:50   ` Michael S. Tsirkin
  2026-04-29 21:04 ` [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  4 siblings, 1 reply; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 12:38 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once an entry is successfully produced and the ptr_ring
reaches capacity, the netdev queue is stopped instead of dropping
subsequent packets.

If producing an entry fails anyways due to a race, tun_net_xmit returns
NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
LLTX is enabled and the transmit path operates without the usual locking.

If no qdisc is present, the previous tail-drop behavior is preserved.

The existing __tun_wake_queue() function of the consumer races with the
producer for waking/stopping the netdev queue: the consumer may drain
the ring just as the producer stops the queue, leading to a permanent
stall. To avoid this, the producer re-checks the ring after stopping
and wakes the queue itself if space was just made. An
smp_mb__after_atomic() is required so the re-peek of the ring sees any
drain that the consumer performed.
smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
netif_wake_subqueue():

Consumer CPU                  Producer CPU
========================      =========================
__ptr_ring_consume()
netif_wake_subqueue()         netif_tx_stop_queue()
          /\                  smp_mb__after_atomic()
          ||                  __ptr_ring_produce_peek()
contains RMW operation
 test_and_clear_bit()
          /\
          ||
 "Fully ordered RMW:
smp_mb() before + after"
    - atomic_t.txt

Benchmarks:
The benchmarks show a slight regression in raw transmission performance,
though no packets are lost anymore.

The previously introduced threshold to only wake after the queue stopped
and half of the ring was consumed showed to be a descent choice:
Waking the queue whenever a consume made space in the ring strongly
degrades performance for tap, while waking only when the ring is empty
is too late and also hurts throughput for tap & tap+vhost-net.
Other ratios (3/4, 7/8) showed similar results (not shown here), so
1/2 was chosen for the sake of simplicity for both tun/tap and
tun/tap+vhost-net.

Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.

Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster, slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)

+--------------------------+--------------+----------------+----------+
| 1 thread                 | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

+--------------------------+--------------+----------------+----------+
| 2 threads                | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index efe809597622..c2a1618cc9db 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct netdev_queue *queue;
 	struct tun_file *tfile;
 	int len = skb->len;
+	bool qdisc_present;
+	int ret;
 
 	rcu_read_lock();
 	tfile = rcu_dereference(tun->tfiles[txq]);
@@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	qdisc_present = !qdisc_txq_has_no_queue(queue);
+
+	spin_lock(&tfile->tx_ring.producer_lock);
+	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
+	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
+		netif_tx_stop_queue(queue);
+		/* Re-peek and wake if the consumer drained the ring
+		 * concurrently in a race. smp_mb__after_atomic() pairs
+		 * with the test_and_clear_bit() of netif_wake_subqueue()
+		 * in __tun_wake_queue().
+		 */
+		smp_mb__after_atomic();
+		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
+			netif_tx_wake_queue(queue);
+	}
+	spin_unlock(&tfile->tx_ring.producer_lock);
+
+	if (ret) {
+		/* If a qdisc is attached to our virtual device,
+		 * returning NETDEV_TX_BUSY is allowed.
+		 */
+		if (qdisc_present) {
+			rcu_read_unlock();
+			return NETDEV_TX_BUSY;
+		}
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
 
 	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
 	txq_trans_cond_update(queue);
 
 	/* Notify and wake up reader process */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 12:38 ` [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
@ 2026-04-28 12:50   ` Michael S. Tsirkin
  2026-04-28 13:10     ` Simon Schippers
  0 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-04-28 12:50 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> becomes full. Once an entry is successfully produced and the ptr_ring
> reaches capacity, the netdev queue is stopped instead of dropping
> subsequent packets.
> 
> If producing an entry fails anyways due to a race, tun_net_xmit returns
> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> LLTX is enabled and the transmit path operates without the usual locking.
> 
> If no qdisc is present, the previous tail-drop behavior is preserved.
> 
> The existing __tun_wake_queue() function of the consumer races with the
> producer for waking/stopping the netdev queue: the consumer may drain
> the ring just as the producer stops the queue, leading to a permanent
> stall. To avoid this, the producer re-checks the ring after stopping
> and wakes the queue itself if space was just made. An
> smp_mb__after_atomic() is required so the re-peek of the ring sees any
> drain that the consumer performed.
> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
> netif_wake_subqueue():
> 
> Consumer CPU                  Producer CPU
> ========================      =========================
> __ptr_ring_consume()
> netif_wake_subqueue()         netif_tx_stop_queue()
>           /\                  smp_mb__after_atomic()
>           ||                  __ptr_ring_produce_peek()
> contains RMW operation
>  test_and_clear_bit()
>           /\
>           ||
>  "Fully ordered RMW:
> smp_mb() before + after"
>     - atomic_t.txt
> 
> Benchmarks:
> The benchmarks show a slight regression in raw transmission performance,
> though no packets are lost anymore.

Could you include the packets received as well?
To demonstrate the gains/lack of loss. 

> 
> The previously introduced threshold to only wake after the queue stopped
> and half of the ring was consumed showed to be a descent choice:
> Waking the queue whenever a consume made space in the ring strongly
> degrades performance for tap, while waking only when the ring is empty
> is too late and also hurts throughput for tap & tap+vhost-net.
> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> 1/2 was chosen for the sake of simplicity for both tun/tap and
> tun/tap+vhost-net.
> 
> Test setup:
> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
> mitigations disabled.
> 
> Note for tap+vhost-net:
> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> 
> +--------------------------+--------------+----------------+----------+
> | 1 thread                 | Stock        | Patched with   | diff     |
> | sending                  |              | fq_codel qdisc |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
> |            +-------------+--------------+----------------+----------+
> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
> |            +-------------+--------------+----------------+----------+
> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
> 
> +--------------------------+--------------+----------------+----------+
> | 2 threads                | Stock        | Patched with   | diff     |
> | sending                  |              | fq_codel qdisc |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
> |            +-------------+--------------+----------------+----------+
> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
> |            +-------------+--------------+----------------+----------+
> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
> +------------+-------------+--------------+----------------+----------+
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index efe809597622..c2a1618cc9db 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct netdev_queue *queue;
>  	struct tun_file *tfile;
>  	int len = skb->len;
> +	bool qdisc_present;
> +	int ret;
>  
>  	rcu_read_lock();
>  	tfile = rcu_dereference(tun->tfiles[txq]);
> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	nf_reset_ct(skb);
>  
> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +	queue = netdev_get_tx_queue(dev, txq);
> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
> +
> +	spin_lock(&tfile->tx_ring.producer_lock);
> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> +		netif_tx_stop_queue(queue);
> +		/* Re-peek and wake if the consumer drained the ring
> +		 * concurrently in a race. smp_mb__after_atomic() pairs
> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
> +		 * in __tun_wake_queue().
> +		 */
> +		smp_mb__after_atomic();
> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> +			netif_tx_wake_queue(queue);
> +	}
> +	spin_unlock(&tfile->tx_ring.producer_lock);
> +
> +	if (ret) {
> +		/* If a qdisc is attached to our virtual device,
> +		 * returning NETDEV_TX_BUSY is allowed.
> +		 */
> +		if (qdisc_present) {
> +			rcu_read_unlock();
> +			return NETDEV_TX_BUSY;
> +		}
>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>  		goto drop;
>  	}
>  
>  	/* dev->lltx requires to do our own update of trans_start */
> -	queue = netdev_get_tx_queue(dev, txq);
>  	txq_trans_cond_update(queue);
>  
>  	/* Notify and wake up reader process */
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 12:50   ` Michael S. Tsirkin
@ 2026-04-28 13:10     ` Simon Schippers
  2026-04-28 13:22       ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 13:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 4/28/26 14:50, Michael S. Tsirkin wrote:
> On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>> becomes full. Once an entry is successfully produced and the ptr_ring
>> reaches capacity, the netdev queue is stopped instead of dropping
>> subsequent packets.
>>
>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>> LLTX is enabled and the transmit path operates without the usual locking.
>>
>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>
>> The existing __tun_wake_queue() function of the consumer races with the
>> producer for waking/stopping the netdev queue: the consumer may drain
>> the ring just as the producer stops the queue, leading to a permanent
>> stall. To avoid this, the producer re-checks the ring after stopping
>> and wakes the queue itself if space was just made. An
>> smp_mb__after_atomic() is required so the re-peek of the ring sees any
>> drain that the consumer performed.
>> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
>> netif_wake_subqueue():
>>
>> Consumer CPU                  Producer CPU
>> ========================      =========================
>> __ptr_ring_consume()
>> netif_wake_subqueue()         netif_tx_stop_queue()
>>           /\                  smp_mb__after_atomic()
>>           ||                  __ptr_ring_produce_peek()
>> contains RMW operation
>>  test_and_clear_bit()
>>           /\
>>           ||
>>  "Fully ordered RMW:
>> smp_mb() before + after"
>>     - atomic_t.txt
>>
>> Benchmarks:
>> The benchmarks show a slight regression in raw transmission performance,
>> though no packets are lost anymore.
> 
> Could you include the packets received as well?
> To demonstrate the gains/lack of loss. 
> 

Do you mean the number of packets received by the VM?
They should just be the same as the number sent (shown below), right?

I assume they would be visible as RX-DRP for TAP.
For TAP + vhost-net I would have to rewrite the XDP drop
program to count the number of dropped packets...
And I would have to automate it...

>>
>> The previously introduced threshold to only wake after the queue stopped
>> and half of the ring was consumed showed to be a descent choice:
>> Waking the queue whenever a consume made space in the ring strongly
>> degrades performance for tap, while waking only when the ring is empty
>> is too late and also hurts throughput for tap & tap+vhost-net.
>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>> tun/tap+vhost-net.
>>
>> Test setup:
>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
>> mitigations disabled.
>>
>> Note for tap+vhost-net:
>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>
>> +--------------------------+--------------+----------------+----------+
>> | 1 thread                 | Stock        | Patched with   | diff     |
>> | sending                  |              | fq_codel qdisc |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
>> |            +-------------+--------------+----------------+----------+
>> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
>> |            +-------------+--------------+----------------+----------+
>> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>>
>> +--------------------------+--------------+----------------+----------+
>> | 2 threads                | Stock        | Patched with   | diff     |
>> | sending                  |              | fq_codel qdisc |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
>> |            +-------------+--------------+----------------+----------+
>> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
>> |            +-------------+--------------+----------------+----------+
>> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
>> +------------+-------------+--------------+----------------+----------+
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index efe809597622..c2a1618cc9db 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>  	struct netdev_queue *queue;
>>  	struct tun_file *tfile;
>>  	int len = skb->len;
>> +	bool qdisc_present;
>> +	int ret;
>>  
>>  	rcu_read_lock();
>>  	tfile = rcu_dereference(tun->tfiles[txq]);
>> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>  
>>  	nf_reset_ct(skb);
>>  
>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>> +	queue = netdev_get_tx_queue(dev, txq);
>> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
>> +
>> +	spin_lock(&tfile->tx_ring.producer_lock);
>> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>> +		netif_tx_stop_queue(queue);
>> +		/* Re-peek and wake if the consumer drained the ring
>> +		 * concurrently in a race. smp_mb__after_atomic() pairs
>> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
>> +		 * in __tun_wake_queue().
>> +		 */
>> +		smp_mb__after_atomic();
>> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>> +			netif_tx_wake_queue(queue);
>> +	}
>> +	spin_unlock(&tfile->tx_ring.producer_lock);
>> +
>> +	if (ret) {
>> +		/* If a qdisc is attached to our virtual device,
>> +		 * returning NETDEV_TX_BUSY is allowed.
>> +		 */
>> +		if (qdisc_present) {
>> +			rcu_read_unlock();
>> +			return NETDEV_TX_BUSY;
>> +		}
>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>>  		goto drop;
>>  	}
>>  
>>  	/* dev->lltx requires to do our own update of trans_start */
>> -	queue = netdev_get_tx_queue(dev, txq);
>>  	txq_trans_cond_update(queue);
>>  
>>  	/* Notify and wake up reader process */
>> -- 
>> 2.43.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 13:10     ` Simon Schippers
@ 2026-04-28 13:22       ` Michael S. Tsirkin
  2026-04-28 13:41         ` Simon Schippers
  0 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-04-28 13:22 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Tue, Apr 28, 2026 at 03:10:44PM +0200, Simon Schippers wrote:
> On 4/28/26 14:50, Michael S. Tsirkin wrote:
> > On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
> >> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >> becomes full. Once an entry is successfully produced and the ptr_ring
> >> reaches capacity, the netdev queue is stopped instead of dropping
> >> subsequent packets.
> >>
> >> If producing an entry fails anyways due to a race, tun_net_xmit returns
> >> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> >> LLTX is enabled and the transmit path operates without the usual locking.
> >>
> >> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>
> >> The existing __tun_wake_queue() function of the consumer races with the
> >> producer for waking/stopping the netdev queue: the consumer may drain
> >> the ring just as the producer stops the queue, leading to a permanent
> >> stall. To avoid this, the producer re-checks the ring after stopping
> >> and wakes the queue itself if space was just made. An
> >> smp_mb__after_atomic() is required so the re-peek of the ring sees any
> >> drain that the consumer performed.
> >> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
> >> netif_wake_subqueue():
> >>
> >> Consumer CPU                  Producer CPU
> >> ========================      =========================
> >> __ptr_ring_consume()
> >> netif_wake_subqueue()         netif_tx_stop_queue()
> >>           /\                  smp_mb__after_atomic()
> >>           ||                  __ptr_ring_produce_peek()
> >> contains RMW operation
> >>  test_and_clear_bit()
> >>           /\
> >>           ||
> >>  "Fully ordered RMW:
> >> smp_mb() before + after"
> >>     - atomic_t.txt
> >>
> >> Benchmarks:
> >> The benchmarks show a slight regression in raw transmission performance,
> >> though no packets are lost anymore.
> > 
> > Could you include the packets received as well?
> > To demonstrate the gains/lack of loss. 
> > 
> 
> Do you mean the number of packets received by the VM?
> They should just be the same as the number sent (shown below), right?

Minus the loss? Which this is about, right?

> I assume they would be visible as RX-DRP for TAP.
> For TAP + vhost-net I would have to rewrite the XDP drop
> program to count the number of dropped packets...
> And I would have to automate it...
> 
> >>
> >> The previously introduced threshold to only wake after the queue stopped
> >> and half of the ring was consumed showed to be a descent choice:
> >> Waking the queue whenever a consume made space in the ring strongly
> >> degrades performance for tap, while waking only when the ring is empty
> >> is too late and also hurts throughput for tap & tap+vhost-net.
> >> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> >> 1/2 was chosen for the sake of simplicity for both tun/tap and
> >> tun/tap+vhost-net.
> >>
> >> Test setup:
> >> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> >> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
> >> mitigations disabled.
> >>
> >> Note for tap+vhost-net:
> >> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> >> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> >>
> >> +--------------------------+--------------+----------------+----------+
> >> | 1 thread                 | Stock        | Patched with   | diff     |
> >> | sending                  |              | fq_codel qdisc |          |
> >> +------------+-------------+--------------+----------------+----------+
> >> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
> >> |            +-------------+--------------+----------------+----------+
> >> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
> >> +------------+-------------+--------------+----------------+----------+
> >> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
> >> |            +-------------+--------------+----------------+----------+
> >> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
> >> +------------+-------------+--------------+----------------+----------+
> >>
> >> +--------------------------+--------------+----------------+----------+
> >> | 2 threads                | Stock        | Patched with   | diff     |
> >> | sending                  |              | fq_codel qdisc |          |
> >> +------------+-------------+--------------+----------------+----------+
> >> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
> >> |            +-------------+--------------+----------------+----------+
> >> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
> >> +------------+-------------+--------------+----------------+----------+
> >> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
> >> |            +-------------+--------------+----------------+----------+
> >> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
> >> +------------+-------------+--------------+----------------+----------+
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
> >>  1 file changed, 28 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index efe809597622..c2a1618cc9db 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>  	struct netdev_queue *queue;
> >>  	struct tun_file *tfile;
> >>  	int len = skb->len;
> >> +	bool qdisc_present;
> >> +	int ret;
> >>  
> >>  	rcu_read_lock();
> >>  	tfile = rcu_dereference(tun->tfiles[txq]);
> >> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>  
> >>  	nf_reset_ct(skb);
> >>  
> >> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >> +	queue = netdev_get_tx_queue(dev, txq);
> >> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
> >> +
> >> +	spin_lock(&tfile->tx_ring.producer_lock);
> >> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >> +		netif_tx_stop_queue(queue);
> >> +		/* Re-peek and wake if the consumer drained the ring
> >> +		 * concurrently in a race. smp_mb__after_atomic() pairs
> >> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
> >> +		 * in __tun_wake_queue().
> >> +		 */
> >> +		smp_mb__after_atomic();
> >> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >> +			netif_tx_wake_queue(queue);
> >> +	}
> >> +	spin_unlock(&tfile->tx_ring.producer_lock);
> >> +
> >> +	if (ret) {
> >> +		/* If a qdisc is attached to our virtual device,
> >> +		 * returning NETDEV_TX_BUSY is allowed.
> >> +		 */
> >> +		if (qdisc_present) {
> >> +			rcu_read_unlock();
> >> +			return NETDEV_TX_BUSY;
> >> +		}
> >>  		drop_reason = SKB_DROP_REASON_FULL_RING;
> >>  		goto drop;
> >>  	}
> >>  
> >>  	/* dev->lltx requires to do our own update of trans_start */
> >> -	queue = netdev_get_tx_queue(dev, txq);
> >>  	txq_trans_cond_update(queue);
> >>  
> >>  	/* Notify and wake up reader process */
> >> -- 
> >> 2.43.0
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 13:22       ` Michael S. Tsirkin
@ 2026-04-28 13:41         ` Simon Schippers
  2026-04-28 14:10           ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 13:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 4/28/26 15:22, Michael S. Tsirkin wrote:
> On Tue, Apr 28, 2026 at 03:10:44PM +0200, Simon Schippers wrote:
>> On 4/28/26 14:50, Michael S. Tsirkin wrote:
>>> On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>> subsequent packets.
>>>>
>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>>
>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>
>>>> The existing __tun_wake_queue() function of the consumer races with the
>>>> producer for waking/stopping the netdev queue: the consumer may drain
>>>> the ring just as the producer stops the queue, leading to a permanent
>>>> stall. To avoid this, the producer re-checks the ring after stopping
>>>> and wakes the queue itself if space was just made. An
>>>> smp_mb__after_atomic() is required so the re-peek of the ring sees any
>>>> drain that the consumer performed.
>>>> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
>>>> netif_wake_subqueue():
>>>>
>>>> Consumer CPU                  Producer CPU
>>>> ========================      =========================
>>>> __ptr_ring_consume()
>>>> netif_wake_subqueue()         netif_tx_stop_queue()
>>>>           /\                  smp_mb__after_atomic()
>>>>           ||                  __ptr_ring_produce_peek()
>>>> contains RMW operation
>>>>  test_and_clear_bit()
>>>>           /\
>>>>           ||
>>>>  "Fully ordered RMW:
>>>> smp_mb() before + after"
>>>>     - atomic_t.txt
>>>>
>>>> Benchmarks:
>>>> The benchmarks show a slight regression in raw transmission performance,
>>>> though no packets are lost anymore.
>>>
>>> Could you include the packets received as well?
>>> To demonstrate the gains/lack of loss. 
>>>
>>
>> Do you mean the number of packets received by the VM?
>> They should just be the same as the number sent (shown below), right?
> 
> Minus the loss? Which this is about, right?

Yes. I simply calculated "Lost/s":

elapsed_time = 100e6 / sent_pps
Lost/s = total_errors / elapsed_time


To get back total_errors for example for TAP
1 thread sending:

elapsed_time = 100e6 / 1.136Mpps = 88s

3758 Mpps = total_errors / 88s
<=> total_errors = 331 million packets

So, out of 431 million packets sent, 100 million were successfully
delivered and 331 million were lost.

> 
>> I assume they would be visible as RX-DRP for TAP.
>> For TAP + vhost-net I would have to rewrite the XDP drop
>> program to count the number of dropped packets...
>> And I would have to automate it...
>>
>>>>
>>>> The previously introduced threshold to only wake after the queue stopped
>>>> and half of the ring was consumed showed to be a descent choice:
>>>> Waking the queue whenever a consume made space in the ring strongly
>>>> degrades performance for tap, while waking only when the ring is empty
>>>> is too late and also hurts throughput for tap & tap+vhost-net.
>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>>>> tun/tap+vhost-net.
>>>>
>>>> Test setup:
>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
>>>> mitigations disabled.
>>>>
>>>> Note for tap+vhost-net:
>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>>
>>>> +--------------------------+--------------+----------------+----------+
>>>> | 1 thread                 | Stock        | Patched with   | diff     |
>>>> | sending                  |              | fq_codel qdisc |          |
>>>> +------------+-------------+--------------+----------------+----------+
>>>> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
>>>> |            +-------------+--------------+----------------+----------+
>>>> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
>>>> +------------+-------------+--------------+----------------+----------+
>>>> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
>>>> |            +-------------+--------------+----------------+----------+
>>>> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
>>>> +------------+-------------+--------------+----------------+----------+
>>>>
>>>> +--------------------------+--------------+----------------+----------+
>>>> | 2 threads                | Stock        | Patched with   | diff     |
>>>> | sending                  |              | fq_codel qdisc |          |
>>>> +------------+-------------+--------------+----------------+----------+
>>>> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
>>>> |            +-------------+--------------+----------------+----------+
>>>> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
>>>> +------------+-------------+--------------+----------------+----------+
>>>> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
>>>> |            +-------------+--------------+----------------+----------+
>>>> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
>>>> +------------+-------------+--------------+----------------+----------+
>>>>
>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index efe809597622..c2a1618cc9db 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>  	struct netdev_queue *queue;
>>>>  	struct tun_file *tfile;
>>>>  	int len = skb->len;
>>>> +	bool qdisc_present;
>>>> +	int ret;
>>>>  
>>>>  	rcu_read_lock();
>>>>  	tfile = rcu_dereference(tun->tfiles[txq]);
>>>> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>  
>>>>  	nf_reset_ct(skb);
>>>>  
>>>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>> +	queue = netdev_get_tx_queue(dev, txq);
>>>> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>> +
>>>> +	spin_lock(&tfile->tx_ring.producer_lock);
>>>> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>> +		netif_tx_stop_queue(queue);
>>>> +		/* Re-peek and wake if the consumer drained the ring
>>>> +		 * concurrently in a race. smp_mb__after_atomic() pairs
>>>> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
>>>> +		 * in __tun_wake_queue().
>>>> +		 */
>>>> +		smp_mb__after_atomic();
>>>> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>> +			netif_tx_wake_queue(queue);
>>>> +	}
>>>> +	spin_unlock(&tfile->tx_ring.producer_lock);
>>>> +
>>>> +	if (ret) {
>>>> +		/* If a qdisc is attached to our virtual device,
>>>> +		 * returning NETDEV_TX_BUSY is allowed.
>>>> +		 */
>>>> +		if (qdisc_present) {
>>>> +			rcu_read_unlock();
>>>> +			return NETDEV_TX_BUSY;
>>>> +		}
>>>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>  		goto drop;
>>>>  	}
>>>>  
>>>>  	/* dev->lltx requires to do our own update of trans_start */
>>>> -	queue = netdev_get_tx_queue(dev, txq);
>>>>  	txq_trans_cond_update(queue);
>>>>  
>>>>  	/* Notify and wake up reader process */
>>>> -- 
>>>> 2.43.0
>>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 13:41         ` Simon Schippers
@ 2026-04-28 14:10           ` Michael S. Tsirkin
  2026-04-28 14:18             ` Simon Schippers
  0 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-04-28 14:10 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Tue, Apr 28, 2026 at 03:41:20PM +0200, Simon Schippers wrote:
> On 4/28/26 15:22, Michael S. Tsirkin wrote:
> > On Tue, Apr 28, 2026 at 03:10:44PM +0200, Simon Schippers wrote:
> >> On 4/28/26 14:50, Michael S. Tsirkin wrote:
> >>> On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
> >>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>> subsequent packets.
> >>>>
> >>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
> >>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> >>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>>
> >>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>
> >>>> The existing __tun_wake_queue() function of the consumer races with the
> >>>> producer for waking/stopping the netdev queue: the consumer may drain
> >>>> the ring just as the producer stops the queue, leading to a permanent
> >>>> stall. To avoid this, the producer re-checks the ring after stopping
> >>>> and wakes the queue itself if space was just made. An
> >>>> smp_mb__after_atomic() is required so the re-peek of the ring sees any
> >>>> drain that the consumer performed.
> >>>> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
> >>>> netif_wake_subqueue():
> >>>>
> >>>> Consumer CPU                  Producer CPU
> >>>> ========================      =========================
> >>>> __ptr_ring_consume()
> >>>> netif_wake_subqueue()         netif_tx_stop_queue()
> >>>>           /\                  smp_mb__after_atomic()
> >>>>           ||                  __ptr_ring_produce_peek()
> >>>> contains RMW operation
> >>>>  test_and_clear_bit()
> >>>>           /\
> >>>>           ||
> >>>>  "Fully ordered RMW:
> >>>> smp_mb() before + after"
> >>>>     - atomic_t.txt
> >>>>
> >>>> Benchmarks:
> >>>> The benchmarks show a slight regression in raw transmission performance,
> >>>> though no packets are lost anymore.
> >>>
> >>> Could you include the packets received as well?
> >>> To demonstrate the gains/lack of loss. 
> >>>
> >>
> >> Do you mean the number of packets received by the VM?
> >> They should just be the same as the number sent (shown below), right?
> > 
> > Minus the loss? Which this is about, right?
> 
> Yes. I simply calculated "Lost/s":
> 
> elapsed_time = 100e6 / sent_pps
> Lost/s = total_errors / elapsed_time
> 
> 
> To get back total_errors for example for TAP
> 1 thread sending:
> 
> elapsed_time = 100e6 / 1.136Mpps = 88s
> 
> 3758 Mpps = total_errors / 88s
> <=> total_errors = 331 million packets
> 
> So, out of 431 million packets sent, 100 million were successfully
> delivered and 331 million were lost.

That is my issue.

I kind of have trouble mapping that to the table below.
For example:

 | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
 |            +-------------+--------------+----------------+----------+
 |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |

how can # of lost packets exceed the # of transmitted packets?

Thanks!


> > 
> >> I assume they would be visible as RX-DRP for TAP.
> >> For TAP + vhost-net I would have to rewrite the XDP drop
> >> program to count the number of dropped packets...
> >> And I would have to automate it...
> >>
> >>>>
> >>>> The previously introduced threshold to only wake after the queue stopped
> >>>> and half of the ring was consumed showed to be a descent choice:
> >>>> Waking the queue whenever a consume made space in the ring strongly
> >>>> degrades performance for tap, while waking only when the ring is empty
> >>>> is too late and also hurts throughput for tap & tap+vhost-net.
> >>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> >>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
> >>>> tun/tap+vhost-net.
> >>>>
> >>>> Test setup:
> >>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> >>>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
> >>>> mitigations disabled.
> >>>>
> >>>> Note for tap+vhost-net:
> >>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> >>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> >>>>
> >>>> +--------------------------+--------------+----------------+----------+
> >>>> | 1 thread                 | Stock        | Patched with   | diff     |
> >>>> | sending                  |              | fq_codel qdisc |          |
> >>>> +------------+-------------+--------------+----------------+----------+
> >>>> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
> >>>> |            +-------------+--------------+----------------+----------+
> >>>> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
> >>>> +------------+-------------+--------------+----------------+----------+
> >>>> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
> >>>> |            +-------------+--------------+----------------+----------+
> >>>> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
> >>>> +------------+-------------+--------------+----------------+----------+
> >>>>
> >>>> +--------------------------+--------------+----------------+----------+
> >>>> | 2 threads                | Stock        | Patched with   | diff     |
> >>>> | sending                  |              | fq_codel qdisc |          |
> >>>> +------------+-------------+--------------+----------------+----------+
> >>>> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
> >>>> |            +-------------+--------------+----------------+----------+
> >>>> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
> >>>> +------------+-------------+--------------+----------------+----------+
> >>>> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
> >>>> |            +-------------+--------------+----------------+----------+
> >>>> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
> >>>> +------------+-------------+--------------+----------------+----------+
> >>>>
> >>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>> ---
> >>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
> >>>>  1 file changed, 28 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>> index efe809597622..c2a1618cc9db 100644
> >>>> --- a/drivers/net/tun.c
> >>>> +++ b/drivers/net/tun.c
> >>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  	struct netdev_queue *queue;
> >>>>  	struct tun_file *tfile;
> >>>>  	int len = skb->len;
> >>>> +	bool qdisc_present;
> >>>> +	int ret;
> >>>>  
> >>>>  	rcu_read_lock();
> >>>>  	tfile = rcu_dereference(tun->tfiles[txq]);
> >>>> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  
> >>>>  	nf_reset_ct(skb);
> >>>>  
> >>>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>> +	queue = netdev_get_tx_queue(dev, txq);
> >>>> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>> +
> >>>> +	spin_lock(&tfile->tx_ring.producer_lock);
> >>>> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>> +		netif_tx_stop_queue(queue);
> >>>> +		/* Re-peek and wake if the consumer drained the ring
> >>>> +		 * concurrently in a race. smp_mb__after_atomic() pairs
> >>>> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
> >>>> +		 * in __tun_wake_queue().
> >>>> +		 */
> >>>> +		smp_mb__after_atomic();
> >>>> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>> +			netif_tx_wake_queue(queue);
> >>>> +	}
> >>>> +	spin_unlock(&tfile->tx_ring.producer_lock);
> >>>> +
> >>>> +	if (ret) {
> >>>> +		/* If a qdisc is attached to our virtual device,
> >>>> +		 * returning NETDEV_TX_BUSY is allowed.
> >>>> +		 */
> >>>> +		if (qdisc_present) {
> >>>> +			rcu_read_unlock();
> >>>> +			return NETDEV_TX_BUSY;
> >>>> +		}
> >>>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>  		goto drop;
> >>>>  	}
> >>>>  
> >>>>  	/* dev->lltx requires to do our own update of trans_start */
> >>>> -	queue = netdev_get_tx_queue(dev, txq);
> >>>>  	txq_trans_cond_update(queue);
> >>>>  
> >>>>  	/* Notify and wake up reader process */
> >>>> -- 
> >>>> 2.43.0
> >>>
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 14:10           ` Michael S. Tsirkin
@ 2026-04-28 14:18             ` Simon Schippers
  2026-04-28 14:32               ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 14:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 4/28/26 16:10, Michael S. Tsirkin wrote:
> On Tue, Apr 28, 2026 at 03:41:20PM +0200, Simon Schippers wrote:
>> On 4/28/26 15:22, Michael S. Tsirkin wrote:
>>> On Tue, Apr 28, 2026 at 03:10:44PM +0200, Simon Schippers wrote:
>>>> On 4/28/26 14:50, Michael S. Tsirkin wrote:
>>>>> On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
>>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>>>> subsequent packets.
>>>>>>
>>>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>>>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>>>>
>>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>>>
>>>>>> The existing __tun_wake_queue() function of the consumer races with the
>>>>>> producer for waking/stopping the netdev queue: the consumer may drain
>>>>>> the ring just as the producer stops the queue, leading to a permanent
>>>>>> stall. To avoid this, the producer re-checks the ring after stopping
>>>>>> and wakes the queue itself if space was just made. An
>>>>>> smp_mb__after_atomic() is required so the re-peek of the ring sees any
>>>>>> drain that the consumer performed.
>>>>>> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
>>>>>> netif_wake_subqueue():
>>>>>>
>>>>>> Consumer CPU                  Producer CPU
>>>>>> ========================      =========================
>>>>>> __ptr_ring_consume()
>>>>>> netif_wake_subqueue()         netif_tx_stop_queue()
>>>>>>           /\                  smp_mb__after_atomic()
>>>>>>           ||                  __ptr_ring_produce_peek()
>>>>>> contains RMW operation
>>>>>>  test_and_clear_bit()
>>>>>>           /\
>>>>>>           ||
>>>>>>  "Fully ordered RMW:
>>>>>> smp_mb() before + after"
>>>>>>     - atomic_t.txt
>>>>>>
>>>>>> Benchmarks:
>>>>>> The benchmarks show a slight regression in raw transmission performance,
>>>>>> though no packets are lost anymore.
>>>>>
>>>>> Could you include the packets received as well?
>>>>> To demonstrate the gains/lack of loss. 
>>>>>
>>>>
>>>> Do you mean the number of packets received by the VM?
>>>> They should just be the same as the number sent (shown below), right?
>>>
>>> Minus the loss? Which this is about, right?
>>
>> Yes. I simply calculated "Lost/s":
>>
>> elapsed_time = 100e6 / sent_pps
>> Lost/s = total_errors / elapsed_time
>>
>>
>> To get back total_errors for example for TAP
>> 1 thread sending:
>>
>> elapsed_time = 100e6 / 1.136Mpps = 88s
>>
>> 3758 Mpps = total_errors / 88s
>> <=> total_errors = 331 million packets
>>
>> So, out of 431 million packets sent, 100 million were successfully
>> delivered and 331 million were lost.
> 
> That is my issue.
> 
> I kind of have trouble mapping that to the table below.
> For example:
> 
>  | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
>  |            +-------------+--------------+----------------+----------+
>  |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
> 
> how can # of lost packets exceed the # of transmitted packets?
> 
> Thanks!

I just do use the sample script [1]:

./pktgen_sample02_multiqueue.sh -n 100000000 ...

... and this runs until 100_000_000 packets were sucessfully
transmitted, independently of the lost packets/errors.

[1] Link: https://www.kernel.org/doc/html/latest/networking/pktgen.html#sample-scripts

> 
> 
>>>
>>>> I assume they would be visible as RX-DRP for TAP.
>>>> For TAP + vhost-net I would have to rewrite the XDP drop
>>>> program to count the number of dropped packets...
>>>> And I would have to automate it...
>>>>
>>>>>>
>>>>>> The previously introduced threshold to only wake after the queue stopped
>>>>>> and half of the ring was consumed showed to be a descent choice:
>>>>>> Waking the queue whenever a consume made space in the ring strongly
>>>>>> degrades performance for tap, while waking only when the ring is empty
>>>>>> is too late and also hurts throughput for tap & tap+vhost-net.
>>>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>>>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>>>>>> tun/tap+vhost-net.
>>>>>>
>>>>>> Test setup:
>>>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>>>>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
>>>>>> mitigations disabled.
>>>>>>
>>>>>> Note for tap+vhost-net:
>>>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>>>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>>>>
>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>> | 1 thread                 | Stock        | Patched with   | diff     |
>>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>
>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>> | 2 threads                | Stock        | Patched with   | diff     |
>>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>
>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>> ---
>>>>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>>>>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index efe809597622..c2a1618cc9db 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>  	struct netdev_queue *queue;
>>>>>>  	struct tun_file *tfile;
>>>>>>  	int len = skb->len;
>>>>>> +	bool qdisc_present;
>>>>>> +	int ret;
>>>>>>  
>>>>>>  	rcu_read_lock();
>>>>>>  	tfile = rcu_dereference(tun->tfiles[txq]);
>>>>>> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>  
>>>>>>  	nf_reset_ct(skb);
>>>>>>  
>>>>>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>>>> +	queue = netdev_get_tx_queue(dev, txq);
>>>>>> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>> +
>>>>>> +	spin_lock(&tfile->tx_ring.producer_lock);
>>>>>> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>>>> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>>>> +		netif_tx_stop_queue(queue);
>>>>>> +		/* Re-peek and wake if the consumer drained the ring
>>>>>> +		 * concurrently in a race. smp_mb__after_atomic() pairs
>>>>>> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
>>>>>> +		 * in __tun_wake_queue().
>>>>>> +		 */
>>>>>> +		smp_mb__after_atomic();
>>>>>> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>>>> +			netif_tx_wake_queue(queue);
>>>>>> +	}
>>>>>> +	spin_unlock(&tfile->tx_ring.producer_lock);
>>>>>> +
>>>>>> +	if (ret) {
>>>>>> +		/* If a qdisc is attached to our virtual device,
>>>>>> +		 * returning NETDEV_TX_BUSY is allowed.
>>>>>> +		 */
>>>>>> +		if (qdisc_present) {
>>>>>> +			rcu_read_unlock();
>>>>>> +			return NETDEV_TX_BUSY;
>>>>>> +		}
>>>>>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>>  		goto drop;
>>>>>>  	}
>>>>>>  
>>>>>>  	/* dev->lltx requires to do our own update of trans_start */
>>>>>> -	queue = netdev_get_tx_queue(dev, txq);
>>>>>>  	txq_trans_cond_update(queue);
>>>>>>  
>>>>>>  	/* Notify and wake up reader process */
>>>>>> -- 
>>>>>> 2.43.0
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 14:18             ` Simon Schippers
@ 2026-04-28 14:32               ` Michael S. Tsirkin
  2026-04-28 14:55                 ` Simon Schippers
  0 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-04-28 14:32 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Tue, Apr 28, 2026 at 04:18:54PM +0200, Simon Schippers wrote:
> On 4/28/26 16:10, Michael S. Tsirkin wrote:
> > On Tue, Apr 28, 2026 at 03:41:20PM +0200, Simon Schippers wrote:
> >> On 4/28/26 15:22, Michael S. Tsirkin wrote:
> >>> On Tue, Apr 28, 2026 at 03:10:44PM +0200, Simon Schippers wrote:
> >>>> On 4/28/26 14:50, Michael S. Tsirkin wrote:
> >>>>> On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
> >>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>>>> subsequent packets.
> >>>>>>
> >>>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
> >>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
> >>>>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>>>>
> >>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>>>
> >>>>>> The existing __tun_wake_queue() function of the consumer races with the
> >>>>>> producer for waking/stopping the netdev queue: the consumer may drain
> >>>>>> the ring just as the producer stops the queue, leading to a permanent
> >>>>>> stall. To avoid this, the producer re-checks the ring after stopping
> >>>>>> and wakes the queue itself if space was just made. An
> >>>>>> smp_mb__after_atomic() is required so the re-peek of the ring sees any
> >>>>>> drain that the consumer performed.
> >>>>>> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
> >>>>>> netif_wake_subqueue():
> >>>>>>
> >>>>>> Consumer CPU                  Producer CPU
> >>>>>> ========================      =========================
> >>>>>> __ptr_ring_consume()
> >>>>>> netif_wake_subqueue()         netif_tx_stop_queue()
> >>>>>>           /\                  smp_mb__after_atomic()
> >>>>>>           ||                  __ptr_ring_produce_peek()
> >>>>>> contains RMW operation
> >>>>>>  test_and_clear_bit()
> >>>>>>           /\
> >>>>>>           ||
> >>>>>>  "Fully ordered RMW:
> >>>>>> smp_mb() before + after"
> >>>>>>     - atomic_t.txt
> >>>>>>
> >>>>>> Benchmarks:
> >>>>>> The benchmarks show a slight regression in raw transmission performance,
> >>>>>> though no packets are lost anymore.
> >>>>>
> >>>>> Could you include the packets received as well?
> >>>>> To demonstrate the gains/lack of loss. 
> >>>>>
> >>>>
> >>>> Do you mean the number of packets received by the VM?
> >>>> They should just be the same as the number sent (shown below), right?
> >>>
> >>> Minus the loss? Which this is about, right?
> >>
> >> Yes. I simply calculated "Lost/s":
> >>
> >> elapsed_time = 100e6 / sent_pps
> >> Lost/s = total_errors / elapsed_time
> >>
> >>
> >> To get back total_errors for example for TAP
> >> 1 thread sending:
> >>
> >> elapsed_time = 100e6 / 1.136Mpps = 88s
> >>
> >> 3758 Mpps = total_errors / 88s
> >> <=> total_errors = 331 million packets
> >>
> >> So, out of 431 million packets sent, 100 million were successfully
> >> delivered and 331 million were lost.
> > 
> > That is my issue.
> > 
> > I kind of have trouble mapping that to the table below.
> > For example:
> > 
> >  | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
> >  |            +-------------+--------------+----------------+----------+
> >  |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
> > 
> > how can # of lost packets exceed the # of transmitted packets?
> > 
> > Thanks!
> 
> I just do use the sample script [1]:
> 
> ./pktgen_sample02_multiqueue.sh -n 100000000 ...
> 
> ... and this runs until 100_000_000 packets were sucessfully
> transmitted, independently of the lost packets/errors.
> 
> [1] Link: https://www.kernel.org/doc/html/latest/networking/pktgen.html#sample-scripts

Confused. Are you saying "transmitted" is actually "received"? And the #
of packets sent is Transmitted + Lost?

> > 
> > 
> >>>
> >>>> I assume they would be visible as RX-DRP for TAP.
> >>>> For TAP + vhost-net I would have to rewrite the XDP drop
> >>>> program to count the number of dropped packets...
> >>>> And I would have to automate it...
> >>>>
> >>>>>>
> >>>>>> The previously introduced threshold to only wake after the queue stopped
> >>>>>> and half of the ring was consumed showed to be a descent choice:
> >>>>>> Waking the queue whenever a consume made space in the ring strongly
> >>>>>> degrades performance for tap, while waking only when the ring is empty
> >>>>>> is too late and also hurts throughput for tap & tap+vhost-net.
> >>>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
> >>>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
> >>>>>> tun/tap+vhost-net.
> >>>>>>
> >>>>>> Test setup:
> >>>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
> >>>>>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
> >>>>>> mitigations disabled.
> >>>>>>
> >>>>>> Note for tap+vhost-net:
> >>>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
> >>>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
> >>>>>>
> >>>>>> +--------------------------+--------------+----------------+----------+
> >>>>>> | 1 thread                 | Stock        | Patched with   | diff     |
> >>>>>> | sending                  |              | fq_codel qdisc |          |
> >>>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
> >>>>>> |            +-------------+--------------+----------------+----------+
> >>>>>> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
> >>>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
> >>>>>> |            +-------------+--------------+----------------+----------+
> >>>>>> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
> >>>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>>
> >>>>>> +--------------------------+--------------+----------------+----------+
> >>>>>> | 2 threads                | Stock        | Patched with   | diff     |
> >>>>>> | sending                  |              | fq_codel qdisc |          |
> >>>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
> >>>>>> |            +-------------+--------------+----------------+----------+
> >>>>>> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
> >>>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
> >>>>>> |            +-------------+--------------+----------------+----------+
> >>>>>> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
> >>>>>> +------------+-------------+--------------+----------------+----------+
> >>>>>>
> >>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>> ---
> >>>>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
> >>>>>>  1 file changed, 28 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index efe809597622..c2a1618cc9db 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>  	struct netdev_queue *queue;
> >>>>>>  	struct tun_file *tfile;
> >>>>>>  	int len = skb->len;
> >>>>>> +	bool qdisc_present;
> >>>>>> +	int ret;
> >>>>>>  
> >>>>>>  	rcu_read_lock();
> >>>>>>  	tfile = rcu_dereference(tun->tfiles[txq]);
> >>>>>> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>  
> >>>>>>  	nf_reset_ct(skb);
> >>>>>>  
> >>>>>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>>>> +	queue = netdev_get_tx_queue(dev, txq);
> >>>>>> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>>>> +
> >>>>>> +	spin_lock(&tfile->tx_ring.producer_lock);
> >>>>>> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>>>> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>>>> +		netif_tx_stop_queue(queue);
> >>>>>> +		/* Re-peek and wake if the consumer drained the ring
> >>>>>> +		 * concurrently in a race. smp_mb__after_atomic() pairs
> >>>>>> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
> >>>>>> +		 * in __tun_wake_queue().
> >>>>>> +		 */
> >>>>>> +		smp_mb__after_atomic();
> >>>>>> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>>>> +			netif_tx_wake_queue(queue);
> >>>>>> +	}
> >>>>>> +	spin_unlock(&tfile->tx_ring.producer_lock);
> >>>>>> +
> >>>>>> +	if (ret) {
> >>>>>> +		/* If a qdisc is attached to our virtual device,
> >>>>>> +		 * returning NETDEV_TX_BUSY is allowed.
> >>>>>> +		 */
> >>>>>> +		if (qdisc_present) {
> >>>>>> +			rcu_read_unlock();
> >>>>>> +			return NETDEV_TX_BUSY;
> >>>>>> +		}
> >>>>>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>>  		goto drop;
> >>>>>>  	}
> >>>>>>  
> >>>>>>  	/* dev->lltx requires to do our own update of trans_start */
> >>>>>> -	queue = netdev_get_tx_queue(dev, txq);
> >>>>>>  	txq_trans_cond_update(queue);
> >>>>>>  
> >>>>>>  	/* Notify and wake up reader process */
> >>>>>> -- 
> >>>>>> 2.43.0
> >>>>>
> >>>
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
  2026-04-28 14:32               ` Michael S. Tsirkin
@ 2026-04-28 14:55                 ` Simon Schippers
  0 siblings, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-04-28 14:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 4/28/26 16:32, Michael S. Tsirkin wrote:
> On Tue, Apr 28, 2026 at 04:18:54PM +0200, Simon Schippers wrote:
>> On 4/28/26 16:10, Michael S. Tsirkin wrote:
>>> On Tue, Apr 28, 2026 at 03:41:20PM +0200, Simon Schippers wrote:
>>>> On 4/28/26 15:22, Michael S. Tsirkin wrote:
>>>>> On Tue, Apr 28, 2026 at 03:10:44PM +0200, Simon Schippers wrote:
>>>>>> On 4/28/26 14:50, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote:
>>>>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>>>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>>>>>> subsequent packets.
>>>>>>>>
>>>>>>>> If producing an entry fails anyways due to a race, tun_net_xmit returns
>>>>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because
>>>>>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>>>>>>
>>>>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>>>>>
>>>>>>>> The existing __tun_wake_queue() function of the consumer races with the
>>>>>>>> producer for waking/stopping the netdev queue: the consumer may drain
>>>>>>>> the ring just as the producer stops the queue, leading to a permanent
>>>>>>>> stall. To avoid this, the producer re-checks the ring after stopping
>>>>>>>> and wakes the queue itself if space was just made. An
>>>>>>>> smp_mb__after_atomic() is required so the re-peek of the ring sees any
>>>>>>>> drain that the consumer performed.
>>>>>>>> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of
>>>>>>>> netif_wake_subqueue():
>>>>>>>>
>>>>>>>> Consumer CPU                  Producer CPU
>>>>>>>> ========================      =========================
>>>>>>>> __ptr_ring_consume()
>>>>>>>> netif_wake_subqueue()         netif_tx_stop_queue()
>>>>>>>>           /\                  smp_mb__after_atomic()
>>>>>>>>           ||                  __ptr_ring_produce_peek()
>>>>>>>> contains RMW operation
>>>>>>>>  test_and_clear_bit()
>>>>>>>>           /\
>>>>>>>>           ||
>>>>>>>>  "Fully ordered RMW:
>>>>>>>> smp_mb() before + after"
>>>>>>>>     - atomic_t.txt
>>>>>>>>
>>>>>>>> Benchmarks:
>>>>>>>> The benchmarks show a slight regression in raw transmission performance,
>>>>>>>> though no packets are lost anymore.
>>>>>>>
>>>>>>> Could you include the packets received as well?
>>>>>>> To demonstrate the gains/lack of loss. 
>>>>>>>
>>>>>>
>>>>>> Do you mean the number of packets received by the VM?
>>>>>> They should just be the same as the number sent (shown below), right?
>>>>>
>>>>> Minus the loss? Which this is about, right?
>>>>
>>>> Yes. I simply calculated "Lost/s":
>>>>
>>>> elapsed_time = 100e6 / sent_pps
>>>> Lost/s = total_errors / elapsed_time
>>>>
>>>>
>>>> To get back total_errors for example for TAP
>>>> 1 thread sending:
>>>>
>>>> elapsed_time = 100e6 / 1.136Mpps = 88s
>>>>
>>>> 3758 Mpps = total_errors / 88s
>>>> <=> total_errors = 331 million packets
>>>>
>>>> So, out of 431 million packets sent, 100 million were successfully
>>>> delivered and 331 million were lost.
>>>
>>> That is my issue.
>>>
>>> I kind of have trouble mapping that to the table below.
>>> For example:
>>>
>>>  | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
>>>  |            +-------------+--------------+----------------+----------+
>>>  |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
>>>
>>> how can # of lost packets exceed the # of transmitted packets?
>>>
>>> Thanks!
>>
>> I just do use the sample script [1]:
>>
>> ./pktgen_sample02_multiqueue.sh -n 100000000 ...
>>
>> ... and this runs until 100_000_000 packets were sucessfully
>> transmitted, independently of the lost packets/errors.
>>
>> [1] Link: https://www.kernel.org/doc/html/latest/networking/pktgen.html#sample-scripts
> 
> Confused. Are you saying "transmitted" is actually "received"? And the #
> of packets sent is Transmitted + Lost?

Sorry for my confusing answer.

Yes, "transmitted" in the table should be changed to "received".
And yes, as you said, the real # transmitted then is:
Received + Lost = 1.136 + 3.758 = 4.894 Mpps.

> 
>>>
>>>
>>>>>
>>>>>> I assume they would be visible as RX-DRP for TAP.
>>>>>> For TAP + vhost-net I would have to rewrite the XDP drop
>>>>>> program to count the number of dropped packets...
>>>>>> And I would have to automate it...
>>>>>>
>>>>>>>>
>>>>>>>> The previously introduced threshold to only wake after the queue stopped
>>>>>>>> and half of the ring was consumed showed to be a descent choice:
>>>>>>>> Waking the queue whenever a consume made space in the ring strongly
>>>>>>>> degrades performance for tap, while waking only when the ring is empty
>>>>>>>> is too late and also hurts throughput for tap & tap+vhost-net.
>>>>>>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so
>>>>>>>> 1/2 was chosen for the sake of simplicity for both tun/tap and
>>>>>>>> tun/tap+vhost-net.
>>>>>>>>
>>>>>>>> Test setup:
>>>>>>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>>>>>>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
>>>>>>>> mitigations disabled.
>>>>>>>>
>>>>>>>> Note for tap+vhost-net:
>>>>>>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to
>>>>>>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>>>>>>
>>>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>>>> | 1 thread                 | Stock        | Patched with   | diff     |
>>>>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>> | TAP        | Transmitted | 1.136 Mpps   | 1.130 Mpps     | -0.6%    |
>>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>>> |            | Lost/s      | 3.758 Mpps   | 0 pps          |          |
>>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>> | TAP        | Transmitted | 3.858 Mpps   | 3.816 Mpps     | -1.1%    |
>>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>>> | +vhost-net | Lost/s      | 789.8 Kpps   | 0 pps          |          |
>>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>>
>>>>>>>> +--------------------------+--------------+----------------+----------+
>>>>>>>> | 2 threads                | Stock        | Patched with   | diff     |
>>>>>>>> | sending                  |              | fq_codel qdisc |          |
>>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>> | TAP        | Transmitted | 1.117 Mpps   | 1.087 Mpps     | -2.7%    |
>>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>>> |            | Lost/s      | 8.476 Mpps   | 0 pps          |          |
>>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>> | TAP        | Transmitted | 3.679 Mpps   | 3.464 Mpps     | -5.8%    |
>>>>>>>> |            +-------------+--------------+----------------+----------+
>>>>>>>> | +vhost-net | Lost/s      | 5.306 Mpps   | 0 pps          |          |
>>>>>>>> +------------+-------------+--------------+----------------+----------+
>>>>>>>>
>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>> ---
>>>>>>>>  drivers/net/tun.c | 30 ++++++++++++++++++++++++++++--
>>>>>>>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>> index efe809597622..c2a1618cc9db 100644
>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>>>  	struct netdev_queue *queue;
>>>>>>>>  	struct tun_file *tfile;
>>>>>>>>  	int len = skb->len;
>>>>>>>> +	bool qdisc_present;
>>>>>>>> +	int ret;
>>>>>>>>  
>>>>>>>>  	rcu_read_lock();
>>>>>>>>  	tfile = rcu_dereference(tun->tfiles[txq]);
>>>>>>>> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>>>  
>>>>>>>>  	nf_reset_ct(skb);
>>>>>>>>  
>>>>>>>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>>>>>> +	queue = netdev_get_tx_queue(dev, txq);
>>>>>>>> +	qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>>>> +
>>>>>>>> +	spin_lock(&tfile->tx_ring.producer_lock);
>>>>>>>> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>>>>>> +	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>>>>>> +		netif_tx_stop_queue(queue);
>>>>>>>> +		/* Re-peek and wake if the consumer drained the ring
>>>>>>>> +		 * concurrently in a race. smp_mb__after_atomic() pairs
>>>>>>>> +		 * with the test_and_clear_bit() of netif_wake_subqueue()
>>>>>>>> +		 * in __tun_wake_queue().
>>>>>>>> +		 */
>>>>>>>> +		smp_mb__after_atomic();
>>>>>>>> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>>>>>> +			netif_tx_wake_queue(queue);
>>>>>>>> +	}
>>>>>>>> +	spin_unlock(&tfile->tx_ring.producer_lock);
>>>>>>>> +
>>>>>>>> +	if (ret) {
>>>>>>>> +		/* If a qdisc is attached to our virtual device,
>>>>>>>> +		 * returning NETDEV_TX_BUSY is allowed.
>>>>>>>> +		 */
>>>>>>>> +		if (qdisc_present) {
>>>>>>>> +			rcu_read_unlock();
>>>>>>>> +			return NETDEV_TX_BUSY;
>>>>>>>> +		}
>>>>>>>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>>>>  		goto drop;
>>>>>>>>  	}
>>>>>>>>  
>>>>>>>>  	/* dev->lltx requires to do our own update of trans_start */
>>>>>>>> -	queue = netdev_get_tx_queue(dev, txq);
>>>>>>>>  	txq_trans_cond_update(queue);
>>>>>>>>  
>>>>>>>>  	/* Notify and wake up reader process */
>>>>>>>> -- 
>>>>>>>> 2.43.0
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
  2026-04-28 12:38 [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (3 preceding siblings ...)
  2026-04-28 12:38 ` [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
@ 2026-04-29 21:04 ` Simon Schippers
  4 siblings, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-04-29 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

I just saw that the Sashiko AI review found some valid regressions.

I will update and test the patch set accordingly and send a new
version...


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-04-29 21:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 12:38 [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
2026-04-28 12:38 ` [PATCH net-next v9 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
2026-04-28 12:38 ` [PATCH net-next v9 2/4] vhost-net: wake queue of tun/tap after ptr_ring consume Simon Schippers
2026-04-28 12:38 ` [PATCH net-next v9 3/4] ptr_ring: move free-space check into separate helper Simon Schippers
2026-04-28 12:38 ` [PATCH net-next v9 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present Simon Schippers
2026-04-28 12:50   ` Michael S. Tsirkin
2026-04-28 13:10     ` Simon Schippers
2026-04-28 13:22       ` Michael S. Tsirkin
2026-04-28 13:41         ` Simon Schippers
2026-04-28 14:10           ` Michael S. Tsirkin
2026-04-28 14:18             ` Simon Schippers
2026-04-28 14:32               ` Michael S. Tsirkin
2026-04-28 14:55                 ` Simon Schippers
2026-04-29 21:04 ` [PATCH net-next v9 0/4] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox