public inbox for virtualization@lists.linux-foundation.org
 help / color / mirror / Atom feed
* [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
@ 2026-01-07 21:04 Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 1/9] ptr_ring: move free-space check into separate helper Simon Schippers
                   ` (8 more replies)
  0 siblings, 9 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This patch series deals with tun/tap and vhost-net which drop incoming 
SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
patch series, the associated netdev queue is stopped before this happens - 
but only when a qdisc is attached. If no qdisc is present the existing 
behavior is preserved.

By applying proper backpressure, this change allows the connected qdisc to 
operate correctly, as reported in [1], and significantly improves 
performance in real-world scenarios, as demonstrated in our paper [2]. For 
example, we observed a 36% TCP throughput improvement for an OpenVPN 
connection between Germany and the USA.

At the same time, synthetic benchmarks (details below) show only minor 
theoretical performance impact:
(1) With the noqueue qdisc, the patched behavior matches the stock 
    implementation, as expected. In both configurations, a significant
    number of packets are dropped.
(2) pktgen benchmarks show a ~5-10% throughput reduction for TAP alone, 
    while no performance impact is observed for TAP + vhost-net. In both 
    cases, zero packet drops are observed.
(3) TCP benchmarks using iperf3 show no performance degradation for either 
    TAP or TAP combined with vhost-net.

This patch series touches tun/tap and vhost-net, as they share common 
logic and must be updated together. Modifying only one of them would break 
the others. The series is therefore structured as follows:
(1-2) ptr_ring:  Introduce new helpers, which are used by patches (3)
                 and (9).
(3)   tun/tap:   add a ptr_ring consume helper with netdev queue wakeup.
(4-8) vhost-net: introduce and switch to the new tun/tap ptr_ring 
                 wrappers with netdev queue wakeup.
(9)   tun/tap:   avoid ptr_ring tail-drop when a qdisc is present by 
                 stopping the netdev queue.

+-------------------------+-----------+---------------+----------------+
| pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
| Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
| 10M packets             |           |               |                |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
|           +-------------+-----------+---------------+----------------+
|           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
|  +        +-------------+-----------+---------------+----------------+
| vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+

+-------------------------+-----------+---------------+----------------+
| pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
| Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
| 10M packets,            |           |               |                |
| *4 threads*             |           |               |                |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 26 Kpps   | 26 Kpps       | 23 Kpps        |
|           +-------------+-----------+---------------+----------------+
|           | Lost        | 1535 Kpps | 1551 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 64 Kpps   | 63 Kpps       | 66 Kpps        |
|  +        +-------------+-----------+---------------+----------------+
| vhost-net | Lost        | 1550 Kpps | 1506 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+

+-----------------------+-------------+---------------+----------------+
| iperf3 TCP benchmarks | Stock       | Patched with  | Patched with   |
| to Debian VM          |             | noqueue qdisc | fq_codel qdisc |
| i5 6300HQ, 120s       |             |               |                |
+-----------------------+-------------+---------------+----------------+
| TAP                   | 1.71 Gbit/s | 1.71 Gbit/s   | 1.71 Gbit/s    |
+-----------------------+-------------+---------------+----------------+
| TAP + vhost-net       | 22.1 Gbit/s | 22.0 Gbit/s   | 22.0 Gbit/s    |
+-----------------------+-------------+---------------+----------------+

[1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
[2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
[3] Link: https://lore.kernel.org/r/174549940981.608169.4363875844729313831.stgit@firesoul
[4] Link: https://lore.kernel.org/r/176295323282.307447.14790015927673763094.stgit@firesoul

---
Changelog:
V7:
- Switch to an approach similar to veth [3] (excluding the recently fixed 
variant [4]), as suggested by MST, with minor adjustments discussed in V6
- Rename the cover-letter title
- Add multithreaded pktgen and iperf3 benchmarks, as suggested by Jason 
Wang
- Rework __ptr_ring_consume_created_space() so it can also be used after 
batched consume

V6: https://lore.kernel.org/netdev/20251120152914.1127975-1-simon.schippers@tu-dortmund.de/
General:
- Major adjustments to the descriptions. Special thanks to Jon Kohler!
- Fix git bisect by moving most logic into dedicated functions and only 
start using them in patch 7.
- Moved the main logic of the coupled producer and consumer into a single 
patch to avoid a chicken-and-egg dependency between commits :-)
- Rebased to 6.18-rc5 and ran benchmarks again that now also include lost 
packets (previously I missed a 0, so all benchmark results were higher by 
factor 10...).
- Also include the benchmark in patch 7.

Producer:
- Move logic into the new helper tun_ring_produce()
- Added a smp_rmb() paired with the consumer, ensuring freed space of the 
consumer is visible
- Assume that ptr_ring is not full when __ptr_ring_full_next() is called

Consumer:
- Use an unpaired smp_rmb() instead of barrier() to ensure that the 
netdev_tx_queue_stopped() call completes before discarding
- Also wake the netdev queue if it was stopped before discarding and then 
becomes empty
-> Fixes race with producer as identified by MST in V5
-> Waking the netdev queues upon resize is not required anymore
- Use __ptr_ring_consume_created_space() instead of messing with ptr_ring 
internals
-> Batched consume now just calls 
__tun_ring_consume()/__tap_ring_consume() in a loop
- Added an smp_wmb() before waking the netdev queue which is paired with 
the smp_rmb() discussed above

V5: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#u
- Stop the netdev queue prior to producing the final fitting ptr_ring entry
-> Ensures the consumer has the latest netdev queue state, making it safe 
to wake the queue
-> Resolves an issue in vhost-net where the netdev queue could remain 
stopped despite being empty
-> For TUN/TAP, the netdev queue no longer needs to be woken in the 
blocking loop
-> Introduces new helpers __ptr_ring_full_next and 
__ptr_ring_will_invalidate for this purpose
- vhost-net now uses wrappers of TUN/TAP for ptr_ring consumption rather 
than maintaining its own rx_ring pointer

V4: https://lore.kernel.org/netdev/20250902080957.47265-1-simon.schippers@tu-dortmund.de/T/#u
- Target net-next instead of net
- Changed to patch series instead of single patch
- Changed to new title from old title
"TUN/TAP: Improving throughput and latency by avoiding SKB drops"
- Wake netdev queue with new helpers wake_netdev_queue when there is any 
spare capacity in the ptr_ring instead of waiting for it to be empty
- Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
logic
- Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
that happened rarely before
- Use safer logic for vhost-net using RCU read locks to access TUN/TAP data

V3: https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
- Added support for TAP and TAP+vhost-net.

V2: https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
- Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
unnecessary netif_tx_wake_queue in tun_ring_recv.

V1: https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u
---

Simon Schippers (9):
  ptr_ring: move free-space check into separate helper
  ptr_ring: add helper to detect newly freed space on consume
  tun/tap: add ptr_ring consume helper with netdev queue wakeup
  tun/tap: add batched ptr_ring consume functions with netdev queue
    wakeup
  tun/tap: add unconsume function for returning entries to ptr_ring
  tun/tap: add helper functions to check file type
  vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  tun/tap: drop get ring exports
  tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present

 drivers/net/tap.c        | 66 ++++++++++++++++++++++++---
 drivers/net/tun.c        | 99 ++++++++++++++++++++++++++++++++++++----
 drivers/vhost/net.c      | 92 ++++++++++++++++++++++++-------------
 include/linux/if_tap.h   | 16 +++++--
 include/linux/if_tun.h   | 18 ++++++--
 include/linux/ptr_ring.h | 27 ++++++++++-
 6 files changed, 263 insertions(+), 55 deletions(-)

--
2.43.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 1/9] ptr_ring: move free-space check into separate helper
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume Simon Schippers
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This patch moves the check for available free space for a new entry into
a separate function. As a result, __ptr_ring_produce() remains logically
unchanged, while the new helper allows callers to determine in advance
whether subsequent __ptr_ring_produce() calls will succeed. This
information can, for example, be used to temporarily stop producing until
__ptr_ring_peek() indicates that space is available again.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 include/linux/ptr_ring.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 534531807d95..a5a3fa4916d3 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
 	return ret;
 }
 
+static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
+{
+	if (unlikely(!r->size) || r->queue[r->producer])
+		return -ENOSPC;
+
+	return 0;
+}
+
 /* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax(). Callers must hold producer_lock.
  * Callers are responsible for making sure pointer that is being queued
@@ -103,8 +111,10 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
  */
 static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
 {
-	if (unlikely(!r->size) || r->queue[r->producer])
-		return -ENOSPC;
+	int p = __ptr_ring_produce_peek(r);
+
+	if (p)
+		return p;
 
 	/* Make sure the pointer we are storing points to a valid data. */
 	/* Pairs with the dependency ordering in __ptr_ring_consume. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 1/9] ptr_ring: move free-space check into separate helper Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-08  3:23   ` Jason Wang
  2026-01-09  7:22   ` Michael S. Tsirkin
  2026-01-07 21:04 ` [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This proposed function checks whether __ptr_ring_zero_tail() was invoked
within the last n calls to __ptr_ring_consume(), which indicates that new
free space was created. Since __ptr_ring_zero_tail() moves the tail to
the head - and no other function modifies either the head or the tail,
aside from the wrap-around case described below - detecting such a
movement is sufficient to detect the invocation of
__ptr_ring_zero_tail().

The implementation detects this movement by checking whether the tail is
at most n positions behind the head. If this condition holds, the shift
of the tail to its current position must have occurred within the last n
calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
invoked and that new free space was created.

This logic also correctly handles the wrap-around case in which
__ptr_ring_zero_tail() is invoked and the head and the tail are reset
to 0. Since this reset likewise moves the tail to the head, the same
detection logic applies.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 include/linux/ptr_ring.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index a5a3fa4916d3..7cdae6d1d400 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
 	return ret;
 }
 
+/* Returns true if the consume of the last n elements has created space
+ * in the ring buffer (i.e., a new element can be produced).
+ *
+ * Note: Because of batching, a successful call to __ptr_ring_consume() /
+ * __ptr_ring_consume_batched() does not guarantee that the next call to
+ * __ptr_ring_produce() will succeed.
+ */
+static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
+						    int n)
+{
+	return r->consumer_head - r->consumer_tail < n;
+}
+
 /* Cast to structure type and call a function without discarding from FIFO.
  * Function must return a value.
  * Callers must take consumer_lock.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 1/9] ptr_ring: move free-space check into separate helper Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-08  3:38   ` Jason Wang
  2026-01-07 21:04 ` [PATCH net-next v7 4/9] tun/tap: add batched ptr_ring consume functions " Simon Schippers
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
and wake the corresponding netdev subqueue when consuming an entry frees
space in the underlying ptr_ring.

Stopping of the netdev queue when the ptr_ring is full will be introduced
in an upcoming commit.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c | 23 ++++++++++++++++++++++-
 drivers/net/tun.c | 25 +++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 1197f245e873..2442cf7ac385 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
 	return ret ? ret : total;
 }
 
+static void *tap_ring_consume(struct tap_queue *q)
+{
+	struct ptr_ring *ring = &q->ring;
+	struct net_device *dev;
+	void *ptr;
+
+	spin_lock(&ring->consumer_lock);
+
+	ptr = __ptr_ring_consume(ring);
+	if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
+		rcu_read_lock();
+		dev = rcu_dereference(q->tap)->dev;
+		netif_wake_subqueue(dev, q->queue_index);
+		rcu_read_unlock();
+	}
+
+	spin_unlock(&ring->consumer_lock);
+
+	return ptr;
+}
+
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
 			   int noblock, struct sk_buff *skb)
@@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
 					TASK_INTERRUPTIBLE);
 
 		/* Read frames from the queue */
-		skb = ptr_ring_consume(&q->ring);
+		skb = tap_ring_consume(q);
 		if (skb)
 			break;
 		if (noblock) {
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8192740357a0..7148f9a844a4 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
+static void *tun_ring_consume(struct tun_file *tfile)
+{
+	struct ptr_ring *ring = &tfile->tx_ring;
+	struct net_device *dev;
+	void *ptr;
+
+	spin_lock(&ring->consumer_lock);
+
+	ptr = __ptr_ring_consume(ring);
+	if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
+		rcu_read_lock();
+		dev = rcu_dereference(tfile->tun)->dev;
+		netif_wake_subqueue(dev, tfile->queue_index);
+		rcu_read_unlock();
+	}
+
+	spin_unlock(&ring->consumer_lock);
+
+	return ptr;
+}
+
 static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
 	void *ptr = NULL;
 	int error = 0;
 
-	ptr = ptr_ring_consume(&tfile->tx_ring);
+	ptr = tun_ring_consume(tfile);
 	if (ptr)
 		goto out;
 	if (noblock) {
@@ -2131,7 +2152,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		ptr = ptr_ring_consume(&tfile->tx_ring);
+		ptr = tun_ring_consume(tfile);
 		if (ptr)
 			break;
 		if (signal_pending(current)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 4/9] tun/tap: add batched ptr_ring consume functions with netdev queue wakeup
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (2 preceding siblings ...)
  2026-01-07 21:04 ` [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 5/9] tun/tap: add unconsume function for returning entries to ptr_ring Simon Schippers
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Add {tun,tap}_ring_consume_batched() that wrap
__ptr_ring_consume_batched() and wake the corresponding netdev subqueue
when consuming the entries frees space in the ptr_ring.

These wrappers are supposed to be used by vhost-net in an upcoming
commit.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 23 +++++++++++++++++++++++
 drivers/net/tun.c      | 23 +++++++++++++++++++++++
 include/linux/if_tap.h |  6 ++++++
 include/linux/if_tun.h |  7 +++++++
 4 files changed, 59 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 2442cf7ac385..7e3b4eed797c 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -774,6 +774,29 @@ static void *tap_ring_consume(struct tap_queue *q)
 	return ptr;
 }
 
+int tap_ring_consume_batched(struct file *file, void **array, int n)
+{
+	struct tap_queue *q = file->private_data;
+	struct ptr_ring *ring = &q->ring;
+	struct net_device *dev;
+	int i;
+
+	spin_lock(&ring->consumer_lock);
+
+	i = __ptr_ring_consume_batched(ring, array, n);
+	if (__ptr_ring_consume_created_space(ring, i)) {
+		rcu_read_lock();
+		dev = rcu_dereference(q->tap)->dev;
+		netif_wake_subqueue(dev, q->queue_index);
+		rcu_read_unlock();
+	}
+
+	spin_unlock(&ring->consumer_lock);
+
+	return i;
+}
+EXPORT_SYMBOL_GPL(tap_ring_consume_batched);
+
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
 			   int noblock, struct sk_buff *skb)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7148f9a844a4..db3b72025cfb 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3736,6 +3736,29 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+int tun_ring_consume_batched(struct file *file, void **array, int n)
+{
+	struct tun_file *tfile = file->private_data;
+	struct ptr_ring *ring = &tfile->tx_ring;
+	struct net_device *dev;
+	int i;
+
+	spin_lock(&ring->consumer_lock);
+
+	i = __ptr_ring_consume_batched(ring, array, n);
+	if (__ptr_ring_consume_created_space(ring, i)) {
+		rcu_read_lock();
+		dev = rcu_dereference(tfile->tun)->dev;
+		netif_wake_subqueue(dev, tfile->queue_index);
+		rcu_read_unlock();
+	}
+
+	spin_unlock(&ring->consumer_lock);
+
+	return i;
+}
+EXPORT_SYMBOL_GPL(tun_ring_consume_batched);
+
 struct ptr_ring *tun_get_tx_ring(struct file *file)
 {
 	struct tun_file *tfile;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 553552fa635c..cf8b90320b8d 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -11,6 +11,7 @@ struct socket;
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
 struct ptr_ring *tap_get_ptr_ring(struct file *file);
+int tap_ring_consume_batched(struct file *file, void **array, int n);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -22,6 +23,11 @@ static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline int tap_ring_consume_batched(struct file *f,
+					   void **array, int n)
+{
+	return 0;
+}
 #endif /* CONFIG_TAP */
 
 /*
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 80166eb62f41..444dda75a372 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -22,6 +22,7 @@ struct tun_msg_ctl {
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
+int tun_ring_consume_batched(struct file *file, void **array, int n);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -55,6 +56,12 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
+static inline int tun_ring_consume_batched(struct file *file,
+					   void **array, int n)
+{
+	return 0;
+}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 5/9] tun/tap: add unconsume function for returning entries to ptr_ring
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (3 preceding siblings ...)
  2026-01-07 21:04 ` [PATCH net-next v7 4/9] tun/tap: add batched ptr_ring consume functions " Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-08  3:40   ` Jason Wang
  2026-01-07 21:04 ` [PATCH net-next v7 6/9] tun/tap: add helper functions to check file type Simon Schippers
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Add {tun,tap}_ring_unconsume() wrappers to allow external modules
(e.g. vhost-net) to return previously consumed entries back to the
ptr_ring. The functions delegate to ptr_ring_unconsume() and take a
destroy callback for entries that cannot be returned to the ring.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 10 ++++++++++
 drivers/net/tun.c      | 10 ++++++++++
 include/linux/if_tap.h |  4 ++++
 include/linux/if_tun.h |  5 +++++
 4 files changed, 29 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 7e3b4eed797c..4ffe4e95b5a6 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -797,6 +797,16 @@ int tap_ring_consume_batched(struct file *file, void **array, int n)
 }
 EXPORT_SYMBOL_GPL(tap_ring_consume_batched);
 
+void tap_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *))
+{
+	struct tap_queue *q = file->private_data;
+	struct ptr_ring *ring = &q->ring;
+
+	ptr_ring_unconsume(ring, batch, n, destroy);
+}
+EXPORT_SYMBOL_GPL(tap_ring_unconsume);
+
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
 			   int noblock, struct sk_buff *skb)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index db3b72025cfb..d44d206c65e8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3759,6 +3759,16 @@ int tun_ring_consume_batched(struct file *file, void **array, int n)
 }
 EXPORT_SYMBOL_GPL(tun_ring_consume_batched);
 
+void tun_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *))
+{
+	struct tun_file *tfile = file->private_data;
+	struct ptr_ring *ring = &tfile->tx_ring;
+
+	ptr_ring_unconsume(ring, batch, n, destroy);
+}
+EXPORT_SYMBOL_GPL(tun_ring_unconsume);
+
 struct ptr_ring *tun_get_tx_ring(struct file *file)
 {
 	struct tun_file *tfile;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index cf8b90320b8d..28326a69745a 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -12,6 +12,8 @@ struct socket;
 struct socket *tap_get_socket(struct file *);
 struct ptr_ring *tap_get_ptr_ring(struct file *file);
 int tap_ring_consume_batched(struct file *file, void **array, int n);
+void tap_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *));
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -28,6 +30,8 @@ static inline int tap_ring_consume_batched(struct file *f,
 {
 	return 0;
 }
+static inline void tap_ring_unconsume(struct file *file, void **batch,
+				      int n, void (*destroy)(void *)) {}
 #endif /* CONFIG_TAP */
 
 /*
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 444dda75a372..1274c6b34eb6 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -23,6 +23,8 @@ struct tun_msg_ctl {
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
 int tun_ring_consume_batched(struct file *file, void **array, int n);
+void tun_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *));
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -62,6 +64,9 @@ static inline int tun_ring_consume_batched(struct file *file,
 	return 0;
 }
 
+static inline void tun_ring_unconsume(struct file *file, void **batch,
+				      int n, void (*destroy)(void *)) {}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 6/9] tun/tap: add helper functions to check file type
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (4 preceding siblings ...)
  2026-01-07 21:04 ` [PATCH net-next v7 5/9] tun/tap: add unconsume function for returning entries to ptr_ring Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers Simon Schippers
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Add tun_is_tun_file() and tap_is_tap_file() helper functions to check if
a file is a TUN or TAP file, which will be utilized by vhost-net.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 13 +++++++++++++
 drivers/net/tun.c      | 13 +++++++++++++
 include/linux/if_tap.h |  5 +++++
 include/linux/if_tun.h |  6 ++++++
 4 files changed, 37 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4ffe4e95b5a6..cf19d7181c2f 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1243,6 +1243,19 @@ struct ptr_ring *tap_get_ptr_ring(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_ptr_ring);
 
+bool tap_is_tap_file(struct file *file)
+{
+	struct tap_queue *q;
+
+	if (file->f_op != &tap_fops)
+		return false;
+	q = file->private_data;
+	if (!q)
+		return false;
+	return true;
+}
+EXPORT_SYMBOL_GPL(tap_is_tap_file);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
 	struct net_device *dev = tap->dev;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d44d206c65e8..9d6f98e00661 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3782,6 +3782,19 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_tx_ring);
 
+bool tun_is_tun_file(struct file *file)
+{
+	struct tun_file *tfile;
+
+	if (file->f_op != &tun_fops)
+		return false;
+	tfile = file->private_data;
+	if (!tfile)
+		return false;
+	return true;
+}
+EXPORT_SYMBOL_GPL(tun_is_tun_file);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 28326a69745a..14194342b784 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -14,6 +14,7 @@ struct ptr_ring *tap_get_ptr_ring(struct file *file);
 int tap_ring_consume_batched(struct file *file, void **array, int n);
 void tap_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
+bool tap_is_tap_file(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -32,6 +33,10 @@ static inline int tap_ring_consume_batched(struct file *f,
 }
 static inline void tap_ring_unconsume(struct file *file, void **batch,
 				      int n, void (*destroy)(void *)) {}
+static inline bool tap_is_tap_file(struct file *f)
+{
+	return false;
+}
 #endif /* CONFIG_TAP */
 
 /*
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 1274c6b34eb6..0910c6dbac20 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file);
 int tun_ring_consume_batched(struct file *file, void **array, int n);
 void tun_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
+bool tun_is_tun_file(struct file *file);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -67,6 +68,11 @@ static inline int tun_ring_consume_batched(struct file *file,
 static inline void tun_ring_unconsume(struct file *file, void **batch,
 				      int n, void (*destroy)(void *)) {}
 
+static inline bool tun_is_tun_file(struct file *f)
+{
+	return false;
+}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (5 preceding siblings ...)
  2026-01-07 21:04 ` [PATCH net-next v7 6/9] tun/tap: add helper functions to check file type Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-08  4:38   ` Jason Wang
  2026-01-07 21:04 ` [PATCH net-next v7 8/9] tun/tap: drop get ring exports Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present Simon Schippers
  8 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Replace the direct use of ptr_ring in the vhost-net virtqueue with
tun/tap ring wrapper helpers. Instead of storing an rx_ring pointer,
the virtqueue now stores the interface type (IF_TUN, IF_TAP, or IF_NONE)
and dispatches to the corresponding tun/tap helpers for ring
produce, consume, and unconsume operations.

Routing ring operations through the tun/tap helpers enables netdev
queue wakeups, which are required for upcoming netdev queue flow
control support shared by tun/tap and vhost-net.

No functional change is intended beyond switching to the wrapper
helpers.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/vhost/net.c | 92 +++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 32 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7f886d3dba7d..215556f7cd40 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -90,6 +90,12 @@ enum {
 	VHOST_NET_VQ_MAX = 2,
 };
 
+enum if_type {
+	IF_NONE = 0,
+	IF_TUN = 1,
+	IF_TAP = 2,
+};
+
 struct vhost_net_ubuf_ref {
 	/* refcount follows semantics similar to kref:
 	 *  0: object is released
@@ -127,10 +133,11 @@ struct vhost_net_virtqueue {
 	/* Reference counting for outstanding ubufs.
 	 * Protected by vq mutex. Writers must also take device mutex. */
 	struct vhost_net_ubuf_ref *ubufs;
-	struct ptr_ring *rx_ring;
 	struct vhost_net_buf rxq;
 	/* Batched XDP buffs */
 	struct xdp_buff *xdp;
+	/* Interface type */
+	enum if_type type;
 };
 
 struct vhost_net {
@@ -176,24 +183,50 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 	return ret;
 }
 
-static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq,
+				 struct sock *sk)
 {
+	struct file *file = sk->sk_socket->file;
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	rxq->head = 0;
-	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
-					      VHOST_NET_BATCH);
+	switch (nvq->type) {
+	case IF_TUN:
+		rxq->tail = tun_ring_consume_batched(file, rxq->queue,
+						     VHOST_NET_BATCH);
+		break;
+	case IF_TAP:
+		rxq->tail = tap_ring_consume_batched(file, rxq->queue,
+						     VHOST_NET_BATCH);
+		break;
+	case IF_NONE:
+		return 0;
+	}
 	return rxq->tail;
 }
 
-static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
+static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq,
+				    struct socket *sk)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
-
-	if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
-		ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
-				   vhost_net_buf_get_size(rxq),
-				   tun_ptr_free);
+	struct file *file;
+
+	if (sk && !vhost_net_buf_is_empty(rxq)) {
+		file = sk->file;
+		switch (nvq->type) {
+		case IF_TUN:
+			tun_ring_unconsume(file, rxq->queue + rxq->head,
+					   vhost_net_buf_get_size(rxq),
+					   tun_ptr_free);
+			break;
+		case IF_TAP:
+			tap_ring_unconsume(file, rxq->queue + rxq->head,
+					   vhost_net_buf_get_size(rxq),
+					   tun_ptr_free);
+			break;
+		case IF_NONE:
+			return;
+		}
 		rxq->head = rxq->tail = 0;
 	}
 }
@@ -209,14 +242,15 @@ static int vhost_net_buf_peek_len(void *ptr)
 	return __skb_array_len_with_tag(ptr);
 }
 
-static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq,
+			      struct sock *sk)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	if (!vhost_net_buf_is_empty(rxq))
 		goto out;
 
-	if (!vhost_net_buf_produce(nvq))
+	if (!vhost_net_buf_produce(nvq, sk))
 		return 0;
 
 out:
@@ -996,8 +1030,8 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	int len = 0;
 	unsigned long flags;
 
-	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+	if (rvq->type)
+		return vhost_net_buf_peek(rvq, sk);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -1212,7 +1246,7 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		busyloop_intr = false;
-		if (nvq->rx_ring)
+		if (nvq->type)
 			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
@@ -1368,7 +1402,6 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].batched_xdp = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
-		n->vqs[i].rx_ring = NULL;
 		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
@@ -1398,8 +1431,8 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 	sock = vhost_vq_get_backend(vq);
 	vhost_net_disable_vq(n, vq);
 	vhost_vq_set_backend(vq, NULL);
-	vhost_net_buf_unproduce(nvq);
-	nvq->rx_ring = NULL;
+	vhost_net_buf_unproduce(nvq, sock);
+	nvq->type = IF_NONE;
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -1479,18 +1512,13 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
-static struct ptr_ring *get_tap_ptr_ring(struct file *file)
+static enum if_type get_if_type(struct file *file)
 {
-	struct ptr_ring *ring;
-	ring = tun_get_tx_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = tap_get_ptr_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = NULL;
-out:
-	return ring;
+	if (tap_is_tap_file(file))
+		return IF_TAP;
+	if (tun_is_tun_file(file))
+		return IF_TUN;
+	return IF_NONE;
 }
 
 static struct socket *get_tap_socket(int fd)
@@ -1572,7 +1600,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vhost_vq_set_backend(vq, sock);
-		vhost_net_buf_unproduce(nvq);
+		vhost_net_buf_unproduce(nvq, sock);
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
@@ -1581,9 +1609,9 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 			goto err_used;
 		if (index == VHOST_NET_VQ_RX) {
 			if (sock)
-				nvq->rx_ring = get_tap_ptr_ring(sock->file);
+				nvq->type = get_if_type(sock->file);
 			else
-				nvq->rx_ring = NULL;
+				nvq->type = IF_NONE;
 		}
 
 		oldubufs = nvq->ubufs;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 8/9] tun/tap: drop get ring exports
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (6 preceding siblings ...)
  2026-01-07 21:04 ` [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-07 21:04 ` [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present Simon Schippers
  8 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

tun_get_tx_ring and tap_get_ptr_ring no longer have in-tree consumers and
can be dropped.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 13 -------------
 drivers/net/tun.c      | 13 -------------
 include/linux/if_tap.h |  5 -----
 include/linux/if_tun.h |  6 ------
 4 files changed, 37 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index cf19d7181c2f..8821f26d0baa 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1230,19 +1230,6 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
-struct ptr_ring *tap_get_ptr_ring(struct file *file)
-{
-	struct tap_queue *q;
-
-	if (file->f_op != &tap_fops)
-		return ERR_PTR(-EINVAL);
-	q = file->private_data;
-	if (!q)
-		return ERR_PTR(-EBADFD);
-	return &q->ring;
-}
-EXPORT_SYMBOL_GPL(tap_get_ptr_ring);
-
 bool tap_is_tap_file(struct file *file)
 {
 	struct tap_queue *q;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9d6f98e00661..71b6981d07d7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3769,19 +3769,6 @@ void tun_ring_unconsume(struct file *file, void **batch, int n,
 }
 EXPORT_SYMBOL_GPL(tun_ring_unconsume);
 
-struct ptr_ring *tun_get_tx_ring(struct file *file)
-{
-	struct tun_file *tfile;
-
-	if (file->f_op != &tun_fops)
-		return ERR_PTR(-EINVAL);
-	tfile = file->private_data;
-	if (!tfile)
-		return ERR_PTR(-EBADFD);
-	return &tfile->tx_ring;
-}
-EXPORT_SYMBOL_GPL(tun_get_tx_ring);
-
 bool tun_is_tun_file(struct file *file)
 {
 	struct tun_file *tfile;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 14194342b784..0e427b979c11 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -10,7 +10,6 @@ struct socket;
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
-struct ptr_ring *tap_get_ptr_ring(struct file *file);
 int tap_ring_consume_batched(struct file *file, void **array, int n);
 void tap_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
@@ -22,10 +21,6 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
 static inline int tap_ring_consume_batched(struct file *f,
 					   void **array, int n)
 {
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 0910c6dbac20..80b734173a80 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -21,7 +21,6 @@ struct tun_msg_ctl {
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
-struct ptr_ring *tun_get_tx_ring(struct file *file);
 int tun_ring_consume_batched(struct file *file, void **array, int n);
 void tun_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
@@ -54,11 +53,6 @@ static inline struct socket *tun_get_socket(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
-static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 static inline int tun_ring_consume_batched(struct file *file,
 					   void **array, int n)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
                   ` (7 preceding siblings ...)
  2026-01-07 21:04 ` [PATCH net-next v7 8/9] tun/tap: drop get ring exports Simon Schippers
@ 2026-01-07 21:04 ` Simon Schippers
  2026-01-08  4:37   ` Jason Wang
  8 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-07 21:04 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once an entry is successfully produced and the ptr_ring
reaches capacity, the netdev queue is stopped instead of dropping
subsequent packets.

If producing an entry fails anyways, the tun_net_xmit returns
NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
LLTX is enabled and the transmit path operates without the usual locking.
As a result, concurrent calls to tun_net_xmit() are not prevented.

The existing __{tun,tap}_ring_consume functions free space in the
ptr_ring and wake the netdev queue. Races between this wakeup and the
queue-stop logic could leave the queue stopped indefinitely. To prevent
this, a memory barrier is enforced (as discussed in a similar
implementation in [1]), followed by a recheck that wakes the queue if
space is already available.

If no qdisc is present, the previous tail-drop behavior is preserved.

+-------------------------+-----------+---------------+----------------+
| pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
| Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
| 10M packets             |           |               |                |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
|           +-------------+-----------+---------------+----------------+
|           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+
| TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
|  +        +-------------+-----------+---------------+----------------+
| vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
+-----------+-------------+-----------+---------------+----------------+

[1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 71b6981d07d7..74d7fd09e9ba 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct netdev_queue *queue;
 	struct tun_file *tfile;
 	int len = skb->len;
+	bool qdisc_present;
+	int ret;
 
 	rcu_read_lock();
 	tfile = rcu_dereference(tun->tfiles[txq]);
@@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	qdisc_present = !qdisc_txq_has_no_queue(queue);
+
+	spin_lock(&tfile->tx_ring.producer_lock);
+	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
+	if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
+		netif_tx_stop_queue(queue);
+		/* Avoid races with queue wake-up in
+		 * __{tun,tap}_ring_consume by waking if space is
+		 * available in a re-check.
+		 * The barrier makes sure that the stop is visible before
+		 * we re-check.
+		 */
+		smp_mb__after_atomic();
+		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
+			netif_tx_wake_queue(queue);
+	}
+	spin_unlock(&tfile->tx_ring.producer_lock);
+
+	if (ret) {
+		/* If a qdisc is attached to our virtual device,
+		 * returning NETDEV_TX_BUSY is allowed.
+		 */
+		if (qdisc_present) {
+			rcu_read_unlock();
+			return NETDEV_TX_BUSY;
+		}
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
 
 	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
 	txq_trans_cond_update(queue);
 
 	/* Notify and wake up reader process */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-07 21:04 ` [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume Simon Schippers
@ 2026-01-08  3:23   ` Jason Wang
  2026-01-08  7:20     ` Simon Schippers
  2026-01-09  7:22   ` Michael S. Tsirkin
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-08  3:23 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> This proposed function checks whether __ptr_ring_zero_tail() was invoked
> within the last n calls to __ptr_ring_consume(), which indicates that new
> free space was created. Since __ptr_ring_zero_tail() moves the tail to
> the head - and no other function modifies either the head or the tail,
> aside from the wrap-around case described below - detecting such a
> movement is sufficient to detect the invocation of
> __ptr_ring_zero_tail().
>
> The implementation detects this movement by checking whether the tail is
> at most n positions behind the head. If this condition holds, the shift
> of the tail to its current position must have occurred within the last n
> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
> invoked and that new free space was created.
>
> This logic also correctly handles the wrap-around case in which
> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
> to 0. Since this reset likewise moves the tail to the head, the same
> detection logic applies.
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index a5a3fa4916d3..7cdae6d1d400 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>         return ret;
>  }
>
> +/* Returns true if the consume of the last n elements has created space
> + * in the ring buffer (i.e., a new element can be produced).
> + *
> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
> + * __ptr_ring_consume_batched() does not guarantee that the next call to
> + * __ptr_ring_produce() will succeed.

This sounds like a bug that needs to be fixed, as it requires the user
to know the implementation details. For example, even if
__ptr_ring_consume_created_space() returns true, __ptr_ring_produce()
may still fail?

Maybe revert fb9de9704775d?

> + */
> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
> +                                                   int n)
> +{
> +       return r->consumer_head - r->consumer_tail < n;
> +}
> +
>  /* Cast to structure type and call a function without discarding from FIFO.
>   * Function must return a value.
>   * Callers must take consumer_lock.
> --
> 2.43.0
>

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-07 21:04 ` [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
@ 2026-01-08  3:38   ` Jason Wang
  2026-01-08  7:40     ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-08  3:38 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> and wake the corresponding netdev subqueue when consuming an entry frees
> space in the underlying ptr_ring.
>
> Stopping of the netdev queue when the ptr_ring is full will be introduced
> in an upcoming commit.
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 1197f245e873..2442cf7ac385 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>         return ret ? ret : total;
>  }
>
> +static void *tap_ring_consume(struct tap_queue *q)
> +{
> +       struct ptr_ring *ring = &q->ring;
> +       struct net_device *dev;
> +       void *ptr;
> +
> +       spin_lock(&ring->consumer_lock);
> +
> +       ptr = __ptr_ring_consume(ring);
> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> +               rcu_read_lock();
> +               dev = rcu_dereference(q->tap)->dev;
> +               netif_wake_subqueue(dev, q->queue_index);
> +               rcu_read_unlock();
> +       }
> +
> +       spin_unlock(&ring->consumer_lock);
> +
> +       return ptr;
> +}
> +
>  static ssize_t tap_do_read(struct tap_queue *q,
>                            struct iov_iter *to,
>                            int noblock, struct sk_buff *skb)
> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>                                         TASK_INTERRUPTIBLE);
>
>                 /* Read frames from the queue */
> -               skb = ptr_ring_consume(&q->ring);
> +               skb = tap_ring_consume(q);
>                 if (skb)
>                         break;
>                 if (noblock) {
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8192740357a0..7148f9a844a4 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>         return total;
>  }
>
> +static void *tun_ring_consume(struct tun_file *tfile)
> +{
> +       struct ptr_ring *ring = &tfile->tx_ring;
> +       struct net_device *dev;
> +       void *ptr;
> +
> +       spin_lock(&ring->consumer_lock);
> +
> +       ptr = __ptr_ring_consume(ring);
> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {

I guess it's the "bug" I mentioned in the previous patch that leads to
the check of __ptr_ring_consume_created_space() here. If it's true,
another call to tweak the current API.

> +               rcu_read_lock();
> +               dev = rcu_dereference(tfile->tun)->dev;
> +               netif_wake_subqueue(dev, tfile->queue_index);

This would cause the producer TX_SOFTIRQ to run on the same cpu which
I'm not sure is what we want.

> +               rcu_read_unlock();
> +       }

Btw, this function duplicates a lot of logic of tap_ring_consume() we
should consider to merge the logic.

> +
> +       spin_unlock(&ring->consumer_lock);
> +
> +       return ptr;
> +}
> +
>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  {
>         DECLARE_WAITQUEUE(wait, current);
>         void *ptr = NULL;
>         int error = 0;
>
> -       ptr = ptr_ring_consume(&tfile->tx_ring);
> +       ptr = tun_ring_consume(tfile);

I'm not sure having a separate patch like this may help. For example,
it will introduce performance regression.

>         if (ptr)
>                 goto out;
>         if (noblock) {
> @@ -2131,7 +2152,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>
>         while (1) {
>                 set_current_state(TASK_INTERRUPTIBLE);
> -               ptr = ptr_ring_consume(&tfile->tx_ring);
> +               ptr = tun_ring_consume(tfile);
>                 if (ptr)
>                         break;
>                 if (signal_pending(current)) {
> --
> 2.43.0
>

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 5/9] tun/tap: add unconsume function for returning entries to ptr_ring
  2026-01-07 21:04 ` [PATCH net-next v7 5/9] tun/tap: add unconsume function for returning entries to ptr_ring Simon Schippers
@ 2026-01-08  3:40   ` Jason Wang
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Wang @ 2026-01-08  3:40 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> Add {tun,tap}_ring_unconsume() wrappers to allow external modules
> (e.g. vhost-net) to return previously consumed entries back to the
> ptr_ring.

It would be better to explain why we need such a return.

> The functions delegate to ptr_ring_unconsume() and take a
> destroy callback for entries that cannot be returned to the ring.
>

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-07 21:04 ` [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present Simon Schippers
@ 2026-01-08  4:37   ` Jason Wang
  2026-01-08  8:01     ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-08  4:37 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> becomes full. Once an entry is successfully produced and the ptr_ring
> reaches capacity, the netdev queue is stopped instead of dropping
> subsequent packets.
>
> If producing an entry fails anyways, the tun_net_xmit returns
> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> LLTX is enabled and the transmit path operates without the usual locking.
> As a result, concurrent calls to tun_net_xmit() are not prevented.
>
> The existing __{tun,tap}_ring_consume functions free space in the
> ptr_ring and wake the netdev queue. Races between this wakeup and the
> queue-stop logic could leave the queue stopped indefinitely. To prevent
> this, a memory barrier is enforced (as discussed in a similar
> implementation in [1]), followed by a recheck that wakes the queue if
> space is already available.
>
> If no qdisc is present, the previous tail-drop behavior is preserved.
>
> +-------------------------+-----------+---------------+----------------+
> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> | 10M packets             |           |               |                |
> +-----------+-------------+-----------+---------------+----------------+
> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> |           +-------------+-----------+---------------+----------------+
> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> +-----------+-------------+-----------+---------------+----------------+
> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> |  +        +-------------+-----------+---------------+----------------+
> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> +-----------+-------------+-----------+---------------+----------------+
>
> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 71b6981d07d7..74d7fd09e9ba 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>         struct netdev_queue *queue;
>         struct tun_file *tfile;
>         int len = skb->len;
> +       bool qdisc_present;
> +       int ret;
>
>         rcu_read_lock();
>         tfile = rcu_dereference(tun->tfiles[txq]);
> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>
>         nf_reset_ct(skb);
>
> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +       queue = netdev_get_tx_queue(dev, txq);
> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> +
> +       spin_lock(&tfile->tx_ring.producer_lock);
> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> +               netif_tx_stop_queue(queue);
> +               /* Avoid races with queue wake-up in
> +                * __{tun,tap}_ring_consume by waking if space is
> +                * available in a re-check.
> +                * The barrier makes sure that the stop is visible before
> +                * we re-check.
> +                */
> +               smp_mb__after_atomic();
> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> +                       netif_tx_wake_queue(queue);

I'm not sure I will get here, but I think those should be moved to the
following if(ret) check. If __ptr_ring_produce() succeed, there's no
need to bother with those queue stop/wake logic?

> +       }
> +       spin_unlock(&tfile->tx_ring.producer_lock);
> +
> +       if (ret) {
> +               /* If a qdisc is attached to our virtual device,
> +                * returning NETDEV_TX_BUSY is allowed.
> +                */
> +               if (qdisc_present) {
> +                       rcu_read_unlock();
> +                       return NETDEV_TX_BUSY;
> +               }
>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>                 goto drop;
>         }
>
>         /* dev->lltx requires to do our own update of trans_start */
> -       queue = netdev_get_tx_queue(dev, txq);
>         txq_trans_cond_update(queue);
>
>         /* Notify and wake up reader process */
> --
> 2.43.0
>

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  2026-01-07 21:04 ` [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers Simon Schippers
@ 2026-01-08  4:38   ` Jason Wang
  2026-01-08  7:47     ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-08  4:38 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> Replace the direct use of ptr_ring in the vhost-net virtqueue with
> tun/tap ring wrapper helpers. Instead of storing an rx_ring pointer,
> the virtqueue now stores the interface type (IF_TUN, IF_TAP, or IF_NONE)
> and dispatches to the corresponding tun/tap helpers for ring
> produce, consume, and unconsume operations.
>
> Routing ring operations through the tun/tap helpers enables netdev
> queue wakeups, which are required for upcoming netdev queue flow
> control support shared by tun/tap and vhost-net.
>
> No functional change is intended beyond switching to the wrapper
> helpers.
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Co-developed by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/vhost/net.c | 92 +++++++++++++++++++++++++++++----------------
>  1 file changed, 60 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 7f886d3dba7d..215556f7cd40 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -90,6 +90,12 @@ enum {
>         VHOST_NET_VQ_MAX = 2,
>  };
>
> +enum if_type {
> +       IF_NONE = 0,
> +       IF_TUN = 1,
> +       IF_TAP = 2,
> +};

This looks not elegant, can we simply export objects we want to use to
vhost like get_tap_socket()?

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-08  3:23   ` Jason Wang
@ 2026-01-08  7:20     ` Simon Schippers
  2026-01-09  6:01       ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-08  7:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/8/26 04:23, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> This proposed function checks whether __ptr_ring_zero_tail() was invoked
>> within the last n calls to __ptr_ring_consume(), which indicates that new
>> free space was created. Since __ptr_ring_zero_tail() moves the tail to
>> the head - and no other function modifies either the head or the tail,
>> aside from the wrap-around case described below - detecting such a
>> movement is sufficient to detect the invocation of
>> __ptr_ring_zero_tail().
>>
>> The implementation detects this movement by checking whether the tail is
>> at most n positions behind the head. If this condition holds, the shift
>> of the tail to its current position must have occurred within the last n
>> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
>> invoked and that new free space was created.
>>
>> This logic also correctly handles the wrap-around case in which
>> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
>> to 0. Since this reset likewise moves the tail to the head, the same
>> detection logic applies.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  include/linux/ptr_ring.h | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>> index a5a3fa4916d3..7cdae6d1d400 100644
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>>         return ret;
>>  }
>>
>> +/* Returns true if the consume of the last n elements has created space
>> + * in the ring buffer (i.e., a new element can be produced).
>> + *
>> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
>> + * __ptr_ring_consume_batched() does not guarantee that the next call to
>> + * __ptr_ring_produce() will succeed.
> 
> This sounds like a bug that needs to be fixed, as it requires the user
> to know the implementation details. For example, even if
> __ptr_ring_consume_created_space() returns true, __ptr_ring_produce()
> may still fail?

No, it should not fail in that case.
If you only call consume and after that try to produce, *then* it is
likely to fail because __ptr_ring_zero_tail() is only invoked once per
batch.

> 
> Maybe revert fb9de9704775d?

I disagree, as I consider this to be one of the key features of ptr_ring.

That said, there are several other implementation details that users need
to be aware of.

For example, __ptr_ring_empty() must only be called by the consumer. This
was for example the issue in dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops") and the reason why
5442a9da6978 ("veth: more robust handing of race to avoid txq getting
stuck") exists.

> 
>> + */
>> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
>> +                                                   int n)
>> +{
>> +       return r->consumer_head - r->consumer_tail < n;
>> +}
>> +
>>  /* Cast to structure type and call a function without discarding from FIFO.
>>   * Function must return a value.
>>   * Callers must take consumer_lock.
>> --
>> 2.43.0
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-08  3:38   ` Jason Wang
@ 2026-01-08  7:40     ` Simon Schippers
  2026-01-09  6:02       ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-08  7:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/8/26 04:38, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>> and wake the corresponding netdev subqueue when consuming an entry frees
>> space in the underlying ptr_ring.
>>
>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>> in an upcoming commit.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>> index 1197f245e873..2442cf7ac385 100644
>> --- a/drivers/net/tap.c
>> +++ b/drivers/net/tap.c
>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>         return ret ? ret : total;
>>  }
>>
>> +static void *tap_ring_consume(struct tap_queue *q)
>> +{
>> +       struct ptr_ring *ring = &q->ring;
>> +       struct net_device *dev;
>> +       void *ptr;
>> +
>> +       spin_lock(&ring->consumer_lock);
>> +
>> +       ptr = __ptr_ring_consume(ring);
>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>> +               rcu_read_lock();
>> +               dev = rcu_dereference(q->tap)->dev;
>> +               netif_wake_subqueue(dev, q->queue_index);
>> +               rcu_read_unlock();
>> +       }
>> +
>> +       spin_unlock(&ring->consumer_lock);
>> +
>> +       return ptr;
>> +}
>> +
>>  static ssize_t tap_do_read(struct tap_queue *q,
>>                            struct iov_iter *to,
>>                            int noblock, struct sk_buff *skb)
>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>                                         TASK_INTERRUPTIBLE);
>>
>>                 /* Read frames from the queue */
>> -               skb = ptr_ring_consume(&q->ring);
>> +               skb = tap_ring_consume(q);
>>                 if (skb)
>>                         break;
>>                 if (noblock) {
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 8192740357a0..7148f9a844a4 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>         return total;
>>  }
>>
>> +static void *tun_ring_consume(struct tun_file *tfile)
>> +{
>> +       struct ptr_ring *ring = &tfile->tx_ring;
>> +       struct net_device *dev;
>> +       void *ptr;
>> +
>> +       spin_lock(&ring->consumer_lock);
>> +
>> +       ptr = __ptr_ring_consume(ring);
>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> 
> I guess it's the "bug" I mentioned in the previous patch that leads to
> the check of __ptr_ring_consume_created_space() here. If it's true,
> another call to tweak the current API.
> 
>> +               rcu_read_lock();
>> +               dev = rcu_dereference(tfile->tun)->dev;
>> +               netif_wake_subqueue(dev, tfile->queue_index);
> 
> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> I'm not sure is what we want.

What else would you suggest calling to wake the queue?

> 
>> +               rcu_read_unlock();
>> +       }
> 
> Btw, this function duplicates a lot of logic of tap_ring_consume() we
> should consider to merge the logic.

Yes, it is largely the same approach, but it would require accessing the
net_device each time.

> 
>> +
>> +       spin_unlock(&ring->consumer_lock);
>> +
>> +       return ptr;
>> +}
>> +
>>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>  {
>>         DECLARE_WAITQUEUE(wait, current);
>>         void *ptr = NULL;
>>         int error = 0;
>>
>> -       ptr = ptr_ring_consume(&tfile->tx_ring);
>> +       ptr = tun_ring_consume(tfile);
> 
> I'm not sure having a separate patch like this may help. For example,
> it will introduce performance regression.

I ran benchmarks for the whole patch set with noqueue (where the queue is
not stopped to preserve the old behavior), as described in the cover
letter, and observed no performance regression. This leads me to conclude
that there is no performance impact because of this patch when the queue
is not stopped.

> 
>>         if (ptr)
>>                 goto out;
>>         if (noblock) {
>> @@ -2131,7 +2152,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>
>>         while (1) {
>>                 set_current_state(TASK_INTERRUPTIBLE);
>> -               ptr = ptr_ring_consume(&tfile->tx_ring);
>> +               ptr = tun_ring_consume(tfile);
>>                 if (ptr)
>>                         break;
>>                 if (signal_pending(current)) {
>> --
>> 2.43.0
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  2026-01-08  4:38   ` Jason Wang
@ 2026-01-08  7:47     ` Simon Schippers
  2026-01-09  6:04       ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-08  7:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/8/26 05:38, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> Replace the direct use of ptr_ring in the vhost-net virtqueue with
>> tun/tap ring wrapper helpers. Instead of storing an rx_ring pointer,
>> the virtqueue now stores the interface type (IF_TUN, IF_TAP, or IF_NONE)
>> and dispatches to the corresponding tun/tap helpers for ring
>> produce, consume, and unconsume operations.
>>
>> Routing ring operations through the tun/tap helpers enables netdev
>> queue wakeups, which are required for upcoming netdev queue flow
>> control support shared by tun/tap and vhost-net.
>>
>> No functional change is intended beyond switching to the wrapper
>> helpers.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Co-developed by: Jon Kohler <jon@nutanix.com>
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/vhost/net.c | 92 +++++++++++++++++++++++++++++----------------
>>  1 file changed, 60 insertions(+), 32 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 7f886d3dba7d..215556f7cd40 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -90,6 +90,12 @@ enum {
>>         VHOST_NET_VQ_MAX = 2,
>>  };
>>
>> +enum if_type {
>> +       IF_NONE = 0,
>> +       IF_TUN = 1,
>> +       IF_TAP = 2,
>> +};
> 
> This looks not elegant, can we simply export objects we want to use to
> vhost like get_tap_socket()?

No, we cannot do that. We would need access to both the ptr_ring and the
net_device. However, the net_device is protected by an RCU lock.

That is why {tun,tap}_ring_consume_batched() are used:
they take the appropriate locks and handle waking the queue.

> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-08  4:37   ` Jason Wang
@ 2026-01-08  8:01     ` Simon Schippers
  2026-01-09  6:09       ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-08  8:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/8/26 05:37, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>> becomes full. Once an entry is successfully produced and the ptr_ring
>> reaches capacity, the netdev queue is stopped instead of dropping
>> subsequent packets.
>>
>> If producing an entry fails anyways, the tun_net_xmit returns
>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
>> LLTX is enabled and the transmit path operates without the usual locking.
>> As a result, concurrent calls to tun_net_xmit() are not prevented.
>>
>> The existing __{tun,tap}_ring_consume functions free space in the
>> ptr_ring and wake the netdev queue. Races between this wakeup and the
>> queue-stop logic could leave the queue stopped indefinitely. To prevent
>> this, a memory barrier is enforced (as discussed in a similar
>> implementation in [1]), followed by a recheck that wakes the queue if
>> space is already available.
>>
>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>
>> +-------------------------+-----------+---------------+----------------+
>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
>> | 10M packets             |           |               |                |
>> +-----------+-------------+-----------+---------------+----------------+
>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
>> |           +-------------+-----------+---------------+----------------+
>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
>> +-----------+-------------+-----------+---------------+----------------+
>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
>> |  +        +-------------+-----------+---------------+----------------+
>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
>> +-----------+-------------+-----------+---------------+----------------+
>>
>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 71b6981d07d7..74d7fd09e9ba 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>         struct netdev_queue *queue;
>>         struct tun_file *tfile;
>>         int len = skb->len;
>> +       bool qdisc_present;
>> +       int ret;
>>
>>         rcu_read_lock();
>>         tfile = rcu_dereference(tun->tfiles[txq]);
>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>         nf_reset_ct(skb);
>>
>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>> +       queue = netdev_get_tx_queue(dev, txq);
>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>> +
>> +       spin_lock(&tfile->tx_ring.producer_lock);
>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>> +               netif_tx_stop_queue(queue);
>> +               /* Avoid races with queue wake-up in
>> +                * __{tun,tap}_ring_consume by waking if space is
>> +                * available in a re-check.
>> +                * The barrier makes sure that the stop is visible before
>> +                * we re-check.
>> +                */
>> +               smp_mb__after_atomic();
>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>> +                       netif_tx_wake_queue(queue);
> 
> I'm not sure I will get here, but I think those should be moved to the
> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> need to bother with those queue stop/wake logic?

There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
we stop the queue proactively.

I believe what you are aiming for is to always stop the queue if(ret),
which I can agree with. In that case, I would simply change the condition
to:

if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))

> 
>> +       }
>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>> +
>> +       if (ret) {
>> +               /* If a qdisc is attached to our virtual device,
>> +                * returning NETDEV_TX_BUSY is allowed.
>> +                */
>> +               if (qdisc_present) {
>> +                       rcu_read_unlock();
>> +                       return NETDEV_TX_BUSY;
>> +               }
>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>                 goto drop;
>>         }
>>
>>         /* dev->lltx requires to do our own update of trans_start */
>> -       queue = netdev_get_tx_queue(dev, txq);
>>         txq_trans_cond_update(queue);
>>
>>         /* Notify and wake up reader process */
>> --
>> 2.43.0
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-08  7:20     ` Simon Schippers
@ 2026-01-09  6:01       ` Jason Wang
  2026-01-09  6:47         ` Michael S. Tsirkin
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-09  6:01 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 3:21 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/8/26 04:23, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> This proposed function checks whether __ptr_ring_zero_tail() was invoked
> >> within the last n calls to __ptr_ring_consume(), which indicates that new
> >> free space was created. Since __ptr_ring_zero_tail() moves the tail to
> >> the head - and no other function modifies either the head or the tail,
> >> aside from the wrap-around case described below - detecting such a
> >> movement is sufficient to detect the invocation of
> >> __ptr_ring_zero_tail().
> >>
> >> The implementation detects this movement by checking whether the tail is
> >> at most n positions behind the head. If this condition holds, the shift
> >> of the tail to its current position must have occurred within the last n
> >> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
> >> invoked and that new free space was created.
> >>
> >> This logic also correctly handles the wrap-around case in which
> >> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
> >> to 0. Since this reset likewise moves the tail to the head, the same
> >> detection logic applies.
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  include/linux/ptr_ring.h | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> >> index a5a3fa4916d3..7cdae6d1d400 100644
> >> --- a/include/linux/ptr_ring.h
> >> +++ b/include/linux/ptr_ring.h
> >> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
> >>         return ret;
> >>  }
> >>
> >> +/* Returns true if the consume of the last n elements has created space
> >> + * in the ring buffer (i.e., a new element can be produced).
> >> + *
> >> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
> >> + * __ptr_ring_consume_batched() does not guarantee that the next call to
> >> + * __ptr_ring_produce() will succeed.
> >
> > This sounds like a bug that needs to be fixed, as it requires the user
> > to know the implementation details. For example, even if
> > __ptr_ring_consume_created_space() returns true, __ptr_ring_produce()
> > may still fail?
>
> No, it should not fail in that case.
> If you only call consume and after that try to produce, *then* it is
> likely to fail because __ptr_ring_zero_tail() is only invoked once per
> batch.

Well, this makes the helper very hard for users.

So I think at least the documentation should specify the meaning of
'n' here. For example, is it the value returned by
ptr_ring_consume_batched()(), and is it required to be called
immediately after ptr_ring_consume_batched()? If it is, the API is
kind of tricky to be used, we should consider to merge two helpers
into a new single helper to ease the user.

What's more, there would be false positives. Considering there's not
many entries in the ring, just after the first zeroing,
__ptr_ring_consume_created_space() will return true, this will lead to
unnecessary wakeups.

And last, the function will always succeed if n is greater than the batch.

>
> >
> > Maybe revert fb9de9704775d?
>
> I disagree, as I consider this to be one of the key features of ptr_ring.

Nope, it's just an optimization and actually it changes the behaviour
that might be noticed by the user.

Before the patch, ptr_ring_produce() is guaranteed to succeed after a
ptr_ring_consume(). After that patch, it's not. We don't see complaint
because the implication is not obvious (e.g more packet dropping).

>
> That said, there are several other implementation details that users need
> to be aware of.
>
> For example, __ptr_ring_empty() must only be called by the consumer. This
> was for example the issue in dc82a33297fc ("veth: apply qdisc
> backpressure on full ptr_ring to reduce TX drops") and the reason why
> 5442a9da6978 ("veth: more robust handing of race to avoid txq getting
> stuck") exists.

At least the behaviour is documented. This is not the case for the
implications of fb9de9704775d.

Thanks


>
> >
> >> + */
> >> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
> >> +                                                   int n)
> >> +{
> >> +       return r->consumer_head - r->consumer_tail < n;
> >> +}
> >> +
> >>  /* Cast to structure type and call a function without discarding from FIFO.
> >>   * Function must return a value.
> >>   * Callers must take consumer_lock.
> >> --
> >> 2.43.0
> >>
> >
> > Thanks
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-08  7:40     ` Simon Schippers
@ 2026-01-09  6:02       ` Jason Wang
  2026-01-09  9:31         ` Simon Schippers
  2026-01-21  9:32         ` Simon Schippers
  0 siblings, 2 replies; 69+ messages in thread
From: Jason Wang @ 2026-01-09  6:02 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/8/26 04:38, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >> and wake the corresponding netdev subqueue when consuming an entry frees
> >> space in the underlying ptr_ring.
> >>
> >> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >> in an upcoming commit.
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >> index 1197f245e873..2442cf7ac385 100644
> >> --- a/drivers/net/tap.c
> >> +++ b/drivers/net/tap.c
> >> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>         return ret ? ret : total;
> >>  }
> >>
> >> +static void *tap_ring_consume(struct tap_queue *q)
> >> +{
> >> +       struct ptr_ring *ring = &q->ring;
> >> +       struct net_device *dev;
> >> +       void *ptr;
> >> +
> >> +       spin_lock(&ring->consumer_lock);
> >> +
> >> +       ptr = __ptr_ring_consume(ring);
> >> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >> +               rcu_read_lock();
> >> +               dev = rcu_dereference(q->tap)->dev;
> >> +               netif_wake_subqueue(dev, q->queue_index);
> >> +               rcu_read_unlock();
> >> +       }
> >> +
> >> +       spin_unlock(&ring->consumer_lock);
> >> +
> >> +       return ptr;
> >> +}
> >> +
> >>  static ssize_t tap_do_read(struct tap_queue *q,
> >>                            struct iov_iter *to,
> >>                            int noblock, struct sk_buff *skb)
> >> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>                                         TASK_INTERRUPTIBLE);
> >>
> >>                 /* Read frames from the queue */
> >> -               skb = ptr_ring_consume(&q->ring);
> >> +               skb = tap_ring_consume(q);
> >>                 if (skb)
> >>                         break;
> >>                 if (noblock) {
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 8192740357a0..7148f9a844a4 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>         return total;
> >>  }
> >>
> >> +static void *tun_ring_consume(struct tun_file *tfile)
> >> +{
> >> +       struct ptr_ring *ring = &tfile->tx_ring;
> >> +       struct net_device *dev;
> >> +       void *ptr;
> >> +
> >> +       spin_lock(&ring->consumer_lock);
> >> +
> >> +       ptr = __ptr_ring_consume(ring);
> >> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >
> > I guess it's the "bug" I mentioned in the previous patch that leads to
> > the check of __ptr_ring_consume_created_space() here. If it's true,
> > another call to tweak the current API.
> >
> >> +               rcu_read_lock();
> >> +               dev = rcu_dereference(tfile->tun)->dev;
> >> +               netif_wake_subqueue(dev, tfile->queue_index);
> >
> > This would cause the producer TX_SOFTIRQ to run on the same cpu which
> > I'm not sure is what we want.
>
> What else would you suggest calling to wake the queue?

I don't have a good method in my mind, just want to point out its implications.

>
> >
> >> +               rcu_read_unlock();
> >> +       }
> >
> > Btw, this function duplicates a lot of logic of tap_ring_consume() we
> > should consider to merge the logic.
>
> Yes, it is largely the same approach, but it would require accessing the
> net_device each time.

The problem is that, at least for TUN, the socket is loosely coupled
with the netdev. It means the netdev can go away while the socket
might still exist. That's why vhost only talks to the socket, not the
netdev. If we really want to go this way, here, we should at least
check the existence of tun->dev first.

>
> >
> >> +
> >> +       spin_unlock(&ring->consumer_lock);
> >> +
> >> +       return ptr;
> >> +}
> >> +
> >>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
> >>  {
> >>         DECLARE_WAITQUEUE(wait, current);
> >>         void *ptr = NULL;
> >>         int error = 0;
> >>
> >> -       ptr = ptr_ring_consume(&tfile->tx_ring);
> >> +       ptr = tun_ring_consume(tfile);
> >
> > I'm not sure having a separate patch like this may help. For example,
> > it will introduce performance regression.
>
> I ran benchmarks for the whole patch set with noqueue (where the queue is
> not stopped to preserve the old behavior), as described in the cover
> letter, and observed no performance regression. This leads me to conclude
> that there is no performance impact because of this patch when the queue
> is not stopped.

Have you run a benchmark per patch? Or it might just be because the
regression is not obvious. But at least this patch would introduce
more atomic operations or it might just because the TUN doesn't
support burst so pktgen can't have the best PPS.

Thanks


>
> >
> >>         if (ptr)
> >>                 goto out;
> >>         if (noblock) {
> >> @@ -2131,7 +2152,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
> >>
> >>         while (1) {
> >>                 set_current_state(TASK_INTERRUPTIBLE);
> >> -               ptr = ptr_ring_consume(&tfile->tx_ring);
> >> +               ptr = tun_ring_consume(tfile);
> >>                 if (ptr)
> >>                         break;
> >>                 if (signal_pending(current)) {
> >> --
> >> 2.43.0
> >>
> >
> > Thanks
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  2026-01-08  7:47     ` Simon Schippers
@ 2026-01-09  6:04       ` Jason Wang
  2026-01-09  9:57         ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-09  6:04 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 3:48 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/8/26 05:38, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> Replace the direct use of ptr_ring in the vhost-net virtqueue with
> >> tun/tap ring wrapper helpers. Instead of storing an rx_ring pointer,
> >> the virtqueue now stores the interface type (IF_TUN, IF_TAP, or IF_NONE)
> >> and dispatches to the corresponding tun/tap helpers for ring
> >> produce, consume, and unconsume operations.
> >>
> >> Routing ring operations through the tun/tap helpers enables netdev
> >> queue wakeups, which are required for upcoming netdev queue flow
> >> control support shared by tun/tap and vhost-net.
> >>
> >> No functional change is intended beyond switching to the wrapper
> >> helpers.
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Co-developed by: Jon Kohler <jon@nutanix.com>
> >> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  drivers/vhost/net.c | 92 +++++++++++++++++++++++++++++----------------
> >>  1 file changed, 60 insertions(+), 32 deletions(-)
> >>
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 7f886d3dba7d..215556f7cd40 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -90,6 +90,12 @@ enum {
> >>         VHOST_NET_VQ_MAX = 2,
> >>  };
> >>
> >> +enum if_type {
> >> +       IF_NONE = 0,
> >> +       IF_TUN = 1,
> >> +       IF_TAP = 2,
> >> +};
> >
> > This looks not elegant, can we simply export objects we want to use to
> > vhost like get_tap_socket()?
>
> No, we cannot do that. We would need access to both the ptr_ring and the
> net_device. However, the net_device is protected by an RCU lock.
>
> That is why {tun,tap}_ring_consume_batched() are used:
> they take the appropriate locks and handle waking the queue.

How about introducing a callback in the ptr_ring itself, so vhost_net
only need to know about the ptr_ring?

Thanks

>
> >
> > Thanks
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-08  8:01     ` Simon Schippers
@ 2026-01-09  6:09       ` Jason Wang
  2026-01-09 10:14         ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-09  6:09 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/8/26 05:37, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >> becomes full. Once an entry is successfully produced and the ptr_ring
> >> reaches capacity, the netdev queue is stopped instead of dropping
> >> subsequent packets.
> >>
> >> If producing an entry fails anyways, the tun_net_xmit returns
> >> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >> LLTX is enabled and the transmit path operates without the usual locking.
> >> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>
> >> The existing __{tun,tap}_ring_consume functions free space in the
> >> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >> this, a memory barrier is enforced (as discussed in a similar
> >> implementation in [1]), followed by a recheck that wakes the queue if
> >> space is already available.
> >>
> >> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>
> >> +-------------------------+-----------+---------------+----------------+
> >> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >> | 10M packets             |           |               |                |
> >> +-----------+-------------+-----------+---------------+----------------+
> >> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >> |           +-------------+-----------+---------------+----------------+
> >> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >> +-----------+-------------+-----------+---------------+----------------+
> >> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >> |  +        +-------------+-----------+---------------+----------------+
> >> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >> +-----------+-------------+-----------+---------------+----------------+
> >>
> >> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 71b6981d07d7..74d7fd09e9ba 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>         struct netdev_queue *queue;
> >>         struct tun_file *tfile;
> >>         int len = skb->len;
> >> +       bool qdisc_present;
> >> +       int ret;
> >>
> >>         rcu_read_lock();
> >>         tfile = rcu_dereference(tun->tfiles[txq]);
> >> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>
> >>         nf_reset_ct(skb);
> >>
> >> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >> +       queue = netdev_get_tx_queue(dev, txq);
> >> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >> +
> >> +       spin_lock(&tfile->tx_ring.producer_lock);
> >> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >> +               netif_tx_stop_queue(queue);
> >> +               /* Avoid races with queue wake-up in
> >> +                * __{tun,tap}_ring_consume by waking if space is
> >> +                * available in a re-check.
> >> +                * The barrier makes sure that the stop is visible before
> >> +                * we re-check.
> >> +                */
> >> +               smp_mb__after_atomic();
> >> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >> +                       netif_tx_wake_queue(queue);
> >
> > I'm not sure I will get here, but I think those should be moved to the
> > following if(ret) check. If __ptr_ring_produce() succeed, there's no
> > need to bother with those queue stop/wake logic?
>
> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> we stop the queue proactively.

This seems to conflict with the following NETDEV_TX_BUSY. Or is
NETDEV_TX_BUSY prepared for the xdp_xmit?

>
> I believe what you are aiming for is to always stop the queue if(ret),
> which I can agree with. In that case, I would simply change the condition
> to:
>
> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
>
> >
> >> +       }
> >> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >> +
> >> +       if (ret) {
> >> +               /* If a qdisc is attached to our virtual device,
> >> +                * returning NETDEV_TX_BUSY is allowed.
> >> +                */
> >> +               if (qdisc_present) {
> >> +                       rcu_read_unlock();
> >> +                       return NETDEV_TX_BUSY;
> >> +               }
> >>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>                 goto drop;
> >>         }
> >>
> >>         /* dev->lltx requires to do our own update of trans_start */
> >> -       queue = netdev_get_tx_queue(dev, txq);
> >>         txq_trans_cond_update(queue);
> >>
> >>         /* Notify and wake up reader process */
> >> --
> >> 2.43.0
> >>
> >
> > Thanks
> >
>

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-09  6:01       ` Jason Wang
@ 2026-01-09  6:47         ` Michael S. Tsirkin
  0 siblings, 0 replies; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-01-09  6:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: Simon Schippers, willemdebruijn.kernel, andrew+netdev, davem,
	edumazet, kuba, pabeni, eperezma, leiyang, stephen, jon,
	tim.gebauer, netdev, linux-kernel, kvm, virtualization

On Fri, Jan 09, 2026 at 02:01:54PM +0800, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 3:21 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
> >
> > On 1/8/26 04:23, Jason Wang wrote:
> > > On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > > <simon.schippers@tu-dortmund.de> wrote:
> > >>
> > >> This proposed function checks whether __ptr_ring_zero_tail() was invoked
> > >> within the last n calls to __ptr_ring_consume(), which indicates that new
> > >> free space was created. Since __ptr_ring_zero_tail() moves the tail to
> > >> the head - and no other function modifies either the head or the tail,
> > >> aside from the wrap-around case described below - detecting such a
> > >> movement is sufficient to detect the invocation of
> > >> __ptr_ring_zero_tail().
> > >>
> > >> The implementation detects this movement by checking whether the tail is
> > >> at most n positions behind the head. If this condition holds, the shift
> > >> of the tail to its current position must have occurred within the last n
> > >> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
> > >> invoked and that new free space was created.
> > >>
> > >> This logic also correctly handles the wrap-around case in which
> > >> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
> > >> to 0. Since this reset likewise moves the tail to the head, the same
> > >> detection logic applies.
> > >>
> > >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > >> ---
> > >>  include/linux/ptr_ring.h | 13 +++++++++++++
> > >>  1 file changed, 13 insertions(+)
> > >>
> > >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > >> index a5a3fa4916d3..7cdae6d1d400 100644
> > >> --- a/include/linux/ptr_ring.h
> > >> +++ b/include/linux/ptr_ring.h
> > >> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
> > >>         return ret;
> > >>  }
> > >>
> > >> +/* Returns true if the consume of the last n elements has created space
> > >> + * in the ring buffer (i.e., a new element can be produced).
> > >> + *
> > >> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
> > >> + * __ptr_ring_consume_batched() does not guarantee that the next call to
> > >> + * __ptr_ring_produce() will succeed.
> > >
> > > This sounds like a bug that needs to be fixed, as it requires the user
> > > to know the implementation details. For example, even if
> > > __ptr_ring_consume_created_space() returns true, __ptr_ring_produce()
> > > may still fail?
> >
> > No, it should not fail in that case.
> > If you only call consume and after that try to produce, *then* it is
> > likely to fail because __ptr_ring_zero_tail() is only invoked once per
> > batch.
> 
> Well, this makes the helper very hard for users.
> 
> So I think at least the documentation should specify the meaning of
> 'n' here.

Right. Documenting parameters is good.

> For example, is it the value returned by
> ptr_ring_consume_batched()(), and is it required to be called
> immediately after ptr_ring_consume_batched()? If it is, the API is
> kind of tricky to be used, we should consider to merge two helpers
> into a new single helper to ease the user.

I think you are right partially it's more a question of documentation and naming.
It's not that it's hard to use: follow up patches use it
without issues - it's that neither documentatin nor
naming explain how.

let's try to document, first of all: if it does not guarantee that
produce will succeed, then what *is* the guarantee this API gives?

> 
> What's more, there would be false positives. Considering there's not
> many entries in the ring, just after the first zeroing,
> __ptr_ring_consume_created_space() will return true, this will lead to
> unnecessary wakeups.

well optimizations are judged on their performance not on theoretical
analysis. in this instance, this should be rare enough.

> 
> And last, the function will always succeed if n is greater than the batch.
> 
> >
> > >
> > > Maybe revert fb9de9704775d?
> >
> > I disagree, as I consider this to be one of the key features of ptr_ring.
> 
> Nope, it's just an optimization and actually it changes the behaviour
> that might be noticed by the user.
> 
> Before the patch, ptr_ring_produce() is guaranteed to succeed after a
> ptr_ring_consume(). After that patch, it's not. We don't see complaint
> because the implication is not obvious (e.g more packet dropping).
> 
> >
> > That said, there are several other implementation details that users need
> > to be aware of.
> >
> > For example, __ptr_ring_empty() must only be called by the consumer. This
> > was for example the issue in dc82a33297fc ("veth: apply qdisc
> > backpressure on full ptr_ring to reduce TX drops") and the reason why
> > 5442a9da6978 ("veth: more robust handing of race to avoid txq getting
> > stuck") exists.
> 
> At least the behaviour is documented. This is not the case for the
> implications of fb9de9704775d.
> 
> Thanks
> 
> 
> >
> > >
> > >> + */
> > >> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
> > >> +                                                   int n)
> > >> +{
> > >> +       return r->consumer_head - r->consumer_tail < n;
> > >> +}
> > >> +
> > >>  /* Cast to structure type and call a function without discarding from FIFO.
> > >>   * Function must return a value.
> > >>   * Callers must take consumer_lock.
> > >> --
> > >> 2.43.0
> > >>
> > >
> > > Thanks
> > >
> >


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-07 21:04 ` [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume Simon Schippers
  2026-01-08  3:23   ` Jason Wang
@ 2026-01-09  7:22   ` Michael S. Tsirkin
  2026-01-09  7:35     ` Simon Schippers
  1 sibling, 1 reply; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-01-09  7:22 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Wed, Jan 07, 2026 at 10:04:41PM +0100, Simon Schippers wrote:
> This proposed function checks whether __ptr_ring_zero_tail() was invoked
> within the last n calls to __ptr_ring_consume(), which indicates that new
> free space was created. Since __ptr_ring_zero_tail() moves the tail to
> the head - and no other function modifies either the head or the tail,
> aside from the wrap-around case described below - detecting such a
> movement is sufficient to detect the invocation of
> __ptr_ring_zero_tail().
> 
> The implementation detects this movement by checking whether the tail is
> at most n positions behind the head. If this condition holds, the shift
> of the tail to its current position must have occurred within the last n
> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
> invoked and that new free space was created.
> 
> This logic also correctly handles the wrap-around case in which
> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
> to 0. Since this reset likewise moves the tail to the head, the same
> detection logic applies.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index a5a3fa4916d3..7cdae6d1d400 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>  	return ret;
>  }
>  
> +/* Returns true if the consume of the last n elements has created space
> + * in the ring buffer (i.e., a new element can be produced).
> + *
> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
> + * __ptr_ring_consume_batched() does not guarantee that the next call to
> + * __ptr_ring_produce() will succeed.


I think the issue is it does not say what is the actual guarantee.

Another issue is that the "Note" really should be more prominent,
it really is part of explaining what the functions does.

Hmm. Maybe we should tell it how many entries have been consumed and
get back an indication of how much space this created?

fundamentally
	 n - (r->consumer_head - r->consumer_tail)?


does the below sound good maybe?

/* Returns the amound of space (number of new elements that can be
 * produced) that calls to ptr_ring_consume created.
 *
 * Getting n entries from calls to ptr_ring_consume() /
 * ptr_ring_consume_batched() does *not* guarantee that the next n calls to
 * ptr_ring_produce() will succeed.
 *
 * Use this function after consuming n entries to get a hint about
 * how much space was actually created.





> + */
> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
> +						    int n)
> +{
> +	return r->consumer_head - r->consumer_tail < n;
> +}
> +
>  /* Cast to structure type and call a function without discarding from FIFO.
>   * Function must return a value.
>   * Callers must take consumer_lock.
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-09  7:22   ` Michael S. Tsirkin
@ 2026-01-09  7:35     ` Simon Schippers
  2026-01-09  8:31       ` Michael S. Tsirkin
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-09  7:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 1/9/26 08:22, Michael S. Tsirkin wrote:
> On Wed, Jan 07, 2026 at 10:04:41PM +0100, Simon Schippers wrote:
>> This proposed function checks whether __ptr_ring_zero_tail() was invoked
>> within the last n calls to __ptr_ring_consume(), which indicates that new
>> free space was created. Since __ptr_ring_zero_tail() moves the tail to
>> the head - and no other function modifies either the head or the tail,
>> aside from the wrap-around case described below - detecting such a
>> movement is sufficient to detect the invocation of
>> __ptr_ring_zero_tail().
>>
>> The implementation detects this movement by checking whether the tail is
>> at most n positions behind the head. If this condition holds, the shift
>> of the tail to its current position must have occurred within the last n
>> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
>> invoked and that new free space was created.
>>
>> This logic also correctly handles the wrap-around case in which
>> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
>> to 0. Since this reset likewise moves the tail to the head, the same
>> detection logic applies.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  include/linux/ptr_ring.h | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>> index a5a3fa4916d3..7cdae6d1d400 100644
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>>  	return ret;
>>  }
>>  
>> +/* Returns true if the consume of the last n elements has created space
>> + * in the ring buffer (i.e., a new element can be produced).
>> + *
>> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
>> + * __ptr_ring_consume_batched() does not guarantee that the next call to
>> + * __ptr_ring_produce() will succeed.
> 
> 
> I think the issue is it does not say what is the actual guarantee.
> 
> Another issue is that the "Note" really should be more prominent,
> it really is part of explaining what the functions does.
> 
> Hmm. Maybe we should tell it how many entries have been consumed and
> get back an indication of how much space this created?
> 
> fundamentally
> 	 n - (r->consumer_head - r->consumer_tail)?

No, that is wrong from my POV.

It always creates the same amount of space which is the batch size or
multiple batch sizes (or something less in the wrap-around case). That is
of course only if __ptr_ring_zero_tail() was executed at least once,
else it creates zero space.

> 
> 
> does the below sound good maybe?
> 
> /* Returns the amound of space (number of new elements that can be
>  * produced) that calls to ptr_ring_consume created.
>  *
>  * Getting n entries from calls to ptr_ring_consume() /
>  * ptr_ring_consume_batched() does *not* guarantee that the next n calls to
>  * ptr_ring_produce() will succeed.
>  *
>  * Use this function after consuming n entries to get a hint about
>  * how much space was actually created.
> 
> 
> 
> 
> 
>> + */
>> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
>> +						    int n)
>> +{
>> +	return r->consumer_head - r->consumer_tail < n;
>> +}
>> +
>>  /* Cast to structure type and call a function without discarding from FIFO.
>>   * Function must return a value.
>>   * Callers must take consumer_lock.
>> -- 
>> 2.43.0
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-09  7:35     ` Simon Schippers
@ 2026-01-09  8:31       ` Michael S. Tsirkin
  2026-01-09  9:06         ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-01-09  8:31 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Fri, Jan 09, 2026 at 08:35:31AM +0100, Simon Schippers wrote:
> On 1/9/26 08:22, Michael S. Tsirkin wrote:
> > On Wed, Jan 07, 2026 at 10:04:41PM +0100, Simon Schippers wrote:
> >> This proposed function checks whether __ptr_ring_zero_tail() was invoked
> >> within the last n calls to __ptr_ring_consume(), which indicates that new
> >> free space was created. Since __ptr_ring_zero_tail() moves the tail to
> >> the head - and no other function modifies either the head or the tail,
> >> aside from the wrap-around case described below - detecting such a
> >> movement is sufficient to detect the invocation of
> >> __ptr_ring_zero_tail().
> >>
> >> The implementation detects this movement by checking whether the tail is
> >> at most n positions behind the head. If this condition holds, the shift
> >> of the tail to its current position must have occurred within the last n
> >> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
> >> invoked and that new free space was created.
> >>
> >> This logic also correctly handles the wrap-around case in which
> >> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
> >> to 0. Since this reset likewise moves the tail to the head, the same
> >> detection logic applies.
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  include/linux/ptr_ring.h | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> >> index a5a3fa4916d3..7cdae6d1d400 100644
> >> --- a/include/linux/ptr_ring.h
> >> +++ b/include/linux/ptr_ring.h
> >> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
> >>  	return ret;
> >>  }
> >>  
> >> +/* Returns true if the consume of the last n elements has created space
> >> + * in the ring buffer (i.e., a new element can be produced).
> >> + *
> >> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
> >> + * __ptr_ring_consume_batched() does not guarantee that the next call to
> >> + * __ptr_ring_produce() will succeed.
> > 
> > 
> > I think the issue is it does not say what is the actual guarantee.
> > 
> > Another issue is that the "Note" really should be more prominent,
> > it really is part of explaining what the functions does.
> > 
> > Hmm. Maybe we should tell it how many entries have been consumed and
> > get back an indication of how much space this created?
> > 
> > fundamentally
> > 	 n - (r->consumer_head - r->consumer_tail)?
> 
> No, that is wrong from my POV.
> 
> It always creates the same amount of space which is the batch size or
> multiple batch sizes (or something less in the wrap-around case). That is
> of course only if __ptr_ring_zero_tail() was executed at least once,
> else it creates zero space.

exactly, and caller does not know, and now he wants to know so
we add an API for him to find out?

I feel the fact it's a binary (batch or 0) is an implementation
detail better hidden from user.



> > 
> > 
> > does the below sound good maybe?
> > 
> > /* Returns the amound of space (number of new elements that can be
> >  * produced) that calls to ptr_ring_consume created.
> >  *
> >  * Getting n entries from calls to ptr_ring_consume() /
> >  * ptr_ring_consume_batched() does *not* guarantee that the next n calls to
> >  * ptr_ring_produce() will succeed.
> >  *
> >  * Use this function after consuming n entries to get a hint about
> >  * how much space was actually created.
> > 
> > 
> > 
> > 
> > 
> >> + */
> >> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
> >> +						    int n)
> >> +{
> >> +	return r->consumer_head - r->consumer_tail < n;
> >> +}
> >> +
> >>  /* Cast to structure type and call a function without discarding from FIFO.
> >>   * Function must return a value.
> >>   * Callers must take consumer_lock.
> >> -- 
> >> 2.43.0
> > 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-09  8:31       ` Michael S. Tsirkin
@ 2026-01-09  9:06         ` Simon Schippers
  2026-01-12 16:29           ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-09  9:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 1/9/26 09:31, Michael S. Tsirkin wrote:
> On Fri, Jan 09, 2026 at 08:35:31AM +0100, Simon Schippers wrote:
>> On 1/9/26 08:22, Michael S. Tsirkin wrote:
>>> On Wed, Jan 07, 2026 at 10:04:41PM +0100, Simon Schippers wrote:
>>>> This proposed function checks whether __ptr_ring_zero_tail() was invoked
>>>> within the last n calls to __ptr_ring_consume(), which indicates that new
>>>> free space was created. Since __ptr_ring_zero_tail() moves the tail to
>>>> the head - and no other function modifies either the head or the tail,
>>>> aside from the wrap-around case described below - detecting such a
>>>> movement is sufficient to detect the invocation of
>>>> __ptr_ring_zero_tail().
>>>>
>>>> The implementation detects this movement by checking whether the tail is
>>>> at most n positions behind the head. If this condition holds, the shift
>>>> of the tail to its current position must have occurred within the last n
>>>> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
>>>> invoked and that new free space was created.
>>>>
>>>> This logic also correctly handles the wrap-around case in which
>>>> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
>>>> to 0. Since this reset likewise moves the tail to the head, the same
>>>> detection logic applies.
>>>>
>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>>  include/linux/ptr_ring.h | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>>>> index a5a3fa4916d3..7cdae6d1d400 100644
>>>> --- a/include/linux/ptr_ring.h
>>>> +++ b/include/linux/ptr_ring.h
>>>> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +/* Returns true if the consume of the last n elements has created space
>>>> + * in the ring buffer (i.e., a new element can be produced).
>>>> + *
>>>> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
>>>> + * __ptr_ring_consume_batched() does not guarantee that the next call to
>>>> + * __ptr_ring_produce() will succeed.
>>>
>>>
>>> I think the issue is it does not say what is the actual guarantee.
>>>
>>> Another issue is that the "Note" really should be more prominent,
>>> it really is part of explaining what the functions does.
>>>
>>> Hmm. Maybe we should tell it how many entries have been consumed and
>>> get back an indication of how much space this created?
>>>
>>> fundamentally
>>> 	 n - (r->consumer_head - r->consumer_tail)?
>>
>> No, that is wrong from my POV.
>>
>> It always creates the same amount of space which is the batch size or
>> multiple batch sizes (or something less in the wrap-around case). That is
>> of course only if __ptr_ring_zero_tail() was executed at least once,
>> else it creates zero space.
> 
> exactly, and caller does not know, and now he wants to know so
> we add an API for him to find out?
> 
> I feel the fact it's a binary (batch or 0) is an implementation
> detail better hidden from user.

I agree, and I now understood your logic :)

So it should be:

static inline int __ptr_ring_consume_created_space(struct ptr_ring *r,
						   int n)
{
	return max(n - (r->consumer_head - r->consumer_tail), 0);
}

Right?

> 
> 
> 
>>>
>>>
>>> does the below sound good maybe?
>>>
>>> /* Returns the amound of space (number of new elements that can be
>>>  * produced) that calls to ptr_ring_consume created.
>>>  *
>>>  * Getting n entries from calls to ptr_ring_consume() /
>>>  * ptr_ring_consume_batched() does *not* guarantee that the next n calls to
>>>  * ptr_ring_produce() will succeed.
>>>  *
>>>  * Use this function after consuming n entries to get a hint about
>>>  * how much space was actually created.
>>>
>>>
>>>
>>>
>>>
>>>> + */
>>>> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
>>>> +						    int n)
>>>> +{
>>>> +	return r->consumer_head - r->consumer_tail < n;
>>>> +}
>>>> +
>>>>  /* Cast to structure type and call a function without discarding from FIFO.
>>>>   * Function must return a value.
>>>>   * Callers must take consumer_lock.
>>>> -- 
>>>> 2.43.0
>>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-09  6:02       ` Jason Wang
@ 2026-01-09  9:31         ` Simon Schippers
  2026-01-21  9:32         ` Simon Schippers
  1 sibling, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-09  9:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/9/26 07:02, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/8/26 04:38, Jason Wang wrote:
>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>> space in the underlying ptr_ring.
>>>>
>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>> in an upcoming commit.
>>>>
>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>> index 1197f245e873..2442cf7ac385 100644
>>>> --- a/drivers/net/tap.c
>>>> +++ b/drivers/net/tap.c
>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>         return ret ? ret : total;
>>>>  }
>>>>
>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>> +{
>>>> +       struct ptr_ring *ring = &q->ring;
>>>> +       struct net_device *dev;
>>>> +       void *ptr;
>>>> +
>>>> +       spin_lock(&ring->consumer_lock);
>>>> +
>>>> +       ptr = __ptr_ring_consume(ring);
>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>> +               rcu_read_lock();
>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>> +               rcu_read_unlock();
>>>> +       }
>>>> +
>>>> +       spin_unlock(&ring->consumer_lock);
>>>> +
>>>> +       return ptr;
>>>> +}
>>>> +
>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>                            struct iov_iter *to,
>>>>                            int noblock, struct sk_buff *skb)
>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>                                         TASK_INTERRUPTIBLE);
>>>>
>>>>                 /* Read frames from the queue */
>>>> -               skb = ptr_ring_consume(&q->ring);
>>>> +               skb = tap_ring_consume(q);
>>>>                 if (skb)
>>>>                         break;
>>>>                 if (noblock) {
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 8192740357a0..7148f9a844a4 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>         return total;
>>>>  }
>>>>
>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>> +{
>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>> +       struct net_device *dev;
>>>> +       void *ptr;
>>>> +
>>>> +       spin_lock(&ring->consumer_lock);
>>>> +
>>>> +       ptr = __ptr_ring_consume(ring);
>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>
>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>> another call to tweak the current API.
>>>
>>>> +               rcu_read_lock();
>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>
>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>> I'm not sure is what we want.
>>
>> What else would you suggest calling to wake the queue?
> 
> I don't have a good method in my mind, just want to point out its implications.

Okay :)
> 
>>
>>>
>>>> +               rcu_read_unlock();
>>>> +       }
>>>
>>> Btw, this function duplicates a lot of logic of tap_ring_consume() we
>>> should consider to merge the logic.
>>
>> Yes, it is largely the same approach, but it would require accessing the
>> net_device each time.
> 
> The problem is that, at least for TUN, the socket is loosely coupled
> with the netdev. It means the netdev can go away while the socket
> might still exist. That's why vhost only talks to the socket, not the
> netdev. If we really want to go this way, here, we should at least
> check the existence of tun->dev first.

You are right, I missed that.

> 
>>
>>>
>>>> +
>>>> +       spin_unlock(&ring->consumer_lock);
>>>> +
>>>> +       return ptr;
>>>> +}
>>>> +
>>>>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>>>  {
>>>>         DECLARE_WAITQUEUE(wait, current);
>>>>         void *ptr = NULL;
>>>>         int error = 0;
>>>>
>>>> -       ptr = ptr_ring_consume(&tfile->tx_ring);
>>>> +       ptr = tun_ring_consume(tfile);
>>>
>>> I'm not sure having a separate patch like this may help. For example,
>>> it will introduce performance regression.
>>
>> I ran benchmarks for the whole patch set with noqueue (where the queue is
>> not stopped to preserve the old behavior), as described in the cover
>> letter, and observed no performance regression. This leads me to conclude
>> that there is no performance impact because of this patch when the queue
>> is not stopped.
> 
> Have you run a benchmark per patch? Or it might just be because the
> regression is not obvious. But at least this patch would introduce
> more atomic operations or it might just because the TUN doesn't
> support burst so pktgen can't have the best PPS.

No, I haven't. I see your point that this patch adds an additional
atomic test_and_clear_bit() (which will always return false without a
queue stop), and I should test that.

> 
> Thanks
> 
> 
>>
>>>
>>>>         if (ptr)
>>>>                 goto out;
>>>>         if (noblock) {
>>>> @@ -2131,7 +2152,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>>>
>>>>         while (1) {
>>>>                 set_current_state(TASK_INTERRUPTIBLE);
>>>> -               ptr = ptr_ring_consume(&tfile->tx_ring);
>>>> +               ptr = tun_ring_consume(tfile);
>>>>                 if (ptr)
>>>>                         break;
>>>>                 if (signal_pending(current)) {
>>>> --
>>>> 2.43.0
>>>>
>>>
>>> Thanks
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  2026-01-09  6:04       ` Jason Wang
@ 2026-01-09  9:57         ` Simon Schippers
  2026-01-12  2:54           ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-09  9:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/9/26 07:04, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 3:48 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/8/26 05:38, Jason Wang wrote:
>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> Replace the direct use of ptr_ring in the vhost-net virtqueue with
>>>> tun/tap ring wrapper helpers. Instead of storing an rx_ring pointer,
>>>> the virtqueue now stores the interface type (IF_TUN, IF_TAP, or IF_NONE)
>>>> and dispatches to the corresponding tun/tap helpers for ring
>>>> produce, consume, and unconsume operations.
>>>>
>>>> Routing ring operations through the tun/tap helpers enables netdev
>>>> queue wakeups, which are required for upcoming netdev queue flow
>>>> control support shared by tun/tap and vhost-net.
>>>>
>>>> No functional change is intended beyond switching to the wrapper
>>>> helpers.
>>>>
>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Co-developed by: Jon Kohler <jon@nutanix.com>
>>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>>  drivers/vhost/net.c | 92 +++++++++++++++++++++++++++++----------------
>>>>  1 file changed, 60 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>>> index 7f886d3dba7d..215556f7cd40 100644
>>>> --- a/drivers/vhost/net.c
>>>> +++ b/drivers/vhost/net.c
>>>> @@ -90,6 +90,12 @@ enum {
>>>>         VHOST_NET_VQ_MAX = 2,
>>>>  };
>>>>
>>>> +enum if_type {
>>>> +       IF_NONE = 0,
>>>> +       IF_TUN = 1,
>>>> +       IF_TAP = 2,
>>>> +};
>>>
>>> This looks not elegant, can we simply export objects we want to use to
>>> vhost like get_tap_socket()?
>>
>> No, we cannot do that. We would need access to both the ptr_ring and the
>> net_device. However, the net_device is protected by an RCU lock.
>>
>> That is why {tun,tap}_ring_consume_batched() are used:
>> they take the appropriate locks and handle waking the queue.
> 
> How about introducing a callback in the ptr_ring itself, so vhost_net
> only need to know about the ptr_ring?

That would be great, but I'm not sure whether this should be the
responsibility of the ptr_ring.

If the ptr_ring were to keep track of the netdev queue, it could handle
all the management itself - stopping the queue when full and waking it
again once space becomes available.

What would be your idea for implementing this?

> 
> Thanks
> 
>>
>>>
>>> Thanks
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-09  6:09       ` Jason Wang
@ 2026-01-09 10:14         ` Simon Schippers
  2026-01-12  2:22           ` Jason Wang
  2026-01-12  4:33           ` Michael S. Tsirkin
  0 siblings, 2 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-09 10:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/9/26 07:09, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/8/26 05:37, Jason Wang wrote:
>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>> subsequent packets.
>>>>
>>>> If producing an entry fails anyways, the tun_net_xmit returns
>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
>>>>
>>>> The existing __{tun,tap}_ring_consume functions free space in the
>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
>>>> this, a memory barrier is enforced (as discussed in a similar
>>>> implementation in [1]), followed by a recheck that wakes the queue if
>>>> space is already available.
>>>>
>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>
>>>> +-------------------------+-----------+---------------+----------------+
>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
>>>> | 10M packets             |           |               |                |
>>>> +-----------+-------------+-----------+---------------+----------------+
>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
>>>> |           +-------------+-----------+---------------+----------------+
>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
>>>> +-----------+-------------+-----------+---------------+----------------+
>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
>>>> |  +        +-------------+-----------+---------------+----------------+
>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>
>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>>>
>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 71b6981d07d7..74d7fd09e9ba 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>         struct netdev_queue *queue;
>>>>         struct tun_file *tfile;
>>>>         int len = skb->len;
>>>> +       bool qdisc_present;
>>>> +       int ret;
>>>>
>>>>         rcu_read_lock();
>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>
>>>>         nf_reset_ct(skb);
>>>>
>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>> +       queue = netdev_get_tx_queue(dev, txq);
>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>> +
>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>> +               netif_tx_stop_queue(queue);
>>>> +               /* Avoid races with queue wake-up in
>>>> +                * __{tun,tap}_ring_consume by waking if space is
>>>> +                * available in a re-check.
>>>> +                * The barrier makes sure that the stop is visible before
>>>> +                * we re-check.
>>>> +                */
>>>> +               smp_mb__after_atomic();
>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>> +                       netif_tx_wake_queue(queue);
>>>
>>> I'm not sure I will get here, but I think those should be moved to the
>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
>>> need to bother with those queue stop/wake logic?
>>
>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
>> we stop the queue proactively.
> 
> This seems to conflict with the following NETDEV_TX_BUSY. Or is
> NETDEV_TX_BUSY prepared for the xdp_xmit?

Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
And I do not understand the connection with xdp_xmit.

> 
>>
>> I believe what you are aiming for is to always stop the queue if(ret),
>> which I can agree with. In that case, I would simply change the condition
>> to:
>>
>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
>>
>>>
>>>> +       }
>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>>>> +
>>>> +       if (ret) {
>>>> +               /* If a qdisc is attached to our virtual device,
>>>> +                * returning NETDEV_TX_BUSY is allowed.
>>>> +                */
>>>> +               if (qdisc_present) {
>>>> +                       rcu_read_unlock();
>>>> +                       return NETDEV_TX_BUSY;
>>>> +               }
>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>                 goto drop;
>>>>         }
>>>>
>>>>         /* dev->lltx requires to do our own update of trans_start */
>>>> -       queue = netdev_get_tx_queue(dev, txq);
>>>>         txq_trans_cond_update(queue);
>>>>
>>>>         /* Notify and wake up reader process */
>>>> --
>>>> 2.43.0
>>>>
>>>
>>> Thanks
>>>
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-09 10:14         ` Simon Schippers
@ 2026-01-12  2:22           ` Jason Wang
  2026-01-12 11:08             ` Simon Schippers
  2026-01-12  4:33           ` Michael S. Tsirkin
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-12  2:22 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/9/26 07:09, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/8/26 05:37, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>> subsequent packets.
> >>>>
> >>>> If producing an entry fails anyways, the tun_net_xmit returns
> >>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>>>
> >>>> The existing __{tun,tap}_ring_consume functions free space in the
> >>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >>>> this, a memory barrier is enforced (as discussed in a similar
> >>>> implementation in [1]), followed by a recheck that wakes the queue if
> >>>> space is already available.
> >>>>
> >>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>
> >>>> +-------------------------+-----------+---------------+----------------+
> >>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >>>> | 10M packets             |           |               |                |
> >>>> +-----------+-------------+-----------+---------------+----------------+
> >>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >>>> |           +-------------+-----------+---------------+----------------+
> >>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >>>> +-----------+-------------+-----------+---------------+----------------+
> >>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >>>> |  +        +-------------+-----------+---------------+----------------+
> >>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>
> >>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>>
> >>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>> ---
> >>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>> index 71b6981d07d7..74d7fd09e9ba 100644
> >>>> --- a/drivers/net/tun.c
> >>>> +++ b/drivers/net/tun.c
> >>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>         struct netdev_queue *queue;
> >>>>         struct tun_file *tfile;
> >>>>         int len = skb->len;
> >>>> +       bool qdisc_present;
> >>>> +       int ret;
> >>>>
> >>>>         rcu_read_lock();
> >>>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>
> >>>>         nf_reset_ct(skb);
> >>>>
> >>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>> +       queue = netdev_get_tx_queue(dev, txq);
> >>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>> +
> >>>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>> +               netif_tx_stop_queue(queue);
> >>>> +               /* Avoid races with queue wake-up in
> >>>> +                * __{tun,tap}_ring_consume by waking if space is
> >>>> +                * available in a re-check.
> >>>> +                * The barrier makes sure that the stop is visible before
> >>>> +                * we re-check.
> >>>> +                */
> >>>> +               smp_mb__after_atomic();
> >>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>> +                       netif_tx_wake_queue(queue);
> >>>
> >>> I'm not sure I will get here, but I think those should be moved to the
> >>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> >>> need to bother with those queue stop/wake logic?
> >>
> >> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> >> we stop the queue proactively.
> >
> > This seems to conflict with the following NETDEV_TX_BUSY. Or is
> > NETDEV_TX_BUSY prepared for the xdp_xmit?
>
> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?

No, I mean I don't understand why we still need to peek since we've
already used NETDEV_TX_BUSY.

> And I do not understand the connection with xdp_xmit.

Since there's we don't modify xdp_xmit path, so even if we peek next
ndo_start_xmit can still hit ring full.

Thanks

>
> >
> >>
> >> I believe what you are aiming for is to always stop the queue if(ret),
> >> which I can agree with. In that case, I would simply change the condition
> >> to:
> >>
> >> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
> >>
> >>>
> >>>> +       }
> >>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>>> +
> >>>> +       if (ret) {
> >>>> +               /* If a qdisc is attached to our virtual device,
> >>>> +                * returning NETDEV_TX_BUSY is allowed.
> >>>> +                */
> >>>> +               if (qdisc_present) {
> >>>> +                       rcu_read_unlock();
> >>>> +                       return NETDEV_TX_BUSY;
> >>>> +               }
> >>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>                 goto drop;
> >>>>         }
> >>>>
> >>>>         /* dev->lltx requires to do our own update of trans_start */
> >>>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>>         txq_trans_cond_update(queue);
> >>>>
> >>>>         /* Notify and wake up reader process */
> >>>> --
> >>>> 2.43.0
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >
> > Thanks
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  2026-01-09  9:57         ` Simon Schippers
@ 2026-01-12  2:54           ` Jason Wang
  2026-01-12  4:42             ` Michael S. Tsirkin
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-12  2:54 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Fri, Jan 9, 2026 at 5:57 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/9/26 07:04, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 3:48 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/8/26 05:38, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> Replace the direct use of ptr_ring in the vhost-net virtqueue with
> >>>> tun/tap ring wrapper helpers. Instead of storing an rx_ring pointer,
> >>>> the virtqueue now stores the interface type (IF_TUN, IF_TAP, or IF_NONE)
> >>>> and dispatches to the corresponding tun/tap helpers for ring
> >>>> produce, consume, and unconsume operations.
> >>>>
> >>>> Routing ring operations through the tun/tap helpers enables netdev
> >>>> queue wakeups, which are required for upcoming netdev queue flow
> >>>> control support shared by tun/tap and vhost-net.
> >>>>
> >>>> No functional change is intended beyond switching to the wrapper
> >>>> helpers.
> >>>>
> >>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Co-developed by: Jon Kohler <jon@nutanix.com>
> >>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>> ---
> >>>>  drivers/vhost/net.c | 92 +++++++++++++++++++++++++++++----------------
> >>>>  1 file changed, 60 insertions(+), 32 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >>>> index 7f886d3dba7d..215556f7cd40 100644
> >>>> --- a/drivers/vhost/net.c
> >>>> +++ b/drivers/vhost/net.c
> >>>> @@ -90,6 +90,12 @@ enum {
> >>>>         VHOST_NET_VQ_MAX = 2,
> >>>>  };
> >>>>
> >>>> +enum if_type {
> >>>> +       IF_NONE = 0,
> >>>> +       IF_TUN = 1,
> >>>> +       IF_TAP = 2,
> >>>> +};
> >>>
> >>> This looks not elegant, can we simply export objects we want to use to
> >>> vhost like get_tap_socket()?
> >>
> >> No, we cannot do that. We would need access to both the ptr_ring and the
> >> net_device. However, the net_device is protected by an RCU lock.
> >>
> >> That is why {tun,tap}_ring_consume_batched() are used:
> >> they take the appropriate locks and handle waking the queue.
> >
> > How about introducing a callback in the ptr_ring itself, so vhost_net
> > only need to know about the ptr_ring?
>
> That would be great, but I'm not sure whether this should be the
> responsibility of the ptr_ring.
>
> If the ptr_ring were to keep track of the netdev queue, it could handle
> all the management itself - stopping the queue when full and waking it
> again once space becomes available.
>
> What would be your idea for implementing this?

During ptr_ring_init() register a callback, the callback will be
trigger during ptr_ring_consume() or ptr_ring_consume_batched() when
ptr_ring find there's a space for ptr_ring_produce().

Thanks

>
> >
> > Thanks
> >
> >>
> >>>
> >>> Thanks
> >>>
> >>
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-09 10:14         ` Simon Schippers
  2026-01-12  2:22           ` Jason Wang
@ 2026-01-12  4:33           ` Michael S. Tsirkin
  2026-01-12 11:17             ` Simon Schippers
  1 sibling, 1 reply; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-01-12  4:33 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?

We jump through a lot of hoops in virtio_net to avoid using
NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
Given your patches aim to improve precisely ring full,
I would say stopping proactively before NETDEV_TX_BUSY
should be a priority.

-- 
MST


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers
  2026-01-12  2:54           ` Jason Wang
@ 2026-01-12  4:42             ` Michael S. Tsirkin
  0 siblings, 0 replies; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-01-12  4:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Simon Schippers, willemdebruijn.kernel, andrew+netdev, davem,
	edumazet, kuba, pabeni, eperezma, leiyang, stephen, jon,
	tim.gebauer, netdev, linux-kernel, kvm, virtualization

On Mon, Jan 12, 2026 at 10:54:15AM +0800, Jason Wang wrote:
> On Fri, Jan 9, 2026 at 5:57 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
> >
> > On 1/9/26 07:04, Jason Wang wrote:
> > > On Thu, Jan 8, 2026 at 3:48 PM Simon Schippers
> > > <simon.schippers@tu-dortmund.de> wrote:
> > >>
> > >> On 1/8/26 05:38, Jason Wang wrote:
> > >>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > >>> <simon.schippers@tu-dortmund.de> wrote:
> > >>>>
> > >>>> Replace the direct use of ptr_ring in the vhost-net virtqueue with
> > >>>> tun/tap ring wrapper helpers. Instead of storing an rx_ring pointer,
> > >>>> the virtqueue now stores the interface type (IF_TUN, IF_TAP, or IF_NONE)
> > >>>> and dispatches to the corresponding tun/tap helpers for ring
> > >>>> produce, consume, and unconsume operations.
> > >>>>
> > >>>> Routing ring operations through the tun/tap helpers enables netdev
> > >>>> queue wakeups, which are required for upcoming netdev queue flow
> > >>>> control support shared by tun/tap and vhost-net.
> > >>>>
> > >>>> No functional change is intended beyond switching to the wrapper
> > >>>> helpers.
> > >>>>
> > >>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > >>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > >>>> Co-developed by: Jon Kohler <jon@nutanix.com>
> > >>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> > >>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > >>>> ---
> > >>>>  drivers/vhost/net.c | 92 +++++++++++++++++++++++++++++----------------
> > >>>>  1 file changed, 60 insertions(+), 32 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > >>>> index 7f886d3dba7d..215556f7cd40 100644
> > >>>> --- a/drivers/vhost/net.c
> > >>>> +++ b/drivers/vhost/net.c
> > >>>> @@ -90,6 +90,12 @@ enum {
> > >>>>         VHOST_NET_VQ_MAX = 2,
> > >>>>  };
> > >>>>
> > >>>> +enum if_type {
> > >>>> +       IF_NONE = 0,
> > >>>> +       IF_TUN = 1,
> > >>>> +       IF_TAP = 2,
> > >>>> +};
> > >>>
> > >>> This looks not elegant, can we simply export objects we want to use to
> > >>> vhost like get_tap_socket()?
> > >>
> > >> No, we cannot do that. We would need access to both the ptr_ring and the
> > >> net_device. However, the net_device is protected by an RCU lock.
> > >>
> > >> That is why {tun,tap}_ring_consume_batched() are used:
> > >> they take the appropriate locks and handle waking the queue.
> > >
> > > How about introducing a callback in the ptr_ring itself, so vhost_net
> > > only need to know about the ptr_ring?
> >
> > That would be great, but I'm not sure whether this should be the
> > responsibility of the ptr_ring.
> >
> > If the ptr_ring were to keep track of the netdev queue, it could handle
> > all the management itself - stopping the queue when full and waking it
> > again once space becomes available.
> >
> > What would be your idea for implementing this?
> 
> During ptr_ring_init() register a callback, the callback will be
> trigger during ptr_ring_consume() or ptr_ring_consume_batched() when
> ptr_ring find there's a space for ptr_ring_produce().
> 
> Thanks

Not sure the perceived elegance is worth the indirect call overhead.
ptr_ring is trying hard to be low overhead.
What this does is not really complex to justify that.
We just need decent documentation.

> >
> > >
> > > Thanks
> > >
> > >>
> > >>>
> > >>> Thanks
> > >>>
> > >>
> > >
> >


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-12  2:22           ` Jason Wang
@ 2026-01-12 11:08             ` Simon Schippers
  2026-01-12 11:18               ` Michael S. Tsirkin
  2026-01-13  6:26               ` Jason Wang
  0 siblings, 2 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-12 11:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/12/26 03:22, Jason Wang wrote:
> On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/9/26 07:09, Jason Wang wrote:
>>> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 1/8/26 05:37, Jason Wang wrote:
>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>>>>> reaches capacity, the netdev queue is stopped instead of dropping
>>>>>> subsequent packets.
>>>>>>
>>>>>> If producing an entry fails anyways, the tun_net_xmit returns
>>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
>>>>>> LLTX is enabled and the transmit path operates without the usual locking.
>>>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
>>>>>>
>>>>>> The existing __{tun,tap}_ring_consume functions free space in the
>>>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
>>>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
>>>>>> this, a memory barrier is enforced (as discussed in a similar
>>>>>> implementation in [1]), followed by a recheck that wakes the queue if
>>>>>> space is already available.
>>>>>>
>>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
>>>>>>
>>>>>> +-------------------------+-----------+---------------+----------------+
>>>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
>>>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
>>>>>> | 10M packets             |           |               |                |
>>>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
>>>>>> |           +-------------+-----------+---------------+----------------+
>>>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
>>>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
>>>>>> |  +        +-------------+-----------+---------------+----------------+
>>>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
>>>>>> +-----------+-------------+-----------+---------------+----------------+
>>>>>>
>>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
>>>>>>
>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>> ---
>>>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
>>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 71b6981d07d7..74d7fd09e9ba 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>         struct netdev_queue *queue;
>>>>>>         struct tun_file *tfile;
>>>>>>         int len = skb->len;
>>>>>> +       bool qdisc_present;
>>>>>> +       int ret;
>>>>>>
>>>>>>         rcu_read_lock();
>>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
>>>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>
>>>>>>         nf_reset_ct(skb);
>>>>>>
>>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>>>>> +       queue = netdev_get_tx_queue(dev, txq);
>>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
>>>>>> +
>>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
>>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
>>>>>> +               netif_tx_stop_queue(queue);
>>>>>> +               /* Avoid races with queue wake-up in
>>>>>> +                * __{tun,tap}_ring_consume by waking if space is
>>>>>> +                * available in a re-check.
>>>>>> +                * The barrier makes sure that the stop is visible before
>>>>>> +                * we re-check.
>>>>>> +                */
>>>>>> +               smp_mb__after_atomic();
>>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>>>>> +                       netif_tx_wake_queue(queue);
>>>>>
>>>>> I'm not sure I will get here, but I think those should be moved to the
>>>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
>>>>> need to bother with those queue stop/wake logic?
>>>>
>>>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
>>>> we stop the queue proactively.
>>>
>>> This seems to conflict with the following NETDEV_TX_BUSY. Or is
>>> NETDEV_TX_BUSY prepared for the xdp_xmit?
>>
>> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> 
> No, I mean I don't understand why we still need to peek since we've
> already used NETDEV_TX_BUSY.

Yes, if __ptr_ring_produce() returns -ENOSPC, there is no need to check
__ptr_ring_produce_peek(). I agree with you on this point and will update
the code accordingly. In all other cases, checking
__ptr_ring_produce_peek() is still required in order to proactively stop
the queue.

> 
>> And I do not understand the connection with xdp_xmit.
> 
> Since there's we don't modify xdp_xmit path, so even if we peek next
> ndo_start_xmit can still hit ring full.

Ah okay. Would you apply the same stop-and-recheck logic in
tun_xdp_xmit when __ptr_ring_produce() fails to produce it, or is that
not permitted there?

Apart from that, as noted in the commit message, since we are using LLTX,
hitting a full ring is still possible anyway. I could see that especially
at multiqueue tests with pktgen by looking at the qdisc requeues.

Thanks

> 
> Thanks
> 
>>
>>>
>>>>
>>>> I believe what you are aiming for is to always stop the queue if(ret),
>>>> which I can agree with. In that case, I would simply change the condition
>>>> to:
>>>>
>>>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
>>>>
>>>>>
>>>>>> +       }
>>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
>>>>>> +
>>>>>> +       if (ret) {
>>>>>> +               /* If a qdisc is attached to our virtual device,
>>>>>> +                * returning NETDEV_TX_BUSY is allowed.
>>>>>> +                */
>>>>>> +               if (qdisc_present) {
>>>>>> +                       rcu_read_unlock();
>>>>>> +                       return NETDEV_TX_BUSY;
>>>>>> +               }
>>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>>                 goto drop;
>>>>>>         }
>>>>>>
>>>>>>         /* dev->lltx requires to do our own update of trans_start */
>>>>>> -       queue = netdev_get_tx_queue(dev, txq);
>>>>>>         txq_trans_cond_update(queue);
>>>>>>
>>>>>>         /* Notify and wake up reader process */
>>>>>> --
>>>>>> 2.43.0
>>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>
>>> Thanks
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-12  4:33           ` Michael S. Tsirkin
@ 2026-01-12 11:17             ` Simon Schippers
  2026-01-12 11:19               ` Michael S. Tsirkin
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-12 11:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 1/12/26 05:33, Michael S. Tsirkin wrote:
> On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
>> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> 
> We jump through a lot of hoops in virtio_net to avoid using
> NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
> Given your patches aim to improve precisely ring full,
> I would say stopping proactively before NETDEV_TX_BUSY
> should be a priority.
> 

I already proactively stop here with the approach you proposed in
the v6.
Or am I missing something (apart from the xdp path)?

And yes I also dislike returning NETDEV_TX_BUSY but I do not see how
this can be prevented with lltx enabled.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-12 11:08             ` Simon Schippers
@ 2026-01-12 11:18               ` Michael S. Tsirkin
  2026-01-13  6:26               ` Jason Wang
  1 sibling, 0 replies; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-01-12 11:18 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Mon, Jan 12, 2026 at 12:08:14PM +0100, Simon Schippers wrote:
> On 1/12/26 03:22, Jason Wang wrote:
> > On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/9/26 07:09, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/8/26 05:37, Jason Wang wrote:
> >>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>>>> subsequent packets.
> >>>>>>
> >>>>>> If producing an entry fails anyways, the tun_net_xmit returns
> >>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >>>>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>>>>>
> >>>>>> The existing __{tun,tap}_ring_consume functions free space in the
> >>>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >>>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >>>>>> this, a memory barrier is enforced (as discussed in a similar
> >>>>>> implementation in [1]), followed by a recheck that wakes the queue if
> >>>>>> space is already available.
> >>>>>>
> >>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>>>
> >>>>>> +-------------------------+-----------+---------------+----------------+
> >>>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >>>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >>>>>> | 10M packets             |           |               |                |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >>>>>> |           +-------------+-----------+---------------+----------------+
> >>>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >>>>>> |  +        +-------------+-----------+---------------+----------------+
> >>>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>>
> >>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>>>>
> >>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>> ---
> >>>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index 71b6981d07d7..74d7fd09e9ba 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>         struct netdev_queue *queue;
> >>>>>>         struct tun_file *tfile;
> >>>>>>         int len = skb->len;
> >>>>>> +       bool qdisc_present;
> >>>>>> +       int ret;
> >>>>>>
> >>>>>>         rcu_read_lock();
> >>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>
> >>>>>>         nf_reset_ct(skb);
> >>>>>>
> >>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>>>> +       queue = netdev_get_tx_queue(dev, txq);
> >>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>>>> +
> >>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>>>> +               netif_tx_stop_queue(queue);
> >>>>>> +               /* Avoid races with queue wake-up in
> >>>>>> +                * __{tun,tap}_ring_consume by waking if space is
> >>>>>> +                * available in a re-check.
> >>>>>> +                * The barrier makes sure that the stop is visible before
> >>>>>> +                * we re-check.
> >>>>>> +                */
> >>>>>> +               smp_mb__after_atomic();
> >>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>>>> +                       netif_tx_wake_queue(queue);
> >>>>>
> >>>>> I'm not sure I will get here, but I think those should be moved to the
> >>>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> >>>>> need to bother with those queue stop/wake logic?
> >>>>
> >>>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> >>>> we stop the queue proactively.
> >>>
> >>> This seems to conflict with the following NETDEV_TX_BUSY. Or is
> >>> NETDEV_TX_BUSY prepared for the xdp_xmit?
> >>
> >> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> > 
> > No, I mean I don't understand why we still need to peek since we've
> > already used NETDEV_TX_BUSY.
> 
> Yes, if __ptr_ring_produce() returns -ENOSPC, there is no need to check
> __ptr_ring_produce_peek(). I agree with you on this point and will update
> the code accordingly. In all other cases, checking
> __ptr_ring_produce_peek() is still required in order to proactively stop
> the queue.
> 
> > 
> >> And I do not understand the connection with xdp_xmit.
> > 
> > Since there's we don't modify xdp_xmit path, so even if we peek next
> > ndo_start_xmit can still hit ring full.
> 
> Ah okay. Would you apply the same stop-and-recheck logic in
> tun_xdp_xmit when __ptr_ring_produce() fails to produce it, or is that
> not permitted there?
> 
> Apart from that, as noted in the commit message, since we are using LLTX,
> hitting a full ring is still possible anyway. I could see that especially
> at multiqueue tests with pktgen by looking at the qdisc requeues.
> 
> Thanks

If it's an exceptional rare condition (i.e. a race), it's not a big deal. That's why
NETDEV_TX_BUSY is there in the 1st place. If it's the main modus
operadi - not good.


> > 
> > Thanks
> > 
> >>
> >>>
> >>>>
> >>>> I believe what you are aiming for is to always stop the queue if(ret),
> >>>> which I can agree with. In that case, I would simply change the condition
> >>>> to:
> >>>>
> >>>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
> >>>>
> >>>>>
> >>>>>> +       }
> >>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>>>>> +
> >>>>>> +       if (ret) {
> >>>>>> +               /* If a qdisc is attached to our virtual device,
> >>>>>> +                * returning NETDEV_TX_BUSY is allowed.
> >>>>>> +                */
> >>>>>> +               if (qdisc_present) {
> >>>>>> +                       rcu_read_unlock();
> >>>>>> +                       return NETDEV_TX_BUSY;
> >>>>>> +               }
> >>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>>                 goto drop;
> >>>>>>         }
> >>>>>>
> >>>>>>         /* dev->lltx requires to do our own update of trans_start */
> >>>>>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>>>>         txq_trans_cond_update(queue);
> >>>>>>
> >>>>>>         /* Notify and wake up reader process */
> >>>>>> --
> >>>>>> 2.43.0
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> > 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-12 11:17             ` Simon Schippers
@ 2026-01-12 11:19               ` Michael S. Tsirkin
  2026-01-12 11:28                 ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-01-12 11:19 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Mon, Jan 12, 2026 at 12:17:12PM +0100, Simon Schippers wrote:
> On 1/12/26 05:33, Michael S. Tsirkin wrote:
> > On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
> >> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> > 
> > We jump through a lot of hoops in virtio_net to avoid using
> > NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
> > Given your patches aim to improve precisely ring full,
> > I would say stopping proactively before NETDEV_TX_BUSY
> > should be a priority.
> > 
> 
> I already proactively stop here with the approach you proposed in
> the v6.
> Or am I missing something (apart from the xdp path)?

Yes, I am just answering the general question you posed.

> 
> And yes I also dislike returning NETDEV_TX_BUSY but I do not see how
> this can be prevented with lltx enabled.

Preventing NETDEV_TX_BUSY 100% of the time is not required. It's there
to handle races.

-- 
MST


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-12 11:19               ` Michael S. Tsirkin
@ 2026-01-12 11:28                 ` Simon Schippers
  0 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-12 11:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 1/12/26 12:19, Michael S. Tsirkin wrote:
> On Mon, Jan 12, 2026 at 12:17:12PM +0100, Simon Schippers wrote:
>> On 1/12/26 05:33, Michael S. Tsirkin wrote:
>>> On Fri, Jan 09, 2026 at 11:14:54AM +0100, Simon Schippers wrote:
>>>> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
>>>
>>> We jump through a lot of hoops in virtio_net to avoid using
>>> NETDEV_TX_BUSY because that bypasses all the net/ cleverness.
>>> Given your patches aim to improve precisely ring full,
>>> I would say stopping proactively before NETDEV_TX_BUSY
>>> should be a priority.
>>>
>>
>> I already proactively stop here with the approach you proposed in
>> the v6.
>> Or am I missing something (apart from the xdp path)?
> 
> Yes, I am just answering the general question you posed.

Ah okay.

> 
>>
>> And yes I also dislike returning NETDEV_TX_BUSY but I do not see how
>> this can be prevented with lltx enabled.
> 
> Preventing NETDEV_TX_BUSY 100% of the time is not required. It's there
> to handle races.

Great to know. Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume
  2026-01-09  9:06         ` Simon Schippers
@ 2026-01-12 16:29           ` Simon Schippers
  0 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-01-12 16:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 1/9/26 10:06, Simon Schippers wrote:
> On 1/9/26 09:31, Michael S. Tsirkin wrote:
>> On Fri, Jan 09, 2026 at 08:35:31AM +0100, Simon Schippers wrote:
>>> On 1/9/26 08:22, Michael S. Tsirkin wrote:
>>>> On Wed, Jan 07, 2026 at 10:04:41PM +0100, Simon Schippers wrote:
>>>>> This proposed function checks whether __ptr_ring_zero_tail() was invoked
>>>>> within the last n calls to __ptr_ring_consume(), which indicates that new
>>>>> free space was created. Since __ptr_ring_zero_tail() moves the tail to
>>>>> the head - and no other function modifies either the head or the tail,
>>>>> aside from the wrap-around case described below - detecting such a
>>>>> movement is sufficient to detect the invocation of
>>>>> __ptr_ring_zero_tail().
>>>>>
>>>>> The implementation detects this movement by checking whether the tail is
>>>>> at most n positions behind the head. If this condition holds, the shift
>>>>> of the tail to its current position must have occurred within the last n
>>>>> calls to __ptr_ring_consume(), indicating that __ptr_ring_zero_tail() was
>>>>> invoked and that new free space was created.
>>>>>
>>>>> This logic also correctly handles the wrap-around case in which
>>>>> __ptr_ring_zero_tail() is invoked and the head and the tail are reset
>>>>> to 0. Since this reset likewise moves the tail to the head, the same
>>>>> detection logic applies.
>>>>>
>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>> ---
>>>>>  include/linux/ptr_ring.h | 13 +++++++++++++
>>>>>  1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>>>>> index a5a3fa4916d3..7cdae6d1d400 100644
>>>>> --- a/include/linux/ptr_ring.h
>>>>> +++ b/include/linux/ptr_ring.h
>>>>> @@ -438,6 +438,19 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>>>>>  	return ret;
>>>>>  }
>>>>>  
>>>>> +/* Returns true if the consume of the last n elements has created space
>>>>> + * in the ring buffer (i.e., a new element can be produced).
>>>>> + *
>>>>> + * Note: Because of batching, a successful call to __ptr_ring_consume() /
>>>>> + * __ptr_ring_consume_batched() does not guarantee that the next call to
>>>>> + * __ptr_ring_produce() will succeed.
>>>>
>>>>
>>>> I think the issue is it does not say what is the actual guarantee.
>>>>
>>>> Another issue is that the "Note" really should be more prominent,
>>>> it really is part of explaining what the functions does.
>>>>
>>>> Hmm. Maybe we should tell it how many entries have been consumed and
>>>> get back an indication of how much space this created?
>>>>
>>>> fundamentally
>>>> 	 n - (r->consumer_head - r->consumer_tail)?
>>>
>>> No, that is wrong from my POV.
>>>
>>> It always creates the same amount of space which is the batch size or
>>> multiple batch sizes (or something less in the wrap-around case). That is
>>> of course only if __ptr_ring_zero_tail() was executed at least once,
>>> else it creates zero space.
>>
>> exactly, and caller does not know, and now he wants to know so
>> we add an API for him to find out?
>>
>> I feel the fact it's a binary (batch or 0) is an implementation
>> detail better hidden from user.
> 
> I agree, and I now understood your logic :)
> 
> So it should be:
> 
> static inline int __ptr_ring_consume_created_space(struct ptr_ring *r,
> 						   int n)
> {
> 	return max(n - (r->consumer_head - r->consumer_tail), 0);
> }
> 
> Right?

BTW:

No, that's still not correct. It misses the elements between the tail and
head that existed before the consume operation (called pre_consume_gap in
the code below).

After thinking about it a bit more, the best solution I came up with is:

static inline int __ptr_ring_consume_created_space(struct ptr_ring *r,
						   int n)
{
	int pre_consume_gap = (r->head - n) % r->size % r->batch;
	return n - (r->head - r->tail) + pre_consume_gap;
}

Here, (r->head - n) represents the head position before the consume, but
it may be negative. The first modulo normalizes it to a positive value in
the range [0, size). Applying the modulo batch to the pre-consume head
position then yields the number of elements that were between the tail
and head before the consume.

With this approach, we no longer need max(..., 0), because if
n < (r->head - r->tail), the + pre_consume_gap term cancels it out.


Is this solution viable in terms of performance regarding the modulo
operations?

> 
>>
>>
>>
>>>>
>>>>
>>>> does the below sound good maybe?
>>>>
>>>> /* Returns the amound of space (number of new elements that can be
>>>>  * produced) that calls to ptr_ring_consume created.
>>>>  *
>>>>  * Getting n entries from calls to ptr_ring_consume() /
>>>>  * ptr_ring_consume_batched() does *not* guarantee that the next n calls to
>>>>  * ptr_ring_produce() will succeed.
>>>>  *
>>>>  * Use this function after consuming n entries to get a hint about
>>>>  * how much space was actually created.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> + */
>>>>> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r,
>>>>> +						    int n)
>>>>> +{
>>>>> +	return r->consumer_head - r->consumer_tail < n;
>>>>> +}
>>>>> +
>>>>>  /* Cast to structure type and call a function without discarding from FIFO.
>>>>>   * Function must return a value.
>>>>>   * Callers must take consumer_lock.
>>>>> -- 
>>>>> 2.43.0
>>>>
>>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present
  2026-01-12 11:08             ` Simon Schippers
  2026-01-12 11:18               ` Michael S. Tsirkin
@ 2026-01-13  6:26               ` Jason Wang
  1 sibling, 0 replies; 69+ messages in thread
From: Jason Wang @ 2026-01-13  6:26 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Mon, Jan 12, 2026 at 7:08 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/12/26 03:22, Jason Wang wrote:
> > On Fri, Jan 9, 2026 at 6:15 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/9/26 07:09, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 4:02 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/8/26 05:37, Jason Wang wrote:
> >>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
> >>>>>> becomes full. Once an entry is successfully produced and the ptr_ring
> >>>>>> reaches capacity, the netdev queue is stopped instead of dropping
> >>>>>> subsequent packets.
> >>>>>>
> >>>>>> If producing an entry fails anyways, the tun_net_xmit returns
> >>>>>> NETDEV_TX_BUSY, again avoiding a drop. Such failures are expected because
> >>>>>> LLTX is enabled and the transmit path operates without the usual locking.
> >>>>>> As a result, concurrent calls to tun_net_xmit() are not prevented.
> >>>>>>
> >>>>>> The existing __{tun,tap}_ring_consume functions free space in the
> >>>>>> ptr_ring and wake the netdev queue. Races between this wakeup and the
> >>>>>> queue-stop logic could leave the queue stopped indefinitely. To prevent
> >>>>>> this, a memory barrier is enforced (as discussed in a similar
> >>>>>> implementation in [1]), followed by a recheck that wakes the queue if
> >>>>>> space is already available.
> >>>>>>
> >>>>>> If no qdisc is present, the previous tail-drop behavior is preserved.
> >>>>>>
> >>>>>> +-------------------------+-----------+---------------+----------------+
> >>>>>> | pktgen benchmarks to    | Stock     | Patched with  | Patched with   |
> >>>>>> | Debian VM, i5 6300HQ,   |           | noqueue qdisc | fq_codel qdisc |
> >>>>>> | 10M packets             |           |               |                |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 196 Kpps  | 195 Kpps      | 185 Kpps       |
> >>>>>> |           +-------------+-----------+---------------+----------------+
> >>>>>> |           | Lost        | 1618 Kpps | 1556 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>> | TAP       | Transmitted | 577 Kpps  | 582 Kpps      | 578 Kpps       |
> >>>>>> |  +        +-------------+-----------+---------------+----------------+
> >>>>>> | vhost-net | Lost        | 1170 Kpps | 1109 Kpps     | 0              |
> >>>>>> +-----------+-------------+-----------+---------------+----------------+
> >>>>>>
> >>>>>> [1] Link: https://lore.kernel.org/all/20250424085358.75d817ae@kernel.org/
> >>>>>>
> >>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>> ---
> >>>>>>  drivers/net/tun.c | 31 +++++++++++++++++++++++++++++--
> >>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index 71b6981d07d7..74d7fd09e9ba 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>>>>> @@ -1008,6 +1008,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>         struct netdev_queue *queue;
> >>>>>>         struct tun_file *tfile;
> >>>>>>         int len = skb->len;
> >>>>>> +       bool qdisc_present;
> >>>>>> +       int ret;
> >>>>>>
> >>>>>>         rcu_read_lock();
> >>>>>>         tfile = rcu_dereference(tun->tfiles[txq]);
> >>>>>> @@ -1060,13 +1062,38 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>>>
> >>>>>>         nf_reset_ct(skb);
> >>>>>>
> >>>>>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> >>>>>> +       queue = netdev_get_tx_queue(dev, txq);
> >>>>>> +       qdisc_present = !qdisc_txq_has_no_queue(queue);
> >>>>>> +
> >>>>>> +       spin_lock(&tfile->tx_ring.producer_lock);
> >>>>>> +       ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> >>>>>> +       if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) {
> >>>>>> +               netif_tx_stop_queue(queue);
> >>>>>> +               /* Avoid races with queue wake-up in
> >>>>>> +                * __{tun,tap}_ring_consume by waking if space is
> >>>>>> +                * available in a re-check.
> >>>>>> +                * The barrier makes sure that the stop is visible before
> >>>>>> +                * we re-check.
> >>>>>> +                */
> >>>>>> +               smp_mb__after_atomic();
> >>>>>> +               if (!__ptr_ring_produce_peek(&tfile->tx_ring))
> >>>>>> +                       netif_tx_wake_queue(queue);
> >>>>>
> >>>>> I'm not sure I will get here, but I think those should be moved to the
> >>>>> following if(ret) check. If __ptr_ring_produce() succeed, there's no
> >>>>> need to bother with those queue stop/wake logic?
> >>>>
> >>>> There is a need for that. If __ptr_ring_produce_peek() returns -ENOSPC,
> >>>> we stop the queue proactively.
> >>>
> >>> This seems to conflict with the following NETDEV_TX_BUSY. Or is
> >>> NETDEV_TX_BUSY prepared for the xdp_xmit?
> >>
> >> Am I not allowed to stop the queue and then return NETDEV_TX_BUSY?
> >
> > No, I mean I don't understand why we still need to peek since we've
> > already used NETDEV_TX_BUSY.
>
> Yes, if __ptr_ring_produce() returns -ENOSPC, there is no need to check
> __ptr_ring_produce_peek(). I agree with you on this point and will update
> the code accordingly. In all other cases, checking
> __ptr_ring_produce_peek() is still required in order to proactively stop
> the queue.
>
> >
> >> And I do not understand the connection with xdp_xmit.
> >
> > Since there's we don't modify xdp_xmit path, so even if we peek next
> > ndo_start_xmit can still hit ring full.
>
> Ah okay. Would you apply the same stop-and-recheck logic in
> tun_xdp_xmit when __ptr_ring_produce() fails to produce it, or is that
> not permitted there?

I think it won't work as there's no qdsic logic implemented in the XDP
xmit path. NETDEV_TX_BUSY for tun_net_xmit() should be sufficient.

>
> Apart from that, as noted in the commit message, since we are using LLTX,
> hitting a full ring is still possible anyway. I could see that especially
> at multiqueue tests with pktgen by looking at the qdisc requeues.
>
> Thanks

Thanks

>
> >
> > Thanks
> >
> >>
> >>>
> >>>>
> >>>> I believe what you are aiming for is to always stop the queue if(ret),
> >>>> which I can agree with. In that case, I would simply change the condition
> >>>> to:
> >>>>
> >>>> if (qdisc_present && (ret || __ptr_ring_produce_peek(&tfile->tx_ring)))
> >>>>
> >>>>>
> >>>>>> +       }
> >>>>>> +       spin_unlock(&tfile->tx_ring.producer_lock);
> >>>>>> +
> >>>>>> +       if (ret) {
> >>>>>> +               /* If a qdisc is attached to our virtual device,
> >>>>>> +                * returning NETDEV_TX_BUSY is allowed.
> >>>>>> +                */
> >>>>>> +               if (qdisc_present) {
> >>>>>> +                       rcu_read_unlock();
> >>>>>> +                       return NETDEV_TX_BUSY;
> >>>>>> +               }
> >>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>>                 goto drop;
> >>>>>>         }
> >>>>>>
> >>>>>>         /* dev->lltx requires to do our own update of trans_start */
> >>>>>> -       queue = netdev_get_tx_queue(dev, txq);
> >>>>>>         txq_trans_cond_update(queue);
> >>>>>>
> >>>>>>         /* Notify and wake up reader process */
> >>>>>> --
> >>>>>> 2.43.0
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-09  6:02       ` Jason Wang
  2026-01-09  9:31         ` Simon Schippers
@ 2026-01-21  9:32         ` Simon Schippers
  2026-01-22  5:35           ` Jason Wang
  1 sibling, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-21  9:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/9/26 07:02, Jason Wang wrote:
> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/8/26 04:38, Jason Wang wrote:
>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>> space in the underlying ptr_ring.
>>>>
>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>> in an upcoming commit.
>>>>
>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>> ---
>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>> index 1197f245e873..2442cf7ac385 100644
>>>> --- a/drivers/net/tap.c
>>>> +++ b/drivers/net/tap.c
>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>         return ret ? ret : total;
>>>>  }
>>>>
>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>> +{
>>>> +       struct ptr_ring *ring = &q->ring;
>>>> +       struct net_device *dev;
>>>> +       void *ptr;
>>>> +
>>>> +       spin_lock(&ring->consumer_lock);
>>>> +
>>>> +       ptr = __ptr_ring_consume(ring);
>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>> +               rcu_read_lock();
>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>> +               rcu_read_unlock();
>>>> +       }
>>>> +
>>>> +       spin_unlock(&ring->consumer_lock);
>>>> +
>>>> +       return ptr;
>>>> +}
>>>> +
>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>                            struct iov_iter *to,
>>>>                            int noblock, struct sk_buff *skb)
>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>                                         TASK_INTERRUPTIBLE);
>>>>
>>>>                 /* Read frames from the queue */
>>>> -               skb = ptr_ring_consume(&q->ring);
>>>> +               skb = tap_ring_consume(q);
>>>>                 if (skb)
>>>>                         break;
>>>>                 if (noblock) {
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 8192740357a0..7148f9a844a4 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>         return total;
>>>>  }
>>>>
>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>> +{
>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>> +       struct net_device *dev;
>>>> +       void *ptr;
>>>> +
>>>> +       spin_lock(&ring->consumer_lock);
>>>> +
>>>> +       ptr = __ptr_ring_consume(ring);
>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>
>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>> another call to tweak the current API.
>>>
>>>> +               rcu_read_lock();
>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>
>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>> I'm not sure is what we want.
>>
>> What else would you suggest calling to wake the queue?
> 
> I don't have a good method in my mind, just want to point out its implications.

I have to admit I'm a bit stuck at this point, particularly with this
aspect.

What is the correct way to pass the producer CPU ID to the consumer?
Would it make sense to store smp_processor_id() in the tfile inside
tun_net_xmit(), or should it instead be stored in the skb (similar to the
XDP bit)? In the latter case, my concern is that this information may
already be significantly outdated by the time it is used.

Based on that, my idea would be for the consumer to wake the producer by
invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
smp_call_function_single().
Is this a reasonable approach?

More generally, would triggering TX_SOFTIRQ on the consumer CPU be
considered a deal-breaker for the patch set?

Thanks!

> 
>>
>>>
>>>> +               rcu_read_unlock();
>>>> +       }
>>>
>>> Btw, this function duplicates a lot of logic of tap_ring_consume() we
>>> should consider to merge the logic.
>>
>> Yes, it is largely the same approach, but it would require accessing the
>> net_device each time.
> 
> The problem is that, at least for TUN, the socket is loosely coupled
> with the netdev. It means the netdev can go away while the socket
> might still exist. That's why vhost only talks to the socket, not the
> netdev. If we really want to go this way, here, we should at least
> check the existence of tun->dev first.
> 
>>
>>>
>>>> +
>>>> +       spin_unlock(&ring->consumer_lock);
>>>> +
>>>> +       return ptr;
>>>> +}
>>>> +
>>>>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>>>  {
>>>>         DECLARE_WAITQUEUE(wait, current);
>>>>         void *ptr = NULL;
>>>>         int error = 0;
>>>>
>>>> -       ptr = ptr_ring_consume(&tfile->tx_ring);
>>>> +       ptr = tun_ring_consume(tfile);
>>>
>>> I'm not sure having a separate patch like this may help. For example,
>>> it will introduce performance regression.
>>
>> I ran benchmarks for the whole patch set with noqueue (where the queue is
>> not stopped to preserve the old behavior), as described in the cover
>> letter, and observed no performance regression. This leads me to conclude
>> that there is no performance impact because of this patch when the queue
>> is not stopped.
> 
> Have you run a benchmark per patch? Or it might just be because the
> regression is not obvious. But at least this patch would introduce
> more atomic operations or it might just because the TUN doesn't
> support burst so pktgen can't have the best PPS.
> 
> Thanks
> 
> 
>>
>>>
>>>>         if (ptr)
>>>>                 goto out;
>>>>         if (noblock) {
>>>> @@ -2131,7 +2152,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>>>
>>>>         while (1) {
>>>>                 set_current_state(TASK_INTERRUPTIBLE);
>>>> -               ptr = ptr_ring_consume(&tfile->tx_ring);
>>>> +               ptr = tun_ring_consume(tfile);
>>>>                 if (ptr)
>>>>                         break;
>>>>                 if (signal_pending(current)) {
>>>> --
>>>> 2.43.0
>>>>
>>>
>>> Thanks
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-21  9:32         ` Simon Schippers
@ 2026-01-22  5:35           ` Jason Wang
  2026-01-23  3:05             ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-22  5:35 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/9/26 07:02, Jason Wang wrote:
> > On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/8/26 04:38, Jason Wang wrote:
> >>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>> space in the underlying ptr_ring.
> >>>>
> >>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>> in an upcoming commit.
> >>>>
> >>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>> ---
> >>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>> index 1197f245e873..2442cf7ac385 100644
> >>>> --- a/drivers/net/tap.c
> >>>> +++ b/drivers/net/tap.c
> >>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>         return ret ? ret : total;
> >>>>  }
> >>>>
> >>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>> +{
> >>>> +       struct ptr_ring *ring = &q->ring;
> >>>> +       struct net_device *dev;
> >>>> +       void *ptr;
> >>>> +
> >>>> +       spin_lock(&ring->consumer_lock);
> >>>> +
> >>>> +       ptr = __ptr_ring_consume(ring);
> >>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>> +               rcu_read_lock();
> >>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>> +               rcu_read_unlock();
> >>>> +       }
> >>>> +
> >>>> +       spin_unlock(&ring->consumer_lock);
> >>>> +
> >>>> +       return ptr;
> >>>> +}
> >>>> +
> >>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>                            struct iov_iter *to,
> >>>>                            int noblock, struct sk_buff *skb)
> >>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>                                         TASK_INTERRUPTIBLE);
> >>>>
> >>>>                 /* Read frames from the queue */
> >>>> -               skb = ptr_ring_consume(&q->ring);
> >>>> +               skb = tap_ring_consume(q);
> >>>>                 if (skb)
> >>>>                         break;
> >>>>                 if (noblock) {
> >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>> index 8192740357a0..7148f9a844a4 100644
> >>>> --- a/drivers/net/tun.c
> >>>> +++ b/drivers/net/tun.c
> >>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>         return total;
> >>>>  }
> >>>>
> >>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>> +{
> >>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>> +       struct net_device *dev;
> >>>> +       void *ptr;
> >>>> +
> >>>> +       spin_lock(&ring->consumer_lock);
> >>>> +
> >>>> +       ptr = __ptr_ring_consume(ring);
> >>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>
> >>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>> another call to tweak the current API.
> >>>
> >>>> +               rcu_read_lock();
> >>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>
> >>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>> I'm not sure is what we want.
> >>
> >> What else would you suggest calling to wake the queue?
> >
> > I don't have a good method in my mind, just want to point out its implications.
>
> I have to admit I'm a bit stuck at this point, particularly with this
> aspect.
>
> What is the correct way to pass the producer CPU ID to the consumer?
> Would it make sense to store smp_processor_id() in the tfile inside
> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> XDP bit)? In the latter case, my concern is that this information may
> already be significantly outdated by the time it is used.
>
> Based on that, my idea would be for the consumer to wake the producer by
> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> smp_call_function_single().
> Is this a reasonable approach?

I'm not sure but it would introduce costs like IPI.

>
> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> considered a deal-breaker for the patch set?

It depends on whether or not it has effects on the performance.
Especially when vhost is pinned.

Thanks

>
> Thanks!
>
> >
> >>
> >>>
> >>>> +               rcu_read_unlock();
> >>>> +       }
> >>>
> >>> Btw, this function duplicates a lot of logic of tap_ring_consume() we
> >>> should consider to merge the logic.
> >>
> >> Yes, it is largely the same approach, but it would require accessing the
> >> net_device each time.
> >
> > The problem is that, at least for TUN, the socket is loosely coupled
> > with the netdev. It means the netdev can go away while the socket
> > might still exist. That's why vhost only talks to the socket, not the
> > netdev. If we really want to go this way, here, we should at least
> > check the existence of tun->dev first.
> >
> >>
> >>>
> >>>> +
> >>>> +       spin_unlock(&ring->consumer_lock);
> >>>> +
> >>>> +       return ptr;
> >>>> +}
> >>>> +
> >>>>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
> >>>>  {
> >>>>         DECLARE_WAITQUEUE(wait, current);
> >>>>         void *ptr = NULL;
> >>>>         int error = 0;
> >>>>
> >>>> -       ptr = ptr_ring_consume(&tfile->tx_ring);
> >>>> +       ptr = tun_ring_consume(tfile);
> >>>
> >>> I'm not sure having a separate patch like this may help. For example,
> >>> it will introduce performance regression.
> >>
> >> I ran benchmarks for the whole patch set with noqueue (where the queue is
> >> not stopped to preserve the old behavior), as described in the cover
> >> letter, and observed no performance regression. This leads me to conclude
> >> that there is no performance impact because of this patch when the queue
> >> is not stopped.
> >
> > Have you run a benchmark per patch? Or it might just be because the
> > regression is not obvious. But at least this patch would introduce
> > more atomic operations or it might just because the TUN doesn't
> > support burst so pktgen can't have the best PPS.
> >
> > Thanks
> >
> >
> >>
> >>>
> >>>>         if (ptr)
> >>>>                 goto out;
> >>>>         if (noblock) {
> >>>> @@ -2131,7 +2152,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
> >>>>
> >>>>         while (1) {
> >>>>                 set_current_state(TASK_INTERRUPTIBLE);
> >>>> -               ptr = ptr_ring_consume(&tfile->tx_ring);
> >>>> +               ptr = tun_ring_consume(tfile);
> >>>>                 if (ptr)
> >>>>                         break;
> >>>>                 if (signal_pending(current)) {
> >>>> --
> >>>> 2.43.0
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-22  5:35           ` Jason Wang
@ 2026-01-23  3:05             ` Jason Wang
  2026-01-23  9:54               ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-23  3:05 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
> >
> > On 1/9/26 07:02, Jason Wang wrote:
> > > On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> > > <simon.schippers@tu-dortmund.de> wrote:
> > >>
> > >> On 1/8/26 04:38, Jason Wang wrote:
> > >>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> > >>> <simon.schippers@tu-dortmund.de> wrote:
> > >>>>
> > >>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> > >>>> and wake the corresponding netdev subqueue when consuming an entry frees
> > >>>> space in the underlying ptr_ring.
> > >>>>
> > >>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> > >>>> in an upcoming commit.
> > >>>>
> > >>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > >>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > >>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > >>>> ---
> > >>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> > >>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> > >>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> > >>>> index 1197f245e873..2442cf7ac385 100644
> > >>>> --- a/drivers/net/tap.c
> > >>>> +++ b/drivers/net/tap.c
> > >>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> > >>>>         return ret ? ret : total;
> > >>>>  }
> > >>>>
> > >>>> +static void *tap_ring_consume(struct tap_queue *q)
> > >>>> +{
> > >>>> +       struct ptr_ring *ring = &q->ring;
> > >>>> +       struct net_device *dev;
> > >>>> +       void *ptr;
> > >>>> +
> > >>>> +       spin_lock(&ring->consumer_lock);
> > >>>> +
> > >>>> +       ptr = __ptr_ring_consume(ring);
> > >>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> > >>>> +               rcu_read_lock();
> > >>>> +               dev = rcu_dereference(q->tap)->dev;
> > >>>> +               netif_wake_subqueue(dev, q->queue_index);
> > >>>> +               rcu_read_unlock();
> > >>>> +       }
> > >>>> +
> > >>>> +       spin_unlock(&ring->consumer_lock);
> > >>>> +
> > >>>> +       return ptr;
> > >>>> +}
> > >>>> +
> > >>>>  static ssize_t tap_do_read(struct tap_queue *q,
> > >>>>                            struct iov_iter *to,
> > >>>>                            int noblock, struct sk_buff *skb)
> > >>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> > >>>>                                         TASK_INTERRUPTIBLE);
> > >>>>
> > >>>>                 /* Read frames from the queue */
> > >>>> -               skb = ptr_ring_consume(&q->ring);
> > >>>> +               skb = tap_ring_consume(q);
> > >>>>                 if (skb)
> > >>>>                         break;
> > >>>>                 if (noblock) {
> > >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > >>>> index 8192740357a0..7148f9a844a4 100644
> > >>>> --- a/drivers/net/tun.c
> > >>>> +++ b/drivers/net/tun.c
> > >>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> > >>>>         return total;
> > >>>>  }
> > >>>>
> > >>>> +static void *tun_ring_consume(struct tun_file *tfile)
> > >>>> +{
> > >>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> > >>>> +       struct net_device *dev;
> > >>>> +       void *ptr;
> > >>>> +
> > >>>> +       spin_lock(&ring->consumer_lock);
> > >>>> +
> > >>>> +       ptr = __ptr_ring_consume(ring);
> > >>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> > >>>
> > >>> I guess it's the "bug" I mentioned in the previous patch that leads to
> > >>> the check of __ptr_ring_consume_created_space() here. If it's true,
> > >>> another call to tweak the current API.
> > >>>
> > >>>> +               rcu_read_lock();
> > >>>> +               dev = rcu_dereference(tfile->tun)->dev;
> > >>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> > >>>
> > >>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> > >>> I'm not sure is what we want.
> > >>
> > >> What else would you suggest calling to wake the queue?
> > >
> > > I don't have a good method in my mind, just want to point out its implications.
> >
> > I have to admit I'm a bit stuck at this point, particularly with this
> > aspect.
> >
> > What is the correct way to pass the producer CPU ID to the consumer?
> > Would it make sense to store smp_processor_id() in the tfile inside
> > tun_net_xmit(), or should it instead be stored in the skb (similar to the
> > XDP bit)? In the latter case, my concern is that this information may
> > already be significantly outdated by the time it is used.
> >
> > Based on that, my idea would be for the consumer to wake the producer by
> > invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> > smp_call_function_single().
> > Is this a reasonable approach?
>
> I'm not sure but it would introduce costs like IPI.
>
> >
> > More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> > considered a deal-breaker for the patch set?
>
> It depends on whether or not it has effects on the performance.
> Especially when vhost is pinned.

I meant we can benchmark to see the impact. For example, pin vhost to
a specific CPU and the try to see the impact of the TX_SOFTIRQ.

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-23  3:05             ` Jason Wang
@ 2026-01-23  9:54               ` Simon Schippers
  2026-01-27 16:47                 ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-23  9:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/23/26 04:05, Jason Wang wrote:
> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>> <simon.schippers@tu-dortmund.de> wrote:
>>>
>>> On 1/9/26 07:02, Jason Wang wrote:
>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>
>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>
>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>> space in the underlying ptr_ring.
>>>>>>>
>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>> in an upcoming commit.
>>>>>>>
>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>> ---
>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>> --- a/drivers/net/tap.c
>>>>>>> +++ b/drivers/net/tap.c
>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>         return ret ? ret : total;
>>>>>>>  }
>>>>>>>
>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>> +{
>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>> +       struct net_device *dev;
>>>>>>> +       void *ptr;
>>>>>>> +
>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>> +
>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>> +               rcu_read_lock();
>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>> +               rcu_read_unlock();
>>>>>>> +       }
>>>>>>> +
>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>> +
>>>>>>> +       return ptr;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>                            struct iov_iter *to,
>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>
>>>>>>>                 /* Read frames from the queue */
>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>                 if (skb)
>>>>>>>                         break;
>>>>>>>                 if (noblock) {
>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>> --- a/drivers/net/tun.c
>>>>>>> +++ b/drivers/net/tun.c
>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>         return total;
>>>>>>>  }
>>>>>>>
>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>> +{
>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>> +       struct net_device *dev;
>>>>>>> +       void *ptr;
>>>>>>> +
>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>> +
>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>
>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>> another call to tweak the current API.
>>>>>>
>>>>>>> +               rcu_read_lock();
>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>
>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>> I'm not sure is what we want.
>>>>>
>>>>> What else would you suggest calling to wake the queue?
>>>>
>>>> I don't have a good method in my mind, just want to point out its implications.
>>>
>>> I have to admit I'm a bit stuck at this point, particularly with this
>>> aspect.
>>>
>>> What is the correct way to pass the producer CPU ID to the consumer?
>>> Would it make sense to store smp_processor_id() in the tfile inside
>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>> XDP bit)? In the latter case, my concern is that this information may
>>> already be significantly outdated by the time it is used.
>>>
>>> Based on that, my idea would be for the consumer to wake the producer by
>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>> smp_call_function_single().
>>> Is this a reasonable approach?
>>
>> I'm not sure but it would introduce costs like IPI.
>>
>>>
>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>> considered a deal-breaker for the patch set?
>>
>> It depends on whether or not it has effects on the performance.
>> Especially when vhost is pinned.
> 
> I meant we can benchmark to see the impact. For example, pin vhost to
> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> 
> Thanks
> 

I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
for both the stock and patched versions. The benchmarks were run with
the full patch series applied, since testing only patches 1-3 would not
be meaningful - the queue is never stopped in that case, so no
TX_SOFTIRQ is triggered.

Compared to the non-pinned CPU benchmarks in the cover letter,
performance is lower for pktgen with a single thread but higher with
four threads. The results show no regression for the patched version,
with even slight performance improvements observed:

+-------------------------+-----------+----------------+
| pktgen benchmarks to    | Stock     | Patched with   |
| Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
| 100M packets            |           |                |
| vhost pinned to core 0  |           |                |
+-----------+-------------+-----------+----------------+
| TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
|  +        +-------------+-----------+----------------+
| vhost-net | Lost        | 1154 Kpps | 0              |
+-----------+-------------+-----------+----------------+

+-------------------------+-----------+----------------+
| pktgen benchmarks to    | Stock     | Patched with   |
| Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
| 100M packets            |           |                |
| vhost pinned to core 0  |           |                |
| *4 threads*             |           |                |
+-----------+-------------+-----------+----------------+
| TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
|  +        +-------------+-----------+----------------+
| vhost-net | Lost        | 1527 Kpps | 0              |
+-----------+-------------+-----------+----------------+

+------------------------+-------------+----------------+
| iperf3 TCP benchmarks  | Stock       | Patched with   |
| to Debian VM 120s      |             | fq_codel qdisc |
| vhost pinned to core 0 |             |                |
+------------------------+-------------+----------------+
| TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
|  +                     |             |                |
| vhost-net              |             |                |
+------------------------+-------------+----------------+

+---------------------------+-------------+----------------+
| iperf3 TCP benchmarks     | Stock       | Patched with   |
| to Debian VM 120s         |             | fq_codel qdisc |
| vhost pinned to core 0    |             |                |
| *4 iperf3 client threads* |             |                |
+---------------------------+-------------+----------------+
| TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
|  +                        |             |                |
| vhost-net                 |             |                |
+---------------------------+-------------+----------------+

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-23  9:54               ` Simon Schippers
@ 2026-01-27 16:47                 ` Simon Schippers
  2026-01-28  7:03                   ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-27 16:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/23/26 10:54, Simon Schippers wrote:
> On 1/23/26 04:05, Jason Wang wrote:
>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>
>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>
>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>> in an upcoming commit.
>>>>>>>>
>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>> ---
>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>         return ret ? ret : total;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>> +{
>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>> +       struct net_device *dev;
>>>>>>>> +       void *ptr;
>>>>>>>> +
>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>> +
>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>> +               rcu_read_lock();
>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>> +               rcu_read_unlock();
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>> +
>>>>>>>> +       return ptr;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>                            struct iov_iter *to,
>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>
>>>>>>>>                 /* Read frames from the queue */
>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>                 if (skb)
>>>>>>>>                         break;
>>>>>>>>                 if (noblock) {
>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>         return total;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>> +{
>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>> +       struct net_device *dev;
>>>>>>>> +       void *ptr;
>>>>>>>> +
>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>> +
>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>
>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>> another call to tweak the current API.
>>>>>>>
>>>>>>>> +               rcu_read_lock();
>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>
>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>> I'm not sure is what we want.
>>>>>>
>>>>>> What else would you suggest calling to wake the queue?
>>>>>
>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>
>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>> aspect.
>>>>
>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>> XDP bit)? In the latter case, my concern is that this information may
>>>> already be significantly outdated by the time it is used.
>>>>
>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>> smp_call_function_single().
>>>> Is this a reasonable approach?
>>>
>>> I'm not sure but it would introduce costs like IPI.
>>>
>>>>
>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>> considered a deal-breaker for the patch set?
>>>
>>> It depends on whether or not it has effects on the performance.
>>> Especially when vhost is pinned.
>>
>> I meant we can benchmark to see the impact. For example, pin vhost to
>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>
>> Thanks
>>
> 
> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> for both the stock and patched versions. The benchmarks were run with
> the full patch series applied, since testing only patches 1-3 would not
> be meaningful - the queue is never stopped in that case, so no
> TX_SOFTIRQ is triggered.
> 
> Compared to the non-pinned CPU benchmarks in the cover letter,
> performance is lower for pktgen with a single thread but higher with
> four threads. The results show no regression for the patched version,
> with even slight performance improvements observed:
> 
> +-------------------------+-----------+----------------+
> | pktgen benchmarks to    | Stock     | Patched with   |
> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> | 100M packets            |           |                |
> | vhost pinned to core 0  |           |                |
> +-----------+-------------+-----------+----------------+
> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> |  +        +-------------+-----------+----------------+
> | vhost-net | Lost        | 1154 Kpps | 0              |
> +-----------+-------------+-----------+----------------+
> 
> +-------------------------+-----------+----------------+
> | pktgen benchmarks to    | Stock     | Patched with   |
> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> | 100M packets            |           |                |
> | vhost pinned to core 0  |           |                |
> | *4 threads*             |           |                |
> +-----------+-------------+-----------+----------------+
> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> |  +        +-------------+-----------+----------------+
> | vhost-net | Lost        | 1527 Kpps | 0              |
> +-----------+-------------+-----------+----------------+
> 
> +------------------------+-------------+----------------+
> | iperf3 TCP benchmarks  | Stock       | Patched with   |
> | to Debian VM 120s      |             | fq_codel qdisc |
> | vhost pinned to core 0 |             |                |
> +------------------------+-------------+----------------+
> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
> |  +                     |             |                |
> | vhost-net              |             |                |
> +------------------------+-------------+----------------+
> 
> +---------------------------+-------------+----------------+
> | iperf3 TCP benchmarks     | Stock       | Patched with   |
> | to Debian VM 120s         |             | fq_codel qdisc |
> | vhost pinned to core 0    |             |                |
> | *4 iperf3 client threads* |             |                |
> +---------------------------+-------------+----------------+
> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
> |  +                        |             |                |
> | vhost-net                 |             |                |
> +---------------------------+-------------+----------------+

What are your thoughts on this?

Thanks!


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-27 16:47                 ` Simon Schippers
@ 2026-01-28  7:03                   ` Jason Wang
  2026-01-28  7:53                     ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-28  7:03 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/23/26 10:54, Simon Schippers wrote:
> > On 1/23/26 04:05, Jason Wang wrote:
> >> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>
> >>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/9/26 07:02, Jason Wang wrote:
> >>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> On 1/8/26 04:38, Jason Wang wrote:
> >>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>
> >>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>>>>>> space in the underlying ptr_ring.
> >>>>>>>>
> >>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>>>>>> in an upcoming commit.
> >>>>>>>>
> >>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>>>> ---
> >>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>> index 1197f245e873..2442cf7ac385 100644
> >>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>         return ret ? ret : total;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>>>>>> +{
> >>>>>>>> +       struct ptr_ring *ring = &q->ring;
> >>>>>>>> +       struct net_device *dev;
> >>>>>>>> +       void *ptr;
> >>>>>>>> +
> >>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>> +
> >>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>> +               rcu_read_lock();
> >>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>>>>>> +               rcu_read_unlock();
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       spin_unlock(&ring->consumer_lock);
> >>>>>>>> +
> >>>>>>>> +       return ptr;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>                            struct iov_iter *to,
> >>>>>>>>                            int noblock, struct sk_buff *skb)
> >>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>                                         TASK_INTERRUPTIBLE);
> >>>>>>>>
> >>>>>>>>                 /* Read frames from the queue */
> >>>>>>>> -               skb = ptr_ring_consume(&q->ring);
> >>>>>>>> +               skb = tap_ring_consume(q);
> >>>>>>>>                 if (skb)
> >>>>>>>>                         break;
> >>>>>>>>                 if (noblock) {
> >>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>> index 8192740357a0..7148f9a844a4 100644
> >>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>         return total;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>>>>>> +{
> >>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>>>>>> +       struct net_device *dev;
> >>>>>>>> +       void *ptr;
> >>>>>>>> +
> >>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>> +
> >>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>
> >>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>>>>>> another call to tweak the current API.
> >>>>>>>
> >>>>>>>> +               rcu_read_lock();
> >>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>>>>>
> >>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>>>>>> I'm not sure is what we want.
> >>>>>>
> >>>>>> What else would you suggest calling to wake the queue?
> >>>>>
> >>>>> I don't have a good method in my mind, just want to point out its implications.
> >>>>
> >>>> I have to admit I'm a bit stuck at this point, particularly with this
> >>>> aspect.
> >>>>
> >>>> What is the correct way to pass the producer CPU ID to the consumer?
> >>>> Would it make sense to store smp_processor_id() in the tfile inside
> >>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> >>>> XDP bit)? In the latter case, my concern is that this information may
> >>>> already be significantly outdated by the time it is used.
> >>>>
> >>>> Based on that, my idea would be for the consumer to wake the producer by
> >>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> >>>> smp_call_function_single().
> >>>> Is this a reasonable approach?
> >>>
> >>> I'm not sure but it would introduce costs like IPI.
> >>>
> >>>>
> >>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> >>>> considered a deal-breaker for the patch set?
> >>>
> >>> It depends on whether or not it has effects on the performance.
> >>> Especially when vhost is pinned.
> >>
> >> I meant we can benchmark to see the impact. For example, pin vhost to
> >> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> >>
> >> Thanks
> >>
> >
> > I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> > for both the stock and patched versions. The benchmarks were run with
> > the full patch series applied, since testing only patches 1-3 would not
> > be meaningful - the queue is never stopped in that case, so no
> > TX_SOFTIRQ is triggered.
> >
> > Compared to the non-pinned CPU benchmarks in the cover letter,
> > performance is lower for pktgen with a single thread but higher with
> > four threads. The results show no regression for the patched version,
> > with even slight performance improvements observed:
> >
> > +-------------------------+-----------+----------------+
> > | pktgen benchmarks to    | Stock     | Patched with   |
> > | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> > | 100M packets            |           |                |
> > | vhost pinned to core 0  |           |                |
> > +-----------+-------------+-----------+----------------+
> > | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> > |  +        +-------------+-----------+----------------+
> > | vhost-net | Lost        | 1154 Kpps | 0              |
> > +-----------+-------------+-----------+----------------+
> >
> > +-------------------------+-----------+----------------+
> > | pktgen benchmarks to    | Stock     | Patched with   |
> > | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> > | 100M packets            |           |                |
> > | vhost pinned to core 0  |           |                |
> > | *4 threads*             |           |                |
> > +-----------+-------------+-----------+----------------+
> > | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> > |  +        +-------------+-----------+----------------+
> > | vhost-net | Lost        | 1527 Kpps | 0              |
> > +-----------+-------------+-----------+----------------+

The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
the guest or an xdp program that did XDP_DROP in the guest.

> >
> > +------------------------+-------------+----------------+
> > | iperf3 TCP benchmarks  | Stock       | Patched with   |
> > | to Debian VM 120s      |             | fq_codel qdisc |
> > | vhost pinned to core 0 |             |                |
> > +------------------------+-------------+----------------+
> > | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
> > |  +                     |             |                |
> > | vhost-net              |             |                |
> > +------------------------+-------------+----------------+
> >
> > +---------------------------+-------------+----------------+
> > | iperf3 TCP benchmarks     | Stock       | Patched with   |
> > | to Debian VM 120s         |             | fq_codel qdisc |
> > | vhost pinned to core 0    |             |                |
> > | *4 iperf3 client threads* |             |                |
> > +---------------------------+-------------+----------------+
> > | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
> > |  +                        |             |                |
> > | vhost-net                 |             |                |
> > +---------------------------+-------------+----------------+
>
> What are your thoughts on this?
>
> Thanks!
>
>

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-28  7:03                   ` Jason Wang
@ 2026-01-28  7:53                     ` Simon Schippers
  2026-01-29  1:14                       ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-28  7:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/28/26 08:03, Jason Wang wrote:
> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/23/26 10:54, Simon Schippers wrote:
>>> On 1/23/26 04:05, Jason Wang wrote:
>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>
>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>
>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>
>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>
>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>> ---
>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>> +{
>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>> +       void *ptr;
>>>>>>>>>> +
>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>> +
>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>> +
>>>>>>>>>> +       return ptr;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>
>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>                 if (skb)
>>>>>>>>>>                         break;
>>>>>>>>>>                 if (noblock) {
>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>         return total;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>> +{
>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>> +       void *ptr;
>>>>>>>>>> +
>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>> +
>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>
>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>> another call to tweak the current API.
>>>>>>>>>
>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>
>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>> I'm not sure is what we want.
>>>>>>>>
>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>
>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>
>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>> aspect.
>>>>>>
>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>> already be significantly outdated by the time it is used.
>>>>>>
>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>> smp_call_function_single().
>>>>>> Is this a reasonable approach?
>>>>>
>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>
>>>>>>
>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>> considered a deal-breaker for the patch set?
>>>>>
>>>>> It depends on whether or not it has effects on the performance.
>>>>> Especially when vhost is pinned.
>>>>
>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>
>>>> Thanks
>>>>
>>>
>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>> for both the stock and patched versions. The benchmarks were run with
>>> the full patch series applied, since testing only patches 1-3 would not
>>> be meaningful - the queue is never stopped in that case, so no
>>> TX_SOFTIRQ is triggered.
>>>
>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>> performance is lower for pktgen with a single thread but higher with
>>> four threads. The results show no regression for the patched version,
>>> with even slight performance improvements observed:
>>>
>>> +-------------------------+-----------+----------------+
>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>> | 100M packets            |           |                |
>>> | vhost pinned to core 0  |           |                |
>>> +-----------+-------------+-----------+----------------+
>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>> |  +        +-------------+-----------+----------------+
>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>> +-----------+-------------+-----------+----------------+
>>>
>>> +-------------------------+-----------+----------------+
>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>> | 100M packets            |           |                |
>>> | vhost pinned to core 0  |           |                |
>>> | *4 threads*             |           |                |
>>> +-----------+-------------+-----------+----------------+
>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>> |  +        +-------------+-----------+----------------+
>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>> +-----------+-------------+-----------+----------------+
> 
> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
> the guest or an xdp program that did XDP_DROP in the guest.

I forgot to mention that these PPS values are per thread.
So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
respectively. For packet loss, that comes out to 1154 Kpps * 4 =
4616 Kpps and 0, respectively.

Sorry about that!

The pktgen benchmarks with a single thread look fine, right?

I'll still look into using an XDP program that does XDP_DROP in the
guest.

Thanks!

> 
>>>
>>> +------------------------+-------------+----------------+
>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
>>> | to Debian VM 120s      |             | fq_codel qdisc |
>>> | vhost pinned to core 0 |             |                |
>>> +------------------------+-------------+----------------+
>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
>>> |  +                     |             |                |
>>> | vhost-net              |             |                |
>>> +------------------------+-------------+----------------+
>>>
>>> +---------------------------+-------------+----------------+
>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
>>> | to Debian VM 120s         |             | fq_codel qdisc |
>>> | vhost pinned to core 0    |             |                |
>>> | *4 iperf3 client threads* |             |                |
>>> +---------------------------+-------------+----------------+
>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
>>> |  +                        |             |                |
>>> | vhost-net                 |             |                |
>>> +---------------------------+-------------+----------------+
>>
>> What are your thoughts on this?
>>
>> Thanks!
>>
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-28  7:53                     ` Simon Schippers
@ 2026-01-29  1:14                       ` Jason Wang
  2026-01-29  9:24                         ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-29  1:14 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/28/26 08:03, Jason Wang wrote:
> > On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/23/26 10:54, Simon Schippers wrote:
> >>> On 1/23/26 04:05, Jason Wang wrote:
> >>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>
> >>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> On 1/9/26 07:02, Jason Wang wrote:
> >>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> >>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>
> >>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
> >>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>>>>>>>> space in the underlying ptr_ring.
> >>>>>>>>>>
> >>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>>>>>>>> in an upcoming commit.
> >>>>>>>>>>
> >>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>>>>>> ---
> >>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
> >>>>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>>>         return ret ? ret : total;
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>>>>>>>> +{
> >>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
> >>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>> +       void *ptr;
> >>>>>>>>>> +
> >>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>> +
> >>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>>>>>>>> +               rcu_read_unlock();
> >>>>>>>>>> +       }
> >>>>>>>>>> +
> >>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
> >>>>>>>>>> +
> >>>>>>>>>> +       return ptr;
> >>>>>>>>>> +}
> >>>>>>>>>> +
> >>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>                            struct iov_iter *to,
> >>>>>>>>>>                            int noblock, struct sk_buff *skb)
> >>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>                                         TASK_INTERRUPTIBLE);
> >>>>>>>>>>
> >>>>>>>>>>                 /* Read frames from the queue */
> >>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
> >>>>>>>>>> +               skb = tap_ring_consume(q);
> >>>>>>>>>>                 if (skb)
> >>>>>>>>>>                         break;
> >>>>>>>>>>                 if (noblock) {
> >>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
> >>>>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>>>         return total;
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>>>>>>>> +{
> >>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>> +       void *ptr;
> >>>>>>>>>> +
> >>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>> +
> >>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>
> >>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>>>>>>>> another call to tweak the current API.
> >>>>>>>>>
> >>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>>>>>>>
> >>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>>>>>>>> I'm not sure is what we want.
> >>>>>>>>
> >>>>>>>> What else would you suggest calling to wake the queue?
> >>>>>>>
> >>>>>>> I don't have a good method in my mind, just want to point out its implications.
> >>>>>>
> >>>>>> I have to admit I'm a bit stuck at this point, particularly with this
> >>>>>> aspect.
> >>>>>>
> >>>>>> What is the correct way to pass the producer CPU ID to the consumer?
> >>>>>> Would it make sense to store smp_processor_id() in the tfile inside
> >>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> >>>>>> XDP bit)? In the latter case, my concern is that this information may
> >>>>>> already be significantly outdated by the time it is used.
> >>>>>>
> >>>>>> Based on that, my idea would be for the consumer to wake the producer by
> >>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> >>>>>> smp_call_function_single().
> >>>>>> Is this a reasonable approach?
> >>>>>
> >>>>> I'm not sure but it would introduce costs like IPI.
> >>>>>
> >>>>>>
> >>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> >>>>>> considered a deal-breaker for the patch set?
> >>>>>
> >>>>> It depends on whether or not it has effects on the performance.
> >>>>> Especially when vhost is pinned.
> >>>>
> >>>> I meant we can benchmark to see the impact. For example, pin vhost to
> >>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> >>>>
> >>>> Thanks
> >>>>
> >>>
> >>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> >>> for both the stock and patched versions. The benchmarks were run with
> >>> the full patch series applied, since testing only patches 1-3 would not
> >>> be meaningful - the queue is never stopped in that case, so no
> >>> TX_SOFTIRQ is triggered.
> >>>
> >>> Compared to the non-pinned CPU benchmarks in the cover letter,
> >>> performance is lower for pktgen with a single thread but higher with
> >>> four threads. The results show no regression for the patched version,
> >>> with even slight performance improvements observed:
> >>>
> >>> +-------------------------+-----------+----------------+
> >>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>> | 100M packets            |           |                |
> >>> | vhost pinned to core 0  |           |                |
> >>> +-----------+-------------+-----------+----------------+
> >>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> >>> |  +        +-------------+-----------+----------------+
> >>> | vhost-net | Lost        | 1154 Kpps | 0              |
> >>> +-----------+-------------+-----------+----------------+
> >>>
> >>> +-------------------------+-----------+----------------+
> >>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>> | 100M packets            |           |                |
> >>> | vhost pinned to core 0  |           |                |
> >>> | *4 threads*             |           |                |
> >>> +-----------+-------------+-----------+----------------+
> >>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> >>> |  +        +-------------+-----------+----------------+
> >>> | vhost-net | Lost        | 1527 Kpps | 0              |
> >>> +-----------+-------------+-----------+----------------+
> >
> > The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
> > the guest or an xdp program that did XDP_DROP in the guest.
>
> I forgot to mention that these PPS values are per thread.
> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
> 4616 Kpps and 0, respectively.
>
> Sorry about that!
>
> The pktgen benchmarks with a single thread look fine, right?

Still looks very low. E.g I just have a run of pktgen (using
pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
I can get 1Mpps.

>
> I'll still look into using an XDP program that does XDP_DROP in the
> guest.
>
> Thanks!

Thanks

>
> >
> >>>
> >>> +------------------------+-------------+----------------+
> >>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
> >>> | to Debian VM 120s      |             | fq_codel qdisc |
> >>> | vhost pinned to core 0 |             |                |
> >>> +------------------------+-------------+----------------+
> >>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
> >>> |  +                     |             |                |
> >>> | vhost-net              |             |                |
> >>> +------------------------+-------------+----------------+
> >>>
> >>> +---------------------------+-------------+----------------+
> >>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
> >>> | to Debian VM 120s         |             | fq_codel qdisc |
> >>> | vhost pinned to core 0    |             |                |
> >>> | *4 iperf3 client threads* |             |                |
> >>> +---------------------------+-------------+----------------+
> >>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
> >>> |  +                        |             |                |
> >>> | vhost-net                 |             |                |
> >>> +---------------------------+-------------+----------------+
> >>
> >> What are your thoughts on this?
> >>
> >> Thanks!
> >>
> >>
> >
> > Thanks
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-29  1:14                       ` Jason Wang
@ 2026-01-29  9:24                         ` Simon Schippers
  2026-01-30  1:51                           ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-01-29  9:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/29/26 02:14, Jason Wang wrote:
> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/28/26 08:03, Jason Wang wrote:
>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 1/23/26 10:54, Simon Schippers wrote:
>>>>> On 1/23/26 04:05, Jason Wang wrote:
>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>
>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>
>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>
>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>>>
>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>>>
>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>> +
>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>> +
>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>>>> +       }
>>>>>>>>>>>> +
>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>>>> +
>>>>>>>>>>>> +       return ptr;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>>>
>>>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>>>                 if (skb)
>>>>>>>>>>>>                         break;
>>>>>>>>>>>>                 if (noblock) {
>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>>>         return total;
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>> +
>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>> +
>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>
>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>>>> another call to tweak the current API.
>>>>>>>>>>>
>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>>>
>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>>>> I'm not sure is what we want.
>>>>>>>>>>
>>>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>>>
>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>>>
>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>>>> aspect.
>>>>>>>>
>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>>>> already be significantly outdated by the time it is used.
>>>>>>>>
>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>>>> smp_call_function_single().
>>>>>>>> Is this a reasonable approach?
>>>>>>>
>>>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>>>
>>>>>>>>
>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>>>> considered a deal-breaker for the patch set?
>>>>>>>
>>>>>>> It depends on whether or not it has effects on the performance.
>>>>>>> Especially when vhost is pinned.
>>>>>>
>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>>>> for both the stock and patched versions. The benchmarks were run with
>>>>> the full patch series applied, since testing only patches 1-3 would not
>>>>> be meaningful - the queue is never stopped in that case, so no
>>>>> TX_SOFTIRQ is triggered.
>>>>>
>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>>>> performance is lower for pktgen with a single thread but higher with
>>>>> four threads. The results show no regression for the patched version,
>>>>> with even slight performance improvements observed:
>>>>>
>>>>> +-------------------------+-----------+----------------+
>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>> | 100M packets            |           |                |
>>>>> | vhost pinned to core 0  |           |                |
>>>>> +-----------+-------------+-----------+----------------+
>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>>>> |  +        +-------------+-----------+----------------+
>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>>>> +-----------+-------------+-----------+----------------+
>>>>>
>>>>> +-------------------------+-----------+----------------+
>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>> | 100M packets            |           |                |
>>>>> | vhost pinned to core 0  |           |                |
>>>>> | *4 threads*             |           |                |
>>>>> +-----------+-------------+-----------+----------------+
>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>>>> |  +        +-------------+-----------+----------------+
>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>>>> +-----------+-------------+-----------+----------------+
>>>
>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
>>> the guest or an xdp program that did XDP_DROP in the guest.
>>
>> I forgot to mention that these PPS values are per thread.
>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
>> 4616 Kpps and 0, respectively.
>>
>> Sorry about that!
>>
>> The pktgen benchmarks with a single thread look fine, right?
> 
> Still looks very low. E.g I just have a run of pktgen (using
> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
> I can get 1Mpps.

Keep in mind that I am using an older CPU (i5-6300HQ). For the
single-threaded tests I always used pktgen_sample01_simple.sh, and for
the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.

Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
though the same parameters work fine for sample01 and sample02):

samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
/samples/pktgen/functions.sh: line 79: echo: write error: Operation not
supported
ERROR: Write error(1) occurred
cmd: "burst 32 > /proc/net/pktgen/tap0@0"

...and I do not know what I am doing wrong, even after looking at
Documentation/networking/pktgen.rst. Every burst size except 1 fails.
Any clues?

Thanks!

> 
>>
>> I'll still look into using an XDP program that does XDP_DROP in the
>> guest.
>>
>> Thanks!
> 
> Thanks
> 
>>
>>>
>>>>>
>>>>> +------------------------+-------------+----------------+
>>>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
>>>>> | to Debian VM 120s      |             | fq_codel qdisc |
>>>>> | vhost pinned to core 0 |             |                |
>>>>> +------------------------+-------------+----------------+
>>>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
>>>>> |  +                     |             |                |
>>>>> | vhost-net              |             |                |
>>>>> +------------------------+-------------+----------------+
>>>>>
>>>>> +---------------------------+-------------+----------------+
>>>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
>>>>> | to Debian VM 120s         |             | fq_codel qdisc |
>>>>> | vhost pinned to core 0    |             |                |
>>>>> | *4 iperf3 client threads* |             |                |
>>>>> +---------------------------+-------------+----------------+
>>>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
>>>>> |  +                        |             |                |
>>>>> | vhost-net                 |             |                |
>>>>> +---------------------------+-------------+----------------+
>>>>
>>>> What are your thoughts on this?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>
>>> Thanks
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-29  9:24                         ` Simon Schippers
@ 2026-01-30  1:51                           ` Jason Wang
  2026-02-01 20:19                             ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-01-30  1:51 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/29/26 02:14, Jason Wang wrote:
> > On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/28/26 08:03, Jason Wang wrote:
> >>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/23/26 10:54, Simon Schippers wrote:
> >>>>> On 1/23/26 04:05, Jason Wang wrote:
> >>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> >>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>
> >>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
> >>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> >>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
> >>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>>>>>>>>>> space in the underlying ptr_ring.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>>>>>>>>>> in an upcoming commit.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
> >>>>>>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>>>>>         return ret ? ret : total;
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
> >>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>>>>>>>>>> +               rcu_read_unlock();
> >>>>>>>>>>>> +       }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +       return ptr;
> >>>>>>>>>>>> +}
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>                            struct iov_iter *to,
> >>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
> >>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
> >>>>>>>>>>>>
> >>>>>>>>>>>>                 /* Read frames from the queue */
> >>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
> >>>>>>>>>>>> +               skb = tap_ring_consume(q);
> >>>>>>>>>>>>                 if (skb)
> >>>>>>>>>>>>                         break;
> >>>>>>>>>>>>                 if (noblock) {
> >>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
> >>>>>>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>>>>>         return total;
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>
> >>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>>>>>>>>>> another call to tweak the current API.
> >>>>>>>>>>>
> >>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>>>>>>>>>
> >>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>>>>>>>>>> I'm not sure is what we want.
> >>>>>>>>>>
> >>>>>>>>>> What else would you suggest calling to wake the queue?
> >>>>>>>>>
> >>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
> >>>>>>>>
> >>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
> >>>>>>>> aspect.
> >>>>>>>>
> >>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
> >>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
> >>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> >>>>>>>> XDP bit)? In the latter case, my concern is that this information may
> >>>>>>>> already be significantly outdated by the time it is used.
> >>>>>>>>
> >>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
> >>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> >>>>>>>> smp_call_function_single().
> >>>>>>>> Is this a reasonable approach?
> >>>>>>>
> >>>>>>> I'm not sure but it would introduce costs like IPI.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> >>>>>>>> considered a deal-breaker for the patch set?
> >>>>>>>
> >>>>>>> It depends on whether or not it has effects on the performance.
> >>>>>>> Especially when vhost is pinned.
> >>>>>>
> >>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
> >>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>
> >>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> >>>>> for both the stock and patched versions. The benchmarks were run with
> >>>>> the full patch series applied, since testing only patches 1-3 would not
> >>>>> be meaningful - the queue is never stopped in that case, so no
> >>>>> TX_SOFTIRQ is triggered.
> >>>>>
> >>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
> >>>>> performance is lower for pktgen with a single thread but higher with
> >>>>> four threads. The results show no regression for the patched version,
> >>>>> with even slight performance improvements observed:
> >>>>>
> >>>>> +-------------------------+-----------+----------------+
> >>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>> | 100M packets            |           |                |
> >>>>> | vhost pinned to core 0  |           |                |
> >>>>> +-----------+-------------+-----------+----------------+
> >>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> >>>>> |  +        +-------------+-----------+----------------+
> >>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
> >>>>> +-----------+-------------+-----------+----------------+
> >>>>>
> >>>>> +-------------------------+-----------+----------------+
> >>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>> | 100M packets            |           |                |
> >>>>> | vhost pinned to core 0  |           |                |
> >>>>> | *4 threads*             |           |                |
> >>>>> +-----------+-------------+-----------+----------------+
> >>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> >>>>> |  +        +-------------+-----------+----------------+
> >>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
> >>>>> +-----------+-------------+-----------+----------------+
> >>>
> >>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
> >>> the guest or an xdp program that did XDP_DROP in the guest.
> >>
> >> I forgot to mention that these PPS values are per thread.
> >> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
> >> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
> >> 4616 Kpps and 0, respectively.
> >>
> >> Sorry about that!
> >>
> >> The pktgen benchmarks with a single thread look fine, right?
> >
> > Still looks very low. E.g I just have a run of pktgen (using
> > pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
> > I can get 1Mpps.
>
> Keep in mind that I am using an older CPU (i5-6300HQ). For the
> single-threaded tests I always used pktgen_sample01_simple.sh, and for
> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
>
> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
> though the same parameters work fine for sample01 and sample02):
>
> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
> supported
> ERROR: Write error(1) occurred
> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
>
> ...and I do not know what I am doing wrong, even after looking at
> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
> Any clues?

Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.

Another thing I can think of is to disable

1) mitigations in both guest and host
2) any kernel debug features in both host and guest

Thanks

>
> Thanks!
>
> >
> >>
> >> I'll still look into using an XDP program that does XDP_DROP in the
> >> guest.
> >>
> >> Thanks!
> >
> > Thanks
> >
> >>
> >>>
> >>>>>
> >>>>> +------------------------+-------------+----------------+
> >>>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
> >>>>> | to Debian VM 120s      |             | fq_codel qdisc |
> >>>>> | vhost pinned to core 0 |             |                |
> >>>>> +------------------------+-------------+----------------+
> >>>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
> >>>>> |  +                     |             |                |
> >>>>> | vhost-net              |             |                |
> >>>>> +------------------------+-------------+----------------+
> >>>>>
> >>>>> +---------------------------+-------------+----------------+
> >>>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
> >>>>> | to Debian VM 120s         |             | fq_codel qdisc |
> >>>>> | vhost pinned to core 0    |             |                |
> >>>>> | *4 iperf3 client threads* |             |                |
> >>>>> +---------------------------+-------------+----------------+
> >>>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
> >>>>> |  +                        |             |                |
> >>>>> | vhost-net                 |             |                |
> >>>>> +---------------------------+-------------+----------------+
> >>>>
> >>>> What are your thoughts on this?
> >>>>
> >>>> Thanks!
> >>>>
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-01-30  1:51                           ` Jason Wang
@ 2026-02-01 20:19                             ` Simon Schippers
  2026-02-03  3:48                               ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-02-01 20:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 1/30/26 02:51, Jason Wang wrote:
> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/29/26 02:14, Jason Wang wrote:
>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 1/28/26 08:03, Jason Wang wrote:
>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>
>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>>>>>> +       }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       return ptr;
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>>>>>                 if (skb)
>>>>>>>>>>>>>>                         break;
>>>>>>>>>>>>>>                 if (noblock) {
>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>>>>>         return total;
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>>>>>> another call to tweak the current API.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>>>>>
>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>>>>>> I'm not sure is what we want.
>>>>>>>>>>>>
>>>>>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>>>>>
>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>>>>>
>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>>>>>> aspect.
>>>>>>>>>>
>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>>>>>> already be significantly outdated by the time it is used.
>>>>>>>>>>
>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>>>>>> smp_call_function_single().
>>>>>>>>>> Is this a reasonable approach?
>>>>>>>>>
>>>>>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>>>>>> considered a deal-breaker for the patch set?
>>>>>>>>>
>>>>>>>>> It depends on whether or not it has effects on the performance.
>>>>>>>>> Especially when vhost is pinned.
>>>>>>>>
>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>>>>>> for both the stock and patched versions. The benchmarks were run with
>>>>>>> the full patch series applied, since testing only patches 1-3 would not
>>>>>>> be meaningful - the queue is never stopped in that case, so no
>>>>>>> TX_SOFTIRQ is triggered.
>>>>>>>
>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>>>>>> performance is lower for pktgen with a single thread but higher with
>>>>>>> four threads. The results show no regression for the patched version,
>>>>>>> with even slight performance improvements observed:
>>>>>>>
>>>>>>> +-------------------------+-----------+----------------+
>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>> | 100M packets            |           |                |
>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>
>>>>>>> +-------------------------+-----------+----------------+
>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>> | 100M packets            |           |                |
>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>> | *4 threads*             |           |                |
>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>
>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
>>>>> the guest or an xdp program that did XDP_DROP in the guest.
>>>>
>>>> I forgot to mention that these PPS values are per thread.
>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
>>>> 4616 Kpps and 0, respectively.
>>>>
>>>> Sorry about that!
>>>>
>>>> The pktgen benchmarks with a single thread look fine, right?
>>>
>>> Still looks very low. E.g I just have a run of pktgen (using
>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
>>> I can get 1Mpps.
>>
>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
>>
>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
>> though the same parameters work fine for sample01 and sample02):
>>
>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
>> supported
>> ERROR: Write error(1) occurred
>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
>>
>> ...and I do not know what I am doing wrong, even after looking at
>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
>> Any clues?
> 
> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.

I tried using "-b 0", and while it worked, there was no noticeable
performance improvement.

> 
> Another thing I can think of is to disable
> 
> 1) mitigations in both guest and host
> 2) any kernel debug features in both host and guest

I also rebuilt the kernel with everything disabled under
"Kernel hacking", but that didn’t make any difference either.

Because of this, I ran "pktgen_sample01_simple.sh" and
"pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
with very similar performance between the stock and patched kernels.

Personally, I think the low performance is to blame on the hardware.

Thanks!

> 
> Thanks
> 
>>
>> Thanks!
>>
>>>
>>>>
>>>> I'll still look into using an XDP program that does XDP_DROP in the
>>>> guest.
>>>>
>>>> Thanks!
>>>
>>> Thanks
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> +------------------------+-------------+----------------+
>>>>>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
>>>>>>> | to Debian VM 120s      |             | fq_codel qdisc |
>>>>>>> | vhost pinned to core 0 |             |                |
>>>>>>> +------------------------+-------------+----------------+
>>>>>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
>>>>>>> |  +                     |             |                |
>>>>>>> | vhost-net              |             |                |
>>>>>>> +------------------------+-------------+----------------+
>>>>>>>
>>>>>>> +---------------------------+-------------+----------------+
>>>>>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
>>>>>>> | to Debian VM 120s         |             | fq_codel qdisc |
>>>>>>> | vhost pinned to core 0    |             |                |
>>>>>>> | *4 iperf3 client threads* |             |                |
>>>>>>> +---------------------------+-------------+----------------+
>>>>>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
>>>>>>> |  +                        |             |                |
>>>>>>> | vhost-net                 |             |                |
>>>>>>> +---------------------------+-------------+----------------+
>>>>>>
>>>>>> What are your thoughts on this?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-01 20:19                             ` Simon Schippers
@ 2026-02-03  3:48                               ` Jason Wang
  2026-02-04 15:43                                 ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-02-03  3:48 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 1/30/26 02:51, Jason Wang wrote:
> > On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/29/26 02:14, Jason Wang wrote:
> >>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/28/26 08:03, Jason Wang wrote:
> >>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> On 1/23/26 10:54, Simon Schippers wrote:
> >>>>>>> On 1/23/26 04:05, Jason Wang wrote:
> >>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> >>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
> >>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> >>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
> >>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>>>>>>>>>>>> space in the underlying ptr_ring.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>>>>>>>>>>>> in an upcoming commit.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
> >>>>>>>>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>>>>>>>         return ret ? ret : total;
> >>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
> >>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>>>>>>>>>>>> +               rcu_read_unlock();
> >>>>>>>>>>>>>> +       }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       return ptr;
> >>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>                            struct iov_iter *to,
> >>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
> >>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>                 /* Read frames from the queue */
> >>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
> >>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
> >>>>>>>>>>>>>>                 if (skb)
> >>>>>>>>>>>>>>                         break;
> >>>>>>>>>>>>>>                 if (noblock) {
> >>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
> >>>>>>>>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>>>>>>>         return total;
> >>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>>>>>>>>>>>> another call to tweak the current API.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>>>>>>>>>>>> I'm not sure is what we want.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What else would you suggest calling to wake the queue?
> >>>>>>>>>>>
> >>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
> >>>>>>>>>>
> >>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
> >>>>>>>>>> aspect.
> >>>>>>>>>>
> >>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
> >>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
> >>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> >>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
> >>>>>>>>>> already be significantly outdated by the time it is used.
> >>>>>>>>>>
> >>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
> >>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> >>>>>>>>>> smp_call_function_single().
> >>>>>>>>>> Is this a reasonable approach?
> >>>>>>>>>
> >>>>>>>>> I'm not sure but it would introduce costs like IPI.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> >>>>>>>>>> considered a deal-breaker for the patch set?
> >>>>>>>>>
> >>>>>>>>> It depends on whether or not it has effects on the performance.
> >>>>>>>>> Especially when vhost is pinned.
> >>>>>>>>
> >>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
> >>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>
> >>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> >>>>>>> for both the stock and patched versions. The benchmarks were run with
> >>>>>>> the full patch series applied, since testing only patches 1-3 would not
> >>>>>>> be meaningful - the queue is never stopped in that case, so no
> >>>>>>> TX_SOFTIRQ is triggered.
> >>>>>>>
> >>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
> >>>>>>> performance is lower for pktgen with a single thread but higher with
> >>>>>>> four threads. The results show no regression for the patched version,
> >>>>>>> with even slight performance improvements observed:
> >>>>>>>
> >>>>>>> +-------------------------+-----------+----------------+
> >>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>> | 100M packets            |           |                |
> >>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> >>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
> >>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>
> >>>>>>> +-------------------------+-----------+----------------+
> >>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>> | 100M packets            |           |                |
> >>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>> | *4 threads*             |           |                |
> >>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> >>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
> >>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>
> >>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
> >>>>> the guest or an xdp program that did XDP_DROP in the guest.
> >>>>
> >>>> I forgot to mention that these PPS values are per thread.
> >>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
> >>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
> >>>> 4616 Kpps and 0, respectively.
> >>>>
> >>>> Sorry about that!
> >>>>
> >>>> The pktgen benchmarks with a single thread look fine, right?
> >>>
> >>> Still looks very low. E.g I just have a run of pktgen (using
> >>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
> >>> I can get 1Mpps.
> >>
> >> Keep in mind that I am using an older CPU (i5-6300HQ). For the
> >> single-threaded tests I always used pktgen_sample01_simple.sh, and for
> >> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
> >>
> >> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
> >> though the same parameters work fine for sample01 and sample02):
> >>
> >> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
> >> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
> >> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
> >> supported
> >> ERROR: Write error(1) occurred
> >> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
> >>
> >> ...and I do not know what I am doing wrong, even after looking at
> >> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
> >> Any clues?
> >
> > Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
>
> I tried using "-b 0", and while it worked, there was no noticeable
> performance improvement.
>
> >
> > Another thing I can think of is to disable
> >
> > 1) mitigations in both guest and host
> > 2) any kernel debug features in both host and guest
>
> I also rebuilt the kernel with everything disabled under
> "Kernel hacking", but that didn’t make any difference either.
>
> Because of this, I ran "pktgen_sample01_simple.sh" and
> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
> with very similar performance between the stock and patched kernels.
>
> Personally, I think the low performance is to blame on the hardware.

Let's double confirm this by:

1) make sure pktgen is using 100% CPU
2) Perf doesn't show anything strange for pktgen thread

Thanks

>
> Thanks!
>
> >
> > Thanks
> >
> >>
> >> Thanks!
> >>
> >>>
> >>>>
> >>>> I'll still look into using an XDP program that does XDP_DROP in the
> >>>> guest.
> >>>>
> >>>> Thanks!
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>> +------------------------+-------------+----------------+
> >>>>>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
> >>>>>>> | to Debian VM 120s      |             | fq_codel qdisc |
> >>>>>>> | vhost pinned to core 0 |             |                |
> >>>>>>> +------------------------+-------------+----------------+
> >>>>>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
> >>>>>>> |  +                     |             |                |
> >>>>>>> | vhost-net              |             |                |
> >>>>>>> +------------------------+-------------+----------------+
> >>>>>>>
> >>>>>>> +---------------------------+-------------+----------------+
> >>>>>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
> >>>>>>> | to Debian VM 120s         |             | fq_codel qdisc |
> >>>>>>> | vhost pinned to core 0    |             |                |
> >>>>>>> | *4 iperf3 client threads* |             |                |
> >>>>>>> +---------------------------+-------------+----------------+
> >>>>>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
> >>>>>>> |  +                        |             |                |
> >>>>>>> | vhost-net                 |             |                |
> >>>>>>> +---------------------------+-------------+----------------+
> >>>>>>
> >>>>>> What are your thoughts on this?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>
> >>
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-03  3:48                               ` Jason Wang
@ 2026-02-04 15:43                                 ` Simon Schippers
  2026-02-05  3:59                                   ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-02-04 15:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 2/3/26 04:48, Jason Wang wrote:
> On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 1/30/26 02:51, Jason Wang wrote:
>>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 1/29/26 02:14, Jason Wang wrote:
>>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> On 1/28/26 08:03, Jason Wang wrote:
>>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>
>>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
>>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
>>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>>>>>>>> +       }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +       return ptr;
>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>>>>>>>                 if (skb)
>>>>>>>>>>>>>>>>                         break;
>>>>>>>>>>>>>>>>                 if (noblock) {
>>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>>>>>>>         return total;
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>>>>>>>> another call to tweak the current API.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>>>>>>>> I'm not sure is what we want.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>>>>>>>
>>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>>>>>>>> aspect.
>>>>>>>>>>>>
>>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>>>>>>>> already be significantly outdated by the time it is used.
>>>>>>>>>>>>
>>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>>>>>>>> smp_call_function_single().
>>>>>>>>>>>> Is this a reasonable approach?
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>>>>>>>> considered a deal-breaker for the patch set?
>>>>>>>>>>>
>>>>>>>>>>> It depends on whether or not it has effects on the performance.
>>>>>>>>>>> Especially when vhost is pinned.
>>>>>>>>>>
>>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>>>>>>>> for both the stock and patched versions. The benchmarks were run with
>>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
>>>>>>>>> be meaningful - the queue is never stopped in that case, so no
>>>>>>>>> TX_SOFTIRQ is triggered.
>>>>>>>>>
>>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>>>>>>>> performance is lower for pktgen with a single thread but higher with
>>>>>>>>> four threads. The results show no regression for the patched version,
>>>>>>>>> with even slight performance improvements observed:
>>>>>>>>>
>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>
>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>> | *4 threads*             |           |                |
>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>
>>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
>>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
>>>>>>
>>>>>> I forgot to mention that these PPS values are per thread.
>>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
>>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
>>>>>> 4616 Kpps and 0, respectively.
>>>>>>
>>>>>> Sorry about that!
>>>>>>
>>>>>> The pktgen benchmarks with a single thread look fine, right?
>>>>>
>>>>> Still looks very low. E.g I just have a run of pktgen (using
>>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
>>>>> I can get 1Mpps.
>>>>
>>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
>>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
>>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
>>>>
>>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
>>>> though the same parameters work fine for sample01 and sample02):
>>>>
>>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
>>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
>>>> supported
>>>> ERROR: Write error(1) occurred
>>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
>>>>
>>>> ...and I do not know what I am doing wrong, even after looking at
>>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
>>>> Any clues?
>>>
>>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
>>
>> I tried using "-b 0", and while it worked, there was no noticeable
>> performance improvement.
>>
>>>
>>> Another thing I can think of is to disable
>>>
>>> 1) mitigations in both guest and host
>>> 2) any kernel debug features in both host and guest
>>
>> I also rebuilt the kernel with everything disabled under
>> "Kernel hacking", but that didn’t make any difference either.
>>
>> Because of this, I ran "pktgen_sample01_simple.sh" and
>> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
>> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
>> with very similar performance between the stock and patched kernels.
>>
>> Personally, I think the low performance is to blame on the hardware.
> 
> Let's double confirm this by:
> 
> 1) make sure pktgen is using 100% CPU
> 2) Perf doesn't show anything strange for pktgen thread
> 
> Thanks
> 

I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
100 second perf stat measurement covering all kpktgend threads.

Across all configurations, a single CPU was fully utilized.

Apart from that, the patched variants show a higher branch frequency and
a slightly increased number of context switches.


The detailed results are provided below:

Processor: Ryzen 5 5600X

pktgen command:
sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000

perf stat command:
sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt


Results:
Stock TAP:
            46.997      context-switches                 #    467,2 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
        100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
     8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
    77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
   382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
   377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
    84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)

     100,100414494 seconds time elapsed


Stock TAP+vhost-net:
            47.087      context-switches                 #    468,1 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
        100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
     8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
    72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
   382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
   349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
    83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)

     100,100520402 seconds time elapsed


Patched TAP:
            47.862      context-switches                 #    475,8 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
        100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
     9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
    99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
   382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
   312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
    76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)

     100,101262454 seconds time elapsed


Patched TAP+vhost-net:
            47.892      context-switches                 #    476,1 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
        100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
     9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
    90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
   382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
   340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
    78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)

     100,101212911 seconds time elapsed

>>
>> Thanks!
>>
>>>
>>> Thanks
>>>
>>>>
>>>> Thanks!
>>>>
>>>>>
>>>>>>
>>>>>> I'll still look into using an XDP program that does XDP_DROP in the
>>>>>> guest.
>>>>>>
>>>>>> Thanks!
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> +------------------------+-------------+----------------+
>>>>>>>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
>>>>>>>>> | to Debian VM 120s      |             | fq_codel qdisc |
>>>>>>>>> | vhost pinned to core 0 |             |                |
>>>>>>>>> +------------------------+-------------+----------------+
>>>>>>>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
>>>>>>>>> |  +                     |             |                |
>>>>>>>>> | vhost-net              |             |                |
>>>>>>>>> +------------------------+-------------+----------------+
>>>>>>>>>
>>>>>>>>> +---------------------------+-------------+----------------+
>>>>>>>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
>>>>>>>>> | to Debian VM 120s         |             | fq_codel qdisc |
>>>>>>>>> | vhost pinned to core 0    |             |                |
>>>>>>>>> | *4 iperf3 client threads* |             |                |
>>>>>>>>> +---------------------------+-------------+----------------+
>>>>>>>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
>>>>>>>>> |  +                        |             |                |
>>>>>>>>> | vhost-net                 |             |                |
>>>>>>>>> +---------------------------+-------------+----------------+
>>>>>>>>
>>>>>>>> What are your thoughts on this?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-04 15:43                                 ` Simon Schippers
@ 2026-02-05  3:59                                   ` Jason Wang
  2026-02-05 22:28                                     ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-02-05  3:59 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Wed, Feb 4, 2026 at 11:44 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 2/3/26 04:48, Jason Wang wrote:
> > On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 1/30/26 02:51, Jason Wang wrote:
> >>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/29/26 02:14, Jason Wang wrote:
> >>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> On 1/28/26 08:03, Jason Wang wrote:
> >>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
> >>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>
> >>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
> >>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
> >>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> >>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
> >>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> >>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
> >>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>>>>>>>>>>>>>> space in the underlying ptr_ring.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>>>>>>>>>>>>>> in an upcoming commit.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
> >>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>>>>>>>>>         return ret ? ret : total;
> >>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
> >>>>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>>>>>>>>>>>>>> +               rcu_read_unlock();
> >>>>>>>>>>>>>>>> +       }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +       return ptr;
> >>>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>>>                            struct iov_iter *to,
> >>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
> >>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>                 /* Read frames from the queue */
> >>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
> >>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
> >>>>>>>>>>>>>>>>                 if (skb)
> >>>>>>>>>>>>>>>>                         break;
> >>>>>>>>>>>>>>>>                 if (noblock) {
> >>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
> >>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>>>>>>>>>         return total;
> >>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>>>>>>>>>>>>>> another call to tweak the current API.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>>>>>>>>>>>>>> I'm not sure is what we want.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
> >>>>>>>>>>>> aspect.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
> >>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
> >>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> >>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
> >>>>>>>>>>>> already be significantly outdated by the time it is used.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
> >>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> >>>>>>>>>>>> smp_call_function_single().
> >>>>>>>>>>>> Is this a reasonable approach?
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> >>>>>>>>>>>> considered a deal-breaker for the patch set?
> >>>>>>>>>>>
> >>>>>>>>>>> It depends on whether or not it has effects on the performance.
> >>>>>>>>>>> Especially when vhost is pinned.
> >>>>>>>>>>
> >>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
> >>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> >>>>>>>>> for both the stock and patched versions. The benchmarks were run with
> >>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
> >>>>>>>>> be meaningful - the queue is never stopped in that case, so no
> >>>>>>>>> TX_SOFTIRQ is triggered.
> >>>>>>>>>
> >>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
> >>>>>>>>> performance is lower for pktgen with a single thread but higher with
> >>>>>>>>> four threads. The results show no regression for the patched version,
> >>>>>>>>> with even slight performance improvements observed:
> >>>>>>>>>
> >>>>>>>>> +-------------------------+-----------+----------------+
> >>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>>>> | 100M packets            |           |                |
> >>>>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> >>>>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
> >>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>
> >>>>>>>>> +-------------------------+-----------+----------------+
> >>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>>>> | 100M packets            |           |                |
> >>>>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>>>> | *4 threads*             |           |                |
> >>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> >>>>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
> >>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>
> >>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
> >>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
> >>>>>>
> >>>>>> I forgot to mention that these PPS values are per thread.
> >>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
> >>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
> >>>>>> 4616 Kpps and 0, respectively.
> >>>>>>
> >>>>>> Sorry about that!
> >>>>>>
> >>>>>> The pktgen benchmarks with a single thread look fine, right?
> >>>>>
> >>>>> Still looks very low. E.g I just have a run of pktgen (using
> >>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
> >>>>> I can get 1Mpps.
> >>>>
> >>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
> >>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
> >>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
> >>>>
> >>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
> >>>> though the same parameters work fine for sample01 and sample02):
> >>>>
> >>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
> >>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
> >>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
> >>>> supported
> >>>> ERROR: Write error(1) occurred
> >>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
> >>>>
> >>>> ...and I do not know what I am doing wrong, even after looking at
> >>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
> >>>> Any clues?
> >>>
> >>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
> >>
> >> I tried using "-b 0", and while it worked, there was no noticeable
> >> performance improvement.
> >>
> >>>
> >>> Another thing I can think of is to disable
> >>>
> >>> 1) mitigations in both guest and host
> >>> 2) any kernel debug features in both host and guest
> >>
> >> I also rebuilt the kernel with everything disabled under
> >> "Kernel hacking", but that didn’t make any difference either.
> >>
> >> Because of this, I ran "pktgen_sample01_simple.sh" and
> >> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
> >> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
> >> with very similar performance between the stock and patched kernels.
> >>
> >> Personally, I think the low performance is to blame on the hardware.
> >
> > Let's double confirm this by:
> >
> > 1) make sure pktgen is using 100% CPU
> > 2) Perf doesn't show anything strange for pktgen thread
> >
> > Thanks
> >
>
> I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
> 100 second perf stat measurement covering all kpktgend threads.
>
> Across all configurations, a single CPU was fully utilized.
>
> Apart from that, the patched variants show a higher branch frequency and
> a slightly increased number of context switches.
>
>
> The detailed results are provided below:
>
> Processor: Ryzen 5 5600X
>
> pktgen command:
> sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
> 52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
>
> perf stat command:
> sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
>
>
> Results:
> Stock TAP:
>             46.997      context-switches                 #    467,2 cs/sec  cs_per_second
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>         100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>      8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
>     77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
>    382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>    377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
>     84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>
>      100,100414494 seconds time elapsed
>
>
> Stock TAP+vhost-net:
>             47.087      context-switches                 #    468,1 cs/sec  cs_per_second
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>         100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>      8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
>     72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
>    382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>    349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>     83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>
>      100,100520402 seconds time elapsed
>
>
> Patched TAP:
>             47.862      context-switches                 #    475,8 cs/sec  cs_per_second
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>         100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>      9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
>     99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
>    382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>    312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
>     76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)
>
>      100,101262454 seconds time elapsed
>
>
> Patched TAP+vhost-net:
>             47.892      context-switches                 #    476,1 cs/sec  cs_per_second
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>         100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>      9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
>     90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
>    382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>    340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>     78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)
>
>      100,101212911 seconds time elapsed

Thanks for sharing. I have more questions:

1) The number of CPU and vCPUs
2) If you pin vhost or vCPU threads
3) what does perf top looks like or perf top -p $pid_of_vhost

>
> >>
> >> Thanks!
> >>
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>>>
> >>>>>>
> >>>>>> I'll still look into using an XDP program that does XDP_DROP in the
> >>>>>> guest.
> >>>>>>
> >>>>>> Thanks!
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> +------------------------+-------------+----------------+
> >>>>>>>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
> >>>>>>>>> | to Debian VM 120s      |             | fq_codel qdisc |
> >>>>>>>>> | vhost pinned to core 0 |             |                |
> >>>>>>>>> +------------------------+-------------+----------------+
> >>>>>>>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
> >>>>>>>>> |  +                     |             |                |
> >>>>>>>>> | vhost-net              |             |                |
> >>>>>>>>> +------------------------+-------------+----------------+
> >>>>>>>>>
> >>>>>>>>> +---------------------------+-------------+----------------+
> >>>>>>>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
> >>>>>>>>> | to Debian VM 120s         |             | fq_codel qdisc |
> >>>>>>>>> | vhost pinned to core 0    |             |                |
> >>>>>>>>> | *4 iperf3 client threads* |             |                |
> >>>>>>>>> +---------------------------+-------------+----------------+
> >>>>>>>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
> >>>>>>>>> |  +                        |             |                |
> >>>>>>>>> | vhost-net                 |             |                |
> >>>>>>>>> +---------------------------+-------------+----------------+
> >>>>>>>>
> >>>>>>>> What are your thoughts on this?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-05  3:59                                   ` Jason Wang
@ 2026-02-05 22:28                                     ` Simon Schippers
  2026-02-06  3:21                                       ` Jason Wang
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-02-05 22:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 2/5/26 04:59, Jason Wang wrote:
> On Wed, Feb 4, 2026 at 11:44 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 2/3/26 04:48, Jason Wang wrote:
>>> On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 1/30/26 02:51, Jason Wang wrote:
>>>>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> On 1/29/26 02:14, Jason Wang wrote:
>>>>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>
>>>>>>>> On 1/28/26 08:03, Jason Wang wrote:
>>>>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>
>>>>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
>>>>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
>>>>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>>>>>>>>>> +       }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +       return ptr;
>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>>>>>>>>>                 if (skb)
>>>>>>>>>>>>>>>>>>                         break;
>>>>>>>>>>>>>>>>>>                 if (noblock) {
>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>>>>>>>>>         return total;
>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>>>>>>>>>> another call to tweak the current API.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>>>>>>>>>> I'm not sure is what we want.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>>>>>>>>>> aspect.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>>>>>>>>>> already be significantly outdated by the time it is used.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>>>>>>>>>> smp_call_function_single().
>>>>>>>>>>>>>> Is this a reasonable approach?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>>>>>>>>>> considered a deal-breaker for the patch set?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It depends on whether or not it has effects on the performance.
>>>>>>>>>>>>> Especially when vhost is pinned.
>>>>>>>>>>>>
>>>>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>>>>>>>>>> for both the stock and patched versions. The benchmarks were run with
>>>>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
>>>>>>>>>>> be meaningful - the queue is never stopped in that case, so no
>>>>>>>>>>> TX_SOFTIRQ is triggered.
>>>>>>>>>>>
>>>>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>>>>>>>>>> performance is lower for pktgen with a single thread but higher with
>>>>>>>>>>> four threads. The results show no regression for the patched version,
>>>>>>>>>>> with even slight performance improvements observed:
>>>>>>>>>>>
>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>
>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>> | *4 threads*             |           |                |
>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>
>>>>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
>>>>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
>>>>>>>>
>>>>>>>> I forgot to mention that these PPS values are per thread.
>>>>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
>>>>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
>>>>>>>> 4616 Kpps and 0, respectively.
>>>>>>>>
>>>>>>>> Sorry about that!
>>>>>>>>
>>>>>>>> The pktgen benchmarks with a single thread look fine, right?
>>>>>>>
>>>>>>> Still looks very low. E.g I just have a run of pktgen (using
>>>>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
>>>>>>> I can get 1Mpps.
>>>>>>
>>>>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
>>>>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
>>>>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
>>>>>>
>>>>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
>>>>>> though the same parameters work fine for sample01 and sample02):
>>>>>>
>>>>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
>>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
>>>>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
>>>>>> supported
>>>>>> ERROR: Write error(1) occurred
>>>>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
>>>>>>
>>>>>> ...and I do not know what I am doing wrong, even after looking at
>>>>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
>>>>>> Any clues?
>>>>>
>>>>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
>>>>
>>>> I tried using "-b 0", and while it worked, there was no noticeable
>>>> performance improvement.
>>>>
>>>>>
>>>>> Another thing I can think of is to disable
>>>>>
>>>>> 1) mitigations in both guest and host
>>>>> 2) any kernel debug features in both host and guest
>>>>
>>>> I also rebuilt the kernel with everything disabled under
>>>> "Kernel hacking", but that didn’t make any difference either.
>>>>
>>>> Because of this, I ran "pktgen_sample01_simple.sh" and
>>>> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
>>>> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
>>>> with very similar performance between the stock and patched kernels.
>>>>
>>>> Personally, I think the low performance is to blame on the hardware.
>>>
>>> Let's double confirm this by:
>>>
>>> 1) make sure pktgen is using 100% CPU
>>> 2) Perf doesn't show anything strange for pktgen thread
>>>
>>> Thanks
>>>
>>
>> I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
>> 100 second perf stat measurement covering all kpktgend threads.
>>
>> Across all configurations, a single CPU was fully utilized.
>>
>> Apart from that, the patched variants show a higher branch frequency and
>> a slightly increased number of context switches.
>>
>>
>> The detailed results are provided below:
>>
>> Processor: Ryzen 5 5600X
>>
>> pktgen command:
>> sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
>> 52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
>>
>> perf stat command:
>> sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
>>
>>
>> Results:
>> Stock TAP:
>>             46.997      context-switches                 #    467,2 cs/sec  cs_per_second
>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>         100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>      8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
>>     77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
>>    382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>    377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
>>     84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>
>>      100,100414494 seconds time elapsed
>>
>>
>> Stock TAP+vhost-net:
>>             47.087      context-switches                 #    468,1 cs/sec  cs_per_second
>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>         100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>      8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
>>     72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
>>    382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>    349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>     83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>
>>      100,100520402 seconds time elapsed
>>
>>
>> Patched TAP:
>>             47.862      context-switches                 #    475,8 cs/sec  cs_per_second
>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>         100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>      9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
>>     99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
>>    382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>    312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
>>     76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)
>>
>>      100,101262454 seconds time elapsed
>>
>>
>> Patched TAP+vhost-net:
>>             47.892      context-switches                 #    476,1 cs/sec  cs_per_second
>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>         100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>      9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
>>     90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
>>    382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>    340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>     78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)
>>
>>      100,101212911 seconds time elapsed
> 
> Thanks for sharing. I have more questions:
> 
> 1) The number of CPU and vCPUs

qemu runs with a single core. And my host system is now a Ryzen 5 5600x
with 6 cores, 12 threads.
This is my command for TAP+vhost-net:

sudo qemu-system-x86_64 -hda debian.qcow2
-netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no,vhost=on
-device virtio-net-pci,netdev=mynet0 -m 1024 -enable-kvm

For TAP only it is the same but without vhost=on.

> 2) If you pin vhost or vCPU threads

Not in the previous shown benchmark. I pinned vhost in other benchmarks
but since there is only minor PPS difference I omitted for the sake of
simplicity.

> 3) what does perf top looks like or perf top -p $pid_of_vhost

The perf reports for the pid_of_vhost from pktgen_sample01_simple.sh
with TAP+vhost-net (not pinned, pktgen single queue, fq_codel) are shown
below. I can not see a huge difference between stock and patched.

Also I included perf reports from the pktgen_pids. I find them more
intersting because tun_net_xmit shows less overhead for the patched.
I assume that is due to the stopped netdev queue.

I have now benchmarked pretty much all possible combinations (with a
script) of TAP/TAP+vhost-net, single/multi-queue pktgen, vhost
pinned/not pinned, with/without -b 0, fq_codel/noqueue... All of that
with perf records..
I could share them if you want but I feel this is getting out of hand. 


Stock:
sudo perf record -p "$vhost_pid"
...
# Overhead  Command          Shared Object               Symbol                                    
# ........  ...............  ..........................  ..........................................
#
     5.97%  vhost-4874       [kernel.kallsyms]           [k] _copy_to_iter
     2.68%  vhost-4874       [kernel.kallsyms]           [k] tun_do_read
     2.23%  vhost-4874       [kernel.kallsyms]           [k] native_write_msr
     1.93%  vhost-4874       [kernel.kallsyms]           [k] __check_object_size
     1.61%  vhost-4874       [kernel.kallsyms]           [k] __slab_free.isra.0
     1.56%  vhost-4874       [kernel.kallsyms]           [k] __get_user_nocheck_2
     1.54%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_zero
     1.45%  vhost-4874       [kernel.kallsyms]           [k] kmem_cache_free
     1.43%  vhost-4874       [kernel.kallsyms]           [k] tun_recvmsg
     1.24%  vhost-4874       [kernel.kallsyms]           [k] sk_skb_reason_drop
     1.12%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_safe_ret
     1.07%  vhost-4874       [kernel.kallsyms]           [k] native_read_msr
     0.76%  vhost-4874       [kernel.kallsyms]           [k] simple_copy_to_iter
     0.75%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_return_thunk
     0.69%  vhost-4874       [vhost]                     [k] 0x0000000000002e70
     0.59%  vhost-4874       [kernel.kallsyms]           [k] skb_release_data
     0.59%  vhost-4874       [kernel.kallsyms]           [k] __skb_datagram_iter
     0.53%  vhost-4874       [vhost]                     [k] 0x0000000000002e5f
     0.51%  vhost-4874       [kernel.kallsyms]           [k] slab_update_freelist.isra.0
     0.46%  vhost-4874       [kernel.kallsyms]           [k] kfree_skbmem
     0.44%  vhost-4874       [kernel.kallsyms]           [k] skb_copy_datagram_iter
     0.43%  vhost-4874       [kernel.kallsyms]           [k] skb_free_head
     0.37%  qemu-system-x86  [unknown]                   [k] 0xffffffffba898b1b
     0.35%  vhost-4874       [vhost]                     [k] 0x0000000000002e6b
     0.33%  vhost-4874       [vhost_net]                 [k] 0x000000000000357d
     0.28%  vhost-4874       [kernel.kallsyms]           [k] __check_heap_object
     0.27%  vhost-4874       [vhost_net]                 [k] 0x00000000000035f3
     0.26%  vhost-4874       [vhost_net]                 [k] 0x00000000000030f6
     0.26%  vhost-4874       [kernel.kallsyms]           [k] __virt_addr_valid
     0.24%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_advance
     0.22%  vhost-4874       [kernel.kallsyms]           [k] perf_event_update_userpage
     0.22%  vhost-4874       [kernel.kallsyms]           [k] check_stack_object
     0.19%  qemu-system-x86  [unknown]                   [k] 0xffffffffba2a68cd
     0.19%  vhost-4874       [kernel.kallsyms]           [k] dequeue_entities
     0.19%  vhost-4874       [vhost_net]                 [k] 0x0000000000003237
     0.18%  vhost-4874       [vhost_net]                 [k] 0x0000000000003550
     0.18%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_del
     0.18%  vhost-4874       [vhost_net]                 [k] 0x00000000000034a0
     0.17%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_disable_all
     0.16%  vhost-4874       [vhost_net]                 [k] 0x0000000000003523
     0.16%  vhost-4874       [kernel.kallsyms]           [k] amd_pmu_addr_offset
...


sudo perf record -p "$kpktgend_pids":
...
# Overhead  Command      Shared Object      Symbol                                         
# ........  ...........  .................  ...............................................
#
    10.98%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
    10.45%  kpktgend_0   [kernel.kallsyms]  [k] memset
     8.40%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
     6.31%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
     3.13%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_safe_ret
     2.40%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
     2.11%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_return_thunk
     1.76%  kpktgend_0   [kernel.kallsyms]  [k] __netdev_alloc_skb
     1.74%  kpktgend_0   [kernel.kallsyms]  [k] __get_random_u32_below
     1.67%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
     1.61%  kpktgend_0   [pktgen]           [k] 0x0000000000003305
     1.57%  kpktgend_0   [pktgen]           [k] 0x00000000000032ff
     1.56%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
     1.49%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_free
     1.48%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
     1.39%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
     1.12%  kpktgend_0   [pktgen]           [k] 0x0000000000003334
     1.09%  kpktgend_0   [pktgen]           [k] 0x000000000000332a
     0.99%  kpktgend_0   [pktgen]           [k] 0x0000000000003116
     0.92%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_data
     0.91%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
     0.88%  kpktgend_0   [pktgen]           [k] 0x0000000000004121
     0.77%  kpktgend_0   [pktgen]           [k] 0x0000000000003427
     0.75%  kpktgend_0   [pktgen]           [k] 0x0000000000004337
     0.70%  kpktgend_0   [pktgen]           [k] 0x00000000000021b9
     0.68%  kpktgend_0   [pktgen]           [k] 0x0000000000002447
     0.68%  kpktgend_0   [pktgen]           [k] 0x0000000000003919
     0.65%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
     0.63%  kpktgend_0   [kernel.kallsyms]  [k] skb_free_head
     0.63%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skbmem
     0.61%  kpktgend_0   [pktgen]           [k] 0x0000000000003257
     0.60%  kpktgend_0   [pktgen]           [k] 0x000000000000243a
     0.59%  kpktgend_0   [pktgen]           [k] 0x000000000000413d
     0.58%  kpktgend_0   [pktgen]           [k] 0x00000000000040eb
     0.58%  kpktgend_0   [pktgen]           [k] 0x000000000000435f
     0.51%  kpktgend_0   [kernel.kallsyms]  [k] _raw_spin_lock
     0.50%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
     0.45%  kpktgend_0   [pktgen]           [k] 0x000000000000330d
     0.45%  kpktgend_0   [pktgen]           [k] 0x0000000000004124
     0.44%  kpktgend_0   [pktgen]           [k] 0x000000000000433c
     0.43%  kpktgend_0   [pktgen]           [k] 0x0000000000003111


====================================================================
Patched: 
sudo perf record -p "$vhost_pid"
...
# Overhead  Command          Shared Object               Symbol                                    
# ........  ...............  ..........................  ..........................................
#
     5.85%  vhost-7042       [kernel.kallsyms]           [k] _copy_to_iter
     2.75%  vhost-7042       [kernel.kallsyms]           [k] tun_do_read
     2.37%  vhost-7042       [kernel.kallsyms]           [k] __check_object_size
     2.28%  vhost-7042       [kernel.kallsyms]           [k] native_write_msr
     1.74%  vhost-7042       [kernel.kallsyms]           [k] __slab_free.isra.0
     1.61%  vhost-7042       [kernel.kallsyms]           [k] iov_iter_zero
     1.54%  vhost-7042       [kernel.kallsyms]           [k] kmem_cache_free
     1.53%  vhost-7042       [kernel.kallsyms]           [k] tun_recvmsg
     1.33%  vhost-7042       [kernel.kallsyms]           [k] __get_user_nocheck_2
     1.28%  vhost-7042       [kernel.kallsyms]           [k] sk_skb_reason_drop
     1.09%  vhost-7042       [kernel.kallsyms]           [k] native_read_msr
     1.04%  vhost-7042       [kernel.kallsyms]           [k] srso_alias_safe_ret
     0.92%  vhost-7042       [kernel.kallsyms]           [k] simple_copy_to_iter
     0.84%  vhost-7042       [kernel.kallsyms]           [k] skb_release_data
     0.75%  vhost-7042       [kernel.kallsyms]           [k] srso_alias_return_thunk
     0.72%  vhost-7042       [kernel.kallsyms]           [k] __skb_datagram_iter
     0.70%  vhost-7042       [vhost]                     [k] 0x0000000000002e70
     0.53%  vhost-7042       [vhost]                     [k] 0x0000000000002e5f
     0.52%  vhost-7042       [kernel.kallsyms]           [k] slab_update_freelist.isra.0
     0.45%  vhost-7042       [kernel.kallsyms]           [k] skb_free_head
     0.44%  vhost-7042       [kernel.kallsyms]           [k] skb_copy_datagram_iter
     0.44%  vhost-7042       [kernel.kallsyms]           [k] kfree_skbmem
     0.34%  vhost-7042       [vhost_net]                 [k] 0x00000000000033e6
     0.33%  vhost-7042       [kernel.kallsyms]           [k] iov_iter_advance
     0.33%  vhost-7042       [vhost]                     [k] 0x0000000000002e6b
     0.31%  qemu-system-x86  [unknown]                   [k] 0xffffffffaa898b1b
     0.28%  vhost-7042       [vhost_net]                 [k] 0x00000000000033b9
     0.27%  vhost-7042       [vhost_net]                 [k] 0x000000000000345c
     0.27%  vhost-7042       [vhost_net]                 [k] 0x00000000000035c6
     0.27%  vhost-7042       [kernel.kallsyms]           [k] __check_heap_object
     0.25%  vhost-7042       [kernel.kallsyms]           [k] perf_event_update_userpage
     0.23%  vhost-7042       [kernel.kallsyms]           [k] __virt_addr_valid
     0.19%  vhost-7042       [kernel.kallsyms]           [k] x86_pmu_disable_all
...


sudo perf record -p "$kpktgend_pids":
...
# Overhead  Command      Shared Object      Symbol                                   
# ........  ...........  .................  .........................................
#
     5.98%  kpktgend_0   [pktgen]           [k] 0x0000000000003305
     5.94%  kpktgend_0   [pktgen]           [k] 0x00000000000032ff
     5.93%  kpktgend_0   [kernel.kallsyms]  [k] memset
     5.13%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
     5.00%  kpktgend_0   [pktgen]           [k] 0x000000000000330d
     4.68%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
     4.22%  kpktgend_0   [pktgen]           [k] 0x0000000000003334
     3.51%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_safe_ret
     3.46%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
     2.57%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_return_thunk
     2.49%  kpktgend_0   [pktgen]           [k] 0x000000000000332a
     2.02%  kpktgend_0   [pktgen]           [k] 0x0000000000003927
     1.94%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
     1.92%  kpktgend_0   [pktgen]           [k] 0x000000000000332f
     1.83%  kpktgend_0   [pktgen]           [k] 0x0000000000003116
     1.65%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
     1.51%  kpktgend_0   [pktgen]           [k] 0x00000000000032fd
     1.35%  kpktgend_0   [pktgen]           [k] 0x00000000000030bd
     1.35%  kpktgend_0   [pktgen]           [k] 0x0000000000003919
     1.20%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
     1.14%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
     1.06%  kpktgend_0   [kernel.kallsyms]  [k] kthread_should_stop
     0.89%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
     0.88%  kpktgend_0   [kernel.kallsyms]  [k] __get_random_u32_below
     0.83%  kpktgend_0   [kernel.kallsyms]  [k] __netdev_alloc_skb
     0.83%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
     0.74%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
     0.72%  kpktgend_0   [pktgen]           [k] 0x00000000000030c5
     0.71%  kpktgend_0   [pktgen]           [k] 0x00000000000030c1
     0.70%  kpktgend_0   [pktgen]           [k] 0x00000000000030ce
     0.68%  kpktgend_0   [pktgen]           [k] 0x00000000000030d1
     0.68%  kpktgend_0   [pktgen]           [k] 0x000000000000391e
     0.63%  kpktgend_0   [pktgen]           [k] 0x000000000000311f
     0.62%  kpktgend_0   [pktgen]           [k] 0x000000000000312c
     0.61%  kpktgend_0   [pktgen]           [k] 0x0000000000003131
     0.61%  kpktgend_0   [pktgen]           [k] 0x0000000000003124
     0.57%  kpktgend_0   [pktgen]           [k] 0x0000000000003111
     0.56%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
     0.55%  kpktgend_0   [pktgen]           [k] 0x00000000000030b8
     0.44%  kpktgend_0   [pktgen]           [k] 0x0000000000004337
     0.43%  kpktgend_0   [pktgen]           [k] 0x0000000000004121

Thanks :)

> 
>>
>>>>
>>>> Thanks!
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I'll still look into using an XDP program that does XDP_DROP in the
>>>>>>>> guest.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +------------------------+-------------+----------------+
>>>>>>>>>>> | iperf3 TCP benchmarks  | Stock       | Patched with   |
>>>>>>>>>>> | to Debian VM 120s      |             | fq_codel qdisc |
>>>>>>>>>>> | vhost pinned to core 0 |             |                |
>>>>>>>>>>> +------------------------+-------------+----------------+
>>>>>>>>>>> | TAP                    | 22.0 Gbit/s | 22.0 Gbit/s    |
>>>>>>>>>>> |  +                     |             |                |
>>>>>>>>>>> | vhost-net              |             |                |
>>>>>>>>>>> +------------------------+-------------+----------------+
>>>>>>>>>>>
>>>>>>>>>>> +---------------------------+-------------+----------------+
>>>>>>>>>>> | iperf3 TCP benchmarks     | Stock       | Patched with   |
>>>>>>>>>>> | to Debian VM 120s         |             | fq_codel qdisc |
>>>>>>>>>>> | vhost pinned to core 0    |             |                |
>>>>>>>>>>> | *4 iperf3 client threads* |             |                |
>>>>>>>>>>> +---------------------------+-------------+----------------+
>>>>>>>>>>> | TAP                       | 21.4 Gbit/s | 21.5 Gbit/s    |
>>>>>>>>>>> |  +                        |             |                |
>>>>>>>>>>> | vhost-net                 |             |                |
>>>>>>>>>>> +---------------------------+-------------+----------------+
>>>>>>>>>>
>>>>>>>>>> What are your thoughts on this?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-05 22:28                                     ` Simon Schippers
@ 2026-02-06  3:21                                       ` Jason Wang
  2026-02-08 18:18                                         ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-02-06  3:21 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Fri, Feb 6, 2026 at 6:28 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 2/5/26 04:59, Jason Wang wrote:
> > On Wed, Feb 4, 2026 at 11:44 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 2/3/26 04:48, Jason Wang wrote:
> >>> On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 1/30/26 02:51, Jason Wang wrote:
> >>>>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> On 1/29/26 02:14, Jason Wang wrote:
> >>>>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
> >>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>
> >>>>>>>> On 1/28/26 08:03, Jason Wang wrote:
> >>>>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
> >>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
> >>>>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
> >>>>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> >>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
> >>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> >>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>>>>>>>>>>>>>>>> space in the underlying ptr_ring.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>>>>>>>>>>>>>>>> in an upcoming commit.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
> >>>>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>>>>>>>>>>>         return ret ? ret : total;
> >>>>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
> >>>>>>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>>>>>>>>>>>>>>>> +               rcu_read_unlock();
> >>>>>>>>>>>>>>>>>> +       }
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> +       return ptr;
> >>>>>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>>>>>                            struct iov_iter *to,
> >>>>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
> >>>>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>                 /* Read frames from the queue */
> >>>>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
> >>>>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
> >>>>>>>>>>>>>>>>>>                 if (skb)
> >>>>>>>>>>>>>>>>>>                         break;
> >>>>>>>>>>>>>>>>>>                 if (noblock) {
> >>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
> >>>>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>>>>>>>>>>>         return total;
> >>>>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>>>>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>>>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>>>>>>>>>>>>>>>> another call to tweak the current API.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>>>>>>>>>>>>>>>> I'm not sure is what we want.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
> >>>>>>>>>>>>>> aspect.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
> >>>>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
> >>>>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> >>>>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
> >>>>>>>>>>>>>> already be significantly outdated by the time it is used.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
> >>>>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> >>>>>>>>>>>>>> smp_call_function_single().
> >>>>>>>>>>>>>> Is this a reasonable approach?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> >>>>>>>>>>>>>> considered a deal-breaker for the patch set?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It depends on whether or not it has effects on the performance.
> >>>>>>>>>>>>> Especially when vhost is pinned.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
> >>>>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> >>>>>>>>>>> for both the stock and patched versions. The benchmarks were run with
> >>>>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
> >>>>>>>>>>> be meaningful - the queue is never stopped in that case, so no
> >>>>>>>>>>> TX_SOFTIRQ is triggered.
> >>>>>>>>>>>
> >>>>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
> >>>>>>>>>>> performance is lower for pktgen with a single thread but higher with
> >>>>>>>>>>> four threads. The results show no regression for the patched version,
> >>>>>>>>>>> with even slight performance improvements observed:
> >>>>>>>>>>>
> >>>>>>>>>>> +-------------------------+-----------+----------------+
> >>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>>>>>> | 100M packets            |           |                |
> >>>>>>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> >>>>>>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
> >>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>>>
> >>>>>>>>>>> +-------------------------+-----------+----------------+
> >>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>>>>>> | 100M packets            |           |                |
> >>>>>>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>>>>>> | *4 threads*             |           |                |
> >>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> >>>>>>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
> >>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>
> >>>>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
> >>>>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
> >>>>>>>>
> >>>>>>>> I forgot to mention that these PPS values are per thread.
> >>>>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
> >>>>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
> >>>>>>>> 4616 Kpps and 0, respectively.
> >>>>>>>>
> >>>>>>>> Sorry about that!
> >>>>>>>>
> >>>>>>>> The pktgen benchmarks with a single thread look fine, right?
> >>>>>>>
> >>>>>>> Still looks very low. E.g I just have a run of pktgen (using
> >>>>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
> >>>>>>> I can get 1Mpps.
> >>>>>>
> >>>>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
> >>>>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
> >>>>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
> >>>>>>
> >>>>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
> >>>>>> though the same parameters work fine for sample01 and sample02):
> >>>>>>
> >>>>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
> >>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
> >>>>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
> >>>>>> supported
> >>>>>> ERROR: Write error(1) occurred
> >>>>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
> >>>>>>
> >>>>>> ...and I do not know what I am doing wrong, even after looking at
> >>>>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
> >>>>>> Any clues?
> >>>>>
> >>>>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
> >>>>
> >>>> I tried using "-b 0", and while it worked, there was no noticeable
> >>>> performance improvement.
> >>>>
> >>>>>
> >>>>> Another thing I can think of is to disable
> >>>>>
> >>>>> 1) mitigations in both guest and host
> >>>>> 2) any kernel debug features in both host and guest
> >>>>
> >>>> I also rebuilt the kernel with everything disabled under
> >>>> "Kernel hacking", but that didn’t make any difference either.
> >>>>
> >>>> Because of this, I ran "pktgen_sample01_simple.sh" and
> >>>> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
> >>>> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
> >>>> with very similar performance between the stock and patched kernels.
> >>>>
> >>>> Personally, I think the low performance is to blame on the hardware.
> >>>
> >>> Let's double confirm this by:
> >>>
> >>> 1) make sure pktgen is using 100% CPU
> >>> 2) Perf doesn't show anything strange for pktgen thread
> >>>
> >>> Thanks
> >>>
> >>
> >> I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
> >> 100 second perf stat measurement covering all kpktgend threads.
> >>
> >> Across all configurations, a single CPU was fully utilized.
> >>
> >> Apart from that, the patched variants show a higher branch frequency and
> >> a slightly increased number of context switches.
> >>
> >>
> >> The detailed results are provided below:
> >>
> >> Processor: Ryzen 5 5600X
> >>
> >> pktgen command:
> >> sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
> >> 52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
> >>
> >> perf stat command:
> >> sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
> >>
> >>
> >> Results:
> >> Stock TAP:
> >>             46.997      context-switches                 #    467,2 cs/sec  cs_per_second
> >>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>         100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>      8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
> >>     77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
> >>    382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>    377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
> >>     84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
> >>
> >>      100,100414494 seconds time elapsed
> >>
> >>
> >> Stock TAP+vhost-net:
> >>             47.087      context-switches                 #    468,1 cs/sec  cs_per_second
> >>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>         100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>      8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
> >>     72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
> >>    382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>    349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
> >>     83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
> >>
> >>      100,100520402 seconds time elapsed
> >>
> >>
> >> Patched TAP:
> >>             47.862      context-switches                 #    475,8 cs/sec  cs_per_second
> >>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>         100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>      9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
> >>     99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
> >>    382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>    312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
> >>     76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)
> >>
> >>      100,101262454 seconds time elapsed
> >>
> >>
> >> Patched TAP+vhost-net:
> >>             47.892      context-switches                 #    476,1 cs/sec  cs_per_second
> >>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>         100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>      9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
> >>     90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
> >>    382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>    340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
> >>     78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)
> >>
> >>      100,101212911 seconds time elapsed
> >
> > Thanks for sharing. I have more questions:
> >
> > 1) The number of CPU and vCPUs
>
> qemu runs with a single core. And my host system is now a Ryzen 5 5600x
> with 6 cores, 12 threads.
> This is my command for TAP+vhost-net:
>
> sudo qemu-system-x86_64 -hda debian.qcow2
> -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no,vhost=on
> -device virtio-net-pci,netdev=mynet0 -m 1024 -enable-kvm
>
> For TAP only it is the same but without vhost=on.
>
> > 2) If you pin vhost or vCPU threads
>
> Not in the previous shown benchmark. I pinned vhost in other benchmarks
> but since there is only minor PPS difference I omitted for the sake of
> simplicity.
>
> > 3) what does perf top looks like or perf top -p $pid_of_vhost
>
> The perf reports for the pid_of_vhost from pktgen_sample01_simple.sh
> with TAP+vhost-net (not pinned, pktgen single queue, fq_codel) are shown
> below. I can not see a huge difference between stock and patched.
>
> Also I included perf reports from the pktgen_pids. I find them more
> intersting because tun_net_xmit shows less overhead for the patched.
> I assume that is due to the stopped netdev queue.
>
> I have now benchmarked pretty much all possible combinations (with a
> script) of TAP/TAP+vhost-net, single/multi-queue pktgen, vhost
> pinned/not pinned, with/without -b 0, fq_codel/noqueue... All of that
> with perf records..
> I could share them if you want but I feel this is getting out of hand.
>
>
> Stock:
> sudo perf record -p "$vhost_pid"
> ...
> # Overhead  Command          Shared Object               Symbol
> # ........  ...............  ..........................  ..........................................
> #
>      5.97%  vhost-4874       [kernel.kallsyms]           [k] _copy_to_iter
>      2.68%  vhost-4874       [kernel.kallsyms]           [k] tun_do_read
>      2.23%  vhost-4874       [kernel.kallsyms]           [k] native_write_msr
>      1.93%  vhost-4874       [kernel.kallsyms]           [k] __check_object_size

Let's disable CONFIG_HARDENED_USERCOPY and retry.

>      1.61%  vhost-4874       [kernel.kallsyms]           [k] __slab_free.isra.0
>      1.56%  vhost-4874       [kernel.kallsyms]           [k] __get_user_nocheck_2
>      1.54%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_zero
>      1.45%  vhost-4874       [kernel.kallsyms]           [k] kmem_cache_free
>      1.43%  vhost-4874       [kernel.kallsyms]           [k] tun_recvmsg
>      1.24%  vhost-4874       [kernel.kallsyms]           [k] sk_skb_reason_drop
>      1.12%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_safe_ret
>      1.07%  vhost-4874       [kernel.kallsyms]           [k] native_read_msr
>      0.76%  vhost-4874       [kernel.kallsyms]           [k] simple_copy_to_iter
>      0.75%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_return_thunk
>      0.69%  vhost-4874       [vhost]                     [k] 0x0000000000002e70
>      0.59%  vhost-4874       [kernel.kallsyms]           [k] skb_release_data
>      0.59%  vhost-4874       [kernel.kallsyms]           [k] __skb_datagram_iter
>      0.53%  vhost-4874       [vhost]                     [k] 0x0000000000002e5f
>      0.51%  vhost-4874       [kernel.kallsyms]           [k] slab_update_freelist.isra.0
>      0.46%  vhost-4874       [kernel.kallsyms]           [k] kfree_skbmem
>      0.44%  vhost-4874       [kernel.kallsyms]           [k] skb_copy_datagram_iter
>      0.43%  vhost-4874       [kernel.kallsyms]           [k] skb_free_head
>      0.37%  qemu-system-x86  [unknown]                   [k] 0xffffffffba898b1b
>      0.35%  vhost-4874       [vhost]                     [k] 0x0000000000002e6b
>      0.33%  vhost-4874       [vhost_net]                 [k] 0x000000000000357d
>      0.28%  vhost-4874       [kernel.kallsyms]           [k] __check_heap_object
>      0.27%  vhost-4874       [vhost_net]                 [k] 0x00000000000035f3
>      0.26%  vhost-4874       [vhost_net]                 [k] 0x00000000000030f6
>      0.26%  vhost-4874       [kernel.kallsyms]           [k] __virt_addr_valid
>      0.24%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_advance
>      0.22%  vhost-4874       [kernel.kallsyms]           [k] perf_event_update_userpage
>      0.22%  vhost-4874       [kernel.kallsyms]           [k] check_stack_object
>      0.19%  qemu-system-x86  [unknown]                   [k] 0xffffffffba2a68cd
>      0.19%  vhost-4874       [kernel.kallsyms]           [k] dequeue_entities
>      0.19%  vhost-4874       [vhost_net]                 [k] 0x0000000000003237
>      0.18%  vhost-4874       [vhost_net]                 [k] 0x0000000000003550
>      0.18%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_del
>      0.18%  vhost-4874       [vhost_net]                 [k] 0x00000000000034a0
>      0.17%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_disable_all
>      0.16%  vhost-4874       [vhost_net]                 [k] 0x0000000000003523
>      0.16%  vhost-4874       [kernel.kallsyms]           [k] amd_pmu_addr_offset
> ...
>
>
> sudo perf record -p "$kpktgend_pids":
> ...
> # Overhead  Command      Shared Object      Symbol
> # ........  ...........  .................  ...............................................
> #
>     10.98%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>     10.45%  kpktgend_0   [kernel.kallsyms]  [k] memset
>      8.40%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>      6.31%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>      3.13%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_safe_ret
>      2.40%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
>      2.11%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_return_thunk

This is a hint that SRSO migitaion is enabled.

Have you disabled CPU_MITIGATIONS via either Kconfig or kernel command
line (mitigations = off) for both host and guest?

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-06  3:21                                       ` Jason Wang
@ 2026-02-08 18:18                                         ` Simon Schippers
  2026-02-12  0:12                                           ` Simon Schippers
  2026-02-12  8:14                                           ` Jason Wang
  0 siblings, 2 replies; 69+ messages in thread
From: Simon Schippers @ 2026-02-08 18:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 2/6/26 04:21, Jason Wang wrote:
> On Fri, Feb 6, 2026 at 6:28 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 2/5/26 04:59, Jason Wang wrote:
>>> On Wed, Feb 4, 2026 at 11:44 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 2/3/26 04:48, Jason Wang wrote:
>>>>> On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> On 1/30/26 02:51, Jason Wang wrote:
>>>>>>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>
>>>>>>>> On 1/29/26 02:14, Jason Wang wrote:
>>>>>>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>
>>>>>>>>>> On 1/28/26 08:03, Jason Wang wrote:
>>>>>>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
>>>>>>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
>>>>>>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>>>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>>>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>>>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>>>>>>>>>>>> +       }
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +       return ptr;
>>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>>>>>>>>>>>                 if (skb)
>>>>>>>>>>>>>>>>>>>>                         break;
>>>>>>>>>>>>>>>>>>>>                 if (noblock) {
>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>>>>>>>>>>>         return total;
>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>>>>>>>>>>>> another call to tweak the current API.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>>>>>>>>>>>> I'm not sure is what we want.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>>>>>>>>>>>> aspect.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>>>>>>>>>>>> already be significantly outdated by the time it is used.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>>>>>>>>>>>> smp_call_function_single().
>>>>>>>>>>>>>>>> Is this a reasonable approach?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>>>>>>>>>>>> considered a deal-breaker for the patch set?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It depends on whether or not it has effects on the performance.
>>>>>>>>>>>>>>> Especially when vhost is pinned.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>>>>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>>>>>>>>>>>> for both the stock and patched versions. The benchmarks were run with
>>>>>>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
>>>>>>>>>>>>> be meaningful - the queue is never stopped in that case, so no
>>>>>>>>>>>>> TX_SOFTIRQ is triggered.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>>>>>>>>>>>> performance is lower for pktgen with a single thread but higher with
>>>>>>>>>>>>> four threads. The results show no regression for the patched version,
>>>>>>>>>>>>> with even slight performance improvements observed:
>>>>>>>>>>>>>
>>>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>
>>>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>>>> | *4 threads*             |           |                |
>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>
>>>>>>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
>>>>>>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
>>>>>>>>>>
>>>>>>>>>> I forgot to mention that these PPS values are per thread.
>>>>>>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
>>>>>>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
>>>>>>>>>> 4616 Kpps and 0, respectively.
>>>>>>>>>>
>>>>>>>>>> Sorry about that!
>>>>>>>>>>
>>>>>>>>>> The pktgen benchmarks with a single thread look fine, right?
>>>>>>>>>
>>>>>>>>> Still looks very low. E.g I just have a run of pktgen (using
>>>>>>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
>>>>>>>>> I can get 1Mpps.
>>>>>>>>
>>>>>>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
>>>>>>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
>>>>>>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
>>>>>>>>
>>>>>>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
>>>>>>>> though the same parameters work fine for sample01 and sample02):
>>>>>>>>
>>>>>>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
>>>>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
>>>>>>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
>>>>>>>> supported
>>>>>>>> ERROR: Write error(1) occurred
>>>>>>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
>>>>>>>>
>>>>>>>> ...and I do not know what I am doing wrong, even after looking at
>>>>>>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
>>>>>>>> Any clues?
>>>>>>>
>>>>>>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
>>>>>>
>>>>>> I tried using "-b 0", and while it worked, there was no noticeable
>>>>>> performance improvement.
>>>>>>
>>>>>>>
>>>>>>> Another thing I can think of is to disable
>>>>>>>
>>>>>>> 1) mitigations in both guest and host
>>>>>>> 2) any kernel debug features in both host and guest
>>>>>>
>>>>>> I also rebuilt the kernel with everything disabled under
>>>>>> "Kernel hacking", but that didn’t make any difference either.
>>>>>>
>>>>>> Because of this, I ran "pktgen_sample01_simple.sh" and
>>>>>> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
>>>>>> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
>>>>>> with very similar performance between the stock and patched kernels.
>>>>>>
>>>>>> Personally, I think the low performance is to blame on the hardware.
>>>>>
>>>>> Let's double confirm this by:
>>>>>
>>>>> 1) make sure pktgen is using 100% CPU
>>>>> 2) Perf doesn't show anything strange for pktgen thread
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>> I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
>>>> 100 second perf stat measurement covering all kpktgend threads.
>>>>
>>>> Across all configurations, a single CPU was fully utilized.
>>>>
>>>> Apart from that, the patched variants show a higher branch frequency and
>>>> a slightly increased number of context switches.
>>>>
>>>>
>>>> The detailed results are provided below:
>>>>
>>>> Processor: Ryzen 5 5600X
>>>>
>>>> pktgen command:
>>>> sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
>>>>
>>>> perf stat command:
>>>> sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
>>>>
>>>>
>>>> Results:
>>>> Stock TAP:
>>>>             46.997      context-switches                 #    467,2 cs/sec  cs_per_second
>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>         100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>      8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
>>>>     77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
>>>>    382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>    377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
>>>>     84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>>>
>>>>      100,100414494 seconds time elapsed
>>>>
>>>>
>>>> Stock TAP+vhost-net:
>>>>             47.087      context-switches                 #    468,1 cs/sec  cs_per_second
>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>         100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>      8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
>>>>     72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
>>>>    382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>    349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>>>     83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>>>
>>>>      100,100520402 seconds time elapsed
>>>>
>>>>
>>>> Patched TAP:
>>>>             47.862      context-switches                 #    475,8 cs/sec  cs_per_second
>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>         100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>      9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
>>>>     99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
>>>>    382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>    312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
>>>>     76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)
>>>>
>>>>      100,101262454 seconds time elapsed
>>>>
>>>>
>>>> Patched TAP+vhost-net:
>>>>             47.892      context-switches                 #    476,1 cs/sec  cs_per_second
>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>         100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>      9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
>>>>     90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
>>>>    382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>    340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>>>     78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)
>>>>
>>>>      100,101212911 seconds time elapsed
>>>
>>> Thanks for sharing. I have more questions:
>>>
>>> 1) The number of CPU and vCPUs
>>
>> qemu runs with a single core. And my host system is now a Ryzen 5 5600x
>> with 6 cores, 12 threads.
>> This is my command for TAP+vhost-net:
>>
>> sudo qemu-system-x86_64 -hda debian.qcow2
>> -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no,vhost=on
>> -device virtio-net-pci,netdev=mynet0 -m 1024 -enable-kvm
>>
>> For TAP only it is the same but without vhost=on.
>>
>>> 2) If you pin vhost or vCPU threads
>>
>> Not in the previous shown benchmark. I pinned vhost in other benchmarks
>> but since there is only minor PPS difference I omitted for the sake of
>> simplicity.
>>
>>> 3) what does perf top looks like or perf top -p $pid_of_vhost
>>
>> The perf reports for the pid_of_vhost from pktgen_sample01_simple.sh
>> with TAP+vhost-net (not pinned, pktgen single queue, fq_codel) are shown
>> below. I can not see a huge difference between stock and patched.
>>
>> Also I included perf reports from the pktgen_pids. I find them more
>> intersting because tun_net_xmit shows less overhead for the patched.
>> I assume that is due to the stopped netdev queue.
>>
>> I have now benchmarked pretty much all possible combinations (with a
>> script) of TAP/TAP+vhost-net, single/multi-queue pktgen, vhost
>> pinned/not pinned, with/without -b 0, fq_codel/noqueue... All of that
>> with perf records..
>> I could share them if you want but I feel this is getting out of hand.
>>
>>
>> Stock:
>> sudo perf record -p "$vhost_pid"
>> ...
>> # Overhead  Command          Shared Object               Symbol
>> # ........  ...............  ..........................  ..........................................
>> #
>>      5.97%  vhost-4874       [kernel.kallsyms]           [k] _copy_to_iter
>>      2.68%  vhost-4874       [kernel.kallsyms]           [k] tun_do_read
>>      2.23%  vhost-4874       [kernel.kallsyms]           [k] native_write_msr
>>      1.93%  vhost-4874       [kernel.kallsyms]           [k] __check_object_size
> 
> Let's disable CONFIG_HARDENED_USERCOPY and retry.
> 
>>      1.61%  vhost-4874       [kernel.kallsyms]           [k] __slab_free.isra.0
>>      1.56%  vhost-4874       [kernel.kallsyms]           [k] __get_user_nocheck_2
>>      1.54%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_zero
>>      1.45%  vhost-4874       [kernel.kallsyms]           [k] kmem_cache_free
>>      1.43%  vhost-4874       [kernel.kallsyms]           [k] tun_recvmsg
>>      1.24%  vhost-4874       [kernel.kallsyms]           [k] sk_skb_reason_drop
>>      1.12%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_safe_ret
>>      1.07%  vhost-4874       [kernel.kallsyms]           [k] native_read_msr
>>      0.76%  vhost-4874       [kernel.kallsyms]           [k] simple_copy_to_iter
>>      0.75%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_return_thunk
>>      0.69%  vhost-4874       [vhost]                     [k] 0x0000000000002e70
>>      0.59%  vhost-4874       [kernel.kallsyms]           [k] skb_release_data
>>      0.59%  vhost-4874       [kernel.kallsyms]           [k] __skb_datagram_iter
>>      0.53%  vhost-4874       [vhost]                     [k] 0x0000000000002e5f
>>      0.51%  vhost-4874       [kernel.kallsyms]           [k] slab_update_freelist.isra.0
>>      0.46%  vhost-4874       [kernel.kallsyms]           [k] kfree_skbmem
>>      0.44%  vhost-4874       [kernel.kallsyms]           [k] skb_copy_datagram_iter
>>      0.43%  vhost-4874       [kernel.kallsyms]           [k] skb_free_head
>>      0.37%  qemu-system-x86  [unknown]                   [k] 0xffffffffba898b1b
>>      0.35%  vhost-4874       [vhost]                     [k] 0x0000000000002e6b
>>      0.33%  vhost-4874       [vhost_net]                 [k] 0x000000000000357d
>>      0.28%  vhost-4874       [kernel.kallsyms]           [k] __check_heap_object
>>      0.27%  vhost-4874       [vhost_net]                 [k] 0x00000000000035f3
>>      0.26%  vhost-4874       [vhost_net]                 [k] 0x00000000000030f6
>>      0.26%  vhost-4874       [kernel.kallsyms]           [k] __virt_addr_valid
>>      0.24%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_advance
>>      0.22%  vhost-4874       [kernel.kallsyms]           [k] perf_event_update_userpage
>>      0.22%  vhost-4874       [kernel.kallsyms]           [k] check_stack_object
>>      0.19%  qemu-system-x86  [unknown]                   [k] 0xffffffffba2a68cd
>>      0.19%  vhost-4874       [kernel.kallsyms]           [k] dequeue_entities
>>      0.19%  vhost-4874       [vhost_net]                 [k] 0x0000000000003237
>>      0.18%  vhost-4874       [vhost_net]                 [k] 0x0000000000003550
>>      0.18%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_del
>>      0.18%  vhost-4874       [vhost_net]                 [k] 0x00000000000034a0
>>      0.17%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_disable_all
>>      0.16%  vhost-4874       [vhost_net]                 [k] 0x0000000000003523
>>      0.16%  vhost-4874       [kernel.kallsyms]           [k] amd_pmu_addr_offset
>> ...
>>
>>
>> sudo perf record -p "$kpktgend_pids":
>> ...
>> # Overhead  Command      Shared Object      Symbol
>> # ........  ...........  .................  ...............................................
>> #
>>     10.98%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>>     10.45%  kpktgend_0   [kernel.kallsyms]  [k] memset
>>      8.40%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>>      6.31%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>>      3.13%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_safe_ret
>>      2.40%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
>>      2.11%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_return_thunk
> 
> This is a hint that SRSO migitaion is enabled.
> 
> Have you disabled CPU_MITIGATIONS via either Kconfig or kernel command
> line (mitigations = off) for both host and guest?
> 
> Thanks
> 

Your both suggested changes really boosted the performance, especially
for TAP.

I disabled SRSO mitigation with spec_rstack_overflow=off and went from
"Mitigation: Safe RET" to "Vulnerable" on the host. The VM showed "Not
affected" but I applied spec_rstack_overflow=off anyway.

Here are some new benchmarks for pktgen_sample01_simple.sh:
(I also have other available and I can share them if you want.)

+-------------------------+-----------+----------------+
| pktgen benchmarks to    | Stock     | Patched with   |
| Debian VM, R5 5600X,    |           | fq_codel qdisc |
| 100M packets            |           |                |
| CPU not pinned          |           |                |
+-----------+-------------+-----------+----------------+
| TAP       | Transmitted | 1330 Kpps | 1033 Kpps      |
|           +-------------+-----------+----------------+
|           | Lost        | 3895 Kpps | 0              |
+-----------+-------------+-----------+----------------+
| TAP       | Transmitted | 1408 Kpps | 1420 Kpps      |
|  +        +-------------+-----------+----------------+
| vhost-net | Lost        | 3712 Kpps | 0              |
+-----------+-------------+-----------+----------------+

I do not understand why there is a regression for TAP but not for
TAP+vhost-net...


The perf report of pktgen and perf stat for TAP & TAP+vhost-net are
below. I also included perf reports & perf statsof vhost for
TAP+vhost-net.

=========================================================================

TAP stock:
perf report of pktgen:

# Overhead  Command      Shared Object      Symbol                                        
# ........  ...........  .................  ..............................................
#
    22.39%  kpktgend_0   [kernel.kallsyms]  [k] memset
    10.59%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
     7.56%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
     5.74%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
     4.76%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_free
     3.23%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
     2.55%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
     2.49%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
     2.48%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
     2.44%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
     2.21%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
     1.46%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
     1.36%  kpktgend_0   [kernel.kallsyms]  [k] ip_send_check
     1.17%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
     1.09%  kpktgend_0   [kernel.kallsyms]  [k] _raw_spin_lock
     1.01%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
     0.85%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_data
     0.83%  kpktgend_0   [kernel.kallsyms]  [k] __netdev_alloc_skb
     0.71%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
     0.68%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
     0.64%  kpktgend_0   [kernel.kallsyms]  [k] skb_tx_error
     0.59%  kpktgend_0   [kernel.kallsyms]  [k] __get_random_u32_below
     0.58%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
     0.51%  kpktgend_0   [pktgen]           [k] 0x000000000000422e
     0.50%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
     0.48%  kpktgend_0   [kernel.kallsyms]  [k] _get_random_bytes
     0.46%  kpktgend_0   [pktgen]           [k] 0x0000000000004220
     0.46%  kpktgend_0   [pktgen]           [k] 0x0000000000004229
     0.45%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_head_state
     0.44%  kpktgend_0   [pktgen]           [k] 0x000000000000211d
...


perf stat of pktgen:
 Performance counter stats for process id '4740,4741,4742,4743,4744,4745,4746,4747,4748,4749,4750,4751,4752,4753,4754,4755,4756,4757,4758,4759,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787':

            35.436      context-switches                 #    469,7 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
         75.443,67 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
       548.187.113      branch-misses                    #      0,5 %  branch_miss_rate         (50,18%)
   119.270.991.801      branches                         #   1580,9 M/sec  branch_frequency     (66,79%)
   347.803.953.690      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,79%)
   689.142.448.524      instructions                     #      2,0 instructions  insn_per_cycle  (66,79%)
    11.063.715.152      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,43%)

      75,698467362 seconds time elapsed


=========================================================================

TAP patched:
perf report of pktgen:

# Overhead  Command      Shared Object      Symbol                                        
# ........  ...........  .................  ..............................................
#
    16.18%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
    16.11%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
    16.10%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
     4.78%  kpktgend_0   [kernel.kallsyms]  [k] memset
     4.54%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
     2.62%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
     2.42%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
     1.89%  kpktgend_0   [kernel.kallsyms]  [k] kthread_should_stop
     1.77%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
     1.66%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
     1.53%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
     1.44%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
     1.42%  kpktgend_0   [kernel.kallsyms]  [k] __cond_resched
     0.91%  kpktgend_0   [pktgen]           [k] 0x0000000000003877
     0.91%  kpktgend_0   [pktgen]           [k] 0x0000000000003284
     0.89%  kpktgend_0   [pktgen]           [k] 0x000000000000327f
     0.75%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
     0.64%  kpktgend_0   [pktgen]           [k] 0x0000000000003061
     0.61%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
     0.57%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
     0.52%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
     0.48%  kpktgend_0   [pktgen]           [k] 0x000000000000326d
     0.47%  kpktgend_0   [pktgen]           [k] 0x0000000000003265
     0.47%  kpktgend_0   [pktgen]           [k] 0x0000000000003864
     0.45%  kpktgend_0   [pktgen]           [k] 0x0000000000003008
     0.35%  kpktgend_0   [pktgen]           [k] 0x000000000000449b
     0.34%  kpktgend_0   [pktgen]           [k] 0x0000000000003242
     0.32%  kpktgend_0   [pktgen]           [k] 0x00000000000030a6
     0.32%  kpktgend_0   [pktgen]           [k] 0x000000000000308b
     0.32%  kpktgend_0   [pktgen]           [k] 0x0000000000003869
     0.31%  kpktgend_0   [pktgen]           [k] 0x00000000000030c2
...

perf stat of pktgen:

 Performance counter stats for process id '3257,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267,3268,3269,3270,3271,3272,3273,3274,3275,3276,3277,3278,3279,3280,3281,3282,3283,3284,3285,3286,3287,3288,3289,3290,3291,3292,3293,3294,3295,3296,3297,3298,3299,3300,3301,3302,3303,3304':

            45.545      context-switches                 #    468,9 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
         97.130,77 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
       237.212.098      branch-misses                    #      0,1 %  branch_miss_rate         (50,12%)
   172.088.418.840      branches                         #   1771,7 M/sec  branch_frequency     (66,78%)
   447.219.346.605      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,79%)
   619.203.459.603      instructions                     #      1,4 instructions  insn_per_cycle  (66,79%)
     5.821.044.711      stalled-cycles-frontend          #     0,01 frontend_cycles_idle        (66,48%)

      97,353332168 seconds time elapsed

=========================================================================

TAP+vhost-net stock:

perf report of pktgen:

# Overhead  Command      Shared Object      Symbol                                        
# ........  ...........  .................  ..............................................
#
    22.25%  kpktgend_0   [kernel.kallsyms]  [k] memset
    10.73%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
     7.69%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
     5.71%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
     4.66%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_free
     3.20%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
     2.50%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
     2.48%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
     2.45%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
     2.41%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
     2.22%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
     1.44%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
     1.34%  kpktgend_0   [kernel.kallsyms]  [k] ip_send_check
     1.22%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
     1.06%  kpktgend_0   [kernel.kallsyms]  [k] _raw_spin_lock
     1.04%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
     0.85%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_data
     0.83%  kpktgend_0   [kernel.kallsyms]  [k] __netdev_alloc_skb
     0.72%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
     0.70%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
     0.62%  kpktgend_0   [kernel.kallsyms]  [k] skb_tx_error
     0.61%  kpktgend_0   [kernel.kallsyms]  [k] __get_random_u32_below
     0.60%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
     0.52%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
     0.47%  kpktgend_0   [kernel.kallsyms]  [k] _get_random_bytes
     0.47%  kpktgend_0   [pktgen]           [k] 0x000000000000422e
     0.45%  kpktgend_0   [pktgen]           [k] 0x0000000000004229
     0.44%  kpktgend_0   [pktgen]           [k] 0x0000000000004220
     0.43%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_head_state
     0.42%  kpktgend_0   [kernel.kallsyms]  [k] netdev_core_stats_inc
     0.42%  kpktgend_0   [pktgen]           [k] 0x0000000000002119
...

perf stat of pktgen:

 Performance counter stats for process id '4740,4741,4742,4743,4744,4745,4746,4747,4748,4749,4750,4751,4752,4753,4754,4755,4756,4757,4758,4759,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787':

            34.830      context-switches                 #    489,0 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
         71.224,77 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
       506.905.400      branch-misses                    #      0,5 %  branch_miss_rate         (50,15%)
   110.207.563.428      branches                         #   1547,3 M/sec  branch_frequency     (66,78%)
   324.745.594.771      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,77%)
   635.181.893.816      instructions                     #      2,0 instructions  insn_per_cycle  (66,77%)
    10.450.586.633      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,46%)

      71,547831150 seconds time elapsed


perf report of vhost:

# Overhead  Command          Shared Object               Symbol                                          
# ........  ...............  ..........................  ................................................
#
     8.66%  vhost-14592      [kernel.kallsyms]           [k] _copy_to_iter
     2.76%  vhost-14592      [kernel.kallsyms]           [k] native_write_msr
     2.57%  vhost-14592      [kernel.kallsyms]           [k] __get_user_nocheck_2
     2.03%  vhost-14592      [kernel.kallsyms]           [k] iov_iter_zero
     1.21%  vhost-14592      [kernel.kallsyms]           [k] native_read_msr
     0.89%  vhost-14592      [kernel.kallsyms]           [k] kmem_cache_free
     0.85%  vhost-14592      [kernel.kallsyms]           [k] __slab_free.isra.0
     0.84%  vhost-14592      [vhost]                     [k] 0x0000000000002e3a
     0.83%  vhost-14592      [kernel.kallsyms]           [k] tun_do_read
     0.74%  vhost-14592      [kernel.kallsyms]           [k] tun_recvmsg
     0.72%  vhost-14592      [kernel.kallsyms]           [k] slab_update_freelist.isra.0
     0.49%  vhost-14592      [vhost]                     [k] 0x0000000000002e29
     0.45%  vhost-14592      [vhost]                     [k] 0x0000000000002e35
     0.43%  qemu-system-x86  [unknown]                   [k] 0xffffffffb5298b1b
     0.26%  vhost-14592      [kernel.kallsyms]           [k] __skb_datagram_iter
     0.24%  vhost-14592      [kernel.kallsyms]           [k] skb_release_data
     0.24%  qemu-system-x86  [unknown]                   [k] 0xffffffffb4ca68cd
     0.24%  vhost-14592      [kernel.kallsyms]           [k] iov_iter_advance
     0.22%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eb79c
     0.22%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba58
     0.14%  vhost-14592      [kernel.kallsyms]           [k] sk_skb_reason_drop
     0.14%  vhost-14592      [kernel.kallsyms]           [k] amd_pmu_addr_offset
     0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba54
     0.13%  vhost-14592      [kernel.kallsyms]           [k] skb_free_head
     0.12%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba50
     0.12%  vhost-14592      [kernel.kallsyms]           [k] skb_release_head_state
     0.10%  qemu-system-x86  [kernel.kallsyms]           [k] native_write_msr
     0.09%  vhost-14592      [kernel.kallsyms]           [k] event_sched_out
     0.09%  vhost-14592      [kernel.kallsyms]           [k] x86_pmu_del
     0.09%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eb798
     0.09%  vhost-14592      [kernel.kallsyms]           [k] put_cpu_partial
...


perf stat of vhost:

 Performance counter stats for process id '14592':

         1.576.207      context-switches                 #  15070,7 cs/sec  cs_per_second     
               459      cpu-migrations                   #      4,4 migrations/sec  migrations_per_second
                 2      page-faults                      #      0,0 faults/sec  page_faults_per_second
        104.587,77 msec task-clock                       #      1,5 CPUs  CPUs_utilized       
       401.899.188      branch-misses                    #      0,2 %  branch_miss_rate         (49,91%)
   174.642.296.972      branches                         #   1669,8 M/sec  branch_frequency     (66,71%)
   453.598.103.128      cpu-cycles                       #      4,3 GHz  cycles_frequency       (66,98%)
   957.886.719.689      instructions                     #      2,1 instructions  insn_per_cycle  (66,77%)
    11.834.633.090      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,54%)

      71,561336447 seconds time elapsed


=========================================================================

TAP+vhost-net patched:

perf report of pktgen:

# Overhead  Command      Shared Object      Symbol                                        
# ........  ...........  .................  ..............................................
#
    16.83%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
    16.81%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
    16.74%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
     5.96%  kpktgend_0   [kernel.kallsyms]  [k] memset
     3.87%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
     2.87%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
     1.77%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
     1.72%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
     1.68%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
     1.63%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
     1.56%  kpktgend_0   [kernel.kallsyms]  [k] kthread_should_stop
     1.41%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
     1.19%  kpktgend_0   [kernel.kallsyms]  [k] __cond_resched
     0.83%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
     0.79%  kpktgend_0   [pktgen]           [k] 0x000000000000327f
     0.78%  kpktgend_0   [pktgen]           [k] 0x0000000000003284
     0.77%  kpktgend_0   [pktgen]           [k] 0x0000000000003877
     0.69%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
     0.66%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
     0.56%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
     0.54%  kpktgend_0   [pktgen]           [k] 0x0000000000003061
     0.41%  kpktgend_0   [pktgen]           [k] 0x0000000000003864
     0.41%  kpktgend_0   [pktgen]           [k] 0x0000000000003265
     0.40%  kpktgend_0   [pktgen]           [k] 0x0000000000003008
     0.39%  kpktgend_0   [pktgen]           [k] 0x000000000000326d
     0.37%  kpktgend_0   [kernel.kallsyms]  [k] ip_send_check
     0.36%  kpktgend_0   [pktgen]           [k] 0x000000000000422e
     0.32%  kpktgend_0   [pktgen]           [k] 0x0000000000004220
     0.30%  kpktgend_0   [pktgen]           [k] 0x0000000000004229
     0.29%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
     0.28%  kpktgend_0   [kernel.kallsyms]  [k] _raw_spin_lock
...

perf stat of pktgen:

 Performance counter stats for process id '3257,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267,3268,3269,3270,3271,3272,3273,3274,3275,3276,3277,3278,3279,3280,3281,3282,3283,3284,3285,3286,3287,3288,3289,3290,3291,3292,3293,3294,3295,3296,3297,3298,3299,3300,3301,3302,3303,3304':

            34.525      context-switches                 #    489,1 cs/sec  cs_per_second     
                 0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
                 0      page-faults                      #      0,0 faults/sec  page_faults_per_second
         70.593,02 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
       225.587.357      branch-misses                    #      0,2 %  branch_miss_rate         (50,15%)
   135.486.264.836      branches                         #   1919,3 M/sec  branch_frequency     (66,77%)
   324.131.813.682      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,77%)
   501.960.610.999      instructions                     #      1,5 instructions  insn_per_cycle  (66,77%)
     2.689.294.657      stalled-cycles-frontend          #     0,01 frontend_cycles_idle        (66,46%)

      70,928052784 seconds time elapsed


perf report of vhost:

# Overhead  Command          Shared Object               Symbol                                          
# ........  ...............  ..........................  ................................................
#
     8.95%  vhost-12220      [kernel.kallsyms]           [k] _copy_to_iter
     4.03%  vhost-12220      [kernel.kallsyms]           [k] native_write_msr
     2.44%  vhost-12220      [kernel.kallsyms]           [k] __get_user_nocheck_2
     2.12%  vhost-12220      [kernel.kallsyms]           [k] iov_iter_zero
     1.74%  vhost-12220      [kernel.kallsyms]           [k] native_read_msr
     0.92%  vhost-12220      [kernel.kallsyms]           [k] kmem_cache_free
     0.87%  vhost-12220      [vhost]                     [k] 0x0000000000002e3a
     0.86%  vhost-12220      [kernel.kallsyms]           [k] __slab_free.isra.0
     0.82%  vhost-12220      [kernel.kallsyms]           [k] tun_recvmsg
     0.82%  vhost-12220      [kernel.kallsyms]           [k] tun_do_read
     0.73%  vhost-12220      [kernel.kallsyms]           [k] slab_update_freelist.isra.0
     0.51%  vhost-12220      [vhost]                     [k] 0x0000000000002e29
     0.47%  vhost-12220      [vhost]                     [k] 0x0000000000002e35
     0.40%  qemu-system-x86  [unknown]                   [k] 0xffffffff97e98b1b
     0.28%  vhost-12220      [kernel.kallsyms]           [k] __skb_datagram_iter
     0.26%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba58
     0.24%  vhost-12220      [kernel.kallsyms]           [k] iov_iter_advance
     0.22%  qemu-system-x86  [unknown]                   [k] 0xffffffff978a68cd
     0.22%  vhost-12220      [kernel.kallsyms]           [k] skb_release_data
     0.21%  vhost-12220      [kernel.kallsyms]           [k] amd_pmu_addr_offset
     0.19%  vhost-12220      [kernel.kallsyms]           [k] tun_ring_consume_batched
     0.18%  vhost-12220      [kernel.kallsyms]           [k] __rcu_read_unlock
     0.14%  vhost-12220      [kernel.kallsyms]           [k] sk_skb_reason_drop
     0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eb79c
     0.13%  vhost-12220      [kernel.kallsyms]           [k] skb_release_head_state
     0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba54
     0.13%  vhost-12220      [kernel.kallsyms]           [k] psi_group_change
     0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba50
     0.11%  vhost-12220      [kernel.kallsyms]           [k] skb_free_head
     0.10%  vhost-12220      [kernel.kallsyms]           [k] __update_load_avg_cfs_rq
     0.10%  vhost-12220      [kernel.kallsyms]           [k] update_load_avg
...


perf stat of vhost:

 Performance counter stats for process id '12220':

         2.841.331      context-switches                 #  26120,3 cs/sec  cs_per_second     
             1.902      cpu-migrations                   #     17,5 migrations/sec  migrations_per_second
                 2      page-faults                      #      0,0 faults/sec  page_faults_per_second
        108.778,75 msec task-clock                       #      1,5 CPUs  CPUs_utilized       
       422.032.153      branch-misses                    #      0,2 %  branch_miss_rate         (49,95%)
   177.051.281.496      branches                         #   1627,6 M/sec  branch_frequency     (66,59%)
   458.977.136.165      cpu-cycles                       #      4,2 GHz  cycles_frequency       (66,47%)
   968.869.747.208      instructions                     #      2,1 instructions  insn_per_cycle  (66,70%)
    12.748.378.886      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,76%)

      70,946778111 seconds time elapsed


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-08 18:18                                         ` Simon Schippers
@ 2026-02-12  0:12                                           ` Simon Schippers
  2026-02-12  7:06                                             ` Michael S. Tsirkin
  2026-02-12  8:14                                           ` Jason Wang
  1 sibling, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-02-12  0:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 2/8/26 19:18, Simon Schippers wrote:
> On 2/6/26 04:21, Jason Wang wrote:
>> On Fri, Feb 6, 2026 at 6:28 AM Simon Schippers
>> <simon.schippers@tu-dortmund.de> wrote:
>>>
>>> On 2/5/26 04:59, Jason Wang wrote:
>>>> On Wed, Feb 4, 2026 at 11:44 PM Simon Schippers
>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>
>>>>> On 2/3/26 04:48, Jason Wang wrote:
>>>>>> On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>
>>>>>>> On 1/30/26 02:51, Jason Wang wrote:
>>>>>>>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>
>>>>>>>>> On 1/29/26 02:14, Jason Wang wrote:
>>>>>>>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 1/28/26 08:03, Jason Wang wrote:
>>>>>>>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
>>>>>>>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
>>>>>>>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>>>>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>>>>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>>>>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>>>>>>>>>>>>> +       }
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +       return ptr;
>>>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>>>>>>>>>>>>                 if (skb)
>>>>>>>>>>>>>>>>>>>>>                         break;
>>>>>>>>>>>>>>>>>>>>>                 if (noblock) {
>>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>>>>>>>>>>>>         return total;
>>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>>>>>>>>>>>>> another call to tweak the current API.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>>>>>>>>>>>>> I'm not sure is what we want.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>>>>>>>>>>>>> aspect.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>>>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>>>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>>>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>>>>>>>>>>>>> already be significantly outdated by the time it is used.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>>>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>>>>>>>>>>>>> smp_call_function_single().
>>>>>>>>>>>>>>>>> Is this a reasonable approach?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>>>>>>>>>>>>> considered a deal-breaker for the patch set?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It depends on whether or not it has effects on the performance.
>>>>>>>>>>>>>>>> Especially when vhost is pinned.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>>>>>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>>>>>>>>>>>>> for both the stock and patched versions. The benchmarks were run with
>>>>>>>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
>>>>>>>>>>>>>> be meaningful - the queue is never stopped in that case, so no
>>>>>>>>>>>>>> TX_SOFTIRQ is triggered.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>>>>>>>>>>>>> performance is lower for pktgen with a single thread but higher with
>>>>>>>>>>>>>> four threads. The results show no regression for the patched version,
>>>>>>>>>>>>>> with even slight performance improvements observed:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>>>>> | *4 threads*             |           |                |
>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>
>>>>>>>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
>>>>>>>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
>>>>>>>>>>>
>>>>>>>>>>> I forgot to mention that these PPS values are per thread.
>>>>>>>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
>>>>>>>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
>>>>>>>>>>> 4616 Kpps and 0, respectively.
>>>>>>>>>>>
>>>>>>>>>>> Sorry about that!
>>>>>>>>>>>
>>>>>>>>>>> The pktgen benchmarks with a single thread look fine, right?
>>>>>>>>>>
>>>>>>>>>> Still looks very low. E.g I just have a run of pktgen (using
>>>>>>>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
>>>>>>>>>> I can get 1Mpps.
>>>>>>>>>
>>>>>>>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
>>>>>>>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
>>>>>>>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
>>>>>>>>>
>>>>>>>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
>>>>>>>>> though the same parameters work fine for sample01 and sample02):
>>>>>>>>>
>>>>>>>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
>>>>>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
>>>>>>>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
>>>>>>>>> supported
>>>>>>>>> ERROR: Write error(1) occurred
>>>>>>>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
>>>>>>>>>
>>>>>>>>> ...and I do not know what I am doing wrong, even after looking at
>>>>>>>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
>>>>>>>>> Any clues?
>>>>>>>>
>>>>>>>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
>>>>>>>
>>>>>>> I tried using "-b 0", and while it worked, there was no noticeable
>>>>>>> performance improvement.
>>>>>>>
>>>>>>>>
>>>>>>>> Another thing I can think of is to disable
>>>>>>>>
>>>>>>>> 1) mitigations in both guest and host
>>>>>>>> 2) any kernel debug features in both host and guest
>>>>>>>
>>>>>>> I also rebuilt the kernel with everything disabled under
>>>>>>> "Kernel hacking", but that didn’t make any difference either.
>>>>>>>
>>>>>>> Because of this, I ran "pktgen_sample01_simple.sh" and
>>>>>>> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
>>>>>>> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
>>>>>>> with very similar performance between the stock and patched kernels.
>>>>>>>
>>>>>>> Personally, I think the low performance is to blame on the hardware.
>>>>>>
>>>>>> Let's double confirm this by:
>>>>>>
>>>>>> 1) make sure pktgen is using 100% CPU
>>>>>> 2) Perf doesn't show anything strange for pktgen thread
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>> I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
>>>>> 100 second perf stat measurement covering all kpktgend threads.
>>>>>
>>>>> Across all configurations, a single CPU was fully utilized.
>>>>>
>>>>> Apart from that, the patched variants show a higher branch frequency and
>>>>> a slightly increased number of context switches.
>>>>>
>>>>>
>>>>> The detailed results are provided below:
>>>>>
>>>>> Processor: Ryzen 5 5600X
>>>>>
>>>>> pktgen command:
>>>>> sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
>>>>>
>>>>> perf stat command:
>>>>> sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
>>>>>
>>>>>
>>>>> Results:
>>>>> Stock TAP:
>>>>>             46.997      context-switches                 #    467,2 cs/sec  cs_per_second
>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>         100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>      8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
>>>>>     77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
>>>>>    382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>    377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
>>>>>     84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>>>>
>>>>>      100,100414494 seconds time elapsed
>>>>>
>>>>>
>>>>> Stock TAP+vhost-net:
>>>>>             47.087      context-switches                 #    468,1 cs/sec  cs_per_second
>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>         100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>      8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
>>>>>     72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
>>>>>    382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>    349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>>>>     83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>>>>
>>>>>      100,100520402 seconds time elapsed
>>>>>
>>>>>
>>>>> Patched TAP:
>>>>>             47.862      context-switches                 #    475,8 cs/sec  cs_per_second
>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>         100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>      9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
>>>>>     99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
>>>>>    382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>    312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
>>>>>     76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)
>>>>>
>>>>>      100,101262454 seconds time elapsed
>>>>>
>>>>>
>>>>> Patched TAP+vhost-net:
>>>>>             47.892      context-switches                 #    476,1 cs/sec  cs_per_second
>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>         100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>      9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
>>>>>     90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
>>>>>    382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>    340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>>>>     78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)
>>>>>
>>>>>      100,101212911 seconds time elapsed
>>>>
>>>> Thanks for sharing. I have more questions:
>>>>
>>>> 1) The number of CPU and vCPUs
>>>
>>> qemu runs with a single core. And my host system is now a Ryzen 5 5600x
>>> with 6 cores, 12 threads.
>>> This is my command for TAP+vhost-net:
>>>
>>> sudo qemu-system-x86_64 -hda debian.qcow2
>>> -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no,vhost=on
>>> -device virtio-net-pci,netdev=mynet0 -m 1024 -enable-kvm
>>>
>>> For TAP only it is the same but without vhost=on.
>>>
>>>> 2) If you pin vhost or vCPU threads
>>>
>>> Not in the previous shown benchmark. I pinned vhost in other benchmarks
>>> but since there is only minor PPS difference I omitted for the sake of
>>> simplicity.
>>>
>>>> 3) what does perf top looks like or perf top -p $pid_of_vhost
>>>
>>> The perf reports for the pid_of_vhost from pktgen_sample01_simple.sh
>>> with TAP+vhost-net (not pinned, pktgen single queue, fq_codel) are shown
>>> below. I can not see a huge difference between stock and patched.
>>>
>>> Also I included perf reports from the pktgen_pids. I find them more
>>> intersting because tun_net_xmit shows less overhead for the patched.
>>> I assume that is due to the stopped netdev queue.
>>>
>>> I have now benchmarked pretty much all possible combinations (with a
>>> script) of TAP/TAP+vhost-net, single/multi-queue pktgen, vhost
>>> pinned/not pinned, with/without -b 0, fq_codel/noqueue... All of that
>>> with perf records..
>>> I could share them if you want but I feel this is getting out of hand.
>>>
>>>
>>> Stock:
>>> sudo perf record -p "$vhost_pid"
>>> ...
>>> # Overhead  Command          Shared Object               Symbol
>>> # ........  ...............  ..........................  ..........................................
>>> #
>>>      5.97%  vhost-4874       [kernel.kallsyms]           [k] _copy_to_iter
>>>      2.68%  vhost-4874       [kernel.kallsyms]           [k] tun_do_read
>>>      2.23%  vhost-4874       [kernel.kallsyms]           [k] native_write_msr
>>>      1.93%  vhost-4874       [kernel.kallsyms]           [k] __check_object_size
>>
>> Let's disable CONFIG_HARDENED_USERCOPY and retry.
>>
>>>      1.61%  vhost-4874       [kernel.kallsyms]           [k] __slab_free.isra.0
>>>      1.56%  vhost-4874       [kernel.kallsyms]           [k] __get_user_nocheck_2
>>>      1.54%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_zero
>>>      1.45%  vhost-4874       [kernel.kallsyms]           [k] kmem_cache_free
>>>      1.43%  vhost-4874       [kernel.kallsyms]           [k] tun_recvmsg
>>>      1.24%  vhost-4874       [kernel.kallsyms]           [k] sk_skb_reason_drop
>>>      1.12%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_safe_ret
>>>      1.07%  vhost-4874       [kernel.kallsyms]           [k] native_read_msr
>>>      0.76%  vhost-4874       [kernel.kallsyms]           [k] simple_copy_to_iter
>>>      0.75%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_return_thunk
>>>      0.69%  vhost-4874       [vhost]                     [k] 0x0000000000002e70
>>>      0.59%  vhost-4874       [kernel.kallsyms]           [k] skb_release_data
>>>      0.59%  vhost-4874       [kernel.kallsyms]           [k] __skb_datagram_iter
>>>      0.53%  vhost-4874       [vhost]                     [k] 0x0000000000002e5f
>>>      0.51%  vhost-4874       [kernel.kallsyms]           [k] slab_update_freelist.isra.0
>>>      0.46%  vhost-4874       [kernel.kallsyms]           [k] kfree_skbmem
>>>      0.44%  vhost-4874       [kernel.kallsyms]           [k] skb_copy_datagram_iter
>>>      0.43%  vhost-4874       [kernel.kallsyms]           [k] skb_free_head
>>>      0.37%  qemu-system-x86  [unknown]                   [k] 0xffffffffba898b1b
>>>      0.35%  vhost-4874       [vhost]                     [k] 0x0000000000002e6b
>>>      0.33%  vhost-4874       [vhost_net]                 [k] 0x000000000000357d
>>>      0.28%  vhost-4874       [kernel.kallsyms]           [k] __check_heap_object
>>>      0.27%  vhost-4874       [vhost_net]                 [k] 0x00000000000035f3
>>>      0.26%  vhost-4874       [vhost_net]                 [k] 0x00000000000030f6
>>>      0.26%  vhost-4874       [kernel.kallsyms]           [k] __virt_addr_valid
>>>      0.24%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_advance
>>>      0.22%  vhost-4874       [kernel.kallsyms]           [k] perf_event_update_userpage
>>>      0.22%  vhost-4874       [kernel.kallsyms]           [k] check_stack_object
>>>      0.19%  qemu-system-x86  [unknown]                   [k] 0xffffffffba2a68cd
>>>      0.19%  vhost-4874       [kernel.kallsyms]           [k] dequeue_entities
>>>      0.19%  vhost-4874       [vhost_net]                 [k] 0x0000000000003237
>>>      0.18%  vhost-4874       [vhost_net]                 [k] 0x0000000000003550
>>>      0.18%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_del
>>>      0.18%  vhost-4874       [vhost_net]                 [k] 0x00000000000034a0
>>>      0.17%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_disable_all
>>>      0.16%  vhost-4874       [vhost_net]                 [k] 0x0000000000003523
>>>      0.16%  vhost-4874       [kernel.kallsyms]           [k] amd_pmu_addr_offset
>>> ...
>>>
>>>
>>> sudo perf record -p "$kpktgend_pids":
>>> ...
>>> # Overhead  Command      Shared Object      Symbol
>>> # ........  ...........  .................  ...............................................
>>> #
>>>     10.98%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>>>     10.45%  kpktgend_0   [kernel.kallsyms]  [k] memset
>>>      8.40%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>>>      6.31%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>>>      3.13%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_safe_ret
>>>      2.40%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
>>>      2.11%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_return_thunk
>>
>> This is a hint that SRSO migitaion is enabled.
>>
>> Have you disabled CPU_MITIGATIONS via either Kconfig or kernel command
>> line (mitigations = off) for both host and guest?
>>
>> Thanks
>>
> 
> Your both suggested changes really boosted the performance, especially
> for TAP.
> 
> I disabled SRSO mitigation with spec_rstack_overflow=off and went from
> "Mitigation: Safe RET" to "Vulnerable" on the host. The VM showed "Not
> affected" but I applied spec_rstack_overflow=off anyway.
> 
> Here are some new benchmarks for pktgen_sample01_simple.sh:
> (I also have other available and I can share them if you want.)
> 
> +-------------------------+-----------+----------------+
> | pktgen benchmarks to    | Stock     | Patched with   |
> | Debian VM, R5 5600X,    |           | fq_codel qdisc |
> | 100M packets            |           |                |
> | CPU not pinned          |           |                |
> +-----------+-------------+-----------+----------------+
> | TAP       | Transmitted | 1330 Kpps | 1033 Kpps      |
> |           +-------------+-----------+----------------+
> |           | Lost        | 3895 Kpps | 0              |
> +-----------+-------------+-----------+----------------+
> | TAP       | Transmitted | 1408 Kpps | 1420 Kpps      |
> |  +        +-------------+-----------+----------------+
> | vhost-net | Lost        | 3712 Kpps | 0              |
> +-----------+-------------+-----------+----------------+
> 
> I do not understand why there is a regression for TAP but not for
> TAP+vhost-net...
> 
> 
> The perf report of pktgen and perf stat for TAP & TAP+vhost-net are
> below. I also included perf reports & perf statsof vhost for
> TAP+vhost-net.
> 
> =========================================================================
> 
> TAP stock:
> perf report of pktgen:
> 
> # Overhead  Command      Shared Object      Symbol                                        
> # ........  ...........  .................  ..............................................
> #
>     22.39%  kpktgend_0   [kernel.kallsyms]  [k] memset
>     10.59%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>      7.56%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>      5.74%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>      4.76%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_free
>      3.23%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
>      2.55%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
>      2.49%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
>      2.48%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
>      2.44%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
>      2.21%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
>      1.46%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
>      1.36%  kpktgend_0   [kernel.kallsyms]  [k] ip_send_check
>      1.17%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
>      1.09%  kpktgend_0   [kernel.kallsyms]  [k] _raw_spin_lock
>      1.01%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
>      0.85%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_data
>      0.83%  kpktgend_0   [kernel.kallsyms]  [k] __netdev_alloc_skb
>      0.71%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
>      0.68%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
>      0.64%  kpktgend_0   [kernel.kallsyms]  [k] skb_tx_error
>      0.59%  kpktgend_0   [kernel.kallsyms]  [k] __get_random_u32_below
>      0.58%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
>      0.51%  kpktgend_0   [pktgen]           [k] 0x000000000000422e
>      0.50%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
>      0.48%  kpktgend_0   [kernel.kallsyms]  [k] _get_random_bytes
>      0.46%  kpktgend_0   [pktgen]           [k] 0x0000000000004220
>      0.46%  kpktgend_0   [pktgen]           [k] 0x0000000000004229
>      0.45%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_head_state
>      0.44%  kpktgend_0   [pktgen]           [k] 0x000000000000211d
> ...
> 
> 
> perf stat of pktgen:
>  Performance counter stats for process id '4740,4741,4742,4743,4744,4745,4746,4747,4748,4749,4750,4751,4752,4753,4754,4755,4756,4757,4758,4759,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787':
> 
>             35.436      context-switches                 #    469,7 cs/sec  cs_per_second     
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>          75.443,67 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
>        548.187.113      branch-misses                    #      0,5 %  branch_miss_rate         (50,18%)
>    119.270.991.801      branches                         #   1580,9 M/sec  branch_frequency     (66,79%)
>    347.803.953.690      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,79%)
>    689.142.448.524      instructions                     #      2,0 instructions  insn_per_cycle  (66,79%)
>     11.063.715.152      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,43%)
> 
>       75,698467362 seconds time elapsed
> 
> 
> =========================================================================
> 
> TAP patched:
> perf report of pktgen:
> 
> # Overhead  Command      Shared Object      Symbol                                        
> # ........  ...........  .................  ..............................................
> #
>     16.18%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
>     16.11%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
>     16.10%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
>      4.78%  kpktgend_0   [kernel.kallsyms]  [k] memset
>      4.54%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
>      2.62%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
>      2.42%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>      1.89%  kpktgend_0   [kernel.kallsyms]  [k] kthread_should_stop
>      1.77%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
>      1.66%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>      1.53%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
>      1.44%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>      1.42%  kpktgend_0   [kernel.kallsyms]  [k] __cond_resched
>      0.91%  kpktgend_0   [pktgen]           [k] 0x0000000000003877
>      0.91%  kpktgend_0   [pktgen]           [k] 0x0000000000003284
>      0.89%  kpktgend_0   [pktgen]           [k] 0x000000000000327f
>      0.75%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
>      0.64%  kpktgend_0   [pktgen]           [k] 0x0000000000003061
>      0.61%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
>      0.57%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
>      0.52%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
>      0.48%  kpktgend_0   [pktgen]           [k] 0x000000000000326d
>      0.47%  kpktgend_0   [pktgen]           [k] 0x0000000000003265
>      0.47%  kpktgend_0   [pktgen]           [k] 0x0000000000003864
>      0.45%  kpktgend_0   [pktgen]           [k] 0x0000000000003008
>      0.35%  kpktgend_0   [pktgen]           [k] 0x000000000000449b
>      0.34%  kpktgend_0   [pktgen]           [k] 0x0000000000003242
>      0.32%  kpktgend_0   [pktgen]           [k] 0x00000000000030a6
>      0.32%  kpktgend_0   [pktgen]           [k] 0x000000000000308b
>      0.32%  kpktgend_0   [pktgen]           [k] 0x0000000000003869
>      0.31%  kpktgend_0   [pktgen]           [k] 0x00000000000030c2
> ...
> 
> perf stat of pktgen:
> 
>  Performance counter stats for process id '3257,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267,3268,3269,3270,3271,3272,3273,3274,3275,3276,3277,3278,3279,3280,3281,3282,3283,3284,3285,3286,3287,3288,3289,3290,3291,3292,3293,3294,3295,3296,3297,3298,3299,3300,3301,3302,3303,3304':
> 
>             45.545      context-switches                 #    468,9 cs/sec  cs_per_second     
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>          97.130,77 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
>        237.212.098      branch-misses                    #      0,1 %  branch_miss_rate         (50,12%)
>    172.088.418.840      branches                         #   1771,7 M/sec  branch_frequency     (66,78%)
>    447.219.346.605      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,79%)
>    619.203.459.603      instructions                     #      1,4 instructions  insn_per_cycle  (66,79%)
>      5.821.044.711      stalled-cycles-frontend          #     0,01 frontend_cycles_idle        (66,48%)
> 
>       97,353332168 seconds time elapsed
> 
> =========================================================================
> 
> TAP+vhost-net stock:
> 
> perf report of pktgen:
> 
> # Overhead  Command      Shared Object      Symbol                                        
> # ........  ...........  .................  ..............................................
> #
>     22.25%  kpktgend_0   [kernel.kallsyms]  [k] memset
>     10.73%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>      7.69%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>      5.71%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>      4.66%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_free
>      3.20%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
>      2.50%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
>      2.48%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
>      2.45%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
>      2.41%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
>      2.22%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
>      1.44%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
>      1.34%  kpktgend_0   [kernel.kallsyms]  [k] ip_send_check
>      1.22%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
>      1.06%  kpktgend_0   [kernel.kallsyms]  [k] _raw_spin_lock
>      1.04%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
>      0.85%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_data
>      0.83%  kpktgend_0   [kernel.kallsyms]  [k] __netdev_alloc_skb
>      0.72%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
>      0.70%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
>      0.62%  kpktgend_0   [kernel.kallsyms]  [k] skb_tx_error
>      0.61%  kpktgend_0   [kernel.kallsyms]  [k] __get_random_u32_below
>      0.60%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
>      0.52%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
>      0.47%  kpktgend_0   [kernel.kallsyms]  [k] _get_random_bytes
>      0.47%  kpktgend_0   [pktgen]           [k] 0x000000000000422e
>      0.45%  kpktgend_0   [pktgen]           [k] 0x0000000000004229
>      0.44%  kpktgend_0   [pktgen]           [k] 0x0000000000004220
>      0.43%  kpktgend_0   [kernel.kallsyms]  [k] skb_release_head_state
>      0.42%  kpktgend_0   [kernel.kallsyms]  [k] netdev_core_stats_inc
>      0.42%  kpktgend_0   [pktgen]           [k] 0x0000000000002119
> ...
> 
> perf stat of pktgen:
> 
>  Performance counter stats for process id '4740,4741,4742,4743,4744,4745,4746,4747,4748,4749,4750,4751,4752,4753,4754,4755,4756,4757,4758,4759,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787':
> 
>             34.830      context-switches                 #    489,0 cs/sec  cs_per_second     
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>          71.224,77 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
>        506.905.400      branch-misses                    #      0,5 %  branch_miss_rate         (50,15%)
>    110.207.563.428      branches                         #   1547,3 M/sec  branch_frequency     (66,78%)
>    324.745.594.771      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,77%)
>    635.181.893.816      instructions                     #      2,0 instructions  insn_per_cycle  (66,77%)
>     10.450.586.633      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,46%)
> 
>       71,547831150 seconds time elapsed
> 
> 
> perf report of vhost:
> 
> # Overhead  Command          Shared Object               Symbol                                          
> # ........  ...............  ..........................  ................................................
> #
>      8.66%  vhost-14592      [kernel.kallsyms]           [k] _copy_to_iter
>      2.76%  vhost-14592      [kernel.kallsyms]           [k] native_write_msr
>      2.57%  vhost-14592      [kernel.kallsyms]           [k] __get_user_nocheck_2
>      2.03%  vhost-14592      [kernel.kallsyms]           [k] iov_iter_zero
>      1.21%  vhost-14592      [kernel.kallsyms]           [k] native_read_msr
>      0.89%  vhost-14592      [kernel.kallsyms]           [k] kmem_cache_free
>      0.85%  vhost-14592      [kernel.kallsyms]           [k] __slab_free.isra.0
>      0.84%  vhost-14592      [vhost]                     [k] 0x0000000000002e3a
>      0.83%  vhost-14592      [kernel.kallsyms]           [k] tun_do_read
>      0.74%  vhost-14592      [kernel.kallsyms]           [k] tun_recvmsg
>      0.72%  vhost-14592      [kernel.kallsyms]           [k] slab_update_freelist.isra.0
>      0.49%  vhost-14592      [vhost]                     [k] 0x0000000000002e29
>      0.45%  vhost-14592      [vhost]                     [k] 0x0000000000002e35
>      0.43%  qemu-system-x86  [unknown]                   [k] 0xffffffffb5298b1b
>      0.26%  vhost-14592      [kernel.kallsyms]           [k] __skb_datagram_iter
>      0.24%  vhost-14592      [kernel.kallsyms]           [k] skb_release_data
>      0.24%  qemu-system-x86  [unknown]                   [k] 0xffffffffb4ca68cd
>      0.24%  vhost-14592      [kernel.kallsyms]           [k] iov_iter_advance
>      0.22%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eb79c
>      0.22%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba58
>      0.14%  vhost-14592      [kernel.kallsyms]           [k] sk_skb_reason_drop
>      0.14%  vhost-14592      [kernel.kallsyms]           [k] amd_pmu_addr_offset
>      0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba54
>      0.13%  vhost-14592      [kernel.kallsyms]           [k] skb_free_head
>      0.12%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba50
>      0.12%  vhost-14592      [kernel.kallsyms]           [k] skb_release_head_state
>      0.10%  qemu-system-x86  [kernel.kallsyms]           [k] native_write_msr
>      0.09%  vhost-14592      [kernel.kallsyms]           [k] event_sched_out
>      0.09%  vhost-14592      [kernel.kallsyms]           [k] x86_pmu_del
>      0.09%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eb798
>      0.09%  vhost-14592      [kernel.kallsyms]           [k] put_cpu_partial
> ...
> 
> 
> perf stat of vhost:
> 
>  Performance counter stats for process id '14592':
> 
>          1.576.207      context-switches                 #  15070,7 cs/sec  cs_per_second     
>                459      cpu-migrations                   #      4,4 migrations/sec  migrations_per_second
>                  2      page-faults                      #      0,0 faults/sec  page_faults_per_second
>         104.587,77 msec task-clock                       #      1,5 CPUs  CPUs_utilized       
>        401.899.188      branch-misses                    #      0,2 %  branch_miss_rate         (49,91%)
>    174.642.296.972      branches                         #   1669,8 M/sec  branch_frequency     (66,71%)
>    453.598.103.128      cpu-cycles                       #      4,3 GHz  cycles_frequency       (66,98%)
>    957.886.719.689      instructions                     #      2,1 instructions  insn_per_cycle  (66,77%)
>     11.834.633.090      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,54%)
> 
>       71,561336447 seconds time elapsed
> 
> 
> =========================================================================
> 
> TAP+vhost-net patched:
> 
> perf report of pktgen:
> 
> # Overhead  Command      Shared Object      Symbol                                        
> # ........  ...........  .................  ..............................................
> #
>     16.83%  kpktgend_0   [pktgen]           [k] 0x000000000000324f
>     16.81%  kpktgend_0   [pktgen]           [k] 0x0000000000003255
>     16.74%  kpktgend_0   [pktgen]           [k] 0x000000000000325d
>      5.96%  kpktgend_0   [kernel.kallsyms]  [k] memset
>      3.87%  kpktgend_0   [kernel.kallsyms]  [k] __local_bh_enable_ip
>      2.87%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>      1.77%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>      1.72%  kpktgend_0   [pktgen]           [k] 0x000000000000324d
>      1.68%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>      1.63%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_unlock
>      1.56%  kpktgend_0   [kernel.kallsyms]  [k] kthread_should_stop
>      1.41%  kpktgend_0   [kernel.kallsyms]  [k] __rcu_read_lock
>      1.19%  kpktgend_0   [kernel.kallsyms]  [k] __cond_resched
>      0.83%  kpktgend_0   [kernel.kallsyms]  [k] chacha_permute
>      0.79%  kpktgend_0   [pktgen]           [k] 0x000000000000327f
>      0.78%  kpktgend_0   [pktgen]           [k] 0x0000000000003284
>      0.77%  kpktgend_0   [pktgen]           [k] 0x0000000000003877
>      0.69%  kpktgend_0   [kernel.kallsyms]  [k] sock_def_readable
>      0.66%  kpktgend_0   [kernel.kallsyms]  [k] get_random_u32
>      0.56%  kpktgend_0   [kernel.kallsyms]  [k] skb_put
>      0.54%  kpktgend_0   [pktgen]           [k] 0x0000000000003061
>      0.41%  kpktgend_0   [pktgen]           [k] 0x0000000000003864
>      0.41%  kpktgend_0   [pktgen]           [k] 0x0000000000003265
>      0.40%  kpktgend_0   [pktgen]           [k] 0x0000000000003008
>      0.39%  kpktgend_0   [pktgen]           [k] 0x000000000000326d
>      0.37%  kpktgend_0   [kernel.kallsyms]  [k] ip_send_check
>      0.36%  kpktgend_0   [pktgen]           [k] 0x000000000000422e
>      0.32%  kpktgend_0   [pktgen]           [k] 0x0000000000004220
>      0.30%  kpktgend_0   [pktgen]           [k] 0x0000000000004229
>      0.29%  kpktgend_0   [kernel.kallsyms]  [k] kmalloc_reserve
>      0.28%  kpktgend_0   [kernel.kallsyms]  [k] _raw_spin_lock
> ...
> 
> perf stat of pktgen:
> 
>  Performance counter stats for process id '3257,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267,3268,3269,3270,3271,3272,3273,3274,3275,3276,3277,3278,3279,3280,3281,3282,3283,3284,3285,3286,3287,3288,3289,3290,3291,3292,3293,3294,3295,3296,3297,3298,3299,3300,3301,3302,3303,3304':
> 
>             34.525      context-switches                 #    489,1 cs/sec  cs_per_second     
>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>          70.593,02 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
>        225.587.357      branch-misses                    #      0,2 %  branch_miss_rate         (50,15%)
>    135.486.264.836      branches                         #   1919,3 M/sec  branch_frequency     (66,77%)
>    324.131.813.682      cpu-cycles                       #      4,6 GHz  cycles_frequency       (66,77%)
>    501.960.610.999      instructions                     #      1,5 instructions  insn_per_cycle  (66,77%)
>      2.689.294.657      stalled-cycles-frontend          #     0,01 frontend_cycles_idle        (66,46%)
> 
>       70,928052784 seconds time elapsed
> 
> 
> perf report of vhost:
> 
> # Overhead  Command          Shared Object               Symbol                                          
> # ........  ...............  ..........................  ................................................
> #
>      8.95%  vhost-12220      [kernel.kallsyms]           [k] _copy_to_iter
>      4.03%  vhost-12220      [kernel.kallsyms]           [k] native_write_msr
>      2.44%  vhost-12220      [kernel.kallsyms]           [k] __get_user_nocheck_2
>      2.12%  vhost-12220      [kernel.kallsyms]           [k] iov_iter_zero
>      1.74%  vhost-12220      [kernel.kallsyms]           [k] native_read_msr
>      0.92%  vhost-12220      [kernel.kallsyms]           [k] kmem_cache_free
>      0.87%  vhost-12220      [vhost]                     [k] 0x0000000000002e3a
>      0.86%  vhost-12220      [kernel.kallsyms]           [k] __slab_free.isra.0
>      0.82%  vhost-12220      [kernel.kallsyms]           [k] tun_recvmsg
>      0.82%  vhost-12220      [kernel.kallsyms]           [k] tun_do_read
>      0.73%  vhost-12220      [kernel.kallsyms]           [k] slab_update_freelist.isra.0
>      0.51%  vhost-12220      [vhost]                     [k] 0x0000000000002e29
>      0.47%  vhost-12220      [vhost]                     [k] 0x0000000000002e35
>      0.40%  qemu-system-x86  [unknown]                   [k] 0xffffffff97e98b1b
>      0.28%  vhost-12220      [kernel.kallsyms]           [k] __skb_datagram_iter
>      0.26%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba58
>      0.24%  vhost-12220      [kernel.kallsyms]           [k] iov_iter_advance
>      0.22%  qemu-system-x86  [unknown]                   [k] 0xffffffff978a68cd
>      0.22%  vhost-12220      [kernel.kallsyms]           [k] skb_release_data
>      0.21%  vhost-12220      [kernel.kallsyms]           [k] amd_pmu_addr_offset
>      0.19%  vhost-12220      [kernel.kallsyms]           [k] tun_ring_consume_batched
>      0.18%  vhost-12220      [kernel.kallsyms]           [k] __rcu_read_unlock
>      0.14%  vhost-12220      [kernel.kallsyms]           [k] sk_skb_reason_drop
>      0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eb79c
>      0.13%  vhost-12220      [kernel.kallsyms]           [k] skb_release_head_state
>      0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba54
>      0.13%  vhost-12220      [kernel.kallsyms]           [k] psi_group_change
>      0.13%  qemu-system-x86  qemu-system-x86_64          [.] 0x00000000008eba50
>      0.11%  vhost-12220      [kernel.kallsyms]           [k] skb_free_head
>      0.10%  vhost-12220      [kernel.kallsyms]           [k] __update_load_avg_cfs_rq
>      0.10%  vhost-12220      [kernel.kallsyms]           [k] update_load_avg
> ...
> 
> 
> perf stat of vhost:
> 
>  Performance counter stats for process id '12220':
> 
>          2.841.331      context-switches                 #  26120,3 cs/sec  cs_per_second     
>              1.902      cpu-migrations                   #     17,5 migrations/sec  migrations_per_second
>                  2      page-faults                      #      0,0 faults/sec  page_faults_per_second
>         108.778,75 msec task-clock                       #      1,5 CPUs  CPUs_utilized       
>        422.032.153      branch-misses                    #      0,2 %  branch_miss_rate         (49,95%)
>    177.051.281.496      branches                         #   1627,6 M/sec  branch_frequency     (66,59%)
>    458.977.136.165      cpu-cycles                       #      4,2 GHz  cycles_frequency       (66,47%)
>    968.869.747.208      instructions                     #      2,1 instructions  insn_per_cycle  (66,70%)
>     12.748.378.886      stalled-cycles-frontend          #     0,03 frontend_cycles_idle        (66,76%)
> 
>       70,946778111 seconds time elapsed
> 

Hi, what do you think?

Thanks!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-12  0:12                                           ` Simon Schippers
@ 2026-02-12  7:06                                             ` Michael S. Tsirkin
  2026-02-12  8:03                                               ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-02-12  7:06 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization


Simon is there a reason you drop "Re: " subject prefix when
replying? Each time I am thinking it's a new version
only to find out it's this endless thread where people quote
>1000 lines of context to add 2 lines at the end.

-- 
MST


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-12  7:06                                             ` Michael S. Tsirkin
@ 2026-02-12  8:03                                               ` Simon Schippers
  0 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-02-12  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 2/12/26 08:06, Michael S. Tsirkin wrote:
> 
> Simon is there a reason you drop "Re: " subject prefix when
> replying? Each time I am thinking it's a new version
> only to find out it's this endless thread where people quote
>> 1000 lines of context to add 2 lines at the end.
> 

No, there is no reason for that. Sorry, I will keep the
"Re: " from now on.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-08 18:18                                         ` Simon Schippers
  2026-02-12  0:12                                           ` Simon Schippers
@ 2026-02-12  8:14                                           ` Jason Wang
  2026-02-14 17:13                                             ` Simon Schippers
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Wang @ 2026-02-12  8:14 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On Mon, Feb 9, 2026 at 2:18 AM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 2/6/26 04:21, Jason Wang wrote:
> > On Fri, Feb 6, 2026 at 6:28 AM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 2/5/26 04:59, Jason Wang wrote:
> >>> On Wed, Feb 4, 2026 at 11:44 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 2/3/26 04:48, Jason Wang wrote:
> >>>>> On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> On 1/30/26 02:51, Jason Wang wrote:
> >>>>>>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
> >>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>
> >>>>>>>> On 1/29/26 02:14, Jason Wang wrote:
> >>>>>>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
> >>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 1/28/26 08:03, Jason Wang wrote:
> >>>>>>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
> >>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
> >>>>>>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
> >>>>>>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
> >>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
> >>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
> >>>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
> >>>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
> >>>>>>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
> >>>>>>>>>>>>>>>>>>>> space in the underlying ptr_ring.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
> >>>>>>>>>>>>>>>>>>>> in an upcoming commit.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >>>>>>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >>>>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
> >>>>>>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
> >>>>>>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
> >>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>>>>>>>>>>>>>         return ret ? ret : total;
> >>>>>>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
> >>>>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
> >>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
> >>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
> >>>>>>>>>>>>>>>>>>>> +               rcu_read_unlock();
> >>>>>>>>>>>>>>>>>>>> +       }
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +       return ptr;
> >>>>>>>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>>>>>>>                            struct iov_iter *to,
> >>>>>>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
> >>>>>>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
> >>>>>>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>                 /* Read frames from the queue */
> >>>>>>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
> >>>>>>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
> >>>>>>>>>>>>>>>>>>>>                 if (skb)
> >>>>>>>>>>>>>>>>>>>>                         break;
> >>>>>>>>>>>>>>>>>>>>                 if (noblock) {
> >>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
> >>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>>>>>>>>>>>>>         return total;
> >>>>>>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
> >>>>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
> >>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
> >>>>>>>>>>>>>>>>>>>> +       void *ptr;
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
> >>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
> >>>>>>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
> >>>>>>>>>>>>>>>>>>> another call to tweak the current API.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
> >>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
> >>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
> >>>>>>>>>>>>>>>>>>> I'm not sure is what we want.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
> >>>>>>>>>>>>>>>> aspect.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
> >>>>>>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
> >>>>>>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
> >>>>>>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
> >>>>>>>>>>>>>>>> already be significantly outdated by the time it is used.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
> >>>>>>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
> >>>>>>>>>>>>>>>> smp_call_function_single().
> >>>>>>>>>>>>>>>> Is this a reasonable approach?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
> >>>>>>>>>>>>>>>> considered a deal-breaker for the patch set?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It depends on whether or not it has effects on the performance.
> >>>>>>>>>>>>>>> Especially when vhost is pinned.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
> >>>>>>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
> >>>>>>>>>>>>> for both the stock and patched versions. The benchmarks were run with
> >>>>>>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
> >>>>>>>>>>>>> be meaningful - the queue is never stopped in that case, so no
> >>>>>>>>>>>>> TX_SOFTIRQ is triggered.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
> >>>>>>>>>>>>> performance is lower for pktgen with a single thread but higher with
> >>>>>>>>>>>>> four threads. The results show no regression for the patched version,
> >>>>>>>>>>>>> with even slight performance improvements observed:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +-------------------------+-----------+----------------+
> >>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>>>>>>>> | 100M packets            |           |                |
> >>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
> >>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
> >>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +-------------------------+-----------+----------------+
> >>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
> >>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
> >>>>>>>>>>>>> | 100M packets            |           |                |
> >>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
> >>>>>>>>>>>>> | *4 threads*             |           |                |
> >>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
> >>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
> >>>>>>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
> >>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
> >>>>>>>>>>>
> >>>>>>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
> >>>>>>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
> >>>>>>>>>>
> >>>>>>>>>> I forgot to mention that these PPS values are per thread.
> >>>>>>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
> >>>>>>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
> >>>>>>>>>> 4616 Kpps and 0, respectively.
> >>>>>>>>>>
> >>>>>>>>>> Sorry about that!
> >>>>>>>>>>
> >>>>>>>>>> The pktgen benchmarks with a single thread look fine, right?
> >>>>>>>>>
> >>>>>>>>> Still looks very low. E.g I just have a run of pktgen (using
> >>>>>>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
> >>>>>>>>> I can get 1Mpps.
> >>>>>>>>
> >>>>>>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
> >>>>>>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
> >>>>>>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
> >>>>>>>>
> >>>>>>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
> >>>>>>>> though the same parameters work fine for sample01 and sample02):
> >>>>>>>>
> >>>>>>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
> >>>>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
> >>>>>>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
> >>>>>>>> supported
> >>>>>>>> ERROR: Write error(1) occurred
> >>>>>>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
> >>>>>>>>
> >>>>>>>> ...and I do not know what I am doing wrong, even after looking at
> >>>>>>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
> >>>>>>>> Any clues?
> >>>>>>>
> >>>>>>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
> >>>>>>
> >>>>>> I tried using "-b 0", and while it worked, there was no noticeable
> >>>>>> performance improvement.
> >>>>>>
> >>>>>>>
> >>>>>>> Another thing I can think of is to disable
> >>>>>>>
> >>>>>>> 1) mitigations in both guest and host
> >>>>>>> 2) any kernel debug features in both host and guest
> >>>>>>
> >>>>>> I also rebuilt the kernel with everything disabled under
> >>>>>> "Kernel hacking", but that didn’t make any difference either.
> >>>>>>
> >>>>>> Because of this, I ran "pktgen_sample01_simple.sh" and
> >>>>>> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
> >>>>>> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
> >>>>>> with very similar performance between the stock and patched kernels.
> >>>>>>
> >>>>>> Personally, I think the low performance is to blame on the hardware.
> >>>>>
> >>>>> Let's double confirm this by:
> >>>>>
> >>>>> 1) make sure pktgen is using 100% CPU
> >>>>> 2) Perf doesn't show anything strange for pktgen thread
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>> I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
> >>>> 100 second perf stat measurement covering all kpktgend threads.
> >>>>
> >>>> Across all configurations, a single CPU was fully utilized.
> >>>>
> >>>> Apart from that, the patched variants show a higher branch frequency and
> >>>> a slightly increased number of context switches.
> >>>>
> >>>>
> >>>> The detailed results are provided below:
> >>>>
> >>>> Processor: Ryzen 5 5600X
> >>>>
> >>>> pktgen command:
> >>>> sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
> >>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
> >>>>
> >>>> perf stat command:
> >>>> sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
> >>>>
> >>>>
> >>>> Results:
> >>>> Stock TAP:
> >>>>             46.997      context-switches                 #    467,2 cs/sec  cs_per_second
> >>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>>>         100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>>>      8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
> >>>>     77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
> >>>>    382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>>>    377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
> >>>>     84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
> >>>>
> >>>>      100,100414494 seconds time elapsed
> >>>>
> >>>>
> >>>> Stock TAP+vhost-net:
> >>>>             47.087      context-switches                 #    468,1 cs/sec  cs_per_second
> >>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>>>         100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>>>      8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
> >>>>     72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
> >>>>    382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>>>    349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
> >>>>     83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
> >>>>
> >>>>      100,100520402 seconds time elapsed
> >>>>
> >>>>
> >>>> Patched TAP:
> >>>>             47.862      context-switches                 #    475,8 cs/sec  cs_per_second
> >>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>>>         100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>>>      9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
> >>>>     99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
> >>>>    382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>>>    312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
> >>>>     76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)
> >>>>
> >>>>      100,101262454 seconds time elapsed
> >>>>
> >>>>
> >>>> Patched TAP+vhost-net:
> >>>>             47.892      context-switches                 #    476,1 cs/sec  cs_per_second
> >>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
> >>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
> >>>>         100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized
> >>>>      9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
> >>>>     90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
> >>>>    382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
> >>>>    340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
> >>>>     78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)
> >>>>
> >>>>      100,101212911 seconds time elapsed
> >>>
> >>> Thanks for sharing. I have more questions:
> >>>
> >>> 1) The number of CPU and vCPUs
> >>
> >> qemu runs with a single core. And my host system is now a Ryzen 5 5600x
> >> with 6 cores, 12 threads.
> >> This is my command for TAP+vhost-net:
> >>
> >> sudo qemu-system-x86_64 -hda debian.qcow2
> >> -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no,vhost=on
> >> -device virtio-net-pci,netdev=mynet0 -m 1024 -enable-kvm
> >>
> >> For TAP only it is the same but without vhost=on.
> >>
> >>> 2) If you pin vhost or vCPU threads
> >>
> >> Not in the previous shown benchmark. I pinned vhost in other benchmarks
> >> but since there is only minor PPS difference I omitted for the sake of
> >> simplicity.
> >>
> >>> 3) what does perf top looks like or perf top -p $pid_of_vhost
> >>
> >> The perf reports for the pid_of_vhost from pktgen_sample01_simple.sh
> >> with TAP+vhost-net (not pinned, pktgen single queue, fq_codel) are shown
> >> below. I can not see a huge difference between stock and patched.
> >>
> >> Also I included perf reports from the pktgen_pids. I find them more
> >> intersting because tun_net_xmit shows less overhead for the patched.
> >> I assume that is due to the stopped netdev queue.
> >>
> >> I have now benchmarked pretty much all possible combinations (with a
> >> script) of TAP/TAP+vhost-net, single/multi-queue pktgen, vhost
> >> pinned/not pinned, with/without -b 0, fq_codel/noqueue... All of that
> >> with perf records..
> >> I could share them if you want but I feel this is getting out of hand.
> >>
> >>
> >> Stock:
> >> sudo perf record -p "$vhost_pid"
> >> ...
> >> # Overhead  Command          Shared Object               Symbol
> >> # ........  ...............  ..........................  ..........................................
> >> #
> >>      5.97%  vhost-4874       [kernel.kallsyms]           [k] _copy_to_iter
> >>      2.68%  vhost-4874       [kernel.kallsyms]           [k] tun_do_read
> >>      2.23%  vhost-4874       [kernel.kallsyms]           [k] native_write_msr
> >>      1.93%  vhost-4874       [kernel.kallsyms]           [k] __check_object_size
> >
> > Let's disable CONFIG_HARDENED_USERCOPY and retry.
> >
> >>      1.61%  vhost-4874       [kernel.kallsyms]           [k] __slab_free.isra.0
> >>      1.56%  vhost-4874       [kernel.kallsyms]           [k] __get_user_nocheck_2
> >>      1.54%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_zero
> >>      1.45%  vhost-4874       [kernel.kallsyms]           [k] kmem_cache_free
> >>      1.43%  vhost-4874       [kernel.kallsyms]           [k] tun_recvmsg
> >>      1.24%  vhost-4874       [kernel.kallsyms]           [k] sk_skb_reason_drop
> >>      1.12%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_safe_ret
> >>      1.07%  vhost-4874       [kernel.kallsyms]           [k] native_read_msr
> >>      0.76%  vhost-4874       [kernel.kallsyms]           [k] simple_copy_to_iter
> >>      0.75%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_return_thunk
> >>      0.69%  vhost-4874       [vhost]                     [k] 0x0000000000002e70
> >>      0.59%  vhost-4874       [kernel.kallsyms]           [k] skb_release_data
> >>      0.59%  vhost-4874       [kernel.kallsyms]           [k] __skb_datagram_iter
> >>      0.53%  vhost-4874       [vhost]                     [k] 0x0000000000002e5f
> >>      0.51%  vhost-4874       [kernel.kallsyms]           [k] slab_update_freelist.isra.0
> >>      0.46%  vhost-4874       [kernel.kallsyms]           [k] kfree_skbmem
> >>      0.44%  vhost-4874       [kernel.kallsyms]           [k] skb_copy_datagram_iter
> >>      0.43%  vhost-4874       [kernel.kallsyms]           [k] skb_free_head
> >>      0.37%  qemu-system-x86  [unknown]                   [k] 0xffffffffba898b1b
> >>      0.35%  vhost-4874       [vhost]                     [k] 0x0000000000002e6b
> >>      0.33%  vhost-4874       [vhost_net]                 [k] 0x000000000000357d
> >>      0.28%  vhost-4874       [kernel.kallsyms]           [k] __check_heap_object
> >>      0.27%  vhost-4874       [vhost_net]                 [k] 0x00000000000035f3
> >>      0.26%  vhost-4874       [vhost_net]                 [k] 0x00000000000030f6
> >>      0.26%  vhost-4874       [kernel.kallsyms]           [k] __virt_addr_valid
> >>      0.24%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_advance
> >>      0.22%  vhost-4874       [kernel.kallsyms]           [k] perf_event_update_userpage
> >>      0.22%  vhost-4874       [kernel.kallsyms]           [k] check_stack_object
> >>      0.19%  qemu-system-x86  [unknown]                   [k] 0xffffffffba2a68cd
> >>      0.19%  vhost-4874       [kernel.kallsyms]           [k] dequeue_entities
> >>      0.19%  vhost-4874       [vhost_net]                 [k] 0x0000000000003237
> >>      0.18%  vhost-4874       [vhost_net]                 [k] 0x0000000000003550
> >>      0.18%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_del
> >>      0.18%  vhost-4874       [vhost_net]                 [k] 0x00000000000034a0
> >>      0.17%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_disable_all
> >>      0.16%  vhost-4874       [vhost_net]                 [k] 0x0000000000003523
> >>      0.16%  vhost-4874       [kernel.kallsyms]           [k] amd_pmu_addr_offset
> >> ...
> >>
> >>
> >> sudo perf record -p "$kpktgend_pids":
> >> ...
> >> # Overhead  Command      Shared Object      Symbol
> >> # ........  ...........  .................  ...............................................
> >> #
> >>     10.98%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
> >>     10.45%  kpktgend_0   [kernel.kallsyms]  [k] memset
> >>      8.40%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
> >>      6.31%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
> >>      3.13%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_safe_ret
> >>      2.40%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
> >>      2.11%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_return_thunk
> >
> > This is a hint that SRSO migitaion is enabled.
> >
> > Have you disabled CPU_MITIGATIONS via either Kconfig or kernel command
> > line (mitigations = off) for both host and guest?
> >
> > Thanks
> >
>
> Your both suggested changes really boosted the performance, especially
> for TAP.

Good to know that.

>
> I disabled SRSO mitigation with spec_rstack_overflow=off and went from
> "Mitigation: Safe RET" to "Vulnerable" on the host. The VM showed "Not
> affected" but I applied spec_rstack_overflow=off anyway.

I think we need to find the root cause of the regression.

>
> Here are some new benchmarks for pktgen_sample01_simple.sh:
> (I also have other available and I can share them if you want.)
>

It's a little hard to compare the diff, maybe you can do perf diff.

(Btw, it's near to the public holiday of spring festival in China, so
the reply might a slow).

Thanks


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-12  8:14                                           ` Jason Wang
@ 2026-02-14 17:13                                             ` Simon Schippers
  2026-02-14 18:18                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Schippers @ 2026-02-14 17:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, leiyang, stephen, jon, tim.gebauer, netdev,
	linux-kernel, kvm, virtualization

On 2/12/26 09:14, Jason Wang wrote:
> On Mon, Feb 9, 2026 at 2:18 AM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 2/6/26 04:21, Jason Wang wrote:
>>> On Fri, Feb 6, 2026 at 6:28 AM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 2/5/26 04:59, Jason Wang wrote:
>>>>> On Wed, Feb 4, 2026 at 11:44 PM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> On 2/3/26 04:48, Jason Wang wrote:
>>>>>>> On Mon, Feb 2, 2026 at 4:19 AM Simon Schippers
>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>
>>>>>>>> On 1/30/26 02:51, Jason Wang wrote:
>>>>>>>>> On Thu, Jan 29, 2026 at 5:25 PM Simon Schippers
>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>
>>>>>>>>>> On 1/29/26 02:14, Jason Wang wrote:
>>>>>>>>>>> On Wed, Jan 28, 2026 at 3:54 PM Simon Schippers
>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 1/28/26 08:03, Jason Wang wrote:
>>>>>>>>>>>>> On Wed, Jan 28, 2026 at 12:48 AM Simon Schippers
>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 1/23/26 10:54, Simon Schippers wrote:
>>>>>>>>>>>>>>> On 1/23/26 04:05, Jason Wang wrote:
>>>>>>>>>>>>>>>> On Thu, Jan 22, 2026 at 1:35 PM Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2026 at 5:33 PM Simon Schippers
>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 1/9/26 07:02, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 3:41 PM Simon Schippers
>>>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 1/8/26 04:38, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>>>> On Thu, Jan 8, 2026 at 5:06 AM Simon Schippers
>>>>>>>>>>>>>>>>>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Introduce {tun,tap}_ring_consume() helpers that wrap __ptr_ring_consume()
>>>>>>>>>>>>>>>>>>>>>> and wake the corresponding netdev subqueue when consuming an entry frees
>>>>>>>>>>>>>>>>>>>>>> space in the underlying ptr_ring.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Stopping of the netdev queue when the ptr_ring is full will be introduced
>>>>>>>>>>>>>>>>>>>>>> in an upcoming commit.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>  drivers/net/tap.c | 23 ++++++++++++++++++++++-
>>>>>>>>>>>>>>>>>>>>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>>>>>>>>>>>>>>>>>>>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>>>> index 1197f245e873..2442cf7ac385 100644
>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>>>>>>>>>>>>>>>> @@ -753,6 +753,27 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>>>         return ret ? ret : total;
>>>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +static void *tap_ring_consume(struct tap_queue *q)
>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &q->ring;
>>>>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(q->tap)->dev;
>>>>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, q->queue_index);
>>>>>>>>>>>>>>>>>>>>>> +               rcu_read_unlock();
>>>>>>>>>>>>>>>>>>>>>> +       }
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +       spin_unlock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +       return ptr;
>>>>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>  static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>>>                            struct iov_iter *to,
>>>>>>>>>>>>>>>>>>>>>>                            int noblock, struct sk_buff *skb)
>>>>>>>>>>>>>>>>>>>>>> @@ -774,7 +795,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>>>>>>>>>>>>>>>>>>>>>                                         TASK_INTERRUPTIBLE);
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>                 /* Read frames from the queue */
>>>>>>>>>>>>>>>>>>>>>> -               skb = ptr_ring_consume(&q->ring);
>>>>>>>>>>>>>>>>>>>>>> +               skb = tap_ring_consume(q);
>>>>>>>>>>>>>>>>>>>>>>                 if (skb)
>>>>>>>>>>>>>>>>>>>>>>                         break;
>>>>>>>>>>>>>>>>>>>>>>                 if (noblock) {
>>>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>>>> index 8192740357a0..7148f9a844a4 100644
>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>>>>>>>>>>>>>>> @@ -2113,13 +2113,34 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>>>>>>>>>>>>>>>         return total;
>>>>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +static void *tun_ring_consume(struct tun_file *tfile)
>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>> +       struct ptr_ring *ring = &tfile->tx_ring;
>>>>>>>>>>>>>>>>>>>>>> +       struct net_device *dev;
>>>>>>>>>>>>>>>>>>>>>> +       void *ptr;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +       spin_lock(&ring->consumer_lock);
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +       ptr = __ptr_ring_consume(ring);
>>>>>>>>>>>>>>>>>>>>>> +       if (unlikely(ptr && __ptr_ring_consume_created_space(ring, 1))) {
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I guess it's the "bug" I mentioned in the previous patch that leads to
>>>>>>>>>>>>>>>>>>>>> the check of __ptr_ring_consume_created_space() here. If it's true,
>>>>>>>>>>>>>>>>>>>>> another call to tweak the current API.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +               rcu_read_lock();
>>>>>>>>>>>>>>>>>>>>>> +               dev = rcu_dereference(tfile->tun)->dev;
>>>>>>>>>>>>>>>>>>>>>> +               netif_wake_subqueue(dev, tfile->queue_index);
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This would cause the producer TX_SOFTIRQ to run on the same cpu which
>>>>>>>>>>>>>>>>>>>>> I'm not sure is what we want.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> What else would you suggest calling to wake the queue?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't have a good method in my mind, just want to point out its implications.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have to admit I'm a bit stuck at this point, particularly with this
>>>>>>>>>>>>>>>>>> aspect.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What is the correct way to pass the producer CPU ID to the consumer?
>>>>>>>>>>>>>>>>>> Would it make sense to store smp_processor_id() in the tfile inside
>>>>>>>>>>>>>>>>>> tun_net_xmit(), or should it instead be stored in the skb (similar to the
>>>>>>>>>>>>>>>>>> XDP bit)? In the latter case, my concern is that this information may
>>>>>>>>>>>>>>>>>> already be significantly outdated by the time it is used.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Based on that, my idea would be for the consumer to wake the producer by
>>>>>>>>>>>>>>>>>> invoking a new function (e.g., tun_wake_queue()) on the producer CPU via
>>>>>>>>>>>>>>>>>> smp_call_function_single().
>>>>>>>>>>>>>>>>>> Is this a reasonable approach?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm not sure but it would introduce costs like IPI.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> More generally, would triggering TX_SOFTIRQ on the consumer CPU be
>>>>>>>>>>>>>>>>>> considered a deal-breaker for the patch set?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It depends on whether or not it has effects on the performance.
>>>>>>>>>>>>>>>>> Especially when vhost is pinned.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I meant we can benchmark to see the impact. For example, pin vhost to
>>>>>>>>>>>>>>>> a specific CPU and the try to see the impact of the TX_SOFTIRQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I ran benchmarks with vhost pinned to CPU 0 using taskset -p -c 0 ...
>>>>>>>>>>>>>>> for both the stock and patched versions. The benchmarks were run with
>>>>>>>>>>>>>>> the full patch series applied, since testing only patches 1-3 would not
>>>>>>>>>>>>>>> be meaningful - the queue is never stopped in that case, so no
>>>>>>>>>>>>>>> TX_SOFTIRQ is triggered.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Compared to the non-pinned CPU benchmarks in the cover letter,
>>>>>>>>>>>>>>> performance is lower for pktgen with a single thread but higher with
>>>>>>>>>>>>>>> four threads. The results show no regression for the patched version,
>>>>>>>>>>>>>>> with even slight performance improvements observed:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>>> | TAP       | Transmitted | 452 Kpps  | 454 Kpps       |
>>>>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>>>>>> | vhost-net | Lost        | 1154 Kpps | 0              |
>>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +-------------------------+-----------+----------------+
>>>>>>>>>>>>>>> | pktgen benchmarks to    | Stock     | Patched with   |
>>>>>>>>>>>>>>> | Debian VM, i5 6300HQ,   |           | fq_codel qdisc |
>>>>>>>>>>>>>>> | 100M packets            |           |                |
>>>>>>>>>>>>>>> | vhost pinned to core 0  |           |                |
>>>>>>>>>>>>>>> | *4 threads*             |           |                |
>>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>>> | TAP       | Transmitted | 71 Kpps   | 79 Kpps        |
>>>>>>>>>>>>>>> |  +        +-------------+-----------+----------------+
>>>>>>>>>>>>>>> | vhost-net | Lost        | 1527 Kpps | 0              |
>>>>>>>>>>>>>>> +-----------+-------------+-----------+----------------+
>>>>>>>>>>>>>
>>>>>>>>>>>>> The PPS seems to be low. I'd suggest using testpmd (rxonly) mode in
>>>>>>>>>>>>> the guest or an xdp program that did XDP_DROP in the guest.
>>>>>>>>>>>>
>>>>>>>>>>>> I forgot to mention that these PPS values are per thread.
>>>>>>>>>>>> So overall we have 71 Kpps * 4 = 284 Kpps and 79 Kpps * 4 = 326 Kpps,
>>>>>>>>>>>> respectively. For packet loss, that comes out to 1154 Kpps * 4 =
>>>>>>>>>>>> 4616 Kpps and 0, respectively.
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry about that!
>>>>>>>>>>>>
>>>>>>>>>>>> The pktgen benchmarks with a single thread look fine, right?
>>>>>>>>>>>
>>>>>>>>>>> Still looks very low. E.g I just have a run of pktgen (using
>>>>>>>>>>> pktgen_sample03_burst_single_flow.sh) without a XDP_DROP in the guest,
>>>>>>>>>>> I can get 1Mpps.
>>>>>>>>>>
>>>>>>>>>> Keep in mind that I am using an older CPU (i5-6300HQ). For the
>>>>>>>>>> single-threaded tests I always used pktgen_sample01_simple.sh, and for
>>>>>>>>>> the multi-threaded tests I always used pktgen_sample02_multiqueue.sh.
>>>>>>>>>>
>>>>>>>>>> Using pktgen_sample03_burst_single_flow.sh as you did fails for me (even
>>>>>>>>>> though the same parameters work fine for sample01 and sample02):
>>>>>>>>>>
>>>>>>>>>> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i tap0 -m
>>>>>>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 100000000
>>>>>>>>>> /samples/pktgen/functions.sh: line 79: echo: write error: Operation not
>>>>>>>>>> supported
>>>>>>>>>> ERROR: Write error(1) occurred
>>>>>>>>>> cmd: "burst 32 > /proc/net/pktgen/tap0@0"
>>>>>>>>>>
>>>>>>>>>> ...and I do not know what I am doing wrong, even after looking at
>>>>>>>>>> Documentation/networking/pktgen.rst. Every burst size except 1 fails.
>>>>>>>>>> Any clues?
>>>>>>>>>
>>>>>>>>> Please use -b 0, and I'm Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz.
>>>>>>>>
>>>>>>>> I tried using "-b 0", and while it worked, there was no noticeable
>>>>>>>> performance improvement.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Another thing I can think of is to disable
>>>>>>>>>
>>>>>>>>> 1) mitigations in both guest and host
>>>>>>>>> 2) any kernel debug features in both host and guest
>>>>>>>>
>>>>>>>> I also rebuilt the kernel with everything disabled under
>>>>>>>> "Kernel hacking", but that didn’t make any difference either.
>>>>>>>>
>>>>>>>> Because of this, I ran "pktgen_sample01_simple.sh" and
>>>>>>>> "pktgen_sample02_multiqueue.sh" on my AMD Ryzen 5 5600X system. The
>>>>>>>> results were about 374 Kpps with TAP and 1192 Kpps with TAP+vhost_net,
>>>>>>>> with very similar performance between the stock and patched kernels.
>>>>>>>>
>>>>>>>> Personally, I think the low performance is to blame on the hardware.
>>>>>>>
>>>>>>> Let's double confirm this by:
>>>>>>>
>>>>>>> 1) make sure pktgen is using 100% CPU
>>>>>>> 2) Perf doesn't show anything strange for pktgen thread
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>> I ran pktgen using pktgen_sample01_simple.sh and, in parallel, started a
>>>>>> 100 second perf stat measurement covering all kpktgend threads.
>>>>>>
>>>>>> Across all configurations, a single CPU was fully utilized.
>>>>>>
>>>>>> Apart from that, the patched variants show a higher branch frequency and
>>>>>> a slightly increased number of context switches.
>>>>>>
>>>>>>
>>>>>> The detailed results are provided below:
>>>>>>
>>>>>> Processor: Ryzen 5 5600X
>>>>>>
>>>>>> pktgen command:
>>>>>> sudo perf stat samples/pktgen/pktgen_sample01_simple.sh -i tap0 -m
>>>>>> 52:54:00:12:34:56 -d 10.0.0.2 -n 10000000000
>>>>>>
>>>>>> perf stat command:
>>>>>> sudo perf stat --timeout 100000 -p $(pgrep kpktgend | tr '\n' ,) -o X.txt
>>>>>>
>>>>>>
>>>>>> Results:
>>>>>> Stock TAP:
>>>>>>             46.997      context-switches                 #    467,2 cs/sec  cs_per_second
>>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>>         100.587,69 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>>      8.491.586.483      branch-misses                    #     10,9 %  branch_miss_rate         (50,24%)
>>>>>>     77.734.761.406      branches                         #    772,8 M/sec  branch_frequency     (66,85%)
>>>>>>    382.420.291.585      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>>    377.612.185.141      instructions                     #      1,0 instructions  insn_per_cycle  (66,85%)
>>>>>>     84.012.185.936      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>>>>>
>>>>>>      100,100414494 seconds time elapsed
>>>>>>
>>>>>>
>>>>>> Stock TAP+vhost-net:
>>>>>>             47.087      context-switches                 #    468,1 cs/sec  cs_per_second
>>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>>         100.594,09 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>>      8.034.703.613      branch-misses                    #     11,1 %  branch_miss_rate         (50,24%)
>>>>>>     72.477.989.922      branches                         #    720,5 M/sec  branch_frequency     (66,86%)
>>>>>>    382.218.276.832      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>>    349.555.577.281      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>>>>>     83.917.644.262      stalled-cycles-frontend          #     0,22 frontend_cycles_idle        (66,35%)
>>>>>>
>>>>>>      100,100520402 seconds time elapsed
>>>>>>
>>>>>>
>>>>>> Patched TAP:
>>>>>>             47.862      context-switches                 #    475,8 cs/sec  cs_per_second
>>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>>         100.589,30 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>>      9.337.258.794      branch-misses                    #      9,4 %  branch_miss_rate         (50,19%)
>>>>>>     99.518.421.676      branches                         #    989,4 M/sec  branch_frequency     (66,85%)
>>>>>>    382.508.244.894      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>>    312.582.270.975      instructions                     #      0,8 instructions  insn_per_cycle  (66,85%)
>>>>>>     76.338.503.984      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,39%)
>>>>>>
>>>>>>      100,101262454 seconds time elapsed
>>>>>>
>>>>>>
>>>>>> Patched TAP+vhost-net:
>>>>>>             47.892      context-switches                 #    476,1 cs/sec  cs_per_second
>>>>>>                  0      cpu-migrations                   #      0,0 migrations/sec  migrations_per_second
>>>>>>                  0      page-faults                      #      0,0 faults/sec  page_faults_per_second
>>>>>>         100.581,95 msec task-clock                       #      1,0 CPUs  CPUs_utilized
>>>>>>      9.083.588.313      branch-misses                    #     10,1 %  branch_miss_rate         (50,28%)
>>>>>>     90.300.124.712      branches                         #    897,8 M/sec  branch_frequency     (66,85%)
>>>>>>    382.374.510.376      cpu-cycles                       #      3,8 GHz  cycles_frequency       (66,85%)
>>>>>>    340.089.181.199      instructions                     #      0,9 instructions  insn_per_cycle  (66,85%)
>>>>>>     78.151.408.955      stalled-cycles-frontend          #     0,20 frontend_cycles_idle        (66,31%)
>>>>>>
>>>>>>      100,101212911 seconds time elapsed
>>>>>
>>>>> Thanks for sharing. I have more questions:
>>>>>
>>>>> 1) The number of CPU and vCPUs
>>>>
>>>> qemu runs with a single core. And my host system is now a Ryzen 5 5600x
>>>> with 6 cores, 12 threads.
>>>> This is my command for TAP+vhost-net:
>>>>
>>>> sudo qemu-system-x86_64 -hda debian.qcow2
>>>> -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no,vhost=on
>>>> -device virtio-net-pci,netdev=mynet0 -m 1024 -enable-kvm
>>>>
>>>> For TAP only it is the same but without vhost=on.
>>>>
>>>>> 2) If you pin vhost or vCPU threads
>>>>
>>>> Not in the previous shown benchmark. I pinned vhost in other benchmarks
>>>> but since there is only minor PPS difference I omitted for the sake of
>>>> simplicity.
>>>>
>>>>> 3) what does perf top looks like or perf top -p $pid_of_vhost
>>>>
>>>> The perf reports for the pid_of_vhost from pktgen_sample01_simple.sh
>>>> with TAP+vhost-net (not pinned, pktgen single queue, fq_codel) are shown
>>>> below. I can not see a huge difference between stock and patched.
>>>>
>>>> Also I included perf reports from the pktgen_pids. I find them more
>>>> intersting because tun_net_xmit shows less overhead for the patched.
>>>> I assume that is due to the stopped netdev queue.
>>>>
>>>> I have now benchmarked pretty much all possible combinations (with a
>>>> script) of TAP/TAP+vhost-net, single/multi-queue pktgen, vhost
>>>> pinned/not pinned, with/without -b 0, fq_codel/noqueue... All of that
>>>> with perf records..
>>>> I could share them if you want but I feel this is getting out of hand.
>>>>
>>>>
>>>> Stock:
>>>> sudo perf record -p "$vhost_pid"
>>>> ...
>>>> # Overhead  Command          Shared Object               Symbol
>>>> # ........  ...............  ..........................  ..........................................
>>>> #
>>>>      5.97%  vhost-4874       [kernel.kallsyms]           [k] _copy_to_iter
>>>>      2.68%  vhost-4874       [kernel.kallsyms]           [k] tun_do_read
>>>>      2.23%  vhost-4874       [kernel.kallsyms]           [k] native_write_msr
>>>>      1.93%  vhost-4874       [kernel.kallsyms]           [k] __check_object_size
>>>
>>> Let's disable CONFIG_HARDENED_USERCOPY and retry.
>>>
>>>>      1.61%  vhost-4874       [kernel.kallsyms]           [k] __slab_free.isra.0
>>>>      1.56%  vhost-4874       [kernel.kallsyms]           [k] __get_user_nocheck_2
>>>>      1.54%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_zero
>>>>      1.45%  vhost-4874       [kernel.kallsyms]           [k] kmem_cache_free
>>>>      1.43%  vhost-4874       [kernel.kallsyms]           [k] tun_recvmsg
>>>>      1.24%  vhost-4874       [kernel.kallsyms]           [k] sk_skb_reason_drop
>>>>      1.12%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_safe_ret
>>>>      1.07%  vhost-4874       [kernel.kallsyms]           [k] native_read_msr
>>>>      0.76%  vhost-4874       [kernel.kallsyms]           [k] simple_copy_to_iter
>>>>      0.75%  vhost-4874       [kernel.kallsyms]           [k] srso_alias_return_thunk
>>>>      0.69%  vhost-4874       [vhost]                     [k] 0x0000000000002e70
>>>>      0.59%  vhost-4874       [kernel.kallsyms]           [k] skb_release_data
>>>>      0.59%  vhost-4874       [kernel.kallsyms]           [k] __skb_datagram_iter
>>>>      0.53%  vhost-4874       [vhost]                     [k] 0x0000000000002e5f
>>>>      0.51%  vhost-4874       [kernel.kallsyms]           [k] slab_update_freelist.isra.0
>>>>      0.46%  vhost-4874       [kernel.kallsyms]           [k] kfree_skbmem
>>>>      0.44%  vhost-4874       [kernel.kallsyms]           [k] skb_copy_datagram_iter
>>>>      0.43%  vhost-4874       [kernel.kallsyms]           [k] skb_free_head
>>>>      0.37%  qemu-system-x86  [unknown]                   [k] 0xffffffffba898b1b
>>>>      0.35%  vhost-4874       [vhost]                     [k] 0x0000000000002e6b
>>>>      0.33%  vhost-4874       [vhost_net]                 [k] 0x000000000000357d
>>>>      0.28%  vhost-4874       [kernel.kallsyms]           [k] __check_heap_object
>>>>      0.27%  vhost-4874       [vhost_net]                 [k] 0x00000000000035f3
>>>>      0.26%  vhost-4874       [vhost_net]                 [k] 0x00000000000030f6
>>>>      0.26%  vhost-4874       [kernel.kallsyms]           [k] __virt_addr_valid
>>>>      0.24%  vhost-4874       [kernel.kallsyms]           [k] iov_iter_advance
>>>>      0.22%  vhost-4874       [kernel.kallsyms]           [k] perf_event_update_userpage
>>>>      0.22%  vhost-4874       [kernel.kallsyms]           [k] check_stack_object
>>>>      0.19%  qemu-system-x86  [unknown]                   [k] 0xffffffffba2a68cd
>>>>      0.19%  vhost-4874       [kernel.kallsyms]           [k] dequeue_entities
>>>>      0.19%  vhost-4874       [vhost_net]                 [k] 0x0000000000003237
>>>>      0.18%  vhost-4874       [vhost_net]                 [k] 0x0000000000003550
>>>>      0.18%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_del
>>>>      0.18%  vhost-4874       [vhost_net]                 [k] 0x00000000000034a0
>>>>      0.17%  vhost-4874       [kernel.kallsyms]           [k] x86_pmu_disable_all
>>>>      0.16%  vhost-4874       [vhost_net]                 [k] 0x0000000000003523
>>>>      0.16%  vhost-4874       [kernel.kallsyms]           [k] amd_pmu_addr_offset
>>>> ...
>>>>
>>>>
>>>> sudo perf record -p "$kpktgend_pids":
>>>> ...
>>>> # Overhead  Command      Shared Object      Symbol
>>>> # ........  ...........  .................  ...............................................
>>>> #
>>>>     10.98%  kpktgend_0   [kernel.kallsyms]  [k] tun_net_xmit
>>>>     10.45%  kpktgend_0   [kernel.kallsyms]  [k] memset
>>>>      8.40%  kpktgend_0   [kernel.kallsyms]  [k] __alloc_skb
>>>>      6.31%  kpktgend_0   [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
>>>>      3.13%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_safe_ret
>>>>      2.40%  kpktgend_0   [kernel.kallsyms]  [k] sk_skb_reason_drop
>>>>      2.11%  kpktgend_0   [kernel.kallsyms]  [k] srso_alias_return_thunk
>>>
>>> This is a hint that SRSO migitaion is enabled.
>>>
>>> Have you disabled CPU_MITIGATIONS via either Kconfig or kernel command
>>> line (mitigations = off) for both host and guest?
>>>
>>> Thanks
>>>
>>
>> Your both suggested changes really boosted the performance, especially
>> for TAP.
> 
> Good to know that.
> 
>>
>> I disabled SRSO mitigation with spec_rstack_overflow=off and went from
>> "Mitigation: Safe RET" to "Vulnerable" on the host. The VM showed "Not
>> affected" but I applied spec_rstack_overflow=off anyway.
> 
> I think we need to find the root cause of the regression.
> 
>>
>> Here are some new benchmarks for pktgen_sample01_simple.sh:
>> (I also have other available and I can share them if you want.)
>>
> 
> It's a little hard to compare the diff, maybe you can do perf diff.

I ran perf diff for the pktgen perf records of TAP and TAP+vhost-net
(both single queue, not cpu pinned).

With the help of that perf diff (results below) I was able to find out
that functions related to the wake (e.g. __local_bh_enable_ip) cost
quite some performance. Because my patch already wakes on
__ptr_ring_consume_created_space(), I suspected that this leads to very
frequent stop -> wake -> stop -> wake cycles.

Therefore, I also compiled a new variant that wakes on
__ptr_ring_empty() instead. The idea is that netif_tx_wake_queue() is
invoked less frequently.

The pktgen results:

+-------------------------+-----------+-----------+---------------+
| pktgen benchmarks to    | Stock     | Patched   | Wake on       |
| Debian VM, R5 5600X,    |           |           | empty Variant |
| 100M packets            |           |           |               |
| CPU not pinned          |           |           |               |
+-----------+-------------+-----------+-----------+---------------+
| TAP       | Transmitted | 1293 Kpps | 989 Kpps  | 1248 Kpps     |
|           +-------------+-----------+-----------+---------------+
|           | Lost        | 3918 Kpps | 0         | 0             |
+-----------+-------------+-----------+-----------+---------------+
| TAP       | Transmitted | 1411 Kpps | 1410 Kpps | 1379 Kpps     |
|  +        +-------------+-----------+-----------+---------------+
| vhost-net | Lost        | 3659 Kpps | 0         | 0             |
+-----------+-------------+-----------+-----------+---------------+


My conclusions are:

Patched: Waking on __ptr_ring_produce_created_space() is too early. The
         stop/wake cycle occurs too frequently which slows down
         performance as can be seen for TAP.

Wake on empty variant: Waking on __ptr_ring_empty() is (slightly) too
                       late. The consumer starves because the producer
                       first has to produce packets again. This slows
                       down performance aswell as can be seen for TAP
		       and TAP+vhost-net (both down ~30-40Kpps).

I think something inbetween should be used.
The wake should be done as late as possible to have as few
NET_TX_SOFTIRQs as possible but early enough that there are still
consumable packets remaining to not starve the consumer.

However, I can not think of a proper way to implement this right now.

Thanks!


=========================================================================
TAP:

# Event 'cpu/cycles/P'
#
# Data files:
#  [0] stock_pktgen.data (Baseline)
#  [1] patched_pktgen.data 
#  [2] wake_on_empty_variant_pktgen.data 
#
# Baseline/0  Delta Abs/1  Delta Abs/2  Shared Object      Symbol                                                  
# ..........  ...........  ...........  .................  ........................................................
#
      24.49%      +43.46%      +47.09%  [pktgen]           [k] 0x0000000000000e30
      22.27%      -17.03%      -16.76%  [kernel.kallsyms]  [k] memset
      10.59%       -7.72%       -8.06%  [kernel.kallsyms]  [k] __alloc_skb
       7.50%       -5.34%       -6.00%  [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
       1.20%       +4.08%       +2.82%  [kernel.kallsyms]  [k] __local_bh_enable_ip
       5.76%       -4.04%       -4.07%  [kernel.kallsyms]  [k] tun_net_xmit
       3.15%       -2.23%       -2.37%  [kernel.kallsyms]  [k] chacha_permute
       0.22%       +1.87%       +1.41%  [kernel.kallsyms]  [k] kthread_should_stop
       2.36%       -1.55%       -1.57%  [kernel.kallsyms]  [k] get_random_u32
       2.19%       -1.51%       -1.74%  [kernel.kallsyms]  [k] skb_put
       0.18%       +1.33%       +1.04%  [kernel.kallsyms]  [k] __cond_resched
       0.68%       +1.32%       +0.87%  [kernel.kallsyms]  [k] __rcu_read_unlock
       0.49%       +1.17%       +0.85%  [kernel.kallsyms]  [k] __rcu_read_lock
       1.40%       -1.15%       -1.24%  [kernel.kallsyms]  [k] sk_skb_reason_drop
       1.34%       -1.12%       -1.16%  [kernel.kallsyms]  [k] ip_send_check
       1.10%       -0.80%       -0.85%  [kernel.kallsyms]  [k] _raw_spin_lock
       1.04%       -0.71%       -0.81%  [kernel.kallsyms]  [k] kmalloc_reserve
       0.86%       -0.51%       -0.66%  [kernel.kallsyms]  [k] __netdev_alloc_skb
       0.62%       -0.41%       -0.46%  [kernel.kallsyms]  [k] __get_random_u32_below
       0.50%       -0.34%       -0.38%  [kernel.kallsyms]  [k] _get_random_bytes
       0.37%       -0.26%       -0.28%  [kernel.kallsyms]  [k] crng_fast_key_erasure
       0.33%       -0.19%       -0.25%  [kernel.kallsyms]  [k] chacha_block_generic
       0.24%       -0.15%       -0.18%  [kernel.kallsyms]  [k] skb_clone_tx_timestamp
       0.31%       -0.13%       -0.23%  [kernel.kallsyms]  [k] skb_push
       0.30%       -0.11%       -0.22%  [kernel.kallsyms]  [k] _raw_spin_unlock
       0.25%       -0.10%       -0.14%  [kernel.kallsyms]  [k] __x86_indirect_thunk_array
       0.56%       +0.08%       +0.15%  [kernel.kallsyms]  [k] sock_def_readable
       0.12%       -0.08%       -0.09%  [kernel.kallsyms]  [k] memcpy
       0.25%       +0.06%       +0.01%  [kernel.kallsyms]  [k] ___slab_alloc
       0.31%       +0.05%       -0.03%  [kernel.kallsyms]  [k] native_write_msr
       0.01%       +0.05%       +0.96%  [kernel.kallsyms]  [k] clear_page_erms
       0.07%       +0.03%       -0.03%  [kernel.kallsyms]  [k] get_partial_node
       0.08%       -0.03%       -0.05%  [kernel.kallsyms]  [k] crng_make_state
       0.02%       +0.02%       -0.01%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
       0.02%       -0.02%       -0.01%  [kernel.kallsyms]  [k] read_tsc
       0.02%       +0.02%       +0.00%  [kernel.kallsyms]  [k] __slab_alloc.isra.0
       0.00%       +0.02%       +0.10%  [kernel.kallsyms]  [k] get_page_from_freelist
       0.08%       +0.01%       -0.06%  [kernel.kallsyms]  [k] put_cpu_partial
       0.00%       +0.01%       +0.25%  [kernel.kallsyms]  [k] allocate_slab
       0.01%       +0.01%       -0.00%  [kernel.kallsyms]  [k] x86_schedule_events
       0.01%       +0.01%       -0.01%  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
       0.00%       +0.01%       +0.00%  [kernel.kallsyms]  [k] perf_assign_events
       0.00%       +0.01%       -0.00%  [amdgpu]           [k] 0x00000000000662f4
       0.02%       +0.01%       +0.00%  [kernel.kallsyms]  [k] amd_pmu_addr_offset


=========================================================================
TAP+vhost-net:

# Event 'cpu/cycles/P'
#
# Data files:
#  [0] stock_pktgen.data (Baseline)
#  [1] patched_pktgen.data 
#  [2] wake_on_empty_variant_pktgen.data 
#
# Baseline/0  Delta Abs/1  Delta Abs/2  Shared Object      Symbol                                        
# ..........  ...........  ...........  .................  ..............................................
#
      24.35%      +47.04%      +45.59%  [pktgen]           [k] 0x0000000000000e30
      22.06%      -16.02%      -16.19%  [kernel.kallsyms]  [k] memset
      10.72%       -7.84%       -7.84%  [kernel.kallsyms]  [k] __alloc_skb
       7.59%       -5.82%       -5.79%  [kernel.kallsyms]  [k] kmem_cache_alloc_node_noprof
       5.69%       -3.98%       -4.08%  [kernel.kallsyms]  [k] tun_net_xmit
       1.22%       +2.74%       +2.65%  [kernel.kallsyms]  [k] __local_bh_enable_ip
       3.18%       -2.33%       -2.30%  [kernel.kallsyms]  [k] chacha_permute
       2.47%       -1.78%       -1.49%  [kernel.kallsyms]  [k] get_random_u32
       2.16%       -1.58%       -1.66%  [kernel.kallsyms]  [k] skb_put
       0.24%       +1.36%       +1.33%  [kernel.kallsyms]  [k] kthread_should_stop
       1.47%       -1.28%       -1.29%  [kernel.kallsyms]  [k] sk_skb_reason_drop
       0.18%       +1.05%       +1.01%  [kernel.kallsyms]  [k] __cond_resched
       0.69%       +0.88%       +0.84%  [kernel.kallsyms]  [k] __rcu_read_unlock
       1.23%       -0.87%       -1.04%  [kernel.kallsyms]  [k] ip_send_check
       0.52%       +0.83%       +0.84%  [kernel.kallsyms]  [k] __rcu_read_lock
       1.09%       -0.80%       -0.78%  [kernel.kallsyms]  [k] _raw_spin_lock
       1.03%       -0.73%       -0.75%  [kernel.kallsyms]  [k] kmalloc_reserve
       0.83%       -0.61%       -0.58%  [kernel.kallsyms]  [k] __netdev_alloc_skb
       0.63%       -0.47%       -0.45%  [kernel.kallsyms]  [k] __get_random_u32_below
       0.47%       -0.34%       -0.33%  [kernel.kallsyms]  [k] _get_random_bytes
       0.36%       -0.26%       -0.25%  [kernel.kallsyms]  [k] crng_fast_key_erasure
       0.34%       -0.25%       -0.24%  [kernel.kallsyms]  [k] chacha_block_generic
       0.32%       -0.22%       -0.23%  [kernel.kallsyms]  [k] skb_push
       0.31%       -0.21%       -0.22%  [kernel.kallsyms]  [k] _raw_spin_unlock
       0.25%       -0.19%       -0.18%  [kernel.kallsyms]  [k] skb_clone_tx_timestamp
       0.28%       -0.15%       -0.16%  [kernel.kallsyms]  [k] __x86_indirect_thunk_array
       0.11%       -0.09%       -0.08%  [kernel.kallsyms]  [k] memcpy
       0.10%       -0.08%       -0.08%  [kernel.kallsyms]  [k] crng_make_state
       0.29%       -0.03%       -0.02%  [kernel.kallsyms]  [k] native_write_msr
       0.13%       -0.03%       -0.01%  [kernel.kallsyms]  [k] native_read_msr
       0.66%       +0.02%       +0.09%  [kernel.kallsyms]  [k] sock_def_readable
       0.27%       +0.01%       +0.03%  [kernel.kallsyms]  [k] ___slab_alloc
       0.03%       -0.01%       -0.01%  [kernel.kallsyms]  [k] __slab_alloc.isra.0
       0.03%       -0.01%       -0.00%  [kernel.kallsyms]  [k] amd_pmu_addr_offset
       0.08%       -0.01%       -0.03%  [kernel.kallsyms]  [k] get_partial_node
       0.09%       -0.01%       +1.02%  [kernel.kallsyms]  [k] clear_page_erms
       0.00%       +0.00%       +0.00%  [kernel.kallsyms]  [k] x86_schedule_events
       0.01%       +0.00%       +0.01%  [kernel.kallsyms]  [k] x86_pmu_add
       0.01%       +0.00%       +0.10%  [kernel.kallsyms]  [k] get_page_from_freelist
       0.01%       -0.00%       +0.00%  [kernel.kallsyms]  [k] x86_pmu_del
       0.01%       -0.00%       -0.00%  [kernel.kallsyms]  [k] x86_pmu_disable_all
       0.00%       +0.00%       -0.00%  [kernel.kallsyms]  [k] perf_assign_events
       0.01%       -0.00%       -0.00%  [kernel.kallsyms]  [k] group_sched_out
       0.01%       -0.00%       -0.00%  [kernel.kallsyms]  [k] read_tsc

> 
> (Btw, it's near to the public holiday of spring festival in China, so
> the reply might a slow).

Good to know, I won't ping you then.

> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-14 17:13                                             ` Simon Schippers
@ 2026-02-14 18:18                                               ` Michael S. Tsirkin
  2026-02-14 19:51                                                 ` Simon Schippers
  0 siblings, 1 reply; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-02-14 18:18 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Sat, Feb 14, 2026 at 06:13:14PM +0100, Simon Schippers wrote:

...

> Patched: Waking on __ptr_ring_produce_created_space() is too early. The
>          stop/wake cycle occurs too frequently which slows down
>          performance as can be seen for TAP.
> 
> Wake on empty variant: Waking on __ptr_ring_empty() is (slightly) too
>                        late. The consumer starves because the producer
>                        first has to produce packets again. This slows
>                        down performance aswell as can be seen for TAP
> 		       and TAP+vhost-net (both down ~30-40Kpps).
> 
> I think something inbetween should be used.
> The wake should be done as late as possible to have as few
> NET_TX_SOFTIRQs as possible but early enough that there are still
> consumable packets remaining to not starve the consumer.
> 
> However, I can not think of a proper way to implement this right now.
> 
> Thanks!

What is the difficulty?

Your patches check __ptr_ring_consume_created_space(..., 1).

How about __ptr_ring_consume_created_space(..., 8) then? 16?

-- 
MST


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-14 18:18                                               ` Michael S. Tsirkin
@ 2026-02-14 19:51                                                 ` Simon Schippers
  2026-02-14 23:49                                                   ` Michael S. Tsirkin
  2026-02-15 10:38                                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 69+ messages in thread
From: Simon Schippers @ 2026-02-14 19:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 2/14/26 19:18, Michael S. Tsirkin wrote:
> On Sat, Feb 14, 2026 at 06:13:14PM +0100, Simon Schippers wrote:
> 
> ...
> 
>> Patched: Waking on __ptr_ring_produce_created_space() is too early. The
>>          stop/wake cycle occurs too frequently which slows down
>>          performance as can be seen for TAP.
>>
>> Wake on empty variant: Waking on __ptr_ring_empty() is (slightly) too
>>                        late. The consumer starves because the producer
>>                        first has to produce packets again. This slows
>>                        down performance aswell as can be seen for TAP
>> 		       and TAP+vhost-net (both down ~30-40Kpps).
>>
>> I think something inbetween should be used.
>> The wake should be done as late as possible to have as few
>> NET_TX_SOFTIRQs as possible but early enough that there are still
>> consumable packets remaining to not starve the consumer.
>>
>> However, I can not think of a proper way to implement this right now.
>>
>> Thanks!
> 
> What is the difficulty?

There is no way to tell how many entries are currently in the ring.

> 
> Your patches check __ptr_ring_consume_created_space(..., 1).

Yes, and this returns if either 0 space or a batch size space was
created.
(In the current implementation it would be false or true, but as
discussed earlier this can be changed.)

> 
> How about __ptr_ring_consume_created_space(..., 8) then? 16?
> 

This would return how much space the last 8/16 consume operations
created. But in tap_ring_consume() we only consume a single entry.

Maybe we could avoid __ptr_ring_consume_created_space with this:
1. Wait for the queue to stop with netif_tx_queue_stopped()
2. Then count the numbers of consumes we did after the queue stopped
3. Wake the queue if count >= threshold with threshold >= ring->batch

I would say that such a threshold could be something like ring->size/2.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-14 19:51                                                 ` Simon Schippers
@ 2026-02-14 23:49                                                   ` Michael S. Tsirkin
  2026-02-15 10:38                                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-02-14 23:49 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Sat, Feb 14, 2026 at 08:51:53PM +0100, Simon Schippers wrote:
> On 2/14/26 19:18, Michael S. Tsirkin wrote:
> > On Sat, Feb 14, 2026 at 06:13:14PM +0100, Simon Schippers wrote:
> > 
> > ...
> > 
> >> Patched: Waking on __ptr_ring_produce_created_space() is too early. The
> >>          stop/wake cycle occurs too frequently which slows down
> >>          performance as can be seen for TAP.
> >>
> >> Wake on empty variant: Waking on __ptr_ring_empty() is (slightly) too
> >>                        late. The consumer starves because the producer
> >>                        first has to produce packets again. This slows
> >>                        down performance aswell as can be seen for TAP
> >> 		       and TAP+vhost-net (both down ~30-40Kpps).
> >>
> >> I think something inbetween should be used.
> >> The wake should be done as late as possible to have as few
> >> NET_TX_SOFTIRQs as possible but early enough that there are still
> >> consumable packets remaining to not starve the consumer.
> >>
> >> However, I can not think of a proper way to implement this right now.
> >>
> >> Thanks!
> > 
> > What is the difficulty?
> 
> There is no way to tell how many entries are currently in the ring.
> 
> > 
> > Your patches check __ptr_ring_consume_created_space(..., 1).
> 
> Yes, and this returns if either 0 space or a batch size space was
> created.
> (In the current implementation it would be false or true, but as
> discussed earlier this can be changed.)
> 
> > 
> > How about __ptr_ring_consume_created_space(..., 8) then? 16?
> > 
> 
> This would return how much space the last 8/16 consume operations
> created. But in tap_ring_consume() we only consume a single entry.
> 
> Maybe we could avoid __ptr_ring_consume_created_space with this:
> 1. Wait for the queue to stop with netif_tx_queue_stopped()
> 2. Then count the numbers of consumes we did after the queue stopped
> 3. Wake the queue if count >= threshold with threshold >= ring->batch
> 
> I would say that such a threshold could be something like ring->size/2.

OK


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-14 19:51                                                 ` Simon Schippers
  2026-02-14 23:49                                                   ` Michael S. Tsirkin
@ 2026-02-15 10:38                                                   ` Michael S. Tsirkin
  2026-02-16 13:27                                                     ` Simon Schippers
  1 sibling, 1 reply; 69+ messages in thread
From: Michael S. Tsirkin @ 2026-02-15 10:38 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On Sat, Feb 14, 2026 at 08:51:53PM +0100, Simon Schippers wrote:
> On 2/14/26 19:18, Michael S. Tsirkin wrote:
> > On Sat, Feb 14, 2026 at 06:13:14PM +0100, Simon Schippers wrote:
> > 
> > ...
> > 
> >> Patched: Waking on __ptr_ring_produce_created_space() is too early. The
> >>          stop/wake cycle occurs too frequently which slows down
> >>          performance as can be seen for TAP.
> >>
> >> Wake on empty variant: Waking on __ptr_ring_empty() is (slightly) too
> >>                        late. The consumer starves because the producer
> >>                        first has to produce packets again. This slows
> >>                        down performance aswell as can be seen for TAP
> >> 		       and TAP+vhost-net (both down ~30-40Kpps).
> >>
> >> I think something inbetween should be used.
> >> The wake should be done as late as possible to have as few
> >> NET_TX_SOFTIRQs as possible but early enough that there are still
> >> consumable packets remaining to not starve the consumer.
> >>
> >> However, I can not think of a proper way to implement this right now.
> >>
> >> Thanks!
> > 
> > What is the difficulty?
> 
> There is no way to tell how many entries are currently in the ring.
> 
> > 
> > Your patches check __ptr_ring_consume_created_space(..., 1).
> 
> Yes, and this returns if either 0 space or a batch size space was
> created.
> (In the current implementation it would be false or true, but as
> discussed earlier this can be changed.)
> 
> > 
> > How about __ptr_ring_consume_created_space(..., 8) then? 16?
> > 
> 
> This would return how much space the last 8/16 consume operations
> created. But in tap_ring_consume() we only consume a single entry.
> 
> Maybe we could avoid __ptr_ring_consume_created_space with this:
> 1. Wait for the queue to stop with netif_tx_queue_stopped()
> 2. Then count the numbers of consumes we did after the queue stopped
> 3. Wake the queue if count >= threshold with threshold >= ring->batch
> 
> I would say that such a threshold could be something like ring->size/2.


To add to what i wrote, size/2 means:
leave half a ring for consumer, half a ring for producer.

If one of the two is more bursty, we might want a different
balance. Offhand, the kernel is less bursty and userspace is
more bursty.

So it's an interesting question but size/2 is a good start.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup
  2026-02-15 10:38                                                   ` Michael S. Tsirkin
@ 2026-02-16 13:27                                                     ` Simon Schippers
  0 siblings, 0 replies; 69+ messages in thread
From: Simon Schippers @ 2026-02-16 13:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization

On 2/15/26 11:38, Michael S. Tsirkin wrote:
> On Sat, Feb 14, 2026 at 08:51:53PM +0100, Simon Schippers wrote:
>> On 2/14/26 19:18, Michael S. Tsirkin wrote:
>>> On Sat, Feb 14, 2026 at 06:13:14PM +0100, Simon Schippers wrote:
>>>
>>> ...
>>>
>>>> Patched: Waking on __ptr_ring_produce_created_space() is too early. The
>>>>          stop/wake cycle occurs too frequently which slows down
>>>>          performance as can be seen for TAP.
>>>>
>>>> Wake on empty variant: Waking on __ptr_ring_empty() is (slightly) too
>>>>                        late. The consumer starves because the producer
>>>>                        first has to produce packets again. This slows
>>>>                        down performance aswell as can be seen for TAP
>>>> 		       and TAP+vhost-net (both down ~30-40Kpps).
>>>>
>>>> I think something inbetween should be used.
>>>> The wake should be done as late as possible to have as few
>>>> NET_TX_SOFTIRQs as possible but early enough that there are still
>>>> consumable packets remaining to not starve the consumer.
>>>>
>>>> However, I can not think of a proper way to implement this right now.
>>>>
>>>> Thanks!
>>>
>>> What is the difficulty?
>>
>> There is no way to tell how many entries are currently in the ring.
>>
>>>
>>> Your patches check __ptr_ring_consume_created_space(..., 1).
>>
>> Yes, and this returns if either 0 space or a batch size space was
>> created.
>> (In the current implementation it would be false or true, but as
>> discussed earlier this can be changed.)
>>
>>>
>>> How about __ptr_ring_consume_created_space(..., 8) then? 16?
>>>
>>
>> This would return how much space the last 8/16 consume operations
>> created. But in tap_ring_consume() we only consume a single entry.
>>
>> Maybe we could avoid __ptr_ring_consume_created_space with this:
>> 1. Wait for the queue to stop with netif_tx_queue_stopped()
>> 2. Then count the numbers of consumes we did after the queue stopped
>> 3. Wake the queue if count >= threshold with threshold >= ring->batch
>>
>> I would say that such a threshold could be something like ring->size/2.
> 
> 
> To add to what i wrote, size/2 means:
> leave half a ring for consumer, half a ring for producer.
> 
> If one of the two is more bursty, we might want a different
> balance. Offhand, the kernel is less bursty and userspace is
> more bursty.
> 
> So it's an interesting question but size/2 is a good start.
> 

I implemented this (I can post the implementation if you want)
and I got:
- 1216Kpps for TAP --> worse performance than stock (1293 Kpps) and
  also worse performance than wake on empty (1248 Kpps)
- 1408Kpps for TAP+vhost-net --> pretty much same performance as
  stock (1411 Kpps)

I also tried 7/8 for producer, 1/8 for consumer the results did not
really get better: 
- 1227Kpps for TAP --> worse performance than stock (1293 Kpps) and
  also worse performance than wake on empty (1248 Kpps); better
  performance than 1/2
- 1350Kpps for TAP+vhost-net --> worse performance than everything


So my theory of using something inbetween did not hold up here.
Judging from my benchmarking the best solution would be to use:
- Wake on empty for TAP --> 1248Kpps (1293 Kpps stock, 3% worse)
- Wake on __ptr_ring_consume_created_space() for TAP+vhost-net
  --> 1410Kpps (1411 Kpps stock, 0% worse)

This would also keep the implementation simple.


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2026-02-16 13:28 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 21:04 [PATCH net-next v7 0/9] tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 1/9] ptr_ring: move free-space check into separate helper Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 2/9] ptr_ring: add helper to detect newly freed space on consume Simon Schippers
2026-01-08  3:23   ` Jason Wang
2026-01-08  7:20     ` Simon Schippers
2026-01-09  6:01       ` Jason Wang
2026-01-09  6:47         ` Michael S. Tsirkin
2026-01-09  7:22   ` Michael S. Tsirkin
2026-01-09  7:35     ` Simon Schippers
2026-01-09  8:31       ` Michael S. Tsirkin
2026-01-09  9:06         ` Simon Schippers
2026-01-12 16:29           ` Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 3/9] tun/tap: add ptr_ring consume helper with netdev queue wakeup Simon Schippers
2026-01-08  3:38   ` Jason Wang
2026-01-08  7:40     ` Simon Schippers
2026-01-09  6:02       ` Jason Wang
2026-01-09  9:31         ` Simon Schippers
2026-01-21  9:32         ` Simon Schippers
2026-01-22  5:35           ` Jason Wang
2026-01-23  3:05             ` Jason Wang
2026-01-23  9:54               ` Simon Schippers
2026-01-27 16:47                 ` Simon Schippers
2026-01-28  7:03                   ` Jason Wang
2026-01-28  7:53                     ` Simon Schippers
2026-01-29  1:14                       ` Jason Wang
2026-01-29  9:24                         ` Simon Schippers
2026-01-30  1:51                           ` Jason Wang
2026-02-01 20:19                             ` Simon Schippers
2026-02-03  3:48                               ` Jason Wang
2026-02-04 15:43                                 ` Simon Schippers
2026-02-05  3:59                                   ` Jason Wang
2026-02-05 22:28                                     ` Simon Schippers
2026-02-06  3:21                                       ` Jason Wang
2026-02-08 18:18                                         ` Simon Schippers
2026-02-12  0:12                                           ` Simon Schippers
2026-02-12  7:06                                             ` Michael S. Tsirkin
2026-02-12  8:03                                               ` Simon Schippers
2026-02-12  8:14                                           ` Jason Wang
2026-02-14 17:13                                             ` Simon Schippers
2026-02-14 18:18                                               ` Michael S. Tsirkin
2026-02-14 19:51                                                 ` Simon Schippers
2026-02-14 23:49                                                   ` Michael S. Tsirkin
2026-02-15 10:38                                                   ` Michael S. Tsirkin
2026-02-16 13:27                                                     ` Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 4/9] tun/tap: add batched ptr_ring consume functions " Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 5/9] tun/tap: add unconsume function for returning entries to ptr_ring Simon Schippers
2026-01-08  3:40   ` Jason Wang
2026-01-07 21:04 ` [PATCH net-next v7 6/9] tun/tap: add helper functions to check file type Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 7/9] vhost-net: vhost-net: replace rx_ring with tun/tap ring wrappers Simon Schippers
2026-01-08  4:38   ` Jason Wang
2026-01-08  7:47     ` Simon Schippers
2026-01-09  6:04       ` Jason Wang
2026-01-09  9:57         ` Simon Schippers
2026-01-12  2:54           ` Jason Wang
2026-01-12  4:42             ` Michael S. Tsirkin
2026-01-07 21:04 ` [PATCH net-next v7 8/9] tun/tap: drop get ring exports Simon Schippers
2026-01-07 21:04 ` [PATCH net-next v7 9/9] tun/tap & vhost-net: avoid ptr_ring tail-drop when qdisc is present Simon Schippers
2026-01-08  4:37   ` Jason Wang
2026-01-08  8:01     ` Simon Schippers
2026-01-09  6:09       ` Jason Wang
2026-01-09 10:14         ` Simon Schippers
2026-01-12  2:22           ` Jason Wang
2026-01-12 11:08             ` Simon Schippers
2026-01-12 11:18               ` Michael S. Tsirkin
2026-01-13  6:26               ` Jason Wang
2026-01-12  4:33           ` Michael S. Tsirkin
2026-01-12 11:17             ` Simon Schippers
2026-01-12 11:19               ` Michael S. Tsirkin
2026-01-12 11:28                 ` Simon Schippers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox