[PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
@ 2025-11-20 15:29 Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness Simon Schippers
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

This patch series deals with tun/tap and vhost-net which drop incoming 
SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
patch series, the associated netdev queue is stopped before this happens. 
This allows the connected qdisc to function correctly as reported by [1] 
and improves application-layer performance, see our paper [2]. Meanwhile 
the theoretical performance differs only slightly:

+--------------------------------+-----------+----------+
| pktgen benchmarks to Debian VM | Stock     | Patched  |
| i5 6300HQ, 20M packets         |           |          |
+-----------------+--------------+-----------+----------+
| TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
|                 +--------------+-----------+----------+
|                 | Lost         | 1615 Kpps | 0 pps    |
+-----------------+--------------+-----------+----------+
| TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
|                 +--------------+-----------+----------+
|                 | Lost         | 1164 Kpps | 0 pps    |
+-----------------+--------------+-----------+----------+

This patch series includes tun/tap, and vhost-net because they share 
logic. Adjusting only one of them would break the others. Therefore, the 
patch series is structured as follows:
1+2: new ptr_ring helpers for 3
3: tun/tap: tun/tap: add synchronized ring produce/consume with queue 
management
4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by 
vhost-net
7: tun/tap & vhost-net: only now use the previous implemented functions to 
not break git bisect
8: tun/tap: drop get ring exports (not used anymore)

Possible future work:
- Introduction of Byte Queue Limits as suggested by Stephen Hemminger
- Adaption of the netdev queue flow control for ipvtap & macvtap

[1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
[2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf

Changelog:
V6:
General:
- Major adjustments to the descriptions. Special thanks to Jon Kohler!
- Fix git bisect by moving most logic into dedicated functions and only 
start using them in patch 7.
- Moved the main logic of the coupled producer and consumer into a single 
patch to avoid a chicken-and-egg dependency between commits :-)
- Rebased to 6.18-rc5 and ran benchmarks again that now also include lost 
packets (previously I missed a 0, so all benchmark results were higher by 
factor 10...).
- Also include the benchmark in patch 7.

Producer:
- Move logic into the new helper tun_ring_produce()
- Added a smp_rmb() paired with the consumer, ensuring freed space of the 
consumer is visible
- Assume that ptr_ring is not full when __ptr_ring_full_next() is called

Consumer:
- Use an unpaired smp_rmb() instead of barrier() to ensure that the 
netdev_tx_queue_stopped() call completes before discarding
- Also wake the netdev queue if it was stopped before discarding and then 
becomes empty
-> Fixes race with producer as identified by MST in V5
-> Waking the netdev queues upon resize is not required anymore
- Use __ptr_ring_consume_created_space() instead of messing with ptr_ring 
internals
-> Batched consume now just calls 
__tun_ring_consume()/__tap_ring_consume() in a loop
- Added an smp_wmb() before waking the netdev queue which is paired with 
the smp_rmb() discussed above

V5: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#u
- Stop the netdev queue prior to producing the final fitting ptr_ring entry
-> Ensures the consumer has the latest netdev queue state, making it safe 
to wake the queue
-> Resolves an issue in vhost-net where the netdev queue could remain 
stopped despite being empty
-> For TUN/TAP, the netdev queue no longer needs to be woken in the 
blocking loop
-> Introduces new helpers __ptr_ring_full_next and 
__ptr_ring_will_invalidate for this purpose
- vhost-net now uses wrappers of TUN/TAP for ptr_ring consumption rather 
than maintaining its own rx_ring pointer

V4: https://lore.kernel.org/netdev/20250902080957.47265-1-simon.schippers@tu-dortmund.de/T/#u
- Target net-next instead of net
- Changed to patch series instead of single patch
- Changed to new title from old title
"TUN/TAP: Improving throughput and latency by avoiding SKB drops"
- Wake netdev queue with new helpers wake_netdev_queue when there is any 
spare capacity in the ptr_ring instead of waiting for it to be empty
- Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
logic
- Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
that happened rarely before
- Use safer logic for vhost-net using RCU read locks to access TUN/TAP data

V3: https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
- Added support for TAP and TAP+vhost-net.

V2: https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
- Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
unnecessary netif_tx_wake_queue in tun_ring_recv.

V1: https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u

Thanks,
Simon :)

Simon Schippers (8):
  ptr_ring: add __ptr_ring_full_next() to predict imminent fullness
  ptr_ring: add helper to check if consume created space
  tun/tap: add synchronized ring produce/consume with queue management
  tun/tap: add batched ring consume function
  tun/tap: add uncomsume function for returning entries to ring
  tun/tap: add helper functions to check file type
  tun/tap & vhost-net: use {tun|tap}_ring_{consume|produce} to avoid
    tail drops
  tun/tap: drop get ring exports

 drivers/net/tap.c        | 106 +++++++++++++++++++++++++--
 drivers/net/tun.c        | 154 +++++++++++++++++++++++++++++++++++----
 drivers/vhost/net.c      |  92 +++++++++++++++--------
 include/linux/if_tap.h   |  16 +++-
 include/linux/if_tun.h   |  18 ++++-
 include/linux/ptr_ring.h |  42 +++++++++++
 6 files changed, 372 insertions(+), 56 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-25 14:54   ` Michael S. Tsirkin
  2025-11-20 15:29 ` [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space Simon Schippers
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Introduce the __ptr_ring_full_next() helper, which lets callers check
if the ptr_ring will become full after the next insertion. This is useful
for proactively managing capacity before the ring is actually full.
Callers must ensure the ring is not already full before using this
helper. This is because __ptr_ring_discard_one() may zero entries in
reverse order, the slot after the current producer position may be
cleared before the current one. This must be considered when using this
check.

Note: This function is especially relevant when paired with the memory
ordering guarantees of __ptr_ring_produce() (smp_wmb()), allowing for
safe producer/consumer coordination.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 include/linux/ptr_ring.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 534531807d95..da141cc8b075 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -96,6 +96,31 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
 	return ret;
 }
 
+/*
+ * Checks if the ptr_ring will become full after the next insertion.
+ *
+ * Note: Callers must ensure that the ptr_ring is not full before calling
+ * this function, as __ptr_ring_discard_one invalidates entries in
+ * reverse order. Because the next entry (rather than the current one)
+ * may be zeroed after an insertion, failing to account for this can
+ * cause false negatives when checking whether the ring will become full
+ * on the next insertion.
+ */
+static inline bool __ptr_ring_full_next(struct ptr_ring *r)
+{
+	int p;
+
+	if (unlikely(r->size <= 1))
+		return true;
+
+	p = r->producer + 1;
+
+	if (unlikely(p >= r->size))
+		p = 0;
+
+	return r->queue[p];
+}
+
 /* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax(). Callers must hold producer_lock.
  * Callers are responsible for making sure pointer that is being queued
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-25 15:01   ` Michael S. Tsirkin
  2025-11-30 18:16   ` Willem de Bruijn
  2025-11-20 15:29 ` [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management Simon Schippers
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Add __ptr_ring_consume_created_space() to check whether the previous
__ptr_ring_consume() call successfully consumed an element and created
space in the ring buffer. This enables callers to conditionally notify
producers when space becomes available.

The function is only valid immediately after a single consume operation
and should not be used after calling __ptr_ring_consume_batched().

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 include/linux/ptr_ring.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index da141cc8b075..76d6840b45a3 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -453,6 +453,23 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
 	return ret;
 }
 
+/*
+ * Check if the previous consume operation created space
+ *
+ * Returns true if the last call to __ptr_ring_consume() has created
+ * space in the ring buffer (i.e., an element was consumed).
+ *
+ * Note: This function is only valid immediately after a single call to
+ * __ptr_ring_consume(). If multiple calls to ptr_ring_consume*() have
+ * been made, this check must be performed after each call individually.
+ * Likewise, do not use this function after calling
+ * __ptr_ring_consume_batched().
+ */
+static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r)
+{
+	return r->consumer_tail >= r->consumer_head;
+}
+
 /* Cast to structure type and call a function without discarding from FIFO.
  * Function must return a value.
  * Callers must take consumer_lock.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-25 16:54   ` Michael S. Tsirkin
  2025-11-20 15:29 ` [PATCH net-next v6 4/8] tun/tap: add batched ring consume function Simon Schippers
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Implement new ring buffer produce and consume functions for tun and tap
drivers that provide lockless producer-consumer synchronization and
netdev queue management to prevent ptr_ring tail drop and permanent
starvation.

- tun_ring_produce(): Produces packets to the ptr_ring with proper memory
  barriers and proactively stops the netdev queue when the ring is about
  to become full.

- __tun_ring_consume() / __tap_ring_consume(): Internal consume functions
  that check if the netdev queue was stopped due to a full ring, and wake
  it when space becomes available. Uses memory barriers to ensure proper
  ordering between producer and consumer.

- tun_ring_consume() / tap_ring_consume(): Wrapper functions that acquire
  the consumer lock before calling the internal consume functions.

Key features:
- Proactive queue stopping using __ptr_ring_full_next() to stop the queue
  before it becomes completely full.
- Not stopping the queue when the ptr_ring is full already, because if
  the consumer empties all entries in the meantime, stopping the queue
  would cause permanent starvation.
- Conditional queue waking using __ptr_ring_consume_created_space() to
  wake the queue only when space is actually created in the ring.
- Prevents permanent starvation by ensuring the queue is also woken when
  the ring becomes empty, which can happen when racing the producer.

NB: __always_unused on unused functions, to be removed later in the
series to not break bisectability.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c |  63 +++++++++++++++++++++++++++++
 drivers/net/tun.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 164 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 1197f245e873..c370a02789eb 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -753,6 +753,69 @@ static ssize_t tap_put_user(struct tap_queue *q,
 	return ret ? ret : total;
 }
 
+/*
+ * Consume a packet from the transmit ring. Callers must hold
+ * the consumer_lock of the ptr_ring. If the ring was full and the
+ * queue was stopped, this may wake up the queue if space is created.
+ */
+static void *__tap_ring_consume(struct tap_queue *q)
+{
+	struct ptr_ring *ring = &q->ring;
+	struct netdev_queue *txq;
+	struct net_device *dev;
+	bool stopped;
+	void *ptr;
+
+	ptr = __ptr_ring_peek(ring);
+	if (!ptr)
+		return ptr;
+
+	/* Paired with smp_wmb() in the ring producer path. Ensures we
+	 * see any updated netdev queue state caused by a full ring.
+	 * Needed for proper synchronization between the ring and the
+	 * netdev queue.
+	 */
+	smp_rmb();
+	rcu_read_lock();
+	dev = rcu_dereference(q->tap)->dev;
+	txq = netdev_get_tx_queue(dev, q->queue_index);
+	stopped = netif_tx_queue_stopped(txq);
+
+	/* Ensures the read for a stopped queue completes before the
+	 * discard, so that we don't miss the window to wake the queue if
+	 * needed.
+	 */
+	smp_rmb();
+	__ptr_ring_discard_one(ring);
+
+	/* If the queue was stopped (meaning the producer couldn't have
+	 * inserted new entries just now), and we have actually created
+	 * space in the ring, or the ring is now empty (due to a race
+	 * with the producer), then it is now safe to wake the queue.
+	 */
+	if (unlikely(stopped &&
+		     (__ptr_ring_consume_created_space(ring) ||
+		      __ptr_ring_empty(ring)))) {
+		/* Paired with smp_rmb() in tun_ring_produce. */
+		smp_wmb();
+		netif_tx_wake_queue(txq);
+	}
+	rcu_read_unlock();
+
+	return ptr;
+}
+
+static __always_unused void *tap_ring_consume(struct tap_queue *q)
+{
+	void *ptr;
+
+	spin_lock(&q->ring.consumer_lock);
+	ptr = __tap_ring_consume(q);
+	spin_unlock(&q->ring.consumer_lock);
+
+	return ptr;
+}
+
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
 			   int noblock, struct sk_buff *skb)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8192740357a0..3b9d8d406ff5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -999,6 +999,107 @@ static unsigned int run_ebpf_filter(struct tun_struct *tun,
 	return len;
 }
 
+/* Produce a packet into the transmit ring. If the ring becomes full, the
+ * netdev queue is stopped until the consumer wakes it again.
+ */
+static __always_unused int tun_ring_produce(struct ptr_ring *ring,
+					    struct netdev_queue *queue,
+					    struct sk_buff *skb)
+{
+	int ret;
+
+	spin_lock(&ring->producer_lock);
+
+	/* Pairs with smp_wmb() in __tun_ring_consume/__tap_ring_consume.
+	 * Ensures that freed space by the consumer is visible.
+	 */
+	smp_rmb();
+
+	/* Do not stop the netdev queue if the ptr_ring is full already.
+	 * The consumer could empty out the ptr_ring in the meantime
+	 * without noticing the stopped netdev queue, resulting in a
+	 * stopped netdev queue and an empty ptr_ring. In this case the
+	 * netdev queue would stay stopped forever.
+	 */
+	if (unlikely(!__ptr_ring_full(ring) &&
+		     __ptr_ring_full_next(ring)))
+		netif_tx_stop_queue(queue);
+
+	/* Note: __ptr_ring_produce has an internal smp_wmb() to synchronize the
+	 * state with the consumer. This ensures that after adding an entry to
+	 * the ring, any stopped queue state is visible to the consumer after
+	 * dequeueing.
+	 */
+	ret = __ptr_ring_produce(ring, skb);
+
+	spin_unlock(&ring->producer_lock);
+
+	return ret;
+}
+
+/*
+ * Consume a packet from the transmit ring. Callers must hold
+ * the consumer_lock of the ptr_ring. If the ring was full and the
+ * queue was stopped, this may wake up the queue if space is created.
+ */
+static void *__tun_ring_consume(struct tun_file *tfile)
+{
+	struct ptr_ring *ring = &tfile->tx_ring;
+	struct netdev_queue *txq;
+	struct net_device *dev;
+	bool stopped;
+	void *ptr;
+
+	ptr = __ptr_ring_peek(ring);
+	if (!ptr)
+		return ptr;
+
+	/* Paired with smp_wmb() in the ring producer path. Ensures we
+	 * see any updated netdev queue state caused by a full ring.
+	 * Needed for proper synchronization between the ring and the
+	 * netdev queue.
+	 */
+	smp_rmb();
+	rcu_read_lock();
+	dev = rcu_dereference(tfile->tun)->dev;
+	txq = netdev_get_tx_queue(dev, tfile->queue_index);
+	stopped = netif_tx_queue_stopped(txq);
+
+	/* Ensures the read for a stopped queue completes before the
+	 * discard, so that we don't miss the window to wake the queue if
+	 * needed.
+	 */
+	smp_rmb();
+	__ptr_ring_discard_one(ring);
+
+	/* If the queue was stopped (meaning the producer couldn't have
+	 * inserted new entries just now), and we have actually created
+	 * space in the ring, or the ring is now empty (due to a race
+	 * with the producer), then it is now safe to wake the queue.
+	 */
+	if (unlikely(stopped &&
+		     (__ptr_ring_consume_created_space(ring) ||
+		      __ptr_ring_empty(ring)))) {
+		/* Paired with smp_rmb() in tun_ring_produce. */
+		smp_wmb();
+		netif_tx_wake_queue(txq);
+	}
+	rcu_read_unlock();
+
+	return ptr;
+}
+
+static void __always_unused *tun_ring_consume(struct tun_file *tfile)
+{
+	void *ptr;
+
+	spin_lock(&tfile->tx_ring.consumer_lock);
+	ptr = __tun_ring_consume(tfile);
+	spin_unlock(&tfile->tx_ring.consumer_lock);
+
+	return ptr;
+}
+
 /* Net device start xmit */
 static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 4/8] tun/tap: add batched ring consume function
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (2 preceding siblings ...)
  2025-11-20 15:29 ` [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 5/8] tun/tap: add uncomsume function for returning entries to ring Simon Schippers
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Add tun_ring_consume_batched() and tap_ring_consume_batched() to allow
consuming multiple items from the respective ring buffer in a single lock
acquisition. Heavily inspired by ptr_ring_consume_batched() and will be
used for bulk dequeue operations (e.g. vhost-net).

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 21 +++++++++++++++++++++
 drivers/net/tun.c      | 21 +++++++++++++++++++++
 include/linux/if_tap.h |  6 ++++++
 include/linux/if_tun.h |  7 +++++++
 4 files changed, 55 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index c370a02789eb..01717c8fd284 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -816,6 +816,27 @@ static __always_unused void *tap_ring_consume(struct tap_queue *q)
 	return ptr;
 }
 
+int tap_ring_consume_batched(struct file *file, void **array, int n)
+{
+	struct tap_queue *q = file->private_data;
+	void *ptr;
+	int i;
+
+	spin_lock(&q->ring.consumer_lock);
+
+	for (i = 0; i < n; i++) {
+		ptr = __tap_ring_consume(q);
+		if (!ptr)
+			break;
+		array[i] = ptr;
+	}
+
+	spin_unlock(&q->ring.consumer_lock);
+
+	return i;
+}
+EXPORT_SYMBOL_GPL(tap_ring_consume_batched);
+
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
 			   int noblock, struct sk_buff *skb)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3b9d8d406ff5..42df185341ad 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3816,6 +3816,27 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+int tun_ring_consume_batched(struct file *file, void **array, int n)
+{
+	struct tun_file *tfile = file->private_data;
+	void *ptr;
+	int i;
+
+	spin_lock(&tfile->tx_ring.consumer_lock);
+
+	for (i = 0; i < n; i++) {
+		ptr = __tun_ring_consume(tfile);
+		if (!ptr)
+			break;
+		array[i] = ptr;
+	}
+
+	spin_unlock(&tfile->tx_ring.consumer_lock);
+
+	return i;
+}
+EXPORT_SYMBOL_GPL(tun_ring_consume_batched);
+
 struct ptr_ring *tun_get_tx_ring(struct file *file)
 {
 	struct tun_file *tfile;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 553552fa635c..cf8b90320b8d 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -11,6 +11,7 @@ struct socket;
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
 struct ptr_ring *tap_get_ptr_ring(struct file *file);
+int tap_ring_consume_batched(struct file *file, void **array, int n);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -22,6 +23,11 @@ static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline int tap_ring_consume_batched(struct file *f,
+					   void **array, int n)
+{
+	return 0;
+}
 #endif /* CONFIG_TAP */
 
 /*
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 80166eb62f41..444dda75a372 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -22,6 +22,7 @@ struct tun_msg_ctl {
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
+int tun_ring_consume_batched(struct file *file, void **array, int n);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -55,6 +56,12 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
+static inline int tun_ring_consume_batched(struct file *file,
+					   void **array, int n)
+{
+	return 0;
+}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 5/8] tun/tap: add uncomsume function for returning entries to ring
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (3 preceding siblings ...)
  2025-11-20 15:29 ` [PATCH net-next v6 4/8] tun/tap: add batched ring consume function Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 6/8] tun/tap: add helper functions to check file type Simon Schippers
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Add tun_ring_unconsume() and tap_ring_unconsume() wrappers to allow
external modules (e.g. vhost-net) to return previously consumed entries
back to the ring. This complements tun_ring_consume_batched() and
tap_ring_consume_batched() and enables proper error handling when
consumed packets need to be rolled back.

The functions delegate to ptr_ring_unconsume() and take a destroy
callback for entries that cannot be returned to the ring.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 10 ++++++++++
 drivers/net/tun.c      | 10 ++++++++++
 include/linux/if_tap.h |  4 ++++
 include/linux/if_tun.h |  5 +++++
 4 files changed, 29 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 01717c8fd284..0069e2f177f4 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -837,6 +837,16 @@ int tap_ring_consume_batched(struct file *file, void **array, int n)
 }
 EXPORT_SYMBOL_GPL(tap_ring_consume_batched);
 
+void tap_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *))
+{
+	struct tap_queue *q = file->private_data;
+	struct ptr_ring *ring = &q->ring;
+
+	ptr_ring_unconsume(ring, batch, n, destroy);
+}
+EXPORT_SYMBOL_GPL(tap_ring_unconsume);
+
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
 			   int noblock, struct sk_buff *skb)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 42df185341ad..bf109440d2c7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3837,6 +3837,16 @@ int tun_ring_consume_batched(struct file *file, void **array, int n)
 }
 EXPORT_SYMBOL_GPL(tun_ring_consume_batched);
 
+void tun_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *))
+{
+	struct tun_file *tfile = file->private_data;
+	struct ptr_ring *ring = &tfile->tx_ring;
+
+	ptr_ring_unconsume(ring, batch, n, destroy);
+}
+EXPORT_SYMBOL_GPL(tun_ring_unconsume);
+
 struct ptr_ring *tun_get_tx_ring(struct file *file)
 {
 	struct tun_file *tfile;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index cf8b90320b8d..28326a69745a 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -12,6 +12,8 @@ struct socket;
 struct socket *tap_get_socket(struct file *);
 struct ptr_ring *tap_get_ptr_ring(struct file *file);
 int tap_ring_consume_batched(struct file *file, void **array, int n);
+void tap_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *));
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -28,6 +30,8 @@ static inline int tap_ring_consume_batched(struct file *f,
 {
 	return 0;
 }
+static inline void tap_ring_unconsume(struct file *file, void **batch,
+				      int n, void (*destroy)(void *)) {}
 #endif /* CONFIG_TAP */
 
 /*
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 444dda75a372..1274c6b34eb6 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -23,6 +23,8 @@ struct tun_msg_ctl {
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
 int tun_ring_consume_batched(struct file *file, void **array, int n);
+void tun_ring_unconsume(struct file *file, void **batch, int n,
+			void (*destroy)(void *));
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -62,6 +64,9 @@ static inline int tun_ring_consume_batched(struct file *file,
 	return 0;
 }
 
+static inline void tun_ring_unconsume(struct file *file, void **batch,
+				      int n, void (*destroy)(void *)) {}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 6/8] tun/tap: add helper functions to check file type
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (4 preceding siblings ...)
  2025-11-20 15:29 ` [PATCH net-next v6 5/8] tun/tap: add uncomsume function for returning entries to ring Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 7/8] tun/tap & vhost-net: use {tun|tap}_ring_{consume|produce} to avoid tail drops Simon Schippers
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Add tun_is_tun_file() and tap_is_tap_file() helper functions to check if
a file is a TUN or TAP file, which will be utilized by vhost-net.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 13 +++++++++++++
 drivers/net/tun.c      | 13 +++++++++++++
 include/linux/if_tap.h |  5 +++++
 include/linux/if_tun.h |  6 ++++++
 4 files changed, 37 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 0069e2f177f4..56b8fe376e4a 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1283,6 +1283,19 @@ struct ptr_ring *tap_get_ptr_ring(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_ptr_ring);
 
+bool tap_is_tap_file(struct file *file)
+{
+	struct tap_queue *q;
+
+	if (file->f_op != &tap_fops)
+		return false;
+	q = file->private_data;
+	if (!q)
+		return false;
+	return true;
+}
+EXPORT_SYMBOL_GPL(tap_is_tap_file);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
 	struct net_device *dev = tap->dev;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bf109440d2c7..dc2d267d30d7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3860,6 +3860,19 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_tx_ring);
 
+bool tun_is_tun_file(struct file *file)
+{
+	struct tun_file *tfile;
+
+	if (file->f_op != &tun_fops)
+		return false;
+	tfile = file->private_data;
+	if (!tfile)
+		return false;
+	return true;
+}
+EXPORT_SYMBOL_GPL(tun_is_tun_file);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 28326a69745a..14194342b784 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -14,6 +14,7 @@ struct ptr_ring *tap_get_ptr_ring(struct file *file);
 int tap_ring_consume_batched(struct file *file, void **array, int n);
 void tap_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
+bool tap_is_tap_file(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -32,6 +33,10 @@ static inline int tap_ring_consume_batched(struct file *f,
 }
 static inline void tap_ring_unconsume(struct file *file, void **batch,
 				      int n, void (*destroy)(void *)) {}
+static inline bool tap_is_tap_file(struct file *f)
+{
+	return false;
+}
 #endif /* CONFIG_TAP */
 
 /*
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 1274c6b34eb6..0910c6dbac20 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -25,6 +25,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file);
 int tun_ring_consume_batched(struct file *file, void **array, int n);
 void tun_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
+bool tun_is_tun_file(struct file *file);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -67,6 +68,11 @@ static inline int tun_ring_consume_batched(struct file *file,
 static inline void tun_ring_unconsume(struct file *file, void **batch,
 				      int n, void (*destroy)(void *)) {}
 
+static inline bool tun_is_tun_file(struct file *f)
+{
+	return false;
+}
+
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 7/8] tun/tap & vhost-net: use {tun|tap}_ring_{consume|produce} to avoid tail drops
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (5 preceding siblings ...)
  2025-11-20 15:29 ` [PATCH net-next v6 6/8] tun/tap: add helper functions to check file type Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 7/8] tun/tap/vhost: " Simon Schippers
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Switch to {tun|tap}_ring_{consume|produce} in both tun/tap as well as
vhost_net to avoid ptr_ring tail drops.

For tun, disable dev->lltx to ensure that tun_net_xmit is not called even
though the netdev queue is stopped. Consequently, the update of
trans_start in tun_net_xmit is also removed.

Instead of the rx_ring, the vhost-net virtqueue now stores the interface
type IF_TAP or IF_TUN (or IF_NONE) to call tun/tap wrappers.

+--------------------------------+-----------+----------+
| pktgen benchmarks to Debian VM | Stock     | Patched  |
| i5 6300HQ, 20M packets         |           |          |
+-----------------+--------------+-----------+----------+
| TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
|                 +--------------+-----------+----------+
|                 | Lost         | 1615 Kpps | 0 pps    |
+-----------------+--------------+-----------+----------+
| TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
|                 +--------------+-----------+----------+
|                 | Lost         | 1164 Kpps | 0 pps    |
+-----------------+--------------+-----------+----------+

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c   |  4 +-
 drivers/net/tun.c   | 20 ++++------
 drivers/vhost/net.c | 92 ++++++++++++++++++++++++++++++---------------
 3 files changed, 71 insertions(+), 45 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 56b8fe376e4a..2847db4e3cc7 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -805,7 +805,7 @@ static void *__tap_ring_consume(struct tap_queue *q)
 	return ptr;
 }
 
-static __always_unused void *tap_ring_consume(struct tap_queue *q)
+static void *tap_ring_consume(struct tap_queue *q)
 {
 	void *ptr;
 
@@ -868,7 +868,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
 					TASK_INTERRUPTIBLE);
 
 		/* Read frames from the queue */
-		skb = ptr_ring_consume(&q->ring);
+		skb = tap_ring_consume(q);
 		if (skb)
 			break;
 		if (noblock) {
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index dc2d267d30d7..9da6e794a80f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -931,7 +931,6 @@ static int tun_net_init(struct net_device *dev)
 	dev->vlan_features = dev->features &
 			     ~(NETIF_F_HW_VLAN_CTAG_TX |
 			       NETIF_F_HW_VLAN_STAG_TX);
-	dev->lltx = true;
 
 	tun->flags = (tun->flags & ~TUN_FEATURES) |
 		      (ifr->ifr_flags & TUN_FEATURES);
@@ -1002,9 +1001,9 @@ static unsigned int run_ebpf_filter(struct tun_struct *tun,
 /* Produce a packet into the transmit ring. If the ring becomes full, the
  * netdev queue is stopped until the consumer wakes it again.
  */
-static __always_unused int tun_ring_produce(struct ptr_ring *ring,
-					    struct netdev_queue *queue,
-					    struct sk_buff *skb)
+static int tun_ring_produce(struct ptr_ring *ring,
+			    struct netdev_queue *queue,
+			    struct sk_buff *skb)
 {
 	int ret;
 
@@ -1089,7 +1088,7 @@ static void *__tun_ring_consume(struct tun_file *tfile)
 	return ptr;
 }
 
-static void __always_unused *tun_ring_consume(struct tun_file *tfile)
+static void *tun_ring_consume(struct tun_file *tfile)
 {
 	void *ptr;
 
@@ -1161,15 +1160,12 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
 
-	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
-	txq_trans_cond_update(queue);
-
 	/* Notify and wake up reader process */
 	if (tfile->flags & TUN_FASYNC)
 		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
@@ -2220,7 +2216,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 	void *ptr = NULL;
 	int error = 0;
 
-	ptr = ptr_ring_consume(&tfile->tx_ring);
+	ptr = tun_ring_consume(tfile);
 	if (ptr)
 		goto out;
 	if (noblock) {
@@ -2232,7 +2228,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		ptr = ptr_ring_consume(&tfile->tx_ring);
+		ptr = tun_ring_consume(tfile);
 		if (ptr)
 			break;
 		if (signal_pending(current)) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 35ded4330431..022efca1d4af 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -90,6 +90,12 @@ enum {
 	VHOST_NET_VQ_MAX = 2,
 };
 
+enum if_type {
+	IF_NONE = 0,
+	IF_TUN = 1,
+	IF_TAP = 2,
+};
+
 struct vhost_net_ubuf_ref {
 	/* refcount follows semantics similar to kref:
 	 *  0: object is released
@@ -131,6 +137,8 @@ struct vhost_net_virtqueue {
 	struct vhost_net_buf rxq;
 	/* Batched XDP buffs */
 	struct xdp_buff *xdp;
+	/* Interface type */
+	enum if_type type;
 };
 
 struct vhost_net {
@@ -176,24 +184,50 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 	return ret;
 }
 
-static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq,
+				 struct sock *sk)
 {
+	struct file *file = sk->sk_socket->file;
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	rxq->head = 0;
-	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
-					      VHOST_NET_BATCH);
+	switch (nvq->type) {
+	case IF_TUN:
+		rxq->tail = tun_ring_consume_batched(file, rxq->queue,
+						     VHOST_NET_BATCH);
+		break;
+	case IF_TAP:
+		rxq->tail = tap_ring_consume_batched(file, rxq->queue,
+						     VHOST_NET_BATCH);
+		break;
+	case IF_NONE:
+		return 0;
+	}
 	return rxq->tail;
 }
 
-static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
+static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq,
+				    struct socket *sk)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
-
-	if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
-		ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
-				   vhost_net_buf_get_size(rxq),
-				   tun_ptr_free);
+	struct file *file;
+
+	if (sk && !vhost_net_buf_is_empty(rxq)) {
+		file = sk->file;
+		switch (nvq->type) {
+		case IF_TUN:
+			tun_ring_unconsume(file, rxq->queue + rxq->head,
+					   vhost_net_buf_get_size(rxq),
+					   tun_ptr_free);
+			break;
+		case IF_TAP:
+			tap_ring_unconsume(file, rxq->queue + rxq->head,
+					   vhost_net_buf_get_size(rxq),
+					   tun_ptr_free);
+			break;
+		case IF_NONE:
+			return;
+		}
 		rxq->head = rxq->tail = 0;
 	}
 }
@@ -209,14 +243,15 @@ static int vhost_net_buf_peek_len(void *ptr)
 	return __skb_array_len_with_tag(ptr);
 }
 
-static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq,
+			      struct sock *sk)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	if (!vhost_net_buf_is_empty(rxq))
 		goto out;
 
-	if (!vhost_net_buf_produce(nvq))
+	if (!vhost_net_buf_produce(nvq, sk))
 		return 0;
 
 out:
@@ -991,8 +1026,8 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	int len = 0;
 	unsigned long flags;
 
-	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+	if (rvq->type)
+		return vhost_net_buf_peek(rvq, sk);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -1201,7 +1236,7 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		busyloop_intr = false;
-		if (nvq->rx_ring)
+		if (nvq->type)
 			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
@@ -1357,7 +1392,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].batched_xdp = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
-		n->vqs[i].rx_ring = NULL;
+		n->vqs[i].rx_ring = IF_NONE;
 		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
@@ -1387,8 +1422,8 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 	sock = vhost_vq_get_backend(vq);
 	vhost_net_disable_vq(n, vq);
 	vhost_vq_set_backend(vq, NULL);
-	vhost_net_buf_unproduce(nvq);
-	nvq->rx_ring = NULL;
+	vhost_net_buf_unproduce(nvq, sock);
+	nvq->type = IF_NONE;
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -1468,18 +1503,13 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
-static struct ptr_ring *get_tap_ptr_ring(struct file *file)
+static enum if_type get_if_type(struct file *file)
 {
-	struct ptr_ring *ring;
-	ring = tun_get_tx_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = tap_get_ptr_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = NULL;
-out:
-	return ring;
+	if (tap_is_tap_file(file))
+		return IF_TAP;
+	if (tun_is_tun_file(file))
+		return IF_TUN;
+	return IF_NONE;
 }
 
 static struct socket *get_tap_socket(int fd)
@@ -1561,7 +1591,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vhost_vq_set_backend(vq, sock);
-		vhost_net_buf_unproduce(nvq);
+		vhost_net_buf_unproduce(nvq, sock);
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
@@ -1570,9 +1600,9 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 			goto err_used;
 		if (index == VHOST_NET_VQ_RX) {
 			if (sock)
-				nvq->rx_ring = get_tap_ptr_ring(sock->file);
+				nvq->type = get_if_type(sock->file);
 			else
-				nvq->rx_ring = NULL;
+				nvq->type = IF_NONE;
 		}
 
 		oldubufs = nvq->ubufs;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 7/8] tun/tap/vhost: use {tun|tap}_ring_{consume|produce} to avoid tail drops
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (6 preceding siblings ...)
  2025-11-20 15:29 ` [PATCH net-next v6 7/8] tun/tap & vhost-net: use {tun|tap}_ring_{consume|produce} to avoid tail drops Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-20 15:29 ` [PATCH net-next v6 8/8] tun/tap: drop get ring exports Simon Schippers
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

Switch to {tun|tap}_ring_{consume|produce} in both tun/tap as well as
vhost_net to avoid ptr_ring tail drops.

For tun, disable dev->lltx to ensure that tun_net_xmit is not called even
though the netdev queue is stopped (it can happen due to unconsume or
queue resize). Consequently, the update of trans_start in tun_net_xmit is
also removed.

Instead of the rx_ring, the virtqueue now saves the interface type
IF_TAP, IF_TUN, (or IF_NONE) to call tun/tap wrappers.

+--------------------------------+-----------+----------+
| pktgen benchmarks to Debian VM | Stock     | Patched  |
| i5 6300HQ, 20M packets         |           |          |
+-----------------+--------------+-----------+----------+
| TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
|                 +--------------+-----------+----------+
|                 | Lost         | 1615 Kpps | 0 pps    |
+-----------------+--------------+-----------+----------+
| TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
|                 +--------------+-----------+----------+
|                 | Lost         | 1164 Kpps | 0 pps    |
+-----------------+--------------+-----------+----------+

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c   |  4 +-
 drivers/net/tun.c   | 20 ++++------
 drivers/vhost/net.c | 92 ++++++++++++++++++++++++++++++---------------
 3 files changed, 71 insertions(+), 45 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 56b8fe376e4a..2847db4e3cc7 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -805,7 +805,7 @@ static void *__tap_ring_consume(struct tap_queue *q)
 	return ptr;
 }
 
-static __always_unused void *tap_ring_consume(struct tap_queue *q)
+static void *tap_ring_consume(struct tap_queue *q)
 {
 	void *ptr;
 
@@ -868,7 +868,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
 					TASK_INTERRUPTIBLE);
 
 		/* Read frames from the queue */
-		skb = ptr_ring_consume(&q->ring);
+		skb = tap_ring_consume(q);
 		if (skb)
 			break;
 		if (noblock) {
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index dc2d267d30d7..9da6e794a80f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -931,7 +931,6 @@ static int tun_net_init(struct net_device *dev)
 	dev->vlan_features = dev->features &
 			     ~(NETIF_F_HW_VLAN_CTAG_TX |
 			       NETIF_F_HW_VLAN_STAG_TX);
-	dev->lltx = true;
 
 	tun->flags = (tun->flags & ~TUN_FEATURES) |
 		      (ifr->ifr_flags & TUN_FEATURES);
@@ -1002,9 +1001,9 @@ static unsigned int run_ebpf_filter(struct tun_struct *tun,
 /* Produce a packet into the transmit ring. If the ring becomes full, the
  * netdev queue is stopped until the consumer wakes it again.
  */
-static __always_unused int tun_ring_produce(struct ptr_ring *ring,
-					    struct netdev_queue *queue,
-					    struct sk_buff *skb)
+static int tun_ring_produce(struct ptr_ring *ring,
+			    struct netdev_queue *queue,
+			    struct sk_buff *skb)
 {
 	int ret;
 
@@ -1089,7 +1088,7 @@ static void *__tun_ring_consume(struct tun_file *tfile)
 	return ptr;
 }
 
-static void __always_unused *tun_ring_consume(struct tun_file *tfile)
+static void *tun_ring_consume(struct tun_file *tfile)
 {
 	void *ptr;
 
@@ -1161,15 +1160,12 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
 
-	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
-	txq_trans_cond_update(queue);
-
 	/* Notify and wake up reader process */
 	if (tfile->flags & TUN_FASYNC)
 		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
@@ -2220,7 +2216,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 	void *ptr = NULL;
 	int error = 0;
 
-	ptr = ptr_ring_consume(&tfile->tx_ring);
+	ptr = tun_ring_consume(tfile);
 	if (ptr)
 		goto out;
 	if (noblock) {
@@ -2232,7 +2228,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		ptr = ptr_ring_consume(&tfile->tx_ring);
+		ptr = tun_ring_consume(tfile);
 		if (ptr)
 			break;
 		if (signal_pending(current)) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 35ded4330431..022efca1d4af 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -90,6 +90,12 @@ enum {
 	VHOST_NET_VQ_MAX = 2,
 };
 
+enum if_type {
+	IF_NONE = 0,
+	IF_TUN = 1,
+	IF_TAP = 2,
+};
+
 struct vhost_net_ubuf_ref {
 	/* refcount follows semantics similar to kref:
 	 *  0: object is released
@@ -131,6 +137,8 @@ struct vhost_net_virtqueue {
 	struct vhost_net_buf rxq;
 	/* Batched XDP buffs */
 	struct xdp_buff *xdp;
+	/* Interface type */
+	enum if_type type;
 };
 
 struct vhost_net {
@@ -176,24 +184,50 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 	return ret;
 }
 
-static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq,
+				 struct sock *sk)
 {
+	struct file *file = sk->sk_socket->file;
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	rxq->head = 0;
-	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
-					      VHOST_NET_BATCH);
+	switch (nvq->type) {
+	case IF_TUN:
+		rxq->tail = tun_ring_consume_batched(file, rxq->queue,
+						     VHOST_NET_BATCH);
+		break;
+	case IF_TAP:
+		rxq->tail = tap_ring_consume_batched(file, rxq->queue,
+						     VHOST_NET_BATCH);
+		break;
+	case IF_NONE:
+		return 0;
+	}
 	return rxq->tail;
 }
 
-static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
+static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq,
+				    struct socket *sk)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
-
-	if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
-		ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
-				   vhost_net_buf_get_size(rxq),
-				   tun_ptr_free);
+	struct file *file;
+
+	if (sk && !vhost_net_buf_is_empty(rxq)) {
+		file = sk->file;
+		switch (nvq->type) {
+		case IF_TUN:
+			tun_ring_unconsume(file, rxq->queue + rxq->head,
+					   vhost_net_buf_get_size(rxq),
+					   tun_ptr_free);
+			break;
+		case IF_TAP:
+			tap_ring_unconsume(file, rxq->queue + rxq->head,
+					   vhost_net_buf_get_size(rxq),
+					   tun_ptr_free);
+			break;
+		case IF_NONE:
+			return;
+		}
 		rxq->head = rxq->tail = 0;
 	}
 }
@@ -209,14 +243,15 @@ static int vhost_net_buf_peek_len(void *ptr)
 	return __skb_array_len_with_tag(ptr);
 }
 
-static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq,
+			      struct sock *sk)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	if (!vhost_net_buf_is_empty(rxq))
 		goto out;
 
-	if (!vhost_net_buf_produce(nvq))
+	if (!vhost_net_buf_produce(nvq, sk))
 		return 0;
 
 out:
@@ -991,8 +1026,8 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	int len = 0;
 	unsigned long flags;
 
-	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+	if (rvq->type)
+		return vhost_net_buf_peek(rvq, sk);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -1201,7 +1236,7 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		busyloop_intr = false;
-		if (nvq->rx_ring)
+		if (nvq->type)
 			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
@@ -1357,7 +1392,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].batched_xdp = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
-		n->vqs[i].rx_ring = NULL;
+		n->vqs[i].rx_ring = IF_NONE;
 		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
@@ -1387,8 +1422,8 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 	sock = vhost_vq_get_backend(vq);
 	vhost_net_disable_vq(n, vq);
 	vhost_vq_set_backend(vq, NULL);
-	vhost_net_buf_unproduce(nvq);
-	nvq->rx_ring = NULL;
+	vhost_net_buf_unproduce(nvq, sock);
+	nvq->type = IF_NONE;
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -1468,18 +1503,13 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
-static struct ptr_ring *get_tap_ptr_ring(struct file *file)
+static enum if_type get_if_type(struct file *file)
 {
-	struct ptr_ring *ring;
-	ring = tun_get_tx_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = tap_get_ptr_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = NULL;
-out:
-	return ring;
+	if (tap_is_tap_file(file))
+		return IF_TAP;
+	if (tun_is_tun_file(file))
+		return IF_TUN;
+	return IF_NONE;
 }
 
 static struct socket *get_tap_socket(int fd)
@@ -1561,7 +1591,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vhost_vq_set_backend(vq, sock);
-		vhost_net_buf_unproduce(nvq);
+		vhost_net_buf_unproduce(nvq, sock);
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
@@ -1570,9 +1600,9 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 			goto err_used;
 		if (index == VHOST_NET_VQ_RX) {
 			if (sock)
-				nvq->rx_ring = get_tap_ptr_ring(sock->file);
+				nvq->type = get_if_type(sock->file);
 			else
-				nvq->rx_ring = NULL;
+				nvq->type = IF_NONE;
 		}
 
 		oldubufs = nvq->ubufs;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 8/8] tun/tap: drop get ring exports
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (7 preceding siblings ...)
  2025-11-20 15:29 ` [PATCH net-next v6 7/8] tun/tap/vhost: " Simon Schippers
@ 2025-11-20 15:29 ` Simon Schippers
  2025-11-21  6:19 ` [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
  2025-11-21  9:18 ` [syzbot ci] " syzbot ci
  10 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-20 15:29 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, mst, eperezma, jon, tim.gebauer, simon.schippers,
	netdev, linux-kernel, kvm, virtualization

tun_get_tx_ring and tap_get_ptr_ring no longer have in-tree consumers and
can be dropped.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Co-developed by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      | 13 -------------
 drivers/net/tun.c      | 13 -------------
 include/linux/if_tap.h |  5 -----
 include/linux/if_tun.h |  6 ------
 4 files changed, 37 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 2847db4e3cc7..fd87db829913 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1270,19 +1270,6 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
-struct ptr_ring *tap_get_ptr_ring(struct file *file)
-{
-	struct tap_queue *q;
-
-	if (file->f_op != &tap_fops)
-		return ERR_PTR(-EINVAL);
-	q = file->private_data;
-	if (!q)
-		return ERR_PTR(-EBADFD);
-	return &q->ring;
-}
-EXPORT_SYMBOL_GPL(tap_get_ptr_ring);
-
 bool tap_is_tap_file(struct file *file)
 {
 	struct tap_queue *q;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9da6e794a80f..32f53e31a5a7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3843,19 +3843,6 @@ void tun_ring_unconsume(struct file *file, void **batch, int n,
 }
 EXPORT_SYMBOL_GPL(tun_ring_unconsume);
 
-struct ptr_ring *tun_get_tx_ring(struct file *file)
-{
-	struct tun_file *tfile;
-
-	if (file->f_op != &tun_fops)
-		return ERR_PTR(-EINVAL);
-	tfile = file->private_data;
-	if (!tfile)
-		return ERR_PTR(-EBADFD);
-	return &tfile->tx_ring;
-}
-EXPORT_SYMBOL_GPL(tun_get_tx_ring);
-
 bool tun_is_tun_file(struct file *file)
 {
 	struct tun_file *tfile;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 14194342b784..0e427b979c11 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -10,7 +10,6 @@ struct socket;
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
-struct ptr_ring *tap_get_ptr_ring(struct file *file);
 int tap_ring_consume_batched(struct file *file, void **array, int n);
 void tap_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
@@ -22,10 +21,6 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
 static inline int tap_ring_consume_batched(struct file *f,
 					   void **array, int n)
 {
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 0910c6dbac20..80b734173a80 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -21,7 +21,6 @@ struct tun_msg_ctl {
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
-struct ptr_ring *tun_get_tx_ring(struct file *file);
 int tun_ring_consume_batched(struct file *file, void **array, int n);
 void tun_ring_unconsume(struct file *file, void **batch, int n,
 			void (*destroy)(void *));
@@ -54,11 +53,6 @@ static inline struct socket *tun_get_socket(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
-static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 static inline int tun_ring_consume_batched(struct file *file,
 					   void **array, int n)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (8 preceding siblings ...)
  2025-11-20 15:29 ` [PATCH net-next v6 8/8] tun/tap: drop get ring exports Simon Schippers
@ 2025-11-21  6:19 ` Jason Wang
  2025-11-21  9:22   ` Simon Schippers
  2025-11-21  9:18 ` [syzbot ci] " syzbot ci
  10 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-11-21  6:19 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> This patch series deals with tun/tap and vhost-net which drop incoming
> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> patch series, the associated netdev queue is stopped before this happens.
> This allows the connected qdisc to function correctly as reported by [1]
> and improves application-layer performance, see our paper [2]. Meanwhile
> the theoretical performance differs only slightly:
>
> +--------------------------------+-----------+----------+
> | pktgen benchmarks to Debian VM | Stock     | Patched  |
> | i5 6300HQ, 20M packets         |           |          |
> +-----------------+--------------+-----------+----------+
> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
> |                 +--------------+-----------+----------+
> |                 | Lost         | 1615 Kpps | 0 pps    |
> +-----------------+--------------+-----------+----------+
> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
> |                 +--------------+-----------+----------+
> |                 | Lost         | 1164 Kpps | 0 pps    |
> +-----------------+--------------+-----------+----------+

PPS drops somehow for TAP, any reason for that?

Btw, I had some questions:

1) most of the patches in this series would introduce non-trivial
impact on the performance, we probably need to benchmark each or split
the series. What's more we need to run TCP benchmark
(throughput/latency) as well as pktgen see the real impact

2) I see this:

        if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
                drop_reason = SKB_DROP_REASON_FULL_RING;
                goto drop;
        }

So there could still be packet drop? Or is this related to the XDP path?

3) The LLTX change would have performance implications, but the
benmark doesn't cover the case where multiple transmission is done in
parallel

4) After the LLTX change, it seems we've lost the synchronization with
the XDP_TX and XDP_REDIRECT path?

5) The series introduces various ptr_ring helpers with lots of
ordering stuff which is complicated, I wonder if we first have a
simple patch to implement the zero packet loss

>
> This patch series includes tun/tap, and vhost-net because they share
> logic. Adjusting only one of them would break the others. Therefore, the
> patch series is structured as follows:
> 1+2: new ptr_ring helpers for 3
> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> management
> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> vhost-net
> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> not break git bisect
> 8: tun/tap: drop get ring exports (not used anymore)
>
> Possible future work:
> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger

This seems to be not easy. The tx completion depends on the userspace behaviour.

> - Adaption of the netdev queue flow control for ipvtap & macvtap
>
> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [syzbot ci] Re: tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (9 preceding siblings ...)
  2025-11-21  6:19 ` [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
@ 2025-11-21  9:18 ` syzbot ci
  10 siblings, 0 replies; 30+ messages in thread
From: syzbot ci @ 2025-11-21  9:18 UTC (permalink / raw)
  To: andrew, davem, edumazet, eperezma, jasowang, jon, kuba, kvm,
	linux-kernel, mst, netdev, pabeni, simon.schippers, tim.gebauer,
	virtualization, willemdebruijn.kernel
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v6] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
https://lore.kernel.org/all/20251120152914.1127975-1-simon.schippers@tu-dortmund.de
* [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness
* [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space
* [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management
* [PATCH net-next v6 4/8] tun/tap: add batched ring consume function
* [PATCH net-next v6 5/8] tun/tap: add uncomsume function for returning entries to ring
* [PATCH net-next v6 6/8] tun/tap: add helper functions to check file type
* [PATCH net-next v6 7/8] tun/tap/vhost: use {tun|tap}_ring_{consume|produce} to avoid tail drops
* [PATCH net-next v6 8/8] tun/tap: drop get ring exports

and found the following issue:
general protection fault in tun_net_xmit

Full report is available here:
https://ci.syzbot.org/series/63c35694-3fa6-48b6-ba11-f893f55bcc1a

***

general protection fault in tun_net_xmit

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      45a1cd8346ca245a1ca475b26eb6ceb9d8b7c6f0
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/e1084fb4-2e0a-4c87-8e42-bc8fa70e1c77/config
syz repro: https://ci.syzbot.org/findings/cf1c9121-7e31-4bc8-a254-f9e6c8ee2d26/syz_repro

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
CPU: 1 UID: 0 PID: 13 Comm: kworker/u8:1 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:__ptr_ring_full include/linux/ptr_ring.h:51 [inline]
RIP: 0010:tun_ring_produce drivers/net/tun.c:1023 [inline]
RIP: 0010:tun_net_xmit+0xdf0/0x1840 drivers/net/tun.c:1164
Code: 00 00 00 fc ff df 48 89 44 24 50 0f b6 04 18 84 c0 0f 85 1f 07 00 00 4c 89 7c 24 30 4d 63 37 4f 8d 3c f4 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74 08 4c 89 ff e8 92 ba e3 fb 49 83 3f 00 74 0a e8 17
RSP: 0018:ffffc90000126f80 EFLAGS: 00010202
RAX: 0000000000000002 RBX: dffffc0000000000 RCX: dffffc0000000000
RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffc90000126f00
RBP: ffffc900001270b0 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff52000024de0 R12: 0000000000000010
R13: ffff8881730b6a48 R14: 0000000000000000 R15: 0000000000000010
FS:  0000000000000000(0000) GS:ffff8882a9f38000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000002280 CR3: 00000001bb189000 CR4: 0000000000352ef0
Call Trace:
 <TASK>
 __netdev_start_xmit include/linux/netdevice.h:5259 [inline]
 netdev_start_xmit include/linux/netdevice.h:5268 [inline]
 xmit_one net/core/dev.c:3853 [inline]
 dev_hard_start_xmit+0x2d7/0x830 net/core/dev.c:3869
 __dev_queue_xmit+0x172a/0x3740 net/core/dev.c:4811
 neigh_output include/net/neighbour.h:556 [inline]
 ip6_finish_output2+0xfb3/0x1480 net/ipv6/ip6_output.c:136
 NF_HOOK_COND include/linux/netfilter.h:307 [inline]
 ip6_output+0x340/0x550 net/ipv6/ip6_output.c:247
 NF_HOOK include/linux/netfilter.h:318 [inline]
 ndisc_send_skb+0xbce/0x1510 net/ipv6/ndisc.c:512
 addrconf_dad_completed+0x7ae/0xd60 net/ipv6/addrconf.c:4360
 addrconf_dad_work+0xc36/0x14b0 net/ipv6/addrconf.c:-1
 process_one_work kernel/workqueue.c:3263 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427
 kthread+0x711/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__ptr_ring_full include/linux/ptr_ring.h:51 [inline]
RIP: 0010:tun_ring_produce drivers/net/tun.c:1023 [inline]
RIP: 0010:tun_net_xmit+0xdf0/0x1840 drivers/net/tun.c:1164
Code: 00 00 00 fc ff df 48 89 44 24 50 0f b6 04 18 84 c0 0f 85 1f 07 00 00 4c 89 7c 24 30 4d 63 37 4f 8d 3c f4 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74 08 4c 89 ff e8 92 ba e3 fb 49 83 3f 00 74 0a e8 17
RSP: 0018:ffffc90000126f80 EFLAGS: 00010202
RAX: 0000000000000002 RBX: dffffc0000000000 RCX: dffffc0000000000
RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffc90000126f00
RBP: ffffc900001270b0 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff52000024de0 R12: 0000000000000010
R13: ffff8881730b6a48 R14: 0000000000000000 R15: 0000000000000010
FS:  0000000000000000(0000) GS:ffff8882a9f38000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000002280 CR3: 00000001bb189000 CR4: 0000000000352ef0
----------------
Code disassembly (best guess), 5 bytes skipped:
   0:	df 48 89             	fisttps -0x77(%rax)
   3:	44 24 50             	rex.R and $0x50,%al
   6:	0f b6 04 18          	movzbl (%rax,%rbx,1),%eax
   a:	84 c0                	test   %al,%al
   c:	0f 85 1f 07 00 00    	jne    0x731
  12:	4c 89 7c 24 30       	mov    %r15,0x30(%rsp)
  17:	4d 63 37             	movslq (%r15),%r14
  1a:	4f 8d 3c f4          	lea    (%r12,%r14,8),%r15
  1e:	4c 89 f8             	mov    %r15,%rax
  21:	48 c1 e8 03          	shr    $0x3,%rax
* 25:	80 3c 18 00          	cmpb   $0x0,(%rax,%rbx,1) <-- trapping instruction
  29:	74 08                	je     0x33
  2b:	4c 89 ff             	mov    %r15,%rdi
  2e:	e8 92 ba e3 fb       	call   0xfbe3bac5
  33:	49 83 3f 00          	cmpq   $0x0,(%r15)
  37:	74 0a                	je     0x43
  39:	e8                   	.byte 0xe8
  3a:	17                   	(bad)


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-21  6:19 ` [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
@ 2025-11-21  9:22   ` Simon Schippers
  2025-11-24  1:04     ` Jason Wang
  2025-11-26  7:15     ` Michael S. Tsirkin
  0 siblings, 2 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-21  9:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On 11/21/25 07:19, Jason Wang wrote:
> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> This patch series deals with tun/tap and vhost-net which drop incoming
>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>> patch series, the associated netdev queue is stopped before this happens.
>> This allows the connected qdisc to function correctly as reported by [1]
>> and improves application-layer performance, see our paper [2]. Meanwhile
>> the theoretical performance differs only slightly:
>>
>> +--------------------------------+-----------+----------+
>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
>> | i5 6300HQ, 20M packets         |           |          |
>> +-----------------+--------------+-----------+----------+
>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
>> |                 +--------------+-----------+----------+
>> |                 | Lost         | 1615 Kpps | 0 pps    |
>> +-----------------+--------------+-----------+----------+
>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
>> |                 +--------------+-----------+----------+
>> |                 | Lost         | 1164 Kpps | 0 pps    |
>> +-----------------+--------------+-----------+----------+
> 

Hi Jason,

thank you for your reply!

> PPS drops somehow for TAP, any reason for that?

I have no explicit explanation for that except general overheads coming
with this implementation.

> 
> Btw, I had some questions:
> 
> 1) most of the patches in this series would introduce non-trivial
> impact on the performance, we probably need to benchmark each or split
> the series. What's more we need to run TCP benchmark
> (throughput/latency) as well as pktgen see the real impact

What could be done, IMO, is to activate tun_ring_consume() /
tap_ring_consume() before enabling tun_ring_produce(). Then we could see
if this alone drops performance.

For TCP benchmarks, you mean userspace performance like iperf3 between a
host and a guest system?

> 
> 2) I see this:
> 
>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>                 goto drop;
>         }
> 
> So there could still be packet drop? Or is this related to the XDP path?

Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
unconsume. Since those two happen so rarely, I figured we should just
drop in this case.

> 
> 3) The LLTX change would have performance implications, but the
> benmark doesn't cover the case where multiple transmission is done in
> parallel

Do you mean multiple applications that produce traffic and potentially
run on different CPUs?

> 
> 4) After the LLTX change, it seems we've lost the synchronization with
> the XDP_TX and XDP_REDIRECT path?

I must admit I did not take a look at XDP and cannot really judge if/how
lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
call and I do not see the impact for XDP, which calls its own methods.
> 
> 5) The series introduces various ptr_ring helpers with lots of
> ordering stuff which is complicated, I wonder if we first have a
> simple patch to implement the zero packet loss

I personally don't see how a simpler patch is possible without using
discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
spin locking between producer and consumer. But I am open for
suggestions :)

> 
>>
>> This patch series includes tun/tap, and vhost-net because they share
>> logic. Adjusting only one of them would break the others. Therefore, the
>> patch series is structured as follows:
>> 1+2: new ptr_ring helpers for 3
>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
>> management
>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
>> vhost-net
>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
>> not break git bisect
>> 8: tun/tap: drop get ring exports (not used anymore)
>>
>> Possible future work:
>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> 
> This seems to be not easy. The tx completion depends on the userspace behaviour.

I agree, but I really would like to reduce the buffer bloat caused by the
default 500 TUN / 1000 TAP packet queue without losing performance.

> 
>> - Adaption of the netdev queue flow control for ipvtap & macvtap
>>
>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>>
> 
> Thanks
> 

Thanks! :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-21  9:22   ` Simon Schippers
@ 2025-11-24  1:04     ` Jason Wang
  2025-11-24  9:19       ` Simon Schippers
  2025-11-26  7:15     ` Michael S. Tsirkin
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-11-24  1:04 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 11/21/25 07:19, Jason Wang wrote:
> > On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> This patch series deals with tun/tap and vhost-net which drop incoming
> >> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> >> patch series, the associated netdev queue is stopped before this happens.
> >> This allows the connected qdisc to function correctly as reported by [1]
> >> and improves application-layer performance, see our paper [2]. Meanwhile
> >> the theoretical performance differs only slightly:
> >>
> >> +--------------------------------+-----------+----------+
> >> | pktgen benchmarks to Debian VM | Stock     | Patched  |
> >> | i5 6300HQ, 20M packets         |           |          |
> >> +-----------------+--------------+-----------+----------+
> >> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
> >> |                 +--------------+-----------+----------+
> >> |                 | Lost         | 1615 Kpps | 0 pps    |
> >> +-----------------+--------------+-----------+----------+
> >> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
> >> |                 +--------------+-----------+----------+
> >> |                 | Lost         | 1164 Kpps | 0 pps    |
> >> +-----------------+--------------+-----------+----------+
> >
>
> Hi Jason,
>
> thank you for your reply!
>
> > PPS drops somehow for TAP, any reason for that?
>
> I have no explicit explanation for that except general overheads coming
> with this implementation.

It would be better to fix that.

>
> >
> > Btw, I had some questions:
> >
> > 1) most of the patches in this series would introduce non-trivial
> > impact on the performance, we probably need to benchmark each or split
> > the series. What's more we need to run TCP benchmark
> > (throughput/latency) as well as pktgen see the real impact
>
> What could be done, IMO, is to activate tun_ring_consume() /
> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
> if this alone drops performance.
>
> For TCP benchmarks, you mean userspace performance like iperf3 between a
> host and a guest system?

Yes,

>
> >
> > 2) I see this:
> >
> >         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
> >                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >                 goto drop;
> >         }
> >
> > So there could still be packet drop? Or is this related to the XDP path?
>
> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
> unconsume. Since those two happen so rarely, I figured we should just
> drop in this case.
>
> >
> > 3) The LLTX change would have performance implications, but the
> > benmark doesn't cover the case where multiple transmission is done in
> > parallel
>
> Do you mean multiple applications that produce traffic and potentially
> run on different CPUs?

Yes.

>
> >
> > 4) After the LLTX change, it seems we've lost the synchronization with
> > the XDP_TX and XDP_REDIRECT path?
>
> I must admit I did not take a look at XDP and cannot really judge if/how
> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
> call and I do not see the impact for XDP, which calls its own methods.

Without LLTX tun_net_xmit is protected by tx lock but it is not the
case of tun_xdp_xmit. This is because, unlike other devices, tun
doesn't have a dedicated TX queue for XDP, so the queue is shared by
both XDP and skb. So XDP xmit path needs to be protected with tx lock
as well, and since we don't have queue discipline for XDP, it means we
could still drop packets when XDP is enabled. I'm not sure this would
defeat the whole idea or not.

> >
> > 5) The series introduces various ptr_ring helpers with lots of
> > ordering stuff which is complicated, I wonder if we first have a
> > simple patch to implement the zero packet loss
>
> I personally don't see how a simpler patch is possible without using
> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
> spin locking between producer and consumer. But I am open for
> suggestions :)

I see NETDEV_TX_BUSY is used by veth:

static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
{
        if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
                return NETDEV_TX_BUSY; /* signal qdisc layer */

        return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
}

Maybe it would be simpler to start from that (probably with a new tun->flags?).

Thanks

>
> >
> >>
> >> This patch series includes tun/tap, and vhost-net because they share
> >> logic. Adjusting only one of them would break the others. Therefore, the
> >> patch series is structured as follows:
> >> 1+2: new ptr_ring helpers for 3
> >> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> >> management
> >> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> >> vhost-net
> >> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> >> not break git bisect
> >> 8: tun/tap: drop get ring exports (not used anymore)
> >>
> >> Possible future work:
> >> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> >
> > This seems to be not easy. The tx completion depends on the userspace behaviour.
>
> I agree, but I really would like to reduce the buffer bloat caused by the
> default 500 TUN / 1000 TAP packet queue without losing performance.
>
> >
> >> - Adaption of the netdev queue flow control for ipvtap & macvtap
> >>
> >> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >>
> >
> > Thanks
> >
>
> Thanks! :)
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-24  1:04     ` Jason Wang
@ 2025-11-24  9:19       ` Simon Schippers
  2025-11-25  1:34         ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2025-11-24  9:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On 11/24/25 02:04, Jason Wang wrote:
> On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 11/21/25 07:19, Jason Wang wrote:
>>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> This patch series deals with tun/tap and vhost-net which drop incoming
>>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>>>> patch series, the associated netdev queue is stopped before this happens.
>>>> This allows the connected qdisc to function correctly as reported by [1]
>>>> and improves application-layer performance, see our paper [2]. Meanwhile
>>>> the theoretical performance differs only slightly:
>>>>
>>>> +--------------------------------+-----------+----------+
>>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
>>>> | i5 6300HQ, 20M packets         |           |          |
>>>> +-----------------+--------------+-----------+----------+
>>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
>>>> |                 +--------------+-----------+----------+
>>>> |                 | Lost         | 1615 Kpps | 0 pps    |
>>>> +-----------------+--------------+-----------+----------+
>>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
>>>> |                 +--------------+-----------+----------+
>>>> |                 | Lost         | 1164 Kpps | 0 pps    |
>>>> +-----------------+--------------+-----------+----------+
>>>
>>
>> Hi Jason,
>>
>> thank you for your reply!
>>
>>> PPS drops somehow for TAP, any reason for that?
>>
>> I have no explicit explanation for that except general overheads coming
>> with this implementation.
> 
> It would be better to fix that.
> 
>>
>>>
>>> Btw, I had some questions:
>>>
>>> 1) most of the patches in this series would introduce non-trivial
>>> impact on the performance, we probably need to benchmark each or split
>>> the series. What's more we need to run TCP benchmark
>>> (throughput/latency) as well as pktgen see the real impact
>>
>> What could be done, IMO, is to activate tun_ring_consume() /
>> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
>> if this alone drops performance.
>>
>> For TCP benchmarks, you mean userspace performance like iperf3 between a
>> host and a guest system?
> 
> Yes,
> 
>>
>>>
>>> 2) I see this:
>>>
>>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>                 goto drop;
>>>         }
>>>
>>> So there could still be packet drop? Or is this related to the XDP path?
>>
>> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
>> unconsume. Since those two happen so rarely, I figured we should just
>> drop in this case.
>>
>>>
>>> 3) The LLTX change would have performance implications, but the
>>> benmark doesn't cover the case where multiple transmission is done in
>>> parallel
>>
>> Do you mean multiple applications that produce traffic and potentially
>> run on different CPUs?
> 
> Yes.
> 
>>
>>>
>>> 4) After the LLTX change, it seems we've lost the synchronization with
>>> the XDP_TX and XDP_REDIRECT path?
>>
>> I must admit I did not take a look at XDP and cannot really judge if/how
>> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
>> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
>> call and I do not see the impact for XDP, which calls its own methods.
> 
> Without LLTX tun_net_xmit is protected by tx lock but it is not the
> case of tun_xdp_xmit. This is because, unlike other devices, tun
> doesn't have a dedicated TX queue for XDP, so the queue is shared by
> both XDP and skb. So XDP xmit path needs to be protected with tx lock
> as well, and since we don't have queue discipline for XDP, it means we
> could still drop packets when XDP is enabled. I'm not sure this would
> defeat the whole idea or not.

Good point.

> 
>>>
>>> 5) The series introduces various ptr_ring helpers with lots of
>>> ordering stuff which is complicated, I wonder if we first have a
>>> simple patch to implement the zero packet loss
>>
>> I personally don't see how a simpler patch is possible without using
>> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
>> spin locking between producer and consumer. But I am open for
>> suggestions :)
> 
> I see NETDEV_TX_BUSY is used by veth:
> 
> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> {
>         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>                 return NETDEV_TX_BUSY; /* signal qdisc layer */
> 
>         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> }
> 
> Maybe it would be simpler to start from that (probably with a new tun->flags?).
> 
> Thanks

Do you mean that this patchset could be implemented using the same
approach that was used for veth in [1]?
This could then also fix the XDP path.

But is returning NETDEV_TX_BUSY fine in our case?

Do you mean a flag that enables or disables the no-drop behavior?

Thanks!

[1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u

> 
>>
>>>
>>>>
>>>> This patch series includes tun/tap, and vhost-net because they share
>>>> logic. Adjusting only one of them would break the others. Therefore, the
>>>> patch series is structured as follows:
>>>> 1+2: new ptr_ring helpers for 3
>>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
>>>> management
>>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
>>>> vhost-net
>>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
>>>> not break git bisect
>>>> 8: tun/tap: drop get ring exports (not used anymore)
>>>>
>>>> Possible future work:
>>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
>>>
>>> This seems to be not easy. The tx completion depends on the userspace behaviour.
>>
>> I agree, but I really would like to reduce the buffer bloat caused by the
>> default 500 TUN / 1000 TAP packet queue without losing performance.
>>
>>>
>>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
>>>>
>>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>>>>
>>>
>>> Thanks
>>>
>>
>> Thanks! :)
>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-24  9:19       ` Simon Schippers
@ 2025-11-25  1:34         ` Jason Wang
  2025-11-25 14:04           ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2025-11-25  1:34 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 11/24/25 02:04, Jason Wang wrote:
> > On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 11/21/25 07:19, Jason Wang wrote:
> >>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> This patch series deals with tun/tap and vhost-net which drop incoming
> >>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> >>>> patch series, the associated netdev queue is stopped before this happens.
> >>>> This allows the connected qdisc to function correctly as reported by [1]
> >>>> and improves application-layer performance, see our paper [2]. Meanwhile
> >>>> the theoretical performance differs only slightly:
> >>>>
> >>>> +--------------------------------+-----------+----------+
> >>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
> >>>> | i5 6300HQ, 20M packets         |           |          |
> >>>> +-----------------+--------------+-----------+----------+
> >>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
> >>>> |                 +--------------+-----------+----------+
> >>>> |                 | Lost         | 1615 Kpps | 0 pps    |
> >>>> +-----------------+--------------+-----------+----------+
> >>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
> >>>> |                 +--------------+-----------+----------+
> >>>> |                 | Lost         | 1164 Kpps | 0 pps    |
> >>>> +-----------------+--------------+-----------+----------+
> >>>
> >>
> >> Hi Jason,
> >>
> >> thank you for your reply!
> >>
> >>> PPS drops somehow for TAP, any reason for that?
> >>
> >> I have no explicit explanation for that except general overheads coming
> >> with this implementation.
> >
> > It would be better to fix that.
> >
> >>
> >>>
> >>> Btw, I had some questions:
> >>>
> >>> 1) most of the patches in this series would introduce non-trivial
> >>> impact on the performance, we probably need to benchmark each or split
> >>> the series. What's more we need to run TCP benchmark
> >>> (throughput/latency) as well as pktgen see the real impact
> >>
> >> What could be done, IMO, is to activate tun_ring_consume() /
> >> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
> >> if this alone drops performance.
> >>
> >> For TCP benchmarks, you mean userspace performance like iperf3 between a
> >> host and a guest system?
> >
> > Yes,
> >
> >>
> >>>
> >>> 2) I see this:
> >>>
> >>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
> >>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>                 goto drop;
> >>>         }
> >>>
> >>> So there could still be packet drop? Or is this related to the XDP path?
> >>
> >> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
> >> unconsume. Since those two happen so rarely, I figured we should just
> >> drop in this case.
> >>
> >>>
> >>> 3) The LLTX change would have performance implications, but the
> >>> benmark doesn't cover the case where multiple transmission is done in
> >>> parallel
> >>
> >> Do you mean multiple applications that produce traffic and potentially
> >> run on different CPUs?
> >
> > Yes.
> >
> >>
> >>>
> >>> 4) After the LLTX change, it seems we've lost the synchronization with
> >>> the XDP_TX and XDP_REDIRECT path?
> >>
> >> I must admit I did not take a look at XDP and cannot really judge if/how
> >> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
> >> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
> >> call and I do not see the impact for XDP, which calls its own methods.
> >
> > Without LLTX tun_net_xmit is protected by tx lock but it is not the
> > case of tun_xdp_xmit. This is because, unlike other devices, tun
> > doesn't have a dedicated TX queue for XDP, so the queue is shared by
> > both XDP and skb. So XDP xmit path needs to be protected with tx lock
> > as well, and since we don't have queue discipline for XDP, it means we
> > could still drop packets when XDP is enabled. I'm not sure this would
> > defeat the whole idea or not.
>
> Good point.
>
> >
> >>>
> >>> 5) The series introduces various ptr_ring helpers with lots of
> >>> ordering stuff which is complicated, I wonder if we first have a
> >>> simple patch to implement the zero packet loss
> >>
> >> I personally don't see how a simpler patch is possible without using
> >> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
> >> spin locking between producer and consumer. But I am open for
> >> suggestions :)
> >
> > I see NETDEV_TX_BUSY is used by veth:
> >
> > static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> > {
> >         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> >                 return NETDEV_TX_BUSY; /* signal qdisc layer */
> >
> >         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> > }
> >
> > Maybe it would be simpler to start from that (probably with a new tun->flags?).
> >
> > Thanks
>
> Do you mean that this patchset could be implemented using the same
> approach that was used for veth in [1]?
> This could then also fix the XDP path.

I think so.

>
> But is returning NETDEV_TX_BUSY fine in our case?

If it helps to avoid packet drop. But I'm not sure if qdisc is a must
in your case.

>
> Do you mean a flag that enables or disables the no-drop behavior?

Yes, via a new flags that could be set via TUNSETIFF.

Thanks

>
> Thanks!
>
> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
>
> >
> >>
> >>>
> >>>>
> >>>> This patch series includes tun/tap, and vhost-net because they share
> >>>> logic. Adjusting only one of them would break the others. Therefore, the
> >>>> patch series is structured as follows:
> >>>> 1+2: new ptr_ring helpers for 3
> >>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> >>>> management
> >>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> >>>> vhost-net
> >>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> >>>> not break git bisect
> >>>> 8: tun/tap: drop get ring exports (not used anymore)
> >>>>
> >>>> Possible future work:
> >>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> >>>
> >>> This seems to be not easy. The tx completion depends on the userspace behaviour.
> >>
> >> I agree, but I really would like to reduce the buffer bloat caused by the
> >> default 500 TUN / 1000 TAP packet queue without losing performance.
> >>
> >>>
> >>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
> >>>>
> >>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >> Thanks! :)
> >>
> >
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-25  1:34         ` Jason Wang
@ 2025-11-25 14:04           ` Simon Schippers
  2025-11-26  6:19             ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2025-11-25 14:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On 11/25/25 02:34, Jason Wang wrote:
> On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> On 11/24/25 02:04, Jason Wang wrote:
>>> On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>
>>>> On 11/21/25 07:19, Jason Wang wrote:
>>>>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
>>>>> <simon.schippers@tu-dortmund.de> wrote:
>>>>>>
>>>>>> This patch series deals with tun/tap and vhost-net which drop incoming
>>>>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>>>>>> patch series, the associated netdev queue is stopped before this happens.
>>>>>> This allows the connected qdisc to function correctly as reported by [1]
>>>>>> and improves application-layer performance, see our paper [2]. Meanwhile
>>>>>> the theoretical performance differs only slightly:
>>>>>>
>>>>>> +--------------------------------+-----------+----------+
>>>>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
>>>>>> | i5 6300HQ, 20M packets         |           |          |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
>>>>>> |                 +--------------+-----------+----------+
>>>>>> |                 | Lost         | 1615 Kpps | 0 pps    |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
>>>>>> |                 +--------------+-----------+----------+
>>>>>> |                 | Lost         | 1164 Kpps | 0 pps    |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>
>>>>
>>>> Hi Jason,
>>>>
>>>> thank you for your reply!
>>>>
>>>>> PPS drops somehow for TAP, any reason for that?
>>>>
>>>> I have no explicit explanation for that except general overheads coming
>>>> with this implementation.
>>>
>>> It would be better to fix that.
>>>
>>>>
>>>>>
>>>>> Btw, I had some questions:
>>>>>
>>>>> 1) most of the patches in this series would introduce non-trivial
>>>>> impact on the performance, we probably need to benchmark each or split
>>>>> the series. What's more we need to run TCP benchmark
>>>>> (throughput/latency) as well as pktgen see the real impact
>>>>
>>>> What could be done, IMO, is to activate tun_ring_consume() /
>>>> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
>>>> if this alone drops performance.
>>>>
>>>> For TCP benchmarks, you mean userspace performance like iperf3 between a
>>>> host and a guest system?
>>>
>>> Yes,
>>>
>>>>
>>>>>
>>>>> 2) I see this:
>>>>>
>>>>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>                 goto drop;
>>>>>         }
>>>>>
>>>>> So there could still be packet drop? Or is this related to the XDP path?
>>>>
>>>> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
>>>> unconsume. Since those two happen so rarely, I figured we should just
>>>> drop in this case.
>>>>
>>>>>
>>>>> 3) The LLTX change would have performance implications, but the
>>>>> benmark doesn't cover the case where multiple transmission is done in
>>>>> parallel
>>>>
>>>> Do you mean multiple applications that produce traffic and potentially
>>>> run on different CPUs?
>>>
>>> Yes.
>>>
>>>>
>>>>>
>>>>> 4) After the LLTX change, it seems we've lost the synchronization with
>>>>> the XDP_TX and XDP_REDIRECT path?
>>>>
>>>> I must admit I did not take a look at XDP and cannot really judge if/how
>>>> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
>>>> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
>>>> call and I do not see the impact for XDP, which calls its own methods.
>>>
>>> Without LLTX tun_net_xmit is protected by tx lock but it is not the
>>> case of tun_xdp_xmit. This is because, unlike other devices, tun
>>> doesn't have a dedicated TX queue for XDP, so the queue is shared by
>>> both XDP and skb. So XDP xmit path needs to be protected with tx lock
>>> as well, and since we don't have queue discipline for XDP, it means we
>>> could still drop packets when XDP is enabled. I'm not sure this would
>>> defeat the whole idea or not.
>>
>> Good point.
>>
>>>
>>>>>
>>>>> 5) The series introduces various ptr_ring helpers with lots of
>>>>> ordering stuff which is complicated, I wonder if we first have a
>>>>> simple patch to implement the zero packet loss
>>>>
>>>> I personally don't see how a simpler patch is possible without using
>>>> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
>>>> spin locking between producer and consumer. But I am open for
>>>> suggestions :)
>>>
>>> I see NETDEV_TX_BUSY is used by veth:
>>>
>>> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>> {
>>>         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>>>                 return NETDEV_TX_BUSY; /* signal qdisc layer */
>>>
>>>         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>> }
>>>
>>> Maybe it would be simpler to start from that (probably with a new tun->flags?).
>>>
>>> Thanks
>>
>> Do you mean that this patchset could be implemented using the same
>> approach that was used for veth in [1]?
>> This could then also fix the XDP path.
> 
> I think so.

Okay, I will do so and submit a v7 when net-next opens again for 6.19.

> 
>>
>> But is returning NETDEV_TX_BUSY fine in our case?
> 
> If it helps to avoid packet drop. But I'm not sure if qdisc is a must
> in your case.

I will try to avoid returning it.

When no qdisc is connected, I will just drop like veth does.

> 
>>
>> Do you mean a flag that enables or disables the no-drop behavior?
> 
> Yes, via a new flags that could be set via TUNSETIFF.
> 
> Thanks

I am not a fan of that, since I can not imagine a use case where
dropping packets is desired. veth does not introduce a flag either.

Of course, if there is a major performance degradation, it makes sense.
But I will benchmark it, and we will see.

Thank you!

> 
>>
>> Thanks!
>>
>> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> This patch series includes tun/tap, and vhost-net because they share
>>>>>> logic. Adjusting only one of them would break the others. Therefore, the
>>>>>> patch series is structured as follows:
>>>>>> 1+2: new ptr_ring helpers for 3
>>>>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
>>>>>> management
>>>>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
>>>>>> vhost-net
>>>>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
>>>>>> not break git bisect
>>>>>> 8: tun/tap: drop get ring exports (not used anymore)
>>>>>>
>>>>>> Possible future work:
>>>>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
>>>>>
>>>>> This seems to be not easy. The tx completion depends on the userspace behaviour.
>>>>
>>>> I agree, but I really would like to reduce the buffer bloat caused by the
>>>> default 500 TUN / 1000 TAP packet queue without losing performance.
>>>>
>>>>>
>>>>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
>>>>>>
>>>>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>>>>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>> Thanks! :)
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness
  2025-11-20 15:29 ` [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness Simon Schippers
@ 2025-11-25 14:54   ` Michael S. Tsirkin
  0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2025-11-25 14:54 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Thu, Nov 20, 2025 at 04:29:06PM +0100, Simon Schippers wrote:
> Introduce the __ptr_ring_full_next() helper, which lets callers check
> if the ptr_ring will become full after the next insertion. This is useful
> for proactively managing capacity before the ring is actually full.
> Callers must ensure the ring is not already full before using this
> helper. This is because __ptr_ring_discard_one() may zero entries in
> reverse order, the slot after the current producer position may be
> cleared before the current one. This must be considered when using this
> check.
> 
> Note: This function is especially relevant when paired with the memory
> ordering guarantees of __ptr_ring_produce() (smp_wmb()), allowing for
> safe producer/consumer coordination.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Co-developed-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 534531807d95..da141cc8b075 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -96,6 +96,31 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
>  	return ret;
>  }
>  
> +/*
> + * Checks if the ptr_ring will become full after the next insertion.

Is this for the producer or the consumer? A better name would
reflect that.

> + *
> + * Note: Callers must ensure that the ptr_ring is not full before calling
> + * this function,

how?

> as __ptr_ring_discard_one invalidates entries in
> + * reverse order. Because the next entry (rather than the current one)
> + * may be zeroed after an insertion, failing to account for this can
> + * cause false negatives when checking whether the ring will become full
> + * on the next insertion.

this part confuses more than it clarifies.

> + */
> +static inline bool __ptr_ring_full_next(struct ptr_ring *r)
> +{
> +	int p;
> +
> +	if (unlikely(r->size <= 1))
> +		return true;
> +
> +	p = r->producer + 1;
> +
> +	if (unlikely(p >= r->size))
> +		p = 0;
> +
> +	return r->queue[p];
> +}
> +
>  /* Note: callers invoking this in a loop must use a compiler barrier,
>   * for example cpu_relax(). Callers must hold producer_lock.
>   * Callers are responsible for making sure pointer that is being queued
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space
  2025-11-20 15:29 ` [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space Simon Schippers
@ 2025-11-25 15:01   ` Michael S. Tsirkin
  2025-11-25 16:12     ` Simon Schippers
  2025-11-30 18:16   ` Willem de Bruijn
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2025-11-25 15:01 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Thu, Nov 20, 2025 at 04:29:07PM +0100, Simon Schippers wrote:
> Add __ptr_ring_consume_created_space() to check whether the previous
> __ptr_ring_consume() call successfully consumed an element and created
> space in the ring buffer. This enables callers to conditionally notify
> producers when space becomes available.
> 
> The function is only valid immediately after a single consume operation
> and should not be used after calling __ptr_ring_consume_batched().
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Co-developed by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index da141cc8b075..76d6840b45a3 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -453,6 +453,23 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>  	return ret;
>  }
>  
> +/*
> + * Check if the previous consume operation created space

space?

what does this mean?

> + *
> + * Returns true if the last call to __ptr_ring_consume() has created
> + * space in the ring buffer (i.e., an element was consumed).
> + *
> + * Note: This function is only valid immediately after a single call to
> + * __ptr_ring_consume(). If multiple calls to ptr_ring_consume*() have
> + * been made, this check must be performed after each call individually.
> + * Likewise, do not use this function after calling
> + * __ptr_ring_consume_batched().

API-wise, it is a really weird function.  So is 

{
	p = __ptr_ring_consume

	return !!p
}

guaranteed to be equivalent to 

{
	p = __ptr_ring_consume

	return !!__ptr_ring_consume_created_space
}



> + */
> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r)
> +{
> +	return r->consumer_tail >= r->consumer_head;
> +}
> +
>  /* Cast to structure type and call a function without discarding from FIFO.
>   * Function must return a value.
>   * Callers must take consumer_lock.
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space
  2025-11-25 15:01   ` Michael S. Tsirkin
@ 2025-11-25 16:12     ` Simon Schippers
  2025-11-25 17:18       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2025-11-25 16:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On 11/25/25 16:01, Michael S. Tsirkin wrote:
> On Thu, Nov 20, 2025 at 04:29:07PM +0100, Simon Schippers wrote:
>> Add __ptr_ring_consume_created_space() to check whether the previous
>> __ptr_ring_consume() call successfully consumed an element and created
>> space in the ring buffer. This enables callers to conditionally notify
>> producers when space becomes available.
>>
>> The function is only valid immediately after a single consume operation
>> and should not be used after calling __ptr_ring_consume_batched().
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Co-developed by: Jon Kohler <jon@nutanix.com>
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  include/linux/ptr_ring.h | 17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>> index da141cc8b075..76d6840b45a3 100644
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -453,6 +453,23 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>>  	return ret;
>>  }
>>  
>> +/*
>> + * Check if the previous consume operation created space
> 
> space?
> 
> what does this mean?
> 
>> + *
>> + * Returns true if the last call to __ptr_ring_consume() has created
>> + * space in the ring buffer (i.e., an element was consumed).
>> + *
>> + * Note: This function is only valid immediately after a single call to
>> + * __ptr_ring_consume(). If multiple calls to ptr_ring_consume*() have
>> + * been made, this check must be performed after each call individually.
>> + * Likewise, do not use this function after calling
>> + * __ptr_ring_consume_batched().
> 
> API-wise, it is a really weird function.  So is 
> 
> {
> 	p = __ptr_ring_consume
> 
> 	return !!p
> }
> 
> guaranteed to be equivalent to 
> 
> {
> 	p = __ptr_ring_consume
> 
> 	return !!__ptr_ring_consume_created_space
> }

I am a bit confused. You were the one recommending this function to me,
see [1].

Maybe the comments need rework here, but the function should be fine.

Thanks

[1] Link: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#mb722e8ae4ceb5df24f74305c6145561883d4e987

> 
> 
> 
>> + */
>> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r)
>> +{
>> +	return r->consumer_tail >= r->consumer_head;
>> +}
>> +
>>  /* Cast to structure type and call a function without discarding from FIFO.
>>   * Function must return a value.
>>   * Callers must take consumer_lock.
>> -- 
>> 2.43.0
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management
  2025-11-20 15:29 ` [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management Simon Schippers
@ 2025-11-25 16:54   ` Michael S. Tsirkin
  2025-11-26  9:23     ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2025-11-25 16:54 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Thu, Nov 20, 2025 at 04:29:08PM +0100, Simon Schippers wrote:
> Implement new ring buffer produce and consume functions for tun and tap
> drivers that provide lockless producer-consumer synchronization and
> netdev queue management to prevent ptr_ring tail drop and permanent
> starvation.
> 
> - tun_ring_produce(): Produces packets to the ptr_ring with proper memory
>   barriers and proactively stops the netdev queue when the ring is about
>   to become full.
> 
> - __tun_ring_consume() / __tap_ring_consume(): Internal consume functions
>   that check if the netdev queue was stopped due to a full ring, and wake
>   it when space becomes available. Uses memory barriers to ensure proper
>   ordering between producer and consumer.
> 
> - tun_ring_consume() / tap_ring_consume(): Wrapper functions that acquire
>   the consumer lock before calling the internal consume functions.
> 
> Key features:
> - Proactive queue stopping using __ptr_ring_full_next() to stop the queue
>   before it becomes completely full.
> - Not stopping the queue when the ptr_ring is full already, because if
>   the consumer empties all entries in the meantime, stopping the queue
>   would cause permanent starvation.

what is permanent starvation? this comment seems to answer this
question:


	/* Do not stop the netdev queue if the ptr_ring is full already.
	 * The consumer could empty out the ptr_ring in the meantime
	 * without noticing the stopped netdev queue, resulting in a
	 * stopped netdev queue and an empty ptr_ring. In this case the
	 * netdev queue would stay stopped forever.
	 */


why having a single entry in
the ring we never use helpful to address this?




In fact, all your patch does to solve it, is check
netif_tx_queue_stopped on every consumed packet.


I already proposed:

static inline int __ptr_ring_peek_producer(struct ptr_ring *r)
{
        if (unlikely(!r->size) || r->queue[r->producer])
                return -ENOSPC;
        return 0;
}

And with that, why isn't avoiding the race as simple as
just rechecking after stopping the queue?

__ptr_ring_produce();
if (__ptr_ring_peek_producer())
	netif_tx_stop_queue
	if (!__ptr_ring_peek_producer())
		netif_tx_wake_queue(txq);







-- 
MST


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space
  2025-11-25 16:12     ` Simon Schippers
@ 2025-11-25 17:18       ` Michael S. Tsirkin
  0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2025-11-25 17:18 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Tue, Nov 25, 2025 at 05:12:35PM +0100, Simon Schippers wrote:
> On 11/25/25 16:01, Michael S. Tsirkin wrote:
> > On Thu, Nov 20, 2025 at 04:29:07PM +0100, Simon Schippers wrote:
> >> Add __ptr_ring_consume_created_space() to check whether the previous
> >> __ptr_ring_consume() call successfully consumed an element and created
> >> space in the ring buffer. This enables callers to conditionally notify
> >> producers when space becomes available.
> >>
> >> The function is only valid immediately after a single consume operation
> >> and should not be used after calling __ptr_ring_consume_batched().
> >>
> >> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> >> Co-developed by: Jon Kohler <jon@nutanix.com>
> >> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> >> ---
> >>  include/linux/ptr_ring.h | 17 +++++++++++++++++
> >>  1 file changed, 17 insertions(+)
> >>
> >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> >> index da141cc8b075..76d6840b45a3 100644
> >> --- a/include/linux/ptr_ring.h
> >> +++ b/include/linux/ptr_ring.h
> >> @@ -453,6 +453,23 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
> >>  	return ret;
> >>  }
> >>  
> >> +/*
> >> + * Check if the previous consume operation created space
> > 
> > space?
> > 
> > what does this mean?
> > 
> >> + *
> >> + * Returns true if the last call to __ptr_ring_consume() has created
> >> + * space in the ring buffer (i.e., an element was consumed).
> >> + *
> >> + * Note: This function is only valid immediately after a single call to
> >> + * __ptr_ring_consume(). If multiple calls to ptr_ring_consume*() have
> >> + * been made, this check must be performed after each call individually.
> >> + * Likewise, do not use this function after calling
> >> + * __ptr_ring_consume_batched().
> > 
> > API-wise, it is a really weird function.  So is 
> > 
> > {
> > 	p = __ptr_ring_consume
> > 
> > 	return !!p
> > }
> > 
> > guaranteed to be equivalent to 
> > 
> > {
> > 	p = __ptr_ring_consume
> > 
> > 	return !!__ptr_ring_consume_created_space
> > }
> 
> I am a bit confused. You were the one recommending this function to me,
> see [1].
> 
> Maybe the comments need rework here, but the function should be fine.
> 
> Thanks
> 
> [1] Link: https://lore.kernel.org/netdev/20250922221553.47802-1-simon.schippers@tu-dortmund.de/T/#mb722e8ae4ceb5df24f74305c6145561883d4e987


I see, (an element was consumed) part confused, instead of clarifying.
That is not the question - it was consumed.



Let me try:

Returns true if the last call to __ptr_ring_consume() has created
space in the ring buffer (i.e., a new element can be produced).


Note: Because of batching, a successful call to __ptr_ring_consume
does not guarantee that the next call to __ptr_ring_produce
will succeed.

Note2: This function is only valid immediately after a single call to
__ptr_ring_consume(). If multiple calls to ptr_ring_consume*() have
been made, and you want to know whether any of them created space,
it is not enough to call this after the last __ptr_ring_consume -
instead, this check must be performed after each call individually.
Likewise, do not use this function after calling
__ptr_ring_consume_batched().




> > 
> > 
> > 
> >> + */
> >> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r)
> >> +{
> >> +	return r->consumer_tail >= r->consumer_head;
> >> +}
> >> +
> >>  /* Cast to structure type and call a function without discarding from FIFO.
> >>   * Function must return a value.
> >>   * Callers must take consumer_lock.
> >> -- 
> >> 2.43.0
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-25 14:04           ` Simon Schippers
@ 2025-11-26  6:19             ` Jason Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Wang @ 2025-11-26  6:19 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, andrew+netdev, davem, edumazet, kuba,
	pabeni, mst, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Tue, Nov 25, 2025 at 10:05 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> On 11/25/25 02:34, Jason Wang wrote:
> > On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
> > <simon.schippers@tu-dortmund.de> wrote:
> >>
> >> On 11/24/25 02:04, Jason Wang wrote:
> >>> On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
> >>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>
> >>>> On 11/21/25 07:19, Jason Wang wrote:
> >>>>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> >>>>> <simon.schippers@tu-dortmund.de> wrote:
> >>>>>>
> >>>>>> This patch series deals with tun/tap and vhost-net which drop incoming
> >>>>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> >>>>>> patch series, the associated netdev queue is stopped before this happens.
> >>>>>> This allows the connected qdisc to function correctly as reported by [1]
> >>>>>> and improves application-layer performance, see our paper [2]. Meanwhile
> >>>>>> the theoretical performance differs only slightly:
> >>>>>>
> >>>>>> +--------------------------------+-----------+----------+
> >>>>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
> >>>>>> | i5 6300HQ, 20M packets         |           |          |
> >>>>>> +-----------------+--------------+-----------+----------+
> >>>>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
> >>>>>> |                 +--------------+-----------+----------+
> >>>>>> |                 | Lost         | 1615 Kpps | 0 pps    |
> >>>>>> +-----------------+--------------+-----------+----------+
> >>>>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
> >>>>>> |                 +--------------+-----------+----------+
> >>>>>> |                 | Lost         | 1164 Kpps | 0 pps    |
> >>>>>> +-----------------+--------------+-----------+----------+
> >>>>>
> >>>>
> >>>> Hi Jason,
> >>>>
> >>>> thank you for your reply!
> >>>>
> >>>>> PPS drops somehow for TAP, any reason for that?
> >>>>
> >>>> I have no explicit explanation for that except general overheads coming
> >>>> with this implementation.
> >>>
> >>> It would be better to fix that.
> >>>
> >>>>
> >>>>>
> >>>>> Btw, I had some questions:
> >>>>>
> >>>>> 1) most of the patches in this series would introduce non-trivial
> >>>>> impact on the performance, we probably need to benchmark each or split
> >>>>> the series. What's more we need to run TCP benchmark
> >>>>> (throughput/latency) as well as pktgen see the real impact
> >>>>
> >>>> What could be done, IMO, is to activate tun_ring_consume() /
> >>>> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
> >>>> if this alone drops performance.
> >>>>
> >>>> For TCP benchmarks, you mean userspace performance like iperf3 between a
> >>>> host and a guest system?
> >>>
> >>> Yes,
> >>>
> >>>>
> >>>>>
> >>>>> 2) I see this:
> >>>>>
> >>>>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
> >>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
> >>>>>                 goto drop;
> >>>>>         }
> >>>>>
> >>>>> So there could still be packet drop? Or is this related to the XDP path?
> >>>>
> >>>> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
> >>>> unconsume. Since those two happen so rarely, I figured we should just
> >>>> drop in this case.
> >>>>
> >>>>>
> >>>>> 3) The LLTX change would have performance implications, but the
> >>>>> benmark doesn't cover the case where multiple transmission is done in
> >>>>> parallel
> >>>>
> >>>> Do you mean multiple applications that produce traffic and potentially
> >>>> run on different CPUs?
> >>>
> >>> Yes.
> >>>
> >>>>
> >>>>>
> >>>>> 4) After the LLTX change, it seems we've lost the synchronization with
> >>>>> the XDP_TX and XDP_REDIRECT path?
> >>>>
> >>>> I must admit I did not take a look at XDP and cannot really judge if/how
> >>>> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
> >>>> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
> >>>> call and I do not see the impact for XDP, which calls its own methods.
> >>>
> >>> Without LLTX tun_net_xmit is protected by tx lock but it is not the
> >>> case of tun_xdp_xmit. This is because, unlike other devices, tun
> >>> doesn't have a dedicated TX queue for XDP, so the queue is shared by
> >>> both XDP and skb. So XDP xmit path needs to be protected with tx lock
> >>> as well, and since we don't have queue discipline for XDP, it means we
> >>> could still drop packets when XDP is enabled. I'm not sure this would
> >>> defeat the whole idea or not.
> >>
> >> Good point.
> >>
> >>>
> >>>>>
> >>>>> 5) The series introduces various ptr_ring helpers with lots of
> >>>>> ordering stuff which is complicated, I wonder if we first have a
> >>>>> simple patch to implement the zero packet loss
> >>>>
> >>>> I personally don't see how a simpler patch is possible without using
> >>>> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
> >>>> spin locking between producer and consumer. But I am open for
> >>>> suggestions :)
> >>>
> >>> I see NETDEV_TX_BUSY is used by veth:
> >>>
> >>> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> >>> {
> >>>         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> >>>                 return NETDEV_TX_BUSY; /* signal qdisc layer */
> >>>
> >>>         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> >>> }
> >>>
> >>> Maybe it would be simpler to start from that (probably with a new tun->flags?).
> >>>
> >>> Thanks
> >>
> >> Do you mean that this patchset could be implemented using the same
> >> approach that was used for veth in [1]?
> >> This could then also fix the XDP path.
> >
> > I think so.
>
> Okay, I will do so and submit a v7 when net-next opens again for 6.19.
>
> >
> >>
> >> But is returning NETDEV_TX_BUSY fine in our case?
> >
> > If it helps to avoid packet drop. But I'm not sure if qdisc is a must
> > in your case.
>
> I will try to avoid returning it.
>
> When no qdisc is connected, I will just drop like veth does.
>
> >
> >>
> >> Do you mean a flag that enables or disables the no-drop behavior?
> >
> > Yes, via a new flags that could be set via TUNSETIFF.
> >
> > Thanks
>
> I am not a fan of that, since I can not imagine a use case where
> dropping packets is desired.

Right, it's just for the case when we can see regression for some specific test.

> veth does not introduce a flag either.
>
> Of course, if there is a major performance degradation, it makes sense.
> But I will benchmark it, and we will see.

Exactly.

Thanks

>
> Thank you!
>
> >
> >>
> >> Thanks!
> >>
> >> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> This patch series includes tun/tap, and vhost-net because they share
> >>>>>> logic. Adjusting only one of them would break the others. Therefore, the
> >>>>>> patch series is structured as follows:
> >>>>>> 1+2: new ptr_ring helpers for 3
> >>>>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> >>>>>> management
> >>>>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> >>>>>> vhost-net
> >>>>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> >>>>>> not break git bisect
> >>>>>> 8: tun/tap: drop get ring exports (not used anymore)
> >>>>>>
> >>>>>> Possible future work:
> >>>>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> >>>>>
> >>>>> This seems to be not easy. The tx completion depends on the userspace behaviour.
> >>>>
> >>>> I agree, but I really would like to reduce the buffer bloat caused by the
> >>>> default 500 TUN / 1000 TAP packet queue without losing performance.
> >>>>
> >>>>>
> >>>>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
> >>>>>>
> >>>>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >>>>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >>>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>> Thanks! :)
> >>>>
> >>>
> >>
> >
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-21  9:22   ` Simon Schippers
  2025-11-24  1:04     ` Jason Wang
@ 2025-11-26  7:15     ` Michael S. Tsirkin
  2025-11-26  9:24       ` Simon Schippers
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2025-11-26  7:15 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Fri, Nov 21, 2025 at 10:22:54AM +0100, Simon Schippers wrote:
> I agree, but I really would like to reduce the buffer bloat caused by the
> default 500 TUN / 1000 TAP packet queue without losing performance.

that default is part of the userspace API and can't be changed.
just change whatever userspace is creating your device.

-- 
MST


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management
  2025-11-25 16:54   ` Michael S. Tsirkin
@ 2025-11-26  9:23     ` Simon Schippers
  2025-11-26 15:25       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2025-11-26  9:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On 11/25/25 17:54, Michael S. Tsirkin wrote:
> On Thu, Nov 20, 2025 at 04:29:08PM +0100, Simon Schippers wrote:
>> Implement new ring buffer produce and consume functions for tun and tap
>> drivers that provide lockless producer-consumer synchronization and
>> netdev queue management to prevent ptr_ring tail drop and permanent
>> starvation.
>>
>> - tun_ring_produce(): Produces packets to the ptr_ring with proper memory
>>   barriers and proactively stops the netdev queue when the ring is about
>>   to become full.
>>
>> - __tun_ring_consume() / __tap_ring_consume(): Internal consume functions
>>   that check if the netdev queue was stopped due to a full ring, and wake
>>   it when space becomes available. Uses memory barriers to ensure proper
>>   ordering between producer and consumer.
>>
>> - tun_ring_consume() / tap_ring_consume(): Wrapper functions that acquire
>>   the consumer lock before calling the internal consume functions.
>>
>> Key features:
>> - Proactive queue stopping using __ptr_ring_full_next() to stop the queue
>>   before it becomes completely full.
>> - Not stopping the queue when the ptr_ring is full already, because if
>>   the consumer empties all entries in the meantime, stopping the queue
>>   would cause permanent starvation.
> 
> what is permanent starvation? this comment seems to answer this
> question:
> 
> 
> 	/* Do not stop the netdev queue if the ptr_ring is full already.
> 	 * The consumer could empty out the ptr_ring in the meantime
> 	 * without noticing the stopped netdev queue, resulting in a
> 	 * stopped netdev queue and an empty ptr_ring. In this case the
> 	 * netdev queue would stay stopped forever.
> 	 */
> 
> 
> why having a single entry in
> the ring we never use helpful to address this?
> 
> 
> 
> 
> In fact, all your patch does to solve it, is check
> netif_tx_queue_stopped on every consumed packet.
> 
> 
> I already proposed:
> 
> static inline int __ptr_ring_peek_producer(struct ptr_ring *r)
> {
>         if (unlikely(!r->size) || r->queue[r->producer])
>                 return -ENOSPC;
>         return 0;
> }
> 
> And with that, why isn't avoiding the race as simple as
> just rechecking after stopping the queue?
 
I think you are right and that is quite similar to what veth [1] does.
However, there are two differences:

- Your approach avoids returning NETDEV_TX_BUSY by already stopping
  when the ring becomes full (and not when the ring is full already)
- ...and the recheck of the producer wakes on !full instead of empty.

I like both aspects better than the veth implementation.

Just one thing: like the veth implementation, we probably need a
smp_mb__after_atomic() after netif_tx_stop_queue() as they also discussed
in their v6 [2].


On the consumer side, I would then just do:

__ptr_ring_consume();
if (unlikely(__ptr_ring_consume_created_space()))
    netif_tx_wake_queue(txq);

Right?

And for the batched consume method, I would just call this in a loop.

Thank you!

[1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#m2582fcc48901e2e845b20b89e0e7196951484e5f
[2] Link: https://lore.kernel.org/all/174549933665.608169.392044991754158047.stgit@firesoul/T/#m63f2deb86ffbd9ff3a27e1232077a3775606c14d

> 
> __ptr_ring_produce();
> if (__ptr_ring_peek_producer())
> 	netif_tx_stop_queue

smp_mb__after_atomic(); // Right here

> 	if (!__ptr_ring_peek_producer())
> 		netif_tx_wake_queue(txq);
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop
  2025-11-26  7:15     ` Michael S. Tsirkin
@ 2025-11-26  9:24       ` Simon Schippers
  0 siblings, 0 replies; 30+ messages in thread
From: Simon Schippers @ 2025-11-26  9:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, willemdebruijn.kernel, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On 11/26/25 08:15, Michael S. Tsirkin wrote:
> On Fri, Nov 21, 2025 at 10:22:54AM +0100, Simon Schippers wrote:
>> I agree, but I really would like to reduce the buffer bloat caused by the
>> default 500 TUN / 1000 TAP packet queue without losing performance.
> 
> that default is part of the userspace API and can't be changed.
> just change whatever userspace is creating your device.
> 

Yes, but I’m thinking about introducing a new interface flag like
IFF_BQL. However, as noted earlier, there are significant implementation
challenges.

I think there can be advantages to something like VPN's on mobile
devices where the throughput varies between a few Mbit/s (small TUN/TAP
queue is fine) and multiple Gbit/s (need a bigger queue).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management
  2025-11-26  9:23     ` Simon Schippers
@ 2025-11-26 15:25       ` Michael S. Tsirkin
  2025-11-26 16:04         ` Simon Schippers
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2025-11-26 15:25 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Wed, Nov 26, 2025 at 10:23:50AM +0100, Simon Schippers wrote:
> On 11/25/25 17:54, Michael S. Tsirkin wrote:
> > On Thu, Nov 20, 2025 at 04:29:08PM +0100, Simon Schippers wrote:
> >> Implement new ring buffer produce and consume functions for tun and tap
> >> drivers that provide lockless producer-consumer synchronization and
> >> netdev queue management to prevent ptr_ring tail drop and permanent
> >> starvation.
> >>
> >> - tun_ring_produce(): Produces packets to the ptr_ring with proper memory
> >>   barriers and proactively stops the netdev queue when the ring is about
> >>   to become full.
> >>
> >> - __tun_ring_consume() / __tap_ring_consume(): Internal consume functions
> >>   that check if the netdev queue was stopped due to a full ring, and wake
> >>   it when space becomes available. Uses memory barriers to ensure proper
> >>   ordering between producer and consumer.
> >>
> >> - tun_ring_consume() / tap_ring_consume(): Wrapper functions that acquire
> >>   the consumer lock before calling the internal consume functions.
> >>
> >> Key features:
> >> - Proactive queue stopping using __ptr_ring_full_next() to stop the queue
> >>   before it becomes completely full.
> >> - Not stopping the queue when the ptr_ring is full already, because if
> >>   the consumer empties all entries in the meantime, stopping the queue
> >>   would cause permanent starvation.
> > 
> > what is permanent starvation? this comment seems to answer this
> > question:
> > 
> > 
> > 	/* Do not stop the netdev queue if the ptr_ring is full already.
> > 	 * The consumer could empty out the ptr_ring in the meantime
> > 	 * without noticing the stopped netdev queue, resulting in a
> > 	 * stopped netdev queue and an empty ptr_ring. In this case the
> > 	 * netdev queue would stay stopped forever.
> > 	 */
> > 
> > 
> > why having a single entry in
> > the ring we never use helpful to address this?
> > 
> > 
> > 
> > 
> > In fact, all your patch does to solve it, is check
> > netif_tx_queue_stopped on every consumed packet.
> > 
> > 
> > I already proposed:
> > 
> > static inline int __ptr_ring_peek_producer(struct ptr_ring *r)
> > {
> >         if (unlikely(!r->size) || r->queue[r->producer])
> >                 return -ENOSPC;
> >         return 0;
> > }
> > 
> > And with that, why isn't avoiding the race as simple as
> > just rechecking after stopping the queue?
>  
> I think you are right and that is quite similar to what veth [1] does.
> However, there are two differences:
> 
> - Your approach avoids returning NETDEV_TX_BUSY by already stopping
>   when the ring becomes full (and not when the ring is full already)
> - ...and the recheck of the producer wakes on !full instead of empty.
> 
> I like both aspects better than the veth implementation.

Right.

Though frankly, someone should just fix NETDEV_TX_BUSY already
at least with the most popular qdiscs.

It is a common situation and it is just annoying that every driver has
to come up with its own scheme.





> Just one thing: like the veth implementation, we probably need a
> smp_mb__after_atomic() after netif_tx_stop_queue() as they also discussed
> in their v6 [2].

yea makes sense.

> 
> On the consumer side, I would then just do:
> 
> __ptr_ring_consume();
> if (unlikely(__ptr_ring_consume_created_space()))
>     netif_tx_wake_queue(txq);
> 
> Right?
> 
> And for the batched consume method, I would just call this in a loop.

Well tun does not use batched consume does it?


> Thank you!
> 
> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#m2582fcc48901e2e845b20b89e0e7196951484e5f
> [2] Link: https://lore.kernel.org/all/174549933665.608169.392044991754158047.stgit@firesoul/T/#m63f2deb86ffbd9ff3a27e1232077a3775606c14d
> 
> > 
> > __ptr_ring_produce();
> > if (__ptr_ring_peek_producer())
> > 	netif_tx_stop_queue
> 
> smp_mb__after_atomic(); // Right here
> 
> > 	if (!__ptr_ring_peek_producer())
> > 		netif_tx_wake_queue(txq);
> > 
> > 
> > 
> > 
> > 
> > 
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management
  2025-11-26 15:25       ` Michael S. Tsirkin
@ 2025-11-26 16:04         ` Simon Schippers
  2025-11-26 18:16           ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Simon Schippers @ 2025-11-26 16:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On 11/26/25 16:25, Michael S. Tsirkin wrote:
> On Wed, Nov 26, 2025 at 10:23:50AM +0100, Simon Schippers wrote:
>> On 11/25/25 17:54, Michael S. Tsirkin wrote:
>>> On Thu, Nov 20, 2025 at 04:29:08PM +0100, Simon Schippers wrote:
>>>> Implement new ring buffer produce and consume functions for tun and tap
>>>> drivers that provide lockless producer-consumer synchronization and
>>>> netdev queue management to prevent ptr_ring tail drop and permanent
>>>> starvation.
>>>>
>>>> - tun_ring_produce(): Produces packets to the ptr_ring with proper memory
>>>>   barriers and proactively stops the netdev queue when the ring is about
>>>>   to become full.
>>>>
>>>> - __tun_ring_consume() / __tap_ring_consume(): Internal consume functions
>>>>   that check if the netdev queue was stopped due to a full ring, and wake
>>>>   it when space becomes available. Uses memory barriers to ensure proper
>>>>   ordering between producer and consumer.
>>>>
>>>> - tun_ring_consume() / tap_ring_consume(): Wrapper functions that acquire
>>>>   the consumer lock before calling the internal consume functions.
>>>>
>>>> Key features:
>>>> - Proactive queue stopping using __ptr_ring_full_next() to stop the queue
>>>>   before it becomes completely full.
>>>> - Not stopping the queue when the ptr_ring is full already, because if
>>>>   the consumer empties all entries in the meantime, stopping the queue
>>>>   would cause permanent starvation.
>>>
>>> what is permanent starvation? this comment seems to answer this
>>> question:
>>>
>>>
>>> 	/* Do not stop the netdev queue if the ptr_ring is full already.
>>> 	 * The consumer could empty out the ptr_ring in the meantime
>>> 	 * without noticing the stopped netdev queue, resulting in a
>>> 	 * stopped netdev queue and an empty ptr_ring. In this case the
>>> 	 * netdev queue would stay stopped forever.
>>> 	 */
>>>
>>>
>>> why having a single entry in
>>> the ring we never use helpful to address this?
>>>
>>>
>>>
>>>
>>> In fact, all your patch does to solve it, is check
>>> netif_tx_queue_stopped on every consumed packet.
>>>
>>>
>>> I already proposed:
>>>
>>> static inline int __ptr_ring_peek_producer(struct ptr_ring *r)
>>> {
>>>         if (unlikely(!r->size) || r->queue[r->producer])
>>>                 return -ENOSPC;
>>>         return 0;
>>> }
>>>
>>> And with that, why isn't avoiding the race as simple as
>>> just rechecking after stopping the queue?
>>  
>> I think you are right and that is quite similar to what veth [1] does.
>> However, there are two differences:
>>
>> - Your approach avoids returning NETDEV_TX_BUSY by already stopping
>>   when the ring becomes full (and not when the ring is full already)
>> - ...and the recheck of the producer wakes on !full instead of empty.
>>
>> I like both aspects better than the veth implementation.
> 
> Right.
> 
> Though frankly, someone should just fix NETDEV_TX_BUSY already
> at least with the most popular qdiscs.
> 
> It is a common situation and it is just annoying that every driver has
> to come up with its own scheme.

I can not judge it, but yes, it would have made this patchset way
simpler.

> 
> 
> 
> 
> 
>> Just one thing: like the veth implementation, we probably need a
>> smp_mb__after_atomic() after netif_tx_stop_queue() as they also discussed
>> in their v6 [2].
> 
> yea makes sense.
> 
>>
>> On the consumer side, I would then just do:
>>
>> __ptr_ring_consume();
>> if (unlikely(__ptr_ring_consume_created_space()))
>>     netif_tx_wake_queue(txq);
>>
>> Right?
>>
>> And for the batched consume method, I would just call this in a loop.
> 
> Well tun does not use batched consume does it?

tun does not but vhost-net does.

Since vhost-net also uses tun_net_xmit() as its ndo_start_xmit in a
tap+vhost-net setup, its consumer must also be changed. Else
tun_net_xmit() would stop the queue, but it would never be woken again.

> 
> 
>> Thank you!
>>
>> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#m2582fcc48901e2e845b20b89e0e7196951484e5f
>> [2] Link: https://lore.kernel.org/all/174549933665.608169.392044991754158047.stgit@firesoul/T/#m63f2deb86ffbd9ff3a27e1232077a3775606c14d
>>
>>>
>>> __ptr_ring_produce();
>>> if (__ptr_ring_peek_producer())
>>> 	netif_tx_stop_queue
>>
>> smp_mb__after_atomic(); // Right here
>>
>>> 	if (!__ptr_ring_peek_producer())
>>> 		netif_tx_wake_queue(txq);
>>>
>>>
>>>
>>>
>>>
>>>
>>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management
  2025-11-26 16:04         ` Simon Schippers
@ 2025-11-26 18:16           ` Michael S. Tsirkin
  0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2025-11-26 18:16 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, jon, tim.gebauer, netdev, linux-kernel,
	kvm, virtualization

On Wed, Nov 26, 2025 at 05:04:25PM +0100, Simon Schippers wrote:
> On 11/26/25 16:25, Michael S. Tsirkin wrote:
> > On Wed, Nov 26, 2025 at 10:23:50AM +0100, Simon Schippers wrote:
> >> On 11/25/25 17:54, Michael S. Tsirkin wrote:
> >>> On Thu, Nov 20, 2025 at 04:29:08PM +0100, Simon Schippers wrote:
> >>>> Implement new ring buffer produce and consume functions for tun and tap
> >>>> drivers that provide lockless producer-consumer synchronization and
> >>>> netdev queue management to prevent ptr_ring tail drop and permanent
> >>>> starvation.
> >>>>
> >>>> - tun_ring_produce(): Produces packets to the ptr_ring with proper memory
> >>>>   barriers and proactively stops the netdev queue when the ring is about
> >>>>   to become full.
> >>>>
> >>>> - __tun_ring_consume() / __tap_ring_consume(): Internal consume functions
> >>>>   that check if the netdev queue was stopped due to a full ring, and wake
> >>>>   it when space becomes available. Uses memory barriers to ensure proper
> >>>>   ordering between producer and consumer.
> >>>>
> >>>> - tun_ring_consume() / tap_ring_consume(): Wrapper functions that acquire
> >>>>   the consumer lock before calling the internal consume functions.
> >>>>
> >>>> Key features:
> >>>> - Proactive queue stopping using __ptr_ring_full_next() to stop the queue
> >>>>   before it becomes completely full.
> >>>> - Not stopping the queue when the ptr_ring is full already, because if
> >>>>   the consumer empties all entries in the meantime, stopping the queue
> >>>>   would cause permanent starvation.
> >>>
> >>> what is permanent starvation? this comment seems to answer this
> >>> question:
> >>>
> >>>
> >>> 	/* Do not stop the netdev queue if the ptr_ring is full already.
> >>> 	 * The consumer could empty out the ptr_ring in the meantime
> >>> 	 * without noticing the stopped netdev queue, resulting in a
> >>> 	 * stopped netdev queue and an empty ptr_ring. In this case the
> >>> 	 * netdev queue would stay stopped forever.
> >>> 	 */
> >>>
> >>>
> >>> why having a single entry in
> >>> the ring we never use helpful to address this?
> >>>
> >>>
> >>>
> >>>
> >>> In fact, all your patch does to solve it, is check
> >>> netif_tx_queue_stopped on every consumed packet.
> >>>
> >>>
> >>> I already proposed:
> >>>
> >>> static inline int __ptr_ring_peek_producer(struct ptr_ring *r)
> >>> {
> >>>         if (unlikely(!r->size) || r->queue[r->producer])
> >>>                 return -ENOSPC;
> >>>         return 0;
> >>> }
> >>>
> >>> And with that, why isn't avoiding the race as simple as
> >>> just rechecking after stopping the queue?
> >>  
> >> I think you are right and that is quite similar to what veth [1] does.
> >> However, there are two differences:
> >>
> >> - Your approach avoids returning NETDEV_TX_BUSY by already stopping
> >>   when the ring becomes full (and not when the ring is full already)
> >> - ...and the recheck of the producer wakes on !full instead of empty.
> >>
> >> I like both aspects better than the veth implementation.
> > 
> > Right.
> > 
> > Though frankly, someone should just fix NETDEV_TX_BUSY already
> > at least with the most popular qdiscs.
> > 
> > It is a common situation and it is just annoying that every driver has
> > to come up with its own scheme.
> 
> I can not judge it, but yes, it would have made this patchset way
> simpler.
> 
> > 
> > 
> > 
> > 
> > 
> >> Just one thing: like the veth implementation, we probably need a
> >> smp_mb__after_atomic() after netif_tx_stop_queue() as they also discussed
> >> in their v6 [2].
> > 
> > yea makes sense.
> > 
> >>
> >> On the consumer side, I would then just do:
> >>
> >> __ptr_ring_consume();
> >> if (unlikely(__ptr_ring_consume_created_space()))
> >>     netif_tx_wake_queue(txq);
> >>
> >> Right?
> >>
> >> And for the batched consume method, I would just call this in a loop.
> > 
> > Well tun does not use batched consume does it?
> 
> tun does not but vhost-net does.
> 
> Since vhost-net also uses tun_net_xmit() as its ndo_start_xmit in a
> tap+vhost-net setup, its consumer must also be changed. Else
> tun_net_xmit() would stop the queue, but it would never be woken again.


Ah, ok.



> > 
> > 
> >> Thank you!
> >>
> >> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#m2582fcc48901e2e845b20b89e0e7196951484e5f
> >> [2] Link: https://lore.kernel.org/all/174549933665.608169.392044991754158047.stgit@firesoul/T/#m63f2deb86ffbd9ff3a27e1232077a3775606c14d
> >>
> >>>
> >>> __ptr_ring_produce();
> >>> if (__ptr_ring_peek_producer())
> >>> 	netif_tx_stop_queue
> >>
> >> smp_mb__after_atomic(); // Right here
> >>
> >>> 	if (!__ptr_ring_peek_producer())
> >>> 		netif_tx_wake_queue(txq);
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space
  2025-11-20 15:29 ` [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space Simon Schippers
  2025-11-25 15:01   ` Michael S. Tsirkin
@ 2025-11-30 18:16   ` Willem de Bruijn
  1 sibling, 0 replies; 30+ messages in thread
From: Willem de Bruijn @ 2025-11-30 18:16 UTC (permalink / raw)
  To: Simon Schippers, willemdebruijn.kernel, jasowang, andrew+netdev,
	davem, edumazet, kuba, pabeni, mst, eperezma, jon, tim.gebauer,
	simon.schippers, netdev, linux-kernel, kvm, virtualization

Simon Schippers wrote:
> Add __ptr_ring_consume_created_space() to check whether the previous
> __ptr_ring_consume() call successfully consumed an element and created
> space in the ring buffer. This enables callers to conditionally notify
> producers when space becomes available.
> 
> The function is only valid immediately after a single consume operation
> and should not be used after calling __ptr_ring_consume_batched().

Please explain why it is only valid in that case.
 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Co-developed by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index da141cc8b075..76d6840b45a3 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -453,6 +453,23 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>  	return ret;
>  }
>  
> +/*
> + * Check if the previous consume operation created space
> + *
> + * Returns true if the last call to __ptr_ring_consume() has created
> + * space in the ring buffer (i.e., an element was consumed).
> + *
> + * Note: This function is only valid immediately after a single call to
> + * __ptr_ring_consume(). If multiple calls to ptr_ring_consume*() have
> + * been made, this check must be performed after each call individually.
> + * Likewise, do not use this function after calling
> + * __ptr_ring_consume_batched().
> + */
> +static inline bool __ptr_ring_consume_created_space(struct ptr_ring *r)
> +{
> +	return r->consumer_tail >= r->consumer_head;
> +}
> +
>  /* Cast to structure type and call a function without discarding from FIFO.
>   * Function must return a value.
>   * Callers must take consumer_lock.
> -- 
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-11-30 18:16 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20 15:29 [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
2025-11-20 15:29 ` [PATCH net-next v6 1/8] ptr_ring: add __ptr_ring_full_next() to predict imminent fullness Simon Schippers
2025-11-25 14:54   ` Michael S. Tsirkin
2025-11-20 15:29 ` [PATCH net-next v6 2/8] ptr_ring: add helper to check if consume created space Simon Schippers
2025-11-25 15:01   ` Michael S. Tsirkin
2025-11-25 16:12     ` Simon Schippers
2025-11-25 17:18       ` Michael S. Tsirkin
2025-11-30 18:16   ` Willem de Bruijn
2025-11-20 15:29 ` [PATCH net-next v6 3/8] tun/tap: add synchronized ring produce/consume with queue management Simon Schippers
2025-11-25 16:54   ` Michael S. Tsirkin
2025-11-26  9:23     ` Simon Schippers
2025-11-26 15:25       ` Michael S. Tsirkin
2025-11-26 16:04         ` Simon Schippers
2025-11-26 18:16           ` Michael S. Tsirkin
2025-11-20 15:29 ` [PATCH net-next v6 4/8] tun/tap: add batched ring consume function Simon Schippers
2025-11-20 15:29 ` [PATCH net-next v6 5/8] tun/tap: add uncomsume function for returning entries to ring Simon Schippers
2025-11-20 15:29 ` [PATCH net-next v6 6/8] tun/tap: add helper functions to check file type Simon Schippers
2025-11-20 15:29 ` [PATCH net-next v6 7/8] tun/tap & vhost-net: use {tun|tap}_ring_{consume|produce} to avoid tail drops Simon Schippers
2025-11-20 15:29 ` [PATCH net-next v6 7/8] tun/tap/vhost: " Simon Schippers
2025-11-20 15:29 ` [PATCH net-next v6 8/8] tun/tap: drop get ring exports Simon Schippers
2025-11-21  6:19 ` [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
2025-11-21  9:22   ` Simon Schippers
2025-11-24  1:04     ` Jason Wang
2025-11-24  9:19       ` Simon Schippers
2025-11-25  1:34         ` Jason Wang
2025-11-25 14:04           ` Simon Schippers
2025-11-26  6:19             ` Jason Wang
2025-11-26  7:15     ` Michael S. Tsirkin
2025-11-26  9:24       ` Simon Schippers
2025-11-21  9:18 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).