[PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop
@ 2025-09-02  8:09 Simon Schippers
  2025-09-02  8:09 ` [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available Simon Schippers
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-02  8:09 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm
  Cc: Simon Schippers

This patch series deals with TUN/TAP and vhost_net which drop incoming 
SKBs whenever their internal ptr_ring buffer is full. Instead, with this 
patch series, the associated netdev queue is stopped before this happens. 
This allows the connected qdisc to function correctly as reported by [1] 
and improves application-layer performance, see benchmarks.

This patch series includes TUN, TAP, and vhost_net because they share 
logic. Adjusting only one of them would break the others. Therefore, the 
patch series is structured as follows:
1. New ptr_ring_spare helper to check if the ptr_ring has spare capacity
2. Netdev queue flow control for TUN: Logic for stopping the queue upon 
full ptr_ring and waking the queue if ptr_ring has spare capacity
3. Additions for TAP: Similar logic for waking the queue
4. Additions for vhost_net: Calling TUN/TAP methods for waking the queue

Benchmarks ([2] & [3]):
- TUN: TCP throughput over real-world 120ms RTT OpenVPN connection 
improved by 36% (117Mbit/s vs 185 Mbit/s)
- TAP: TCP throughput to local qemu VM stays the same (2.2Gbit/s), an 
improvement by factor 2 at emulated 120ms RTT (98Mbit/s vs 198Mbit/s)
- TAP+vhost_net: TCP throughput to local qemu VM approx. the same 
(23.4Gbit/s vs 23.9Gbit/s), same performance at emulated 120ms RTT 
(200Mbit/s)
- TUN/TAP/TAP+vhost_net: Reduction of ptr_ring size to ~10 packets 
possible without losing performance

Possible future work:
- Introduction of Byte Queue Limits as suggested by Stephen Hemminger
- Adaption of the netdev queue flow control for ipvtap & macvtap

[1] Link: 
https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
[2] Link: 
https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
[3] Link: https://github.com/tudo-cni/nodrop

Links to previous versions:
V3: 
https://lore.kernel.org/netdev/20250825211832.84901-1-simon.schippers@tu-dortmund.de/T/#u
V2: 
https://lore.kernel.org/netdev/20250811220430.14063-1-simon.schippers@tu-dortmund.de/T/#u
V1: 
https://lore.kernel.org/netdev/20250808153721.261334-1-simon.schippers@tu-dortmund.de/T/#u

Changelog:
V3 -> V4:
- Target net-next instead of net
- Changed to patch series instead of single patch
- Changed to new title from old title
"TUN/TAP: Improving throughput and latency by avoiding SKB drops"
- Wake netdev queue with new helpers wake_netdev_queue when there is any 
spare capacity in the ptr_ring instead of waiting for it to be empty
- Use tun_file instead of tun_struct in tun_ring_recv as a more consistent 
logic
- Use smp_wmb() and smp_rmb() barrier pair, which avoids any packet drops 
that happened rarely before
- Use safer logic for vhost_net using RCU read locks to access TUN/TAP data

V2 -> V3: Added support for TAP and TAP+vhost_net.

V1 -> V2: Removed NETDEV_TX_BUSY return case in tun_net_xmit and removed 
unnecessary netif_tx_wake_queue in tun_ring_recv.



Simon Schippers (4):
  ptr_ring_spare: Helper to check if spare capacity of size cnt is
    available
  netdev queue flow control for TUN
  netdev queue flow control for TAP
  netdev queue flow control for vhost_net

 drivers/net/tap.c        | 28 ++++++++++++++++
 drivers/net/tun.c        | 39 ++++++++++++++++++++--
 drivers/vhost/net.c      | 34 +++++++++++++++----
 include/linux/if_tap.h   |  2 ++
 include/linux/if_tun.h   |  3 ++
 include/linux/ptr_ring.h | 71 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 168 insertions(+), 9 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available
  2025-09-02  8:09 [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
@ 2025-09-02  8:09 ` Simon Schippers
  2025-09-02 21:13   ` Willem de Bruijn
  2025-09-03 13:05   ` Michael S. Tsirkin
  2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-02  8:09 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

The implementation is inspired by ptr_ring_empty.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 include/linux/ptr_ring.h | 71 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 551329220e4f..6b8cfaecf478 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -243,6 +243,77 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
 	return ret;
 }
 
+/*
+ * Check if a spare capacity of cnt is available without taking any locks.
+ *
+ * If cnt==0 or cnt > r->size it acts the same as __ptr_ring_empty.
+ *
+ * The same requirements apply as described for __ptr_ring_empty.
+ */
+static inline bool __ptr_ring_spare(struct ptr_ring *r, int cnt)
+{
+	int size = r->size;
+	int to_check;
+
+	if (unlikely(!size || cnt < 0))
+		return true;
+
+	if (cnt > size)
+		cnt = 0;
+
+	to_check = READ_ONCE(r->consumer_head) - cnt;
+
+	if (to_check < 0)
+		to_check += size;
+
+	return !r->queue[to_check];
+}
+
+static inline bool ptr_ring_spare(struct ptr_ring *r, int cnt)
+{
+	bool ret;
+
+	spin_lock(&r->consumer_lock);
+	ret = __ptr_ring_spare(r, cnt);
+	spin_unlock(&r->consumer_lock);
+
+	return ret;
+}
+
+static inline bool ptr_ring_spare_irq(struct ptr_ring *r, int cnt)
+{
+	bool ret;
+
+	spin_lock_irq(&r->consumer_lock);
+	ret = __ptr_ring_spare(r, cnt);
+	spin_unlock_irq(&r->consumer_lock);
+
+	return ret;
+}
+
+static inline bool ptr_ring_spare_any(struct ptr_ring *r, int cnt)
+{
+	unsigned long flags;
+	bool ret;
+
+	spin_lock_irqsave(&r->consumer_lock, flags);
+	ret = __ptr_ring_spare(r, cnt);
+	spin_unlock_irqrestore(&r->consumer_lock, flags);
+
+	return ret;
+}
+
+static inline bool ptr_ring_spare_bh(struct ptr_ring *r, int cnt)
+{
+	bool ret;
+
+	spin_lock_bh(&r->consumer_lock);
+	ret = __ptr_ring_spare(r, cnt);
+	spin_unlock_bh(&r->consumer_lock);
+
+	return ret;
+}
+
 /* Must only be called after __ptr_ring_peek returned !NULL */
 static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] netdev queue flow control for TUN
  2025-09-02  8:09 [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
  2025-09-02  8:09 ` [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available Simon Schippers
@ 2025-09-02  8:09 ` Simon Schippers
  2025-09-02 21:20   ` Willem de Bruijn
                     ` (4 more replies)
  2025-09-02  8:09 ` [PATCH 3/4] netdev queue flow control for TAP Simon Schippers
                   ` (2 subsequent siblings)
  4 siblings, 5 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-02  8:09 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

The netdev queue is stopped in tun_net_xmit after inserting an SKB into
the ring buffer if the ring buffer became full because of that. If the
insertion into the ptr_ring fails, the netdev queue is also stopped and
the SKB is dropped. However, this never happened in my testing. To ensure
that the ptr_ring change is available to the consumer before the netdev
queue stop, an smp_wmb() is used.

Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
blocking wait queue and after consuming an SKB from the ptr_ring. This
helper first checks if the netdev queue has stopped. Then with the paired
smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
With that knowledge, the helper can then wake the netdev queue if there is
at least a single spare slot in the ptr_ring by calling ptr_ring_spare
with cnt=1.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cc6c50180663..735498e221d8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1060,13 +1060,21 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
+		/* Paired with smp_rmb() in wake_netdev_queue. */
+		smp_wmb();
+		netif_tx_stop_queue(queue);
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
+	if (ptr_ring_full(&tfile->tx_ring)) {
+		/* Paired with smp_rmb() in wake_netdev_queue. */
+		smp_wmb();
+		netif_tx_stop_queue(queue);
+	}
 
 	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
 	txq_trans_cond_update(queue);
 
 	/* Notify and wake up reader process */
@@ -2110,6 +2118,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
+static inline void wake_netdev_queue(struct tun_file *tfile)
+{
+	struct netdev_queue *txq;
+	struct net_device *dev;
+
+	rcu_read_lock();
+	dev = rcu_dereference(tfile->tun)->dev;
+	txq = netdev_get_tx_queue(dev, tfile->queue_index);
+
+	if (netif_tx_queue_stopped(txq)) {
+		/* Paired with smp_wmb() in tun_net_xmit. */
+		smp_rmb();
+		if (ptr_ring_spare(&tfile->tx_ring, 1))
+			netif_tx_wake_queue(txq);
+	}
+	rcu_read_unlock();
+}
+
 static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
@@ -2139,7 +2165,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 			error = -EFAULT;
 			break;
 		}
-
+		wake_netdev_queue(tfile);
 		schedule();
 	}
 
@@ -2147,6 +2173,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 	remove_wait_queue(&tfile->socket.wq.wait, &wait);
 
 out:
+	wake_netdev_queue(tfile);
 	*err = error;
 	return ptr;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] netdev queue flow control for TAP
  2025-09-02  8:09 [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
  2025-09-02  8:09 ` [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available Simon Schippers
  2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
@ 2025-09-02  8:09 ` Simon Schippers
  2025-09-02  8:09 ` [PATCH 4/4] netdev queue flow control for vhost_net Simon Schippers
  2025-09-03  4:00 ` [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
  4 siblings, 0 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-02  8:09 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

Stopping the netdev queue is done in tun_net_xmit, as TAP uses this method
as its ndo_start_xmit.

To wake the queue, the new helper wake_netdev_queue is called in
tap_do_read. This is done in the blocking wait queue and after consuming
an SKB from the ptr_ring. This helper first checks if the netdev queue has
stopped. Then with the smp_rmb(), which is paired with the smp_wmb() of
tun_net_xmit, it is known that tun_net_xmit will not produce SKBs anymore.
With that knowledge, the helper can then wake the netdev queue if there is
at least a single spare slot. This check is done by calling the method
ptr_ring_spare.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 1197f245e873..4d874672bcd7 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -753,6 +753,24 @@ static ssize_t tap_put_user(struct tap_queue *q,
 	return ret ? ret : total;
 }
 
+static inline void wake_netdev_queue(struct tap_queue *q)
+{
+	struct netdev_queue *txq;
+	struct net_device *dev;
+
+	rcu_read_lock();
+	dev = rcu_dereference(q->tap)->dev;
+	txq = netdev_get_tx_queue(dev, q->queue_index);
+
+	if (netif_tx_queue_stopped(txq)) {
+		/* Paired with smp_wmb() in tun_net_xmit. */
+		smp_rmb();
+		if (ptr_ring_spare(&q->ring, 1))
+			netif_tx_wake_queue(txq);
+	}
+	rcu_read_unlock();
+}
+
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
 			   int noblock, struct sk_buff *skb)
@@ -785,12 +803,16 @@ static ssize_t tap_do_read(struct tap_queue *q,
 			ret = -ERESTARTSYS;
 			break;
 		}
+		wake_netdev_queue(q);
+
 		/* Nothing to read, let's sleep */
 		schedule();
 	}
 	if (!noblock)
 		finish_wait(sk_sleep(&q->sk), &wait);
 
+	wake_netdev_queue(q);
+
 put:
 	if (skb) {
 		ret = tap_put_user(q, skb, to);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] netdev queue flow control for vhost_net
  2025-09-02  8:09 [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (2 preceding siblings ...)
  2025-09-02  8:09 ` [PATCH 3/4] netdev queue flow control for TAP Simon Schippers
@ 2025-09-02  8:09 ` Simon Schippers
  2025-09-02 21:31   ` Willem de Bruijn
  2025-09-03  4:00 ` [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
  4 siblings, 1 reply; 24+ messages in thread
From: Simon Schippers @ 2025-09-02  8:09 UTC (permalink / raw)
  To: willemdebruijn.kernel, jasowang, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

Stopping the queue is done in tun_net_xmit.

Waking the queue is done by calling one of the helpers,
tun_wake_netdev_queue and tap_wake_netdev_queue. For that, in
get_wake_netdev_queue, the correct method is determined and saved in the
function pointer wake_netdev_queue of the vhost_net_virtqueue. Then, each
time after consuming a batch in vhost_net_buf_produce, wake_netdev_queue
is called.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tap.c      |  6 ++++++
 drivers/net/tun.c      |  6 ++++++
 drivers/vhost/net.c    | 34 ++++++++++++++++++++++++++++------
 include/linux/if_tap.h |  2 ++
 include/linux/if_tun.h |  3 +++
 5 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4d874672bcd7..0bad9e3d59af 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1198,6 +1198,12 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
+void tap_wake_netdev_queue(struct file *file)
+{
+	wake_netdev_queue(file->private_data);
+}
+EXPORT_SYMBOL_GPL(tap_wake_netdev_queue);
+
 struct ptr_ring *tap_get_ptr_ring(struct file *file)
 {
 	struct tap_queue *q;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 735498e221d8..e85589b596ac 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3739,6 +3739,12 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+void tun_wake_netdev_queue(struct file *file)
+{
+	wake_netdev_queue(file->private_data);
+}
+EXPORT_SYMBOL_GPL(tun_wake_netdev_queue);
+
 struct ptr_ring *tun_get_tx_ring(struct file *file)
 {
 	struct tun_file *tfile;
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 6edac0c1ba9b..e837d3a334f1 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -130,6 +130,7 @@ struct vhost_net_virtqueue {
 	struct vhost_net_buf rxq;
 	/* Batched XDP buffs */
 	struct xdp_buff *xdp;
+	void (*wake_netdev_queue)(struct file *f);
 };
 
 struct vhost_net {
@@ -175,13 +176,16 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 	return ret;
 }
 
-static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq,
+								 struct sock *sk)
 {
+	struct file *file = sk->sk_socket->file;
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	rxq->head = 0;
 	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
 					      VHOST_NET_BATCH);
+	nvq->wake_netdev_queue(file);
 	return rxq->tail;
 }
 
@@ -208,14 +212,15 @@ static int vhost_net_buf_peek_len(void *ptr)
 	return __skb_array_len_with_tag(ptr);
 }
 
-static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq,
+							  struct sock *sk)
 {
 	struct vhost_net_buf *rxq = &nvq->rxq;
 
 	if (!vhost_net_buf_is_empty(rxq))
 		goto out;
 
-	if (!vhost_net_buf_produce(nvq))
+	if (!vhost_net_buf_produce(nvq, sk))
 		return 0;
 
 out:
@@ -994,7 +999,7 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	unsigned long flags;
 
 	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+		return vhost_net_buf_peek(rvq, sk);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -1499,6 +1504,19 @@ static struct socket *get_tap_socket(int fd)
 	return sock;
 }
 
+static void (*get_wake_netdev_queue(struct file *file))(struct file *file)
+{
+	struct ptr_ring *ring;
+
+	ring = tun_get_tx_ring(file);
+	if (!IS_ERR(ring))
+		return tun_wake_netdev_queue;
+	ring = tap_get_ptr_ring(file);
+	if (!IS_ERR(ring))
+		return tap_wake_netdev_queue;
+	return NULL;
+}
+
 static struct socket *get_socket(int fd)
 {
 	struct socket *sock;
@@ -1570,10 +1588,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		if (r)
 			goto err_used;
 		if (index == VHOST_NET_VQ_RX) {
-			if (sock)
+			if (sock) {
 				nvq->rx_ring = get_tap_ptr_ring(sock->file);
-			else
+				nvq->wake_netdev_queue =
+					get_wake_netdev_queue(sock->file);
+			} else {
 				nvq->rx_ring = NULL;
+				nvq->wake_netdev_queue = NULL;
+			}
 		}
 
 		oldubufs = nvq->ubufs;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 553552fa635c..02b2809784b5 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -10,6 +10,7 @@ struct socket;
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
+void tap_wake_netdev_queue(struct file *file);
 struct ptr_ring *tap_get_ptr_ring(struct file *file);
 #else
 #include <linux/err.h>
@@ -18,6 +19,7 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline void tap_wake_netdev_queue(struct file *f) {}
 static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 80166eb62f41..04c504bb1954 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -21,6 +21,7 @@ struct tun_msg_ctl {
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
+void tun_wake_netdev_queue(struct file *file);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
 
 static inline bool tun_is_xdp_frame(void *ptr)
@@ -50,6 +51,8 @@ static inline struct socket *tun_get_socket(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
+static inline void tun_wake_netdev_queue(struct file *f) {}
+
 static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available
  2025-09-02  8:09 ` [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available Simon Schippers
@ 2025-09-02 21:13   ` Willem de Bruijn
  2025-09-03  3:13     ` Jason Wang
  2025-09-03 13:05   ` Michael S. Tsirkin
  1 sibling, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2025-09-02 21:13 UTC (permalink / raw)
  To: Simon Schippers, willemdebruijn.kernel, jasowang, mst, eperezma,
	stephen, netdev, linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

Simon Schippers wrote:
> The implementation is inspired by ptr_ring_empty.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 71 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 551329220e4f..6b8cfaecf478 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -243,6 +243,77 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
>  	return ret;
>  }
>  
> +/*
> + * Check if a spare capacity of cnt is available without taking any locks.
> + *
> + * If cnt==0 or cnt > r->size it acts the same as __ptr_ring_empty.

cnt >= r->size?

> + *
> + * The same requirements apply as described for __ptr_ring_empty.
> + */
> +static inline bool __ptr_ring_spare(struct ptr_ring *r, int cnt)
> +{
> +	int size = r->size;
> +	int to_check;
> +
> +	if (unlikely(!size || cnt < 0))
> +		return true;

Does !size ever happen. Also no need for preconditions for trivial
errors that never happen, like passing negative values. Or prefer
an unsigned type.

> +
> +	if (cnt > size)
> +		cnt = 0;
> +
> +	to_check = READ_ONCE(r->consumer_head) - cnt;
> +
> +	if (to_check < 0)
> +		to_check += size;
> +
> +	return !r->queue[to_check];
> +}
> +
> +static inline bool ptr_ring_spare(struct ptr_ring *r, int cnt)
> +{
> +	bool ret;
> +
> +	spin_lock(&r->consumer_lock);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock(&r->consumer_lock);
> +
> +	return ret;
> +}
> +
> +static inline bool ptr_ring_spare_irq(struct ptr_ring *r, int cnt)
> +{
> +	bool ret;
> +
> +	spin_lock_irq(&r->consumer_lock);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock_irq(&r->consumer_lock);
> +
> +	return ret;
> +}
> +
> +static inline bool ptr_ring_spare_any(struct ptr_ring *r, int cnt)
> +{
> +	unsigned long flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&r->consumer_lock, flags);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock_irqrestore(&r->consumer_lock, flags);
> +
> +	return ret;
> +}
> +
> +static inline bool ptr_ring_spare_bh(struct ptr_ring *r, int cnt)
> +{
> +	bool ret;
> +
> +	spin_lock_bh(&r->consumer_lock);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock_bh(&r->consumer_lock);
> +
> +	return ret;
> +}

Please only introduce the variants actually used.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] netdev queue flow control for TUN
  2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
@ 2025-09-02 21:20   ` Willem de Bruijn
  2025-09-03 18:35     ` Simon Schippers
  2025-09-02 21:31   ` Willem de Bruijn
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2025-09-02 21:20 UTC (permalink / raw)
  To: Simon Schippers, willemdebruijn.kernel, jasowang, mst, eperezma,
	stephen, netdev, linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

Simon Schippers wrote:
> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
> the ring buffer if the ring buffer became full because of that. If the
> insertion into the ptr_ring fails, the netdev queue is also stopped and
> the SKB is dropped. However, this never happened in my testing.

Indeed, since the last successful insertion will always pause the
queue before this can happen. Since this cannot be reached, no need
to add the code defensively. If in doubt, maybe add a
NET_DEBUG_WARN_ON_ONCE.

> To ensure
> that the ptr_ring change is available to the consumer before the netdev
> queue stop, an smp_wmb() is used.
> 
> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
> blocking wait queue and after consuming an SKB from the ptr_ring. This
> helper first checks if the netdev queue has stopped. Then with the paired
> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
> With that knowledge, the helper can then wake the netdev queue if there is
> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
> with cnt=1.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cc6c50180663..735498e221d8 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1060,13 +1060,21 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	nf_reset_ct(skb);
>  
> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +	queue = netdev_get_tx_queue(dev, txq);
> +	if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
> +		/* Paired with smp_rmb() in wake_netdev_queue. */
> +		smp_wmb();
> +		netif_tx_stop_queue(queue);
>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>  		goto drop;
>  	}
> +	if (ptr_ring_full(&tfile->tx_ring)) {
> +		/* Paired with smp_rmb() in wake_netdev_queue. */
> +		smp_wmb();
> +		netif_tx_stop_queue(queue);
> +	}
>  
>  	/* dev->lltx requires to do our own update of trans_start */
> -	queue = netdev_get_tx_queue(dev, txq);
>  	txq_trans_cond_update(queue);
>  
>  	/* Notify and wake up reader process */
> @@ -2110,6 +2118,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  	return total;
>  }
>  
> +static inline void wake_netdev_queue(struct tun_file *tfile)
> +{
> +	struct netdev_queue *txq;
> +	struct net_device *dev;
> +
> +	rcu_read_lock();
> +	dev = rcu_dereference(tfile->tun)->dev;
> +	txq = netdev_get_tx_queue(dev, tfile->queue_index);
> +
> +	if (netif_tx_queue_stopped(txq)) {
> +		/* Paired with smp_wmb() in tun_net_xmit. */
> +		smp_rmb();
> +		if (ptr_ring_spare(&tfile->tx_ring, 1))
> +			netif_tx_wake_queue(txq);
> +	}
> +	rcu_read_unlock();
> +}
> +
>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  {
>  	DECLARE_WAITQUEUE(wait, current);
> @@ -2139,7 +2165,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  			error = -EFAULT;
>  			break;
>  		}
> -
> +		wake_netdev_queue(tfile);

Why wake when no entry was consumed?

Also keep the empty line.

>  		schedule();
>  	}
>  
> @@ -2147,6 +2173,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  	remove_wait_queue(&tfile->socket.wq.wait, &wait);
>  
>  out:
> +	wake_netdev_queue(tfile);
>  	*err = error;
>  	return ptr;
>  }
> -- 
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] netdev queue flow control for vhost_net
  2025-09-02  8:09 ` [PATCH 4/4] netdev queue flow control for vhost_net Simon Schippers
@ 2025-09-02 21:31   ` Willem de Bruijn
  2025-09-03  3:42     ` Jason Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2025-09-02 21:31 UTC (permalink / raw)
  To: Simon Schippers, willemdebruijn.kernel, jasowang, mst, eperezma,
	stephen, netdev, linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

Simon Schippers wrote:
> Stopping the queue is done in tun_net_xmit.
> 
> Waking the queue is done by calling one of the helpers,
> tun_wake_netdev_queue and tap_wake_netdev_queue. For that, in
> get_wake_netdev_queue, the correct method is determined and saved in the
> function pointer wake_netdev_queue of the vhost_net_virtqueue. Then, each
> time after consuming a batch in vhost_net_buf_produce, wake_netdev_queue
> is called.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tap.c      |  6 ++++++
>  drivers/net/tun.c      |  6 ++++++
>  drivers/vhost/net.c    | 34 ++++++++++++++++++++++++++++------
>  include/linux/if_tap.h |  2 ++
>  include/linux/if_tun.h |  3 +++
>  5 files changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 4d874672bcd7..0bad9e3d59af 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -1198,6 +1198,12 @@ struct socket *tap_get_socket(struct file *file)
>  }
>  EXPORT_SYMBOL_GPL(tap_get_socket);
>  
> +void tap_wake_netdev_queue(struct file *file)
> +{
> +	wake_netdev_queue(file->private_data);
> +}
> +EXPORT_SYMBOL_GPL(tap_wake_netdev_queue);
> +
>  struct ptr_ring *tap_get_ptr_ring(struct file *file)
>  {
>  	struct tap_queue *q;
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 735498e221d8..e85589b596ac 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -3739,6 +3739,12 @@ struct socket *tun_get_socket(struct file *file)
>  }
>  EXPORT_SYMBOL_GPL(tun_get_socket);
>  
> +void tun_wake_netdev_queue(struct file *file)
> +{
> +	wake_netdev_queue(file->private_data);
> +}
> +EXPORT_SYMBOL_GPL(tun_wake_netdev_queue);

Having multiple functions with the same name is tad annoying from a 
cscape PoV, better to call the internal functions
__tun_wake_netdev_queue, etc.

> +
>  struct ptr_ring *tun_get_tx_ring(struct file *file)
>  {
>  	struct tun_file *tfile;
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 6edac0c1ba9b..e837d3a334f1 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -130,6 +130,7 @@ struct vhost_net_virtqueue {
>  	struct vhost_net_buf rxq;
>  	/* Batched XDP buffs */
>  	struct xdp_buff *xdp;
> +	void (*wake_netdev_queue)(struct file *f);

Indirect function calls are expensive post spectre. Probably
preferable to just have a branch.

A branch in `file->f_op != &tun_fops` would be expensive still as it
may touch a cold cacheline.

How about adding a bit in struct ptr_ring itself. Pahole shows plenty
of holes. Jason, WDYT?

>  };
>  
>  struct vhost_net {
> @@ -175,13 +176,16 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
>  	return ret;
>  }
>  
> -static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
> +static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq,
> +								 struct sock *sk)
>  {
> +	struct file *file = sk->sk_socket->file;
>  	struct vhost_net_buf *rxq = &nvq->rxq;
>  
>  	rxq->head = 0;
>  	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
>  					      VHOST_NET_BATCH);
> +	nvq->wake_netdev_queue(file);
>  	return rxq->tail;
>  }
>  
> @@ -208,14 +212,15 @@ static int vhost_net_buf_peek_len(void *ptr)
>  	return __skb_array_len_with_tag(ptr);
>  }
>  
> -static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
> +static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq,
> +							  struct sock *sk)

odd indentation

>  {
>  	struct vhost_net_buf *rxq = &nvq->rxq;
>  
>  	if (!vhost_net_buf_is_empty(rxq))
>  		goto out;
>  
> -	if (!vhost_net_buf_produce(nvq))
> +	if (!vhost_net_buf_produce(nvq, sk))
>  		return 0;
>  
>  out:
> @@ -994,7 +999,7 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
>  	unsigned long flags;
>  
>  	if (rvq->rx_ring)
> -		return vhost_net_buf_peek(rvq);
> +		return vhost_net_buf_peek(rvq, sk);
>  
>  	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
>  	head = skb_peek(&sk->sk_receive_queue);
> @@ -1499,6 +1504,19 @@ static struct socket *get_tap_socket(int fd)
>  	return sock;
>  }
>  
> +static void (*get_wake_netdev_queue(struct file *file))(struct file *file)
> +{
> +	struct ptr_ring *ring;
> +
> +	ring = tun_get_tx_ring(file);
> +	if (!IS_ERR(ring))
> +		return tun_wake_netdev_queue;
> +	ring = tap_get_ptr_ring(file);
> +	if (!IS_ERR(ring))
> +		return tap_wake_netdev_queue;
> +	return NULL;
> +}
> +
>  static struct socket *get_socket(int fd)
>  {
>  	struct socket *sock;
> @@ -1570,10 +1588,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  		if (r)
>  			goto err_used;
>  		if (index == VHOST_NET_VQ_RX) {
> -			if (sock)
> +			if (sock) {
>  				nvq->rx_ring = get_tap_ptr_ring(sock->file);
> -			else
> +				nvq->wake_netdev_queue =
> +					get_wake_netdev_queue(sock->file);
> +			} else {
>  				nvq->rx_ring = NULL;
> +				nvq->wake_netdev_queue = NULL;
> +			}
>  		}
>  
>  		oldubufs = nvq->ubufs;
> diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
> index 553552fa635c..02b2809784b5 100644
> --- a/include/linux/if_tap.h
> +++ b/include/linux/if_tap.h
> @@ -10,6 +10,7 @@ struct socket;
>  
>  #if IS_ENABLED(CONFIG_TAP)
>  struct socket *tap_get_socket(struct file *);
> +void tap_wake_netdev_queue(struct file *file);
>  struct ptr_ring *tap_get_ptr_ring(struct file *file);
>  #else
>  #include <linux/err.h>
> @@ -18,6 +19,7 @@ static inline struct socket *tap_get_socket(struct file *f)
>  {
>  	return ERR_PTR(-EINVAL);
>  }
> +static inline void tap_wake_netdev_queue(struct file *f) {}
>  static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
>  {
>  	return ERR_PTR(-EINVAL);
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index 80166eb62f41..04c504bb1954 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -21,6 +21,7 @@ struct tun_msg_ctl {
>  
>  #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
>  struct socket *tun_get_socket(struct file *);
> +void tun_wake_netdev_queue(struct file *file);
>  struct ptr_ring *tun_get_tx_ring(struct file *file);
>  
>  static inline bool tun_is_xdp_frame(void *ptr)
> @@ -50,6 +51,8 @@ static inline struct socket *tun_get_socket(struct file *f)
>  	return ERR_PTR(-EINVAL);
>  }
>  
> +static inline void tun_wake_netdev_queue(struct file *f) {}
> +
>  static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
>  {
>  	return ERR_PTR(-EINVAL);
> -- 
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] netdev queue flow control for TUN
  2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
  2025-09-02 21:20   ` Willem de Bruijn
@ 2025-09-02 21:31   ` Willem de Bruijn
  2025-09-03  3:27   ` Jason Wang
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2025-09-02 21:31 UTC (permalink / raw)
  To: Simon Schippers, willemdebruijn.kernel, jasowang, mst, eperezma,
	stephen, netdev, linux-kernel, virtualization, kvm
  Cc: Simon Schippers, Tim Gebauer

Simon Schippers wrote:
> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
> the ring buffer if the ring buffer became full because of that. If the
> insertion into the ptr_ring fails, the netdev queue is also stopped and
> the SKB is dropped. However, this never happened in my testing. To ensure
> that the ptr_ring change is available to the consumer before the netdev
> queue stop, an smp_wmb() is used.
> 
> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
> blocking wait queue and after consuming an SKB from the ptr_ring. This
> helper first checks if the netdev queue has stopped. Then with the paired
> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
> With that knowledge, the helper can then wake the netdev queue if there is
> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
> with cnt=1.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cc6c50180663..735498e221d8 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1060,13 +1060,21 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	nf_reset_ct(skb);
>  
> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +	queue = netdev_get_tx_queue(dev, txq);
> +	if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
> +		/* Paired with smp_rmb() in wake_netdev_queue. */
> +		smp_wmb();
> +		netif_tx_stop_queue(queue);
>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>  		goto drop;
>  	}
> +	if (ptr_ring_full(&tfile->tx_ring)) {
> +		/* Paired with smp_rmb() in wake_netdev_queue. */
> +		smp_wmb();
> +		netif_tx_stop_queue(queue);
> +	}
>  
>  	/* dev->lltx requires to do our own update of trans_start */
> -	queue = netdev_get_tx_queue(dev, txq);
>  	txq_trans_cond_update(queue);
>  
>  	/* Notify and wake up reader process */
> @@ -2110,6 +2118,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  	return total;
>  }
>  
> +static inline void wake_netdev_queue(struct tun_file *tfile)

no inline keyword in .c files, let the compiler decide.

> +{
> +	struct netdev_queue *txq;
> +	struct net_device *dev;
> +
> +	rcu_read_lock();
> +	dev = rcu_dereference(tfile->tun)->dev;
> +	txq = netdev_get_tx_queue(dev, tfile->queue_index);
> +
> +	if (netif_tx_queue_stopped(txq)) {
> +		/* Paired with smp_wmb() in tun_net_xmit. */
> +		smp_rmb();
> +		if (ptr_ring_spare(&tfile->tx_ring, 1))
> +			netif_tx_wake_queue(txq);
> +	}
> +	rcu_read_unlock();
> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available
  2025-09-02 21:13   ` Willem de Bruijn
@ 2025-09-03  3:13     ` Jason Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Wang @ 2025-09-03  3:13 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Simon Schippers, mst, eperezma, stephen, netdev, linux-kernel,
	virtualization, kvm, Tim Gebauer

On Wed, Sep 3, 2025 at 5:13 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Simon Schippers wrote:
> > The implementation is inspired by ptr_ring_empty.
> >
> > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > ---
> >  include/linux/ptr_ring.h | 71 ++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 71 insertions(+)
> >
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 551329220e4f..6b8cfaecf478 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -243,6 +243,77 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
> >       return ret;
> >  }
> >
> > +/*
> > + * Check if a spare capacity of cnt is available without taking any locks.
> > + *
> > + * If cnt==0 or cnt > r->size it acts the same as __ptr_ring_empty.
>
> cnt >= r->size?
>
> > + *
> > + * The same requirements apply as described for __ptr_ring_empty.
> > + */
> > +static inline bool __ptr_ring_spare(struct ptr_ring *r, int cnt)
> > +{
> > +     int size = r->size;
> > +     int to_check;
> > +
> > +     if (unlikely(!size || cnt < 0))
> > +             return true;
>
> Does !size ever happen.

Yes, see 982fb490c298 ("ptr_ring: support zero length ring"). The
reason is tun reuse dev->tx_queue_len for ptr_ring size.

> Also no need for preconditions for trivial
> errors that never happen, like passing negative values. Or prefer
> an unsigned type.

+1.

Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] netdev queue flow control for TUN
  2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
  2025-09-02 21:20   ` Willem de Bruijn
  2025-09-02 21:31   ` Willem de Bruijn
@ 2025-09-03  3:27   ` Jason Wang
  2025-09-03 18:41     ` Simon Schippers
  2025-09-03 13:10   ` Michael S. Tsirkin
  2025-09-03 13:13   ` Michael S. Tsirkin
  4 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2025-09-03  3:27 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

On Tue, Sep 2, 2025 at 4:10 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
> the ring buffer if the ring buffer became full because of that. If the
> insertion into the ptr_ring fails, the netdev queue is also stopped and
> the SKB is dropped. However, this never happened in my testing.

You can reach this by using pktgen on TUN.

> To ensure
> that the ptr_ring change is available to the consumer before the netdev
> queue stop, an smp_wmb() is used.
>
> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
> blocking wait queue and after consuming an SKB from the ptr_ring. This
> helper first checks if the netdev queue has stopped. Then with the paired
> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
> With that knowledge, the helper can then wake the netdev queue if there is
> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
> with cnt=1.
>
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cc6c50180663..735498e221d8 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1060,13 +1060,21 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>
>         nf_reset_ct(skb);
>
> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +       queue = netdev_get_tx_queue(dev, txq);
> +       if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
> +               /* Paired with smp_rmb() in wake_netdev_queue. */
> +               smp_wmb();
> +               netif_tx_stop_queue(queue);

The barrier looks odd since it requires the driver to care about the
ordering, can you elaborate more on this?

There's a WRITE_ONCE + mb() in netif_tx_stop_queue already:

static __always_inline void netif_tx_stop_queue(struct netdev_queue *dev_queue)
{
        /* Paired with READ_ONCE() from dev_watchdog() */
        WRITE_ONCE(dev_queue->trans_start, jiffies);

        /* This barrier is paired with smp_mb() from dev_watchdog() */
        smp_mb__before_atomic();

        /* Must be an atomic op see netif_txq_try_stop() */
        set_bit(__QUEUE_STATE_DRV_XOFF, &dev_queue->state);
}

>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>                 goto drop;
>         }
> +       if (ptr_ring_full(&tfile->tx_ring)) {
> +               /* Paired with smp_rmb() in wake_netdev_queue. */
> +               smp_wmb();
> +               netif_tx_stop_queue(queue);
> +       }
>
>         /* dev->lltx requires to do our own update of trans_start */
> -       queue = netdev_get_tx_queue(dev, txq);
>         txq_trans_cond_update(queue);
>
>         /* Notify and wake up reader process */
> @@ -2110,6 +2118,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>         return total;
>  }
>
> +static inline void wake_netdev_queue(struct tun_file *tfile)

Let's rename this to tun_wake_xxx.

> +{
> +       struct netdev_queue *txq;
> +       struct net_device *dev;
> +
> +       rcu_read_lock();
> +       dev = rcu_dereference(tfile->tun)->dev;
> +       txq = netdev_get_tx_queue(dev, tfile->queue_index);
> +
> +       if (netif_tx_queue_stopped(txq)) {
> +               /* Paired with smp_wmb() in tun_net_xmit. */
> +               smp_rmb();
> +               if (ptr_ring_spare(&tfile->tx_ring, 1))

I wonder if there would be a case that will use cnt > 1. If not a
ptr_ring_can_produce() should be sufficient.

> +                       netif_tx_wake_queue(txq);
> +       }
> +       rcu_read_unlock();
> +}
> +
>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  {
>         DECLARE_WAITQUEUE(wait, current);
> @@ -2139,7 +2165,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>                         error = -EFAULT;
>                         break;
>                 }
> -
> +               wake_netdev_queue(tfile);
>                 schedule();
>         }
>
> @@ -2147,6 +2173,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>         remove_wait_queue(&tfile->socket.wq.wait, &wait);
>
>  out:
> +       wake_netdev_queue(tfile);
>         *err = error;
>         return ptr;
>  }
> --
> 2.43.0
>

Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] netdev queue flow control for vhost_net
  2025-09-02 21:31   ` Willem de Bruijn
@ 2025-09-03  3:42     ` Jason Wang
  2025-09-03 13:51       ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2025-09-03  3:42 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Simon Schippers, mst, eperezma, stephen, netdev, linux-kernel,
	virtualization, kvm, Tim Gebauer

On Wed, Sep 3, 2025 at 5:31 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Simon Schippers wrote:
> > Stopping the queue is done in tun_net_xmit.
> >
> > Waking the queue is done by calling one of the helpers,
> > tun_wake_netdev_queue and tap_wake_netdev_queue. For that, in
> > get_wake_netdev_queue, the correct method is determined and saved in the
> > function pointer wake_netdev_queue of the vhost_net_virtqueue. Then, each
> > time after consuming a batch in vhost_net_buf_produce, wake_netdev_queue
> > is called.
> >
> > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > ---
> >  drivers/net/tap.c      |  6 ++++++
> >  drivers/net/tun.c      |  6 ++++++
> >  drivers/vhost/net.c    | 34 ++++++++++++++++++++++++++++------
> >  include/linux/if_tap.h |  2 ++
> >  include/linux/if_tun.h |  3 +++
> >  5 files changed, 45 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> > index 4d874672bcd7..0bad9e3d59af 100644
> > --- a/drivers/net/tap.c
> > +++ b/drivers/net/tap.c
> > @@ -1198,6 +1198,12 @@ struct socket *tap_get_socket(struct file *file)
> >  }
> >  EXPORT_SYMBOL_GPL(tap_get_socket);
> >
> > +void tap_wake_netdev_queue(struct file *file)
> > +{
> > +     wake_netdev_queue(file->private_data);
> > +}
> > +EXPORT_SYMBOL_GPL(tap_wake_netdev_queue);
> > +
> >  struct ptr_ring *tap_get_ptr_ring(struct file *file)
> >  {
> >       struct tap_queue *q;
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 735498e221d8..e85589b596ac 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -3739,6 +3739,12 @@ struct socket *tun_get_socket(struct file *file)
> >  }
> >  EXPORT_SYMBOL_GPL(tun_get_socket);
> >
> > +void tun_wake_netdev_queue(struct file *file)
> > +{
> > +     wake_netdev_queue(file->private_data);
> > +}
> > +EXPORT_SYMBOL_GPL(tun_wake_netdev_queue);
>
> Having multiple functions with the same name is tad annoying from a
> cscape PoV, better to call the internal functions
> __tun_wake_netdev_queue, etc.
>
> > +
> >  struct ptr_ring *tun_get_tx_ring(struct file *file)
> >  {
> >       struct tun_file *tfile;
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 6edac0c1ba9b..e837d3a334f1 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -130,6 +130,7 @@ struct vhost_net_virtqueue {
> >       struct vhost_net_buf rxq;
> >       /* Batched XDP buffs */
> >       struct xdp_buff *xdp;
> > +     void (*wake_netdev_queue)(struct file *f);
>
> Indirect function calls are expensive post spectre. Probably
> preferable to just have a branch.
>
> A branch in `file->f_op != &tun_fops` would be expensive still as it
> may touch a cold cacheline.
>
> How about adding a bit in struct ptr_ring itself. Pahole shows plenty
> of holes. Jason, WDYT?
>

I'm not sure I get the idea, did you mean a bit for classifying TUN
and TAP? If this is, I'm not sure it's a good idea as ptr_ring should
have no knowledge of its user.

Consider there were still indirect calls to sock->ops, maybe we can
start from the branch.

Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop
  2025-09-02  8:09 [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
                   ` (3 preceding siblings ...)
  2025-09-02  8:09 ` [PATCH 4/4] netdev queue flow control for vhost_net Simon Schippers
@ 2025-09-03  4:00 ` Jason Wang
  2025-09-03 18:55   ` Simon Schippers
  4 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2025-09-03  4:00 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm

On Tue, Sep 2, 2025 at 4:10 PM Simon Schippers
<simon.schippers@tu-dortmund.de> wrote:
>
> This patch series deals with TUN/TAP and vhost_net which drop incoming
> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> patch series, the associated netdev queue is stopped before this happens.
> This allows the connected qdisc to function correctly as reported by [1]
> and improves application-layer performance, see benchmarks.
>
> This patch series includes TUN, TAP, and vhost_net because they share
> logic. Adjusting only one of them would break the others. Therefore, the
> patch series is structured as follows:
> 1. New ptr_ring_spare helper to check if the ptr_ring has spare capacity
> 2. Netdev queue flow control for TUN: Logic for stopping the queue upon
> full ptr_ring and waking the queue if ptr_ring has spare capacity
> 3. Additions for TAP: Similar logic for waking the queue
> 4. Additions for vhost_net: Calling TUN/TAP methods for waking the queue
>
> Benchmarks ([2] & [3]):
> - TUN: TCP throughput over real-world 120ms RTT OpenVPN connection
> improved by 36% (117Mbit/s vs 185 Mbit/s)
> - TAP: TCP throughput to local qemu VM stays the same (2.2Gbit/s), an
> improvement by factor 2 at emulated 120ms RTT (98Mbit/s vs 198Mbit/s)
> - TAP+vhost_net: TCP throughput to local qemu VM approx. the same
> (23.4Gbit/s vs 23.9Gbit/s), same performance at emulated 120ms RTT
> (200Mbit/s)
> - TUN/TAP/TAP+vhost_net: Reduction of ptr_ring size to ~10 packets
> possible without losing performance
>
> Possible future work:
> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> - Adaption of the netdev queue flow control for ipvtap & macvtap

Could you please run pktgen on TUN as well to see the difference?

Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available
  2025-09-02  8:09 ` [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available Simon Schippers
  2025-09-02 21:13   ` Willem de Bruijn
@ 2025-09-03 13:05   ` Michael S. Tsirkin
  2025-09-03 18:29     ` Simon Schippers
  1 sibling, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2025-09-03 13:05 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

On Tue, Sep 02, 2025 at 10:09:54AM +0200, Simon Schippers wrote:
> The implementation is inspired by ptr_ring_empty.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  include/linux/ptr_ring.h | 71 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 551329220e4f..6b8cfaecf478 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -243,6 +243,77 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
>  	return ret;
>  }
>  
> +/*
> + * Check if a spare capacity of cnt is available without taking any locks.

Not sure what "spare" means here. I think you mean

Check if the ring has enough space to produce a given
number of entries.

> + *
> + * If cnt==0 or cnt > r->size it acts the same as __ptr_ring_empty.

Logically, cnt = 0 should always be true, cnt > size should always be
false then?

Why do you want it to act as __ptr_ring_empty?


> + *
> + * The same requirements apply as described for __ptr_ring_empty.


Which is:

 * However, if some other CPU consumes ring entries at the same time, the value
 * returned is not guaranteed to be correct.


but it's not right here yes? consuming entries will just add more
space ...

Also:
 * In this case - to avoid incorrectly detecting the ring
 * as empty - the CPU consuming the ring entries is responsible
 * for either consuming all ring entries until the ring is empty,
 * or synchronizing with some other CPU and causing it to
 * re-test __ptr_ring_empty and/or consume the ring enteries
 * after the synchronization point.

how would you apply this here?


> + */
> +static inline bool __ptr_ring_spare(struct ptr_ring *r, int cnt)
> +{
> +	int size = r->size;
> +	int to_check;
> +
> +	if (unlikely(!size || cnt < 0))
> +		return true;
> +
> +	if (cnt > size)
> +		cnt = 0;
> +
> +	to_check = READ_ONCE(r->consumer_head) - cnt;
> +
> +	if (to_check < 0)
> +		to_check += size;
> +
> +	return !r->queue[to_check];
> +}
> +

I will have to look at how this is used to understand if it's
correct. But I think we need better documentation.


> +static inline bool ptr_ring_spare(struct ptr_ring *r, int cnt)
> +{
> +	bool ret;
> +
> +	spin_lock(&r->consumer_lock);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock(&r->consumer_lock);
> +
> +	return ret;


I don't understand why you take the consumer lock here.
If a producer is running it will make the value wrong,
if consumer is running it will just create more space.


> +}
> +
> +static inline bool ptr_ring_spare_irq(struct ptr_ring *r, int cnt)
> +{
> +	bool ret;
> +
> +	spin_lock_irq(&r->consumer_lock);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock_irq(&r->consumer_lock);
> +
> +	return ret;
> +}
> +
> +static inline bool ptr_ring_spare_any(struct ptr_ring *r, int cnt)
> +{
> +	unsigned long flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&r->consumer_lock, flags);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock_irqrestore(&r->consumer_lock, flags);
> +
> +	return ret;
> +}
> +
> +static inline bool ptr_ring_spare_bh(struct ptr_ring *r, int cnt)
> +{
> +	bool ret;
> +
> +	spin_lock_bh(&r->consumer_lock);
> +	ret = __ptr_ring_spare(r, cnt);
> +	spin_unlock_bh(&r->consumer_lock);
> +
> +	return ret;
> +}
> +
>  /* Must only be called after __ptr_ring_peek returned !NULL */
>  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  {
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] netdev queue flow control for TUN
  2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
                     ` (2 preceding siblings ...)
  2025-09-03  3:27   ` Jason Wang
@ 2025-09-03 13:10   ` Michael S. Tsirkin
  2025-09-03 18:45     ` Simon Schippers
  2025-09-03 13:13   ` Michael S. Tsirkin
  4 siblings, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2025-09-03 13:10 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

On Tue, Sep 02, 2025 at 10:09:55AM +0200, Simon Schippers wrote:
> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
> the ring buffer if the ring buffer became full because of that. If the
> insertion into the ptr_ring fails, the netdev queue is also stopped and
> the SKB is dropped. However, this never happened in my testing. To ensure
> that the ptr_ring change is available to the consumer before the netdev
> queue stop, an smp_wmb() is used.
> 
> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
> blocking wait queue and after consuming an SKB from the ptr_ring. This
> helper first checks if the netdev queue has stopped. Then with the paired
> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
> With that knowledge, the helper can then wake the netdev queue if there is
> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
> with cnt=1.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>


Oh you just want to know if produce will succeed?
Kind of a version of peek but for producer?

So all this cuteness of looking at the consumer is actually not necessary,
and bad for cache.

You just want this:


Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 551329220e4f..de25fe81dd4e 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
 	return ret;
 }
 
+static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
+{
+	if (unlikely(!r->size) || r->queue[r->producer])
+		return -ENOSPC;
+
+	return 0;
+}
+
 /* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax(). Callers must hold producer_lock.
  * Callers are responsible for making sure pointer that is being queued
@@ -103,8 +111,10 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
  */
 static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
 {
-	if (unlikely(!r->size) || r->queue[r->producer])
-		return -ENOSPC;
+	int r = __ptr_ring_produce_peek(r);
+
+	if (r)
+		return r;
 
 	/* Make sure the pointer we are storing points to a valid data. */
 	/* Pairs with the dependency ordering in __ptr_ring_consume. */



Add some docs, and call this, then wake.  No?

-- 
MST


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] netdev queue flow control for TUN
  2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
                     ` (3 preceding siblings ...)
  2025-09-03 13:10   ` Michael S. Tsirkin
@ 2025-09-03 13:13   ` Michael S. Tsirkin
  2025-09-03 18:49     ` Simon Schippers
  4 siblings, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2025-09-03 13:13 UTC (permalink / raw)
  To: Simon Schippers
  Cc: willemdebruijn.kernel, jasowang, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

On Tue, Sep 02, 2025 at 10:09:55AM +0200, Simon Schippers wrote:
> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
> the ring buffer if the ring buffer became full because of that. If the
> insertion into the ptr_ring fails, the netdev queue is also stopped and
> the SKB is dropped. However, this never happened in my testing. To ensure
> that the ptr_ring change is available to the consumer before the netdev
> queue stop, an smp_wmb() is used.

I think the stop -> wake bounce involves enough barriers already,
no need for us to get cute.


> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
> blocking wait queue and after consuming an SKB from the ptr_ring. This
> helper first checks if the netdev queue has stopped. Then with the paired
> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
> With that knowledge, the helper can then wake the netdev queue if there is
> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
> with cnt=1.
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cc6c50180663..735498e221d8 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1060,13 +1060,21 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	nf_reset_ct(skb);
>  
> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +	queue = netdev_get_tx_queue(dev, txq);
> +	if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
> +		/* Paired with smp_rmb() in wake_netdev_queue. */
> +		smp_wmb();
> +		netif_tx_stop_queue(queue);
>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>  		goto drop;
>  	}
> +	if (ptr_ring_full(&tfile->tx_ring)) {
> +		/* Paired with smp_rmb() in wake_netdev_queue. */
> +		smp_wmb();
> +		netif_tx_stop_queue(queue);
> +	}
>  
>  	/* dev->lltx requires to do our own update of trans_start */
> -	queue = netdev_get_tx_queue(dev, txq);
>  	txq_trans_cond_update(queue);
>  
>  	/* Notify and wake up reader process */
> @@ -2110,6 +2118,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  	return total;
>  }
>  
> +static inline void wake_netdev_queue(struct tun_file *tfile)
> +{
> +	struct netdev_queue *txq;
> +	struct net_device *dev;
> +
> +	rcu_read_lock();
> +	dev = rcu_dereference(tfile->tun)->dev;
> +	txq = netdev_get_tx_queue(dev, tfile->queue_index);
> +
> +	if (netif_tx_queue_stopped(txq)) {
> +		/* Paired with smp_wmb() in tun_net_xmit. */
> +		smp_rmb();
> +		if (ptr_ring_spare(&tfile->tx_ring, 1))
> +			netif_tx_wake_queue(txq);
> +	}
> +	rcu_read_unlock();
> +}
> +
>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  {
>  	DECLARE_WAITQUEUE(wait, current);
> @@ -2139,7 +2165,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  			error = -EFAULT;
>  			break;
>  		}
> -
> +		wake_netdev_queue(tfile);
>  		schedule();
>  	}
>  
> @@ -2147,6 +2173,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  	remove_wait_queue(&tfile->socket.wq.wait, &wait);
>  
>  out:
> +	wake_netdev_queue(tfile);
>  	*err = error;
>  	return ptr;
>  }
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] netdev queue flow control for vhost_net
  2025-09-03  3:42     ` Jason Wang
@ 2025-09-03 13:51       ` Willem de Bruijn
  2025-09-04  2:47         ` Jason Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2025-09-03 13:51 UTC (permalink / raw)
  To: Jason Wang, Willem de Bruijn
  Cc: Simon Schippers, mst, eperezma, stephen, netdev, linux-kernel,
	virtualization, kvm, Tim Gebauer

Jason Wang wrote:
> On Wed, Sep 3, 2025 at 5:31 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Simon Schippers wrote:
> > > Stopping the queue is done in tun_net_xmit.
> > >
> > > Waking the queue is done by calling one of the helpers,
> > > tun_wake_netdev_queue and tap_wake_netdev_queue. For that, in
> > > get_wake_netdev_queue, the correct method is determined and saved in the
> > > function pointer wake_netdev_queue of the vhost_net_virtqueue. Then, each
> > > time after consuming a batch in vhost_net_buf_produce, wake_netdev_queue
> > > is called.
> > >
> > > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > > ---
> > >  drivers/net/tap.c      |  6 ++++++
> > >  drivers/net/tun.c      |  6 ++++++
> > >  drivers/vhost/net.c    | 34 ++++++++++++++++++++++++++++------
> > >  include/linux/if_tap.h |  2 ++
> > >  include/linux/if_tun.h |  3 +++
> > >  5 files changed, 45 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> > > index 4d874672bcd7..0bad9e3d59af 100644
> > > --- a/drivers/net/tap.c
> > > +++ b/drivers/net/tap.c
> > > @@ -1198,6 +1198,12 @@ struct socket *tap_get_socket(struct file *file)
> > >  }
> > >  EXPORT_SYMBOL_GPL(tap_get_socket);
> > >
> > > +void tap_wake_netdev_queue(struct file *file)
> > > +{
> > > +     wake_netdev_queue(file->private_data);
> > > +}
> > > +EXPORT_SYMBOL_GPL(tap_wake_netdev_queue);
> > > +
> > >  struct ptr_ring *tap_get_ptr_ring(struct file *file)
> > >  {
> > >       struct tap_queue *q;
> > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > > index 735498e221d8..e85589b596ac 100644
> > > --- a/drivers/net/tun.c
> > > +++ b/drivers/net/tun.c
> > > @@ -3739,6 +3739,12 @@ struct socket *tun_get_socket(struct file *file)
> > >  }
> > >  EXPORT_SYMBOL_GPL(tun_get_socket);
> > >
> > > +void tun_wake_netdev_queue(struct file *file)
> > > +{
> > > +     wake_netdev_queue(file->private_data);
> > > +}
> > > +EXPORT_SYMBOL_GPL(tun_wake_netdev_queue);
> >
> > Having multiple functions with the same name is tad annoying from a
> > cscape PoV, better to call the internal functions
> > __tun_wake_netdev_queue, etc.
> >
> > > +
> > >  struct ptr_ring *tun_get_tx_ring(struct file *file)
> > >  {
> > >       struct tun_file *tfile;
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 6edac0c1ba9b..e837d3a334f1 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -130,6 +130,7 @@ struct vhost_net_virtqueue {
> > >       struct vhost_net_buf rxq;
> > >       /* Batched XDP buffs */
> > >       struct xdp_buff *xdp;
> > > +     void (*wake_netdev_queue)(struct file *f);
> >
> > Indirect function calls are expensive post spectre. Probably
> > preferable to just have a branch.
> >
> > A branch in `file->f_op != &tun_fops` would be expensive still as it
> > may touch a cold cacheline.
> >
> > How about adding a bit in struct ptr_ring itself. Pahole shows plenty
> > of holes. Jason, WDYT?
> >
> 
> I'm not sure I get the idea, did you mean a bit for classifying TUN
> and TAP? If this is, I'm not sure it's a good idea as ptr_ring should
> have no knowledge of its user.

That is what I meant.
 
> Consider there were still indirect calls to sock->ops, maybe we can
> start from the branch.

What do you mean?

Tangential: if indirect calls really are needed in a hot path, e.g.,
to maintain this isolation of ptr_ring from its users, then
INDIRECT_CALL wrappers can be used to avoid the cost.

That too effectively breaks the isolation between caller and callee.
But only for the most important N callers that are listed in the
INDIRECT_CALL_? wrapper.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available
  2025-09-03 13:05   ` Michael S. Tsirkin
@ 2025-09-03 18:29     ` Simon Schippers
  0 siblings, 0 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-03 18:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

Michael S. Tsirkin wrote:
> On Tue, Sep 02, 2025 at 10:09:54AM +0200, Simon Schippers wrote:
>> The implementation is inspired by ptr_ring_empty.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  include/linux/ptr_ring.h | 71 ++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 71 insertions(+)
>>
>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>> index 551329220e4f..6b8cfaecf478 100644
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -243,6 +243,77 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
>>  	return ret;
>>  }
>>  
>> +/*
>> + * Check if a spare capacity of cnt is available without taking any locks.
> 
> Not sure what "spare" means here. I think you mean
> 
> Check if the ring has enough space to produce a given
> number of entries.
> 
>> + *
>> + * If cnt==0 or cnt > r->size it acts the same as __ptr_ring_empty.
> 
> Logically, cnt = 0 should always be true, cnt > size should always be
> false then?
> 
> Why do you want it to act as __ptr_ring_empty?
> 
> 
>> + *
>> + * The same requirements apply as described for __ptr_ring_empty.
> 
> 
> Which is:
> 
>  * However, if some other CPU consumes ring entries at the same time, the value
>  * returned is not guaranteed to be correct.
> 
> 
> but it's not right here yes? consuming entries will just add more
> space ...
> 
> Also:
>  * In this case - to avoid incorrectly detecting the ring
>  * as empty - the CPU consuming the ring entries is responsible
>  * for either consuming all ring entries until the ring is empty,
>  * or synchronizing with some other CPU and causing it to
>  * re-test __ptr_ring_empty and/or consume the ring enteries
>  * after the synchronization point.
> 
> how would you apply this here?
> 
> 
>> + */
>> +static inline bool __ptr_ring_spare(struct ptr_ring *r, int cnt)
>> +{
>> +	int size = r->size;
>> +	int to_check;
>> +
>> +	if (unlikely(!size || cnt < 0))
>> +		return true;
>> +
>> +	if (cnt > size)
>> +		cnt = 0;
>> +
>> +	to_check = READ_ONCE(r->consumer_head) - cnt;
>> +
>> +	if (to_check < 0)
>> +		to_check += size;
>> +
>> +	return !r->queue[to_check];
>> +}
>> +
> 
> I will have to look at how this is used to understand if it's
> correct. But I think we need better documentation.
> 
> 
>> +static inline bool ptr_ring_spare(struct ptr_ring *r, int cnt)
>> +{
>> +	bool ret;
>> +
>> +	spin_lock(&r->consumer_lock);
>> +	ret = __ptr_ring_spare(r, cnt);
>> +	spin_unlock(&r->consumer_lock);
>> +
>> +	return ret;
> 
> 
> I don't understand why you take the consumer lock here.
> If a producer is running it will make the value wrong,
> if consumer is running it will just create more space.
> 
>

I agree, I messed up the ptr_ring helper.
Your proposed approach is way superior and I will use that one instead.

The idea behind the cnt was to have an option if the producer may produce
multiple entries like tap_handle_frame with GSO. But of course this should
be in a different patch since I will not cover tap_handle_frame, which is
used by ipvtap and macvtap, in this patch series.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/4] netdev queue flow control for TUN
  2025-09-02 21:20   ` Willem de Bruijn
@ 2025-09-03 18:35     ` Simon Schippers
  0 siblings, 0 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-03 18:35 UTC (permalink / raw)
  To: Willem de Bruijn, jasowang, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm
  Cc: Tim Gebauer

Willem de Bruijn wrote:
> Simon Schippers wrote:
>> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
>> the ring buffer if the ring buffer became full because of that. If the
>> insertion into the ptr_ring fails, the netdev queue is also stopped and
>> the SKB is dropped. However, this never happened in my testing.
> 
> Indeed, since the last successful insertion will always pause the
> queue before this can happen. Since this cannot be reached, no need
> to add the code defensively. If in doubt, maybe add a
> NET_DEBUG_WARN_ON_ONCE.
> 
>> To ensure
>> that the ptr_ring change is available to the consumer before the netdev
>> queue stop, an smp_wmb() is used.
>>
>> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
>> blocking wait queue and after consuming an SKB from the ptr_ring. This
>> helper first checks if the netdev queue has stopped. Then with the paired
>> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
>> With that knowledge, the helper can then wake the netdev queue if there is
>> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
>> with cnt=1.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 33 ++++++++++++++++++++++++++++++---
>>  1 file changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index cc6c50180663..735498e221d8 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1060,13 +1060,21 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>  
>>  	nf_reset_ct(skb);
>>  
>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>> +	queue = netdev_get_tx_queue(dev, txq);
>> +	if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
>> +		/* Paired with smp_rmb() in wake_netdev_queue. */
>> +		smp_wmb();
>> +		netif_tx_stop_queue(queue);
>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>>  		goto drop;
>>  	}
>> +	if (ptr_ring_full(&tfile->tx_ring)) {
>> +		/* Paired with smp_rmb() in wake_netdev_queue. */
>> +		smp_wmb();
>> +		netif_tx_stop_queue(queue);
>> +	}
>>  
>>  	/* dev->lltx requires to do our own update of trans_start */
>> -	queue = netdev_get_tx_queue(dev, txq);
>>  	txq_trans_cond_update(queue);
>>  
>>  	/* Notify and wake up reader process */
>> @@ -2110,6 +2118,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>  	return total;
>>  }
>>  
>> +static inline void wake_netdev_queue(struct tun_file *tfile)
>> +{
>> +	struct netdev_queue *txq;
>> +	struct net_device *dev;
>> +
>> +	rcu_read_lock();
>> +	dev = rcu_dereference(tfile->tun)->dev;
>> +	txq = netdev_get_tx_queue(dev, tfile->queue_index);
>> +
>> +	if (netif_tx_queue_stopped(txq)) {
>> +		/* Paired with smp_wmb() in tun_net_xmit. */
>> +		smp_rmb();
>> +		if (ptr_ring_spare(&tfile->tx_ring, 1))
>> +			netif_tx_wake_queue(txq);
>> +	}
>> +	rcu_read_unlock();
>> +}
>> +
>>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>  {
>>  	DECLARE_WAITQUEUE(wait, current);
>> @@ -2139,7 +2165,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>  			error = -EFAULT;
>>  			break;
>>  		}
>> -
>> +		wake_netdev_queue(tfile);
> 
> Why wake when no entry was consumed?

I do it because the queue may not have been woken the last time after
consuming an SKB. However, I am not sure if it is still absolutely
necessary after all the changes in the code. Still, I think it is wise to
do it to avoid being stuck in the wait queue under any circumstances.

> 
> Also keep the empty line.
>

Okay :)

>>  		schedule();
>>  	}
>>  
>> @@ -2147,6 +2173,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>  	remove_wait_queue(&tfile->socket.wq.wait, &wait);
>>  
>>  out:
>> +	wake_netdev_queue(tfile);
>>  	*err = error;
>>  	return ptr;
>>  }
>> -- 
>> 2.43.0
>>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/4] netdev queue flow control for TUN
  2025-09-03  3:27   ` Jason Wang
@ 2025-09-03 18:41     ` Simon Schippers
  0 siblings, 0 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-03 18:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

Jason Wang wrote:
> On Tue, Sep 2, 2025 at 4:10 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
>> the ring buffer if the ring buffer became full because of that. If the
>> insertion into the ptr_ring fails, the netdev queue is also stopped and
>> the SKB is dropped. However, this never happened in my testing.
> 
> You can reach this by using pktgen on TUN.
> 

Yes, and I think it could also be reached after ptr_ring_unconsume is
called in vhost_net.

>> To ensure
>> that the ptr_ring change is available to the consumer before the netdev
>> queue stop, an smp_wmb() is used.
>>
>> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
>> blocking wait queue and after consuming an SKB from the ptr_ring. This
>> helper first checks if the netdev queue has stopped. Then with the paired
>> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
>> With that knowledge, the helper can then wake the netdev queue if there is
>> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
>> with cnt=1.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 33 ++++++++++++++++++++++++++++++---
>>  1 file changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index cc6c50180663..735498e221d8 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1060,13 +1060,21 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>         nf_reset_ct(skb);
>>
>> -       if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>> +       queue = netdev_get_tx_queue(dev, txq);
>> +       if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
>> +               /* Paired with smp_rmb() in wake_netdev_queue. */
>> +               smp_wmb();
>> +               netif_tx_stop_queue(queue);
> 
> The barrier looks odd since it requires the driver to care about the
> ordering, can you elaborate more on this?
> 
> There's a WRITE_ONCE + mb() in netif_tx_stop_queue already:
> 
> static __always_inline void netif_tx_stop_queue(struct netdev_queue *dev_queue)
> {
>         /* Paired with READ_ONCE() from dev_watchdog() */
>         WRITE_ONCE(dev_queue->trans_start, jiffies);
> 
>         /* This barrier is paired with smp_mb() from dev_watchdog() */
>         smp_mb__before_atomic();
> 
>         /* Must be an atomic op see netif_txq_try_stop() */
>         set_bit(__QUEUE_STATE_DRV_XOFF, &dev_queue->state);
> }
> 
>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>                 goto drop;
>>         }
>> +       if (ptr_ring_full(&tfile->tx_ring)) {
>> +               /* Paired with smp_rmb() in wake_netdev_queue. */
>> +               smp_wmb();
>> +               netif_tx_stop_queue(queue);
>> +       }
>>
>>         /* dev->lltx requires to do our own update of trans_start */
>> -       queue = netdev_get_tx_queue(dev, txq);
>>         txq_trans_cond_update(queue);
>>
>>         /* Notify and wake up reader process */
>> @@ -2110,6 +2118,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>         return total;
>>  }
>>
>> +static inline void wake_netdev_queue(struct tun_file *tfile)
> 
> Let's rename this to tun_wake_xxx.
>

Okay. I will rename wake_netdev_queue to tun_wake_netdev_queue and
tap_wake_netdev_queue respectively and remove inline. Then vhost_net just
calls these methods in vhost_net_buf_produce with file->private_data.

>> +{
>> +       struct netdev_queue *txq;
>> +       struct net_device *dev;
>> +
>> +       rcu_read_lock();
>> +       dev = rcu_dereference(tfile->tun)->dev;
>> +       txq = netdev_get_tx_queue(dev, tfile->queue_index);
>> +
>> +       if (netif_tx_queue_stopped(txq)) {
>> +               /* Paired with smp_wmb() in tun_net_xmit. */
>> +               smp_rmb();
>> +               if (ptr_ring_spare(&tfile->tx_ring, 1))
> 
> I wonder if there would be a case that will use cnt > 1. If not a
> ptr_ring_can_produce() should be sufficient.
> 
>> +                       netif_tx_wake_queue(txq);
>> +       }
>> +       rcu_read_unlock();
>> +}
>> +
>>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>  {
>>         DECLARE_WAITQUEUE(wait, current);
>> @@ -2139,7 +2165,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>                         error = -EFAULT;
>>                         break;
>>                 }
>> -
>> +               wake_netdev_queue(tfile);
>>                 schedule();
>>         }
>>
>> @@ -2147,6 +2173,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>         remove_wait_queue(&tfile->socket.wq.wait, &wait);
>>
>>  out:
>> +       wake_netdev_queue(tfile);
>>         *err = error;
>>         return ptr;
>>  }
>> --
>> 2.43.0
>>
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] netdev queue flow control for TUN
  2025-09-03 13:10   ` Michael S. Tsirkin
@ 2025-09-03 18:45     ` Simon Schippers
  0 siblings, 0 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-03 18:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

Michael S. Tsirkin wrote:
> On Tue, Sep 02, 2025 at 10:09:55AM +0200, Simon Schippers wrote:
>> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
>> the ring buffer if the ring buffer became full because of that. If the
>> insertion into the ptr_ring fails, the netdev queue is also stopped and
>> the SKB is dropped. However, this never happened in my testing. To ensure
>> that the ptr_ring change is available to the consumer before the netdev
>> queue stop, an smp_wmb() is used.
>>
>> Then in tun_ring_recv, the new helper wake_netdev_queue is called in the
>> blocking wait queue and after consuming an SKB from the ptr_ring. This
>> helper first checks if the netdev queue has stopped. Then with the paired
>> smp_rmb() it is known that tun_net_xmit will not produce SKBs anymore.
>> With that knowledge, the helper can then wake the netdev queue if there is
>> at least a single spare slot in the ptr_ring by calling ptr_ring_spare
>> with cnt=1.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> 
> 
> Oh you just want to know if produce will succeed?
> Kind of a version of peek but for producer?
> 
> So all this cuteness of looking at the consumer is actually not necessary,
> and bad for cache.
> 
> You just want this:
> 
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 551329220e4f..de25fe81dd4e 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -96,6 +96,14 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
>  	return ret;
>  }
>  
> +static inline int __ptr_ring_produce_peek(struct ptr_ring *r)
> +{
> +	if (unlikely(!r->size) || r->queue[r->producer])
> +		return -ENOSPC;
> +
> +	return 0;
> +}
> +
>  /* Note: callers invoking this in a loop must use a compiler barrier,
>   * for example cpu_relax(). Callers must hold producer_lock.
>   * Callers are responsible for making sure pointer that is being queued
> @@ -103,8 +111,10 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
>   */
>  static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
>  {
> -	if (unlikely(!r->size) || r->queue[r->producer])
> -		return -ENOSPC;
> +	int r = __ptr_ring_produce_peek(r);
> +
> +	if (r)
> +		return r;
>  
>  	/* Make sure the pointer we are storing points to a valid data. */
>  	/* Pairs with the dependency ordering in __ptr_ring_consume. */
> 
> 
> 
> Add some docs, and call this, then wake.  No?
>

Yes, this looks great! I like that it does not need any further logic :)
I will just call this method instead of my approach in wake_netdev_queue
without taking any locks. It should be just fine since at this moment it
is known that the producer stopped due to the stopped netdev queue.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/4] netdev queue flow control for TUN
  2025-09-03 13:13   ` Michael S. Tsirkin
@ 2025-09-03 18:49     ` Simon Schippers
  0 siblings, 0 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-03 18:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm, Tim Gebauer

Michael S. Tsirkin wrote:
> On Tue, Sep 02, 2025 at 10:09:55AM +0200, Simon Schippers wrote:
>> The netdev queue is stopped in tun_net_xmit after inserting an SKB into
>> the ring buffer if the ring buffer became full because of that. If the
>> insertion into the ptr_ring fails, the netdev queue is also stopped and
>> the SKB is dropped. However, this never happened in my testing. To ensure
>> that the ptr_ring change is available to the consumer before the netdev
>> queue stop, an smp_wmb() is used.
> 
> I think the stop -> wake bounce involves enough barriers already,
> no need for us to get cute.
>

Yes, you and Jason are correct, it seems to be unnecessary. I just removed
the barriers, and it tests fine!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop
  2025-09-03  4:00 ` [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
@ 2025-09-03 18:55   ` Simon Schippers
  0 siblings, 0 replies; 24+ messages in thread
From: Simon Schippers @ 2025-09-03 18:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: willemdebruijn.kernel, mst, eperezma, stephen, netdev,
	linux-kernel, virtualization, kvm

Jason Wang wrote:
> On Tue, Sep 2, 2025 at 4:10 PM Simon Schippers
> <simon.schippers@tu-dortmund.de> wrote:
>>
>> This patch series deals with TUN/TAP and vhost_net which drop incoming
>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>> patch series, the associated netdev queue is stopped before this happens.
>> This allows the connected qdisc to function correctly as reported by [1]
>> and improves application-layer performance, see benchmarks.
>>
>> This patch series includes TUN, TAP, and vhost_net because they share
>> logic. Adjusting only one of them would break the others. Therefore, the
>> patch series is structured as follows:
>> 1. New ptr_ring_spare helper to check if the ptr_ring has spare capacity
>> 2. Netdev queue flow control for TUN: Logic for stopping the queue upon
>> full ptr_ring and waking the queue if ptr_ring has spare capacity
>> 3. Additions for TAP: Similar logic for waking the queue
>> 4. Additions for vhost_net: Calling TUN/TAP methods for waking the queue
>>
>> Benchmarks ([2] & [3]):
>> - TUN: TCP throughput over real-world 120ms RTT OpenVPN connection
>> improved by 36% (117Mbit/s vs 185 Mbit/s)
>> - TAP: TCP throughput to local qemu VM stays the same (2.2Gbit/s), an
>> improvement by factor 2 at emulated 120ms RTT (98Mbit/s vs 198Mbit/s)
>> - TAP+vhost_net: TCP throughput to local qemu VM approx. the same
>> (23.4Gbit/s vs 23.9Gbit/s), same performance at emulated 120ms RTT
>> (200Mbit/s)
>> - TUN/TAP/TAP+vhost_net: Reduction of ptr_ring size to ~10 packets
>> possible without losing performance
>>
>> Possible future work:
>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
>> - Adaption of the netdev queue flow control for ipvtap & macvtap
> 
> Could you please run pktgen on TUN as well to see the difference?
> 
> Thanks
>

Yes, I will look into it :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] netdev queue flow control for vhost_net
  2025-09-03 13:51       ` Willem de Bruijn
@ 2025-09-04  2:47         ` Jason Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Wang @ 2025-09-04  2:47 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Simon Schippers, mst, eperezma, stephen, netdev, linux-kernel,
	virtualization, kvm, Tim Gebauer

On Wed, Sep 3, 2025 at 9:52 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Wang wrote:
> > On Wed, Sep 3, 2025 at 5:31 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Simon Schippers wrote:
> > > > Stopping the queue is done in tun_net_xmit.
> > > >
> > > > Waking the queue is done by calling one of the helpers,
> > > > tun_wake_netdev_queue and tap_wake_netdev_queue. For that, in
> > > > get_wake_netdev_queue, the correct method is determined and saved in the
> > > > function pointer wake_netdev_queue of the vhost_net_virtqueue. Then, each
> > > > time after consuming a batch in vhost_net_buf_produce, wake_netdev_queue
> > > > is called.
> > > >
> > > > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > > > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > > > ---
> > > >  drivers/net/tap.c      |  6 ++++++
> > > >  drivers/net/tun.c      |  6 ++++++
> > > >  drivers/vhost/net.c    | 34 ++++++++++++++++++++++++++++------
> > > >  include/linux/if_tap.h |  2 ++
> > > >  include/linux/if_tun.h |  3 +++
> > > >  5 files changed, 45 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> > > > index 4d874672bcd7..0bad9e3d59af 100644
> > > > --- a/drivers/net/tap.c
> > > > +++ b/drivers/net/tap.c
> > > > @@ -1198,6 +1198,12 @@ struct socket *tap_get_socket(struct file *file)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(tap_get_socket);
> > > >
> > > > +void tap_wake_netdev_queue(struct file *file)
> > > > +{
> > > > +     wake_netdev_queue(file->private_data);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(tap_wake_netdev_queue);
> > > > +
> > > >  struct ptr_ring *tap_get_ptr_ring(struct file *file)
> > > >  {
> > > >       struct tap_queue *q;
> > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > > > index 735498e221d8..e85589b596ac 100644
> > > > --- a/drivers/net/tun.c
> > > > +++ b/drivers/net/tun.c
> > > > @@ -3739,6 +3739,12 @@ struct socket *tun_get_socket(struct file *file)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(tun_get_socket);
> > > >
> > > > +void tun_wake_netdev_queue(struct file *file)
> > > > +{
> > > > +     wake_netdev_queue(file->private_data);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(tun_wake_netdev_queue);
> > >
> > > Having multiple functions with the same name is tad annoying from a
> > > cscape PoV, better to call the internal functions
> > > __tun_wake_netdev_queue, etc.
> > >
> > > > +
> > > >  struct ptr_ring *tun_get_tx_ring(struct file *file)
> > > >  {
> > > >       struct tun_file *tfile;
> > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > index 6edac0c1ba9b..e837d3a334f1 100644
> > > > --- a/drivers/vhost/net.c
> > > > +++ b/drivers/vhost/net.c
> > > > @@ -130,6 +130,7 @@ struct vhost_net_virtqueue {
> > > >       struct vhost_net_buf rxq;
> > > >       /* Batched XDP buffs */
> > > >       struct xdp_buff *xdp;
> > > > +     void (*wake_netdev_queue)(struct file *f);
> > >
> > > Indirect function calls are expensive post spectre. Probably
> > > preferable to just have a branch.
> > >
> > > A branch in `file->f_op != &tun_fops` would be expensive still as it
> > > may touch a cold cacheline.
> > >
> > > How about adding a bit in struct ptr_ring itself. Pahole shows plenty
> > > of holes. Jason, WDYT?
> > >
> >
> > I'm not sure I get the idea, did you mean a bit for classifying TUN
> > and TAP? If this is, I'm not sure it's a good idea as ptr_ring should
> > have no knowledge of its user.
>
> That is what I meant.
>
> > Consider there were still indirect calls to sock->ops, maybe we can
> > start from the branch.
>
> What do you mean?
>
> Tangential: if indirect calls really are needed in a hot path, e.g.,
> to maintain this isolation of ptr_ring from its users, then
> INDIRECT_CALL wrappers can be used to avoid the cost.
>
> That too effectively breaks the isolation between caller and callee.
> But only for the most important N callers that are listed in the
> INDIRECT_CALL_? wrapper.

Yes, I mean we can try to store the flag for example vhost_virtqueue struct.

Thanks

>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-09-04  2:47 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-02  8:09 [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Simon Schippers
2025-09-02  8:09 ` [PATCH 1/4] ptr_ring_spare: Helper to check if spare capacity of size cnt is available Simon Schippers
2025-09-02 21:13   ` Willem de Bruijn
2025-09-03  3:13     ` Jason Wang
2025-09-03 13:05   ` Michael S. Tsirkin
2025-09-03 18:29     ` Simon Schippers
2025-09-02  8:09 ` [PATCH 2/4] netdev queue flow control for TUN Simon Schippers
2025-09-02 21:20   ` Willem de Bruijn
2025-09-03 18:35     ` Simon Schippers
2025-09-02 21:31   ` Willem de Bruijn
2025-09-03  3:27   ` Jason Wang
2025-09-03 18:41     ` Simon Schippers
2025-09-03 13:10   ` Michael S. Tsirkin
2025-09-03 18:45     ` Simon Schippers
2025-09-03 13:13   ` Michael S. Tsirkin
2025-09-03 18:49     ` Simon Schippers
2025-09-02  8:09 ` [PATCH 3/4] netdev queue flow control for TAP Simon Schippers
2025-09-02  8:09 ` [PATCH 4/4] netdev queue flow control for vhost_net Simon Schippers
2025-09-02 21:31   ` Willem de Bruijn
2025-09-03  3:42     ` Jason Wang
2025-09-03 13:51       ` Willem de Bruijn
2025-09-04  2:47         ` Jason Wang
2025-09-03  4:00 ` [PATCH net-next v4 0/4] TUN/TAP & vhost_net: netdev queue flow control to avoid ptr_ring tail drop Jason Wang
2025-09-03 18:55   ` Simon Schippers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).