[PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache
@ 2025-11-25 20:00 Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 1/9] tun: cleanup out label in tun_xdp_one Jon Kohler
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)
  Cc: Jon Kohler

Use the per-CPU NAPI cache for SKB allocation in most places, and
leverage bulk allocation for tun_xdp_one since the batch size is known
at submission time. Additionally, utilize napi_build_skb and
napi_consume_skb to further benefit from the NAPI cache. This all
improves efficiency by reducing allocation overhead. 

Note: This series does not address the large payload path in
tun_alloc_skb, which spans sock.c and skbuff.c,A separate series will
handle privatizing the allocation code in tun and integrating the NAPI
cache for that path.

Results using basic iperf3 UDP test:
TX guest: taskset -c 2 iperf3 -c rx-ip-here -t 30 -p 5200 -b 0 -u -i 30
RX guest: taskset -c 2 iperf3 -s -p 5200 -D

        Bitrate       
Before: 6.08 Gbits/sec
After : 6.36 Gbits/sec

However, the basic test doesn't tell the whole story. Looking at a
flamegraph from before and after, less cycles are spent both on RX
vhost thread in the guest-to-guest on a single host case, but also less
cycles in the guest-to-guest case when on separate hosts, as the host
NIC handlers benefit from these NAPI-allocated SKBs (and deferred free)
as well.

Speaking of deferred free, v2 adds exporting deferred free from net
core and using immediately prior in tun_put_user. This not only keeps
the cache as warm as you can get, but also prevents a TX heavy vhost
thread from getting IPI'd like its going out of style. This approach
is similar in concept to what happens from NAPI loop in net_rx_action.

I've also merged this series with a small series about cleaning up
packet drop statistics along the various error paths in tun, as I want
to make sure those all go through kfree_skb_reason(), and we'd have
merge conflicts separating the two. If the maintainers want to take
them separately, happy to break them apart if needed. It is fairly
clean keeping them together otherwise.

Thanks all,
Jon

v2:
- Added drop statistics cleanup series, else merge conflicts abound
- Removed xdp_prog change (Willem)
- Clarified drop scenario in tun_put_user, where it is an extention of
  current behavior (Willem comment from v1)
- Export skb_defer_free_flush
- Use deferred skb free to immediately refill cache prior to bulk alloc,
  which also prevents IPIs from pounding TX heavy / TX only cores

v1: https://patchwork.kernel.org/project/netdevbpf/cover/20250506145530.2877229-1-jon@nutanix.com/

Jon Kohler (9):
  tun: cleanup out label in tun_xdp_one
  tun: correct drop statistics in tun_xdp_one
  tun: correct drop statistics in tun_put_user
  tun: correct drop statistics in tun_get_user
  tun: use bulk NAPI cache allocation in tun_xdp_one
  tun: use napi_build_skb in __tun_build_skb
  tun: use napi_consume_skb() in tun_put_user
  net: core: export skb_defer_free_flush
  tun: flush deferred skb free list before bulk NAPI cache get

 drivers/net/tun.c      | 170 +++++++++++++++++++++++++++++------------
 include/linux/skbuff.h |   1 +
 net/core/dev.c         |   3 +-
 3 files changed, 126 insertions(+), 48 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 1/9] tun: cleanup out label in tun_xdp_one
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 2/9] tun: correct drop statistics " Jon Kohler
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)
  Cc: Jon Kohler

Make all previous callers to out simply return directly, as
the out label only had return ret.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8192740357a0..dcce49a26800 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2435,8 +2435,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 build:
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
-		ret = -ENOMEM;
-		goto out;
+		return -ENOMEM;
 	}
 
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
@@ -2455,8 +2454,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 	if (tun_vnet_hdr_tnl_to_skb(tun->flags, features, skb, tnl_hdr)) {
 		atomic_long_inc(&tun->rx_frame_errors);
 		kfree_skb(skb);
-		ret = -EINVAL;
-		goto out;
+		return -EINVAL;
 	}
 
 	skb->protocol = eth_type_trans(skb, tun->dev);
@@ -2467,8 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 	if (skb_xdp) {
 		ret = do_xdp_generic(xdp_prog, &skb);
 		if (ret != XDP_PASS) {
-			ret = 0;
-			goto out;
+			return 0;
 		}
 	}
 
@@ -2502,7 +2499,6 @@ static int tun_xdp_one(struct tun_struct *tun,
 	if (rxhash)
 		tun_flow_update(tun, rxhash, tfile);
 
-out:
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 2/9] tun: correct drop statistics in tun_xdp_one
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 1/9] tun: cleanup out label in tun_xdp_one Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user Jon Kohler
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)
  Cc: Jon Kohler

Add drop reason to kfree_skb calls in tun_xdp_one and ensure that
all failing paths properly increment drop counter.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index dcce49a26800..68ad46ab04a4 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2392,8 +2392,10 @@ static int tun_xdp_one(struct tun_struct *tun,
 	bool skb_xdp = false;
 	struct page *page;
 
-	if (unlikely(datasize < ETH_HLEN))
+	if (unlikely(datasize < ETH_HLEN)) {
+		dev_core_stats_rx_dropped_inc(tun->dev);
 		return -EINVAL;
+	}
 
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
@@ -2407,6 +2409,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 		act = bpf_prog_run_xdp(xdp_prog, xdp);
 		ret = tun_xdp_act(tun, xdp_prog, xdp, act);
 		if (ret < 0) {
+			/* tun_xdp_act already handles drop statistics */
 			put_page(virt_to_head_page(xdp->data));
 			return ret;
 		}
@@ -2435,6 +2438,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 build:
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
+		dev_core_stats_rx_dropped_inc(tun->dev);
 		return -ENOMEM;
 	}
 
@@ -2453,7 +2457,8 @@ static int tun_xdp_one(struct tun_struct *tun,
 	tnl_hdr = (struct virtio_net_hdr_v1_hash_tunnel *)gso;
 	if (tun_vnet_hdr_tnl_to_skb(tun->flags, features, skb, tnl_hdr)) {
 		atomic_long_inc(&tun->rx_frame_errors);
-		kfree_skb(skb);
+		kfree_skb_reason(skb, SKB_DROP_REASON_DEV_HDR);
+		dev_core_stats_rx_dropped_inc(tun->dev);
 		return -EINVAL;
 	}
 
@@ -2479,7 +2484,8 @@ static int tun_xdp_one(struct tun_struct *tun,
 
 		if (unlikely(tfile->detached)) {
 			spin_unlock(&queue->lock);
-			kfree_skb(skb);
+			kfree_skb_reason(skb, SKB_DROP_REASON_DEV_READY);
+			dev_core_stats_rx_dropped_inc(tun->dev);
 			return -EBUSY;
 		}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 1/9] tun: cleanup out label in tun_xdp_one Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 2/9] tun: correct drop statistics " Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-29  3:07   ` Willem de Bruijn
  2025-11-25 20:00 ` [PATCH net-next v2 4/9] tun: correct drop statistics in tun_get_user Jon Kohler
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	open list
  Cc: Jon Kohler

Fold kfree_skb and consume_skb for tun_put_user into tun_put_user and
rework kfree_skb to take a drop reason. Add drop reason to all drop
sites and ensure that all failing paths properly increment drop
counter.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 51 +++++++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 68ad46ab04a4..e0f5e1fe4bd0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2035,6 +2035,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 			    struct sk_buff *skb,
 			    struct iov_iter *iter)
 {
+	enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
 	struct tun_pi pi = { 0, skb->protocol };
 	ssize_t total;
 	int vlan_offset = 0;
@@ -2051,8 +2052,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	total = skb->len + vlan_hlen + vnet_hdr_sz;
 
 	if (!(tun->flags & IFF_NO_PI)) {
-		if (iov_iter_count(iter) < sizeof(pi))
-			return -EINVAL;
+		if (iov_iter_count(iter) < sizeof(pi)) {
+			ret = -EINVAL;
+			drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
+			goto drop;
+		}
 
 		total += sizeof(pi);
 		if (iov_iter_count(iter) < total) {
@@ -2060,8 +2064,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 			pi.flags |= TUN_PKT_STRIP;
 		}
 
-		if (copy_to_iter(&pi, sizeof(pi), iter) != sizeof(pi))
-			return -EFAULT;
+		if (copy_to_iter(&pi, sizeof(pi), iter) != sizeof(pi)) {
+			ret = -EFAULT;
+			drop_reason = SKB_DROP_REASON_SKB_UCOPY_FAULT;
+			goto drop;
+		}
 	}
 
 	if (vnet_hdr_sz) {
@@ -2070,8 +2077,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 
 		ret = tun_vnet_hdr_tnl_from_skb(tun->flags, tun->dev, skb,
 						&hdr);
-		if (ret)
-			return ret;
+		if (ret) {
+			drop_reason = SKB_DROP_REASON_DEV_HDR;
+			goto drop;
+		}
 
 		/*
 		 * Drop the packet if the configured header size is too small
@@ -2080,8 +2089,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 		gso = (struct virtio_net_hdr *)&hdr;
 		ret = __tun_vnet_hdr_put(vnet_hdr_sz, tun->dev->features,
 					 iter, gso);
-		if (ret)
-			return ret;
+		if (ret) {
+			drop_reason = SKB_DROP_REASON_DEV_HDR;
+			goto drop;
+		}
 	}
 
 	if (vlan_hlen) {
@@ -2094,23 +2105,33 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
 
 		ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
-		if (ret || !iov_iter_count(iter))
-			goto done;
+		if (ret || !iov_iter_count(iter)) {
+			drop_reason = SKB_DROP_REASON_DEV_HDR;
+			goto drop;
+		}
 
 		ret = copy_to_iter(&veth, sizeof(veth), iter);
-		if (ret != sizeof(veth) || !iov_iter_count(iter))
-			goto done;
+		if (ret != sizeof(veth) || !iov_iter_count(iter)) {
+			drop_reason = SKB_DROP_REASON_DEV_HDR;
+			goto drop;
+		}
 	}
 
 	skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);
 
-done:
 	/* caller is in process context, */
 	preempt_disable();
 	dev_sw_netstats_tx_add(tun->dev, 1, skb->len + vlan_hlen);
 	preempt_enable();
 
+	consume_skb(skb);
+
 	return total;
+
+drop:
+	dev_core_stats_tx_dropped_inc(tun->dev);
+	kfree_skb_reason(skb, drop_reason);
+	return ret;
 }
 
 static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
@@ -2182,10 +2203,6 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		struct sk_buff *skb = ptr;
 
 		ret = tun_put_user(tun, tfile, skb, to);
-		if (unlikely(ret < 0))
-			kfree_skb(skb);
-		else
-			consume_skb(skb);
 	}
 
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 4/9] tun: correct drop statistics in tun_get_user
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
                   ` (2 preceding siblings ...)
  2025-11-25 20:00 ` [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one Jon Kohler
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)
  Cc: Jon Kohler, Chuang Wang

Improve on commit 4b4f052e2d89 ("net: tun: track dropped skb via
kfree_skb_reason()") and commit ab00af85d2f8 ("net: tun: rebuild error
handling in tun_get_user") by updating all potential drop sites in
tun_get_user with appropriate drop reasons.

Rework goto free_skb to goto drop, so that drop statistics will be
updated. Redirect early failures to drop_stats_only, which doesn't
need to worry about skb as it wouldn't be allocated yet.

Cc: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 53 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e0f5e1fe4bd0..97f130bc5fed 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1657,6 +1657,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		}
 		err = tun_xdp_act(tun, xdp_prog, &xdp, act);
 		if (err < 0) {
+			/* tun_xdp_act already handles drop statistics */
 			if (act == XDP_REDIRECT || act == XDP_TX)
 				put_page(alloc_frag->page);
 			goto out;
@@ -1720,12 +1721,17 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	gso = (struct virtio_net_hdr *)&hdr;
 
 	if (!(tun->flags & IFF_NO_PI)) {
-		if (len < sizeof(pi))
-			return -EINVAL;
+		if (len < sizeof(pi)) {
+			err = -EINVAL;
+			goto drop_stats_only;
+		}
+
 		len -= sizeof(pi);
 
-		if (!copy_from_iter_full(&pi, sizeof(pi), from))
-			return -EFAULT;
+		if (!copy_from_iter_full(&pi, sizeof(pi), from)) {
+			err = -EFAULT;
+			goto drop_stats_only;
+		}
 	}
 
 	if (tun->flags & IFF_VNET_HDR) {
@@ -1734,16 +1740,20 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		features = tun_vnet_hdr_guest_features(vnet_hdr_sz);
 		hdr_len = __tun_vnet_hdr_get(vnet_hdr_sz, tun->flags,
 					     features, from, gso);
-		if (hdr_len < 0)
-			return hdr_len;
+		if (hdr_len < 0) {
+			err = hdr_len;
+			goto drop_stats_only;
+		}
 
 		len -= vnet_hdr_sz;
 	}
 
 	if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
 		align += NET_IP_ALIGN;
-		if (unlikely(len < ETH_HLEN || (hdr_len && hdr_len < ETH_HLEN)))
-			return -EINVAL;
+		if (unlikely(len < ETH_HLEN || (hdr_len && hdr_len < ETH_HLEN))) {
+			err = -EINVAL;
+			goto drop_stats_only;
+		}
 	}
 
 	good_linear = SKB_MAX_HEAD(align);
@@ -1769,9 +1779,18 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		 */
 		skb = tun_build_skb(tun, tfile, from, gso, len, &skb_xdp);
 		err = PTR_ERR_OR_ZERO(skb);
-		if (err)
+		if (err) {
+			drop_reason = err == -ENOMEM ?
+				SKB_DROP_REASON_NOMEM :
+				SKB_DROP_REASON_SKB_UCOPY_FAULT;
 			goto drop;
+		}
 		if (!skb)
+			/* tun_build_skb can return null with no err ptr
+			 * from XDP paths, return total_len and always
+			 * appear successful to caller, as drop statistics
+			 * are already handled.
+			 */
 			return total_len;
 	} else {
 		if (!zerocopy) {
@@ -1796,8 +1815,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		}
 
 		err = PTR_ERR_OR_ZERO(skb);
-		if (err)
+		if (err) {
+			drop_reason = SKB_DROP_REASON_NOMEM;
 			goto drop;
+		}
 
 		if (zerocopy)
 			err = zerocopy_sg_from_iter(skb, from);
@@ -1814,7 +1835,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	if (tun_vnet_hdr_tnl_to_skb(tun->flags, features, skb, &hdr)) {
 		atomic_long_inc(&tun->rx_frame_errors);
 		err = -EINVAL;
-		goto free_skb;
+		drop_reason = SKB_DROP_REASON_DEV_HDR;
+		goto drop;
 	}
 
 	switch (tun->flags & TUN_TYPE_MASK) {
@@ -1831,6 +1853,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 				break;
 			default:
 				err = -EINVAL;
+				drop_reason = SKB_DROP_REASON_INVALID_PROTO;
 				goto drop;
 			}
 		}
@@ -1938,7 +1961,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 			spin_unlock_bh(&queue->lock);
 			rcu_read_unlock();
 			err = -EBUSY;
-			goto free_skb;
+			drop_reason = SKB_DROP_REASON_DEV_READY;
+			goto drop;
 		}
 
 		__skb_queue_tail(queue, skb);
@@ -1969,7 +1993,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	if (err != -EAGAIN)
 		dev_core_stats_rx_dropped_inc(tun->dev);
 
-free_skb:
 	if (!IS_ERR_OR_NULL(skb))
 		kfree_skb_reason(skb, drop_reason);
 
@@ -1980,6 +2003,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	}
 
 	return err ?: total_len;
+
+drop_stats_only:
+	dev_core_stats_rx_dropped_inc(tun->dev);
+	return err;
 }
 
 static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
                   ` (3 preceding siblings ...)
  2025-11-25 20:00 ` [PATCH net-next v2 4/9] tun: correct drop statistics in tun_get_user Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-28  3:02   ` Jason Wang
  2025-11-25 20:00 ` [PATCH net-next v2 6/9] tun: use napi_build_skb in __tun_build_skb Jon Kohler
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)
  Cc: Jon Kohler

Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
This reduces allocation overhead and improves efficiency, especially
when IFF_NAPI is enabled and GRO is feeding entries back to the cache.

If bulk allocation cannot fully satisfy the batch, gracefully drop only
the uncovered portion, allowing the rest of the batch to proceed, which
is what already happens in the previous case where build_skb() would
fail and return -ENOMEM.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 97f130bc5fed..64f944cce517 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2420,13 +2420,13 @@ static void tun_put_page(struct tun_page *tpage)
 static int tun_xdp_one(struct tun_struct *tun,
 		       struct tun_file *tfile,
 		       struct xdp_buff *xdp, int *flush,
-		       struct tun_page *tpage)
+		       struct tun_page *tpage,
+		       struct sk_buff *skb)
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
 	struct virtio_net_hdr *gso = xdp->data_hard_start;
 	struct virtio_net_hdr_v1_hash_tunnel *tnl_hdr;
 	struct bpf_prog *xdp_prog;
-	struct sk_buff *skb = NULL;
 	struct sk_buff_head *queue;
 	netdev_features_t features;
 	u32 rxhash = 0, act;
@@ -2437,6 +2437,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 	struct page *page;
 
 	if (unlikely(datasize < ETH_HLEN)) {
+		kfree_skb_reason(skb, SKB_DROP_REASON_PKT_TOO_SMALL);
 		dev_core_stats_rx_dropped_inc(tun->dev);
 		return -EINVAL;
 	}
@@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 		ret = tun_xdp_act(tun, xdp_prog, xdp, act);
 		if (ret < 0) {
 			/* tun_xdp_act already handles drop statistics */
+			kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
 			put_page(virt_to_head_page(xdp->data));
 			return ret;
 		}
@@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 			*flush = true;
 			fallthrough;
 		case XDP_TX:
+			napi_consume_skb(skb, 1);
 			return 0;
 		case XDP_PASS:
 			break;
@@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
 				tpage->page = page;
 				tpage->count = 1;
 			}
+			napi_consume_skb(skb, 1);
 			return 0;
 		}
 	}
 
 build:
-	skb = build_skb(xdp->data_hard_start, buflen);
+	skb = build_skb_around(skb, xdp->data_hard_start, buflen);
 	if (!skb) {
+		kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
 		dev_core_stats_rx_dropped_inc(tun->dev);
 		return -ENOMEM;
 	}
@@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
 	    ctl && ctl->type == TUN_MSG_PTR) {
 		struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
+		int flush = 0, queued = 0, num_skbs = 0;
 		struct tun_page tpage;
 		int n = ctl->num;
-		int flush = 0, queued = 0;
+		/* Max size of VHOST_NET_BATCH */
+		void *skbs[64];
 
 		memset(&tpage, 0, sizeof(tpage));
 
@@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		rcu_read_lock();
 		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
-		for (i = 0; i < n; i++) {
+		num_skbs = napi_skb_cache_get_bulk(skbs, n);
+
+		for (i = 0; i < num_skbs; i++) {
+			struct sk_buff *skb = skbs[i];
 			xdp = &((struct xdp_buff *)ctl->ptr)[i];
-			ret = tun_xdp_one(tun, tfile, xdp, &flush, &tpage);
+			ret = tun_xdp_one(tun, tfile, xdp, &flush, &tpage,
+					  skb);
 			if (ret > 0)
 				queued += ret;
 		}
 
+		/* Handle remaining xdp_buff entries if num_skbs < ctl->num */
+		for (i = num_skbs; i < ctl->num; i++) {
+			xdp = &((struct xdp_buff *)ctl->ptr)[i];
+			dev_core_stats_rx_dropped_inc(tun->dev);
+			put_page(virt_to_head_page(xdp->data));
+		}
+
 		if (flush)
 			xdp_do_flush();
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 6/9] tun: use napi_build_skb in __tun_build_skb
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
                   ` (4 preceding siblings ...)
  2025-11-25 20:00 ` [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 7/9] tun: use napi_consume_skb() in tun_put_user Jon Kohler
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	open list
  Cc: Jon Kohler

Use napi_build_skb for small payload SKBs that end up using the
tun_build_skb path.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 64f944cce517..27c502786a04 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1538,7 +1538,11 @@ static struct sk_buff *__tun_build_skb(struct tun_file *tfile,
 				       int buflen, int len, int pad,
 				       int metasize)
 {
-	struct sk_buff *skb = build_skb(buf, buflen);
+	struct sk_buff *skb;
+
+	local_bh_disable();
+	skb = napi_build_skb(buf, buflen);
+	local_bh_enable();
 
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 7/9] tun: use napi_consume_skb() in tun_put_user
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
                   ` (5 preceding siblings ...)
  2025-11-25 20:00 ` [PATCH net-next v2 6/9] tun: use napi_build_skb in __tun_build_skb Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 8/9] net: core: export skb_defer_free_flush Jon Kohler
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	open list
  Cc: Jon Kohler, Nick Hudson

With build_skb paths now utilizing the local NAPI SKB cache, switch to
using napi_consume_skb() in tun_put_user. This ensures the local SKB
cache is refilled on read, improving memory locality and performance,
especially when RX and TX run on the same worker thread (e.g., vhost
workers). As napi_consume_skb() also performs deferred SKB freeing,
SKBs allocated on different CPUs benefit as well.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Nick Hudson <nhudson@akamai.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 27c502786a04..b48a66b39e0a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2155,7 +2155,9 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	dev_sw_netstats_tx_add(tun->dev, 1, skb->len + vlan_hlen);
 	preempt_enable();
 
-	consume_skb(skb);
+	local_bh_disable();
+	napi_consume_skb(skb, 1);
+	local_bh_enable();
 
 	return total;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 8/9] net: core: export skb_defer_free_flush
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
                   ` (6 preceding siblings ...)
  2025-11-25 20:00 ` [PATCH net-next v2 7/9] tun: use napi_consume_skb() in tun_put_user Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-25 20:00 ` [PATCH net-next v2 9/9] tun: flush deferred skb free list before bulk NAPI cache get Jon Kohler
  2025-11-29  3:08 ` [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Willem de Bruijn
  9 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stanislav Fomichev, Kuniyuki Iwashima,
	Ahmed Zaki, Samiullah Khawaja, Alexander Lobakin, open list
  Cc: Jon Kohler

Export skb_defer_free_flush so that it can be invoked in other modules,
which is helpful in situations where processing the deferred backlog
at specific cache refill points may be preferred.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 include/linux/skbuff.h | 1 +
 net/core/dev.c         | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ff90281ddf90..daa2d7480fbd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1365,6 +1365,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
 void skb_attempt_defer_free(struct sk_buff *skb);
+void skb_defer_free_flush(void);
 
 u32 napi_skb_cache_get_bulk(void **skbs, u32 n);
 struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
diff --git a/net/core/dev.c b/net/core/dev.c
index 9094c0fb8c68..c4f535be7b6d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6774,7 +6774,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 }
 EXPORT_SYMBOL(napi_complete_done);
 
-static void skb_defer_free_flush(void)
+void skb_defer_free_flush(void)
 {
 	struct llist_node *free_list;
 	struct sk_buff *skb, *next;
@@ -6795,6 +6795,7 @@ static void skb_defer_free_flush(void)
 		}
 	}
 }
+EXPORT_SYMBOL_GPL(skb_defer_free_flush);
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next v2 9/9] tun: flush deferred skb free list before bulk NAPI cache get
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
                   ` (7 preceding siblings ...)
  2025-11-25 20:00 ` [PATCH net-next v2 8/9] net: core: export skb_defer_free_flush Jon Kohler
@ 2025-11-25 20:00 ` Jon Kohler
  2025-11-29  3:08 ` [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Willem de Bruijn
  9 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-11-25 20:00 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	open list
  Cc: Jon Kohler

Call skb_defer_free_flush() immediately before invoking
napi_skb_cache_get_bulk() in the XDP batch path. This ensures any
deferred skb frees are processed so that the NAPI skb cache is refilled
just in time for use. Keeping the cache warm helps reduce unnecessary
IPIs during heavy transmit workloads.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/net/tun.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b48a66b39e0a..7d7f1ddcb707 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2589,6 +2589,12 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		rcu_read_lock();
 		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
+		/* Attempt to flush deferred free list immediately
+		 * prior to bulk get, which will help repopulate the local
+		 * cache and help reduce the amount of IPIs a TX hot core
+		 * will receive when the defer list grows high.
+		 */
+		skb_defer_free_flush();
 		num_skbs = napi_skb_cache_get_bulk(skbs, n);
 
 		for (i = 0; i < num_skbs; i++) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-11-25 20:00 ` [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one Jon Kohler
@ 2025-11-28  3:02   ` Jason Wang
  2025-12-02 16:49     ` Jon Kohler
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Wang @ 2025-11-28  3:02 UTC (permalink / raw)
  To: Jon Kohler
  Cc: netdev, Willem de Bruijn, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)

On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@nutanix.com> wrote:
>
> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
> This reduces allocation overhead and improves efficiency, especially
> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.

Does this mean we should only enable this when NAPI is used?

>
> If bulk allocation cannot fully satisfy the batch, gracefully drop only
> the uncovered portion, allowing the rest of the batch to proceed, which
> is what already happens in the previous case where build_skb() would
> fail and return -ENOMEM.
>
> Signed-off-by: Jon Kohler <jon@nutanix.com>

Do we have any benchmark result for this?

> ---
>  drivers/net/tun.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 97f130bc5fed..64f944cce517 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2420,13 +2420,13 @@ static void tun_put_page(struct tun_page *tpage)
>  static int tun_xdp_one(struct tun_struct *tun,
>                        struct tun_file *tfile,
>                        struct xdp_buff *xdp, int *flush,
> -                      struct tun_page *tpage)
> +                      struct tun_page *tpage,
> +                      struct sk_buff *skb)
>  {
>         unsigned int datasize = xdp->data_end - xdp->data;
>         struct virtio_net_hdr *gso = xdp->data_hard_start;
>         struct virtio_net_hdr_v1_hash_tunnel *tnl_hdr;
>         struct bpf_prog *xdp_prog;
> -       struct sk_buff *skb = NULL;
>         struct sk_buff_head *queue;
>         netdev_features_t features;
>         u32 rxhash = 0, act;
> @@ -2437,6 +2437,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>         struct page *page;
>
>         if (unlikely(datasize < ETH_HLEN)) {
> +               kfree_skb_reason(skb, SKB_DROP_REASON_PKT_TOO_SMALL);
>                 dev_core_stats_rx_dropped_inc(tun->dev);
>                 return -EINVAL;
>         }
> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>                 ret = tun_xdp_act(tun, xdp_prog, xdp, act);
>                 if (ret < 0) {
>                         /* tun_xdp_act already handles drop statistics */
> +                       kfree_skb_reason(skb, SKB_DROP_REASON_XDP);

This should belong to previous patches?

>                         put_page(virt_to_head_page(xdp->data));
>                         return ret;
>                 }
> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>                         *flush = true;
>                         fallthrough;
>                 case XDP_TX:
> +                       napi_consume_skb(skb, 1);
>                         return 0;
>                 case XDP_PASS:
>                         break;
> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
>                                 tpage->page = page;
>                                 tpage->count = 1;
>                         }
> +                       napi_consume_skb(skb, 1);

I wonder if this would have any side effects since tun_xdp_one() is
not called by a NAPI.

>                         return 0;
>                 }
>         }
>
>  build:
> -       skb = build_skb(xdp->data_hard_start, buflen);
> +       skb = build_skb_around(skb, xdp->data_hard_start, buflen);
>         if (!skb) {
> +               kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
>                 dev_core_stats_rx_dropped_inc(tun->dev);
>                 return -ENOMEM;
>         }
> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>         if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
>             ctl && ctl->type == TUN_MSG_PTR) {
>                 struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> +               int flush = 0, queued = 0, num_skbs = 0;
>                 struct tun_page tpage;
>                 int n = ctl->num;
> -               int flush = 0, queued = 0;
> +               /* Max size of VHOST_NET_BATCH */
> +               void *skbs[64];

I think we need some tweaks

1) TUN is decoupled from vhost, so it should have its own value (a
macro is better)
2) Provide a way to fail or handle the case when more than 64

>
>                 memset(&tpage, 0, sizeof(tpage));
>
> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>                 rcu_read_lock();
>                 bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>
> -               for (i = 0; i < n; i++) {
> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);

Its document said:

"""
 * Must be called *only* from the BH context.
"""

> +
> +               for (i = 0; i < num_skbs; i++) {
> +                       struct sk_buff *skb = skbs[i];
>                         xdp = &((struct xdp_buff *)ctl->ptr)[i];
> -                       ret = tun_xdp_one(tun, tfile, xdp, &flush, &tpage);
> +                       ret = tun_xdp_one(tun, tfile, xdp, &flush, &tpage,
> +                                         skb);
>                         if (ret > 0)
>                                 queued += ret;
>                 }
>
> +               /* Handle remaining xdp_buff entries if num_skbs < ctl->num */
> +               for (i = num_skbs; i < ctl->num; i++) {
> +                       xdp = &((struct xdp_buff *)ctl->ptr)[i];
> +                       dev_core_stats_rx_dropped_inc(tun->dev);

Could we do this in a batch?

> +                       put_page(virt_to_head_page(xdp->data));
> +               }
> +
>                 if (flush)
>                         xdp_do_flush();
>
> --
> 2.43.0
>

Thanks


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user
  2025-11-25 20:00 ` [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user Jon Kohler
@ 2025-11-29  3:07   ` Willem de Bruijn
  2025-12-02 16:40     ` Jon Kohler
  0 siblings, 1 reply; 29+ messages in thread
From: Willem de Bruijn @ 2025-11-29  3:07 UTC (permalink / raw)
  To: Jon Kohler, netdev, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	open list
  Cc: Jon Kohler

Jon Kohler wrote:
> Fold kfree_skb and consume_skb for tun_put_user into tun_put_user and
> rework kfree_skb to take a drop reason. Add drop reason to all drop
> sites and ensure that all failing paths properly increment drop
> counter.
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
>  drivers/net/tun.c | 51 +++++++++++++++++++++++++++++++----------------
>  1 file changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 68ad46ab04a4..e0f5e1fe4bd0 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2035,6 +2035,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  			    struct sk_buff *skb,
>  			    struct iov_iter *iter)
>  {
> +	enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
>  	struct tun_pi pi = { 0, skb->protocol };
>  	ssize_t total;
>  	int vlan_offset = 0;
> @@ -2051,8 +2052,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  	total = skb->len + vlan_hlen + vnet_hdr_sz;
>  
>  	if (!(tun->flags & IFF_NO_PI)) {
> -		if (iov_iter_count(iter) < sizeof(pi))
> -			return -EINVAL;
> +		if (iov_iter_count(iter) < sizeof(pi)) {
> +			ret = -EINVAL;
> +			drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;

PI counts as SKB_DROP_REASON_DEV_HDR?

> +			goto drop;
> +		}
>  
>  		total += sizeof(pi);
>  		if (iov_iter_count(iter) < total) {
> @@ -2060,8 +2064,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  			pi.flags |= TUN_PKT_STRIP;
>  		}
>  
> -		if (copy_to_iter(&pi, sizeof(pi), iter) != sizeof(pi))
> -			return -EFAULT;
> +		if (copy_to_iter(&pi, sizeof(pi), iter) != sizeof(pi)) {
> +			ret = -EFAULT;
> +			drop_reason = SKB_DROP_REASON_SKB_UCOPY_FAULT;
> +			goto drop;
> +		}
>  	}
>  
>  	if (vnet_hdr_sz) {
> @@ -2070,8 +2077,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  
>  		ret = tun_vnet_hdr_tnl_from_skb(tun->flags, tun->dev, skb,
>  						&hdr);
> -		if (ret)
> -			return ret;
> +		if (ret) {
> +			drop_reason = SKB_DROP_REASON_DEV_HDR;
> +			goto drop;
> +		}
>  
>  		/*
>  		 * Drop the packet if the configured header size is too small
> @@ -2080,8 +2089,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  		gso = (struct virtio_net_hdr *)&hdr;
>  		ret = __tun_vnet_hdr_put(vnet_hdr_sz, tun->dev->features,
>  					 iter, gso);
> -		if (ret)
> -			return ret;
> +		if (ret) {
> +			drop_reason = SKB_DROP_REASON_DEV_HDR;
> +			goto drop;
> +		}
>  	}
>  
>  	if (vlan_hlen) {
> @@ -2094,23 +2105,33 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
>  
>  		ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
> -		if (ret || !iov_iter_count(iter))
> -			goto done;
> +		if (ret || !iov_iter_count(iter)) {
> +			drop_reason = SKB_DROP_REASON_DEV_HDR;
> +			goto drop;
> +		}
>  
>  		ret = copy_to_iter(&veth, sizeof(veth), iter);
> -		if (ret != sizeof(veth) || !iov_iter_count(iter))
> -			goto done;
> +		if (ret != sizeof(veth) || !iov_iter_count(iter)) {
> +			drop_reason = SKB_DROP_REASON_DEV_HDR;
> +			goto drop;
> +		}
>  	}
>  
>  	skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);
>  
> -done:
>  	/* caller is in process context, */
>  	preempt_disable();
>  	dev_sw_netstats_tx_add(tun->dev, 1, skb->len + vlan_hlen);
>  	preempt_enable();
>  
> +	consume_skb(skb);
> +
>  	return total;
> +
> +drop:
> +	dev_core_stats_tx_dropped_inc(tun->dev);
> +	kfree_skb_reason(skb, drop_reason);
> +	return ret;
>  }
>  
>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
> @@ -2182,10 +2203,6 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>  		struct sk_buff *skb = ptr;
>  
>  		ret = tun_put_user(tun, tfile, skb, to);
> -		if (unlikely(ret < 0))
> -			kfree_skb(skb);
> -		else
> -			consume_skb(skb);
>  	}
>  
>  	return ret;
> -- 
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache
  2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
                   ` (8 preceding siblings ...)
  2025-11-25 20:00 ` [PATCH net-next v2 9/9] tun: flush deferred skb free list before bulk NAPI cache get Jon Kohler
@ 2025-11-29  3:08 ` Willem de Bruijn
  2025-12-02 16:38   ` Jon Kohler
  9 siblings, 1 reply; 29+ messages in thread
From: Willem de Bruijn @ 2025-11-29  3:08 UTC (permalink / raw)
  To: Jon Kohler, netdev, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev,
	(open list:XDP \(eXpress Data Path\):Keyword:\(?:\b|_\)xdp\(?:\b|_\))
  Cc: Jon Kohler

Jon Kohler wrote:
> Use the per-CPU NAPI cache for SKB allocation in most places, and
> leverage bulk allocation for tun_xdp_one since the batch size is known
> at submission time. Additionally, utilize napi_build_skb and
> napi_consume_skb to further benefit from the NAPI cache. This all
> improves efficiency by reducing allocation overhead. 
> 
> Note: This series does not address the large payload path in
> tun_alloc_skb, which spans sock.c and skbuff.c,A separate series will
> handle privatizing the allocation code in tun and integrating the NAPI
> cache for that path.
> 
> Results using basic iperf3 UDP test:
> TX guest: taskset -c 2 iperf3 -c rx-ip-here -t 30 -p 5200 -b 0 -u -i 30
> RX guest: taskset -c 2 iperf3 -s -p 5200 -D
> 
>         Bitrate       
> Before: 6.08 Gbits/sec
> After : 6.36 Gbits/sec
> 
> However, the basic test doesn't tell the whole story. Looking at a
> flamegraph from before and after, less cycles are spent both on RX
> vhost thread in the guest-to-guest on a single host case, but also less
> cycles in the guest-to-guest case when on separate hosts, as the host
> NIC handlers benefit from these NAPI-allocated SKBs (and deferred free)
> as well.
> 
> Speaking of deferred free, v2 adds exporting deferred free from net
> core and using immediately prior in tun_put_user. This not only keeps
> the cache as warm as you can get, but also prevents a TX heavy vhost
> thread from getting IPI'd like its going out of style. This approach
> is similar in concept to what happens from NAPI loop in net_rx_action.
> 
> I've also merged this series with a small series about cleaning up
> packet drop statistics along the various error paths in tun, as I want
> to make sure those all go through kfree_skb_reason(), and we'd have
> merge conflicts separating the two. If the maintainers want to take
> them separately, happy to break them apart if needed. It is fairly
> clean keeping them together otherwise.

I think it would be preferable to send the cleanup separately, first.

Why would that cause merge conflicts?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache
  2025-11-29  3:08 ` [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Willem de Bruijn
@ 2025-12-02 16:38   ` Jon Kohler
  0 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-12-02 16:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev,
	(open list:XDP \(eXpress Data Path\):Keyword:\(?:\b|_\)xdp\(?:\b|_\))



> On Nov 28, 2025, at 10:08 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Jon Kohler wrote:
>> Use the per-CPU NAPI cache for SKB allocation in most places, and
>> leverage bulk allocation for tun_xdp_one since the batch size is known
>> at submission time. Additionally, utilize napi_build_skb and
>> napi_consume_skb to further benefit from the NAPI cache. This all
>> improves efficiency by reducing allocation overhead. 
>> 
>> Note: This series does not address the large payload path in
>> tun_alloc_skb, which spans sock.c and skbuff.c,A separate series will
>> handle privatizing the allocation code in tun and integrating the NAPI
>> cache for that path.
>> 
>> Results using basic iperf3 UDP test:
>> TX guest: taskset -c 2 iperf3 -c rx-ip-here -t 30 -p 5200 -b 0 -u -i 30
>> RX guest: taskset -c 2 iperf3 -s -p 5200 -D
>> 
>>        Bitrate       
>> Before: 6.08 Gbits/sec
>> After : 6.36 Gbits/sec
>> 
>> However, the basic test doesn't tell the whole story. Looking at a
>> flamegraph from before and after, less cycles are spent both on RX
>> vhost thread in the guest-to-guest on a single host case, but also less
>> cycles in the guest-to-guest case when on separate hosts, as the host
>> NIC handlers benefit from these NAPI-allocated SKBs (and deferred free)
>> as well.
>> 
>> Speaking of deferred free, v2 adds exporting deferred free from net
>> core and using immediately prior in tun_put_user. This not only keeps
>> the cache as warm as you can get, but also prevents a TX heavy vhost
>> thread from getting IPI'd like its going out of style. This approach
>> is similar in concept to what happens from NAPI loop in net_rx_action.
>> 
>> I've also merged this series with a small series about cleaning up
>> packet drop statistics along the various error paths in tun, as I want
>> to make sure those all go through kfree_skb_reason(), and we'd have
>> merge conflicts separating the two. If the maintainers want to take
>> them separately, happy to break them apart if needed. It is fairly
>> clean keeping them together otherwise.
> 
> I think it would be preferable to send the cleanup separately, first.

Sure, will do

> Why would that cause merge conflicts?

Just from a CI perspective, if I sent them separately, guessing CI
would bark about merge conflicts. 

Not a problem, let’s nail down cleanup parts and then we can worry
about performance parts

Thx!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user
  2025-11-29  3:07   ` Willem de Bruijn
@ 2025-12-02 16:40     ` Jon Kohler
  2025-12-02 21:34       ` Willem de Bruijn
  0 siblings, 1 reply; 29+ messages in thread
From: Jon Kohler @ 2025-12-02 16:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev@vger.kernel.org, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, open list



> On Nov 28, 2025, at 10:07 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> Jon Kohler wrote:
>> Fold kfree_skb and consume_skb for tun_put_user into tun_put_user and
>> rework kfree_skb to take a drop reason. Add drop reason to all drop
>> sites and ensure that all failing paths properly increment drop
>> counter.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> ---
>> drivers/net/tun.c | 51 +++++++++++++++++++++++++++++++----------------
>> 1 file changed, 34 insertions(+), 17 deletions(-)
>> 
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 68ad46ab04a4..e0f5e1fe4bd0 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2035,6 +2035,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>     struct sk_buff *skb,
>>     struct iov_iter *iter)
>> {
>> + enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
>> struct tun_pi pi = { 0, skb->protocol };
>> ssize_t total;
>> int vlan_offset = 0;
>> @@ -2051,8 +2052,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>> total = skb->len + vlan_hlen + vnet_hdr_sz;
>> 
>> if (!(tun->flags & IFF_NO_PI)) {
>> - if (iov_iter_count(iter) < sizeof(pi))
>> - return -EINVAL;
>> + if (iov_iter_count(iter) < sizeof(pi)) {
>> + ret = -EINVAL;
>> + drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
> 
> PI counts as SKB_DROP_REASON_DEV_HDR?

Are you saying I should change this use case to DEV_HDR?

This one seemed like a pretty straight forward “It’s too small” case,
no? Or am I misreading into what you’re saying here?

Happy to take a suggestion if I’ve got the drop reason wired
wrong (or if we need to cook up a brand new drop reason for any of
these)

Jon

> 
>> + goto drop;
>> + }
>> 
>> total += sizeof(pi);
>> if (iov_iter_count(iter) < total) {
>> @@ -2060,8 +2064,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>> pi.flags |= TUN_PKT_STRIP;
>> }
>> 
>> - if (copy_to_iter(&pi, sizeof(pi), iter) != sizeof(pi))
>> - return -EFAULT;
>> + if (copy_to_iter(&pi, sizeof(pi), iter) != sizeof(pi)) {
>> + ret = -EFAULT;
>> + drop_reason = SKB_DROP_REASON_SKB_UCOPY_FAULT;
>> + goto drop;
>> + }
>> }
>> 
>> if (vnet_hdr_sz) {
>> @@ -2070,8 +2077,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>> 
>> ret = tun_vnet_hdr_tnl_from_skb(tun->flags, tun->dev, skb,
>> &hdr);
>> - if (ret)
>> - return ret;
>> + if (ret) {
>> + drop_reason = SKB_DROP_REASON_DEV_HDR;
>> + goto drop;
>> + }
>> 
>> /*
>>  * Drop the packet if the configured header size is too small
>> @@ -2080,8 +2089,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>> gso = (struct virtio_net_hdr *)&hdr;
>> ret = __tun_vnet_hdr_put(vnet_hdr_sz, tun->dev->features,
>>  iter, gso);
>> - if (ret)
>> - return ret;
>> + if (ret) {
>> + drop_reason = SKB_DROP_REASON_DEV_HDR;
>> + goto drop;
>> + }
>> }
>> 
>> if (vlan_hlen) {
>> @@ -2094,23 +2105,33 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>> vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
>> 
>> ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
>> - if (ret || !iov_iter_count(iter))
>> - goto done;
>> + if (ret || !iov_iter_count(iter)) {
>> + drop_reason = SKB_DROP_REASON_DEV_HDR;
>> + goto drop;
>> + }
>> 
>> ret = copy_to_iter(&veth, sizeof(veth), iter);
>> - if (ret != sizeof(veth) || !iov_iter_count(iter))
>> - goto done;
>> + if (ret != sizeof(veth) || !iov_iter_count(iter)) {
>> + drop_reason = SKB_DROP_REASON_DEV_HDR;
>> + goto drop;
>> + }
>> }
>> 
>> skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);
>> 
>> -done:
>> /* caller is in process context, */
>> preempt_disable();
>> dev_sw_netstats_tx_add(tun->dev, 1, skb->len + vlan_hlen);
>> preempt_enable();
>> 
>> + consume_skb(skb);
>> +
>> return total;
>> +
>> +drop:
>> + dev_core_stats_tx_dropped_inc(tun->dev);
>> + kfree_skb_reason(skb, drop_reason);
>> + return ret;
>> }
>> 
>> static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>> @@ -2182,10 +2203,6 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>> struct sk_buff *skb = ptr;
>> 
>> ret = tun_put_user(tun, tfile, skb, to);
>> - if (unlikely(ret < 0))
>> - kfree_skb(skb);
>> - else
>> - consume_skb(skb);
>> }
>> 
>> return ret;
>> -- 
>> 2.43.0



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-11-28  3:02   ` Jason Wang
@ 2025-12-02 16:49     ` Jon Kohler
  2025-12-02 17:32       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 29+ messages in thread
From: Jon Kohler @ 2025-12-02 16:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev@vger.kernel.org, Willem de Bruijn, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_)



> On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@redhat.com> wrote:
> 
> On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@nutanix.com> wrote:
>> 
>> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
>> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
>> This reduces allocation overhead and improves efficiency, especially
>> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.
> 
> Does this mean we should only enable this when NAPI is used?

No, it does not mean that at all, but I see what that would be confusing.
I can clean up the commit msg on the next go around

>> 
>> If bulk allocation cannot fully satisfy the batch, gracefully drop only
>> the uncovered portion, allowing the rest of the batch to proceed, which
>> is what already happens in the previous case where build_skb() would
>> fail and return -ENOMEM.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> Do we have any benchmark result for this?

Yes, it is in the cover letter:
https://patchwork.kernel.org/project/netdevbpf/cover/20251125200041.1565663-1-jon@nutanix.com/

>> ---
>> drivers/net/tun.c | 30 ++++++++++++++++++++++++------
>> 1 file changed, 24 insertions(+), 6 deletions(-)
>> 
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 97f130bc5fed..64f944cce517 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2420,13 +2420,13 @@ static void tun_put_page(struct tun_page *tpage)
>> static int tun_xdp_one(struct tun_struct *tun,
>>                       struct tun_file *tfile,
>>                       struct xdp_buff *xdp, int *flush,
>> -                      struct tun_page *tpage)
>> +                      struct tun_page *tpage,
>> +                      struct sk_buff *skb)
>> {
>>        unsigned int datasize = xdp->data_end - xdp->data;
>>        struct virtio_net_hdr *gso = xdp->data_hard_start;
>>        struct virtio_net_hdr_v1_hash_tunnel *tnl_hdr;
>>        struct bpf_prog *xdp_prog;
>> -       struct sk_buff *skb = NULL;
>>        struct sk_buff_head *queue;
>>        netdev_features_t features;
>>        u32 rxhash = 0, act;
>> @@ -2437,6 +2437,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>        struct page *page;
>> 
>>        if (unlikely(datasize < ETH_HLEN)) {
>> +               kfree_skb_reason(skb, SKB_DROP_REASON_PKT_TOO_SMALL);
>>                dev_core_stats_rx_dropped_inc(tun->dev);
>>                return -EINVAL;
>>        }
>> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>                ret = tun_xdp_act(tun, xdp_prog, xdp, act);
>>                if (ret < 0) {
>>                        /* tun_xdp_act already handles drop statistics */
>> +                       kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
> 
> This should belong to previous patches?

Well, not really, as we did not even have an SKB to free at this point
in the previous code
> 
>>                        put_page(virt_to_head_page(xdp->data));
>>                        return ret;
>>                }
>> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>                        *flush = true;
>>                        fallthrough;
>>                case XDP_TX:
>> +                       napi_consume_skb(skb, 1);
>>                        return 0;
>>                case XDP_PASS:
>>                        break;
>> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
>>                                tpage->page = page;
>>                                tpage->count = 1;
>>                        }
>> +                       napi_consume_skb(skb, 1);
> 
> I wonder if this would have any side effects since tun_xdp_one() is
> not called by a NAPI.

As far as I can tell, this napi_consume_skb is really just an artifact of
how it was named and how it was traditionally used. 

Now this is really just a napi_consume_skb within a bh disable/enable
section, which should meet the requirements of how that interface
should be used (again, AFAICT)

> 
>>                        return 0;
>>                }
>>        }
>> 
>> build:
>> -       skb = build_skb(xdp->data_hard_start, buflen);
>> +       skb = build_skb_around(skb, xdp->data_hard_start, buflen);
>>        if (!skb) {
>> +               kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);

Though to your point, I dont think this actually does anything given
that if the skb was somehow nuked as part of build_skb_around, there
would not be an skb to free. Doesn’t hurt though, from a self documenting
code perspective tho?

>>                dev_core_stats_rx_dropped_inc(tun->dev);
>>                return -ENOMEM;
>>        }
>> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>        if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
>>            ctl && ctl->type == TUN_MSG_PTR) {
>>                struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>> +               int flush = 0, queued = 0, num_skbs = 0;
>>                struct tun_page tpage;
>>                int n = ctl->num;
>> -               int flush = 0, queued = 0;
>> +               /* Max size of VHOST_NET_BATCH */
>> +               void *skbs[64];
> 
> I think we need some tweaks
> 
> 1) TUN is decoupled from vhost, so it should have its own value (a
> macro is better)

Sure, I can make another constant that does a similar thing

> 2) Provide a way to fail or handle the case when more than 64

What if we simply assert that the maximum here is 64, which I think
is what it actually is in practice?

> 
>> 
>>                memset(&tpage, 0, sizeof(tpage));
>> 
>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>                rcu_read_lock();
>>                bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>> 
>> -               for (i = 0; i < n; i++) {
>> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
> 
> Its document said:
> 
> """
> * Must be called *only* from the BH context.
> “"”
We’re in a bh_disable section here, is that not good enough?
> 
>> +
>> +               for (i = 0; i < num_skbs; i++) {
>> +                       struct sk_buff *skb = skbs[i];
>>                        xdp = &((struct xdp_buff *)ctl->ptr)[i];
>> -                       ret = tun_xdp_one(tun, tfile, xdp, &flush, &tpage);
>> +                       ret = tun_xdp_one(tun, tfile, xdp, &flush, &tpage,
>> +                                         skb);
>>                        if (ret > 0)
>>                                queued += ret;
>>                }
>> 
>> +               /* Handle remaining xdp_buff entries if num_skbs < ctl->num */
>> +               for (i = num_skbs; i < ctl->num; i++) {
>> +                       xdp = &((struct xdp_buff *)ctl->ptr)[i];
>> +                       dev_core_stats_rx_dropped_inc(tun->dev);
> 
> Could we do this in a batch?

I suspect this will be a very, very rare case, so I dont think optimizing it
(or complicating it any more) does much good, no?

> 
>> +                       put_page(virt_to_head_page(xdp->data));
>> +               }
>> +
>>                if (flush)
>>                        xdp_do_flush();
>> 
>> --
>> 2.43.0
>> 
> 
> Thanks



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-02 16:49     ` Jon Kohler
@ 2025-12-02 17:32       ` Jesper Dangaard Brouer
  2025-12-02 17:45         ` Jon Kohler
  2025-12-03  8:47         ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2025-12-02 17:32 UTC (permalink / raw)
  To: Jon Kohler, Jason Wang
  Cc: netdev@vger.kernel.org, Willem de Bruijn, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Sebastian Andrzej Siewior, Alexander Lobakin



On 02/12/2025 17.49, Jon Kohler wrote:
> 
> 
>> On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@nutanix.com> wrote:
>>>
>>> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
>>> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
>>> This reduces allocation overhead and improves efficiency, especially
>>> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.
>>
>> Does this mean we should only enable this when NAPI is used?
> 
> No, it does not mean that at all, but I see what that would be confusing.
> I can clean up the commit msg on the next go around
> 
>>>
>>> If bulk allocation cannot fully satisfy the batch, gracefully drop only
>>> the uncovered portion, allowing the rest of the batch to proceed, which
>>> is what already happens in the previous case where build_skb() would
>>> fail and return -ENOMEM.
>>>
>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>
>> Do we have any benchmark result for this?
> 
> Yes, it is in the cover letter:
> https://patchwork.kernel.org/project/netdevbpf/cover/20251125200041.1565663-1-jon@nutanix.com/
> 
>>> ---
>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++------
>>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>> index 97f130bc5fed..64f944cce517 100644
>>> --- a/drivers/net/tun.c
>>> +++ b/drivers/net/tun.c
[...]
>>> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>                 ret = tun_xdp_act(tun, xdp_prog, xdp, act);
>>>                 if (ret < 0) {
>>>                         /* tun_xdp_act already handles drop statistics */
>>> +                       kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
>>
>> This should belong to previous patches?
> 
> Well, not really, as we did not even have an SKB to free at this point
> in the previous code
>>
>>>                         put_page(virt_to_head_page(xdp->data));

This calling put_page() directly also looks dubious.

>>>                         return ret;
>>>                 }
>>> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>                         *flush = true;
>>>                         fallthrough;
>>>                 case XDP_TX:
>>> +                       napi_consume_skb(skb, 1);
>>>                         return 0;
>>>                 case XDP_PASS:
>>>                         break;
>>> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>                                 tpage->page = page;
>>>                                 tpage->count = 1;
>>>                         }
>>> +                       napi_consume_skb(skb, 1);
>>
>> I wonder if this would have any side effects since tun_xdp_one() is
>> not called by a NAPI.
> 
> As far as I can tell, this napi_consume_skb is really just an artifact of
> how it was named and how it was traditionally used.
> 
> Now this is really just a napi_consume_skb within a bh disable/enable
> section, which should meet the requirements of how that interface
> should be used (again, AFAICT)
> 

Yicks - this sounds super ugly.  Just wrapping napi_consume_skb() in bh
disable/enable section and then assuming you get the same protection as
NAPI is really dubious.

Cc Sebastian as he is trying to cleanup these kind of use-case, to make
kernel preemption work.


>>
>>>                         return 0;
>>>                 }
>>>         }
>>>
>>> build:
>>> -       skb = build_skb(xdp->data_hard_start, buflen);
>>> +       skb = build_skb_around(skb, xdp->data_hard_start, buflen);
>>>         if (!skb) {
>>> +               kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
> 
> Though to your point, I dont think this actually does anything given
> that if the skb was somehow nuked as part of build_skb_around, there
> would not be an skb to free. Doesn’t hurt though, from a self documenting
> code perspective tho?
> 
>>>                 dev_core_stats_rx_dropped_inc(tun->dev);
>>>                 return -ENOMEM;
>>>         }
>>> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>         if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
>>>             ctl && ctl->type == TUN_MSG_PTR) {
>>>                 struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>>> +               int flush = 0, queued = 0, num_skbs = 0;
>>>                 struct tun_page tpage;
>>>                 int n = ctl->num;
>>> -               int flush = 0, queued = 0;
>>> +               /* Max size of VHOST_NET_BATCH */
>>> +               void *skbs[64];
>>
>> I think we need some tweaks
>>
>> 1) TUN is decoupled from vhost, so it should have its own value (a
>> macro is better)
> 
> Sure, I can make another constant that does a similar thing
> 
>> 2) Provide a way to fail or handle the case when more than 64
> 
> What if we simply assert that the maximum here is 64, which I think
> is what it actually is in practice?
> 
>>
>>>
>>>                 memset(&tpage, 0, sizeof(tpage));
>>>
>>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>                 rcu_read_lock();
>>>                 bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>>>
>>> -               for (i = 0; i < n; i++) {
>>> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
>>
>> Its document said:
>>
>> """
>> * Must be called *only* from the BH context.
>> “"”
> We’re in a bh_disable section here, is that not good enough?

Again this feels very ugly and prone to painting ourselves into a
corner, assuming BH-disabled sections have same protection as NAPI.

(The napi_skb_cache_get/put function are operating on per CPU arrays
without any locking.)

--Jesper

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-02 17:32       ` Jesper Dangaard Brouer
@ 2025-12-02 17:45         ` Jon Kohler
  2025-12-03  4:10           ` Jason Wang
  2025-12-03  8:47         ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 29+ messages in thread
From: Jon Kohler @ 2025-12-02 17:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jason Wang, netdev@vger.kernel.org, Willem de Bruijn, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Sebastian Andrzej Siewior, Alexander Lobakin



> On Dec 2, 2025, at 12:32 PM, Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> 
> 
> 
> On 02/12/2025 17.49, Jon Kohler wrote:
>>> On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@redhat.com> wrote:
>>> 
>>> On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@nutanix.com> wrote:
>>>> 
>>>> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
>>>> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
>>>> This reduces allocation overhead and improves efficiency, especially
>>>> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.
>>> 
>>> Does this mean we should only enable this when NAPI is used?
>> No, it does not mean that at all, but I see what that would be confusing.
>> I can clean up the commit msg on the next go around
>>>> 
>>>> If bulk allocation cannot fully satisfy the batch, gracefully drop only
>>>> the uncovered portion, allowing the rest of the batch to proceed, which
>>>> is what already happens in the previous case where build_skb() would
>>>> fail and return -ENOMEM.
>>>> 
>>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>> 
>>> Do we have any benchmark result for this?
>> Yes, it is in the cover letter:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_project_netdevbpf_cover_20251125200041.1565663-2D1-2Djon-40nutanix.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=D7piJwOOQSj7C1puBlbh5dmAc-qsLw6E660yC5jJXWZk9ppvjOqT9Xc61ewYSmod&s=yUPhRdqt2lVnW5FxiOpvKE34iXKyGEWk502Dko1i3PI&e=
>>>> ---
>>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++------
>>>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>>> 
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 97f130bc5fed..64f944cce517 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
> [...]
>>>> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>                ret = tun_xdp_act(tun, xdp_prog, xdp, act);
>>>>                if (ret < 0) {
>>>>                        /* tun_xdp_act already handles drop statistics */
>>>> +                       kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
>>> 
>>> This should belong to previous patches?
>> Well, not really, as we did not even have an SKB to free at this point
>> in the previous code
>>> 
>>>>                        put_page(virt_to_head_page(xdp->data));
> 
> This calling put_page() directly also looks dubious.
> 
>>>>                        return ret;
>>>>                }
>>>> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>                        *flush = true;
>>>>                        fallthrough;
>>>>                case XDP_TX:
>>>> +                       napi_consume_skb(skb, 1);
>>>>                        return 0;
>>>>                case XDP_PASS:
>>>>                        break;
>>>> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>                                tpage->page = page;
>>>>                                tpage->count = 1;
>>>>                        }
>>>> +                       napi_consume_skb(skb, 1);
>>> 
>>> I wonder if this would have any side effects since tun_xdp_one() is
>>> not called by a NAPI.
>> As far as I can tell, this napi_consume_skb is really just an artifact of
>> how it was named and how it was traditionally used.
>> Now this is really just a napi_consume_skb within a bh disable/enable
>> section, which should meet the requirements of how that interface
>> should be used (again, AFAICT)
> 
> Yicks - this sounds super ugly.  Just wrapping napi_consume_skb() in bh
> disable/enable section and then assuming you get the same protection as
> NAPI is really dubious.
> 
> Cc Sebastian as he is trying to cleanup these kind of use-case, to make
> kernel preemption work.
> 
> 
>>> 
>>>>                        return 0;
>>>>                }
>>>>        }
>>>> 
>>>> build:
>>>> -       skb = build_skb(xdp->data_hard_start, buflen);
>>>> +       skb = build_skb_around(skb, xdp->data_hard_start, buflen);
>>>>        if (!skb) {
>>>> +               kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
>> Though to your point, I dont think this actually does anything given
>> that if the skb was somehow nuked as part of build_skb_around, there
>> would not be an skb to free. Doesn’t hurt though, from a self documenting
>> code perspective tho?
>>>>                dev_core_stats_rx_dropped_inc(tun->dev);
>>>>                return -ENOMEM;
>>>>        }
>>>> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>>        if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
>>>>            ctl && ctl->type == TUN_MSG_PTR) {
>>>>                struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>>>> +               int flush = 0, queued = 0, num_skbs = 0;
>>>>                struct tun_page tpage;
>>>>                int n = ctl->num;
>>>> -               int flush = 0, queued = 0;
>>>> +               /* Max size of VHOST_NET_BATCH */
>>>> +               void *skbs[64];
>>> 
>>> I think we need some tweaks
>>> 
>>> 1) TUN is decoupled from vhost, so it should have its own value (a
>>> macro is better)
>> Sure, I can make another constant that does a similar thing
>>> 2) Provide a way to fail or handle the case when more than 64
>> What if we simply assert that the maximum here is 64, which I think
>> is what it actually is in practice?
>>> 
>>>> 
>>>>                memset(&tpage, 0, sizeof(tpage));
>>>> 
>>>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>>                rcu_read_lock();
>>>>                bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>>>> 
>>>> -               for (i = 0; i < n; i++) {
>>>> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
>>> 
>>> Its document said:
>>> 
>>> """
>>> * Must be called *only* from the BH context.
>>> “"”
>> We’re in a bh_disable section here, is that not good enough?
> 
> Again this feels very ugly and prone to painting ourselves into a
> corner, assuming BH-disabled sections have same protection as NAPI.
> 
> (The napi_skb_cache_get/put function are operating on per CPU arrays
> without any locking.)

Happy to take suggestions on an alternative approach. 

Thoughts:
1. Instead of having IFF_NAPI be an opt-in thing, clean up tun so it
   is *always* NAPI’d 100% of the time? Outside of people who have
   wired this up in their apps manually, on the virtualization side
   there is currently no support from QEMU/Libvirt to enable IFF_NAPI.
   Might be a nice simplification/cleanup to just “do it” full time?
   Then we can play all these sorts of games under the protection of
   NAPI?
2. (Some other non-dubious way of protecting this, without refactoring
   for either conditional NAPI (yuck?) or refactoring for full time
   NAPI? This would be nice, happy to take tips!
3. ... ?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user
  2025-12-02 16:40     ` Jon Kohler
@ 2025-12-02 21:34       ` Willem de Bruijn
  2025-12-02 21:36         ` Jon Kohler
  0 siblings, 1 reply; 29+ messages in thread
From: Willem de Bruijn @ 2025-12-02 21:34 UTC (permalink / raw)
  To: Jon Kohler, Willem de Bruijn
  Cc: netdev@vger.kernel.org, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, open list

Jon Kohler wrote:
> 
> 
> > On Nov 28, 2025, at 10:07 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > 
> > Jon Kohler wrote:
> >> Fold kfree_skb and consume_skb for tun_put_user into tun_put_user and
> >> rework kfree_skb to take a drop reason. Add drop reason to all drop
> >> sites and ensure that all failing paths properly increment drop
> >> counter.
> >> 
> >> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >> ---
> >> drivers/net/tun.c | 51 +++++++++++++++++++++++++++++++----------------
> >> 1 file changed, 34 insertions(+), 17 deletions(-)
> >> 
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 68ad46ab04a4..e0f5e1fe4bd0 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -2035,6 +2035,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>     struct sk_buff *skb,
> >>     struct iov_iter *iter)
> >> {
> >> + enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> >> struct tun_pi pi = { 0, skb->protocol };
> >> ssize_t total;
> >> int vlan_offset = 0;
> >> @@ -2051,8 +2052,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >> total = skb->len + vlan_hlen + vnet_hdr_sz;
> >> 
> >> if (!(tun->flags & IFF_NO_PI)) {
> >> - if (iov_iter_count(iter) < sizeof(pi))
> >> - return -EINVAL;
> >> + if (iov_iter_count(iter) < sizeof(pi)) {
> >> + ret = -EINVAL;
> >> + drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
> > 
> > PI counts as SKB_DROP_REASON_DEV_HDR?
> 
> Are you saying I should change this use case to DEV_HDR?
> 
> This one seemed like a pretty straight forward “It’s too small” case,
> no? Or am I misreading into what you’re saying here?
> 
> Happy to take a suggestion if I’ve got the drop reason wired
> wrong (or if we need to cook up a brand new drop reason for any of
> these)

I agree that it's a clear case of the buffer being too small. But I
consider PI not part of the packet itself, but bad device headers.
It's borderline nitpicking. With that context, pick which you see fits
best.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user
  2025-12-02 21:34       ` Willem de Bruijn
@ 2025-12-02 21:36         ` Jon Kohler
  0 siblings, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-12-02 21:36 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev@vger.kernel.org, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, open list



> On Dec 2, 2025, at 4:34 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Jon Kohler wrote:
>> 
>> 
>>> On Nov 28, 2025, at 10:07 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>>> 
>>> Jon Kohler wrote:
>>>> Fold kfree_skb and consume_skb for tun_put_user into tun_put_user and
>>>> rework kfree_skb to take a drop reason. Add drop reason to all drop
>>>> sites and ensure that all failing paths properly increment drop
>>>> counter.
>>>> 
>>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>>> ---
>>>> drivers/net/tun.c | 51 +++++++++++++++++++++++++++++++----------------
>>>> 1 file changed, 34 insertions(+), 17 deletions(-)
>>>> 
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 68ad46ab04a4..e0f5e1fe4bd0 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -2035,6 +2035,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>    struct sk_buff *skb,
>>>>    struct iov_iter *iter)
>>>> {
>>>> + enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
>>>> struct tun_pi pi = { 0, skb->protocol };
>>>> ssize_t total;
>>>> int vlan_offset = 0;
>>>> @@ -2051,8 +2052,11 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>> total = skb->len + vlan_hlen + vnet_hdr_sz;
>>>> 
>>>> if (!(tun->flags & IFF_NO_PI)) {
>>>> - if (iov_iter_count(iter) < sizeof(pi))
>>>> - return -EINVAL;
>>>> + if (iov_iter_count(iter) < sizeof(pi)) {
>>>> + ret = -EINVAL;
>>>> + drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>>> 
>>> PI counts as SKB_DROP_REASON_DEV_HDR?
>> 
>> Are you saying I should change this use case to DEV_HDR?
>> 
>> This one seemed like a pretty straight forward “It’s too small” case,
>> no? Or am I misreading into what you’re saying here?
>> 
>> Happy to take a suggestion if I’ve got the drop reason wired
>> wrong (or if we need to cook up a brand new drop reason for any of
>> these)
> 
> I agree that it's a clear case of the buffer being too small. But I
> consider PI not part of the packet itself, but bad device headers.
> It's borderline nitpicking. With that context, pick which you see fits
> best.

Yea thats a fair nuance. I’ll chew on it for the next go-around

Jon

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-02 17:45         ` Jon Kohler
@ 2025-12-03  4:10           ` Jason Wang
  2025-12-03  4:34             ` Jon Kohler
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Wang @ 2025-12-03  4:10 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org, Willem de Bruijn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Sebastian Andrzej Siewior, Alexander Lobakin

On Wed, Dec 3, 2025 at 1:46 AM Jon Kohler <jon@nutanix.com> wrote:
>
>
>
> > On Dec 2, 2025, at 12:32 PM, Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >
> >
> >
> > On 02/12/2025 17.49, Jon Kohler wrote:
> >>> On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@redhat.com> wrote:
> >>>
> >>> On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@nutanix.com> wrote:
> >>>>
> >>>> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
> >>>> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
> >>>> This reduces allocation overhead and improves efficiency, especially
> >>>> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.
> >>>
> >>> Does this mean we should only enable this when NAPI is used?
> >> No, it does not mean that at all, but I see what that would be confusing.
> >> I can clean up the commit msg on the next go around
> >>>>
> >>>> If bulk allocation cannot fully satisfy the batch, gracefully drop only
> >>>> the uncovered portion, allowing the rest of the batch to proceed, which
> >>>> is what already happens in the previous case where build_skb() would
> >>>> fail and return -ENOMEM.
> >>>>
> >>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >>>
> >>> Do we have any benchmark result for this?
> >> Yes, it is in the cover letter:
> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_project_netdevbpf_cover_20251125200041.1565663-2D1-2Djon-40nutanix.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=D7piJwOOQSj7C1puBlbh5dmAc-qsLw6E660yC5jJXWZk9ppvjOqT9Xc61ewYSmod&s=yUPhRdqt2lVnW5FxiOpvKE34iXKyGEWk502Dko1i3PI&e=

Ok but it only covers UDP, I think we want to see how it performs for
TCP as well as latency. Btw is the test for IFF_NAPI or not?

> >>>> ---
> >>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++------
> >>>> 1 file changed, 24 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>> index 97f130bc5fed..64f944cce517 100644
> >>>> --- a/drivers/net/tun.c
> >>>> +++ b/drivers/net/tun.c
> > [...]
> >>>> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
> >>>>                ret = tun_xdp_act(tun, xdp_prog, xdp, act);
> >>>>                if (ret < 0) {
> >>>>                        /* tun_xdp_act already handles drop statistics */
> >>>> +                       kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
> >>>
> >>> This should belong to previous patches?
> >> Well, not really, as we did not even have an SKB to free at this point
> >> in the previous code
> >>>
> >>>>                        put_page(virt_to_head_page(xdp->data));
> >
> > This calling put_page() directly also looks dubious.
> >
> >>>>                        return ret;
> >>>>                }
> >>>> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
> >>>>                        *flush = true;
> >>>>                        fallthrough;
> >>>>                case XDP_TX:
> >>>> +                       napi_consume_skb(skb, 1);
> >>>>                        return 0;
> >>>>                case XDP_PASS:
> >>>>                        break;
> >>>> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
> >>>>                                tpage->page = page;
> >>>>                                tpage->count = 1;
> >>>>                        }
> >>>> +                       napi_consume_skb(skb, 1);
> >>>
> >>> I wonder if this would have any side effects since tun_xdp_one() is
> >>> not called by a NAPI.
> >> As far as I can tell, this napi_consume_skb is really just an artifact of
> >> how it was named and how it was traditionally used.
> >> Now this is really just a napi_consume_skb within a bh disable/enable
> >> section, which should meet the requirements of how that interface
> >> should be used (again, AFAICT)
> >
> > Yicks - this sounds super ugly.  Just wrapping napi_consume_skb() in bh
> > disable/enable section and then assuming you get the same protection as
> > NAPI is really dubious.
> >
> > Cc Sebastian as he is trying to cleanup these kind of use-case, to make
> > kernel preemption work.
> >
> >
> >>>
> >>>>                        return 0;
> >>>>                }
> >>>>        }
> >>>>
> >>>> build:
> >>>> -       skb = build_skb(xdp->data_hard_start, buflen);
> >>>> +       skb = build_skb_around(skb, xdp->data_hard_start, buflen);
> >>>>        if (!skb) {
> >>>> +               kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
> >> Though to your point, I dont think this actually does anything given
> >> that if the skb was somehow nuked as part of build_skb_around, there
> >> would not be an skb to free. Doesn’t hurt though, from a self documenting
> >> code perspective tho?
> >>>>                dev_core_stats_rx_dropped_inc(tun->dev);
> >>>>                return -ENOMEM;
> >>>>        }
> >>>> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
> >>>>        if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
> >>>>            ctl && ctl->type == TUN_MSG_PTR) {
> >>>>                struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> >>>> +               int flush = 0, queued = 0, num_skbs = 0;
> >>>>                struct tun_page tpage;
> >>>>                int n = ctl->num;
> >>>> -               int flush = 0, queued = 0;
> >>>> +               /* Max size of VHOST_NET_BATCH */
> >>>> +               void *skbs[64];
> >>>
> >>> I think we need some tweaks
> >>>
> >>> 1) TUN is decoupled from vhost, so it should have its own value (a
> >>> macro is better)
> >> Sure, I can make another constant that does a similar thing
> >>> 2) Provide a way to fail or handle the case when more than 64
> >> What if we simply assert that the maximum here is 64, which I think
> >> is what it actually is in practice?

I still prefer a fallback.

> >>>
> >>>>
> >>>>                memset(&tpage, 0, sizeof(tpage));
> >>>>
> >>>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
> >>>>                rcu_read_lock();
> >>>>                bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> >>>>
> >>>> -               for (i = 0; i < n; i++) {
> >>>> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
> >>>
> >>> Its document said:
> >>>
> >>> """
> >>> * Must be called *only* from the BH context.
> >>> “"”
> >> We’re in a bh_disable section here, is that not good enough?
> >
> > Again this feels very ugly and prone to painting ourselves into a
> > corner, assuming BH-disabled sections have same protection as NAPI.
> >
> > (The napi_skb_cache_get/put function are operating on per CPU arrays
> > without any locking.)
>
> Happy to take suggestions on an alternative approach.
>
> Thoughts:
> 1. Instead of having IFF_NAPI be an opt-in thing, clean up tun so it
>    is *always* NAPI’d 100% of the time?

IFF_NAPI will have some overheads and it is introduced basically for
testing if I was not wrong.

> Outside of people who have
>    wired this up in their apps manually, on the virtualization side
>    there is currently no support from QEMU/Libvirt to enable IFF_NAPI.
>    Might be a nice simplification/cleanup to just “do it” full time?
>    Then we can play all these sorts of games under the protection of
>    NAPI?

A full benchmark needs to be run for this to see.

> 2. (Some other non-dubious way of protecting this, without refactoring
>    for either conditional NAPI (yuck?) or refactoring for full time
>    NAPI? This would be nice, happy to take tips!
> 3. ... ?
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-03  4:10           ` Jason Wang
@ 2025-12-03  4:34             ` Jon Kohler
  2025-12-03  6:40               ` Jason Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Jon Kohler @ 2025-12-03  4:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org, Willem de Bruijn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Sebastian Andrzej Siewior, Alexander Lobakin



> On Dec 2, 2025, at 11:10 PM, Jason Wang <jasowang@redhat.com> wrote:
> 
> On Wed, Dec 3, 2025 at 1:46 AM Jon Kohler <jon@nutanix.com> wrote:
>> 
>> 
>> 
>>> On Dec 2, 2025, at 12:32 PM, Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>> 
>>> 
>>> 
>>> On 02/12/2025 17.49, Jon Kohler wrote:
>>>>> On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@redhat.com> wrote:
>>>>> 
>>>>> On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@nutanix.com> wrote:
>>>>>> 
>>>>>> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
>>>>>> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
>>>>>> This reduces allocation overhead and improves efficiency, especially
>>>>>> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.
>>>>> 
>>>>> Does this mean we should only enable this when NAPI is used?
>>>> No, it does not mean that at all, but I see what that would be confusing.
>>>> I can clean up the commit msg on the next go around
>>>>>> 
>>>>>> If bulk allocation cannot fully satisfy the batch, gracefully drop only
>>>>>> the uncovered portion, allowing the rest of the batch to proceed, which
>>>>>> is what already happens in the previous case where build_skb() would
>>>>>> fail and return -ENOMEM.
>>>>>> 
>>>>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>>>> 
>>>>> Do we have any benchmark result for this?
>>>> Yes, it is in the cover letter:
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_project_netdevbpf_cover_20251125200041.1565663-2D1-2Djon-40nutanix.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=D7piJwOOQSj7C1puBlbh5dmAc-qsLw6E660yC5jJXWZk9ppvjOqT9Xc61ewYSmod&s=yUPhRdqt2lVnW5FxiOpvKE34iXKyGEWk502Dko1i3PI&e=
> 
> Ok but it only covers UDP, I think we want to see how it performs for
> TCP as well as latency. Btw is the test for IFF_NAPI or not?

This test was without IFF_NAPI, but I could get the NAPI numbers too
More on that below

> 
>>>>>> ---
>>>>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++------
>>>>>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>>>>> 
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 97f130bc5fed..64f944cce517 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>> [...]
>>>>>> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>>>               ret = tun_xdp_act(tun, xdp_prog, xdp, act);
>>>>>>               if (ret < 0) {
>>>>>>                       /* tun_xdp_act already handles drop statistics */
>>>>>> +                       kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
>>>>> 
>>>>> This should belong to previous patches?
>>>> Well, not really, as we did not even have an SKB to free at this point
>>>> in the previous code
>>>>> 
>>>>>>                       put_page(virt_to_head_page(xdp->data));
>>> 
>>> This calling put_page() directly also looks dubious.
>>> 
>>>>>>                       return ret;
>>>>>>               }
>>>>>> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>>>                       *flush = true;
>>>>>>                       fallthrough;
>>>>>>               case XDP_TX:
>>>>>> +                       napi_consume_skb(skb, 1);
>>>>>>                       return 0;
>>>>>>               case XDP_PASS:
>>>>>>                       break;
>>>>>> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
>>>>>>                               tpage->page = page;
>>>>>>                               tpage->count = 1;
>>>>>>                       }
>>>>>> +                       napi_consume_skb(skb, 1);
>>>>> 
>>>>> I wonder if this would have any side effects since tun_xdp_one() is
>>>>> not called by a NAPI.
>>>> As far as I can tell, this napi_consume_skb is really just an artifact of
>>>> how it was named and how it was traditionally used.
>>>> Now this is really just a napi_consume_skb within a bh disable/enable
>>>> section, which should meet the requirements of how that interface
>>>> should be used (again, AFAICT)
>>> 
>>> Yicks - this sounds super ugly.  Just wrapping napi_consume_skb() in bh
>>> disable/enable section and then assuming you get the same protection as
>>> NAPI is really dubious.
>>> 
>>> Cc Sebastian as he is trying to cleanup these kind of use-case, to make
>>> kernel preemption work.
>>> 
>>> 
>>>>> 
>>>>>>                       return 0;
>>>>>>               }
>>>>>>       }
>>>>>> 
>>>>>> build:
>>>>>> -       skb = build_skb(xdp->data_hard_start, buflen);
>>>>>> +       skb = build_skb_around(skb, xdp->data_hard_start, buflen);
>>>>>>       if (!skb) {
>>>>>> +               kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
>>>> Though to your point, I dont think this actually does anything given
>>>> that if the skb was somehow nuked as part of build_skb_around, there
>>>> would not be an skb to free. Doesn’t hurt though, from a self documenting
>>>> code perspective tho?
>>>>>>               dev_core_stats_rx_dropped_inc(tun->dev);
>>>>>>               return -ENOMEM;
>>>>>>       }
>>>>>> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>>>>       if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
>>>>>>           ctl && ctl->type == TUN_MSG_PTR) {
>>>>>>               struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>>>>>> +               int flush = 0, queued = 0, num_skbs = 0;
>>>>>>               struct tun_page tpage;
>>>>>>               int n = ctl->num;
>>>>>> -               int flush = 0, queued = 0;
>>>>>> +               /* Max size of VHOST_NET_BATCH */
>>>>>> +               void *skbs[64];
>>>>> 
>>>>> I think we need some tweaks
>>>>> 
>>>>> 1) TUN is decoupled from vhost, so it should have its own value (a
>>>>> macro is better)
>>>> Sure, I can make another constant that does a similar thing
>>>>> 2) Provide a way to fail or handle the case when more than 64
>>>> What if we simply assert that the maximum here is 64, which I think
>>>> is what it actually is in practice?
> 
> I still prefer a fallback.

Ack, will chew on that for the next one, let’s settle on the larger
elephant in the room which is the NAPI stuff below, as none of this
goes anywhere without resolving that first.

> 
>>>>> 
>>>>>> 
>>>>>>               memset(&tpage, 0, sizeof(tpage));
>>>>>> 
>>>>>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>>>>               rcu_read_lock();
>>>>>>               bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>>>>>> 
>>>>>> -               for (i = 0; i < n; i++) {
>>>>>> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
>>>>> 
>>>>> Its document said:
>>>>> 
>>>>> """
>>>>> * Must be called *only* from the BH context.
>>>>> “"”
>>>> We’re in a bh_disable section here, is that not good enough?
>>> 
>>> Again this feels very ugly and prone to painting ourselves into a
>>> corner, assuming BH-disabled sections have same protection as NAPI.
>>> 
>>> (The napi_skb_cache_get/put function are operating on per CPU arrays
>>> without any locking.)
>> 
>> Happy to take suggestions on an alternative approach.
>> 
>> Thoughts:
>> 1. Instead of having IFF_NAPI be an opt-in thing, clean up tun so it
>>   is *always* NAPI’d 100% of the time?
> 
> IFF_NAPI will have some overheads and it is introduced basically for
> testing if I was not wrong.

IIRC it was originally introduced for testing, but under some circumstances
can be wildly faster, see commit fb3f903769e805221eb19209b3d9128d398038a1
("tun: support NAPI for packets received from batched XDP buffs")

You may be thinking of IFF_NAPI_FRAGS, which seems very much “test only”
at this point.

Anyhow, assuming you are thinking of IFF_NAPI itself:
- Are the overheads you’ve got in mind completely structural/unavoidable?
- Or is that something that would be worth while looking at?

As a side note, one thing I did play with that is absolutely silly faster
is using IFF_NAPI with NAPI threads. Under certain scenarios (high tput
that is normally copy bound), gains were nutty (like ~75%+), so the point
is there may be some very interesting juice to squeeze going down that
path.

Coming back to the main path here, the whole reason I’m going down this
patchset is to try to pickup optimizations that are available in other
general purpose drivers, which are all NAPI-ized. We’re at the point now
where tun is getting left behind for things like this because of non-full
time NAPI.

Said another way, I think it would be an advantage to NAPI-ize tun and
make it more like regular ole network drivers, so that the generic core
work being done will benefit tun by default.

>> Outside of people who have
>>   wired this up in their apps manually, on the virtualization side
>>   there is currently no support from QEMU/Libvirt to enable IFF_NAPI.
>>   Might be a nice simplification/cleanup to just “do it” full time?
>>   Then we can play all these sorts of games under the protection of
>>   NAPI?
> 
> A full benchmark needs to be run for this to see.

Do you have a suggested test/suite of tests you’d prefer me to run
so that I can make sure I’m gathering the data that you’d like to
see?

>> 2. (Some other non-dubious way of protecting this, without refactoring
>>   for either conditional NAPI (yuck?) or refactoring for full time
>>   NAPI? This would be nice, happy to take tips!
>> 3. ... ?
>> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-03  4:34             ` Jon Kohler
@ 2025-12-03  6:40               ` Jason Wang
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Wang @ 2025-12-03  6:40 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org, Willem de Bruijn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Sebastian Andrzej Siewior, Alexander Lobakin

On Wed, Dec 3, 2025 at 12:35 PM Jon Kohler <jon@nutanix.com> wrote:
>
>
>
> > On Dec 2, 2025, at 11:10 PM, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Dec 3, 2025 at 1:46 AM Jon Kohler <jon@nutanix.com> wrote:
> >>
> >>
> >>
> >>> On Dec 2, 2025, at 12:32 PM, Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >>>
> >>>
> >>>
> >>> On 02/12/2025 17.49, Jon Kohler wrote:
> >>>>> On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@redhat.com> wrote:
> >>>>>
> >>>>> On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@nutanix.com> wrote:
> >>>>>>
> >>>>>> Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
> >>>>>> in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
> >>>>>> This reduces allocation overhead and improves efficiency, especially
> >>>>>> when IFF_NAPI is enabled and GRO is feeding entries back to the cache.
> >>>>>
> >>>>> Does this mean we should only enable this when NAPI is used?
> >>>> No, it does not mean that at all, but I see what that would be confusing.
> >>>> I can clean up the commit msg on the next go around
> >>>>>>
> >>>>>> If bulk allocation cannot fully satisfy the batch, gracefully drop only
> >>>>>> the uncovered portion, allowing the rest of the batch to proceed, which
> >>>>>> is what already happens in the previous case where build_skb() would
> >>>>>> fail and return -ENOMEM.
> >>>>>>
> >>>>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >>>>>
> >>>>> Do we have any benchmark result for this?
> >>>> Yes, it is in the cover letter:
> >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_project_netdevbpf_cover_20251125200041.1565663-2D1-2Djon-40nutanix.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=D7piJwOOQSj7C1puBlbh5dmAc-qsLw6E660yC5jJXWZk9ppvjOqT9Xc61ewYSmod&s=yUPhRdqt2lVnW5FxiOpvKE34iXKyGEWk502Dko1i3PI&e=
> >
> > Ok but it only covers UDP, I think we want to see how it performs for
> > TCP as well as latency. Btw is the test for IFF_NAPI or not?
>
> This test was without IFF_NAPI, but I could get the NAPI numbers too
> More on that below
>
> >
> >>>>>> ---
> >>>>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++------
> >>>>>> 1 file changed, 24 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index 97f130bc5fed..64f944cce517 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>> [...]
> >>>>>> @@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
> >>>>>>               ret = tun_xdp_act(tun, xdp_prog, xdp, act);
> >>>>>>               if (ret < 0) {
> >>>>>>                       /* tun_xdp_act already handles drop statistics */
> >>>>>> +                       kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
> >>>>>
> >>>>> This should belong to previous patches?
> >>>> Well, not really, as we did not even have an SKB to free at this point
> >>>> in the previous code
> >>>>>
> >>>>>>                       put_page(virt_to_head_page(xdp->data));
> >>>
> >>> This calling put_page() directly also looks dubious.
> >>>
> >>>>>>                       return ret;
> >>>>>>               }
> >>>>>> @@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
> >>>>>>                       *flush = true;
> >>>>>>                       fallthrough;
> >>>>>>               case XDP_TX:
> >>>>>> +                       napi_consume_skb(skb, 1);
> >>>>>>                       return 0;
> >>>>>>               case XDP_PASS:
> >>>>>>                       break;
> >>>>>> @@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
> >>>>>>                               tpage->page = page;
> >>>>>>                               tpage->count = 1;
> >>>>>>                       }
> >>>>>> +                       napi_consume_skb(skb, 1);
> >>>>>
> >>>>> I wonder if this would have any side effects since tun_xdp_one() is
> >>>>> not called by a NAPI.
> >>>> As far as I can tell, this napi_consume_skb is really just an artifact of
> >>>> how it was named and how it was traditionally used.
> >>>> Now this is really just a napi_consume_skb within a bh disable/enable
> >>>> section, which should meet the requirements of how that interface
> >>>> should be used (again, AFAICT)
> >>>
> >>> Yicks - this sounds super ugly.  Just wrapping napi_consume_skb() in bh
> >>> disable/enable section and then assuming you get the same protection as
> >>> NAPI is really dubious.
> >>>
> >>> Cc Sebastian as he is trying to cleanup these kind of use-case, to make
> >>> kernel preemption work.
> >>>
> >>>
> >>>>>
> >>>>>>                       return 0;
> >>>>>>               }
> >>>>>>       }
> >>>>>>
> >>>>>> build:
> >>>>>> -       skb = build_skb(xdp->data_hard_start, buflen);
> >>>>>> +       skb = build_skb_around(skb, xdp->data_hard_start, buflen);
> >>>>>>       if (!skb) {
> >>>>>> +               kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);
> >>>> Though to your point, I dont think this actually does anything given
> >>>> that if the skb was somehow nuked as part of build_skb_around, there
> >>>> would not be an skb to free. Doesn’t hurt though, from a self documenting
> >>>> code perspective tho?
> >>>>>>               dev_core_stats_rx_dropped_inc(tun->dev);
> >>>>>>               return -ENOMEM;
> >>>>>>       }
> >>>>>> @@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
> >>>>>>       if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
> >>>>>>           ctl && ctl->type == TUN_MSG_PTR) {
> >>>>>>               struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> >>>>>> +               int flush = 0, queued = 0, num_skbs = 0;
> >>>>>>               struct tun_page tpage;
> >>>>>>               int n = ctl->num;
> >>>>>> -               int flush = 0, queued = 0;
> >>>>>> +               /* Max size of VHOST_NET_BATCH */
> >>>>>> +               void *skbs[64];
> >>>>>
> >>>>> I think we need some tweaks
> >>>>>
> >>>>> 1) TUN is decoupled from vhost, so it should have its own value (a
> >>>>> macro is better)
> >>>> Sure, I can make another constant that does a similar thing
> >>>>> 2) Provide a way to fail or handle the case when more than 64
> >>>> What if we simply assert that the maximum here is 64, which I think
> >>>> is what it actually is in practice?
> >
> > I still prefer a fallback.
>
> Ack, will chew on that for the next one, let’s settle on the larger
> elephant in the room which is the NAPI stuff below, as none of this
> goes anywhere without resolving that first.
>
> >
> >>>>>
> >>>>>>
> >>>>>>               memset(&tpage, 0, sizeof(tpage));
> >>>>>>
> >>>>>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
> >>>>>>               rcu_read_lock();
> >>>>>>               bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> >>>>>>
> >>>>>> -               for (i = 0; i < n; i++) {
> >>>>>> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
> >>>>>
> >>>>> Its document said:
> >>>>>
> >>>>> """
> >>>>> * Must be called *only* from the BH context.
> >>>>> “"”
> >>>> We’re in a bh_disable section here, is that not good enough?
> >>>
> >>> Again this feels very ugly and prone to painting ourselves into a
> >>> corner, assuming BH-disabled sections have same protection as NAPI.
> >>>
> >>> (The napi_skb_cache_get/put function are operating on per CPU arrays
> >>> without any locking.)
> >>
> >> Happy to take suggestions on an alternative approach.
> >>
> >> Thoughts:
> >> 1. Instead of having IFF_NAPI be an opt-in thing, clean up tun so it
> >>   is *always* NAPI’d 100% of the time?
> >
> > IFF_NAPI will have some overheads and it is introduced basically for
> > testing if I was not wrong.
>
> IIRC it was originally introduced for testing, but under some circumstances
> can be wildly faster, see commit fb3f903769e805221eb19209b3d9128d398038a1
> ("tun: support NAPI for packets received from batched XDP buffs")
>
> You may be thinking of IFF_NAPI_FRAGS, which seems very much “test only”
> at this point.
>
> Anyhow, assuming you are thinking of IFF_NAPI itself:
> - Are the overheads you’ve got in mind completely structural/unavoidable?

NAPI will introduce some latency but I'm not sure if it can be
amortized by the bulking logic you want to introduce. So I think we
need benchmark numbers to decide.

> - Or is that something that would be worth while looking at?

I think so.

>
> As a side note, one thing I did play with that is absolutely silly faster
> is using IFF_NAPI with NAPI threads. Under certain scenarios (high tput
> that is normally copy bound), gains were nutty (like ~75%+), so the point
> is there may be some very interesting juice to squeeze going down that
> path.

I see.

>
> Coming back to the main path here, the whole reason I’m going down this
> patchset is to try to pickup optimizations that are available in other
> general purpose drivers, which are all NAPI-ized. We’re at the point now
> where tun is getting left behind for things like this because of non-full
> time NAPI.
>
> Said another way, I think it would be an advantage to NAPI-ize tun and
> make it more like regular ole network drivers, so that the generic core
> work being done will benefit tun by default.

The change looks non-trivial, maybe it would be easier to start to
optimize NAPI path first.

>
> >> Outside of people who have
> >>   wired this up in their apps manually, on the virtualization side
> >>   there is currently no support from QEMU/Libvirt to enable IFF_NAPI.
> >>   Might be a nice simplification/cleanup to just “do it” full time?
> >>   Then we can play all these sorts of games under the protection of
> >>   NAPI?
> >
> > A full benchmark needs to be run for this to see.
>
> Do you have a suggested test/suite of tests you’d prefer me to run
> so that I can make sure I’m gathering the data that you’d like to
> see?

Just FYI, something like this:

https://lkml.org/lkml/2012/12/7/272

Thanks

>
> >> 2. (Some other non-dubious way of protecting this, without refactoring
> >>   for either conditional NAPI (yuck?) or refactoring for full time
> >>   NAPI? This would be nice, happy to take tips!
> >> 3. ... ?
> >>
> >
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-02 17:32       ` Jesper Dangaard Brouer
  2025-12-02 17:45         ` Jon Kohler
@ 2025-12-03  8:47         ` Sebastian Andrzej Siewior
  2025-12-03 15:35           ` Jon Kohler
  1 sibling, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-12-03  8:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jon Kohler, Jason Wang, netdev@vger.kernel.org, Willem de Bruijn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Alexander Lobakin

On 2025-12-02 18:32:23 [+0100], Jesper Dangaard Brouer wrote:
> > > > +                       napi_consume_skb(skb, 1);
> > > 
> > > I wonder if this would have any side effects since tun_xdp_one() is
> > > not called by a NAPI.
> > 
> > As far as I can tell, this napi_consume_skb is really just an artifact of
> > how it was named and how it was traditionally used.
> > 
> > Now this is really just a napi_consume_skb within a bh disable/enable
> > section, which should meet the requirements of how that interface
> > should be used (again, AFAICT)
> > 
> 
> Yicks - this sounds super ugly.  Just wrapping napi_consume_skb() in bh
> disable/enable section and then assuming you get the same protection as
> NAPI is really dubious.
> 
> Cc Sebastian as he is trying to cleanup these kind of use-case, to make
> kernel preemption work.

I am actually done with this.

Wrapping napi_consume_skb(, 1) in bh-disable basically does the trick if
called from outside-bh section as long as it is not an IRQ section. The
reason is that the skb-head is cached in a per-CPU cache which accessed
only within softirq/ NAPI context.
So you can "return" skbs in NET_TX and have some around in NET_RX.
Otherwise skb is returned directly to the slab allocator.
If this about skb recycling, you using page_pool might help. This
however also expects NAPI/ BH disabled context.

> > > > @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
> > > >                 rcu_read_lock();
> > > >                 bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> > > > 
> > > > -               for (i = 0; i < n; i++) {
> > > > +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
> > > 
> > > Its document said:
> > > 
> > > """
> > > * Must be called *only* from the BH context.
> > > “"”
> > We’re in a bh_disable section here, is that not good enough?
> 
> Again this feels very ugly and prone to painting ourselves into a
> corner, assuming BH-disabled sections have same protection as NAPI.
> 
> (The napi_skb_cache_get/put function are operating on per CPU arrays
> without any locking.)

This is okay. NAPI means BH is disabled. Nothing more. There are a few
implications to it.
The default path is
 process-context (kernel or userland)
 * IRQ *
   -> irq is handled via its handler with disabled interrupts
   -> handler raises NET_RX aka NAPI
   -> irq core is done with IRQ handling and notices softirqs have been
      raised. Disables BH and starts handling softirqs with enabled
      interrupts before returning back before the interruption.
   -> softirqs are handled with with BH disabled.
   * IRQ * fires again.
     -> irq is handled as previously and NET_RX is set again.
     -> irq core returns back to previously handled softirqs
   -> Once NET_RX is done, softirq core would be done and return back
      but since it noticed that NET_RX is pending (again) it does
      another round.

This is how it normally works. If you disable-bh in process context
(either manually via local_bh_disable() or via spin_lock_bh()) then you
enter BH context. There is hardly a difference (in_serving_softirq()
will report a different value but this should not matter to anyone
outside the core code).
Any IRQ that raises NET_RX here will not lead to handling softirqs
because BH is disabled (this maps the "IRQ fires again" case from
above). This is delayed until local_bh_enable().

Therefore protecting the per-CPU array with local_bh_disable() is okay
but for PREEMPT_RT reasons, per-CPU data needs this
local_lock_nested_bh() around it (as napi_skb_cache_get/put does).

> --Jesper

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-03  8:47         ` Sebastian Andrzej Siewior
@ 2025-12-03 15:35           ` Jon Kohler
  2025-12-05  7:58             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 29+ messages in thread
From: Jon Kohler @ 2025-12-03 15:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jesper Dangaard Brouer, Jason Wang, netdev@vger.kernel.org,
	Willem de Bruijn, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Alexander Lobakin



> On Dec 3, 2025, at 3:47 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On 2025-12-02 18:32:23 [+0100], Jesper Dangaard Brouer wrote:
>>>>> +                       napi_consume_skb(skb, 1);
>>>> 
>>>> I wonder if this would have any side effects since tun_xdp_one() is
>>>> not called by a NAPI.
>>> 
>>> As far as I can tell, this napi_consume_skb is really just an artifact of
>>> how it was named and how it was traditionally used.
>>> 
>>> Now this is really just a napi_consume_skb within a bh disable/enable
>>> section, which should meet the requirements of how that interface
>>> should be used (again, AFAICT)
>>> 
>> 
>> Yicks - this sounds super ugly.  Just wrapping napi_consume_skb() in bh
>> disable/enable section and then assuming you get the same protection as
>> NAPI is really dubious.
>> 
>> Cc Sebastian as he is trying to cleanup these kind of use-case, to make
>> kernel preemption work.
> 
> I am actually done with this.
> 
> Wrapping napi_consume_skb(, 1) in bh-disable basically does the trick if
> called from outside-bh section as long as it is not an IRQ section. The
> reason is that the skb-head is cached in a per-CPU cache which accessed
> only within softirq/ NAPI context.
> So you can "return" skbs in NET_TX and have some around in NET_RX.
> Otherwise skb is returned directly to the slab allocator.
> If this about skb recycling, you using page_pool might help. This
> however also expects NAPI/ BH disabled context.
> 
>>>>> @@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>>>>                rcu_read_lock();
>>>>>                bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>>>>> 
>>>>> -               for (i = 0; i < n; i++) {
>>>>> +               num_skbs = napi_skb_cache_get_bulk(skbs, n);
>>>> 
>>>> Its document said:
>>>> 
>>>> """
>>>> * Must be called *only* from the BH context.
>>>> “"”
>>> We’re in a bh_disable section here, is that not good enough?
>> 
>> Again this feels very ugly and prone to painting ourselves into a
>> corner, assuming BH-disabled sections have same protection as NAPI.
>> 
>> (The napi_skb_cache_get/put function are operating on per CPU arrays
>> without any locking.)
> 
> This is okay. NAPI means BH is disabled. Nothing more. There are a few
> implications to it.
> The default path is
> process-context (kernel or userland)
> * IRQ *
>   -> irq is handled via its handler with disabled interrupts
>   -> handler raises NET_RX aka NAPI
>   -> irq core is done with IRQ handling and notices softirqs have been
>      raised. Disables BH and starts handling softirqs with enabled
>      interrupts before returning back before the interruption.
>   -> softirqs are handled with with BH disabled.
>   * IRQ * fires again.
>     -> irq is handled as previously and NET_RX is set again.
>     -> irq core returns back to previously handled softirqs
>   -> Once NET_RX is done, softirq core would be done and return back
>      but since it noticed that NET_RX is pending (again) it does
>      another round.
> 
> This is how it normally works. If you disable-bh in process context
> (either manually via local_bh_disable() or via spin_lock_bh()) then you
> enter BH context. There is hardly a difference (in_serving_softirq()
> will report a different value but this should not matter to anyone
> outside the core code).
> Any IRQ that raises NET_RX here will not lead to handling softirqs
> because BH is disabled (this maps the "IRQ fires again" case from
> above). This is delayed until local_bh_enable().
> 
> Therefore protecting the per-CPU array with local_bh_disable() is okay
> but for PREEMPT_RT reasons, per-CPU data needs this
> local_lock_nested_bh() around it (as napi_skb_cache_get/put does).

Thanks, Sebastian - so if I’m reading this correct, it *is* fine to do
the two following patterns, outside of NAPI:

   local_bh_disable();
   skb = napi_build_skb(buf, len);
   local_bh_enable();

   local_bh_disable();
   napi_consume_skb(skb, 1);
   local_bh_enable();

If so, I wonder if it would be cleaner to have something like
   build_skb_bh(buf, len);

   consume_skb_bh(skb, 1);

Then have those methods handle the local_bh enable/disable, so that
the toggle was a property of a call, not a requirement of the call? 

Similar in concept to ieee80211_rx_ni() defined in net/mac80211.h
(and there are a few others with this sort of pattern, like
netif_tx_disable())

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-03 15:35           ` Jon Kohler
@ 2025-12-05  7:58             ` Sebastian Andrzej Siewior
  2025-12-05 13:21               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-12-05  7:58 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jesper Dangaard Brouer, Jason Wang, netdev@vger.kernel.org,
	Willem de Bruijn, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, open list,
	open list:XDP (eXpress Data Path):Keyword:(?:b|_)xdp(?:b|_),
	Alexander Lobakin

On 2025-12-03 15:35:24 [+0000], Jon Kohler wrote:
> Thanks, Sebastian - so if I’m reading this correct, it *is* fine to do
> the two following patterns, outside of NAPI:
> 
>    local_bh_disable();
>    skb = napi_build_skb(buf, len);
>    local_bh_enable();
> 
>    local_bh_disable();
>    napi_consume_skb(skb, 1);
>    local_bh_enable();
> 
> If so, I wonder if it would be cleaner to have something like
>    build_skb_bh(buf, len);
> 
>    consume_skb_bh(skb, 1);
> 
> Then have those methods handle the local_bh enable/disable, so that
> the toggle was a property of a call, not a requirement of the call? 

Having budget = 0 would be for non-NAPI users. So passing the 1 is
superfluous. You goal seems to be to re-use napi_alloc_cache. Right? And
this is better than skb_pool?

There is already napi_alloc_skb() which expects BH to be disabled and
netdev_alloc_skb() (and friends) which do disable BH if needed. I don't
see an equivalent for non-NAPI users. Haven't checked if any of these
could replace your napi_build_skb().

Historically non-NAPI users would be IRQ users and those can't do
local_bh_disable(). Therefore there is dev_kfree_skb_irq_reason() for
them. You need to delay the free for two reasons.
It seems pure software implementations didn't bother so far.

It might make sense to do napi_consume_skb() similar to
__netdev_alloc_skb() so that also budget=0 users fill the pool if this
is really a benefit.

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-05  7:58             ` Sebastian Andrzej Siewior
@ 2025-12-05 13:21               ` Jesper Dangaard Brouer
  2025-12-05 16:56                 ` Jon Kohler
  2025-12-08 11:04                 ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2025-12-05 13:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Jon Kohler
  Cc: Jason Wang, netdev@vger.kernel.org, Willem de Bruijn, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list, bpf, Alexander Lobakin

On 05/12/2025 08.58, Sebastian Andrzej Siewior wrote:
> On 2025-12-03 15:35:24 [+0000], Jon Kohler wrote:
>> Thanks, Sebastian - so if I’m reading this correct, it *is* fine to do
>> the two following patterns, outside of NAPI:
>>
>>     local_bh_disable();
>>     skb = napi_build_skb(buf, len);
>>     local_bh_enable();
>>
>>     local_bh_disable();
>>     napi_consume_skb(skb, 1);
>>     local_bh_enable();
>>
>> If so, I wonder if it would be cleaner to have something like
>>     build_skb_bh(buf, len);
>>
>>     consume_skb_bh(skb, 1);
>>
>> Then have those methods handle the local_bh enable/disable, so that
>> the toggle was a property of a call, not a requirement of the call?
> 
> Having budget = 0 would be for non-NAPI users. So passing the 1 is
> superfluous. You goal seems to be to re-use napi_alloc_cache. Right? And
> this is better than skb_pool?
> 
> There is already napi_alloc_skb() which expects BH to be disabled and
> netdev_alloc_skb() (and friends) which do disable BH if needed. I don't
> see an equivalent for non-NAPI users. Haven't checked if any of these
> could replace your napi_build_skb().
> 
> Historically non-NAPI users would be IRQ users and those can't do
> local_bh_disable(). Therefore there is dev_kfree_skb_irq_reason() for
> them. You need to delay the free for two reasons.
> It seems pure software implementations didn't bother so far.
> 
> It might make sense to do napi_consume_skb() similar to
> __netdev_alloc_skb() so that also budget=0 users fill the pool if this
> is really a benefit.

I'm not convinced that this "optimization" will be an actual benefit on
a busy system.  Let me explain the side-effect of local_bh_enable().

Calling local_bh_enable() is adding a re-scheduling opportunity, e.g.
for processing softirq.  For a benchmark this might not be noticeable as
this is the main workload.  If there isn't any pending softirq this is
also not noticeable.  In a more mixed workload (or packet storm) this
re-scheduling will allow others to "steal" CPU cycles from you.

Thus, you might not actually save any cycles via this short BH-disable
section.  I remember that I was saving around 19ns / 68cycles on a
3.6GHz E5-1650 CPU, by using this SKB recycle cache.  The cost of a re-
scheduling event is like more.

My advice is to use the napi_* function when already running within a
  BH-disabled section, as it makes sense to save those cycles
(essentially reducing the time spend with BH-disabled).  Wrapping these
napi_* function with BH-disabled just to use them outside NAPI feels
wrong in so many ways.

The another reason why these napi_* functions belongs with NAPI is that
netstack NIC drivers will (almost) always do TX completion first, that
will free/consume some SKBs, and afterwards do RX processing that need
to allocate SKBs for the incoming data frames.  Thus, keeping a cache of
SKBs just released/consumed makes sense.  (p.s. in the past we always
bulk free'ed all SKBs in the napi cache when exiting NAPI, as they would
not be cache hot for next round).

--Jesper

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-05 13:21               ` Jesper Dangaard Brouer
@ 2025-12-05 16:56                 ` Jon Kohler
  2025-12-08 11:04                 ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 29+ messages in thread
From: Jon Kohler @ 2025-12-05 16:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Sebastian Andrzej Siewior, Jason Wang, netdev@vger.kernel.org,
	Willem de Bruijn, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, open list,
	bpf@vger.kernel.org, Alexander Lobakin



> On Dec 5, 2025, at 8:21 AM, Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> 
> 
> 
> On 05/12/2025 08.58, Sebastian Andrzej Siewior wrote:
>> On 2025-12-03 15:35:24 [+0000], Jon Kohler wrote:
>>> Thanks, Sebastian - so if I’m reading this correct, it *is* fine to do
>>> the two following patterns, outside of NAPI:
>>> 
>>>    local_bh_disable();
>>>    skb = napi_build_skb(buf, len);
>>>    local_bh_enable();
>>> 
>>>    local_bh_disable();
>>>    napi_consume_skb(skb, 1);
>>>    local_bh_enable();
>>> 
>>> If so, I wonder if it would be cleaner to have something like
>>>    build_skb_bh(buf, len);
>>> 
>>>    consume_skb_bh(skb, 1);
>>> 
>>> Then have those methods handle the local_bh enable/disable, so that
>>> the toggle was a property of a call, not a requirement of the call?
>> Having budget = 0 would be for non-NAPI users. So passing the 1 is
>> superfluous. You goal seems to be to re-use napi_alloc_cache. Right? And
>> this is better than skb_pool?

Yes, the goal is to avoid the allocation / deallocation overhead, which is
very visible in flamegraphs, and doing it the way I’ve done it here does
benefit the benchmark I used; however, as Jason pointed out, I need to
look more broadly, so I’ll do that for the next time around.

>> There is already napi_alloc_skb() which expects BH to be disabled and
>> netdev_alloc_skb() (and friends) which do disable BH if needed. I don't
>> see an equivalent for non-NAPI users. Haven't checked if any of these
>> could replace your napi_build_skb().
>> Historically non-NAPI users would be IRQ users and those can't do
>> local_bh_disable(). Therefore there is dev_kfree_skb_irq_reason() for
>> them. You need to delay the free for two reasons.
>> It seems pure software implementations didn't bother so far.
>> It might make sense to do napi_consume_skb() similar to
>> __netdev_alloc_skb() so that also budget=0 users fill the pool if this
>> is really a benefit.
> 
> I'm not convinced that this "optimization" will be an actual benefit on
> a busy system.  Let me explain the side-effect of local_bh_enable().
> 
> Calling local_bh_enable() is adding a re-scheduling opportunity, e.g.
> for processing softirq.  For a benchmark this might not be noticeable as
> this is the main workload.  If there isn't any pending softirq this is
> also not noticeable.  In a more mixed workload (or packet storm) this
> re-scheduling will allow others to "steal" CPU cycles from you.
> 
> Thus, you might not actually save any cycles via this short BH-disable
> section.  I remember that I was saving around 19ns / 68cycles on a
> 3.6GHz E5-1650 CPU, by using this SKB recycle cache.  The cost of a re-
> scheduling event is like more.
> 
> My advice is to use the napi_* function when already running within a
> BH-disabled section, as it makes sense to save those cycles
> (essentially reducing the time spend with BH-disabled).  Wrapping these
> napi_* function with BH-disabled just to use them outside NAPI feels
> wrong in so many ways.
> 
> The another reason why these napi_* functions belongs with NAPI is that
> netstack NIC drivers will (almost) always do TX completion first, that
> will free/consume some SKBs, and afterwards do RX processing that need
> to allocate SKBs for the incoming data frames.  Thus, keeping a cache of
> SKBs just released/consumed makes sense.  (p.s. in the past we always
> bulk free'ed all SKBs in the napi cache when exiting NAPI, as they would
> not be cache hot for next round).
> 
> --Jesper

Hey Jesper, thanks for the commentary, I appreciate it. I’ll chew on that
for the next revision of this work and see where it lands.

Thanks again - Jon


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
  2025-12-05 13:21               ` Jesper Dangaard Brouer
  2025-12-05 16:56                 ` Jon Kohler
@ 2025-12-08 11:04                 ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-12-08 11:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jon Kohler, Jason Wang, netdev@vger.kernel.org, Willem de Bruijn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, open list, bpf, Alexander Lobakin

On 2025-12-05 14:21:51 [+0100], Jesper Dangaard Brouer wrote:
> 
> 
> On 05/12/2025 08.58, Sebastian Andrzej Siewior wrote:
> > On 2025-12-03 15:35:24 [+0000], Jon Kohler wrote:
> > > Thanks, Sebastian - so if I’m reading this correct, it *is* fine to do
> > > the two following patterns, outside of NAPI:
> > > 
> > >     local_bh_disable();
> > >     skb = napi_build_skb(buf, len);
> > >     local_bh_enable();
> > > 
> > >     local_bh_disable();
> > >     napi_consume_skb(skb, 1);
> > >     local_bh_enable();
> > > 
> > > If so, I wonder if it would be cleaner to have something like
> > >     build_skb_bh(buf, len);
> > > 
> > >     consume_skb_bh(skb, 1);
> > > 
> > > Then have those methods handle the local_bh enable/disable, so that
> > > the toggle was a property of a call, not a requirement of the call?
> > 
> > Having budget = 0 would be for non-NAPI users. So passing the 1 is
> > superfluous. You goal seems to be to re-use napi_alloc_cache. Right? And
> > this is better than skb_pool?
> > 
> > There is already napi_alloc_skb() which expects BH to be disabled and
> > netdev_alloc_skb() (and friends) which do disable BH if needed. I don't
> > see an equivalent for non-NAPI users. Haven't checked if any of these
> > could replace your napi_build_skb().
> > 
> > Historically non-NAPI users would be IRQ users and those can't do
> > local_bh_disable(). Therefore there is dev_kfree_skb_irq_reason() for
> > them. You need to delay the free for two reasons.
> > It seems pure software implementations didn't bother so far.
> > 
> > It might make sense to do napi_consume_skb() similar to
> > __netdev_alloc_skb() so that also budget=0 users fill the pool if this
> > is really a benefit.
> 
> I'm not convinced that this "optimization" will be an actual benefit on
> a busy system.  Let me explain the side-effect of local_bh_enable().

I'm arguing that this is the right thing to do, I am just saying that it
will not break anything as far as I am aware.

> Calling local_bh_enable() is adding a re-scheduling opportunity, e.g.
> for processing softirq.  For a benchmark this might not be noticeable as
> this is the main workload.  If there isn't any pending softirq this is
> also not noticeable.  In a more mixed workload (or packet storm) this
> re-scheduling will allow others to "steal" CPU cycles from you.

If there wouldn't be a bh/disable-enable then the context would be
process context and the softirq will be handled immediately.
Now it is "delayed" until the bh-enable.
The only advantage I see here is that the caller participates in
napi_alloc_cache.

> Thus, you might not actually save any cycles via this short BH-disable
> section.  I remember that I was saving around 19ns / 68cycles on a
> 3.6GHz E5-1650 CPU, by using this SKB recycle cache.  The cost of a re-
> scheduling event is like more.

It might expensive because you need to branch out, save/ restore
interrupts and check a few flags. This is something you wouldn't have to
do if you return it back to the memory allocator.

> My advice is to use the napi_* function when already running within a
>  BH-disabled section, as it makes sense to save those cycles
> (essentially reducing the time spend with BH-disabled).  Wrapping these
> napi_* function with BH-disabled just to use them outside NAPI feels
> wrong in so many ways.
> 
> The another reason why these napi_* functions belongs with NAPI is that
> netstack NIC drivers will (almost) always do TX completion first, that
> will free/consume some SKBs, and afterwards do RX processing that need
> to allocate SKBs for the incoming data frames.  Thus, keeping a cache of
> SKBs just released/consumed makes sense.  (p.s. in the past we always
> bulk free'ed all SKBs in the napi cache when exiting NAPI, as they would
> not be cache hot for next round).

Right. That is why I asked if using a skb-pool would be an advantage
since you would have a fix pool of skb for TUN/XDP.

> --Jesper

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-12-08 11:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25 20:00 [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 1/9] tun: cleanup out label in tun_xdp_one Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 2/9] tun: correct drop statistics " Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 3/9] tun: correct drop statistics in tun_put_user Jon Kohler
2025-11-29  3:07   ` Willem de Bruijn
2025-12-02 16:40     ` Jon Kohler
2025-12-02 21:34       ` Willem de Bruijn
2025-12-02 21:36         ` Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 4/9] tun: correct drop statistics in tun_get_user Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one Jon Kohler
2025-11-28  3:02   ` Jason Wang
2025-12-02 16:49     ` Jon Kohler
2025-12-02 17:32       ` Jesper Dangaard Brouer
2025-12-02 17:45         ` Jon Kohler
2025-12-03  4:10           ` Jason Wang
2025-12-03  4:34             ` Jon Kohler
2025-12-03  6:40               ` Jason Wang
2025-12-03  8:47         ` Sebastian Andrzej Siewior
2025-12-03 15:35           ` Jon Kohler
2025-12-05  7:58             ` Sebastian Andrzej Siewior
2025-12-05 13:21               ` Jesper Dangaard Brouer
2025-12-05 16:56                 ` Jon Kohler
2025-12-08 11:04                 ` Sebastian Andrzej Siewior
2025-11-25 20:00 ` [PATCH net-next v2 6/9] tun: use napi_build_skb in __tun_build_skb Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 7/9] tun: use napi_consume_skb() in tun_put_user Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 8/9] net: core: export skb_defer_free_flush Jon Kohler
2025-11-25 20:00 ` [PATCH net-next v2 9/9] tun: flush deferred skb free list before bulk NAPI cache get Jon Kohler
2025-11-29  3:08 ` [PATCH net-next v2 0/9] tun: optimize SKB allocation with NAPI cache Willem de Bruijn
2025-12-02 16:38   ` Jon Kohler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).