[PATCH net-next 0/2] xsk: improvement performance in copy mode

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] xsk: improvement performance in copy mode
@ 2025-08-11 13:12 Jason Xing
  2025-08-11 13:12 ` [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
  2025-08-11 13:12 ` [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode Jason Xing
  0 siblings, 2 replies; 17+ messages in thread
From: Jason Xing @ 2025-08-11 13:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Like in VM using virtio_net, there are not that many machines supporting
advanced function like multi-buffer and zerocopy. Using xsk copy mode
becomes a default choice to support bypass kernel feature instead of
resorting to DPDK.

Prior to this series, zerocopy mode has a better performance than copy
mode. But now, the copy mode outperforms zc mode by 12.9%, which was
tested on ixgbe driver by means of xdpsock.

The thought behind this series is to aggregate packets in a certain
small group like GSO/GRO and then send them at one time by only grabbing
the tx queue and disable bh once.

Jason Xing (2):
  xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  xsk: support generic batch xmit in copy mode

 Documentation/networking/af_xdp.rst |   9 ++
 include/linux/netdevice.h           |   2 +
 include/net/xdp_sock.h              |   2 +
 include/uapi/linux/if_xdp.h         |   1 +
 net/core/dev.c                      |  18 ++++
 net/xdp/xsk.c                       | 135 +++++++++++++++++++++++++++-
 tools/include/uapi/linux/if_xdp.h   |   1 +
 7 files changed, 165 insertions(+), 3 deletions(-)

-- 
2.41.3

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  2025-08-11 13:12 [PATCH net-next 0/2] xsk: improvement performance in copy mode Jason Xing
@ 2025-08-11 13:12 ` Jason Xing
  2025-08-12 16:40   ` Maciej Fijalkowski
  2025-08-18  6:20   ` Dan Carpenter
  2025-08-11 13:12 ` [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode Jason Xing
  1 sibling, 2 replies; 17+ messages in thread
From: Jason Xing @ 2025-08-11 13:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch is to prepare for later batch xmit in generic path. Add a new
socket option to provide an alternative to achieve a higher overall
throughput.

skb_batch will be used to store newly allocated skb at one time in the
xmit path.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 Documentation/networking/af_xdp.rst |  9 ++++++++
 include/net/xdp_sock.h              |  2 ++
 include/uapi/linux/if_xdp.h         |  1 +
 net/xdp/xsk.c                       | 32 +++++++++++++++++++++++++++++
 tools/include/uapi/linux/if_xdp.h   |  1 +
 5 files changed, 45 insertions(+)

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 50d92084a49c..1194bdfaf61e 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -447,6 +447,15 @@ mode to allow application to tune the per-socket maximum iteration for
 better throughput and less frequency of send syscall.
 Allowed range is [32, xs->tx->nentries].
 
+XDP_GENERIC_XMIT_BATCH
+----------------------
+
+It provides an option that allows application to use batch xmit in the copy
+mode. Batch process minimizes the number of grabbing/releasing queue lock
+without redundant actions compared to before to gain the overall performance
+improvement whereas it might increase the latency of per packet. The maximum
+value shouldn't be larger than xs->max_tx_budget.
+
 XDP_STATISTICS getsockopt
 -------------------------
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ce587a225661..b5a3e37da8db 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -61,6 +61,7 @@ struct xdp_sock {
 		XSK_BOUND,
 		XSK_UNBOUND,
 	} state;
+	struct sk_buff **skb_batch;
 
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
 	struct list_head tx_list;
@@ -70,6 +71,7 @@ struct xdp_sock {
 	 * preventing other XSKs from being starved.
 	 */
 	u32 tx_budget_spent;
+	u32 generic_xmit_batch;
 
 	/* Statistics */
 	u64 rx_dropped;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 23a062781468..44cb72cd328e 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
 #define XDP_MAX_TX_SKB_BUDGET		9
+#define XDP_GENERIC_XMIT_BATCH		10
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 9c3acecc14b1..7a149f4ac273 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1122,6 +1122,7 @@ static int xsk_release(struct socket *sock)
 	xskq_destroy(xs->tx);
 	xskq_destroy(xs->fq_tmp);
 	xskq_destroy(xs->cq_tmp);
+	kfree(xs->skb_batch);
 
 	sock_orphan(sk);
 	sock->sk = NULL;
@@ -1456,6 +1457,37 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		WRITE_ONCE(xs->max_tx_budget, budget);
 		return 0;
 	}
+	case XDP_GENERIC_XMIT_BATCH:
+	{
+		unsigned int batch, batch_alloc_len;
+		struct sk_buff **new;
+
+		if (optlen != sizeof(batch))
+			return -EINVAL;
+		if (copy_from_sockptr(&batch, optval, sizeof(batch)))
+			return -EFAULT;
+		if (batch > xs->max_tx_budget)
+			return -EACCES;
+
+		mutex_lock(&xs->mutex);
+		if (!batch) {
+			kfree(xs->skb_batch);
+			xs->generic_xmit_batch = 0;
+			goto out;
+		}
+		batch_alloc_len = sizeof(struct sk_buff *) * batch;
+		new = kmalloc(batch_alloc_len, GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+		if (xs->skb_batch)
+			kfree(xs->skb_batch);
+
+		xs->skb_batch = new;
+		xs->generic_xmit_batch = batch;
+out:
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
 	default:
 		break;
 	}
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 23a062781468..44cb72cd328e 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
 #define XDP_MAX_TX_SKB_BUDGET		9
+#define XDP_GENERIC_XMIT_BATCH		10
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-11 13:12 [PATCH net-next 0/2] xsk: improvement performance in copy mode Jason Xing
  2025-08-11 13:12 ` [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
@ 2025-08-11 13:12 ` Jason Xing
  2025-08-12 14:30   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-08-11 13:12 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Zerocopy mode has a good feature named multi buffer while copy mode has
to transmit skb one by one like normal flows. The latter might lose the
bypass power to some extent because of grabbing/releasing the same tx
queue lock and disabling/enabling bh and stuff on a packet basis.
Contending the same queue lock will bring a worse result.

This patch supports batch feature by permitting owning the queue lock to
send the generic_xmit_batch number of packets at one time. To further
achieve a better result, some codes[1] are removed on purpose from
xsk_direct_xmit_batch() as referred to __dev_direct_xmit().

[1]
1. advance the device check to granularity of sendto syscall.
2. remove validating packets because of its uselessness.
3. remove operation of softnet_data.xmit.recursion because it's not
   necessary.
4. remove BQL flow control. We don't need to do BQL control because it
   probably limit the speed. An ideal scenario is to use a standalone and
   clean tx queue to send packets only for xsk. Less competition shows
   better performance results.

Experiments:
1) Tested on virtio_net:
With this patch series applied, the performance number of xdpsock[2] goes
up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
If we test with another thread competing the same queue, a 28% increase
(from 405466 pps to 521076 pps) can be observed.
2) Tested on ixgbe:
The results of zerocopy and copy mode are respectively 1303277 pps and
1187347 pps. After this socket option took effect, copy mode reaches
1472367 which was higher than zerocopy mode impressively.

[2]: ./xdpsock -i eth1 -t  -S -s 64

It's worth mentioning batch process might bring high latency in certain
cases. The recommended value is 32.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/linux/netdevice.h |   2 +
 net/core/dev.c            |  18 +++++++
 net/xdp/xsk.c             | 103 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 120 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5e5de4b0a433..27738894daa7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3352,6 +3352,8 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
 
 int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
 int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
+int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
+			  struct netdev_queue *txq, u32 max_batch, u32 *cur);
 
 static inline int dev_queue_xmit(struct sk_buff *skb)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 68dc47d7e700..7a512bd38806 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4742,6 +4742,24 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 }
 EXPORT_SYMBOL(__dev_queue_xmit);
 
+int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
+			  struct netdev_queue *txq, u32 max_batch, u32 *cur)
+{
+	int ret = NETDEV_TX_BUSY;
+
+	local_bh_disable();
+	HARD_TX_LOCK(dev, txq, smp_processor_id());
+	for (; *cur < max_batch; (*cur)++) {
+		ret = netdev_start_xmit(skb[*cur], dev, txq, false);
+		if (ret != NETDEV_TX_OK)
+			break;
+	}
+	HARD_TX_UNLOCK(dev, txq);
+	local_bh_enable();
+
+	return ret;
+}
+
 int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
 {
 	struct net_device *dev = skb->dev;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7a149f4ac273..92ad82472776 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -780,9 +780,102 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 	return ERR_PTR(err);
 }
 
-static int __xsk_generic_xmit(struct sock *sk)
+static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
+{
+	u32 max_batch = READ_ONCE(xs->generic_xmit_batch);
+	struct sk_buff **skb = xs->skb_batch;
+	struct net_device *dev = xs->dev;
+	struct netdev_queue *txq;
+	bool sent_frame = false;
+	struct xdp_desc desc;
+	u32 i = 0, j = 0;
+	u32 max_budget;
+	int err = 0;
+
+	mutex_lock(&xs->mutex);
+
+	/* Since we dropped the RCU read lock, the socket state might have changed. */
+	if (unlikely(!xsk_is_bound(xs))) {
+		err = -ENXIO;
+		goto out;
+	}
+
+	if (xs->queue_id >= dev->real_num_tx_queues)
+		goto out;
+
+	if (unlikely(!netif_running(dev) ||
+		     !netif_carrier_ok(dev)))
+		goto out;
+
+	max_budget = READ_ONCE(xs->max_tx_budget);
+	txq = netdev_get_tx_queue(dev, xs->queue_id);
+	do {
+		for (; i < max_batch && xskq_cons_peek_desc(xs->tx, &desc, xs->pool); i++) {
+			if (max_budget-- == 0) {
+				err = -EAGAIN;
+				break;
+			}
+			/* This is the backpressure mechanism for the Tx path.
+			 * Reserve space in the completion queue and only proceed
+			 * if there is space in it. This avoids having to implement
+			 * any buffering in the Tx path.
+			 */
+			err = xsk_cq_reserve_addr_locked(xs->pool, desc.addr);
+			if (err) {
+				err = -EAGAIN;
+				break;
+			}
+
+			skb[i] = xsk_build_skb(xs, &desc);
+			if (IS_ERR(skb[i])) {
+				err = PTR_ERR(skb[i]);
+				break;
+			}
+
+			xskq_cons_release(xs->tx);
+
+			if (xp_mb_desc(&desc))
+				xs->skb = skb[i];
+		}
+
+		if (i) {
+			err = xsk_direct_xmit_batch(skb, dev, txq, i, &j);
+			if  (err == NETDEV_TX_BUSY) {
+				err = -EAGAIN;
+			} else if (err == NET_XMIT_DROP) {
+				j++;
+				err = -EBUSY;
+			}
+
+			sent_frame = true;
+			xs->skb = NULL;
+		}
+
+		if (err)
+			goto out;
+		i = j = 0;
+	} while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool));
+
+	if (xskq_has_descs(xs->tx)) {
+		if (xs->skb)
+			xsk_drop_skb(xs->skb);
+		xskq_cons_release(xs->tx);
+	}
+
+out:
+	for (; j < i; j++) {
+		xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb[j]));
+		xsk_consume_skb(skb[j]);
+	}
+	if (sent_frame)
+		__xsk_tx_release(xs);
+
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
+static int __xsk_generic_xmit(struct xdp_sock *xs)
 {
-	struct xdp_sock *xs = xdp_sk(sk);
 	bool sent_frame = false;
 	struct xdp_desc desc;
 	struct sk_buff *skb;
@@ -871,11 +964,15 @@ static int __xsk_generic_xmit(struct sock *sk)
 
 static int xsk_generic_xmit(struct sock *sk)
 {
+	struct xdp_sock *xs = xdp_sk(sk);
 	int ret;
 
 	/* Drop the RCU lock since the SKB path might sleep. */
 	rcu_read_unlock();
-	ret = __xsk_generic_xmit(sk);
+	if (READ_ONCE(xs->generic_xmit_batch))
+		ret = __xsk_generic_xmit_batch(xs);
+	else
+		ret = __xsk_generic_xmit(xs);
 	/* Reaquire RCU lock before going into common code. */
 	rcu_read_lock();
 
-- 
2.41.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-11 13:12 ` [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode Jason Xing
@ 2025-08-12 14:30   ` Jesper Dangaard Brouer
  2025-08-12 17:49     ` Maciej Fijalkowski
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-12 14:30 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel,
	john.fastabend, horms, andrew+netdev
  Cc: bpf, netdev, Jason Xing



On 11/08/2025 15.12, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Zerocopy mode has a good feature named multi buffer while copy mode has
> to transmit skb one by one like normal flows. The latter might lose the
> bypass power to some extent because of grabbing/releasing the same tx
> queue lock and disabling/enabling bh and stuff on a packet basis.
> Contending the same queue lock will bring a worse result.
> 

I actually think that it is worth optimizing the non-zerocopy mode for
AF_XDP.  My use-case was virtual net_devices like veth.


> This patch supports batch feature by permitting owning the queue lock to
> send the generic_xmit_batch number of packets at one time. To further
> achieve a better result, some codes[1] are removed on purpose from
> xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> 
> [1]
> 1. advance the device check to granularity of sendto syscall.
> 2. remove validating packets because of its uselessness.
> 3. remove operation of softnet_data.xmit.recursion because it's not
>     necessary.
> 4. remove BQL flow control. We don't need to do BQL control because it
>     probably limit the speed. An ideal scenario is to use a standalone and
>     clean tx queue to send packets only for xsk. Less competition shows
>     better performance results.
> 
> Experiments:
> 1) Tested on virtio_net:

If you also want to test on veth, then an optimization is to increase
dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
AF_XDP packets getting reallocated by veth driver. I never completed
upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)

  [1] 
https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org 


(more below...)

> With this patch series applied, the performance number of xdpsock[2] goes
> up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> If we test with another thread competing the same queue, a 28% increase
> (from 405466 pps to 521076 pps) can be observed.
> 2) Tested on ixgbe:
> The results of zerocopy and copy mode are respectively 1303277 pps and
> 1187347 pps. After this socket option took effect, copy mode reaches
> 1472367 which was higher than zerocopy mode impressively.
> 
> [2]: ./xdpsock -i eth1 -t  -S -s 64
> 
> It's worth mentioning batch process might bring high latency in certain
> cases. The recommended value is 32.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>   include/linux/netdevice.h |   2 +
>   net/core/dev.c            |  18 +++++++
>   net/xdp/xsk.c             | 103 ++++++++++++++++++++++++++++++++++++--
>   3 files changed, 120 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5e5de4b0a433..27738894daa7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3352,6 +3352,8 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
>   
>   int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
>   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
> +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> +			  struct netdev_queue *txq, u32 max_batch, u32 *cur);
>   
>   static inline int dev_queue_xmit(struct sk_buff *skb)
>   {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 68dc47d7e700..7a512bd38806 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4742,6 +4742,24 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   }
>   EXPORT_SYMBOL(__dev_queue_xmit);
>   
> +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> +			  struct netdev_queue *txq, u32 max_batch, u32 *cur)
> +{
> +	int ret = NETDEV_TX_BUSY;
> +
> +	local_bh_disable();
> +	HARD_TX_LOCK(dev, txq, smp_processor_id());
> +	for (; *cur < max_batch; (*cur)++) {
> +		ret = netdev_start_xmit(skb[*cur], dev, txq, false);

The last argument ('false') to netdev_start_xmit() indicate if there are
'more' packets to be sent. This allows the NIC driver to postpone
writing the tail-pointer/doorbell. For physical hardware this is a large
performance gain.

If index have not reached 'max_batch' then we know 'more' packets are true.

   bool more = !!(*cur != max_batch);

Can I ask you to do a test with netdev_start_xmit() using the 'more' 
boolian ?


> +		if (ret != NETDEV_TX_OK)
> +			break;
> +	}
> +	HARD_TX_UNLOCK(dev, txq);
> +	local_bh_enable();
> +
> +	return ret;
> +}
> +
>   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
>   {
>   	struct net_device *dev = skb->dev;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 7a149f4ac273..92ad82472776 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -780,9 +780,102 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>   	return ERR_PTR(err);
>   }
>   
> -static int __xsk_generic_xmit(struct sock *sk)
> +static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> +{
> +	u32 max_batch = READ_ONCE(xs->generic_xmit_batch);
> +	struct sk_buff **skb = xs->skb_batch;
> +	struct net_device *dev = xs->dev;
> +	struct netdev_queue *txq;
> +	bool sent_frame = false;
> +	struct xdp_desc desc;
> +	u32 i = 0, j = 0;
> +	u32 max_budget;
> +	int err = 0;
> +
> +	mutex_lock(&xs->mutex);
> +
> +	/* Since we dropped the RCU read lock, the socket state might have changed. */
> +	if (unlikely(!xsk_is_bound(xs))) {
> +		err = -ENXIO;
> +		goto out;
> +	}
> +
> +	if (xs->queue_id >= dev->real_num_tx_queues)
> +		goto out;
> +
> +	if (unlikely(!netif_running(dev) ||
> +		     !netif_carrier_ok(dev)))
> +		goto out;
> +
> +	max_budget = READ_ONCE(xs->max_tx_budget);
> +	txq = netdev_get_tx_queue(dev, xs->queue_id);
> +	do {
> +		for (; i < max_batch && xskq_cons_peek_desc(xs->tx, &desc, xs->pool); i++) {
> +			if (max_budget-- == 0) {
> +				err = -EAGAIN;
> +				break;
> +			}
> +			/* This is the backpressure mechanism for the Tx path.
> +			 * Reserve space in the completion queue and only proceed
> +			 * if there is space in it. This avoids having to implement
> +			 * any buffering in the Tx path.
> +			 */
> +			err = xsk_cq_reserve_addr_locked(xs->pool, desc.addr);
> +			if (err) {
> +				err = -EAGAIN;
> +				break;
> +			}
> +
> +			skb[i] = xsk_build_skb(xs, &desc);

There is a missed opportunity for bulk allocating the SKBs here
(via kmem_cache_alloc_bulk).

But this also requires changing the SKB alloc function used by
xsk_build_skb(). As a seperate patch, I recommend that you change the
sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
I expect this will be a large performance improvement on it's own.
Can I ask you to benchmark this change before the batch xmit change?

Opinions needed from other maintainers please (I might be wrong!):
I don't think the socket level accounting done in sock_alloc_send_skb()
is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
code comment above.

--Jesper

> +			if (IS_ERR(skb[i])) {
> +				err = PTR_ERR(skb[i]);
> +				break;
> +			}
> +
> +			xskq_cons_release(xs->tx);
> +
> +			if (xp_mb_desc(&desc))
> +				xs->skb = skb[i];
> +		}
> +
> +		if (i) {
> +			err = xsk_direct_xmit_batch(skb, dev, txq, i, &j);
> +			if  (err == NETDEV_TX_BUSY) {
> +				err = -EAGAIN;
> +			} else if (err == NET_XMIT_DROP) {
> +				j++;
> +				err = -EBUSY;
> +			}
> +
> +			sent_frame = true;
> +			xs->skb = NULL;
> +		}
> +
> +		if (err)
> +			goto out;
> +		i = j = 0;
> +	} while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool));
> +
> +	if (xskq_has_descs(xs->tx)) {
> +		if (xs->skb)
> +			xsk_drop_skb(xs->skb);
> +		xskq_cons_release(xs->tx);
> +	}
> +
> +out:
> +	for (; j < i; j++) {
> +		xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb[j]));
> +		xsk_consume_skb(skb[j]);
> +	}
> +	if (sent_frame)
> +		__xsk_tx_release(xs);
> +
> +	mutex_unlock(&xs->mutex);
> +	return err;
> +}
> +
> +static int __xsk_generic_xmit(struct xdp_sock *xs)
>   {
> -	struct xdp_sock *xs = xdp_sk(sk);
>   	bool sent_frame = false;
>   	struct xdp_desc desc;
>   	struct sk_buff *skb;
> @@ -871,11 +964,15 @@ static int __xsk_generic_xmit(struct sock *sk)
>   
>   static int xsk_generic_xmit(struct sock *sk)
>   {
> +	struct xdp_sock *xs = xdp_sk(sk);
>   	int ret;
>   
>   	/* Drop the RCU lock since the SKB path might sleep. */
>   	rcu_read_unlock();
> -	ret = __xsk_generic_xmit(sk);
> +	if (READ_ONCE(xs->generic_xmit_batch))
> +		ret = __xsk_generic_xmit_batch(xs);
> +	else
> +		ret = __xsk_generic_xmit(xs);
>   	/* Reaquire RCU lock before going into common code. */
>   	rcu_read_lock();
>   


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  2025-08-11 13:12 ` [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
@ 2025-08-12 16:40   ` Maciej Fijalkowski
  2025-08-12 23:46     ` Jason Xing
  2025-08-18  6:20   ` Dan Carpenter
  1 sibling, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-08-12 16:40 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Mon, Aug 11, 2025 at 09:12:35PM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch is to prepare for later batch xmit in generic path. Add a new
> socket option to provide an alternative to achieve a higher overall
> throughput.
> 
> skb_batch will be used to store newly allocated skb at one time in the
> xmit path.

I don't think we need yet another setsockopt. You previously added a knob
for manipulating max tx budget on generic xmit and that should be enough.
I think that we should strive for making the batching approach a default
path in xsk generic xmit.

> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  Documentation/networking/af_xdp.rst |  9 ++++++++
>  include/net/xdp_sock.h              |  2 ++
>  include/uapi/linux/if_xdp.h         |  1 +
>  net/xdp/xsk.c                       | 32 +++++++++++++++++++++++++++++
>  tools/include/uapi/linux/if_xdp.h   |  1 +
>  5 files changed, 45 insertions(+)
> 
> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> index 50d92084a49c..1194bdfaf61e 100644
> --- a/Documentation/networking/af_xdp.rst
> +++ b/Documentation/networking/af_xdp.rst
> @@ -447,6 +447,15 @@ mode to allow application to tune the per-socket maximum iteration for
>  better throughput and less frequency of send syscall.
>  Allowed range is [32, xs->tx->nentries].
>  
> +XDP_GENERIC_XMIT_BATCH
> +----------------------
> +
> +It provides an option that allows application to use batch xmit in the copy
> +mode. Batch process minimizes the number of grabbing/releasing queue lock
> +without redundant actions compared to before to gain the overall performance
> +improvement whereas it might increase the latency of per packet. The maximum
> +value shouldn't be larger than xs->max_tx_budget.
> +
>  XDP_STATISTICS getsockopt
>  -------------------------
>  
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index ce587a225661..b5a3e37da8db 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -61,6 +61,7 @@ struct xdp_sock {
>  		XSK_BOUND,
>  		XSK_UNBOUND,
>  	} state;
> +	struct sk_buff **skb_batch;
>  
>  	struct xsk_queue *tx ____cacheline_aligned_in_smp;
>  	struct list_head tx_list;
> @@ -70,6 +71,7 @@ struct xdp_sock {
>  	 * preventing other XSKs from being starved.
>  	 */
>  	u32 tx_budget_spent;
> +	u32 generic_xmit_batch;
>  
>  	/* Statistics */
>  	u64 rx_dropped;
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 23a062781468..44cb72cd328e 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
>  #define XDP_STATISTICS			7
>  #define XDP_OPTIONS			8
>  #define XDP_MAX_TX_SKB_BUDGET		9
> +#define XDP_GENERIC_XMIT_BATCH		10
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 9c3acecc14b1..7a149f4ac273 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -1122,6 +1122,7 @@ static int xsk_release(struct socket *sock)
>  	xskq_destroy(xs->tx);
>  	xskq_destroy(xs->fq_tmp);
>  	xskq_destroy(xs->cq_tmp);
> +	kfree(xs->skb_batch);
>  
>  	sock_orphan(sk);
>  	sock->sk = NULL;
> @@ -1456,6 +1457,37 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		WRITE_ONCE(xs->max_tx_budget, budget);
>  		return 0;
>  	}
> +	case XDP_GENERIC_XMIT_BATCH:
> +	{
> +		unsigned int batch, batch_alloc_len;
> +		struct sk_buff **new;
> +
> +		if (optlen != sizeof(batch))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&batch, optval, sizeof(batch)))
> +			return -EFAULT;
> +		if (batch > xs->max_tx_budget)
> +			return -EACCES;
> +
> +		mutex_lock(&xs->mutex);
> +		if (!batch) {
> +			kfree(xs->skb_batch);
> +			xs->generic_xmit_batch = 0;
> +			goto out;
> +		}
> +		batch_alloc_len = sizeof(struct sk_buff *) * batch;
> +		new = kmalloc(batch_alloc_len, GFP_KERNEL);
> +		if (!new)
> +			return -ENOMEM;
> +		if (xs->skb_batch)
> +			kfree(xs->skb_batch);
> +
> +		xs->skb_batch = new;
> +		xs->generic_xmit_batch = batch;
> +out:
> +		mutex_unlock(&xs->mutex);
> +		return 0;
> +	}
>  	default:
>  		break;
>  	}
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index 23a062781468..44cb72cd328e 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
>  #define XDP_STATISTICS			7
>  #define XDP_OPTIONS			8
>  #define XDP_MAX_TX_SKB_BUDGET		9
> +#define XDP_GENERIC_XMIT_BATCH		10
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> -- 
> 2.41.3
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-12 14:30   ` Jesper Dangaard Brouer
@ 2025-08-12 17:49     ` Maciej Fijalkowski
  2025-08-13  1:02       ` Jason Xing
  2025-08-13  0:57     ` Jason Xing
  2025-08-15  6:44     ` Jason Xing
  2 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-08-12 17:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jason Xing, davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 11/08/2025 15.12, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> > 
> > Zerocopy mode has a good feature named multi buffer while copy mode has
> > to transmit skb one by one like normal flows. The latter might lose the
> > bypass power to some extent because of grabbing/releasing the same tx
> > queue lock and disabling/enabling bh and stuff on a packet basis.
> > Contending the same queue lock will bring a worse result.
> > 
> 
> I actually think that it is worth optimizing the non-zerocopy mode for
> AF_XDP.  My use-case was virtual net_devices like veth.
> 
> 
> > This patch supports batch feature by permitting owning the queue lock to
> > send the generic_xmit_batch number of packets at one time. To further
> > achieve a better result, some codes[1] are removed on purpose from
> > xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> > 
> > [1]
> > 1. advance the device check to granularity of sendto syscall.
> > 2. remove validating packets because of its uselessness.
> > 3. remove operation of softnet_data.xmit.recursion because it's not
> >     necessary.
> > 4. remove BQL flow control. We don't need to do BQL control because it
> >     probably limit the speed. An ideal scenario is to use a standalone and
> >     clean tx queue to send packets only for xsk. Less competition shows
> >     better performance results.
> > 
> > Experiments:
> > 1) Tested on virtio_net:
> 
> If you also want to test on veth, then an optimization is to increase
> dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> AF_XDP packets getting reallocated by veth driver. I never completed
> upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
> 
>  [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
> 
> 
> (more below...)
> 
> > With this patch series applied, the performance number of xdpsock[2] goes
> > up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> > If we test with another thread competing the same queue, a 28% increase
> > (from 405466 pps to 521076 pps) can be observed.
> > 2) Tested on ixgbe:
> > The results of zerocopy and copy mode are respectively 1303277 pps and
> > 1187347 pps. After this socket option took effect, copy mode reaches
> > 1472367 which was higher than zerocopy mode impressively.
> > 
> > [2]: ./xdpsock -i eth1 -t  -S -s 64
> > 
> > It's worth mentioning batch process might bring high latency in certain
> > cases. The recommended value is 32.

Given the issue I spotted on your ixgbe batching patch, the comparison
against zc performance is probably not reliable.

> > 
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >   include/linux/netdevice.h |   2 +
> >   net/core/dev.c            |  18 +++++++
> >   net/xdp/xsk.c             | 103 ++++++++++++++++++++++++++++++++++++--
> >   3 files changed, 120 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 5e5de4b0a433..27738894daa7 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3352,6 +3352,8 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
> >   int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
> >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
> > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > +			  struct netdev_queue *txq, u32 max_batch, u32 *cur);
> >   static inline int dev_queue_xmit(struct sk_buff *skb)
> >   {
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 68dc47d7e700..7a512bd38806 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4742,6 +4742,24 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   }
> >   EXPORT_SYMBOL(__dev_queue_xmit);
> > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > +			  struct netdev_queue *txq, u32 max_batch, u32 *cur)
> > +{
> > +	int ret = NETDEV_TX_BUSY;
> > +
> > +	local_bh_disable();
> > +	HARD_TX_LOCK(dev, txq, smp_processor_id());
> > +	for (; *cur < max_batch; (*cur)++) {
> > +		ret = netdev_start_xmit(skb[*cur], dev, txq, false);
> 
> The last argument ('false') to netdev_start_xmit() indicate if there are
> 'more' packets to be sent. This allows the NIC driver to postpone
> writing the tail-pointer/doorbell. For physical hardware this is a large
> performance gain.
> 
> If index have not reached 'max_batch' then we know 'more' packets are true.
> 
>   bool more = !!(*cur != max_batch);
> 
> Can I ask you to do a test with netdev_start_xmit() using the 'more' boolian
> ?
> 
> 
> > +		if (ret != NETDEV_TX_OK)
> > +			break;
> > +	}
> > +	HARD_TX_UNLOCK(dev, txq);
> > +	local_bh_enable();
> > +
> > +	return ret;
> > +}
> > +
> >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
> >   {
> >   	struct net_device *dev = skb->dev;
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 7a149f4ac273..92ad82472776 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -780,9 +780,102 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >   	return ERR_PTR(err);
> >   }
> > -static int __xsk_generic_xmit(struct sock *sk)
> > +static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> > +{
> > +	u32 max_batch = READ_ONCE(xs->generic_xmit_batch);
> > +	struct sk_buff **skb = xs->skb_batch;
> > +	struct net_device *dev = xs->dev;
> > +	struct netdev_queue *txq;
> > +	bool sent_frame = false;
> > +	struct xdp_desc desc;
> > +	u32 i = 0, j = 0;
> > +	u32 max_budget;
> > +	int err = 0;
> > +
> > +	mutex_lock(&xs->mutex);
> > +
> > +	/* Since we dropped the RCU read lock, the socket state might have changed. */
> > +	if (unlikely(!xsk_is_bound(xs))) {
> > +		err = -ENXIO;
> > +		goto out;
> > +	}
> > +
> > +	if (xs->queue_id >= dev->real_num_tx_queues)
> > +		goto out;
> > +
> > +	if (unlikely(!netif_running(dev) ||
> > +		     !netif_carrier_ok(dev)))
> > +		goto out;
> > +
> > +	max_budget = READ_ONCE(xs->max_tx_budget);
> > +	txq = netdev_get_tx_queue(dev, xs->queue_id);
> > +	do {
> > +		for (; i < max_batch && xskq_cons_peek_desc(xs->tx, &desc, xs->pool); i++) {

here we should think how to come up with slightly modified version of
xsk_tx_peek_release_desc_batch() for generic xmit needs, or what could we
borrow from this approach that will be applicable here.

> > +			if (max_budget-- == 0) {
> > +				err = -EAGAIN;
> > +				break;
> > +			}
> > +			/* This is the backpressure mechanism for the Tx path.
> > +			 * Reserve space in the completion queue and only proceed
> > +			 * if there is space in it. This avoids having to implement
> > +			 * any buffering in the Tx path.
> > +			 */
> > +			err = xsk_cq_reserve_addr_locked(xs->pool, desc.addr);
> > +			if (err) {
> > +				err = -EAGAIN;
> > +				break;
> > +			}
> > +
> > +			skb[i] = xsk_build_skb(xs, &desc);
> 
> There is a missed opportunity for bulk allocating the SKBs here
> (via kmem_cache_alloc_bulk).

+1

> 
> But this also requires changing the SKB alloc function used by
> xsk_build_skb(). As a seperate patch, I recommend that you change the
> sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
> I expect this will be a large performance improvement on it's own.
> Can I ask you to benchmark this change before the batch xmit change?
> 
> Opinions needed from other maintainers please (I might be wrong!):
> I don't think the socket level accounting done in sock_alloc_send_skb()
> is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
> code comment above.

Thanks for bringing this up, I had the same feeling.

> 
> --Jesper
> 
> > +			if (IS_ERR(skb[i])) {
> > +				err = PTR_ERR(skb[i]);
> > +				break;
> > +			}
> > +
> > +			xskq_cons_release(xs->tx);
> > +
> > +			if (xp_mb_desc(&desc))
> > +				xs->skb = skb[i];
> > +		}
> > +
> > +		if (i) {
> > +			err = xsk_direct_xmit_batch(skb, dev, txq, i, &j);
> > +			if  (err == NETDEV_TX_BUSY) {
> > +				err = -EAGAIN;
> > +			} else if (err == NET_XMIT_DROP) {
> > +				j++;
> > +				err = -EBUSY;
> > +			}
> > +
> > +			sent_frame = true;
> > +			xs->skb = NULL;
> > +		}
> > +
> > +		if (err)
> > +			goto out;
> > +		i = j = 0;
> > +	} while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool));

from the quick glance i don't follow why you have this call here whilst
having it above in the while loop.

BTW do we have something bulk skb freeing in the kernel? given we're gonna
eventually do kmem_cache_alloc_bulk for skbs then could we do
kmem_cache_free_bulk() as well?

> > +
> > +	if (xskq_has_descs(xs->tx)) {
> > +		if (xs->skb)
> > +			xsk_drop_skb(xs->skb);
> > +		xskq_cons_release(xs->tx);
> > +	}
> > +
> > +out:
> > +	for (; j < i; j++) {
> > +		xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb[j]));
> > +		xsk_consume_skb(skb[j]);
> > +	}
> > +	if (sent_frame)
> > +		__xsk_tx_release(xs);
> > +
> > +	mutex_unlock(&xs->mutex);
> > +	return err;
> > +}
> > +
> > +static int __xsk_generic_xmit(struct xdp_sock *xs)
> >   {
> > -	struct xdp_sock *xs = xdp_sk(sk);
> >   	bool sent_frame = false;
> >   	struct xdp_desc desc;
> >   	struct sk_buff *skb;
> > @@ -871,11 +964,15 @@ static int __xsk_generic_xmit(struct sock *sk)
> >   static int xsk_generic_xmit(struct sock *sk)
> >   {
> > +	struct xdp_sock *xs = xdp_sk(sk);
> >   	int ret;
> >   	/* Drop the RCU lock since the SKB path might sleep. */
> >   	rcu_read_unlock();
> > -	ret = __xsk_generic_xmit(sk);
> > +	if (READ_ONCE(xs->generic_xmit_batch))
> > +		ret = __xsk_generic_xmit_batch(xs);
> > +	else
> > +		ret = __xsk_generic_xmit(xs);
> >   	/* Reaquire RCU lock before going into common code. */
> >   	rcu_read_lock();
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  2025-08-12 16:40   ` Maciej Fijalkowski
@ 2025-08-12 23:46     ` Jason Xing
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-08-12 23:46 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing

On Wed, Aug 13, 2025 at 12:40 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Mon, Aug 11, 2025 at 09:12:35PM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This patch is to prepare for later batch xmit in generic path. Add a new
> > socket option to provide an alternative to achieve a higher overall
> > throughput.
> >
> > skb_batch will be used to store newly allocated skb at one time in the
> > xmit path.
>
> I don't think we need yet another setsockopt. You previously added a knob
> for manipulating max tx budget on generic xmit and that should be enough.
> I think that we should strive for making the batching approach a default
> path in xsk generic xmit.

You're right, it‘s the right direction that we should take. But I
considered this as well before cooking the series and then gave up, my
experiments show that in some real cases (not xdpsock) the batch
process might increase latency. It's a side effect. At that time I
thought many years ago the invention of GRO didn't become the default.

Thanks,
Jason

>
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  Documentation/networking/af_xdp.rst |  9 ++++++++
> >  include/net/xdp_sock.h              |  2 ++
> >  include/uapi/linux/if_xdp.h         |  1 +
> >  net/xdp/xsk.c                       | 32 +++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/if_xdp.h   |  1 +
> >  5 files changed, 45 insertions(+)
> >
> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> > index 50d92084a49c..1194bdfaf61e 100644
> > --- a/Documentation/networking/af_xdp.rst
> > +++ b/Documentation/networking/af_xdp.rst
> > @@ -447,6 +447,15 @@ mode to allow application to tune the per-socket maximum iteration for
> >  better throughput and less frequency of send syscall.
> >  Allowed range is [32, xs->tx->nentries].
> >
> > +XDP_GENERIC_XMIT_BATCH
> > +----------------------
> > +
> > +It provides an option that allows application to use batch xmit in the copy
> > +mode. Batch process minimizes the number of grabbing/releasing queue lock
> > +without redundant actions compared to before to gain the overall performance
> > +improvement whereas it might increase the latency of per packet. The maximum
> > +value shouldn't be larger than xs->max_tx_budget.
> > +
> >  XDP_STATISTICS getsockopt
> >  -------------------------
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index ce587a225661..b5a3e37da8db 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -61,6 +61,7 @@ struct xdp_sock {
> >               XSK_BOUND,
> >               XSK_UNBOUND,
> >       } state;
> > +     struct sk_buff **skb_batch;
> >
> >       struct xsk_queue *tx ____cacheline_aligned_in_smp;
> >       struct list_head tx_list;
> > @@ -70,6 +71,7 @@ struct xdp_sock {
> >        * preventing other XSKs from being starved.
> >        */
> >       u32 tx_budget_spent;
> > +     u32 generic_xmit_batch;
> >
> >       /* Statistics */
> >       u64 rx_dropped;
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index 23a062781468..44cb72cd328e 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
> >  #define XDP_STATISTICS                       7
> >  #define XDP_OPTIONS                  8
> >  #define XDP_MAX_TX_SKB_BUDGET                9
> > +#define XDP_GENERIC_XMIT_BATCH               10
> >
> >  struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 9c3acecc14b1..7a149f4ac273 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -1122,6 +1122,7 @@ static int xsk_release(struct socket *sock)
> >       xskq_destroy(xs->tx);
> >       xskq_destroy(xs->fq_tmp);
> >       xskq_destroy(xs->cq_tmp);
> > +     kfree(xs->skb_batch);
> >
> >       sock_orphan(sk);
> >       sock->sk = NULL;
> > @@ -1456,6 +1457,37 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> >               WRITE_ONCE(xs->max_tx_budget, budget);
> >               return 0;
> >       }
> > +     case XDP_GENERIC_XMIT_BATCH:
> > +     {
> > +             unsigned int batch, batch_alloc_len;
> > +             struct sk_buff **new;
> > +
> > +             if (optlen != sizeof(batch))
> > +                     return -EINVAL;
> > +             if (copy_from_sockptr(&batch, optval, sizeof(batch)))
> > +                     return -EFAULT;
> > +             if (batch > xs->max_tx_budget)
> > +                     return -EACCES;
> > +
> > +             mutex_lock(&xs->mutex);
> > +             if (!batch) {
> > +                     kfree(xs->skb_batch);
> > +                     xs->generic_xmit_batch = 0;
> > +                     goto out;
> > +             }
> > +             batch_alloc_len = sizeof(struct sk_buff *) * batch;
> > +             new = kmalloc(batch_alloc_len, GFP_KERNEL);
> > +             if (!new)
> > +                     return -ENOMEM;
> > +             if (xs->skb_batch)
> > +                     kfree(xs->skb_batch);
> > +
> > +             xs->skb_batch = new;
> > +             xs->generic_xmit_batch = batch;
> > +out:
> > +             mutex_unlock(&xs->mutex);
> > +             return 0;
> > +     }
> >       default:
> >               break;
> >       }
> > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> > index 23a062781468..44cb72cd328e 100644
> > --- a/tools/include/uapi/linux/if_xdp.h
> > +++ b/tools/include/uapi/linux/if_xdp.h
> > @@ -80,6 +80,7 @@ struct xdp_mmap_offsets {
> >  #define XDP_STATISTICS                       7
> >  #define XDP_OPTIONS                  8
> >  #define XDP_MAX_TX_SKB_BUDGET                9
> > +#define XDP_GENERIC_XMIT_BATCH               10
> >
> >  struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> > --
> > 2.41.3
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-12 14:30   ` Jesper Dangaard Brouer
  2025-08-12 17:49     ` Maciej Fijalkowski
@ 2025-08-13  0:57     ` Jason Xing
  2025-08-15  6:44     ` Jason Xing
  2 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-08-13  0:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On Tue, Aug 12, 2025 at 10:30 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 11/08/2025 15.12, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Zerocopy mode has a good feature named multi buffer while copy mode has
> > to transmit skb one by one like normal flows. The latter might lose the
> > bypass power to some extent because of grabbing/releasing the same tx
> > queue lock and disabling/enabling bh and stuff on a packet basis.
> > Contending the same queue lock will bring a worse result.
> >
>
> I actually think that it is worth optimizing the non-zerocopy mode for
> AF_XDP.  My use-case was virtual net_devices like veth.
>
>
> > This patch supports batch feature by permitting owning the queue lock to
> > send the generic_xmit_batch number of packets at one time. To further
> > achieve a better result, some codes[1] are removed on purpose from
> > xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> >
> > [1]
> > 1. advance the device check to granularity of sendto syscall.
> > 2. remove validating packets because of its uselessness.
> > 3. remove operation of softnet_data.xmit.recursion because it's not
> >     necessary.
> > 4. remove BQL flow control. We don't need to do BQL control because it
> >     probably limit the speed. An ideal scenario is to use a standalone and
> >     clean tx queue to send packets only for xsk. Less competition shows
> >     better performance results.
> >
> > Experiments:
> > 1) Tested on virtio_net:
>
> If you also want to test on veth, then an optimization is to increase
> dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> AF_XDP packets getting reallocated by veth driver. I never completed
> upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
>
>   [1]
> https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org

Oh, even though I'm not that familiar with veth, I am willing to learn
it these days. Thanks for sharing this with me!

>
>
> (more below...)
>
> > With this patch series applied, the performance number of xdpsock[2] goes
> > up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> > If we test with another thread competing the same queue, a 28% increase
> > (from 405466 pps to 521076 pps) can be observed.
> > 2) Tested on ixgbe:
> > The results of zerocopy and copy mode are respectively 1303277 pps and
> > 1187347 pps. After this socket option took effect, copy mode reaches
> > 1472367 which was higher than zerocopy mode impressively.
> >
> > [2]: ./xdpsock -i eth1 -t  -S -s 64
> >
> > It's worth mentioning batch process might bring high latency in certain
> > cases. The recommended value is 32.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >   include/linux/netdevice.h |   2 +
> >   net/core/dev.c            |  18 +++++++
> >   net/xdp/xsk.c             | 103 ++++++++++++++++++++++++++++++++++++--
> >   3 files changed, 120 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 5e5de4b0a433..27738894daa7 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3352,6 +3352,8 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
> >
> >   int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
> >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
> > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > +                       struct netdev_queue *txq, u32 max_batch, u32 *cur);
> >
> >   static inline int dev_queue_xmit(struct sk_buff *skb)
> >   {
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 68dc47d7e700..7a512bd38806 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4742,6 +4742,24 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   }
> >   EXPORT_SYMBOL(__dev_queue_xmit);
> >
> > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > +                       struct netdev_queue *txq, u32 max_batch, u32 *cur)
> > +{
> > +     int ret = NETDEV_TX_BUSY;
> > +
> > +     local_bh_disable();
> > +     HARD_TX_LOCK(dev, txq, smp_processor_id());
> > +     for (; *cur < max_batch; (*cur)++) {
> > +             ret = netdev_start_xmit(skb[*cur], dev, txq, false);
>
> The last argument ('false') to netdev_start_xmit() indicate if there are
> 'more' packets to be sent. This allows the NIC driver to postpone
> writing the tail-pointer/doorbell. For physical hardware this is a large
> performance gain.
>
> If index have not reached 'max_batch' then we know 'more' packets are true.
>
>    bool more = !!(*cur != max_batch);
>
> Can I ask you to do a test with netdev_start_xmit() using the 'more'
> boolian ?

Agreed, really insightful information. I'm taking note here. Will get
back more information here soon.

>
>
> > +             if (ret != NETDEV_TX_OK)
> > +                     break;
> > +     }
> > +     HARD_TX_UNLOCK(dev, txq);
> > +     local_bh_enable();
> > +
> > +     return ret;
> > +}
> > +
> >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
> >   {
> >       struct net_device *dev = skb->dev;
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 7a149f4ac273..92ad82472776 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -780,9 +780,102 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >       return ERR_PTR(err);
> >   }
> >
> > -static int __xsk_generic_xmit(struct sock *sk)
> > +static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> > +{
> > +     u32 max_batch = READ_ONCE(xs->generic_xmit_batch);
> > +     struct sk_buff **skb = xs->skb_batch;
> > +     struct net_device *dev = xs->dev;
> > +     struct netdev_queue *txq;
> > +     bool sent_frame = false;
> > +     struct xdp_desc desc;
> > +     u32 i = 0, j = 0;
> > +     u32 max_budget;
> > +     int err = 0;
> > +
> > +     mutex_lock(&xs->mutex);
> > +
> > +     /* Since we dropped the RCU read lock, the socket state might have changed. */
> > +     if (unlikely(!xsk_is_bound(xs))) {
> > +             err = -ENXIO;
> > +             goto out;
> > +     }
> > +
> > +     if (xs->queue_id >= dev->real_num_tx_queues)
> > +             goto out;
> > +
> > +     if (unlikely(!netif_running(dev) ||
> > +                  !netif_carrier_ok(dev)))
> > +             goto out;
> > +
> > +     max_budget = READ_ONCE(xs->max_tx_budget);
> > +     txq = netdev_get_tx_queue(dev, xs->queue_id);
> > +     do {
> > +             for (; i < max_batch && xskq_cons_peek_desc(xs->tx, &desc, xs->pool); i++) {
> > +                     if (max_budget-- == 0) {
> > +                             err = -EAGAIN;
> > +                             break;
> > +                     }
> > +                     /* This is the backpressure mechanism for the Tx path.
> > +                      * Reserve space in the completion queue and only proceed
> > +                      * if there is space in it. This avoids having to implement
> > +                      * any buffering in the Tx path.
> > +                      */
> > +                     err = xsk_cq_reserve_addr_locked(xs->pool, desc.addr);
> > +                     if (err) {
> > +                             err = -EAGAIN;
> > +                             break;
> > +                     }
> > +
> > +                     skb[i] = xsk_build_skb(xs, &desc);
>
> There is a missed opportunity for bulk allocating the SKBs here
> (via kmem_cache_alloc_bulk).
>
> But this also requires changing the SKB alloc function used by
> xsk_build_skb(). As a seperate patch, I recommend that you change the
> sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
> I expect this will be a large performance improvement on it's own.
> Can I ask you to benchmark this change before the batch xmit change?

Sure, I will do that.

>
> Opinions needed from other maintainers please (I might be wrong!):
> I don't think the socket level accounting done in sock_alloc_send_skb()
> is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
> code comment above.

Point taken. It seems no chance to remove it in this common helper.
Let me give it more thoughts :)

Thanks,
Jason

>
> --Jesper
>
> > +                     if (IS_ERR(skb[i])) {
> > +                             err = PTR_ERR(skb[i]);
> > +                             break;
> > +                     }
> > +
> > +                     xskq_cons_release(xs->tx);
> > +
> > +                     if (xp_mb_desc(&desc))
> > +                             xs->skb = skb[i];
> > +             }
> > +
> > +             if (i) {
> > +                     err = xsk_direct_xmit_batch(skb, dev, txq, i, &j);
> > +                     if  (err == NETDEV_TX_BUSY) {
> > +                             err = -EAGAIN;
> > +                     } else if (err == NET_XMIT_DROP) {
> > +                             j++;
> > +                             err = -EBUSY;
> > +                     }
> > +
> > +                     sent_frame = true;
> > +                     xs->skb = NULL;
> > +             }
> > +
> > +             if (err)
> > +                     goto out;
> > +             i = j = 0;
> > +     } while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool));
> > +
> > +     if (xskq_has_descs(xs->tx)) {
> > +             if (xs->skb)
> > +                     xsk_drop_skb(xs->skb);
> > +             xskq_cons_release(xs->tx);
> > +     }
> > +
> > +out:
> > +     for (; j < i; j++) {
> > +             xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb[j]));
> > +             xsk_consume_skb(skb[j]);
> > +     }
> > +     if (sent_frame)
> > +             __xsk_tx_release(xs);
> > +
> > +     mutex_unlock(&xs->mutex);
> > +     return err;
> > +}
> > +
> > +static int __xsk_generic_xmit(struct xdp_sock *xs)
> >   {
> > -     struct xdp_sock *xs = xdp_sk(sk);
> >       bool sent_frame = false;
> >       struct xdp_desc desc;
> >       struct sk_buff *skb;
> > @@ -871,11 +964,15 @@ static int __xsk_generic_xmit(struct sock *sk)
> >
> >   static int xsk_generic_xmit(struct sock *sk)
> >   {
> > +     struct xdp_sock *xs = xdp_sk(sk);
> >       int ret;
> >
> >       /* Drop the RCU lock since the SKB path might sleep. */
> >       rcu_read_unlock();
> > -     ret = __xsk_generic_xmit(sk);
> > +     if (READ_ONCE(xs->generic_xmit_batch))
> > +             ret = __xsk_generic_xmit_batch(xs);
> > +     else
> > +             ret = __xsk_generic_xmit(xs);
> >       /* Reaquire RCU lock before going into common code. */
> >       rcu_read_lock();
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-12 17:49     ` Maciej Fijalkowski
@ 2025-08-13  1:02       ` Jason Xing
  2025-08-13 13:06         ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-08-13  1:02 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Jesper Dangaard Brouer, davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, jonathan.lemon, sdf, ast, daniel, john.fastabend,
	horms, andrew+netdev, bpf, netdev, Jason Xing

On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
> >
> >
> > On 11/08/2025 15.12, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Zerocopy mode has a good feature named multi buffer while copy mode has
> > > to transmit skb one by one like normal flows. The latter might lose the
> > > bypass power to some extent because of grabbing/releasing the same tx
> > > queue lock and disabling/enabling bh and stuff on a packet basis.
> > > Contending the same queue lock will bring a worse result.
> > >
> >
> > I actually think that it is worth optimizing the non-zerocopy mode for
> > AF_XDP.  My use-case was virtual net_devices like veth.
> >
> >
> > > This patch supports batch feature by permitting owning the queue lock to
> > > send the generic_xmit_batch number of packets at one time. To further
> > > achieve a better result, some codes[1] are removed on purpose from
> > > xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> > >
> > > [1]
> > > 1. advance the device check to granularity of sendto syscall.
> > > 2. remove validating packets because of its uselessness.
> > > 3. remove operation of softnet_data.xmit.recursion because it's not
> > >     necessary.
> > > 4. remove BQL flow control. We don't need to do BQL control because it
> > >     probably limit the speed. An ideal scenario is to use a standalone and
> > >     clean tx queue to send packets only for xsk. Less competition shows
> > >     better performance results.
> > >
> > > Experiments:
> > > 1) Tested on virtio_net:
> >
> > If you also want to test on veth, then an optimization is to increase
> > dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> > AF_XDP packets getting reallocated by veth driver. I never completed
> > upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
> >
> >  [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
> >
> >
> > (more below...)
> >
> > > With this patch series applied, the performance number of xdpsock[2] goes
> > > up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> > > If we test with another thread competing the same queue, a 28% increase
> > > (from 405466 pps to 521076 pps) can be observed.
> > > 2) Tested on ixgbe:
> > > The results of zerocopy and copy mode are respectively 1303277 pps and
> > > 1187347 pps. After this socket option took effect, copy mode reaches
> > > 1472367 which was higher than zerocopy mode impressively.
> > >
> > > [2]: ./xdpsock -i eth1 -t  -S -s 64
> > >
> > > It's worth mentioning batch process might bring high latency in certain
> > > cases. The recommended value is 32.
>
> Given the issue I spotted on your ixgbe batching patch, the comparison
> against zc performance is probably not reliable.

I have to clarify the thing: zc performance was tested without that
series applied. That means without that series, the number is 1303277
pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.

>
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >   include/linux/netdevice.h |   2 +
> > >   net/core/dev.c            |  18 +++++++
> > >   net/xdp/xsk.c             | 103 ++++++++++++++++++++++++++++++++++++--
> > >   3 files changed, 120 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 5e5de4b0a433..27738894daa7 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -3352,6 +3352,8 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
> > >   int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
> > >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
> > > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > > +                     struct netdev_queue *txq, u32 max_batch, u32 *cur);
> > >   static inline int dev_queue_xmit(struct sk_buff *skb)
> > >   {
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 68dc47d7e700..7a512bd38806 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4742,6 +4742,24 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > >   }
> > >   EXPORT_SYMBOL(__dev_queue_xmit);
> > > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > > +                     struct netdev_queue *txq, u32 max_batch, u32 *cur)
> > > +{
> > > +   int ret = NETDEV_TX_BUSY;
> > > +
> > > +   local_bh_disable();
> > > +   HARD_TX_LOCK(dev, txq, smp_processor_id());
> > > +   for (; *cur < max_batch; (*cur)++) {
> > > +           ret = netdev_start_xmit(skb[*cur], dev, txq, false);
> >
> > The last argument ('false') to netdev_start_xmit() indicate if there are
> > 'more' packets to be sent. This allows the NIC driver to postpone
> > writing the tail-pointer/doorbell. For physical hardware this is a large
> > performance gain.
> >
> > If index have not reached 'max_batch' then we know 'more' packets are true.
> >
> >   bool more = !!(*cur != max_batch);
> >
> > Can I ask you to do a test with netdev_start_xmit() using the 'more' boolian
> > ?
> >
> >
> > > +           if (ret != NETDEV_TX_OK)
> > > +                   break;
> > > +   }
> > > +   HARD_TX_UNLOCK(dev, txq);
> > > +   local_bh_enable();
> > > +
> > > +   return ret;
> > > +}
> > > +
> > >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
> > >   {
> > >     struct net_device *dev = skb->dev;
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 7a149f4ac273..92ad82472776 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -780,9 +780,102 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >     return ERR_PTR(err);
> > >   }
> > > -static int __xsk_generic_xmit(struct sock *sk)
> > > +static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> > > +{
> > > +   u32 max_batch = READ_ONCE(xs->generic_xmit_batch);
> > > +   struct sk_buff **skb = xs->skb_batch;
> > > +   struct net_device *dev = xs->dev;
> > > +   struct netdev_queue *txq;
> > > +   bool sent_frame = false;
> > > +   struct xdp_desc desc;
> > > +   u32 i = 0, j = 0;
> > > +   u32 max_budget;
> > > +   int err = 0;
> > > +
> > > +   mutex_lock(&xs->mutex);
> > > +
> > > +   /* Since we dropped the RCU read lock, the socket state might have changed. */
> > > +   if (unlikely(!xsk_is_bound(xs))) {
> > > +           err = -ENXIO;
> > > +           goto out;
> > > +   }
> > > +
> > > +   if (xs->queue_id >= dev->real_num_tx_queues)
> > > +           goto out;
> > > +
> > > +   if (unlikely(!netif_running(dev) ||
> > > +                !netif_carrier_ok(dev)))
> > > +           goto out;
> > > +
> > > +   max_budget = READ_ONCE(xs->max_tx_budget);
> > > +   txq = netdev_get_tx_queue(dev, xs->queue_id);
> > > +   do {
> > > +           for (; i < max_batch && xskq_cons_peek_desc(xs->tx, &desc, xs->pool); i++) {
>
> here we should think how to come up with slightly modified version of
> xsk_tx_peek_release_desc_batch() for generic xmit needs, or what could we
> borrow from this approach that will be applicable here.

Okay, I will dig into it more. Thanks.

>
> > > +                   if (max_budget-- == 0) {
> > > +                           err = -EAGAIN;
> > > +                           break;
> > > +                   }
> > > +                   /* This is the backpressure mechanism for the Tx path.
> > > +                    * Reserve space in the completion queue and only proceed
> > > +                    * if there is space in it. This avoids having to implement
> > > +                    * any buffering in the Tx path.
> > > +                    */
> > > +                   err = xsk_cq_reserve_addr_locked(xs->pool, desc.addr);
> > > +                   if (err) {
> > > +                           err = -EAGAIN;
> > > +                           break;
> > > +                   }
> > > +
> > > +                   skb[i] = xsk_build_skb(xs, &desc);
> >
> > There is a missed opportunity for bulk allocating the SKBs here
> > (via kmem_cache_alloc_bulk).
>
> +1
>
> >
> > But this also requires changing the SKB alloc function used by
> > xsk_build_skb(). As a seperate patch, I recommend that you change the
> > sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
> > I expect this will be a large performance improvement on it's own.
> > Can I ask you to benchmark this change before the batch xmit change?
> >
> > Opinions needed from other maintainers please (I might be wrong!):
> > I don't think the socket level accounting done in sock_alloc_send_skb()
> > is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
> > code comment above.
>
> Thanks for bringing this up, I had the same feeling.
>
> >
> > --Jesper
> >
> > > +                   if (IS_ERR(skb[i])) {
> > > +                           err = PTR_ERR(skb[i]);
> > > +                           break;
> > > +                   }
> > > +
> > > +                   xskq_cons_release(xs->tx);
> > > +
> > > +                   if (xp_mb_desc(&desc))
> > > +                           xs->skb = skb[i];
> > > +           }
> > > +
> > > +           if (i) {
> > > +                   err = xsk_direct_xmit_batch(skb, dev, txq, i, &j);
> > > +                   if  (err == NETDEV_TX_BUSY) {
> > > +                           err = -EAGAIN;
> > > +                   } else if (err == NET_XMIT_DROP) {
> > > +                           j++;
> > > +                           err = -EBUSY;
> > > +                   }
> > > +
> > > +                   sent_frame = true;
> > > +                   xs->skb = NULL;
> > > +           }
> > > +
> > > +           if (err)
> > > +                   goto out;
> > > +           i = j = 0;
> > > +   } while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool));
>
> from the quick glance i don't follow why you have this call here whilst
> having it above in the while loop.

Because it avoids the first unnecessary xskq_cons_peek_desc() check.

>
> BTW do we have something bulk skb freeing in the kernel? given we're gonna
> eventually do kmem_cache_alloc_bulk for skbs then could we do
> kmem_cache_free_bulk() as well?

Good point. Let me deal with it :)

Thanks,
Jason

>
> > > +
> > > +   if (xskq_has_descs(xs->tx)) {
> > > +           if (xs->skb)
> > > +                   xsk_drop_skb(xs->skb);
> > > +           xskq_cons_release(xs->tx);
> > > +   }
> > > +
> > > +out:
> > > +   for (; j < i; j++) {
> > > +           xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb[j]));
> > > +           xsk_consume_skb(skb[j]);
> > > +   }
> > > +   if (sent_frame)
> > > +           __xsk_tx_release(xs);
> > > +
> > > +   mutex_unlock(&xs->mutex);
> > > +   return err;
> > > +}
> > > +
> > > +static int __xsk_generic_xmit(struct xdp_sock *xs)
> > >   {
> > > -   struct xdp_sock *xs = xdp_sk(sk);
> > >     bool sent_frame = false;
> > >     struct xdp_desc desc;
> > >     struct sk_buff *skb;
> > > @@ -871,11 +964,15 @@ static int __xsk_generic_xmit(struct sock *sk)
> > >   static int xsk_generic_xmit(struct sock *sk)
> > >   {
> > > +   struct xdp_sock *xs = xdp_sk(sk);
> > >     int ret;
> > >     /* Drop the RCU lock since the SKB path might sleep. */
> > >     rcu_read_unlock();
> > > -   ret = __xsk_generic_xmit(sk);
> > > +   if (READ_ONCE(xs->generic_xmit_batch))
> > > +           ret = __xsk_generic_xmit_batch(xs);
> > > +   else
> > > +           ret = __xsk_generic_xmit(xs);
> > >     /* Reaquire RCU lock before going into common code. */
> > >     rcu_read_lock();
> >
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-13  1:02       ` Jason Xing
@ 2025-08-13 13:06         ` Jason Xing
  2025-08-15 16:40           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-08-13 13:06 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Jesper Dangaard Brouer, davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, jonathan.lemon, sdf, ast, daniel, john.fastabend,
	horms, andrew+netdev, bpf, netdev, Jason Xing

On Wed, Aug 13, 2025 at 9:02 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
> > >
> > >
> > > On 11/08/2025 15.12, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > Zerocopy mode has a good feature named multi buffer while copy mode has
> > > > to transmit skb one by one like normal flows. The latter might lose the
> > > > bypass power to some extent because of grabbing/releasing the same tx
> > > > queue lock and disabling/enabling bh and stuff on a packet basis.
> > > > Contending the same queue lock will bring a worse result.
> > > >
> > >
> > > I actually think that it is worth optimizing the non-zerocopy mode for
> > > AF_XDP.  My use-case was virtual net_devices like veth.
> > >
> > >
> > > > This patch supports batch feature by permitting owning the queue lock to
> > > > send the generic_xmit_batch number of packets at one time. To further
> > > > achieve a better result, some codes[1] are removed on purpose from
> > > > xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> > > >
> > > > [1]
> > > > 1. advance the device check to granularity of sendto syscall.
> > > > 2. remove validating packets because of its uselessness.
> > > > 3. remove operation of softnet_data.xmit.recursion because it's not
> > > >     necessary.
> > > > 4. remove BQL flow control. We don't need to do BQL control because it
> > > >     probably limit the speed. An ideal scenario is to use a standalone and
> > > >     clean tx queue to send packets only for xsk. Less competition shows
> > > >     better performance results.
> > > >
> > > > Experiments:
> > > > 1) Tested on virtio_net:
> > >
> > > If you also want to test on veth, then an optimization is to increase
> > > dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> > > AF_XDP packets getting reallocated by veth driver. I never completed
> > > upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
> > >
> > >  [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
> > >
> > >
> > > (more below...)
> > >
> > > > With this patch series applied, the performance number of xdpsock[2] goes
> > > > up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> > > > If we test with another thread competing the same queue, a 28% increase
> > > > (from 405466 pps to 521076 pps) can be observed.
> > > > 2) Tested on ixgbe:
> > > > The results of zerocopy and copy mode are respectively 1303277 pps and
> > > > 1187347 pps. After this socket option took effect, copy mode reaches
> > > > 1472367 which was higher than zerocopy mode impressively.
> > > >
> > > > [2]: ./xdpsock -i eth1 -t  -S -s 64
> > > >
> > > > It's worth mentioning batch process might bring high latency in certain
> > > > cases. The recommended value is 32.
> >
> > Given the issue I spotted on your ixgbe batching patch, the comparison
> > against zc performance is probably not reliable.
>
> I have to clarify the thing: zc performance was tested without that
> series applied. That means without that series, the number is 1303277
> pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.

------
@Maciej Fijalkowski
Interesting thing is that copy mode is way worse than zerocopy if the
nic is _i40e_.

With ixgbe, even copy mode reaches nearly 50-60% of the full speed
(which is only 1Gb/sec) while either zc or batch version copy mode
reaches 70%.
With i40e, copy mode only reaches nearly 9% while zc mode 70%. i40e
has a much faster speed (10Gb/sec) comparatively.

Here are some summaries (budget 32, batch 32):
                 copy      batch             zc
i40e      1777859  2102579     14880643
ixgbe    1187347  1472367     1303277

For i40e, here are numbers around the batch copy mode (budget is 128):
no batch       batch 64
1825027      2228328.
Side note: 2228328 seems to be the max limit in copy mode with this
series applied after testing with different settings.

It turns out that testing on i40e is definitely needed because the
xdpsock test hits the bottleneck on ixgbe.

-----
@Jesper Dangaard Brouer
In terms of the 'more' boolean as Jesper said, related drivers might
need changes like this because in the 'for' loop of the batch process
in xsk_direct_xmit_batch(), drivers may fail to send and then break
the 'for' loop, which leads to no chance to kick the hardware. Or we
can keep trying to send in xsk_direct_xmit_batch() instead of breaking
immediately even if the driver becomes busy right now.

As to ixgbe, the performance doesn't improve as I analyzed (because it
already reaches 70% of full speed).

As to i40e, only adding 'more' logic, the number goes from 2102579 to
2585224 with the 32 batch and 32 budget settings. The number goes from
2200013 to 2697313 with the  batch and 64 budget settings. See! 22+%
improvement!

As to virtio_net, no obvious change here probably because the hardirq
logic doesn't have a huge effect.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-12 14:30   ` Jesper Dangaard Brouer
  2025-08-12 17:49     ` Maciej Fijalkowski
  2025-08-13  0:57     ` Jason Xing
@ 2025-08-15  6:44     ` Jason Xing
  2025-08-21 17:29       ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-08-15  6:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On Tue, Aug 12, 2025 at 10:30 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
...
>
> But this also requires changing the SKB alloc function used by
> xsk_build_skb(). As a seperate patch, I recommend that you change the
> sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
> I expect this will be a large performance improvement on it's own.
> Can I ask you to benchmark this change before the batch xmit change?
>
> Opinions needed from other maintainers please (I might be wrong!):
> I don't think the socket level accounting done in sock_alloc_send_skb()
> is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
> code comment above.

Here I'm bringing back the last test you expected to know :)

I use alloc_skb() to replace sock_alloc_send_skb() and introduce other
minor changes, say, removing sock_wfree() from xsk_destruct_skb(). It
turns out to be a stable 5% performance improvement on i40e driver.
slight improvement on virtio_net. That's good news.

Bad news is that the above logic has bugs like freeing skb in the napi
poll causes accessing skb->sk in xsk_destruct_skb() which triggers a
NULL pointer issue. How did I spot this one? I removed the BQL flow
control and started two xdpsock on different queues, then I saw a
panic[1]... To solve the problem like that, I'm afraid that we still
need to charge a certain length value into sk_wmem_alloc so that
sock_wfree(skb) can be the last one to free the socket finally.

So this socket level accounting mechanism keeps its safety in the above case.

IMHO, we can get rid of the limitation of sk_sndbuf but still use
skb_set_owner_w() that charges the len of skb. If we stick to removing
the whole accounting function, probably we have to adjust the position
of xsk_cq_submit_locked(), but I reckon for now it's not practical...

Any thoughts on this?

[1]
 997 [  133.528449] RIP: 0010:xsk_destruct_skb+0x6a/0x90
 998 [  133.528920] Code: 8b 6c 02 28 48 8b 43 18 4c 8b a0 68 03 00 00
49 8d 9c 24 e8 00 00 00 48 89 df e8 f1 eb 06 00 48 89 c6 49 8b 84 24
88 00 00 00 <48> 8b 50 10 03 2a 48      8b 40 10 48 89 df 89 28 5b 5d
41 5c e9 6e ec
 999 [  133.530526] RSP: 0018:ffffae71c06a0d08 EFLAGS: 00010046
1000 [  133.531005] RAX: 0000000000000000 RBX: ffff9f42c81c49e8 RCX:
00000000000002e7
1001 [  133.531631] RDX: 0000000000000001 RSI: 0000000000000286 RDI:
ffff9f42c81c49e8
1002 [  133.532249] RBP: 0000000000000001 R08: 0000000000000008 R09:
00000000000000001003 [  133.532867] R10: ffffffff978080c0 R11:
ffffae71c06a0ff8 R12: ffff9f42c81c4900
1004 [  133.533491] R13: ffffae71c06a0d88 R14: ffff9f42e0f1f900 R15:
ffff9f42ce850d801005 [  133.534123] FS:  0000000000000000(0000)
GS:ffff9f5227655000(0000) knlGS:00000000000000001006 [  133.534831]
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
1007 [  133.535366] CR2: 0000000000000010 CR3: 000000011c820000 CR4:
00000000003506f0
1008 [  133.536014] Call Trace:
1009 [  133.536313]  <IRQ>
1010 [  133.536583]  skb_release_head_state+0x20/0x90
1011 [  133.537021]  napi_consume_skb+0x42/0x120
1012 [  133.537429]  __free_old_xmit+0x76/0x170 [virtio_net]
1013 [  133.537923]  free_old_xmit+0x53/0xc0 [virtio_net]
1014 [  133.538395]  virtnet_poll+0xed/0x5d0 [virtio_net]
1015 [  133.538867]  ? blake2s_compress+0x52/0xa0
1016 [  133.539286]  __napi_poll+0x28/0x200
1017 [  133.539668]  net_rx_action+0x319/0x400
1018 [  133.540068]  ? sched_clock_cpu+0xb/0x190
1019 [  133.540482]  ? __run_timers+0x1d1/0x260
1020 [  133.540906]  ? __pfx_dl_task_timer+0x10/0x10
1021 [  133.541349]  ? lock_timer_base+0x72/0x90
1022 [  133.541767]  handle_softirqs+0xce/0x2e0
1023 [  133.542178]  __irq_exit_rcu+0xc6/0xf0
1024 [  133.542575]  common_interrupt+0x81/0xa0

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-13 13:06         ` Jason Xing
@ 2025-08-15 16:40           ` Jesper Dangaard Brouer
  2025-08-16  0:03             ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-15 16:40 UTC (permalink / raw)
  To: Jason Xing, Maciej Fijalkowski
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	jonathan.lemon, sdf, ast, daniel, john.fastabend, horms,
	andrew+netdev, bpf, netdev, Jason Xing



On 13/08/2025 15.06, Jason Xing wrote:
> On Wed, Aug 13, 2025 at 9:02 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>>
>> On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
>> <maciej.fijalkowski@intel.com> wrote:
>>>
>>> On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 11/08/2025 15.12, Jason Xing wrote:
>>>>> From: Jason Xing <kernelxing@tencent.com>
>>>>>
>>>>> Zerocopy mode has a good feature named multi buffer while copy mode has
>>>>> to transmit skb one by one like normal flows. The latter might lose the
>>>>> bypass power to some extent because of grabbing/releasing the same tx
>>>>> queue lock and disabling/enabling bh and stuff on a packet basis.
>>>>> Contending the same queue lock will bring a worse result.
>>>>>
>>>>
>>>> I actually think that it is worth optimizing the non-zerocopy mode for
>>>> AF_XDP.  My use-case was virtual net_devices like veth.
>>>>
>>>>
>>>>> This patch supports batch feature by permitting owning the queue lock to
>>>>> send the generic_xmit_batch number of packets at one time. To further
>>>>> achieve a better result, some codes[1] are removed on purpose from
>>>>> xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
>>>>>
>>>>> [1]
>>>>> 1. advance the device check to granularity of sendto syscall.
>>>>> 2. remove validating packets because of its uselessness.
>>>>> 3. remove operation of softnet_data.xmit.recursion because it's not
>>>>>      necessary.
>>>>> 4. remove BQL flow control. We don't need to do BQL control because it
>>>>>      probably limit the speed. An ideal scenario is to use a standalone and
>>>>>      clean tx queue to send packets only for xsk. Less competition shows
>>>>>      better performance results.
>>>>>
>>>>> Experiments:
>>>>> 1) Tested on virtio_net:
>>>>
>>>> If you also want to test on veth, then an optimization is to increase
>>>> dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
>>>> AF_XDP packets getting reallocated by veth driver. I never completed
>>>> upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
>>>>
>>>>   [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
>>>>
>>>>
>>>> (more below...)
>>>>
>>>>> With this patch series applied, the performance number of xdpsock[2] goes
>>>>> up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
>>>>> If we test with another thread competing the same queue, a 28% increase
>>>>> (from 405466 pps to 521076 pps) can be observed.
>>>>> 2) Tested on ixgbe:
>>>>> The results of zerocopy and copy mode are respectively 1303277 pps and
>>>>> 1187347 pps. After this socket option took effect, copy mode reaches
>>>>> 1472367 which was higher than zerocopy mode impressively.
>>>>>
>>>>> [2]: ./xdpsock -i eth1 -t  -S -s 64
>>>>>
>>>>> It's worth mentioning batch process might bring high latency in certain
>>>>> cases. The recommended value is 32.
>>>
>>> Given the issue I spotted on your ixgbe batching patch, the comparison
>>> against zc performance is probably not reliable.
>>
>> I have to clarify the thing: zc performance was tested without that
>> series applied. That means without that series, the number is 1303277
>> pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.
> 

My i40e device is running at 40Gbit/s.
I see significantly higher packet per sec (pps) that you are reporting:

$ sudo ./xdpsock -i i40e2 --txonly -q 2 -z -s 64

  sock0@i40e2:2 txonly xdp-drv
                    pps            pkts           1.00
rx                 0              0
tx                 21,546,859     21,552,896


The "copy" mode (-c/--copy) looks like this:

$ sudo ./xdpsock -i i40e2 --txonly -q 2 --copy -s 64

  sock0@i40e2:2 txonly xdp-drv
                    pps            pkts           1.00
rx                 0              0
tx                 2,135,424      2,136,192


The skb-mode (-S, --xdp-skb) looks like this:

$ sudo ./xdpsock -i i40e2 --txonly -q 2 --xdp-skb -s 64

  sock0@i40e2:2 txonly xdp-skb
                    pps            pkts           1.00
rx                 0              0
tx                 2,187,992      2,188,800


The HUGE performance gap to "xdp-drv" zero-copy mode, tells me that
there is a huge potential for improving the performance for the copy
mode, both "native" xdp-drv and xdp-skb, cases.
Thus, the work your are doing here is important.


> ------
> @Maciej Fijalkowski
> Interesting thing is that copy mode is way worse than zerocopy if the
> nic is _i40e_.
> 
> With ixgbe, even copy mode reaches nearly 50-60% of the full speed
> (which is only 1Gb/sec) while either zc or batch version copy mode
> reaches 70%.
> With i40e, copy mode only reaches nearly 9% while zc mode 70%. i40e
> has a much faster speed (10Gb/sec) comparatively.
> 
> Here are some summaries (budget 32, batch 32):
>                 copy       batch         zc
> i40e       1,777,859   2,102,579   14,880,643
> ixgbe      1,187,347   1,472,367    1,303,277
> 
(added thousands separators to make above readable)

Those number only make sense if
  i40e runs at link speed 10Git/s and
  ixgbe runs at link speed 1Gbit/s


> For i40e, here are numbers around the batch copy mode (budget is 128):
> no batch       batch 64
> 1825027      2228328.
> Side note: 2228328 seems to be the max limit in copy mode with this
> series applied after testing with different settings.
> 
> It turns out that testing on i40e is definitely needed because the
> xdpsock test hits the bottleneck on ixgbe.
> 
> -----
> @Jesper Dangaard Brouer
> In terms of the 'more' boolean as Jesper said, related drivers might
> need changes like this because in the 'for' loop of the batch process
> in xsk_direct_xmit_batch(), drivers may fail to send and then break
> the 'for' loop, which leads to no chance to kick the hardware.

If sending with 'more' indicator and driver fail to send, then it is the
responsibility of the driver to update the tail-ptr/doorbell.
Example for ixgbe driver:
  https://elixir.bootlin.com/linux/v6.16/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L8879-L8880

> Or we
> can keep trying to send in xsk_direct_xmit_batch() instead of breaking
> immediately even if the driver becomes busy right now.
> 
> As to ixgbe, the performance doesn't improve as I analyzed (because it
> already reaches 70% of full speed).
> 

If ixgbe runs at 1Gbit/s then remember Ethernet wire-speed is 1.488
Mpps. Thus, you are much closer than 70% to wire-speed.


> As to i40e, only adding 'more' logic, the number goes from 2102579 to
> 2585224 with the 32 batch and 32 budget settings. The number goes from
> 2200013 to 2697313 with the  batch and 64 budget settings. See! 22+%
> improvement!

That is a very big performance gain IMHO. Shows that avoiding the tail-
ptr/doorbell between each packet have a huge benefit.

> As to virtio_net, no obvious change here probably because the hardirq
> logic doesn't have a huge effect.
> 

Perhaps virtio_net doesn't implement the SKB 'more' feature?

--Jesper


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-15 16:40           ` Jesper Dangaard Brouer
@ 2025-08-16  0:03             ` Jason Xing
  2025-08-16 13:42               ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-08-16  0:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Maciej Fijalkowski, davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, jonathan.lemon, sdf, ast, daniel, john.fastabend,
	horms, andrew+netdev, bpf, netdev, Jason Xing

On Sat, Aug 16, 2025 at 12:40 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 13/08/2025 15.06, Jason Xing wrote:
> > On Wed, Aug 13, 2025 at 9:02 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >>
> >> On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
> >> <maciej.fijalkowski@intel.com> wrote:
> >>>
> >>> On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
> >>>>
> >>>>
> >>>> On 11/08/2025 15.12, Jason Xing wrote:
> >>>>> From: Jason Xing <kernelxing@tencent.com>
> >>>>>
> >>>>> Zerocopy mode has a good feature named multi buffer while copy mode has
> >>>>> to transmit skb one by one like normal flows. The latter might lose the
> >>>>> bypass power to some extent because of grabbing/releasing the same tx
> >>>>> queue lock and disabling/enabling bh and stuff on a packet basis.
> >>>>> Contending the same queue lock will bring a worse result.
> >>>>>
> >>>>
> >>>> I actually think that it is worth optimizing the non-zerocopy mode for
> >>>> AF_XDP.  My use-case was virtual net_devices like veth.
> >>>>
> >>>>
> >>>>> This patch supports batch feature by permitting owning the queue lock to
> >>>>> send the generic_xmit_batch number of packets at one time. To further
> >>>>> achieve a better result, some codes[1] are removed on purpose from
> >>>>> xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> >>>>>
> >>>>> [1]
> >>>>> 1. advance the device check to granularity of sendto syscall.
> >>>>> 2. remove validating packets because of its uselessness.
> >>>>> 3. remove operation of softnet_data.xmit.recursion because it's not
> >>>>>      necessary.
> >>>>> 4. remove BQL flow control. We don't need to do BQL control because it
> >>>>>      probably limit the speed. An ideal scenario is to use a standalone and
> >>>>>      clean tx queue to send packets only for xsk. Less competition shows
> >>>>>      better performance results.
> >>>>>
> >>>>> Experiments:
> >>>>> 1) Tested on virtio_net:
> >>>>
> >>>> If you also want to test on veth, then an optimization is to increase
> >>>> dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> >>>> AF_XDP packets getting reallocated by veth driver. I never completed
> >>>> upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
> >>>>
> >>>>   [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
> >>>>
> >>>>
> >>>> (more below...)
> >>>>
> >>>>> With this patch series applied, the performance number of xdpsock[2] goes
> >>>>> up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> >>>>> If we test with another thread competing the same queue, a 28% increase
> >>>>> (from 405466 pps to 521076 pps) can be observed.
> >>>>> 2) Tested on ixgbe:
> >>>>> The results of zerocopy and copy mode are respectively 1303277 pps and
> >>>>> 1187347 pps. After this socket option took effect, copy mode reaches
> >>>>> 1472367 which was higher than zerocopy mode impressively.
> >>>>>
> >>>>> [2]: ./xdpsock -i eth1 -t  -S -s 64
> >>>>>
> >>>>> It's worth mentioning batch process might bring high latency in certain
> >>>>> cases. The recommended value is 32.
> >>>
> >>> Given the issue I spotted on your ixgbe batching patch, the comparison
> >>> against zc performance is probably not reliable.
> >>
> >> I have to clarify the thing: zc performance was tested without that
> >> series applied. That means without that series, the number is 1303277
> >> pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.
> >
>
> My i40e device is running at 40Gbit/s.
> I see significantly higher packet per sec (pps) that you are reporting:
>
> $ sudo ./xdpsock -i i40e2 --txonly -q 2 -z -s 64
>
>   sock0@i40e2:2 txonly xdp-drv
>                     pps            pkts           1.00
> rx                 0              0
> tx                 21,546,859     21,552,896
>
>
> The "copy" mode (-c/--copy) looks like this:
>
> $ sudo ./xdpsock -i i40e2 --txonly -q 2 --copy -s 64
>
>   sock0@i40e2:2 txonly xdp-drv
>                     pps            pkts           1.00
> rx                 0              0
> tx                 2,135,424      2,136,192
>
>
> The skb-mode (-S, --xdp-skb) looks like this:
>
> $ sudo ./xdpsock -i i40e2 --txonly -q 2 --xdp-skb -s 64
>
>   sock0@i40e2:2 txonly xdp-skb
>                     pps            pkts           1.00
> rx                 0              0
> tx                 2,187,992      2,188,800
>
>
> The HUGE performance gap to "xdp-drv" zero-copy mode, tells me that
> there is a huge potential for improving the performance for the copy
> mode, both "native" xdp-drv and xdp-skb, cases.
> Thus, the work your are doing here is important.

Great! Thanks for your affirmation.

>
>
> > ------
> > @Maciej Fijalkowski
> > Interesting thing is that copy mode is way worse than zerocopy if the
> > nic is _i40e_.
> >
> > With ixgbe, even copy mode reaches nearly 50-60% of the full speed
> > (which is only 1Gb/sec) while either zc or batch version copy mode
> > reaches 70%.
> > With i40e, copy mode only reaches nearly 9% while zc mode 70%. i40e
> > has a much faster speed (10Gb/sec) comparatively.
> >
> > Here are some summaries (budget 32, batch 32):
> >                 copy       batch         zc
> > i40e       1,777,859   2,102,579   14,880,643
> > ixgbe      1,187,347   1,472,367    1,303,277
> >
> (added thousands separators to make above readable)
>
> Those number only make sense if
>   i40e runs at link speed 10Git/s and
>   ixgbe runs at link speed 1Gbit/s
>
>
> > For i40e, here are numbers around the batch copy mode (budget is 128):
> > no batch       batch 64
> > 1825027      2228328.
> > Side note: 2228328 seems to be the max limit in copy mode with this
> > series applied after testing with different settings.
> >
> > It turns out that testing on i40e is definitely needed because the
> > xdpsock test hits the bottleneck on ixgbe.
> >
> > -----
> > @Jesper Dangaard Brouer
> > In terms of the 'more' boolean as Jesper said, related drivers might
> > need changes like this because in the 'for' loop of the batch process
> > in xsk_direct_xmit_batch(), drivers may fail to send and then break
> > the 'for' loop, which leads to no chance to kick the hardware.
>
> If sending with 'more' indicator and driver fail to send, then it is the
> responsibility of the driver to update the tail-ptr/doorbell.
> Example for ixgbe driver:
>   https://elixir.bootlin.com/linux/v6.16/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L8879-L8880

Thanks for sharing this. I'm learning :)

>
> > Or we
> > can keep trying to send in xsk_direct_xmit_batch() instead of breaking
> > immediately even if the driver becomes busy right now.
> >
> > As to ixgbe, the performance doesn't improve as I analyzed (because it
> > already reaches 70% of full speed).
> >
>
> If ixgbe runs at 1Gbit/s then remember Ethernet wire-speed is 1.488
> Mpps. Thus, you are much closer than 70% to wire-speed.

Right. 70% probably is the maximum number that xdpsock can reach.

>
>
> > As to i40e, only adding 'more' logic, the number goes from 2102579 to
> > 2585224 with the 32 batch and 32 budget settings. The number goes from
> > 2200013 to 2697313 with the  batch and 64 budget settings. See! 22+%
> > improvement!
>
> That is a very big performance gain IMHO. Shows that avoiding the tail-
> ptr/doorbell between each packet have a huge benefit.

That's right, and I will include it into the series as you suggested. Thanks!

>
> > As to virtio_net, no obvious change here probably because the hardirq
> > logic doesn't have a huge effect.
> >
>
> Perhaps virtio_net doesn't implement the SKB 'more' feature?

Oh, let me investigate it more deeply.

BTW, I'm not sure if you missed the previous last email, what do you
think of socket level accounting after reading that reply?

Thanks,
Jason

>
> --Jesper
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-16  0:03             ` Jason Xing
@ 2025-08-16 13:42               ` Jason Xing
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-08-16 13:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Maciej Fijalkowski, davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, jonathan.lemon, sdf, ast, daniel, john.fastabend,
	horms, andrew+netdev, bpf, netdev, Jason Xing

On Sat, Aug 16, 2025 at 8:03 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Aug 16, 2025 at 12:40 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >
> >
> >
> > On 13/08/2025 15.06, Jason Xing wrote:
> > > On Wed, Aug 13, 2025 at 9:02 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >>
> > >> On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
> > >> <maciej.fijalkowski@intel.com> wrote:
> > >>>
> > >>> On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
> > >>>>
> > >>>>
> > >>>> On 11/08/2025 15.12, Jason Xing wrote:
> > >>>>> From: Jason Xing <kernelxing@tencent.com>
> > >>>>>
> > >>>>> Zerocopy mode has a good feature named multi buffer while copy mode has
> > >>>>> to transmit skb one by one like normal flows. The latter might lose the
> > >>>>> bypass power to some extent because of grabbing/releasing the same tx
> > >>>>> queue lock and disabling/enabling bh and stuff on a packet basis.
> > >>>>> Contending the same queue lock will bring a worse result.
> > >>>>>
> > >>>>
> > >>>> I actually think that it is worth optimizing the non-zerocopy mode for
> > >>>> AF_XDP.  My use-case was virtual net_devices like veth.
> > >>>>
> > >>>>
> > >>>>> This patch supports batch feature by permitting owning the queue lock to
> > >>>>> send the generic_xmit_batch number of packets at one time. To further
> > >>>>> achieve a better result, some codes[1] are removed on purpose from
> > >>>>> xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> > >>>>>
> > >>>>> [1]
> > >>>>> 1. advance the device check to granularity of sendto syscall.
> > >>>>> 2. remove validating packets because of its uselessness.
> > >>>>> 3. remove operation of softnet_data.xmit.recursion because it's not
> > >>>>>      necessary.
> > >>>>> 4. remove BQL flow control. We don't need to do BQL control because it
> > >>>>>      probably limit the speed. An ideal scenario is to use a standalone and
> > >>>>>      clean tx queue to send packets only for xsk. Less competition shows
> > >>>>>      better performance results.
> > >>>>>
> > >>>>> Experiments:
> > >>>>> 1) Tested on virtio_net:
> > >>>>
> > >>>> If you also want to test on veth, then an optimization is to increase
> > >>>> dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> > >>>> AF_XDP packets getting reallocated by veth driver. I never completed
> > >>>> upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
> > >>>>
> > >>>>   [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
> > >>>>
> > >>>>
> > >>>> (more below...)
> > >>>>
> > >>>>> With this patch series applied, the performance number of xdpsock[2] goes
> > >>>>> up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> > >>>>> If we test with another thread competing the same queue, a 28% increase
> > >>>>> (from 405466 pps to 521076 pps) can be observed.
> > >>>>> 2) Tested on ixgbe:
> > >>>>> The results of zerocopy and copy mode are respectively 1303277 pps and
> > >>>>> 1187347 pps. After this socket option took effect, copy mode reaches
> > >>>>> 1472367 which was higher than zerocopy mode impressively.
> > >>>>>
> > >>>>> [2]: ./xdpsock -i eth1 -t  -S -s 64
> > >>>>>
> > >>>>> It's worth mentioning batch process might bring high latency in certain
> > >>>>> cases. The recommended value is 32.
> > >>>
> > >>> Given the issue I spotted on your ixgbe batching patch, the comparison
> > >>> against zc performance is probably not reliable.
> > >>
> > >> I have to clarify the thing: zc performance was tested without that
> > >> series applied. That means without that series, the number is 1303277
> > >> pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.
> > >
> >
> > My i40e device is running at 40Gbit/s.
> > I see significantly higher packet per sec (pps) that you are reporting:
> >
> > $ sudo ./xdpsock -i i40e2 --txonly -q 2 -z -s 64
> >
> >   sock0@i40e2:2 txonly xdp-drv
> >                     pps            pkts           1.00
> > rx                 0              0
> > tx                 21,546,859     21,552,896
> >
> >
> > The "copy" mode (-c/--copy) looks like this:
> >
> > $ sudo ./xdpsock -i i40e2 --txonly -q 2 --copy -s 64
> >
> >   sock0@i40e2:2 txonly xdp-drv
> >                     pps            pkts           1.00
> > rx                 0              0
> > tx                 2,135,424      2,136,192
> >
> >
> > The skb-mode (-S, --xdp-skb) looks like this:
> >
> > $ sudo ./xdpsock -i i40e2 --txonly -q 2 --xdp-skb -s 64
> >
> >   sock0@i40e2:2 txonly xdp-skb
> >                     pps            pkts           1.00
> > rx                 0              0
> > tx                 2,187,992      2,188,800
> >
> >
> > The HUGE performance gap to "xdp-drv" zero-copy mode, tells me that
> > there is a huge potential for improving the performance for the copy
> > mode, both "native" xdp-drv and xdp-skb, cases.
> > Thus, the work your are doing here is important.
>
> Great! Thanks for your affirmation.
>
> >
> >
> > > ------
> > > @Maciej Fijalkowski
> > > Interesting thing is that copy mode is way worse than zerocopy if the
> > > nic is _i40e_.
> > >
> > > With ixgbe, even copy mode reaches nearly 50-60% of the full speed
> > > (which is only 1Gb/sec) while either zc or batch version copy mode
> > > reaches 70%.
> > > With i40e, copy mode only reaches nearly 9% while zc mode 70%. i40e
> > > has a much faster speed (10Gb/sec) comparatively.
> > >
> > > Here are some summaries (budget 32, batch 32):
> > >                 copy       batch         zc
> > > i40e       1,777,859   2,102,579   14,880,643
> > > ixgbe      1,187,347   1,472,367    1,303,277
> > >
> > (added thousands separators to make above readable)
> >
> > Those number only make sense if
> >   i40e runs at link speed 10Git/s and
> >   ixgbe runs at link speed 1Gbit/s
> >
> >
> > > For i40e, here are numbers around the batch copy mode (budget is 128):
> > > no batch       batch 64
> > > 1825027      2228328.
> > > Side note: 2228328 seems to be the max limit in copy mode with this
> > > series applied after testing with different settings.
> > >
> > > It turns out that testing on i40e is definitely needed because the
> > > xdpsock test hits the bottleneck on ixgbe.
> > >
> > > -----
> > > @Jesper Dangaard Brouer
> > > In terms of the 'more' boolean as Jesper said, related drivers might
> > > need changes like this because in the 'for' loop of the batch process
> > > in xsk_direct_xmit_batch(), drivers may fail to send and then break
> > > the 'for' loop, which leads to no chance to kick the hardware.
> >
> > If sending with 'more' indicator and driver fail to send, then it is the
> > responsibility of the driver to update the tail-ptr/doorbell.
> > Example for ixgbe driver:
> >   https://elixir.bootlin.com/linux/v6.16/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L8879-L8880
>
> Thanks for sharing this. I'm learning :)
>
> >
> > > Or we
> > > can keep trying to send in xsk_direct_xmit_batch() instead of breaking
> > > immediately even if the driver becomes busy right now.
> > >
> > > As to ixgbe, the performance doesn't improve as I analyzed (because it
> > > already reaches 70% of full speed).
> > >
> >
> > If ixgbe runs at 1Gbit/s then remember Ethernet wire-speed is 1.488
> > Mpps. Thus, you are much closer than 70% to wire-speed.
>
> Right. 70% probably is the maximum number that xdpsock can reach.
>
> >
> >
> > > As to i40e, only adding 'more' logic, the number goes from 2102579 to
> > > 2585224 with the 32 batch and 32 budget settings. The number goes from
> > > 2200013 to 2697313 with the  batch and 64 budget settings. See! 22+%
> > > improvement!
> >
> > That is a very big performance gain IMHO. Shows that avoiding the tail-
> > ptr/doorbell between each packet have a huge benefit.
>
> That's right, and I will include it into the series as you suggested. Thanks!
>
> >
> > > As to virtio_net, no obvious change here probably because the hardirq
> > > logic doesn't have a huge effect.
> > >
> >
> > Perhaps virtio_net doesn't implement the SKB 'more' feature?
>
> Oh, let me investigate it more deeply.

Here I'm going to update two things:

#1
Virtio_net driver does implement the more feature at the link:
https://elixir.bootlin.com/linux/v6.16/source/drivers/net/virtio_net.c#L3352

If there is 'more' indicator without setting __QUEUE_STATE_DRV_XOFF,
then driver will not notify the host.

#2
As to the previous question about xsk that might fail to sense the
BUSY error from the driver, I double checked and saw it can be
problematic:
i40e_xmit_frame_ring()
    -> if (i40e_maybe_stop_tx(tx_ring, count + 4 + 1)) {
              return NETDEV_TX_BUSY;
It means that at the very beginning of driver sending packets on i40e
driver, it might be stopped due to the above case in the middle of
sending a number of packets even in the normal sending process (see
dev_hard_start_xmit()). Then the qdisc will miss the opportunity to
ask the driver to kick. The same situation could be applied to this
xsk batch xmit.

Notice that the case is different from the BQL limit and check in
ixgbe_tx_map() (see the link you mentioned). We cannot prevent it
because DQL algo might not stop sending packets but the above code
snippet can. I checked mlx4 and mlx5 and didn't spot any chance where
the BUSY value could be returned in the xmit process.

So I wonder if we fix the problem from the driver side like this
regardless of the current batch xmit:
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 048c33039130..81cc2219632b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3904,6 +3904,7 @@ static netdev_tx_t i40e_xmit_frame_ring(struct
sk_buff *skb,
         */
        if (i40e_maybe_stop_tx(tx_ring, count + 4 + 1)) {
                tx_ring->tx_stats.tx_busy++;
+               writel(tx_ring->next_to_use, tx_ring->tail);
                return NETDEV_TX_BUSY;
        }

If so, there are a few drivers needing to be adjusted... I'm not sure
if I'm moving in the right direction...

Thanks,
Jason

>
> BTW, I'm not sure if you missed the previous last email, what do you
> think of socket level accounting after reading that reply?
>
> Thanks,
> Jason
>
> >
> > --Jesper
> >

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
  2025-08-11 13:12 ` [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
  2025-08-12 16:40   ` Maciej Fijalkowski
@ 2025-08-18  6:20   ` Dan Carpenter
  1 sibling, 0 replies; 17+ messages in thread
From: Dan Carpenter @ 2025-08-18  6:20 UTC (permalink / raw)
  To: oe-kbuild, Jason Xing, davem, edumazet, kuba, pabeni, bjorn,
	magnus.karlsson, maciej.fijalkowski, jonathan.lemon, sdf, ast,
	daniel, hawk, john.fastabend, horms, andrew+netdev
  Cc: lkp, oe-kbuild-all, bpf, netdev, Jason Xing

Hi Jason,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Jason-Xing/xsk-introduce-XDP_GENERIC_XMIT_BATCH-setsockopt/20250811-211509
base:   net-next/main
patch link:    https://lore.kernel.org/r/20250811131236.56206-2-kerneljasonxing%40gmail.com
patch subject: [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
config: s390-randconfig-r073-20250817 (https://download.01.org/0day-ci/archive/20250817/202508171049.SGYNFbP3-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 8.5.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202508171049.SGYNFbP3-lkp@intel.com/

smatch warnings:
net/xdp/xsk.c:1495 xsk_setsockopt() warn: inconsistent returns '&xs->mutex'.

vim +1495 net/xdp/xsk.c

48248366d9de2a Jason Xing         2025-08-11  1460  	case XDP_GENERIC_XMIT_BATCH:
48248366d9de2a Jason Xing         2025-08-11  1461  	{
48248366d9de2a Jason Xing         2025-08-11  1462  		unsigned int batch, batch_alloc_len;
48248366d9de2a Jason Xing         2025-08-11  1463  		struct sk_buff **new;
48248366d9de2a Jason Xing         2025-08-11  1464  
48248366d9de2a Jason Xing         2025-08-11  1465  		if (optlen != sizeof(batch))
48248366d9de2a Jason Xing         2025-08-11  1466  			return -EINVAL;
48248366d9de2a Jason Xing         2025-08-11  1467  		if (copy_from_sockptr(&batch, optval, sizeof(batch)))
48248366d9de2a Jason Xing         2025-08-11  1468  			return -EFAULT;
48248366d9de2a Jason Xing         2025-08-11  1469  		if (batch > xs->max_tx_budget)
48248366d9de2a Jason Xing         2025-08-11  1470  			return -EACCES;
48248366d9de2a Jason Xing         2025-08-11  1471  
48248366d9de2a Jason Xing         2025-08-11  1472  		mutex_lock(&xs->mutex);
48248366d9de2a Jason Xing         2025-08-11  1473  		if (!batch) {
48248366d9de2a Jason Xing         2025-08-11  1474  			kfree(xs->skb_batch);
48248366d9de2a Jason Xing         2025-08-11  1475  			xs->generic_xmit_batch = 0;
48248366d9de2a Jason Xing         2025-08-11  1476  			goto out;
48248366d9de2a Jason Xing         2025-08-11  1477  		}
48248366d9de2a Jason Xing         2025-08-11  1478  		batch_alloc_len = sizeof(struct sk_buff *) * batch;
48248366d9de2a Jason Xing         2025-08-11  1479  		new = kmalloc(batch_alloc_len, GFP_KERNEL);
48248366d9de2a Jason Xing         2025-08-11  1480  		if (!new)
48248366d9de2a Jason Xing         2025-08-11  1481  			return -ENOMEM;

mutex_unlock(&xs->mutex); before returning

48248366d9de2a Jason Xing         2025-08-11  1482  		if (xs->skb_batch)
48248366d9de2a Jason Xing         2025-08-11  1483  			kfree(xs->skb_batch);
48248366d9de2a Jason Xing         2025-08-11  1484  
48248366d9de2a Jason Xing         2025-08-11  1485  		xs->skb_batch = new;
48248366d9de2a Jason Xing         2025-08-11  1486  		xs->generic_xmit_batch = batch;
48248366d9de2a Jason Xing         2025-08-11  1487  out:
48248366d9de2a Jason Xing         2025-08-11  1488  		mutex_unlock(&xs->mutex);
48248366d9de2a Jason Xing         2025-08-11  1489  		return 0;
48248366d9de2a Jason Xing         2025-08-11  1490  	}
c0c77d8fb787cf Björn Töpel        2018-05-02  1491  	default:
c0c77d8fb787cf Björn Töpel        2018-05-02  1492  		break;
c0c77d8fb787cf Björn Töpel        2018-05-02  1493  	}
c0c77d8fb787cf Björn Töpel        2018-05-02  1494  
c0c77d8fb787cf Björn Töpel        2018-05-02 @1495  	return -ENOPROTOOPT;
c0c77d8fb787cf Björn Töpel        2018-05-02  1496  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-15  6:44     ` Jason Xing
@ 2025-08-21 17:29       ` Jesper Dangaard Brouer
  2025-08-22  1:13         ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-21 17:29 UTC (permalink / raw)
  To: Jason Xing, Karlsson, Magnus, Björn Töpel
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

I need some help from Cc Magnus or Björn, to explain why you changes
fails in xsk_destruct_skb().


On 15/08/2025 08.44, Jason Xing wrote:
> On Tue, Aug 12, 2025 at 10:30 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
> ...
>>
>> But this also requires changing the SKB alloc function used by
>> xsk_build_skb(). As a seperate patch, I recommend that you change the
>> sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
>> I expect this will be a large performance improvement on it's own.
>> Can I ask you to benchmark this change before the batch xmit change?
>>
>> Opinions needed from other maintainers please (I might be wrong!):
>> I don't think the socket level accounting done in sock_alloc_send_skb()
>> is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
>> code comment above.
> 
> Here I'm bringing back the last test you expected to know :)
> 
> I use alloc_skb() to replace sock_alloc_send_skb() and introduce other
> minor changes, say, removing sock_wfree() from xsk_destruct_skb(). It
> turns out to be a stable 5% performance improvement on i40e driver.
> slight improvement on virtio_net. That's good news.
> 
> Bad news is that the above logic has bugs like freeing skb in the napi
> poll causes accessing skb->sk in xsk_destruct_skb() which triggers a
> NULL pointer issue. How did I spot this one? I removed the BQL flow
> control and started two xdpsock on different queues, then I saw a
> panic[1]... To solve the problem like that, I'm afraid that we still
> need to charge a certain length value into sk_wmem_alloc so that
> sock_wfree(skb) can be the last one to free the socket finally.
> 
> So this socket level accounting mechanism keeps its safety in the above case.
> 
> IMHO, we can get rid of the limitation of sk_sndbuf but still use
> skb_set_owner_w() that charges the len of skb. If we stick to removing
> the whole accounting function, probably we have to adjust the position
> of xsk_cq_submit_locked(), but I reckon for now it's not practical...
> 
> Any thoughts on this?
> 
> [1]
>   997 [  133.528449] RIP: 0010:xsk_destruct_skb+0x6a/0x90
>   998 [  133.528920] Code: 8b 6c 02 28 48 8b 43 18 4c 8b a0 68 03 00 00
> 49 8d 9c 24 e8 00 00 00 48 89 df e8 f1 eb 06 00 48 89 c6 49 8b 84 24
> 88 00 00 00 <48> 8b 50 10 03 2a 48      8b 40 10 48 89 df 89 28 5b 5d
> 41 5c e9 6e ec
>   999 [  133.530526] RSP: 0018:ffffae71c06a0d08 EFLAGS: 00010046
> 1000 [  133.531005] RAX: 0000000000000000 RBX: ffff9f42c81c49e8 RCX:
> 00000000000002e7
> 1001 [  133.531631] RDX: 0000000000000001 RSI: 0000000000000286 RDI:
> ffff9f42c81c49e8
> 1002 [  133.532249] RBP: 0000000000000001 R08: 0000000000000008 R09:
> 00000000000000001003 [  133.532867] R10: ffffffff978080c0 R11:
> ffffae71c06a0ff8 R12: ffff9f42c81c4900
> 1004 [  133.533491] R13: ffffae71c06a0d88 R14: ffff9f42e0f1f900 R15:
> ffff9f42ce850d801005 [  133.534123] FS:  0000000000000000(0000)
> GS:ffff9f5227655000(0000) knlGS:00000000000000001006 [  133.534831]
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 1007 [  133.535366] CR2: 0000000000000010 CR3: 000000011c820000 CR4:
> 00000000003506f0
> 1008 [  133.536014] Call Trace:
> 1009 [  133.536313]  <IRQ>
> 1010 [  133.536583]  skb_release_head_state+0x20/0x90
> 1011 [  133.537021]  napi_consume_skb+0x42/0x120
> 1012 [  133.537429]  __free_old_xmit+0x76/0x170 [virtio_net]
> 1013 [  133.537923]  free_old_xmit+0x53/0xc0 [virtio_net]
> 1014 [  133.538395]  virtnet_poll+0xed/0x5d0 [virtio_net]
> 1015 [  133.538867]  ? blake2s_compress+0x52/0xa0
> 1016 [  133.539286]  __napi_poll+0x28/0x200
> 1017 [  133.539668]  net_rx_action+0x319/0x400
> 1018 [  133.540068]  ? sched_clock_cpu+0xb/0x190
> 1019 [  133.540482]  ? __run_timers+0x1d1/0x260
> 1020 [  133.540906]  ? __pfx_dl_task_timer+0x10/0x10
> 1021 [  133.541349]  ? lock_timer_base+0x72/0x90
> 1022 [  133.541767]  handle_softirqs+0xce/0x2e0
> 1023 [  133.542178]  __irq_exit_rcu+0xc6/0xf0
> 1024 [  133.542575]  common_interrupt+0x81/0xa0
> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
  2025-08-21 17:29       ` Jesper Dangaard Brouer
@ 2025-08-22  1:13         ` Jason Xing
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-08-22  1:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Björn Töpel, davem, edumazet, kuba,
	pabeni, maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing

On Fri, Aug 22, 2025 at 1:29 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
> I need some help from Cc Magnus or Björn, to explain why you changes
> fails in xsk_destruct_skb().

Oh, I mean the reason for using socket level accounting is we need to
make sure of the safety in tx completion period. In xsk, that is,
xsk_destruct_skb needs to fetch the corresponding sk from skb and
manipulate its ring structure. Without accounting, the socket can be
released and destroyed before the driver calls skb->destructor(). Only
with the accounting protection, the socket is still alive because the
following code:
sock_wfree()
    -> if (refcount_sub_and_test(len, &sk->sk_wmem_alloc))
             __sk_free(sk);

It seems no way to rid the accounting feature for now without
refactoring the whole logic.

We can probably remove the sk_sndbuf limitation, but I still do more
investigation :)

Thanks,
Jason

>
>
> On 15/08/2025 08.44, Jason Xing wrote:
> > On Tue, Aug 12, 2025 at 10:30 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >>
> > ...
> >>
> >> But this also requires changing the SKB alloc function used by
> >> xsk_build_skb(). As a seperate patch, I recommend that you change the
> >> sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
> >> I expect this will be a large performance improvement on it's own.
> >> Can I ask you to benchmark this change before the batch xmit change?
> >>
> >> Opinions needed from other maintainers please (I might be wrong!):
> >> I don't think the socket level accounting done in sock_alloc_send_skb()
> >> is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
> >> code comment above.
> >
> > Here I'm bringing back the last test you expected to know :)
> >
> > I use alloc_skb() to replace sock_alloc_send_skb() and introduce other
> > minor changes, say, removing sock_wfree() from xsk_destruct_skb(). It
> > turns out to be a stable 5% performance improvement on i40e driver.
> > slight improvement on virtio_net. That's good news.
> >
> > Bad news is that the above logic has bugs like freeing skb in the napi
> > poll causes accessing skb->sk in xsk_destruct_skb() which triggers a
> > NULL pointer issue. How did I spot this one? I removed the BQL flow
> > control and started two xdpsock on different queues, then I saw a
> > panic[1]... To solve the problem like that, I'm afraid that we still
> > need to charge a certain length value into sk_wmem_alloc so that
> > sock_wfree(skb) can be the last one to free the socket finally.
> >
> > So this socket level accounting mechanism keeps its safety in the above case.
> >
> > IMHO, we can get rid of the limitation of sk_sndbuf but still use
> > skb_set_owner_w() that charges the len of skb. If we stick to removing
> > the whole accounting function, probably we have to adjust the position
> > of xsk_cq_submit_locked(), but I reckon for now it's not practical...
> >
> > Any thoughts on this?
> >
> > [1]
> >   997 [  133.528449] RIP: 0010:xsk_destruct_skb+0x6a/0x90
> >   998 [  133.528920] Code: 8b 6c 02 28 48 8b 43 18 4c 8b a0 68 03 00 00
> > 49 8d 9c 24 e8 00 00 00 48 89 df e8 f1 eb 06 00 48 89 c6 49 8b 84 24
> > 88 00 00 00 <48> 8b 50 10 03 2a 48      8b 40 10 48 89 df 89 28 5b 5d
> > 41 5c e9 6e ec
> >   999 [  133.530526] RSP: 0018:ffffae71c06a0d08 EFLAGS: 00010046
> > 1000 [  133.531005] RAX: 0000000000000000 RBX: ffff9f42c81c49e8 RCX:
> > 00000000000002e7
> > 1001 [  133.531631] RDX: 0000000000000001 RSI: 0000000000000286 RDI:
> > ffff9f42c81c49e8
> > 1002 [  133.532249] RBP: 0000000000000001 R08: 0000000000000008 R09:
> > 00000000000000001003 [  133.532867] R10: ffffffff978080c0 R11:
> > ffffae71c06a0ff8 R12: ffff9f42c81c4900
> > 1004 [  133.533491] R13: ffffae71c06a0d88 R14: ffff9f42e0f1f900 R15:
> > ffff9f42ce850d801005 [  133.534123] FS:  0000000000000000(0000)
> > GS:ffff9f5227655000(0000) knlGS:00000000000000001006 [  133.534831]
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 1007 [  133.535366] CR2: 0000000000000010 CR3: 000000011c820000 CR4:
> > 00000000003506f0
> > 1008 [  133.536014] Call Trace:
> > 1009 [  133.536313]  <IRQ>
> > 1010 [  133.536583]  skb_release_head_state+0x20/0x90
> > 1011 [  133.537021]  napi_consume_skb+0x42/0x120
> > 1012 [  133.537429]  __free_old_xmit+0x76/0x170 [virtio_net]
> > 1013 [  133.537923]  free_old_xmit+0x53/0xc0 [virtio_net]
> > 1014 [  133.538395]  virtnet_poll+0xed/0x5d0 [virtio_net]
> > 1015 [  133.538867]  ? blake2s_compress+0x52/0xa0
> > 1016 [  133.539286]  __napi_poll+0x28/0x200
> > 1017 [  133.539668]  net_rx_action+0x319/0x400
> > 1018 [  133.540068]  ? sched_clock_cpu+0xb/0x190
> > 1019 [  133.540482]  ? __run_timers+0x1d1/0x260
> > 1020 [  133.540906]  ? __pfx_dl_task_timer+0x10/0x10
> > 1021 [  133.541349]  ? lock_timer_base+0x72/0x90
> > 1022 [  133.541767]  handle_softirqs+0xce/0x2e0
> > 1023 [  133.542178]  __irq_exit_rcu+0xc6/0xf0
> > 1024 [  133.542575]  common_interrupt+0x81/0xa0
> >
> > Thanks,
> > Jason
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-08-22  1:13 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-11 13:12 [PATCH net-next 0/2] xsk: improvement performance in copy mode Jason Xing
2025-08-11 13:12 ` [PATCH net-next 1/2] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
2025-08-12 16:40   ` Maciej Fijalkowski
2025-08-12 23:46     ` Jason Xing
2025-08-18  6:20   ` Dan Carpenter
2025-08-11 13:12 ` [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode Jason Xing
2025-08-12 14:30   ` Jesper Dangaard Brouer
2025-08-12 17:49     ` Maciej Fijalkowski
2025-08-13  1:02       ` Jason Xing
2025-08-13 13:06         ` Jason Xing
2025-08-15 16:40           ` Jesper Dangaard Brouer
2025-08-16  0:03             ` Jason Xing
2025-08-16 13:42               ` Jason Xing
2025-08-13  0:57     ` Jason Xing
2025-08-15  6:44     ` Jason Xing
2025-08-21 17:29       ` Jesper Dangaard Brouer
2025-08-22  1:13         ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).