[RFC v3 Optimizing veth xsk performance 0/9]

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v3 Optimizing veth xsk performance 0/9]
@ 2023-08-08  3:19 Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback Albert Huang
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

AF_XDP is a kernel bypass technology that can greatly improve performance.
However,for virtual devices like veth,even with the use of AF_XDP sockets,
there are still many additional software paths that consume CPU resources. 
This patch series focuses on optimizing the performance of AF_XDP sockets 
for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. 
Patch 5 introduces tx queue and tx napi for packet transmission, while 
patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
add support for AF_XDP tx need_wakup feature. These optimizations significantly
reduce the software path and support checksum offload.

I tested those feature with
A typical topology is shown below:
client(send):                                        server:(recv)
veth<-->veth-peer                                    veth1-peer<--->veth1
  1       |                                                  |   7
          |2                                                6|
          |                                                  |
        bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
                  3                    4                 5    
             (machine1)                              (machine2)    
AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
veth:(172.17.0.2/24)
bridge:(172.17.0.1/24)
eth0:(192.168.156.66/24)

eth1(172.17.0.2/24)
bridge1:(172.17.0.1/24)
eth0:(192.168.156.88/24)

after set default route\snat\dnat. we can have a tests
to get the performance results.

packets send from veth to veth1:
af_xdp test tool:
link:https://github.com/cclinuxer/libxudp
send:(veth)
./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
recv:(veth1)
./objs/xudpperf recv --src 172.17.0.2:6002

udp test tool:iperf3
send:(veth)
iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 0 -u
recv:(veth1)
iperf3 -s -p 6002

performance:
performance:(test weth libxudp lib)
UDP                              : 320 Kpps (with 100% cpu)
AF_XDP   no  zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with  batch  +  zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)

With af_xdp batch, the libxudp user-space program reaches a bottleneck.
Therefore, the softirq did not reach the limit.

This is just an RFC patch series, and some code details still need 
further consideration. Please review this proposal.

v2->v3:
- fix build error find by kernel test robot.

v1->v2:
- all the patches pass checkpatch.pl test. suggested by Simon Horman.
- iperf3 tested with -b 0, update the test results. suggested by Paolo Abeni.
- refactor code to make code structure clearer.
- delete some useless code logic in the veth_xsk_tx_xmit function.
- add support for AF_XDP tx need_wakup feature.

Albert Huang (9):
  veth: Implement ethtool's get_ringparam() callback
  xsk: add dma_check_skip for skipping dma check
  veth: add support for send queue
  xsk: add xsk_tx_completed_addr function
  veth: use send queue tx napi to xmit xsk tx desc
  veth: add ndo_xsk_wakeup callback for veth
  sk_buff: add destructor_arg_xsk_pool for zero copy
  veth: af_xdp tx batch support for ipv4 udp
  veth: add support for AF_XDP tx need_wakup feature

 drivers/net/veth.c          | 679 +++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |   2 +
 include/net/xdp_sock_drv.h  |   5 +
 include/net/xsk_buff_pool.h |   1 +
 net/xdp/xsk.c               |   6 +
 net/xdp/xsk_buff_pool.c     |   3 +-
 net/xdp/xsk_queue.h         |  10 +
 7 files changed, 704 insertions(+), 2 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 2/9] xsk: add dma_check_skip for skipping dma check Albert Huang
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

some xsk library calls get_ringparam() API to get the queue length
to init the xsk umem.

Implement that in veth so those scenarios can work properly.

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..77e12d52ca2b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -255,6 +255,17 @@ static void veth_get_channels(struct net_device *dev,
 static int veth_set_channels(struct net_device *dev,
 			     struct ethtool_channels *ch);
 
+static void veth_get_ringparam(struct net_device *dev,
+			       struct ethtool_ringparam *ring,
+			     struct kernel_ethtool_ringparam *kernel_ring,
+			     struct netlink_ext_ack *extack)
+{
+	ring->rx_max_pending = VETH_RING_SIZE;
+	ring->tx_max_pending = VETH_RING_SIZE;
+	ring->rx_pending = VETH_RING_SIZE;
+	ring->tx_pending = VETH_RING_SIZE;
+}
+
 static const struct ethtool_ops veth_ethtool_ops = {
 	.get_drvinfo		= veth_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
@@ -265,6 +276,7 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_ts_info		= ethtool_op_get_ts_info,
 	.get_channels		= veth_get_channels,
 	.set_channels		= veth_set_channels,
+	.get_ringparam		= veth_get_ringparam,
 };
 
 /* general routines */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 2/9] xsk: add dma_check_skip for skipping dma check
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 3/9] veth: add support for send queue Albert Huang
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

for the virtual net device such as veth, there is
no need to do dma check if we support zero copy.

add this flag after unaligned. beacause there is 4 bytes hole
pahole -V ./net/xdp/xsk_buff_pool.o:
-----------
...
	/* --- cacheline 3 boundary (192 bytes) --- */
	u32                        chunk_size;           /*   192     4 */
	u32                        frame_len;            /*   196     4 */
	u8                         cached_need_wakeup;   /*   200     1 */
	bool                       uses_need_wakeup;     /*   201     1 */
	bool                       dma_need_sync;        /*   202     1 */
	bool                       unaligned;            /*   203     1 */

	/* XXX 4 bytes hole, try to pack */

	void *                     addrs;                /*   208     8 */
	spinlock_t                 cq_lock;              /*   216     4 */
...
-----------

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 include/net/xsk_buff_pool.h | 1 +
 net/xdp/xsk_buff_pool.c     | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..fe31097dc11b 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -81,6 +81,7 @@ struct xsk_buff_pool {
 	bool uses_need_wakeup;
 	bool dma_need_sync;
 	bool unaligned;
+	bool dma_check_skip;
 	void *addrs;
 	/* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect:
 	 * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b3f7b310811e..ed251b8e8773 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 		XDP_PACKET_HEADROOM;
 	pool->umem = umem;
 	pool->addrs = umem->addrs;
+	pool->dma_check_skip = false;
 	INIT_LIST_HEAD(&pool->free_list);
 	INIT_LIST_HEAD(&pool->xskb_list);
 	INIT_LIST_HEAD(&pool->xsk_tx_list);
@@ -202,7 +203,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (err)
 		goto err_unreg_pool;
 
-	if (!pool->dma_pages) {
+	if (!pool->dma_pages && !pool->dma_check_skip) {
 		WARN(1, "Driver did not DMA map zero-copy buffers");
 		err = -EINVAL;
 		goto err_unreg_xsk;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 3/9] veth: add support for send queue
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 2/9] xsk: add dma_check_skip for skipping dma check Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function Albert Huang
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

in order to support native af_xdp for veth. we
need support for send queue for napi tx.
the upcoming patch will make use of it.

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 77e12d52ca2b..25faba879505 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -56,6 +56,11 @@ struct veth_rq_stats {
 	struct u64_stats_sync	syncp;
 };
 
+struct veth_sq_stats {
+	struct veth_stats	vs;
+	struct u64_stats_sync	syncp;
+};
+
 struct veth_rq {
 	struct napi_struct	xdp_napi;
 	struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,11 +74,25 @@ struct veth_rq {
 	struct page_pool	*page_pool;
 };
 
+struct veth_sq {
+	struct napi_struct	xdp_napi;
+	struct net_device	*dev;
+	struct xdp_mem_info	xdp_mem;
+	struct veth_sq_stats	stats;
+	u32 queue_index;
+	/* for xsk */
+	struct {
+		struct xsk_buff_pool __rcu *pool;
+		u32 last_cpu;
+	} xsk;
+};
+
 struct veth_priv {
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
+	struct veth_sq		*sq;
 	unsigned int		requested_headroom;
 };
 
@@ -1495,6 +1514,15 @@ static int veth_alloc_queues(struct net_device *dev)
 		u64_stats_init(&priv->rq[i].stats.syncp);
 	}
 
+	priv->sq = kcalloc(dev->num_tx_queues, sizeof(*priv->sq), GFP_KERNEL);
+	if (!priv->sq)
+		return -ENOMEM;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		priv->sq[i].dev = dev;
+		u64_stats_init(&priv->sq[i].stats.syncp);
+	}
+
 	return 0;
 }
 
@@ -1503,6 +1531,7 @@ static void veth_free_queues(struct net_device *dev)
 	struct veth_priv *priv = netdev_priv(dev);
 
 	kfree(priv->rq);
+	kfree(priv->sq);
 }
 
 static int veth_dev_init(struct net_device *dev)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
                   ` (2 preceding siblings ...)
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 3/9] veth: add support for send queue Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc Albert Huang
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

Return desc to the cq by using the descriptor address.

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 include/net/xdp_sock_drv.h |  5 +++++
 net/xdp/xsk.c              |  6 ++++++
 net/xdp/xsk_queue.h        | 10 ++++++++++
 3 files changed, 21 insertions(+)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..de82c596e48f 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -15,6 +15,7 @@
 #ifdef CONFIG_XDP_SOCKETS
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr);
 bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
 u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
 void xsk_tx_release(struct xsk_buff_pool *pool);
@@ -188,6 +189,10 @@ static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
 {
 }
 
+static inline void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr)
+{
+}
+
 static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
 				    struct xdp_desc *desc)
 {
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4f1e0599146e..b2b8aa7b0bcf 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -396,6 +396,12 @@ void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
 }
 EXPORT_SYMBOL(xsk_tx_completed);
 
+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr)
+{
+	xskq_prod_submit_addr(pool->cq, addr);
+}
+EXPORT_SYMBOL(xsk_tx_completed_addr);
+
 void xsk_tx_release(struct xsk_buff_pool *pool)
 {
 	struct xdp_sock *xs;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..3a5e26a81dc2 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -428,6 +428,16 @@ static inline void __xskq_prod_submit(struct xsk_queue *q, u32 idx)
 	smp_store_release(&q->ring->producer, idx); /* B, matches C */
 }
 
+static inline void xskq_prod_submit_addr(struct xsk_queue *q, u64 addr)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+	u32 idx = q->ring->producer;
+
+	ring->desc[idx++ & q->ring_mask] = addr;
+
+	__xskq_prod_submit(q, idx);
+}
+
 static inline void xskq_prod_submit(struct xsk_queue *q)
 {
 	__xskq_prod_submit(q, q->cached_prod);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
                   ` (3 preceding siblings ...)
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 6/9] veth: add ndo_xsk_wakeup callback for veth Albert Huang
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

use send queue tx napi to xmit xsk tx desc

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 230 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 229 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 25faba879505..28b891dd8dc9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,8 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <net/page_pool.h>
+#include <net/xdp_sock_drv.h>
+#include <net/xdp.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -1061,6 +1063,141 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	return done;
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}
+
+static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
+{
+	struct veth_priv *priv, *peer_priv;
+	struct net_device *dev, *peer_dev;
+	struct veth_stats stats = {};
+	struct sk_buff *skb = NULL;
+	struct veth_rq *peer_rq;
+	struct xdp_desc desc;
+	int done = 0;
+
+	dev = sq->dev;
+	priv = netdev_priv(dev);
+	peer_dev = priv->peer;
+	peer_priv = netdev_priv(peer_dev);
+
+	/* todo: queue index must set before this */
+	peer_rq = &peer_priv->rq[sq->queue_index];
+
+	/* set xsk wake up flag, to do: where to disable */
+	if (xsk_uses_need_wakeup(xsk_pool))
+		xsk_set_tx_need_wakeup(xsk_pool);
+
+	while (budget-- > 0) {
+		unsigned int truesize = 0;
+		struct page *page;
+		void *vaddr;
+		void *addr;
+
+		if (!xsk_tx_peek_desc(xsk_pool, &desc))
+			break;
+
+		addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
+
+		/* can not hold all data in a page */
+		truesize =  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		truesize += desc.len + xsk_pool->headroom;
+		if (truesize > PAGE_SIZE) {
+			xsk_tx_completed_addr(xsk_pool, desc.addr);
+			stats.xdp_drops++;
+			break;
+		}
+
+		page = dev_alloc_page();
+		if (!page) {
+			xsk_tx_completed_addr(xsk_pool, desc.addr);
+			stats.xdp_drops++;
+			break;
+		}
+		vaddr = page_to_virt(page);
+
+		memcpy(vaddr + xsk_pool->headroom, addr, desc.len);
+		xsk_tx_completed_addr(xsk_pool, desc.addr);
+
+		skb = veth_build_skb(vaddr, xsk_pool->headroom, desc.len, PAGE_SIZE);
+		if (!skb) {
+			put_page(page);
+			stats.xdp_drops++;
+			break;
+		}
+		skb->protocol = eth_type_trans(skb, peer_dev);
+		napi_gro_receive(&peer_rq->xdp_napi, skb);
+
+		stats.xdp_bytes += desc.len;
+		done++;
+	}
+
+	/* release, move consumer，and wakeup the producer */
+	if (done) {
+		napi_schedule(&peer_rq->xdp_napi);
+		xsk_tx_release(xsk_pool);
+	}
+
+	u64_stats_update_begin(&sq->stats.syncp);
+	sq->stats.vs.xdp_packets += done;
+	sq->stats.vs.xdp_bytes += stats.xdp_bytes;
+	sq->stats.vs.xdp_drops += stats.xdp_drops;
+	u64_stats_update_end(&sq->stats.syncp);
+
+	return done;
+}
+
+static int veth_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct veth_sq *sq = container_of(napi, struct veth_sq, xdp_napi);
+	struct xsk_buff_pool *pool;
+	int done = 0;
+
+	sq->xsk.last_cpu = smp_processor_id();
+
+	/* xmit for tx queue */
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (pool)
+		done  = veth_xsk_tx_xmit(sq, pool, budget);
+
+	rcu_read_unlock();
+
+	if (done < budget) {
+		/* if done < budget, the tx ring is no buffer */
+		napi_complete_done(napi, done);
+	}
+
+	return done;
+}
+
+static int veth_napi_add_tx(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_sq *sq = &priv->sq[i];
+
+		netif_napi_add(dev, &sq->xdp_napi, veth_poll_tx);
+		napi_enable(&sq->xdp_napi);
+	}
+
+	return 0;
+}
+
 static int veth_create_page_pool(struct veth_rq *rq)
 {
 	struct page_pool_params pp_params = {
@@ -1153,6 +1290,19 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 	}
 }
 
+static void veth_napi_del_tx(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_sq *sq = &priv->sq[i];
+
+		napi_disable(&sq->xdp_napi);
+		__netif_napi_del(&sq->xdp_napi);
+	}
+}
+
 static void veth_napi_del(struct net_device *dev)
 {
 	veth_napi_del_range(dev, 0, dev->real_num_rx_queues);
@@ -1360,7 +1510,7 @@ static void veth_set_xdp_features(struct net_device *dev)
 		struct veth_priv *priv_peer = netdev_priv(peer);
 		xdp_features_t val = NETDEV_XDP_ACT_BASIC |
 				     NETDEV_XDP_ACT_REDIRECT |
-				     NETDEV_XDP_ACT_RX_SG;
+				     NETDEV_XDP_ACT_RX_SG | NETDEV_XDP_ACT_XSK_ZEROCOPY;
 
 		if (priv_peer->_xdp_prog || veth_gro_requested(peer))
 			val |= NETDEV_XDP_ACT_NDO_XMIT |
@@ -1737,11 +1887,89 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return err;
 }
 
+static int veth_xsk_pool_enable(struct net_device *dev, struct xsk_buff_pool *pool, u16 qid)
+{
+	struct veth_priv *peer_priv;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer_dev = priv->peer;
+	int err = 0;
+
+	if (qid >= dev->real_num_tx_queues)
+		return -EINVAL;
+
+	if (!peer_dev)
+		return -EINVAL;
+
+	/* no dma, so we just skip dma skip in xsk zero copy */
+	pool->dma_check_skip = true;
+
+	peer_priv = netdev_priv(peer_dev);
+
+	/* enable peer tx xdp here, this side
+	 * xdp is enable by veth_xdp_set
+	 * to do: we need to check whther this side is already enable xdp
+	 * maybe it do not have xdp prog
+	 */
+	if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+		/*  peer should enable napi*/
+		err = veth_napi_enable(peer_dev);
+		if (err)
+			return err;
+	}
+
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer
+	 * is safe.
+	 */
+	rcu_assign_pointer(priv->sq[qid].xsk.pool, pool);
+
+	veth_napi_add_tx(dev);
+
+	return err;
+}
+
+static int veth_xsk_pool_disable(struct net_device *dev, u16 qid)
+{
+	struct veth_priv *peer_priv;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer_dev = priv->peer;
+	int err = 0;
+
+	if (qid >= dev->real_num_tx_queues)
+		return -EINVAL;
+
+	if (!peer_dev)
+		return -EINVAL;
+
+	peer_priv = netdev_priv(peer_dev);
+
+	/* to do: this may be failed */
+	if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+		/*  disable peer napi */
+		veth_napi_del(peer_dev);
+	}
+
+	veth_napi_del_tx(dev);
+
+	rcu_assign_pointer(priv->sq[qid].xsk.pool, NULL);
+	return err;
+}
+
+/* this  is for setup xdp */
+static int veth_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	if (xdp->xsk.pool)
+		return veth_xsk_pool_enable(dev, xdp->xsk.pool, xdp->xsk.queue_id);
+	else
+		return veth_xsk_pool_disable(dev, xdp->xsk.queue_id);
+}
+
 static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_SETUP_XSK_POOL:
+		return veth_xsk_pool_setup(dev, xdp);
 	default:
 		return -EINVAL;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 6/9] veth: add ndo_xsk_wakeup callback for veth
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
                   ` (4 preceding siblings ...)
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 7/9] sk_buff: add destructor_arg_xsk_pool for zero copy Albert Huang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

add ndo_xsk_wakeup callback for veth, this is used to
wakeup napi tx.

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 28b891dd8dc9..ac78d6a87416 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1805,6 +1805,44 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static void veth_xsk_remote_trigger_napi(void *info)
+{
+	struct veth_sq *sq = info;
+
+	napi_schedule(&sq->xdp_napi);
+}
+
+static int veth_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
+{
+	struct veth_priv *priv;
+	struct veth_sq *sq;
+	u32 last_cpu, cur_cpu;
+
+	if (!netif_running(dev))
+		return -ENETDOWN;
+
+	if (qid >= dev->real_num_rx_queues)
+		return -EINVAL;
+
+	priv = netdev_priv(dev);
+	sq = &priv->sq[qid];
+
+	if (napi_if_scheduled_mark_missed(&sq->xdp_napi))
+		return 0;
+
+	last_cpu = sq->xsk.last_cpu;
+	cur_cpu = get_cpu();
+
+	/*  raise a napi */
+	if (last_cpu == cur_cpu)
+		napi_schedule(&sq->xdp_napi);
+	else
+		smp_call_function_single(last_cpu, veth_xsk_remote_trigger_napi, sq, true);
+
+	put_cpu();
+	return 0;
+}
+
 static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 			struct netlink_ext_ack *extack)
 {
@@ -2019,6 +2057,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
+	.ndo_xsk_wakeup		= veth_xsk_wakeup,
 	.ndo_get_peer_dev	= veth_peer_dev,
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 7/9] sk_buff: add destructor_arg_xsk_pool for zero copy
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
                   ` (5 preceding siblings ...)
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 6/9] veth: add ndo_xsk_wakeup callback for veth Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 8/9] veth: af_xdp tx batch support for ipv4 udp Albert Huang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

this member is add for dummy dev to suppot zero copy

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 include/linux/skbuff.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 16a49ba534e4..db999056022e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -592,6 +592,8 @@ struct skb_shared_info {
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+	/* just for dummy device xsk zero copy */
+	void		*destructor_arg_xsk_pool;
 
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 8/9] veth: af_xdp tx batch support for ipv4 udp
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
                   ` (6 preceding siblings ...)
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 7/9] sk_buff: add destructor_arg_xsk_pool for zero copy Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 9/9] veth: add support for AF_XDP tx need_wakup feature Albert Huang
  2023-08-08 12:01 ` [RFC v3 Optimizing veth xsk performance 0/9] Toke Høiland-Jørgensen
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

A typical topology is shown below:
veth<--------veth-peer
        1       |
                |2
                |
              bridge<------->eth0(such as mlnx5 NIC)

If you use af_xdp to send packets from veth to a physical NIC,
it needs to go through some software paths, so we can refer to
the implementation of kernel GSO. When af_xdp sends packets out
from veth, consider aggregating packets and send a large packet
from the veth virtual NIC to the physical NIC.

performance:(test weth libxdp lib)
AF_XDP without batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with   batch : 1.5 Mpps (with ksoftirqd 15% cpu)

With af_xdp batch, the libxdp user-space program reaches a bottleneck.
Therefore, the softirq did not reach the limit.

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 408 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 387 insertions(+), 21 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ac78d6a87416..70489d017b51 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -29,6 +29,7 @@
 #include <net/page_pool.h>
 #include <net/xdp_sock_drv.h>
 #include <net/xdp.h>
+#include <net/udp.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -103,6 +104,23 @@ struct veth_xdp_tx_bq {
 	unsigned int count;
 };
 
+struct veth_batch_tuple {
+	__u8	protocol;
+	__be32	saddr;
+	__be32	daddr;
+	__be16	source;
+	__be16	dest;
+	__be16	batch_size;
+	__be16	batch_segs;
+	bool    batch_enable;
+	bool    batch_flush;
+};
+
+struct veth_seg_info {
+	u32 segs;
+	u64 desc[] ____cacheline_aligned_in_smp;
+};
+
 /*
  * ethtool interface
  */
@@ -1078,11 +1096,340 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static void veth_xsk_destruct_skb(struct sk_buff *skb)
+{
+	struct skb_shared_info *si = skb_shinfo(skb);
+	struct xsk_buff_pool *pool = (struct xsk_buff_pool *)si->destructor_arg_xsk_pool;
+	struct veth_seg_info *seg_info = (struct veth_seg_info *)si->destructor_arg;
+	unsigned long flags;
+	u32 index = 0;
+	u64 addr;
+
+	/* release cq */
+	spin_lock_irqsave(&pool->cq_lock, flags);
+	for (index = 0; index < seg_info->segs; index++) {
+		addr = (u64)(long)seg_info->desc[index];
+		xsk_tx_completed_addr(pool, addr);
+	}
+	spin_unlock_irqrestore(&pool->cq_lock, flags);
+
+	kfree(seg_info);
+	si->destructor_arg = NULL;
+	si->destructor_arg_xsk_pool = NULL;
+}
+
+static struct sk_buff *veth_build_gso_head_skb(struct net_device *dev,
+					       char *buff, u32 tot_len,
+					       u32 headroom, u32 iph_len,
+					       u32 th_len)
+{
+	struct sk_buff *skb = NULL;
+	int err = 0;
+
+	skb = alloc_skb(tot_len, GFP_KERNEL);
+	if (unlikely(!skb))
+		return NULL;
+
+	/* header room contains the eth header */
+	skb_reserve(skb, headroom - ETH_HLEN);
+	skb_put(skb, ETH_HLEN + iph_len + th_len);
+	skb_shinfo(skb)->gso_segs = 0;
+
+	err = skb_store_bits(skb, 0, buff, ETH_HLEN + iph_len + th_len);
+	if (unlikely(err)) {
+		kfree_skb(skb);
+		return NULL;
+	}
+
+	skb->protocol = eth_type_trans(skb, dev);
+	skb->network_header = skb->mac_header + ETH_HLEN;
+	skb->transport_header = skb->network_header + iph_len;
+	skb->ip_summed = CHECKSUM_PARTIAL;
+
+	return skb;
+}
+
+/* only ipv4 udp match
+ * to do: tcp and ipv6
+ */
+static inline bool veth_segment_match(struct veth_batch_tuple *tuple,
+				      struct iphdr *iph, struct udphdr *udph)
+{
+	if (tuple->protocol == iph->protocol &&
+	    tuple->saddr == iph->saddr &&
+		tuple->daddr == iph->daddr &&
+		tuple->source == udph->source &&
+		tuple->dest == udph->dest &&
+		tuple->batch_size == ntohs(udph->len)) {
+		tuple->batch_flush = false;
+		return true;
+	}
+
+	tuple->batch_flush = true;
+	return false;
+}
+
+static inline void veth_tuple_init(struct veth_batch_tuple *tuple,
+				   struct iphdr *iph, struct udphdr *udph)
+{
+	tuple->protocol = iph->protocol;
+	tuple->saddr = iph->saddr;
+	tuple->daddr = iph->daddr;
+	tuple->source = udph->source;
+	tuple->dest = udph->dest;
+	tuple->batch_flush = false;
+	tuple->batch_size = ntohs(udph->len);
+	tuple->batch_segs = 0;
+}
+
+static inline bool veth_batch_ip_check_v4(struct iphdr *iph, u32 len)
+{
+	if (len <= (ETH_HLEN + sizeof(*iph)))
+		return false;
+
+	if (iph->ihl < 5 || iph->version != 4 || len < (iph->ihl * 4 + ETH_HLEN))
+		return false;
+
+	return true;
+}
+
+static struct sk_buff *veth_build_skb_batch_udp(struct net_device *dev,
+						struct xsk_buff_pool *pool,
+						struct xdp_desc *desc,
+						struct veth_batch_tuple *tuple,
+						struct sk_buff *prev_skb)
+{
+	u32 hr, len, ts, index, iph_len, th_len, data_offset, data_len, tot_len;
+	struct veth_seg_info *seg_info;
+	void *buffer;
+	struct udphdr *udph;
+	struct iphdr *iph;
+	struct sk_buff *skb;
+	struct page *page;
+	u32 seg_len = 0;
+	int hh_len = 0;
+	u64 addr;
+
+	addr = desc->addr;
+	len = desc->len;
+
+	/* l2 reserved len */
+	hh_len = LL_RESERVED_SPACE(dev);
+	hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(hh_len));
+
+	/* data points to eth header */
+	buffer = (unsigned char *)xsk_buff_raw_get_data(pool, addr);
+
+	iph = (struct iphdr *)(buffer + ETH_HLEN);
+	iph_len = iph->ihl * 4;
+
+	udph = (struct udphdr *)(buffer + ETH_HLEN + iph_len);
+	th_len = sizeof(struct udphdr);
+
+	if (tuple->batch_flush)
+		veth_tuple_init(tuple, iph, udph);
+
+	ts = pool->unaligned ? len : pool->chunk_size;
+
+	data_offset = offset_in_page(buffer) + ETH_HLEN + iph_len + th_len;
+	data_len = len - (ETH_HLEN + iph_len + th_len);
+
+	/* head is null or this is a new 5 tuple */
+	if (!prev_skb || !veth_segment_match(tuple, iph, udph)) {
+		tot_len = hr + iph_len + th_len;
+		skb = veth_build_gso_head_skb(dev, buffer, tot_len, hr, iph_len, th_len);
+		if (!skb) {
+			/* to do: handle here for skb */
+			return NULL;
+		}
+
+		/* store information for gso */
+		seg_len = struct_size(seg_info, desc, MAX_SKB_FRAGS);
+		seg_info = kmalloc(seg_len, GFP_KERNEL);
+		if (!seg_info) {
+			/* to do */
+			kfree_skb(skb);
+			return NULL;
+		}
+	} else {
+		skb = prev_skb;
+		skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4 | SKB_GSO_PARTIAL;
+		skb_shinfo(skb)->gso_size = data_len;
+		skb->ip_summed = CHECKSUM_PARTIAL;
+
+		/* max segment is MAX_SKB_FRAGS */
+		if (skb_shinfo(skb)->gso_segs >= MAX_SKB_FRAGS - 1)
+			tuple->batch_flush = true;
+
+		seg_info = (struct veth_seg_info *)skb_shinfo(skb)->destructor_arg;
+	}
+
+	/* offset in umem pool buffer */
+	addr = buffer - pool->addrs;
+
+	/* get the page of the desc */
+	page = pool->umem->pgs[addr >> PAGE_SHIFT];
+
+	/* in order to avoid to get freed by kfree_skb */
+	get_page(page);
+
+	/* desc.data can not hold in two */
+	skb_fill_page_desc(skb, skb_shinfo(skb)->gso_segs, page, data_offset, data_len);
+
+	skb->len += data_len;
+	skb->data_len += data_len;
+	skb->truesize += ts;
+	skb->dev = dev;
+
+	/* later we will support gso for this */
+	index = skb_shinfo(skb)->gso_segs;
+	seg_info->desc[index] = desc->addr;
+	seg_info->segs = ++index;
+	skb_shinfo(skb)->gso_segs++;
+
+	skb_shinfo(skb)->destructor_arg = (void *)(long)seg_info;
+	skb_shinfo(skb)->destructor_arg_xsk_pool = (void *)(long)pool;
+	skb->destructor = veth_xsk_destruct_skb;
+
+	/* to do:
+	 *  add skb to sock. may be there is no need to do for this
+	 *  and this might be multiple xsk sockets involved, so it's
+	 *  difficult to determine which socket is sending the data.
+	 *  refcount_add(ts, &xs->sk.sk_wmem_alloc);
+	 */
+	return skb;
+}
+
+static inline struct sk_buff *veth_build_skb_def(struct net_device *dev,
+						 struct xsk_buff_pool *pool, struct xdp_desc *desc)
+{
+	struct sk_buff *skb = NULL;
+	struct page *page;
+	void *buffer;
+	void *vaddr;
+
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+
+	buffer = (unsigned char *)xsk_buff_raw_get_data(pool, desc->addr);
+
+	vaddr = page_to_virt(page);
+	memcpy(vaddr + pool->headroom, buffer, desc->len);
+	skb = veth_build_skb(vaddr, pool->headroom, desc->len, PAGE_SIZE);
+	if (!skb) {
+		put_page(page);
+		return NULL;
+	}
+
+	skb->protocol = eth_type_trans(skb, dev);
+
+	return skb;
+}
+
+/* To call the following function, the following conditions must be met:
+ * 1.The data packet must be a standard Ethernet data packet
+ * 2. Data packets support batch sending
+ */
+static inline struct sk_buff *veth_build_skb_batch_v4(struct net_device *dev,
+						      struct xsk_buff_pool *pool,
+						      struct xdp_desc *desc,
+						      struct veth_batch_tuple *tuple,
+						      struct sk_buff *prev_skb)
+{
+	struct iphdr *iph;
+	void *buffer;
+	u64 addr;
+
+	addr = desc->addr;
+	buffer = (unsigned char *)xsk_buff_raw_get_data(pool, addr);
+	iph = (struct iphdr *)(buffer + ETH_HLEN);
+	if (!veth_batch_ip_check_v4(iph, desc->len))
+		goto normal;
+
+	switch (iph->protocol) {
+	case IPPROTO_UDP:
+		return veth_build_skb_batch_udp(dev, pool, desc, tuple, prev_skb);
+	default:
+		break;
+	}
+normal:
+	tuple->batch_enable = false;
+	return veth_build_skb_def(dev, pool, desc);
+}
+
+/* Zero copy needs to meet the following conditions：
+ * 1. The data content of tx desc must be within one page
+ * 2、the tx desc must support batch xmit, which seted by userspace
+ */
+static inline bool veth_batch_desc_check(void *buff, u32 len)
+{
+	u32 offset;
+
+	offset = offset_in_page(buff);
+	if (PAGE_SIZE - offset < len)
+		return false;
+
+	return true;
+}
+
+/* here must be a ipv4 or ipv6 packet */
+static inline struct sk_buff *veth_build_skb_batch(struct net_device *dev,
+						   struct xsk_buff_pool *pool,
+						   struct xdp_desc *desc,
+						   struct veth_batch_tuple *tuple,
+						   struct sk_buff *prev_skb)
+{
+	const struct ethhdr *eth;
+	void *buffer;
+
+	buffer = xsk_buff_raw_get_data(pool, desc->addr);
+	if (!veth_batch_desc_check(buffer, desc->len))
+		goto normal;
+
+	eth = (struct ethhdr *)buffer;
+	switch (ntohs(eth->h_proto)) {
+	case ETH_P_IP:
+		tuple->batch_enable = true;
+		return veth_build_skb_batch_v4(dev, pool, desc, tuple, prev_skb);
+	/* to do: not support yet, just build skb, no batch */
+	case ETH_P_IPV6:
+		fallthrough;
+	default:
+		break;
+	}
+
+normal:
+	tuple->batch_flush = false;
+	tuple->batch_enable = false;
+	return veth_build_skb_def(dev, pool, desc);
+}
+
+/* just support ipv4 udp batch
+ * to do: ipv4 tcp and ipv6
+ */
+static inline void veth_skb_batch_checksum(struct sk_buff *skb)
+{
+	struct iphdr *iph = ip_hdr(skb);
+	struct udphdr *uh = udp_hdr(skb);
+	int ip_tot_len = skb->len;
+	int udp_len = skb->len - (skb->transport_header - skb->network_header);
+
+	iph->tot_len = htons(ip_tot_len);
+	ip_send_check(iph);
+	uh->len = htons(udp_len);
+	uh->check = 0;
+
+	udp4_hwcsum(skb, iph->saddr, iph->daddr);
+}
+
 static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
 {
 	struct veth_priv *priv, *peer_priv;
 	struct net_device *dev, *peer_dev;
+	struct veth_batch_tuple tuple;
 	struct veth_stats stats = {};
+	struct sk_buff *prev_skb = NULL;
 	struct sk_buff *skb = NULL;
 	struct veth_rq *peer_rq;
 	struct xdp_desc desc;
@@ -1093,24 +1440,23 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 	peer_dev = priv->peer;
 	peer_priv = netdev_priv(peer_dev);
 
-	/* todo: queue index must set before this */
+	/* queue_index set in napi enable
+	 * to do:may be we should select rq by 5-tuple or hash
+	 */
 	peer_rq = &peer_priv->rq[sq->queue_index];
 
+	memset(&tuple, 0, sizeof(tuple));
+
 	/* set xsk wake up flag, to do: where to disable */
 	if (xsk_uses_need_wakeup(xsk_pool))
 		xsk_set_tx_need_wakeup(xsk_pool);
 
 	while (budget-- > 0) {
 		unsigned int truesize = 0;
-		struct page *page;
-		void *vaddr;
-		void *addr;
 
 		if (!xsk_tx_peek_desc(xsk_pool, &desc))
 			break;
 
-		addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
-
 		/* can not hold all data in a page */
 		truesize =  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 		truesize += desc.len + xsk_pool->headroom;
@@ -1120,30 +1466,50 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 			break;
 		}
 
-		page = dev_alloc_page();
-		if (!page) {
+		skb = veth_build_skb_batch(peer_dev, xsk_pool, &desc, &tuple, prev_skb);
+		if (!skb) {
+			stats.rx_drops++;
 			xsk_tx_completed_addr(xsk_pool, desc.addr);
-			stats.xdp_drops++;
-			break;
+			if (prev_skb != skb) {
+				napi_gro_receive(&peer_rq->xdp_napi, prev_skb);
+				prev_skb = NULL;
+			}
+			continue;
 		}
-		vaddr = page_to_virt(page);
-
-		memcpy(vaddr + xsk_pool->headroom, addr, desc.len);
-		xsk_tx_completed_addr(xsk_pool, desc.addr);
 
-		skb = veth_build_skb(vaddr, xsk_pool->headroom, desc.len, PAGE_SIZE);
-		if (!skb) {
-			put_page(page);
-			stats.xdp_drops++;
-			break;
+		if (!tuple.batch_enable) {
+			xsk_tx_completed_addr(xsk_pool, desc.addr);
+			/* flush the prev skb first to avoid out of order */
+			if (prev_skb != skb && prev_skb) {
+				veth_skb_batch_checksum(prev_skb);
+				napi_gro_receive(&peer_rq->xdp_napi, prev_skb);
+				prev_skb = NULL;
+			}
+			napi_gro_receive(&peer_rq->xdp_napi, skb);
+			skb = NULL;
+		} else {
+			if (prev_skb && tuple.batch_flush) {
+				veth_skb_batch_checksum(prev_skb);
+				napi_gro_receive(&peer_rq->xdp_napi, prev_skb);
+				if (prev_skb == skb)
+					prev_skb = skb = NULL;
+				else
+					prev_skb = skb;
+			} else {
+				prev_skb = skb;
+			}
 		}
-		skb->protocol = eth_type_trans(skb, peer_dev);
-		napi_gro_receive(&peer_rq->xdp_napi, skb);
 
 		stats.xdp_bytes += desc.len;
 		done++;
 	}
 
+	/* means there is a skb need to send to peer_rq (batch)*/
+	if (skb) {
+		veth_skb_batch_checksum(skb);
+		napi_gro_receive(&peer_rq->xdp_napi, skb);
+	}
+
 	/* release, move consumer，and wakeup the producer */
 	if (done) {
 		napi_schedule(&peer_rq->xdp_napi);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v3 Optimizing veth xsk performance 9/9] veth: add support for AF_XDP tx need_wakup feature
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
                   ` (7 preceding siblings ...)
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 8/9] veth: af_xdp tx batch support for ipv4 udp Albert Huang
@ 2023-08-08  3:19 ` Albert Huang
  2023-08-08 12:01 ` [RFC v3 Optimizing veth xsk performance 0/9] Toke Høiland-Jørgensen
  9 siblings, 0 replies; 14+ messages in thread
From: Albert Huang @ 2023-08-08  3:19 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

this patch only support for tx need_wakup feature.

Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 70489d017b51..7c60c64ef10b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1447,9 +1447,9 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 
 	memset(&tuple, 0, sizeof(tuple));
 
-	/* set xsk wake up flag, to do: where to disable */
+	/* clear xsk wake up flag */
 	if (xsk_uses_need_wakeup(xsk_pool))
-		xsk_set_tx_need_wakeup(xsk_pool);
+		xsk_clear_tx_need_wakeup(xsk_pool);
 
 	while (budget-- > 0) {
 		unsigned int truesize = 0;
@@ -1539,12 +1539,15 @@ static int veth_poll_tx(struct napi_struct *napi, int budget)
 	if (pool)
 		done  = veth_xsk_tx_xmit(sq, pool, budget);
 
-	rcu_read_unlock();
-
 	if (done < budget) {
+		/* set xsk wake up flag */
+		if (xsk_uses_need_wakeup(pool))
+			xsk_set_tx_need_wakeup(pool);
+
 		/* if done < budget, the tx ring is no buffer */
 		napi_complete_done(napi, done);
 	}
+	rcu_read_unlock();
 
 	return done;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC v3 Optimizing veth xsk performance 0/9]
  2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
                   ` (8 preceding siblings ...)
  2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 9/9] veth: add support for AF_XDP tx need_wakup feature Albert Huang
@ 2023-08-08 12:01 ` Toke Høiland-Jørgensen
  2023-08-09  7:13   ` 黄杰
  9 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-08-08 12:01 UTC (permalink / raw)
  To: Albert Huang, davem, edumazet, kuba, pabeni
  Cc: Albert Huang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

Albert Huang <huangjie.albert@bytedance.com> writes:

> AF_XDP is a kernel bypass technology that can greatly improve performance.
> However,for virtual devices like veth,even with the use of AF_XDP sockets,
> there are still many additional software paths that consume CPU resources. 
> This patch series focuses on optimizing the performance of AF_XDP sockets 
> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. 
> Patch 5 introduces tx queue and tx napi for packet transmission, while 
> patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
> add support for AF_XDP tx need_wakup feature. These optimizations significantly
> reduce the software path and support checksum offload.
>
> I tested those feature with
> A typical topology is shown below:
> client(send):                                        server:(recv)
> veth<-->veth-peer                                    veth1-peer<--->veth1
>   1       |                                                  |   7
>           |2                                                6|
>           |                                                  |
>         bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>                   3                    4                 5    
>              (machine1)                              (machine2)    

I definitely applaud the effort to improve the performance of af_xdp
over veth, this is something we have flagged as in need of improvement
as well.

However, looking through your patch series, I am less sure that the
approach you're taking here is the right one.

AFAIU (speaking about the TX side here), the main difference between
AF_XDP ZC and the regular transmit mode is that in the regular TX mode
the stack will allocate an skb to hold the frame and push that down the
stack. Whereas in ZC mode, there's a driver NDO that gets called
directly, bypassing the skb allocation entirely.

In this series, you're implementing the ZC mode for veth, but the driver
code ends up allocating an skb anyway. Which seems to be a bit of a
weird midpoint between the two modes, and adds a lot of complexity to
the driver that (at least conceptually) is mostly just a
reimplementation of what the stack does in non-ZC mode (allocate an skb
and push it through the stack).

So my question is, why not optimise the non-zc path in the stack instead
of implementing the zc logic for veth? It seems to me that it would be
quite feasible to apply the same optimisations (bulking, and even GRO)
to that path and achieve the same benefits, without having to add all
this complexity to the veth driver?

-Toke

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC v3 Optimizing veth xsk performance 0/9]
  2023-08-08 12:01 ` [RFC v3 Optimizing veth xsk performance 0/9] Toke Høiland-Jørgensen
@ 2023-08-09  7:13   ` 黄杰
  2023-08-09  9:06     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: 黄杰 @ 2023-08-09  7:13 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Pavel Begunkov, Yunsheng Lin,
	Kees Cook, Richard Gobert, open list:NETWORKING DRIVERS,
	open list, open list:XDP (eXpress Data Path)

Toke Høiland-Jørgensen <toke@redhat.com> 于2023年8月8日周二 20:01写道：
>
> Albert Huang <huangjie.albert@bytedance.com> writes:
>
> > AF_XDP is a kernel bypass technology that can greatly improve performance.
> > However,for virtual devices like veth,even with the use of AF_XDP sockets,
> > there are still many additional software paths that consume CPU resources.
> > This patch series focuses on optimizing the performance of AF_XDP sockets
> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> > Patch 5 introduces tx queue and tx napi for packet transmission, while
> > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
> > add support for AF_XDP tx need_wakup feature. These optimizations significantly
> > reduce the software path and support checksum offload.
> >
> > I tested those feature with
> > A typical topology is shown below:
> > client(send):                                        server:(recv)
> > veth<-->veth-peer                                    veth1-peer<--->veth1
> >   1       |                                                  |   7
> >           |2                                                6|
> >           |                                                  |
> >         bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
> >                   3                    4                 5
> >              (machine1)                              (machine2)
>
> I definitely applaud the effort to improve the performance of af_xdp
> over veth, this is something we have flagged as in need of improvement
> as well.
>
> However, looking through your patch series, I am less sure that the
> approach you're taking here is the right one.
>
> AFAIU (speaking about the TX side here), the main difference between
> AF_XDP ZC and the regular transmit mode is that in the regular TX mode
> the stack will allocate an skb to hold the frame and push that down the
> stack. Whereas in ZC mode, there's a driver NDO that gets called
> directly, bypassing the skb allocation entirely.
>
> In this series, you're implementing the ZC mode for veth, but the driver
> code ends up allocating an skb anyway. Which seems to be a bit of a
> weird midpoint between the two modes, and adds a lot of complexity to
> the driver that (at least conceptually) is mostly just a
> reimplementation of what the stack does in non-ZC mode (allocate an skb
> and push it through the stack).
>
> So my question is, why not optimise the non-zc path in the stack instead
> of implementing the zc logic for veth? It seems to me that it would be
> quite feasible to apply the same optimisations (bulking, and even GRO)
> to that path and achieve the same benefits, without having to add all
> this complexity to the veth driver?
>
> -Toke
>
thanks!
This idea is really good indeed. You've reminded me, and that's
something I overlooked. I will now consider implementing the solution
you've proposed and test the performance enhancement.

Albert.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC v3 Optimizing veth xsk performance 0/9]
  2023-08-09  7:13   ` 黄杰
@ 2023-08-09  9:06     ` Toke Høiland-Jørgensen
  2023-08-09 11:09       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-08-09  9:06 UTC (permalink / raw)
  To: 黄杰
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Pavel Begunkov, Yunsheng Lin,
	Kees Cook, Richard Gobert, open list:NETWORKING DRIVERS,
	open list, open list:XDP (eXpress Data Path)

黄杰 <huangjie.albert@bytedance.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> 于2023年8月8日周二 20:01写道：
>>
>> Albert Huang <huangjie.albert@bytedance.com> writes:
>>
>> > AF_XDP is a kernel bypass technology that can greatly improve performance.
>> > However,for virtual devices like veth,even with the use of AF_XDP sockets,
>> > there are still many additional software paths that consume CPU resources.
>> > This patch series focuses on optimizing the performance of AF_XDP sockets
>> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
>> > Patch 5 introduces tx queue and tx napi for packet transmission, while
>> > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
>> > add support for AF_XDP tx need_wakup feature. These optimizations significantly
>> > reduce the software path and support checksum offload.
>> >
>> > I tested those feature with
>> > A typical topology is shown below:
>> > client(send):                                        server:(recv)
>> > veth<-->veth-peer                                    veth1-peer<--->veth1
>> >   1       |                                                  |   7
>> >           |2                                                6|
>> >           |                                                  |
>> >         bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>> >                   3                    4                 5
>> >              (machine1)                              (machine2)
>>
>> I definitely applaud the effort to improve the performance of af_xdp
>> over veth, this is something we have flagged as in need of improvement
>> as well.
>>
>> However, looking through your patch series, I am less sure that the
>> approach you're taking here is the right one.
>>
>> AFAIU (speaking about the TX side here), the main difference between
>> AF_XDP ZC and the regular transmit mode is that in the regular TX mode
>> the stack will allocate an skb to hold the frame and push that down the
>> stack. Whereas in ZC mode, there's a driver NDO that gets called
>> directly, bypassing the skb allocation entirely.
>>
>> In this series, you're implementing the ZC mode for veth, but the driver
>> code ends up allocating an skb anyway. Which seems to be a bit of a
>> weird midpoint between the two modes, and adds a lot of complexity to
>> the driver that (at least conceptually) is mostly just a
>> reimplementation of what the stack does in non-ZC mode (allocate an skb
>> and push it through the stack).
>>
>> So my question is, why not optimise the non-zc path in the stack instead
>> of implementing the zc logic for veth? It seems to me that it would be
>> quite feasible to apply the same optimisations (bulking, and even GRO)
>> to that path and achieve the same benefits, without having to add all
>> this complexity to the veth driver?
>>
>> -Toke
>>
> thanks!
> This idea is really good indeed. You've reminded me, and that's
> something I overlooked. I will now consider implementing the solution
> you've proposed and test the performance enhancement.

Sounds good, thanks! :)

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v3 Optimizing veth xsk performance 0/9]
  2023-08-09  9:06     ` Toke Høiland-Jørgensen
@ 2023-08-09 11:09       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-09 11:09 UTC (permalink / raw)
  To: 黄杰, Björn Töpel, Magnus Karlsson,
	Maryam Tahhan
  Cc: Toke Høiland-Jørgensen, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Yunsheng Lin, Kees Cook,
	Richard Gobert, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path), Donald Hunter, Dave Tucker


On 09/08/2023 11.06, Toke Høiland-Jørgensen wrote:
> 黄杰 <huangjie.albert@bytedance.com> writes:
> 
>> Toke Høiland-Jørgensen <toke@redhat.com> 于2023年8月8日周二 20:01写道：
>>>
>>> Albert Huang <huangjie.albert@bytedance.com> writes:
>>>
>>>> AF_XDP is a kernel bypass technology that can greatly improve performance.
>>>> However,for virtual devices like veth,even with the use of AF_XDP sockets,
>>>> there are still many additional software paths that consume CPU resources.
>>>> This patch series focuses on optimizing the performance of AF_XDP sockets
>>>> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
>>>> Patch 5 introduces tx queue and tx napi for packet transmission, while
>>>> patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
>>>> add support for AF_XDP tx need_wakup feature. These optimizations significantly
>>>> reduce the software path and support checksum offload.
>>>>
>>>> I tested those feature with
>>>> A typical topology is shown below:
>>>> client(send):                                        server:(recv)
>>>> veth<-->veth-peer                                    veth1-peer<--->veth1
>>>>    1       |                                                  |   7
>>>>            |2                                                6|
>>>>            |                                                  |
>>>>          bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>>>>                    3                    4                 5
>>>>               (machine1)                              (machine2)
>>>
>>> I definitely applaud the effort to improve the performance of af_xdp
>>> over veth, this is something we have flagged as in need of improvement
>>> as well.
>>>
>>> However, looking through your patch series, I am less sure that the
>>> approach you're taking here is the right one.
>>>
>>> AFAIU (speaking about the TX side here), the main difference between
>>> AF_XDP ZC and the regular transmit mode is that in the regular TX mode
>>> the stack will allocate an skb to hold the frame and push that down the
>>> stack. Whereas in ZC mode, there's a driver NDO that gets called
>>> directly, bypassing the skb allocation entirely.
>>>
>>> In this series, you're implementing the ZC mode for veth, but the driver
>>> code ends up allocating an skb anyway. Which seems to be a bit of a
>>> weird midpoint between the two modes, and adds a lot of complexity to
>>> the driver that (at least conceptually) is mostly just a
>>> reimplementation of what the stack does in non-ZC mode (allocate an skb
>>> and push it through the stack).
>>>
>>> So my question is, why not optimise the non-zc path in the stack instead
>>> of implementing the zc logic for veth? It seems to me that it would be
>>> quite feasible to apply the same optimisations (bulking, and even GRO)
>>> to that path and achieve the same benefits, without having to add all
>>> this complexity to the veth driver?
>>>
>>> -Toke
>>>
>> thanks!
>> This idea is really good indeed. You've reminded me, and that's
>> something I overlooked. I will now consider implementing the solution
>> you've proposed and test the performance enhancement.
> 
> Sounds good, thanks! :)

Good to hear, that you want to optimize the non-zc TX path of AF_XDP, as
Toke suggests.

There is a number of performance issues for AF_XDP non-zc TX that I've
talked/complained to Magnus and Bjørn about over the years.
I've recently started to work on fixing these myself, in collaboration
with Maryam (cc).

The most obvious is that non-zc TX uses socket memory accounting for the
SKBs that gets allocated. (ZC TX obviously doesn't).  IMHO this doesn't
make sense as AF_XDP concept is to pre-allocate memory, thus AF_XDP
memory limits are already bounded at setup time.  Further more,
__xsk_generic_xmit() already have a backpressure mechanism based on
avail room in the CQ (Completion Queue) .  Hint: the call
sock_alloc_send_skb() includes/does socket mem accounting.

When AF_XDP gets combined with veth (or other layered software devices),
the problem gets worse, because:

  (1) the SKB that gets allocated by xsk_build_skb() doesn't have enough
      headroom to satisfy XDP requirement XDP_PACKET_HEADROOM.

  (2) the backing memory type from sock_alloc_send_skb() is not
      compatible with generic/veth XDP.

Both these issues, result in that when peer veth device RX the (AF_XDP)
TX packet, then it have to reallocate memory+SKB and copy data *again*.

I'm currently[1] looking into how to fix this and have some PoC patches
to estimate the performance benefit from avoiding the realloc when
entering veth.  With packet size 512, the numbers start at 828Kpps and
after increase to 1002Kpps (and increase of 20% or 208 nanosec).

  [1] 
https://github.com/xdp-project/xdp-project/blob/veth-benchmark01/areas/core/veth_benchmark03.org

--
Best regards,
   Jesper Dangaard Brouer
   MSc.CS, Sr. Principal Kernel Engineer at Red Hat
   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-08-09 11:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-08  3:19 [RFC v3 Optimizing veth xsk performance 0/9] Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 1/9] veth: Implement ethtool's get_ringparam() callback Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 2/9] xsk: add dma_check_skip for skipping dma check Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 3/9] veth: add support for send queue Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 4/9] xsk: add xsk_tx_completed_addr function Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 5/9] veth: use send queue tx napi to xmit xsk tx desc Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 6/9] veth: add ndo_xsk_wakeup callback for veth Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 7/9] sk_buff: add destructor_arg_xsk_pool for zero copy Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 8/9] veth: af_xdp tx batch support for ipv4 udp Albert Huang
2023-08-08  3:19 ` [RFC v3 Optimizing veth xsk performance 9/9] veth: add support for AF_XDP tx need_wakup feature Albert Huang
2023-08-08 12:01 ` [RFC v3 Optimizing veth xsk performance 0/9] Toke Høiland-Jørgensen
2023-08-09  7:13   ` 黄杰
2023-08-09  9:06     ` Toke Høiland-Jørgensen
2023-08-09 11:09       ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).