netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 00/16] idpf: add XDP support
@ 2025-03-05 16:21 Alexander Lobakin
  2025-03-05 16:21 ` [PATCH net-next 01/16] libeth: convert to netmem Alexander Lobakin
                   ` (16 more replies)
  0 siblings, 17 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Add XDP support (w/o XSk yet) to the idpf driver using the libeth_xdp
sublib, which will be then reused in at least iavf and ice.

In general, nothing outstanding comparing to ice, except performance --
let's say, up to 2x for .ndo_xdp_xmit() on certain platforms and
scenarios. libeth_xdp doesn't reinvent the wheel, mostly just
accumulates and optimizes what was already done before to stop copying
that wheel and the bugs over and over again.
idpf doesn't support VLAN Rx offload, so only the hash hint is present
for now.

Alexander Lobakin (12):
  libeth: convert to netmem
  libeth: support native XDP and register memory model
  libeth: add a couple of XDP helpers (libeth_xdp)
  libeth: add XSk helpers
  idpf: fix Rx descriptor ready check barrier in splitq
  idpf: a use saner limit for default number of queues to allocate
  idpf: link NAPIs to queues
  idpf: add support for nointerrupt queues
  idpf: use generic functions to build xdp_buff and skb
  idpf: add support for XDP on Rx
  idpf: add support for .ndo_xdp_xmit()
  idpf: add XDP RSS hash hint

Michal Kubiak (4):
  idpf: make complq cleaning dependent on scheduling mode
  idpf: remove SW marker handling from NAPI
  idpf: prepare structures to support XDP
  idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq

 drivers/net/ethernet/intel/idpf/Kconfig       |    2 +-
 drivers/net/ethernet/intel/libeth/Kconfig     |   10 +-
 drivers/net/ethernet/intel/idpf/Makefile      |    2 +
 drivers/net/ethernet/intel/libeth/Makefile    |    8 +-
 include/net/libeth/types.h                    |  106 +-
 drivers/net/ethernet/intel/idpf/idpf.h        |   35 +-
 .../net/ethernet/intel/idpf/idpf_lan_txrx.h   |    6 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  126 +-
 drivers/net/ethernet/intel/idpf/xdp.h         |  180 ++
 drivers/net/ethernet/intel/libeth/priv.h      |   37 +
 include/net/libeth/rx.h                       |   28 +-
 include/net/libeth/tx.h                       |   36 +-
 include/net/libeth/xdp.h                      | 1869 +++++++++++++++++
 include/net/libeth/xsk.h                      |  685 ++++++
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |   14 +-
 drivers/net/ethernet/intel/idpf/idpf_dev.c    |   11 +-
 .../net/ethernet/intel/idpf/idpf_ethtool.c    |    6 +-
 drivers/net/ethernet/intel/idpf/idpf_lib.c    |   29 +-
 drivers/net/ethernet/intel/idpf/idpf_main.c   |    1 +
 .../ethernet/intel/idpf/idpf_singleq_txrx.c   |  111 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   |  678 +++---
 drivers/net/ethernet/intel/idpf/idpf_vf_dev.c |   11 +-
 .../net/ethernet/intel/idpf/idpf_virtchnl.c   |  113 +-
 drivers/net/ethernet/intel/idpf/xdp.c         |  509 +++++
 drivers/net/ethernet/intel/libeth/rx.c        |   40 +-
 drivers/net/ethernet/intel/libeth/tx.c        |   41 +
 drivers/net/ethernet/intel/libeth/xdp.c       |  449 ++++
 drivers/net/ethernet/intel/libeth/xsk.c       |  269 +++
 28 files changed, 4925 insertions(+), 487 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/idpf/xdp.h
 create mode 100644 drivers/net/ethernet/intel/libeth/priv.h
 create mode 100644 include/net/libeth/xdp.h
 create mode 100644 include/net/libeth/xsk.h
 create mode 100644 drivers/net/ethernet/intel/idpf/xdp.c
 create mode 100644 drivers/net/ethernet/intel/libeth/tx.c
 create mode 100644 drivers/net/ethernet/intel/libeth/xdp.c
 create mode 100644 drivers/net/ethernet/intel/libeth/xsk.c

---
Sending in one batch to introduce/show both the lib and the user.
Let me know if I'd better split.
-- 
2.48.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH net-next 01/16] libeth: convert to netmem
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-06  0:13   ` Mina Almasry
  2025-03-05 16:21 ` [PATCH net-next 02/16] libeth: support native XDP and register memory model Alexander Lobakin
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Back when the libeth Rx core was initially written, devmem was a draft
and netmem_ref didn't exist in the mainline. Now that it's here, make
libeth MP-agnostic before introducing any new code or any new library
users.
When it's known that the created PP/FQ is for header buffers, use faster
"unsafe" underscored netmem <--> virt accessors as netmem_is_net_iov()
is always false in that case, but consumes some cycles (bit test +
true branch).
Misc: replace explicit EXPORT_SYMBOL_NS_GPL("NS") with
DEFAULT_SYMBOL_NAMESPACE.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/libeth/rx.h                       | 22 +++++++------
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 14 ++++----
 .../ethernet/intel/idpf/idpf_singleq_txrx.c   |  2 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 33 +++++++++++--------
 drivers/net/ethernet/intel/libeth/rx.c        | 20 ++++++-----
 5 files changed, 51 insertions(+), 40 deletions(-)

diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index ab05024be518..7d5dc58984b1 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
-/* Copyright (C) 2024 Intel Corporation */
+/* Copyright (C) 2024-2025 Intel Corporation */
 
 #ifndef __LIBETH_RX_H
 #define __LIBETH_RX_H
@@ -31,7 +31,7 @@
 
 /**
  * struct libeth_fqe - structure representing an Rx buffer (fill queue element)
- * @page: page holding the buffer
+ * @netmem: network memory reference holding the buffer
  * @offset: offset from the page start (to the headroom)
  * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
  *
@@ -40,7 +40,7 @@
  * former, @offset is always 0 and @truesize is always ```PAGE_SIZE```.
  */
 struct libeth_fqe {
-	struct page		*page;
+	netmem_ref		netmem;
 	u32			offset;
 	u32			truesize;
 } __aligned_largest;
@@ -102,15 +102,16 @@ static inline dma_addr_t libeth_rx_alloc(const struct libeth_fq_fp *fq, u32 i)
 	struct libeth_fqe *buf = &fq->fqes[i];
 
 	buf->truesize = fq->truesize;
-	buf->page = page_pool_dev_alloc(fq->pp, &buf->offset, &buf->truesize);
-	if (unlikely(!buf->page))
+	buf->netmem = page_pool_dev_alloc_netmem(fq->pp, &buf->offset,
+						 &buf->truesize);
+	if (unlikely(!buf->netmem))
 		return DMA_MAPPING_ERROR;
 
-	return page_pool_get_dma_addr(buf->page) + buf->offset +
+	return page_pool_get_dma_addr_netmem(buf->netmem) + buf->offset +
 	       fq->pp->p.offset;
 }
 
-void libeth_rx_recycle_slow(struct page *page);
+void libeth_rx_recycle_slow(netmem_ref netmem);
 
 /**
  * libeth_rx_sync_for_cpu - synchronize or recycle buffer post DMA
@@ -126,18 +127,19 @@ void libeth_rx_recycle_slow(struct page *page);
 static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe,
 					  u32 len)
 {
-	struct page *page = fqe->page;
+	netmem_ref netmem = fqe->netmem;
 
 	/* Very rare, but possible case. The most common reason:
 	 * the last fragment contained FCS only, which was then
 	 * stripped by the HW.
 	 */
 	if (unlikely(!len)) {
-		libeth_rx_recycle_slow(page);
+		libeth_rx_recycle_slow(netmem);
 		return false;
 	}
 
-	page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len);
+	page_pool_dma_sync_netmem_for_cpu(netmem_get_pp(netmem), netmem,
+					  fqe->offset, len);
 
 	return true;
 }
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 422312b8b54a..35d353d38129 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -723,7 +723,7 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
 	for (u32 i = rx_ring->next_to_clean; i != rx_ring->next_to_use; ) {
 		const struct libeth_fqe *rx_fqes = &rx_ring->rx_fqes[i];
 
-		page_pool_put_full_page(rx_ring->pp, rx_fqes->page, false);
+		libeth_rx_recycle_slow(rx_fqes->netmem);
 
 		if (unlikely(++i == rx_ring->count))
 			i = 0;
@@ -1197,10 +1197,11 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
 			     const struct libeth_fqe *rx_buffer,
 			     unsigned int size)
 {
-	u32 hr = rx_buffer->page->pp->p.offset;
+	u32 hr = netmem_get_pp(rx_buffer->netmem)->p.offset;
 
-	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
-			rx_buffer->offset + hr, size, rx_buffer->truesize);
+	skb_add_rx_frag_netmem(skb, skb_shinfo(skb)->nr_frags,
+			       rx_buffer->netmem, rx_buffer->offset + hr,
+			       size, rx_buffer->truesize);
 }
 
 /**
@@ -1214,12 +1215,13 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
 static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer,
 				      unsigned int size)
 {
-	u32 hr = rx_buffer->page->pp->p.offset;
+	struct page *buf_page = __netmem_to_page(rx_buffer->netmem);
+	u32 hr = buf_page->pp->p.offset;
 	struct sk_buff *skb;
 	void *va;
 
 	/* prefetch first cache line of first page */
-	va = page_address(rx_buffer->page) + rx_buffer->offset;
+	va = page_address(buf_page) + rx_buffer->offset;
 	net_prefetch(va + hr);
 
 	/* build an skb around the page buffer */
diff --git a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
index eae1b6f474e6..aeb2ca5f5a0a 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
@@ -1009,7 +1009,7 @@ static int idpf_rx_singleq_clean(struct idpf_rx_queue *rx_q, int budget)
 			break;
 
 skip_data:
-		rx_buf->page = NULL;
+		rx_buf->netmem = 0;
 
 		IDPF_SINGLEQ_BUMP_RING_IDX(rx_q, ntc);
 		cleaned_count++;
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index bdf52cef3891..6254806c2072 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -382,12 +382,12 @@ static int idpf_tx_desc_alloc_all(struct idpf_vport *vport)
  */
 static void idpf_rx_page_rel(struct libeth_fqe *rx_buf)
 {
-	if (unlikely(!rx_buf->page))
+	if (unlikely(!rx_buf->netmem))
 		return;
 
-	page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false);
+	libeth_rx_recycle_slow(rx_buf->netmem);
 
-	rx_buf->page = NULL;
+	rx_buf->netmem = 0;
 	rx_buf->offset = 0;
 }
 
@@ -3096,10 +3096,10 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
 		      unsigned int size)
 {
-	u32 hr = rx_buf->page->pp->p.offset;
+	u32 hr = netmem_get_pp(rx_buf->netmem)->p.offset;
 
-	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
-			rx_buf->offset + hr, size, rx_buf->truesize);
+	skb_add_rx_frag_netmem(skb, skb_shinfo(skb)->nr_frags, rx_buf->netmem,
+			       rx_buf->offset + hr, size, rx_buf->truesize);
 }
 
 /**
@@ -3122,16 +3122,20 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
 			     struct libeth_fqe *buf, u32 data_len)
 {
 	u32 copy = data_len <= L1_CACHE_BYTES ? data_len : ETH_HLEN;
+	struct page *hdr_page, *buf_page;
 	const void *src;
 	void *dst;
 
-	if (!libeth_rx_sync_for_cpu(buf, copy))
+	if (unlikely(netmem_is_net_iov(buf->netmem)) ||
+	    !libeth_rx_sync_for_cpu(buf, copy))
 		return 0;
 
-	dst = page_address(hdr->page) + hdr->offset + hdr->page->pp->p.offset;
-	src = page_address(buf->page) + buf->offset + buf->page->pp->p.offset;
-	memcpy(dst, src, LARGEST_ALIGN(copy));
+	hdr_page = __netmem_to_page(hdr->netmem);
+	buf_page = __netmem_to_page(buf->netmem);
+	dst = page_address(hdr_page) + hdr->offset + hdr_page->pp->p.offset;
+	src = page_address(buf_page) + buf->offset + buf_page->pp->p.offset;
 
+	memcpy(dst, src, LARGEST_ALIGN(copy));
 	buf->offset += copy;
 
 	return copy;
@@ -3147,11 +3151,12 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
  */
 struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size)
 {
-	u32 hr = buf->page->pp->p.offset;
+	struct page *buf_page = __netmem_to_page(buf->netmem);
+	u32 hr = buf_page->pp->p.offset;
 	struct sk_buff *skb;
 	void *va;
 
-	va = page_address(buf->page) + buf->offset;
+	va = page_address(buf_page) + buf->offset;
 	prefetch(va + hr);
 
 	skb = napi_build_skb(va, buf->truesize);
@@ -3302,7 +3307,7 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 			u64_stats_update_end(&rxq->stats_sync);
 		}
 
-		hdr->page = NULL;
+		hdr->netmem = 0;
 
 payload:
 		if (!libeth_rx_sync_for_cpu(rx_buf, pkt_len))
@@ -3318,7 +3323,7 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 			break;
 
 skip_data:
-		rx_buf->page = NULL;
+		rx_buf->netmem = 0;
 
 		idpf_rx_post_buf_refill(refillq, buf_id);
 		IDPF_RX_BUMP_NTC(rxq, ntc);
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 66d1d23b8ad2..aa5d878181f7 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/* Copyright (C) 2024 Intel Corporation */
+/* Copyright (C) 2024-2025 Intel Corporation */
+
+#define DEFAULT_SYMBOL_NAMESPACE	"LIBETH"
 
 #include <net/libeth/rx.h>
 
@@ -186,7 +188,7 @@ int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
 
 	return -ENOMEM;
 }
-EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_create, "LIBETH");
+EXPORT_SYMBOL_GPL(libeth_rx_fq_create);
 
 /**
  * libeth_rx_fq_destroy - destroy a &page_pool created by libeth
@@ -197,19 +199,19 @@ void libeth_rx_fq_destroy(struct libeth_fq *fq)
 	kvfree(fq->fqes);
 	page_pool_destroy(fq->pp);
 }
-EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_destroy, "LIBETH");
+EXPORT_SYMBOL_GPL(libeth_rx_fq_destroy);
 
 /**
- * libeth_rx_recycle_slow - recycle a libeth page from the NAPI context
- * @page: page to recycle
+ * libeth_rx_recycle_slow - recycle libeth netmem
+ * @netmem: network memory to recycle
  *
  * To be used on exceptions or rare cases not requiring fast inline recycling.
  */
-void libeth_rx_recycle_slow(struct page *page)
+void __cold libeth_rx_recycle_slow(netmem_ref netmem)
 {
-	page_pool_recycle_direct(page->pp, page);
+	page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
 }
-EXPORT_SYMBOL_NS_GPL(libeth_rx_recycle_slow, "LIBETH");
+EXPORT_SYMBOL_GPL(libeth_rx_recycle_slow);
 
 /* Converting abstract packet type numbers into a software structure with
  * the packet parameters to do O(1) lookup on Rx.
@@ -251,7 +253,7 @@ void libeth_rx_pt_gen_hash_type(struct libeth_rx_pt *pt)
 	pt->hash_type |= libeth_rx_pt_xdp_iprot[pt->inner_prot];
 	pt->hash_type |= libeth_rx_pt_xdp_pl[pt->payload_layer];
 }
-EXPORT_SYMBOL_NS_GPL(libeth_rx_pt_gen_hash_type, "LIBETH");
+EXPORT_SYMBOL_GPL(libeth_rx_pt_gen_hash_type);
 
 /* Module */
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 02/16] libeth: support native XDP and register memory model
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
  2025-03-05 16:21 ` [PATCH net-next 01/16] libeth: convert to netmem Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-05 16:21 ` [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp) Alexander Lobakin
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Expand libeth's Page Pool functionality by adding native XDP support.
This means picking the appropriate headroom and DMA direction.
Also, register all the created &page_pools as XDP memory models.
A driver then can call xdp_rxq_info_attach_page_pool() when registering
its RxQ info.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/libeth/rx.h                |  6 +++++-
 drivers/net/ethernet/intel/libeth/rx.c | 20 +++++++++++++++-----
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index 7d5dc58984b1..5d991404845e 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -13,8 +13,10 @@
 
 /* Space reserved in front of each frame */
 #define LIBETH_SKB_HEADROOM	(NET_SKB_PAD + NET_IP_ALIGN)
+#define LIBETH_XDP_HEADROOM	(ALIGN(XDP_PACKET_HEADROOM, NET_SKB_PAD) + \
+				 NET_IP_ALIGN)
 /* Maximum headroom for worst-case calculations */
-#define LIBETH_MAX_HEADROOM	LIBETH_SKB_HEADROOM
+#define LIBETH_MAX_HEADROOM	LIBETH_XDP_HEADROOM
 /* Link layer / L2 overhead: Ethernet, 2 VLAN tags (C + S), FCS */
 #define LIBETH_RX_LL_LEN	(ETH_HLEN + 2 * VLAN_HLEN + ETH_FCS_LEN)
 /* Maximum supported L2-L4 header length */
@@ -66,6 +68,7 @@ enum libeth_fqe_type {
  * @count: number of descriptors/buffers the queue has
  * @type: type of the buffers this queue has
  * @hsplit: flag whether header split is enabled
+ * @xdp: flag indicating whether XDP is enabled
  * @buf_len: HW-writeable length per each buffer
  * @nid: ID of the closest NUMA node with memory
  */
@@ -81,6 +84,7 @@ struct libeth_fq {
 	/* Cold fields */
 	enum libeth_fqe_type	type:2;
 	bool			hsplit:1;
+	bool			xdp:1;
 
 	u32			buf_len;
 	int			nid;
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index aa5d878181f7..c0be9cb043a1 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -70,7 +70,7 @@ static u32 libeth_rx_hw_len_truesize(const struct page_pool_params *pp,
 static bool libeth_rx_page_pool_params(struct libeth_fq *fq,
 				       struct page_pool_params *pp)
 {
-	pp->offset = LIBETH_SKB_HEADROOM;
+	pp->offset = fq->xdp ? LIBETH_XDP_HEADROOM : LIBETH_SKB_HEADROOM;
 	/* HW-writeable / syncable length per one page */
 	pp->max_len = LIBETH_RX_PAGE_LEN(pp->offset);
 
@@ -157,11 +157,12 @@ int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
 		.dev		= napi->dev->dev.parent,
 		.netdev		= napi->dev,
 		.napi		= napi,
-		.dma_dir	= DMA_FROM_DEVICE,
 	};
 	struct libeth_fqe *fqes;
 	struct page_pool *pool;
-	bool ret;
+	int ret;
+
+	pp.dma_dir = fq->xdp ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 
 	if (!fq->hsplit)
 		ret = libeth_rx_page_pool_params(fq, &pp);
@@ -175,18 +176,26 @@ int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
 		return PTR_ERR(pool);
 
 	fqes = kvcalloc_node(fq->count, sizeof(*fqes), GFP_KERNEL, fq->nid);
-	if (!fqes)
+	if (!fqes) {
+		ret = -ENOMEM;
 		goto err_buf;
+	}
+
+	ret = xdp_reg_page_pool(pool);
+	if (ret)
+		goto err_mem;
 
 	fq->fqes = fqes;
 	fq->pp = pool;
 
 	return 0;
 
+err_mem:
+	kvfree(fqes);
 err_buf:
 	page_pool_destroy(pool);
 
-	return -ENOMEM;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(libeth_rx_fq_create);
 
@@ -196,6 +205,7 @@ EXPORT_SYMBOL_GPL(libeth_rx_fq_create);
  */
 void libeth_rx_fq_destroy(struct libeth_fq *fq)
 {
+	xdp_unreg_page_pool(fq->pp);
 	kvfree(fq->fqes);
 	page_pool_destroy(fq->pp);
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
  2025-03-05 16:21 ` [PATCH net-next 01/16] libeth: convert to netmem Alexander Lobakin
  2025-03-05 16:21 ` [PATCH net-next 02/16] libeth: support native XDP and register memory model Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-11 14:05   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 04/16] libeth: add XSk helpers Alexander Lobakin
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

"Couple" is a bit humbly... Add the following functionality to libeth:

* XDP shared queues managing
* XDP_TX bulk sending infra
* .ndo_xdp_xmit() infra
* adding buffers to &xdp_buff
* running XDP prog and managing its verdict
* completing XDP Tx buffers

Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/libeth/Kconfig  |   10 +-
 drivers/net/ethernet/intel/libeth/Makefile |    7 +-
 include/net/libeth/types.h                 |  106 +-
 drivers/net/ethernet/intel/libeth/priv.h   |   26 +
 include/net/libeth/tx.h                    |   30 +-
 include/net/libeth/xdp.h                   | 1827 ++++++++++++++++++++
 drivers/net/ethernet/intel/libeth/tx.c     |   38 +
 drivers/net/ethernet/intel/libeth/xdp.c    |  431 +++++
 8 files changed, 2467 insertions(+), 8 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/libeth/priv.h
 create mode 100644 include/net/libeth/xdp.h
 create mode 100644 drivers/net/ethernet/intel/libeth/tx.c
 create mode 100644 drivers/net/ethernet/intel/libeth/xdp.c

diff --git a/drivers/net/ethernet/intel/libeth/Kconfig b/drivers/net/ethernet/intel/libeth/Kconfig
index 480293b71dbc..d8c4926574fb 100644
--- a/drivers/net/ethernet/intel/libeth/Kconfig
+++ b/drivers/net/ethernet/intel/libeth/Kconfig
@@ -1,9 +1,15 @@
 # SPDX-License-Identifier: GPL-2.0-only
-# Copyright (C) 2024 Intel Corporation
+# Copyright (C) 2024-2025 Intel Corporation
 
 config LIBETH
-	tristate
+	tristate "Common Ethernet library (libeth)" if COMPILE_TEST
 	select PAGE_POOL
 	help
 	  libeth is a common library containing routines shared between several
 	  drivers, but not yet promoted to the generic kernel API.
+
+config LIBETH_XDP
+	tristate "Common XDP library (libeth_xdp)" if COMPILE_TEST
+	select LIBETH
+	help
+	  XDP helpers based on libeth hotpath management.
diff --git a/drivers/net/ethernet/intel/libeth/Makefile b/drivers/net/ethernet/intel/libeth/Makefile
index 52492b081132..51669840ee06 100644
--- a/drivers/net/ethernet/intel/libeth/Makefile
+++ b/drivers/net/ethernet/intel/libeth/Makefile
@@ -1,6 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0-only
-# Copyright (C) 2024 Intel Corporation
+# Copyright (C) 2024-2025 Intel Corporation
 
 obj-$(CONFIG_LIBETH)		+= libeth.o
 
 libeth-y			:= rx.o
+libeth-y			+= tx.o
+
+obj-$(CONFIG_LIBETH_XDP)	+= libeth_xdp.o
+
+libeth_xdp-y			+= xdp.o
diff --git a/include/net/libeth/types.h b/include/net/libeth/types.h
index 603825e45133..cf1d78a9dc38 100644
--- a/include/net/libeth/types.h
+++ b/include/net/libeth/types.h
@@ -1,10 +1,32 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
-/* Copyright (C) 2024 Intel Corporation */
+/* Copyright (C) 2024-2025 Intel Corporation */
 
 #ifndef __LIBETH_TYPES_H
 #define __LIBETH_TYPES_H
 
-#include <linux/types.h>
+#include <linux/workqueue.h>
+
+/* Stats */
+
+/**
+ * struct libeth_rq_napi_stats - "hot" counters to update in Rx polling loop
+ * @packets: received frames counter
+ * @bytes: sum of bytes of received frames above
+ * @fragments: sum of fragments of received S/G frames
+ * @hsplit: number of frames the device performed the header split for
+ * @raw: alias to access all the fields as an array
+ */
+struct libeth_rq_napi_stats {
+	union {
+		struct {
+							u32 packets;
+							u32 bytes;
+							u32 fragments;
+							u32 hsplit;
+		};
+		DECLARE_FLEX_ARRAY(u32, raw);
+	};
+};
 
 /**
  * struct libeth_sq_napi_stats - "hot" counters to update in Tx completion loop
@@ -22,4 +44,84 @@ struct libeth_sq_napi_stats {
 	};
 };
 
+/**
+ * struct libeth_xdpsq_napi_stats - "hot" counters to update in XDP Tx
+ *				    completion loop
+ * @packets: completed frames counter
+ * @bytes: sum of bytes of completed frames above
+ * @fragments: sum of fragments of completed S/G frames
+ * @raw: alias to access all the fields as an array
+ */
+struct libeth_xdpsq_napi_stats {
+	union {
+		struct {
+							u32 packets;
+							u32 bytes;
+							u32 fragments;
+		};
+		DECLARE_FLEX_ARRAY(u32, raw);
+	};
+};
+
+/* XDP */
+
+/*
+ * The following structures should be embedded into driver's queue structure
+ * and passed to the libeth_xdp helpers, never used directly.
+ */
+
+/* XDPSQ sharing */
+
+/**
+ * struct libeth_xdpsq_lock - locking primitive for sharing XDPSQs
+ * @lock: spinlock for locking the queue
+ * @share: whether this particular queue is shared
+ */
+struct libeth_xdpsq_lock {
+	spinlock_t			lock;
+	bool				share;
+};
+
+/* XDPSQ clean-up timers */
+
+/**
+ * struct libeth_xdpsq_timer - timer for cleaning up XDPSQs w/o interrupts
+ * @xdpsq: queue this timer belongs to
+ * @lock: lock for the queue
+ * @dwork: work performing cleanups
+ *
+ * XDPSQs not using interrupts but lazy cleaning, i.e. only when there's no
+ * space for sending the current queued frame/bulk, must fire up timers to
+ * make sure there are no stale buffers to free.
+ */
+struct libeth_xdpsq_timer {
+	void				*xdpsq;
+	struct libeth_xdpsq_lock	*lock;
+
+	struct delayed_work		dwork;
+};
+
+/* Rx polling path */
+
+/**
+ * struct libeth_xdp_buff_stash - struct for stashing &xdp_buff onto a queue
+ * @data: pointer to the start of the frame, xdp_buff.data
+ * @headroom: frame headroom, xdp_buff.data - xdp_buff.data_hard_start
+ * @len: frame linear space length, xdp_buff.data_end - xdp_buff.data
+ * @frame_sz: truesize occupied by the frame, xdp_buff.frame_sz
+ * @flags: xdp_buff.flags
+ *
+ * &xdp_buff is 56 bytes long on x64, &libeth_xdp_buff is 64 bytes. This
+ * structure carries only necessary fields to save/restore a partially built
+ * frame on the queue structure to finish it during the next NAPI poll.
+ */
+struct libeth_xdp_buff_stash {
+	void				*data;
+	u16				headroom;
+	u16				len;
+
+	u32				frame_sz:24;
+	u32				flags:8;
+} __aligned_largest;
+
 #endif /* __LIBETH_TYPES_H */
diff --git a/drivers/net/ethernet/intel/libeth/priv.h b/drivers/net/ethernet/intel/libeth/priv.h
new file mode 100644
index 000000000000..1bd6e2d7a3e7
--- /dev/null
+++ b/drivers/net/ethernet/intel/libeth/priv.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2025 Intel Corporation */
+
+#ifndef __LIBETH_PRIV_H
+#define __LIBETH_PRIV_H
+
+#include <linux/types.h>
+
+/* XDP */
+
+struct skb_shared_info;
+struct xdp_frame_bulk;
+
+struct libeth_xdp_ops {
+	void	(*bulk)(const struct skb_shared_info *sinfo,
+			struct xdp_frame_bulk *bq, bool frags);
+};
+
+void libeth_attach_xdp(const struct libeth_xdp_ops *ops);
+
+static inline void libeth_detach_xdp(void)
+{
+	libeth_attach_xdp(NULL);
+}
+
+#endif /* __LIBETH_PRIV_H */
diff --git a/include/net/libeth/tx.h b/include/net/libeth/tx.h
index 35614f9523f6..c3459917330e 100644
--- a/include/net/libeth/tx.h
+++ b/include/net/libeth/tx.h
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
-/* Copyright (C) 2024 Intel Corporation */
+/* Copyright (C) 2024-2025 Intel Corporation */
 
 #ifndef __LIBETH_TX_H
 #define __LIBETH_TX_H
@@ -12,11 +12,15 @@
 
 /**
  * enum libeth_sqe_type - type of &libeth_sqe to act on Tx completion
- * @LIBETH_SQE_EMPTY: unused/empty, no action required
+ * @LIBETH_SQE_EMPTY: unused/empty OR XDP_TX, no action required
  * @LIBETH_SQE_CTX: context descriptor with empty SQE, no action required
  * @LIBETH_SQE_SLAB: kmalloc-allocated buffer, unmap and kfree()
  * @LIBETH_SQE_FRAG: mapped skb frag, only unmap DMA
  * @LIBETH_SQE_SKB: &sk_buff, unmap and napi_consume_skb(), update stats
+ * @__LIBETH_SQE_XDP_START: separator between skb and XDP types
+ * @LIBETH_SQE_XDP_TX: &skb_shared_info, libeth_xdp_return_buff_bulk(), stats
+ * @LIBETH_SQE_XDP_XMIT: &xdp_frame, unmap and xdp_return_frame_bulk(), stats
+ * @LIBETH_SQE_XDP_XMIT_FRAG: &xdp_frame frag, only unmap DMA
  */
 enum libeth_sqe_type {
 	LIBETH_SQE_EMPTY		= 0U,
@@ -24,6 +28,11 @@ enum libeth_sqe_type {
 	LIBETH_SQE_SLAB,
 	LIBETH_SQE_FRAG,
 	LIBETH_SQE_SKB,
+
+	__LIBETH_SQE_XDP_START,
+	LIBETH_SQE_XDP_TX		= __LIBETH_SQE_XDP_START,
+	LIBETH_SQE_XDP_XMIT,
+	LIBETH_SQE_XDP_XMIT_FRAG,
 };
 
 /**
@@ -32,6 +41,8 @@ enum libeth_sqe_type {
  * @rs_idx: index of the last buffer from the batch this one was sent in
  * @raw: slab buffer to free via kfree()
  * @skb: &sk_buff to consume
+ * @sinfo: skb shared info of an XDP_TX frame
+ * @xdpf: XDP frame from ::ndo_xdp_xmit()
  * @dma: DMA address to unmap
  * @len: length of the mapped region to unmap
  * @nr_frags: number of frags in the frame this buffer belongs to
@@ -46,6 +57,8 @@ struct libeth_sqe {
 	union {
 		void				*raw;
 		struct sk_buff			*skb;
+		struct skb_shared_info		*sinfo;
+		struct xdp_frame		*xdpf;
 	};
 
 	DEFINE_DMA_UNMAP_ADDR(dma);
@@ -71,7 +84,10 @@ struct libeth_sqe {
 /**
  * struct libeth_cq_pp - completion queue poll params
  * @dev: &device to perform DMA unmapping
+ * @bq: XDP frame bulk to combine return operations
  * @ss: onstack NAPI stats to fill
+ * @xss: onstack XDPSQ NAPI stats to fill
+ * @xdp_tx: number of XDP frames processed
  * @napi: whether it's called from the NAPI context
  *
  * libeth uses this structure to access objects needed for performing full
@@ -80,7 +96,13 @@ struct libeth_sqe {
  */
 struct libeth_cq_pp {
 	struct device			*dev;
-	struct libeth_sq_napi_stats	*ss;
+	struct xdp_frame_bulk		*bq;
+
+	union {
+		struct libeth_sq_napi_stats	*ss;
+		struct libeth_xdpsq_napi_stats	*xss;
+	};
+	u32				xdp_tx;
 
 	bool				napi;
 };
@@ -126,4 +148,6 @@ static inline void libeth_tx_complete(struct libeth_sqe *sqe,
 	sqe->type = LIBETH_SQE_EMPTY;
 }
 
+void libeth_tx_complete_any(struct libeth_sqe *sqe, struct libeth_cq_pp *cp);
+
 #endif /* __LIBETH_TX_H */
diff --git a/include/net/libeth/xdp.h b/include/net/libeth/xdp.h
new file mode 100644
index 000000000000..1039cd5d8a56
--- /dev/null
+++ b/include/net/libeth/xdp.h
@@ -0,0 +1,1827 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2025 Intel Corporation */
+
+#ifndef __LIBETH_XDP_H
+#define __LIBETH_XDP_H
+
+#include <linux/bpf_trace.h>
+#include <linux/unroll.h>
+
+#include <net/libeth/rx.h>
+#include <net/libeth/tx.h>
+#include <net/xsk_buff_pool.h>
+
+/* Defined as bits to be able to use them as a mask */
+enum {
+	LIBETH_XDP_PASS			= 0U,
+	LIBETH_XDP_DROP			= BIT(0),
+	LIBETH_XDP_ABORTED		= BIT(1),
+	LIBETH_XDP_TX			= BIT(2),
+	LIBETH_XDP_REDIRECT		= BIT(3),
+};
+
+/*
+ * &xdp_buff_xsk is the largest structure &libeth_xdp_buff gets casted to,
+ * pick maximum pointer-compatible alignment.
+ */
+#define __LIBETH_XDP_BUFF_ALIGN						      \
+	(IS_ALIGNED(sizeof(struct xdp_buff_xsk), 16) ? 16 :		      \
+	 IS_ALIGNED(sizeof(struct xdp_buff_xsk), 8) ? 8 :		      \
+	 sizeof(long))
+
+/**
+ * struct libeth_xdp_buff - libeth extension over &xdp_buff
+ * @base: main &xdp_buff
+ * @data: shortcut for @base.data
+ * @desc: RQ descriptor containing metadata for this buffer
+ * @priv: driver-private scratchspace
+ *
+ * The main reason for this is to have a pointer to the descriptor to be able
+ * to quickly get frame metadata from xdpmo and driver buff-to-xdp callbacks
+ * (as well as bigger alignment).
+ * Pointer/layout-compatible with &xdp_buff and &xdp_buff_xsk.
+ */
+struct libeth_xdp_buff {
+	union {
+		struct xdp_buff		base;
+		void			*data;
+	};
+
+	const void			*desc;
+	unsigned long			priv[]
+					__aligned(__LIBETH_XDP_BUFF_ALIGN);
+} __aligned(__LIBETH_XDP_BUFF_ALIGN);
+static_assert(offsetof(struct libeth_xdp_buff, data) ==
+	      offsetof(struct xdp_buff_xsk, xdp.data));
+static_assert(offsetof(struct libeth_xdp_buff, desc) ==
+	      offsetof(struct xdp_buff_xsk, cb));
+static_assert(IS_ALIGNED(sizeof(struct xdp_buff_xsk),
+			 __alignof(struct libeth_xdp_buff)));
+
+/**
+ * __LIBETH_XDP_ONSTACK_BUFF - declare a &libeth_xdp_buff on the stack
+ * @name: name of the variable to declare
+ * @...: sizeof() of the driver-private data
+ */
+#define __LIBETH_XDP_ONSTACK_BUFF(name, ...)				      \
+	___LIBETH_XDP_ONSTACK_BUFF(name, ##__VA_ARGS__)
+/**
+ * LIBETH_XDP_ONSTACK_BUFF - declare a &libeth_xdp_buff on the stack
+ * @name: name of the variable to declare
+ * @...: type or variable name of the driver-private data
+ */
+#define LIBETH_XDP_ONSTACK_BUFF(name, ...)				      \
+	__LIBETH_XDP_ONSTACK_BUFF(name, __libeth_xdp_priv_sz(__VA_ARGS__))
+
+#define ___LIBETH_XDP_ONSTACK_BUFF(name, ...)				      \
+	_DEFINE_FLEX(struct libeth_xdp_buff, name, priv,		      \
+		     LIBETH_XDP_PRIV_SZ(__VA_ARGS__ + 0),		      \
+		     /* no init */);					      \
+	LIBETH_XDP_ASSERT_PRIV_SZ(__VA_ARGS__ + 0)
+
+#define __libeth_xdp_priv_sz(...)					      \
+	CONCATENATE(__libeth_xdp_psz, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
+
+#define __libeth_xdp_psz0(...)
+#define __libeth_xdp_psz1(...)		sizeof(__VA_ARGS__)
+
+#define LIBETH_XDP_PRIV_SZ(sz)						      \
+	(ALIGN(sz, __alignof(struct libeth_xdp_buff)) / sizeof(long))
+
+/* Performs XSK_CHECK_PRIV_TYPE() */
+#define LIBETH_XDP_ASSERT_PRIV_SZ(sz)					      \
+	static_assert(offsetofend(struct xdp_buff_xsk, cb) >=		      \
+		      struct_size_t(struct libeth_xdp_buff, priv,	      \
+				    LIBETH_XDP_PRIV_SZ(sz)))
+
+/* XDPSQ sharing */
+
+DECLARE_STATIC_KEY_FALSE(libeth_xdpsq_share);
+
+/**
+ * libeth_xdpsq_num - calculate optimal number of XDPSQs for this device + sys
+ * @rxq: current number of active Rx queues
+ * @txq: current number of active Tx queues
+ * @max: maximum number of Tx queues
+ *
+ * Each RQ must have its own XDPSQ for XSk pairs, each CPU must have own XDPSQ
+ * for lockless sending (``XDP_TX``, .ndo_xdp_xmit()). Cap the maximum of these
+ * two with the number of SQs the device can have (minus used ones).
+ *
+ * Return: number of XDP Tx queues the device needs to use.
+ */
+static inline u32 libeth_xdpsq_num(u32 rxq, u32 txq, u32 max)
+{
+	return min(max(nr_cpu_ids, rxq), max - txq);
+}
+
+/**
+ * libeth_xdpsq_shared - whether XDPSQs can be shared between several CPUs
+ * @num: number of active XDPSQs
+ *
+ * Return: true if there's no 1:1 XDPSQ/CPU association, false otherwise.
+ */
+static inline bool libeth_xdpsq_shared(u32 num)
+{
+	return num < nr_cpu_ids;
+}
+
+/**
+ * libeth_xdpsq_id - get XDPSQ index corresponding to this CPU
+ * @num: number of active XDPSQs
+ *
+ * Helper for libeth_xdp routines, do not use in drivers directly.
+ *
+ * Return: XDPSQ index needs to be used on this CPU.
+ */
+static inline u32 libeth_xdpsq_id(u32 num)
+{
+	u32 ret = raw_smp_processor_id();
+
+	if (static_branch_unlikely(&libeth_xdpsq_share) &&
+	    libeth_xdpsq_shared(num))
+		ret %= num;
+
+	return ret;
+}
+
+void __libeth_xdpsq_get(struct libeth_xdpsq_lock *lock,
+			const struct net_device *dev);
+void __libeth_xdpsq_put(struct libeth_xdpsq_lock *lock,
+			const struct net_device *dev);
+
+/**
+ * libeth_xdpsq_get - initialize &libeth_xdpsq_lock
+ * @lock: lock to initialize
+ * @dev: netdev which this lock belongs to
+ * @share: whether XDPSQs can be shared
+ *
+ * Tracks the current XDPSQ association and enables the static lock
+ * if needed.
+ */
+static inline void libeth_xdpsq_get(struct libeth_xdpsq_lock *lock,
+				    const struct net_device *dev,
+				    bool share)
+{
+	if (unlikely(share))
+		__libeth_xdpsq_get(lock, dev);
+}
+
+/**
+ * libeth_xdpsq_put - deinitialize &libeth_xdpsq_lock
+ * @lock: lock to deinitialize
+ * @dev: netdev which this lock belongs to
+ *
+ * Tracks the current XDPSQ association and disables the static lock
+ * if needed.
+ */
+static inline void libeth_xdpsq_put(struct libeth_xdpsq_lock *lock,
+				    const struct net_device *dev)
+{
+	if (static_branch_unlikely(&libeth_xdpsq_share) && lock->share)
+		__libeth_xdpsq_put(lock, dev);
+}
+
+void __libeth_xdpsq_lock(struct libeth_xdpsq_lock *lock);
+void __libeth_xdpsq_unlock(struct libeth_xdpsq_lock *lock);
+
+/**
+ * libeth_xdpsq_lock - grab &libeth_xdpsq_lock if needed
+ * @lock: lock to take
+ *
+ * Touches the underlying spinlock only if the static key is enabled
+ * and the queue itself is marked as shareable.
+ */
+static inline void libeth_xdpsq_lock(struct libeth_xdpsq_lock *lock)
+{
+	if (static_branch_unlikely(&libeth_xdpsq_share) && lock->share)
+		__libeth_xdpsq_lock(lock);
+}
+
+/**
+ * libeth_xdpsq_unlock - free &libeth_xdpsq_lock if needed
+ * @lock: lock to free
+ *
+ * Touches the underlying spinlock only if the static key is enabled
+ * and the queue itself is marked as shareable.
+ */
+static inline void libeth_xdpsq_unlock(struct libeth_xdpsq_lock *lock)
+{
+	if (static_branch_unlikely(&libeth_xdpsq_share) && lock->share)
+		__libeth_xdpsq_unlock(lock);
+}
+
+/* XDPSQ clean-up timers */
+
+void libeth_xdpsq_init_timer(struct libeth_xdpsq_timer *timer, void *xdpsq,
+			     struct libeth_xdpsq_lock *lock,
+			     void (*poll)(struct work_struct *work));
+
+/**
+ * libeth_xdpsq_deinit_timer - deinitialize &libeth_xdpsq_timer
+ * @timer: timer to deinitialize
+ *
+ * Flush and disable the underlying workqueue.
+ */
+static inline void libeth_xdpsq_deinit_timer(struct libeth_xdpsq_timer *timer)
+{
+	cancel_delayed_work_sync(&timer->dwork);
+}
+
+/**
+ * libeth_xdpsq_queue_timer - run &libeth_xdpsq_timer
+ * @timer: timer to queue
+ *
+ * Should be called after the queue was filled and the transmission was run
+ * to complete the pending buffers if no further sending will be done in a
+ * second (-> lazy cleaning won't happen).
+ * If the timer was already run, it will be requeued back to one second
+ * timeout again.
+ */
+static inline void libeth_xdpsq_queue_timer(struct libeth_xdpsq_timer *timer)
+{
+	mod_delayed_work_on(raw_smp_processor_id(), system_bh_highpri_wq,
+			    &timer->dwork, HZ);
+}
+
+/**
+ * libeth_xdpsq_run_timer - wrapper to run a queue clean-up on a timer event
+ * @work: workqueue belonging to the corresponding timer
+ * @poll: driver-specific completion queue poll function
+ *
+ * Run the polling function on the locked queue and requeue the timer if
+ * there's more work to do.
+ * Designed to be used via LIBETH_XDP_DEFINE_TIMER() below.
+ */
+static __always_inline void
+libeth_xdpsq_run_timer(struct work_struct *work,
+		       u32 (*poll)(void *xdpsq, u32 budget))
+{
+	struct libeth_xdpsq_timer *timer = container_of(work, typeof(*timer),
+							dwork.work);
+
+	libeth_xdpsq_lock(timer->lock);
+
+	if (poll(timer->xdpsq, U32_MAX))
+		libeth_xdpsq_queue_timer(timer);
+
+	libeth_xdpsq_unlock(timer->lock);
+}
+
+/* Common Tx bits */
+
+/**
+ * enum - libeth_xdp internal Tx flags
+ * @LIBETH_XDP_TX_BULK: one bulk size at which it will be flushed to the queue
+ * @LIBETH_XDP_TX_BATCH: batch size for which the queue fill loop is unrolled
+ * @LIBETH_XDP_TX_DROP: indicates the send function must drop frames not sent
+ * @LIBETH_XDP_TX_NDO: whether the send function is called from .ndo_xdp_xmit()
+ */
+enum {
+	LIBETH_XDP_TX_BULK		= DEV_MAP_BULK_SIZE,
+	LIBETH_XDP_TX_BATCH		= 8,
+
+	LIBETH_XDP_TX_DROP		= BIT(0),
+	LIBETH_XDP_TX_NDO		= BIT(1),
+};
+
+/**
+ * enum - &libeth_xdp_tx_frame and &libeth_xdp_tx_desc flags
+ * @LIBETH_XDP_TX_LEN: only for ``XDP_TX``, [15:0] of ::len_fl is actual length
+ * @LIBETH_XDP_TX_FIRST: indicates the frag is the first one of the frame
+ * @LIBETH_XDP_TX_LAST: whether the frag is the last one of the frame
+ * @LIBETH_XDP_TX_MULTI: whether the frame contains several frags
+ * @LIBETH_XDP_TX_FLAGS: only for ``XDP_TX``, [31:16] of ::len_fl is flags
+ */
+enum {
+	LIBETH_XDP_TX_LEN		= GENMASK(15, 0),
+
+	LIBETH_XDP_TX_FIRST		= BIT(16),
+	LIBETH_XDP_TX_LAST		= BIT(17),
+	LIBETH_XDP_TX_MULTI		= BIT(18),
+
+	LIBETH_XDP_TX_FLAGS		= GENMASK(31, 16),
+};
+
+/**
+ * struct libeth_xdp_tx_frame - represents one XDP Tx element
+ * @data: frame start pointer for ``XDP_TX``
+ * @len_fl: ``XDP_TX``, combined flags [31:16] and len [15:0] field for speed
+ * @soff: ``XDP_TX``, offset from @data to the start of &skb_shared_info
+ * @frag: one (non-head) frag for ``XDP_TX``
+ * @xdpf: &xdp_frame for the head frag for .ndo_xdp_xmit()
+ * @dma: DMA address of the non-head frag for .ndo_xdp_xmit()
+ * @len: frag length for .ndo_xdp_xmit()
+ * @flags: Tx flags for the above
+ * @opts: combined @len + @flags for the above for speed
+ */
+struct libeth_xdp_tx_frame {
+	union {
+		/* ``XDP_TX`` */
+		struct {
+			void				*data;
+			u32				len_fl;
+			u32				soff;
+		};
+
+		/* ``XDP_TX`` frag */
+		skb_frag_t			frag;
+
+		/* .ndo_xdp_xmit() */
+		struct {
+			union {
+				struct xdp_frame		*xdpf;
+				dma_addr_t			dma;
+			};
+			union {
+				struct {
+					u32				len;
+					u32				flags;
+				};
+				aligned_u64			opts;
+			};
+		};
+	};
+} __aligned(sizeof(struct xdp_desc));
+static_assert(offsetof(struct libeth_xdp_tx_frame, frag.len) ==
+	      offsetof(struct libeth_xdp_tx_frame, len_fl));
+
+/**
+ * struct libeth_xdp_tx_bulk - XDP Tx frame bulk for bulk sending
+ * @prog: corresponding active XDP program, %NULL for .ndo_xdp_xmit()
+ * @dev: &net_device which the frames are transmitted on
+ * @xdpsq: shortcut to the corresponding driver-specific XDPSQ structure
+ * @act_mask: Rx only, mask of all the XDP prog verdicts for that NAPI session
+ * @count: current number of frames in @bulk
+ * @bulk: array of queued frames for bulk Tx
+ *
+ * All XDP Tx operations queue each frame to the bulk first and flush it
+ * when @count reaches the array end. Bulk is always placed on the stack
+ * for performance. One bulk element contains all the data necessary
+ * for sending a frame and then freeing it on completion.
+ */
+struct libeth_xdp_tx_bulk {
+	const struct bpf_prog		*prog;
+	struct net_device		*dev;
+	void				*xdpsq;
+
+	u32				act_mask;
+	u32				count;
+	struct libeth_xdp_tx_frame	bulk[LIBETH_XDP_TX_BULK];
+} __aligned(sizeof(struct libeth_xdp_tx_frame));
+
+/**
+ * struct libeth_xdpsq - abstraction for an XDPSQ
+ * @sqes: array of Tx buffers from the actual queue struct
+ * @descs: opaque pointer to the HW descriptor array
+ * @ntu: pointer to the next free descriptor index
+ * @count: number of descriptors on that queue
+ * @pending: pointer to the number of sent-not-completed descs on that queue
+ * @xdp_tx: pointer to the above
+ * @lock: corresponding XDPSQ lock
+ *
+ * Abstraction for driver-independent implementation of Tx. Placed on the stack
+ * and filled by the driver before the transmission, so that the generic
+ * functions can access and modify driver-specific resources.
+ */
+struct libeth_xdpsq {
+	struct libeth_sqe		*sqes;
+	void				*descs;
+
+	u32				*ntu;
+	u32				count;
+
+	u32				*pending;
+	u32				*xdp_tx;
+	struct libeth_xdpsq_lock	*lock;
+};
+
+/**
+ * struct libeth_xdp_tx_desc - abstraction for an XDP Tx descriptor
+ * @addr: DMA address of the frame
+ * @len: length of the frame
+ * @flags: XDP Tx flags
+ * @opts: combined @len + @flags for speed
+ *
+ * Filled by the generic functions and then passed to driver-specific functions
+ * to fill a HW Tx descriptor, always placed on the [function] stack.
+ */
+struct libeth_xdp_tx_desc {
+	dma_addr_t			addr;
+	union {
+		struct {
+			u32				len;
+			u32				flags;
+		};
+		aligned_u64			opts;
+	};
+} __aligned_largest;
+
+/**
+ * libeth_xdp_ptr_to_priv - convert pointer to a libeth_xdp u64 priv
+ * @ptr: pointer to convert
+ *
+ * The main sending function passes private data as the largest scalar, u64.
+ * Use this helper when you want to pass a pointer there.
+ */
+#define libeth_xdp_ptr_to_priv(ptr) ({					      \
+	typecheck_pointer(ptr);						      \
+	((u64)(uintptr_t)(ptr));					      \
+})
+/**
+ * libeth_xdp_priv_to_ptr - convert libeth_xdp u64 priv to a pointer
+ * @priv: private data to convert
+ *
+ * The main sending function passes private data as the largest scalar, u64.
+ * Use this helper when your callback takes this u64 and you want to convert
+ * it back to a pointer.
+ */
+#define libeth_xdp_priv_to_ptr(priv) ({					      \
+	static_assert(__same_type(priv, u64));				      \
+	((const void *)(uintptr_t)(priv));				      \
+})
+
+/*
+ * On 64-bit systems, assigning one u64 is faster than two u32s. When ::len
+ * occupies lowest 32 bits (LE), whole ::opts can be assigned directly instead.
+ */
+#ifdef __LITTLE_ENDIAN
+#define __LIBETH_WORD_ACCESS		1
+#endif
+#ifdef __LIBETH_WORD_ACCESS
+#define __libeth_xdp_tx_len(flen, ...)					      \
+	.opts = ((flen) | FIELD_PREP(GENMASK_ULL(63, 32), (__VA_ARGS__ + 0)))
+#else
+#define __libeth_xdp_tx_len(flen, ...)					      \
+	.len = (flen), .flags = (__VA_ARGS__ + 0)
+#endif
+
+/**
+ * libeth_xdp_tx_xmit_bulk - main XDP Tx function
+ * @bulk: array of frames to send
+ * @xdpsq: pointer to the driver-specific XDPSQ struct
+ * @n: number of frames to send
+ * @unroll: whether to unroll the queue filling loop for speed
+ * @priv: driver-specific private data
+ * @prep: callback for cleaning the queue and filling abstract &libeth_xdpsq
+ * @fill: internal callback for filling &libeth_sqe and &libeth_xdp_tx_desc
+ * @xmit: callback for filling a HW descriptor with the frame info
+ *
+ * Internal abstraction for placing @n XDP Tx frames on the HW XDPSQ. Used for
+ * all types of frames: ``XDP_TX`` and .ndo_xdp_xmit().
+ * @prep must lock the queue as this function releases it at the end. @unroll
+ * greatly increases the object code size, but also greatly increases
+ * performance.
+ * The compilers inline all those onstack abstractions to direct data accesses.
+ *
+ * Return: number of frames actually placed on the queue, <= @n. The function
+ * can't fail, but can send less frames if there's no enough free descriptors
+ * available. The actual free space is returned by @prep from the driver.
+ */
+static __always_inline u32
+libeth_xdp_tx_xmit_bulk(const struct libeth_xdp_tx_frame *bulk, void *xdpsq,
+			u32 n, bool unroll, u64 priv,
+			u32 (*prep)(void *xdpsq, struct libeth_xdpsq *sq),
+			struct libeth_xdp_tx_desc
+			(*fill)(struct libeth_xdp_tx_frame frm, u32 i,
+				const struct libeth_xdpsq *sq, u64 priv),
+			void (*xmit)(struct libeth_xdp_tx_desc desc, u32 i,
+				     const struct libeth_xdpsq *sq, u64 priv))
+{
+	u32 this, batched, off = 0;
+	struct libeth_xdpsq sq;
+	u32 ntu, i = 0;
+
+	n = min(n, prep(xdpsq, &sq));
+	if (unlikely(!n))
+		goto unlock;
+
+	ntu = *sq.ntu;
+
+	this = sq.count - ntu;
+	if (likely(this > n))
+		this = n;
+
+again:
+	if (!unroll)
+		goto linear;
+
+	batched = ALIGN_DOWN(this, LIBETH_XDP_TX_BATCH);
+
+	for ( ; i < off + batched; i += LIBETH_XDP_TX_BATCH) {
+		u32 base = ntu + i - off;
+
+		unrolled_count(LIBETH_XDP_TX_BATCH)
+		for (u32 j = 0; j < LIBETH_XDP_TX_BATCH; j++)
+			xmit(fill(bulk[i + j], base + j, &sq, priv),
+			     base + j, &sq, priv);
+	}
+
+	if (batched < this) {
+linear:
+		for ( ; i < off + this; i++)
+			xmit(fill(bulk[i], ntu + i - off, &sq, priv),
+			     ntu + i - off, &sq, priv);
+	}
+
+	ntu += this;
+	if (likely(ntu < sq.count))
+		goto out;
+
+	ntu = 0;
+
+	if (i < n) {
+		this = n - i;
+		off = i;
+
+		goto again;
+	}
+
+out:
+	*sq.ntu = ntu;
+	*sq.pending += n;
+	if (sq.xdp_tx)
+		*sq.xdp_tx += n;
+
+unlock:
+	libeth_xdpsq_unlock(sq.lock);
+
+	return n;
+}
+
+/* ``XDP_TX`` bulking */
+
+void libeth_xdp_return_buff_slow(struct libeth_xdp_buff *xdp);
+
+/**
+ * libeth_xdp_tx_queue_head - internal helper for queueing one ``XDP_TX`` head
+ * @bq: XDP Tx bulk to queue the head frag to
+ * @xdp: XDP buffer with the head to queue
+ *
+ * Return: false if it's the only frag of the frame, true if it's an S/G frame.
+ */
+static inline bool libeth_xdp_tx_queue_head(struct libeth_xdp_tx_bulk *bq,
+					    const struct libeth_xdp_buff *xdp)
+{
+	const struct xdp_buff *base = &xdp->base;
+
+	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
+		.data	= xdp->data,
+		.len_fl	= (base->data_end - xdp->data) | LIBETH_XDP_TX_FIRST,
+		.soff	= xdp_data_hard_end(base) - xdp->data,
+	};
+
+	if (!xdp_buff_has_frags(base))
+		return false;
+
+	bq->bulk[bq->count - 1].len_fl |= LIBETH_XDP_TX_MULTI;
+
+	return true;
+}
+
+/**
+ * libeth_xdp_tx_queue_frag - internal helper for queueing one ``XDP_TX`` frag
+ * @bq: XDP Tx bulk to queue the frag to
+ * @frag: frag to queue
+ */
+static inline void libeth_xdp_tx_queue_frag(struct libeth_xdp_tx_bulk *bq,
+					    const skb_frag_t *frag)
+{
+	bq->bulk[bq->count++].frag = *frag;
+}
+
+/**
+ * libeth_xdp_tx_queue_bulk - internal helper for queueing one ``XDP_TX`` frame
+ * @bq: XDP Tx bulk to queue the frame to
+ * @xdp: XDP buffer to queue
+ * @flush_bulk: driver callback to flush the bulk to the HW queue
+ *
+ * Return: true on success, false on flush error.
+ */
+static __always_inline bool
+libeth_xdp_tx_queue_bulk(struct libeth_xdp_tx_bulk *bq,
+			 struct libeth_xdp_buff *xdp,
+			 bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
+					    u32 flags))
+{
+	const struct skb_shared_info *sinfo;
+	bool ret = true;
+	u32 nr_frags;
+
+	if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
+	    unlikely(!flush_bulk(bq, 0))) {
+		libeth_xdp_return_buff_slow(xdp);
+		return false;
+	}
+
+	if (!libeth_xdp_tx_queue_head(bq, xdp))
+		goto out;
+
+	sinfo = xdp_get_shared_info_from_buff(&xdp->base);
+	nr_frags = sinfo->nr_frags;
+
+	for (u32 i = 0; i < nr_frags; i++) {
+		if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
+		    unlikely(!flush_bulk(bq, 0))) {
+			ret = false;
+			break;
+		}
+
+		libeth_xdp_tx_queue_frag(bq, &sinfo->frags[i]);
+	}
+
+out:
+	bq->bulk[bq->count - 1].len_fl |= LIBETH_XDP_TX_LAST;
+	xdp->data = NULL;
+
+	return ret;
+}
+
+/**
+ * libeth_xdp_tx_fill_stats - fill &libeth_sqe with ``XDP_TX`` frame stats
+ * @sqe: SQ element to fill
+ * @desc: libeth_xdp Tx descriptor
+ * @sinfo: &skb_shared_info for this frame
+ *
+ * Internal helper for filling an SQE with the frame stats, do not use in
+ * drivers. Fills the number of frags and bytes for this frame.
+ */
+#define libeth_xdp_tx_fill_stats(sqe, desc, sinfo)			      \
+	__libeth_xdp_tx_fill_stats(sqe, desc, sinfo, __UNIQUE_ID(sqe_),	      \
+				   __UNIQUE_ID(desc_), __UNIQUE_ID(sinfo_))
+
+#define __libeth_xdp_tx_fill_stats(sqe, desc, sinfo, ue, ud, us) do {	      \
+	const struct libeth_xdp_tx_desc *ud = (desc);			      \
+	const struct skb_shared_info *us;				      \
+	struct libeth_sqe *ue = (sqe);					      \
+									      \
+	ue->nr_frags = 1;						      \
+	ue->bytes = ud->len;						      \
+									      \
+	if (ud->flags & LIBETH_XDP_TX_MULTI) {				      \
+		us = (sinfo);						      \
+		ue->nr_frags += us->nr_frags;				      \
+		ue->bytes += us->xdp_frags_size;			      \
+	}								      \
+} while (0)
+
+/**
+ * libeth_xdp_tx_fill_buf - internal helper to fill one ``XDP_TX`` &libeth_sqe
+ * @frm: XDP Tx frame from the bulk
+ * @i: index on the HW queue
+ * @sq: XDPSQ abstraction for the queue
+ * @priv: private data
+ *
+ * Return: XDP Tx descriptor with the synced DMA and other info to pass to
+ * the driver callback.
+ */
+static inline struct libeth_xdp_tx_desc
+libeth_xdp_tx_fill_buf(struct libeth_xdp_tx_frame frm, u32 i,
+		       const struct libeth_xdpsq *sq, u64 priv)
+{
+	struct libeth_xdp_tx_desc desc;
+	struct skb_shared_info *sinfo;
+	skb_frag_t *frag = &frm.frag;
+	struct libeth_sqe *sqe;
+	netmem_ref netmem;
+
+	if (frm.len_fl & LIBETH_XDP_TX_FIRST) {
+		sinfo = frm.data + frm.soff;
+		skb_frag_fill_netmem_desc(frag, virt_to_netmem(frm.data),
+					  offset_in_page(frm.data),
+					  frm.len_fl);
+	} else {
+		sinfo = NULL;
+	}
+
+	netmem = skb_frag_netmem(frag);
+	desc = (typeof(desc)){
+		.addr	= page_pool_get_dma_addr_netmem(netmem) +
+			  skb_frag_off(frag),
+		.len	= skb_frag_size(frag) & LIBETH_XDP_TX_LEN,
+		.flags	= skb_frag_size(frag) & LIBETH_XDP_TX_FLAGS,
+	};
+
+	if (sinfo || !netmem_is_net_iov(netmem)) {
+		const struct page_pool *pp = __netmem_get_pp(netmem);
+
+		dma_sync_single_for_device(pp->p.dev, desc.addr, desc.len,
+					   DMA_BIDIRECTIONAL);
+	}
+
+	if (!sinfo)
+		return desc;
+
+	sqe = &sq->sqes[i];
+	sqe->type = LIBETH_SQE_XDP_TX;
+	sqe->sinfo = sinfo;
+	libeth_xdp_tx_fill_stats(sqe, &desc, sinfo);
+
+	return desc;
+}
+
+void libeth_xdp_tx_exception(struct libeth_xdp_tx_bulk *bq, u32 sent,
+			     u32 flags);
+
+/**
+ * __libeth_xdp_tx_flush_bulk - internal helper to flush one XDP Tx bulk
+ * @bq: bulk to flush
+ * @flags: XDP TX flags (.ndo_xdp_xmit(), etc.)
+ * @prep: driver-specific callback to prepare the queue for sending
+ * @fill: libeth_xdp callback to fill &libeth_sqe and &libeth_xdp_tx_desc
+ * @xmit: driver callback to fill a HW descriptor
+ *
+ * Internal abstraction to create bulk flush functions for drivers.
+ *
+ * Return: true if anything was sent, false otherwise.
+ */
+static __always_inline bool
+__libeth_xdp_tx_flush_bulk(struct libeth_xdp_tx_bulk *bq, u32 flags,
+			   u32 (*prep)(void *xdpsq, struct libeth_xdpsq *sq),
+			   struct libeth_xdp_tx_desc
+			   (*fill)(struct libeth_xdp_tx_frame frm, u32 i,
+				   const struct libeth_xdpsq *sq, u64 priv),
+			   void (*xmit)(struct libeth_xdp_tx_desc desc, u32 i,
+					const struct libeth_xdpsq *sq,
+					u64 priv))
+{
+	u32 sent, drops;
+	int err = 0;
+
+	sent = libeth_xdp_tx_xmit_bulk(bq->bulk, bq->xdpsq,
+				       min(bq->count, LIBETH_XDP_TX_BULK),
+				       false, 0, prep, fill, xmit);
+	drops = bq->count - sent;
+
+	if (unlikely(drops)) {
+		libeth_xdp_tx_exception(bq, sent, flags);
+		err = -ENXIO;
+	} else {
+		bq->count = 0;
+	}
+
+	trace_xdp_bulk_tx(bq->dev, sent, drops, err);
+
+	return likely(sent);
+}
+
+/**
+ * libeth_xdp_tx_flush_bulk - wrapper to define flush of one ``XDP_TX`` bulk
+ * @bq: bulk to flush
+ * @flags: Tx flags, see above
+ * @prep: driver callback to prepare the queue
+ * @xmit: driver callback to fill a HW descriptor
+ *
+ * Use via LIBETH_XDP_DEFINE_FLUSH_TX() to define an ``XDP_TX`` driver
+ * callback.
+ */
+#define libeth_xdp_tx_flush_bulk(bq, flags, prep, xmit)			      \
+	__libeth_xdp_tx_flush_bulk(bq, flags, prep, libeth_xdp_tx_fill_buf,   \
+				   xmit)
+
+/* .ndo_xdp_xmit() implementation */
+
+/**
+ * libeth_xdp_xmit_init_bulk - internal helper to initialize bulk for XDP xmit
+ * @bq: bulk to initialize
+ * @dev: target &net_device
+ * @xdpsqs: array of driver-specific XDPSQ structs
+ * @num: number of active XDPSQs (the above array length)
+ */
+#define libeth_xdp_xmit_init_bulk(bq, dev, xdpsqs, num)			      \
+	__libeth_xdp_xmit_init_bulk(bq, dev, (xdpsqs)[libeth_xdpsq_id(num)])
+
+static inline void __libeth_xdp_xmit_init_bulk(struct libeth_xdp_tx_bulk *bq,
+					       struct net_device *dev,
+					       void *xdpsq)
+{
+	bq->dev = dev;
+	bq->xdpsq = xdpsq;
+	bq->count = 0;
+}
+
+/**
+ * libeth_xdp_xmit_frame_dma - internal helper to access DMA of an &xdp_frame
+ * @xf: pointer to the XDP frame
+ *
+ * There's no place in &libeth_xdp_tx_frame to store DMA address for an
+ * &xdp_frame head. The headroom is used then, the address is placed right
+ * after the frame struct, naturally aligned.
+ *
+ * Return: pointer to the DMA address to use.
+ */
+#define libeth_xdp_xmit_frame_dma(xf)					      \
+	_Generic((xf),							      \
+		 const struct xdp_frame *:				      \
+			(const dma_addr_t *)__libeth_xdp_xmit_frame_dma(xf),  \
+		 struct xdp_frame *:					      \
+			(dma_addr_t *)__libeth_xdp_xmit_frame_dma(xf)	      \
+	)
+
+static inline void *__libeth_xdp_xmit_frame_dma(const struct xdp_frame *xdpf)
+{
+	void *addr = (void *)(xdpf + 1);
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
+	    __alignof(*xdpf) < sizeof(dma_addr_t))
+		addr = PTR_ALIGN(addr, sizeof(dma_addr_t));
+
+	return addr;
+}
+
+/**
+ * libeth_xdp_xmit_queue_head - internal helper for queueing one XDP xmit head
+ * @bq: XDP Tx bulk to queue the head frag to
+ * @xdpf: XDP frame with the head to queue
+ * @dev: device to perform DMA mapping
+ *
+ * Return: ``LIBETH_XDP_DROP`` on DMA mapping error,
+ *	   ``LIBETH_XDP_PASS`` if it's the only frag in the frame,
+ *	   ``LIBETH_XDP_TX`` if it's an S/G frame.
+ */
+static inline u32 libeth_xdp_xmit_queue_head(struct libeth_xdp_tx_bulk *bq,
+					     struct xdp_frame *xdpf,
+					     struct device *dev)
+{
+	dma_addr_t dma;
+
+	dma = dma_map_single(dev, xdpf->data, xdpf->len, DMA_TO_DEVICE);
+	if (dma_mapping_error(dev, dma))
+		return LIBETH_XDP_DROP;
+
+	*libeth_xdp_xmit_frame_dma(xdpf) = dma;
+
+	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
+		.xdpf	= xdpf,
+		__libeth_xdp_tx_len(xdpf->len, LIBETH_XDP_TX_FIRST),
+	};
+
+	if (!xdp_frame_has_frags(xdpf))
+		return LIBETH_XDP_PASS;
+
+	bq->bulk[bq->count - 1].flags |= LIBETH_XDP_TX_MULTI;
+
+	return LIBETH_XDP_TX;
+}
+
+/**
+ * libeth_xdp_xmit_queue_frag - internal helper for queueing one XDP xmit frag
+ * @bq: XDP Tx bulk to queue the frag to
+ * @frag: frag to queue
+ * @dev: device to perform DMA mapping
+ *
+ * Return: true on success, false on DMA mapping error.
+ */
+static inline bool libeth_xdp_xmit_queue_frag(struct libeth_xdp_tx_bulk *bq,
+					      const skb_frag_t *frag,
+					      struct device *dev)
+{
+	dma_addr_t dma;
+
+	dma = skb_frag_dma_map(dev, frag);
+	if (dma_mapping_error(dev, dma))
+		return false;
+
+	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
+		.dma	= dma,
+		__libeth_xdp_tx_len(skb_frag_size(frag)),
+	};
+
+	return true;
+}
+
+/**
+ * libeth_xdp_xmit_queue_bulk - internal helper for queueing one XDP xmit frame
+ * @bq: XDP Tx bulk to queue the frame to
+ * @xdpf: XDP frame to queue
+ * @flush_bulk: driver callback to flush the bulk to the HW queue
+ *
+ * Return: ``LIBETH_XDP_TX`` on success,
+ *	   ``LIBETH_XDP_DROP`` if the frame should be dropped by the stack,
+ *	   ``LIBETH_XDP_ABORTED`` if the frame will be dropped by libeth_xdp.
+ */
+static __always_inline u32
+libeth_xdp_xmit_queue_bulk(struct libeth_xdp_tx_bulk *bq,
+			   struct xdp_frame *xdpf,
+			   bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
+					      u32 flags))
+{
+	u32 head, nr_frags, i, ret = LIBETH_XDP_TX;
+	struct device *dev = bq->dev->dev.parent;
+	const struct skb_shared_info *sinfo;
+
+	if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
+	    unlikely(!flush_bulk(bq, LIBETH_XDP_TX_NDO)))
+		return LIBETH_XDP_DROP;
+
+	head = libeth_xdp_xmit_queue_head(bq, xdpf, dev);
+	if (head == LIBETH_XDP_PASS)
+		goto out;
+	else if (head == LIBETH_XDP_DROP)
+		return LIBETH_XDP_DROP;
+
+	sinfo = xdp_get_shared_info_from_frame(xdpf);
+	nr_frags = sinfo->nr_frags;
+
+	for (i = 0; i < nr_frags; i++) {
+		if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
+		    unlikely(!flush_bulk(bq, LIBETH_XDP_TX_NDO)))
+			break;
+
+		if (!libeth_xdp_xmit_queue_frag(bq, &sinfo->frags[i], dev))
+			break;
+	}
+
+	if (unlikely(i < nr_frags))
+		ret = LIBETH_XDP_ABORTED;
+
+out:
+	bq->bulk[bq->count - 1].flags |= LIBETH_XDP_TX_LAST;
+
+	return ret;
+}
+
+/**
+ * libeth_xdp_xmit_fill_buf - internal helper to fill one XDP xmit &libeth_sqe
+ * @frm: XDP Tx frame from the bulk
+ * @i: index on the HW queue
+ * @sq: XDPSQ abstraction for the queue
+ * @priv: private data
+ *
+ * Return: XDP Tx descriptor with the mapped DMA and other info to pass to
+ * the driver callback.
+ */
+static inline struct libeth_xdp_tx_desc
+libeth_xdp_xmit_fill_buf(struct libeth_xdp_tx_frame frm, u32 i,
+			 const struct libeth_xdpsq *sq, u64 priv)
+{
+	struct libeth_xdp_tx_desc desc;
+	struct libeth_sqe *sqe;
+	struct xdp_frame *xdpf;
+
+	if (frm.flags & LIBETH_XDP_TX_FIRST) {
+		xdpf = frm.xdpf;
+		desc.addr = *libeth_xdp_xmit_frame_dma(xdpf);
+	} else {
+		xdpf = NULL;
+		desc.addr = frm.dma;
+	}
+	desc.opts = frm.opts;
+
+	sqe = &sq->sqes[i];
+	dma_unmap_addr_set(sqe, dma, desc.addr);
+	dma_unmap_len_set(sqe, len, desc.len);
+
+	if (!xdpf) {
+		sqe->type = LIBETH_SQE_XDP_XMIT_FRAG;
+		return desc;
+	}
+
+	sqe->type = LIBETH_SQE_XDP_XMIT;
+	sqe->xdpf = xdpf;
+	libeth_xdp_tx_fill_stats(sqe, &desc,
+				 xdp_get_shared_info_from_frame(xdpf));
+
+	return desc;
+}
+
+/**
+ * libeth_xdp_xmit_flush_bulk - wrapper to define flush of one XDP xmit bulk
+ * @bq: bulk to flush
+ * @flags: Tx flags, see __libeth_xdp_tx_flush_bulk()
+ * @prep: driver callback to prepare the queue
+ * @xmit: driver callback to fill a HW descriptor
+ *
+ * Use via LIBETH_XDP_DEFINE_FLUSH_XMIT() to define an XDP xmit driver
+ * callback.
+ */
+#define libeth_xdp_xmit_flush_bulk(bq, flags, prep, xmit)		      \
+	__libeth_xdp_tx_flush_bulk(bq, (flags) | LIBETH_XDP_TX_NDO, prep,     \
+				   libeth_xdp_xmit_fill_buf, xmit)
+
+u32 libeth_xdp_xmit_return_bulk(const struct libeth_xdp_tx_frame *bq,
+				u32 count, const struct net_device *dev);
+
+/**
+ * __libeth_xdp_xmit_do_bulk - internal function to implement .ndo_xdp_xmit()
+ * @bq: XDP Tx bulk to queue frames to
+ * @frames: XDP frames passed by the stack
+ * @n: number of frames
+ * @flags: flags passed by the stack
+ * @flush_bulk: driver callback to flush an XDP xmit bulk
+ * @finalize: driver callback to finalize sending XDP Tx frames on the queue
+ *
+ * Perform common checks, map the frags and queue them to the bulk, then flush
+ * the bulk to the XDPSQ. If requested by the stack, finalize the queue.
+ *
+ * Return: number of frames send or -errno on error.
+ */
+static __always_inline int
+__libeth_xdp_xmit_do_bulk(struct libeth_xdp_tx_bulk *bq,
+			  struct xdp_frame **frames, u32 n, u32 flags,
+			  bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
+					     u32 flags),
+			  void (*finalize)(void *xdpsq, bool sent, bool flush))
+{
+	u32 nxmit = 0;
+
+	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+		return -EINVAL;
+
+	for (u32 i = 0; likely(i < n); i++) {
+		u32 ret;
+
+		ret = libeth_xdp_xmit_queue_bulk(bq, frames[i], flush_bulk);
+		if (unlikely(ret != LIBETH_XDP_TX)) {
+			nxmit += ret == LIBETH_XDP_ABORTED;
+			break;
+		}
+
+		nxmit++;
+	}
+
+	if (bq->count) {
+		flush_bulk(bq, LIBETH_XDP_TX_NDO);
+		if (unlikely(bq->count))
+			nxmit -= libeth_xdp_xmit_return_bulk(bq->bulk,
+							     bq->count,
+							     bq->dev);
+	}
+
+	finalize(bq->xdpsq, nxmit, flags & XDP_XMIT_FLUSH);
+
+	return nxmit;
+}
+
+/**
+ * libeth_xdp_xmit_do_bulk - implement full .ndo_xdp_xmit() in driver
+ * @dev: target &net_device
+ * @n: number of frames to send
+ * @fr: XDP frames to send
+ * @f: flags passed by the stack
+ * @xqs: array of XDPSQs driver structs
+ * @nqs: number of active XDPSQs, the above array length
+ * @fl: driver callback to flush an XDP xmit bulk
+ * @fin: driver cabback to finalize the queue
+ *
+ * If the driver has active XDPSQs, perform common checks and send the frames.
+ * Finalize the queue, if requested.
+ *
+ * Return: number of frames sent or -errno on error.
+ */
+#define libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin)	      \
+	_libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin,	      \
+				 __UNIQUE_ID(bq_), __UNIQUE_ID(ret_),	      \
+				 __UNIQUE_ID(nqs_))
+
+#define _libeth_xdp_xmit_do_bulk(d, n, fr, f, xqs, nqs, fl, fin, ub, ur, un)  \
+({									      \
+	u32 un = (nqs);							      \
+	int ur;								      \
+									      \
+	if (likely(un)) {						      \
+		struct libeth_xdp_tx_bulk ub;				      \
+									      \
+		libeth_xdp_xmit_init_bulk(&ub, d, xqs, un);		      \
+		ur = __libeth_xdp_xmit_do_bulk(&ub, fr, n, f, fl, fin);	      \
+	} else {							      \
+		ur = -ENXIO;						      \
+	}								      \
+									      \
+	ur;								      \
+})
+
+/* Rx polling path */
+
+/**
+ * libeth_xdp_tx_init_bulk - initialize an XDP Tx bulk for Rx NAPI poll
+ * @bq: bulk to initialize
+ * @prog: RCU pointer to the XDP program (can be %NULL)
+ * @dev: target &net_device
+ * @xdpsqs: array of driver XDPSQ structs
+ * @num: number of active XDPSQs, the above array length
+ *
+ * Should be called on an onstack XDP Tx bulk before the NAPI polling loop.
+ * Initializes all the needed fields to run libeth_xdp functions. If @num == 0,
+ * assumes XDP is not enabled.
+ */
+#define libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num)		      \
+	__libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num,		      \
+				  __UNIQUE_ID(bq_), __UNIQUE_ID(nqs_))
+
+#define __libeth_xdp_tx_init_bulk(bq, pr, d, xdpsqs, num, ub, un) do {	      \
+	typeof(bq) ub = (bq);						      \
+	u32 un = (num);							      \
+									      \
+	rcu_read_lock();						      \
+									      \
+	if (un) {							      \
+		ub->prog = rcu_dereference(pr);				      \
+		ub->dev = (d);						      \
+		ub->xdpsq = (xdpsqs)[libeth_xdpsq_id(un)];		      \
+	} else {							      \
+		ub->prog = NULL;					      \
+	}								      \
+									      \
+	ub->act_mask = 0;						      \
+	ub->count = 0;							      \
+} while (0)
+
+void libeth_xdp_load_stash(struct libeth_xdp_buff *dst,
+			   const struct libeth_xdp_buff_stash *src);
+void libeth_xdp_save_stash(struct libeth_xdp_buff_stash *dst,
+			   const struct libeth_xdp_buff *src);
+void __libeth_xdp_return_stash(struct libeth_xdp_buff_stash *stash);
+
+/**
+ * libeth_xdp_init_buff - initialize a &libeth_xdp_buff for Rx NAPI poll
+ * @dst: onstack buffer to initialize
+ * @src: XDP buffer stash placed on the queue
+ * @rxq: registered &xdp_rxq_info corresponding to this queue
+ *
+ * Should be called before the main NAPI polling loop. Loads the content of
+ * the previously saved stash or initializes the buffer from scratch.
+ */
+static inline void
+libeth_xdp_init_buff(struct libeth_xdp_buff *dst,
+		     const struct libeth_xdp_buff_stash *src,
+		     struct xdp_rxq_info *rxq)
+{
+	if (likely(!src->data))
+		dst->data = NULL;
+	else
+		libeth_xdp_load_stash(dst, src);
+
+	dst->base.rxq = rxq;
+}
+
+/**
+ * libeth_xdp_save_buff - save a partially built buffer on a queue
+ * @dst: XDP buffer stash placed on the queue
+ * @src: onstack buffer to save
+ *
+ * Should be called after the main NAPI polling loop. If the loop exited before
+ * the buffer was finished, saves its content on the queue, so that it can be
+ * completed during the next poll. Otherwise, clears the stash.
+ */
+static inline void libeth_xdp_save_buff(struct libeth_xdp_buff_stash *dst,
+					const struct libeth_xdp_buff *src)
+{
+	if (likely(!src->data))
+		dst->data = NULL;
+	else
+		libeth_xdp_save_stash(dst, src);
+}
+
+/**
+ * libeth_xdp_return_stash - free an XDP buffer stash from a queue
+ * @stash: stash to free
+ *
+ * If the queue is about to be destroyed, but it still has an incompleted
+ * buffer stash, this helper should be called to free it.
+ */
+static inline void libeth_xdp_return_stash(struct libeth_xdp_buff_stash *stash)
+{
+	if (stash->data)
+		__libeth_xdp_return_stash(stash);
+}
+
+static inline void libeth_xdp_return_va(const void *data, bool napi)
+{
+	netmem_ref netmem = virt_to_netmem(data);
+
+	page_pool_put_full_netmem(__netmem_get_pp(netmem), netmem, napi);
+}
+
+static inline void libeth_xdp_return_frags(const struct skb_shared_info *sinfo,
+					   bool napi)
+{
+	for (u32 i = 0; i < sinfo->nr_frags; i++) {
+		netmem_ref netmem = skb_frag_netmem(&sinfo->frags[i]);
+
+		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, napi);
+	}
+}
+
+/**
+ * libeth_xdp_return_buff - free/recycle &libeth_xdp_buff
+ * @xdp: buffer to free
+ *
+ * Hotpath helper to free &libeth_xdp_buff. Comparing to xdp_return_buff(),
+ * it's faster as it gets inlined and always assumes order-0 pages and safe
+ * direct recycling. Zeroes @xdp->data to avoid UAFs.
+ */
+#define libeth_xdp_return_buff(xdp)	__libeth_xdp_return_buff(xdp, true)
+
+static inline void __libeth_xdp_return_buff(struct libeth_xdp_buff *xdp,
+					    bool napi)
+{
+	if (!xdp_buff_has_frags(&xdp->base))
+		goto out;
+
+	libeth_xdp_return_frags(xdp_get_shared_info_from_buff(&xdp->base),
+				napi);
+
+out:
+	libeth_xdp_return_va(xdp->data, napi);
+	xdp->data = NULL;
+}
+
+bool libeth_xdp_buff_add_frag(struct libeth_xdp_buff *xdp,
+			      const struct libeth_fqe *fqe,
+			      u32 len);
+
+/**
+ * libeth_xdp_prepare_buff - fill &libeth_xdp_buff with head FQE data
+ * @xdp: XDP buffer to attach the head to
+ * @fqe: FQE containing the head buffer
+ * @len: buffer len passed from HW
+ *
+ * Internal, use libeth_xdp_process_buff() instead. Initializes XDP buffer
+ * head with the Rx buffer data: data pointer, length, headroom, and
+ * truesize/tailroom. Zeroes the flags.
+ * Uses faster single u64 write instead of per-field access.
+ */
+static inline void libeth_xdp_prepare_buff(struct libeth_xdp_buff *xdp,
+					   const struct libeth_fqe *fqe,
+					   u32 len)
+{
+	const struct page *page = __netmem_to_page(fqe->netmem);
+
+#ifdef __LIBETH_WORD_ACCESS
+	static_assert(offsetofend(typeof(xdp->base), flags) -
+		      offsetof(typeof(xdp->base), frame_sz) ==
+		      sizeof(u64));
+
+	*(u64 *)&xdp->base.frame_sz = fqe->truesize;
+#else
+	xdp_init_buff(&xdp->base, fqe->truesize, xdp->base.rxq);
+#endif
+	xdp_prepare_buff(&xdp->base, page_address(page) + fqe->offset,
+			 page->pp->p.offset, len, true);
+}
+
+/**
+ * libeth_xdp_process_buff - attach Rx buffer to &libeth_xdp_buff
+ * @xdp: XDP buffer to attach the Rx buffer to
+ * @fqe: Rx buffer to process
+ * @len: received data length from the descriptor
+ *
+ * If the XDP buffer is empty, attaches the Rx buffer as head and initializes
+ * the required fields. Otherwise, attaches the buffer as a frag.
+ * Already performs DMA sync-for-CPU and frame start prefetch
+ * (for head buffers only).
+ *
+ * Return: true on success, false if the descriptor must be skipped (empty or
+ * no space for a new frag).
+ */
+static inline bool libeth_xdp_process_buff(struct libeth_xdp_buff *xdp,
+					   const struct libeth_fqe *fqe,
+					   u32 len)
+{
+	if (!libeth_rx_sync_for_cpu(fqe, len))
+		return false;
+
+	if (xdp->data)
+		return libeth_xdp_buff_add_frag(xdp, fqe, len);
+
+	libeth_xdp_prepare_buff(xdp, fqe, len);
+
+	prefetch(xdp->data);
+
+	return true;
+}
+
+/**
+ * libeth_xdp_buff_stats_frags - update onstack RQ stats with XDP frags info
+ * @ss: onstack stats to update
+ * @xdp: buffer to account
+ *
+ * Internal helper used by __libeth_xdp_run_pass(), do not call directly.
+ * Adds buffer's frags count and total len to the onstack stats.
+ */
+static inline void
+libeth_xdp_buff_stats_frags(struct libeth_rq_napi_stats *ss,
+			    const struct libeth_xdp_buff *xdp)
+{
+	const struct skb_shared_info *sinfo;
+
+	sinfo = xdp_get_shared_info_from_buff(&xdp->base);
+	ss->bytes += sinfo->xdp_frags_size;
+	ss->fragments += sinfo->nr_frags + 1;
+}
+
+u32 libeth_xdp_prog_exception(const struct libeth_xdp_tx_bulk *bq,
+			      struct libeth_xdp_buff *xdp,
+			      enum xdp_action act, int ret);
+
+/**
+ * __libeth_xdp_run_prog - run XDP program on an XDP buffer
+ * @xdp: XDP buffer to run the prog on
+ * @bq: buffer bulk for ``XDP_TX`` queueing
+ *
+ * Internal inline abstraction to run XDP program. Handles ``XDP_DROP``
+ * and ``XDP_REDIRECT`` only, the rest is processed levels up.
+ * Reports an XDP prog exception on errors.
+ *
+ * Return: libeth_xdp prog verdict depending on the prog's verdict.
+ */
+static __always_inline u32
+__libeth_xdp_run_prog(struct libeth_xdp_buff *xdp,
+		      const struct libeth_xdp_tx_bulk *bq)
+{
+	enum xdp_action act;
+
+	act = bpf_prog_run_xdp(bq->prog, &xdp->base);
+	if (unlikely(act < XDP_DROP || act > XDP_REDIRECT))
+		goto out;
+
+	switch (act) {
+	case XDP_PASS:
+		return LIBETH_XDP_PASS;
+	case XDP_DROP:
+		libeth_xdp_return_buff(xdp);
+
+		return LIBETH_XDP_DROP;
+	case XDP_TX:
+		return LIBETH_XDP_TX;
+	case XDP_REDIRECT:
+		if (unlikely(xdp_do_redirect(bq->dev, &xdp->base, bq->prog)))
+			break;
+
+		xdp->data = NULL;
+
+		return LIBETH_XDP_REDIRECT;
+	default:
+		break;
+	}
+
+out:
+	return libeth_xdp_prog_exception(bq, xdp, act, 0);
+}
+
+/**
+ * __libeth_xdp_run_flush - run XDP program and handle ``XDP_TX`` verdict
+ * @xdp: XDP buffer to run the prog on
+ * @bq: buffer bulk for ``XDP_TX`` queueing
+ * @run: internal callback for running XDP program
+ * @queue: internal callback for queuing ``XDP_TX`` frame
+ * @flush_bulk: driver callback for flushing a bulk
+ *
+ * Internal inline abstraction to run XDP program and additionally handle
+ * ``XDP_TX`` verdict.
+ * Do not use directly.
+ *
+ * Return: libeth_xdp prog verdict depending on the prog's verdict.
+ */
+static __always_inline u32
+__libeth_xdp_run_flush(struct libeth_xdp_buff *xdp,
+		       struct libeth_xdp_tx_bulk *bq,
+		       u32 (*run)(struct libeth_xdp_buff *xdp,
+				  const struct libeth_xdp_tx_bulk *bq),
+		       bool (*queue)(struct libeth_xdp_tx_bulk *bq,
+				     struct libeth_xdp_buff *xdp,
+				     bool (*flush_bulk)
+					  (struct libeth_xdp_tx_bulk *bq,
+					   u32 flags)),
+		       bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
+					  u32 flags))
+{
+	u32 act;
+
+	act = run(xdp, bq);
+	if (act == LIBETH_XDP_TX && unlikely(!queue(bq, xdp, flush_bulk)))
+		act = LIBETH_XDP_DROP;
+
+	bq->act_mask |= act;
+
+	return act;
+}
+
+/**
+ * libeth_xdp_run_prog - run XDP program and handle all verdicts
+ * @xdp: XDP buffer to process
+ * @bq: XDP Tx bulk to queue ``XDP_TX`` buffers
+ * @fl: driver ``XDP_TX`` bulk flush callback
+ *
+ * Run the attached XDP program and handle all possible verdicts.
+ * Prefer using it via LIBETH_XDP_DEFINE_RUN{,_PASS,_PROG}().
+ *
+ * Return: true if the buffer should be passed up the stack, false if the poll
+ * should go to the next buffer.
+ */
+#define libeth_xdp_run_prog(xdp, bq, fl)				      \
+	(__libeth_xdp_run_flush(xdp, bq, __libeth_xdp_run_prog,		      \
+				libeth_xdp_tx_queue_bulk,		      \
+				fl) == LIBETH_XDP_PASS)
+
+/**
+ * __libeth_xdp_run_pass - helper to run XDP program and handle the result
+ * @xdp: XDP buffer to process
+ * @bq: XDP Tx bulk to queue ``XDP_TX`` frames
+ * @napi: NAPI to build an skb and pass it up the stack
+ * @rs: onstack libeth RQ stats
+ * @md: metadata that should be filled to the XDP buffer
+ * @prep: callback for filling the metadata
+ * @run: driver wrapper to run XDP program
+ * @populate: driver callback to populate an skb with the HW descriptor data
+ *
+ * Inline abstraction that does the following:
+ * 1) adds frame size and frag number (if needed) to the onstack stats;
+ * 2) fills the descriptor metadata to the onstack &libeth_xdp_buff
+ * 3) runs XDP program if present;
+ * 4) handles all possible verdicts;
+ * 5) on ``XDP_PASS`, builds an skb from the buffer;
+ * 6) populates it with the descriptor metadata;
+ * 7) passes it up the stack.
+ *
+ * In most cases, number 2 means just writing the pointer to the HW descriptor
+ * to the XDP buffer. If so, please use LIBETH_XDP_DEFINE_RUN{,_PASS}()
+ * wrappers to build a driver function.
+ */
+static __always_inline void
+__libeth_xdp_run_pass(struct libeth_xdp_buff *xdp,
+		      struct libeth_xdp_tx_bulk *bq, struct napi_struct *napi,
+		      struct libeth_rq_napi_stats *rs, const void *md,
+		      void (*prep)(struct libeth_xdp_buff *xdp,
+				   const void *md),
+		      bool (*run)(struct libeth_xdp_buff *xdp,
+				  struct libeth_xdp_tx_bulk *bq),
+		      bool (*populate)(struct sk_buff *skb,
+				       const struct libeth_xdp_buff *xdp,
+				       struct libeth_rq_napi_stats *rs))
+{
+	struct sk_buff *skb;
+
+	rs->bytes += xdp->base.data_end - xdp->data;
+	rs->packets++;
+
+	if (xdp_buff_has_frags(&xdp->base))
+		libeth_xdp_buff_stats_frags(rs, xdp);
+
+	if (prep && (!__builtin_constant_p(!!md) || md))
+		prep(xdp, md);
+
+	if (!bq || !run || !bq->prog)
+		goto build;
+
+	if (!run(xdp, bq))
+		return;
+
+build:
+	skb = xdp_build_skb_from_buff(&xdp->base);
+	if (unlikely(!skb)) {
+		libeth_xdp_return_buff_slow(xdp);
+		return;
+	}
+
+	xdp->data = NULL;
+
+	if (unlikely(!populate(skb, xdp, rs))) {
+		napi_consume_skb(skb, true);
+		return;
+	}
+
+	napi_gro_receive(napi, skb);
+}
+
+static inline void libeth_xdp_prep_desc(struct libeth_xdp_buff *xdp,
+					const void *desc)
+{
+	xdp->desc = desc;
+}
+
+/**
+ * libeth_xdp_run_pass - helper to run XDP program and handle the result
+ * @xdp: XDP buffer to process
+ * @bq: XDP Tx bulk to queue ``XDP_TX`` frames
+ * @napi: NAPI to build an skb and pass it up the stack
+ * @ss: onstack libeth RQ stats
+ * @desc: pointer to the HW descriptor for that frame
+ * @run: driver wrapper to run XDP program
+ * @populate: driver callback to populate an skb with the HW descriptor data
+ *
+ * Wrapper around the underscored version when "fill the descriptor metadata"
+ * means just writing the pointer to the HW descriptor as @xdp->desc.
+ */
+#define libeth_xdp_run_pass(xdp, bq, napi, ss, desc, run, populate)	      \
+	__libeth_xdp_run_pass(xdp, bq, napi, ss, desc, libeth_xdp_prep_desc,  \
+			      run, populate)
+
+/**
+ * libeth_xdp_finalize_rx - finalize XDPSQ after a NAPI polling loop
+ * @bq: ``XDP_TX`` frame bulk
+ * @flush: driver callback to flush the bulk
+ * @finalize: driver callback to start sending the frames and run the timer
+ *
+ * Flush the bulk if there are frames left to send, kick the queue and flush
+ * the XDP maps.
+ */
+#define libeth_xdp_finalize_rx(bq, flush, finalize)			      \
+	__libeth_xdp_finalize_rx(bq, 0, flush, finalize)
+
+static __always_inline void
+__libeth_xdp_finalize_rx(struct libeth_xdp_tx_bulk *bq, u32 flags,
+			 bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
+					    u32 flags),
+			 void (*finalize)(void *xdpsq, bool sent, bool flush))
+{
+	if (bq->act_mask & LIBETH_XDP_TX) {
+		if (bq->count)
+			flush_bulk(bq, flags | LIBETH_XDP_TX_DROP);
+		finalize(bq->xdpsq, true, true);
+	}
+	if (bq->act_mask & LIBETH_XDP_REDIRECT)
+		xdp_do_flush();
+
+	rcu_read_unlock();
+}
+
+/*
+ * Helpers to reduce boilerplate code in drivers.
+ *
+ * Typical driver Rx flow would be (excl. bulk and buff init, frag attach):
+ *
+ * LIBETH_XDP_DEFINE_START();
+ * LIBETH_XDP_DEFINE_FLUSH_TX(static driver_xdp_flush_tx, driver_xdp_tx_prep,
+ *			      driver_xdp_xmit);
+ * LIBETH_XDP_DEFINE_RUN(static driver_xdp_run, driver_xdp_run_prog,
+ *			 driver_xdp_flush_tx, driver_populate_skb);
+ * LIBETH_XDP_DEFINE_FINALIZE(static driver_xdp_finalize_rx,
+ *			      driver_xdp_flush_tx, driver_xdp_finalize_sq);
+ * LIBETH_XDP_DEFINE_END();
+ *
+ * This will build a set of 4 static functions. The compiler is free to decide
+ * whether to inline them.
+ * Then, in the NAPI polling function:
+ *
+ *	while (packets < budget) {
+ *		// ...
+ *		driver_xdp_run(xdp, &bq, napi, &rs, desc);
+ *	}
+ *	driver_xdp_finalize_rx(&bq);
+ */
+
+#define LIBETH_XDP_DEFINE_START()					      \
+	__diag_push();							      \
+	__diag_ignore(GCC, 8, "-Wold-style-declaration",		      \
+		      "Allow specifying \'static\' after the return type")
+
+/**
+ * LIBETH_XDP_DEFINE_TIMER - define a driver XDPSQ cleanup timer callback
+ * @name: name of the function to define
+ * @poll: Tx polling/completion function
+ */
+#define LIBETH_XDP_DEFINE_TIMER(name, poll)				      \
+void name(struct work_struct *work)					      \
+{									      \
+	libeth_xdpsq_run_timer(work, poll);				      \
+}
+
+/**
+ * LIBETH_XDP_DEFINE_FLUSH_TX - define a driver ``XDP_TX`` bulk flush function
+ * @name: name of the function to define
+ * @prep: driver callback to clean an XDPSQ
+ * @xmit: driver callback to write a HW Tx descriptor
+ */
+#define LIBETH_XDP_DEFINE_FLUSH_TX(name, prep, xmit)			      \
+	__LIBETH_XDP_DEFINE_FLUSH_TX(name, prep, xmit, xdp)
+
+#define __LIBETH_XDP_DEFINE_FLUSH_TX(name, prep, xmit, pfx)		      \
+bool name(struct libeth_xdp_tx_bulk *bq, u32 flags)			      \
+{									      \
+	return libeth_##pfx##_tx_flush_bulk(bq, flags, prep, xmit);	      \
+}
+
+/**
+ * LIBETH_XDP_DEFINE_FLUSH_XMIT - define a driver XDP xmit bulk flush function
+ * @name: name of the function to define
+ * @prep: driver callback to clean an XDPSQ
+ * @xmit: driver callback to write a HW Tx descriptor
+ */
+#define LIBETH_XDP_DEFINE_FLUSH_XMIT(name, prep, xmit)			      \
+bool name(struct libeth_xdp_tx_bulk *bq, u32 flags)			      \
+{									      \
+	return libeth_xdp_xmit_flush_bulk(bq, flags, prep, xmit);	      \
+}
+
+/**
+ * LIBETH_XDP_DEFINE_RUN_PROG - define a driver XDP program run function
+ * @name: name of the function to define
+ * @flush: driver callback to flush an ``XDP_TX`` bulk
+ */
+#define LIBETH_XDP_DEFINE_RUN_PROG(name, flush)				      \
+	bool __LIBETH_XDP_DEFINE_RUN_PROG(name, flush, xdp)
+
+#define __LIBETH_XDP_DEFINE_RUN_PROG(name, flush, pfx)			      \
+name(struct libeth_xdp_buff *xdp, struct libeth_xdp_tx_bulk *bq)	      \
+{									      \
+	return libeth_##pfx##_run_prog(xdp, bq, flush);			      \
+}
+
+/**
+ * LIBETH_XDP_DEFINE_RUN_PASS - define a driver buffer process + pass function
+ * @name: name of the function to define
+ * @run: driver callback to run XDP program (above)
+ * @populate: driver callback to fill an skb with HW descriptor info
+ */
+#define LIBETH_XDP_DEFINE_RUN_PASS(name, run, populate)			      \
+	void __LIBETH_XDP_DEFINE_RUN_PASS(name, run, populate, xdp)
+
+#define __LIBETH_XDP_DEFINE_RUN_PASS(name, run, populate, pfx)		      \
+name(struct libeth_xdp_buff *xdp, struct libeth_xdp_tx_bulk *bq,	      \
+     struct napi_struct *napi, struct libeth_rq_napi_stats *ss,		      \
+     const void *desc)							      \
+{									      \
+	return libeth_##pfx##_run_pass(xdp, bq, napi, ss, desc, run,	      \
+				       populate);			      \
+}
+
+/**
+ * LIBETH_XDP_DEFINE_RUN - define a driver buffer process, run + pass function
+ * @name: name of the function to define
+ * @run: name of the XDP prog run function to define
+ * @flush: driver callback to flush an ``XDP_TX`` bulk
+ * @populate: driver callback to fill an skb with HW descriptor info
+ */
+#define LIBETH_XDP_DEFINE_RUN(name, run, flush, populate)		      \
+	__LIBETH_XDP_DEFINE_RUN(name, run, flush, populate, XDP)
+
+#define __LIBETH_XDP_DEFINE_RUN(name, run, flush, populate, pfx)	      \
+	LIBETH_##pfx##_DEFINE_RUN_PROG(static run, flush);		      \
+	LIBETH_##pfx##_DEFINE_RUN_PASS(name, run, populate)
+
+/**
+ * LIBETH_XDP_DEFINE_FINALIZE - define a driver Rx NAPI poll finalize function
+ * @name: name of the function to define
+ * @flush: driver callback to flush an ``XDP_TX`` bulk
+ * @finalize: driver callback to finalize an XDPSQ and run the timer
+ */
+#define LIBETH_XDP_DEFINE_FINALIZE(name, flush, finalize)		      \
+	__LIBETH_XDP_DEFINE_FINALIZE(name, flush, finalize, xdp)
+
+#define __LIBETH_XDP_DEFINE_FINALIZE(name, flush, finalize, pfx)	      \
+void name(struct libeth_xdp_tx_bulk *bq)				      \
+{									      \
+	libeth_##pfx##_finalize_rx(bq, flush, finalize);		      \
+}
+
+#define LIBETH_XDP_DEFINE_END()		__diag_pop()
+
+/* XMO */
+
+/**
+ * libeth_xdp_buff_to_rq - get RQ pointer from an XDP buffer pointer
+ * @xdp: &libeth_xdp_buff corresponding to the queue
+ * @type: typeof() of the driver Rx queue structure
+ * @member: name of &xdp_rxq_info inside @type
+ *
+ * Often times, pointer to the RQ is needed when reading/filling metadata from
+ * HW descriptors. The helper can be used to quickly jump from an XDP buffer
+ * to the queue corresponding to its &xdp_rxq_info without introducing
+ * additional fields (&libeth_xdp_buff is precisely 1 cacheline long on x64).
+ */
+#define libeth_xdp_buff_to_rq(xdp, type, member)			      \
+	container_of_const((xdp)->base.rxq, type, member)
+
+/**
+ * libeth_xdpmo_rx_hash - convert &libeth_rx_pt to an XDP RSS hash metadata
+ * @hash: pointer to the variable to write the hash to
+ * @rss_type: pointer to the variable to write the hash type to
+ * @val: hash value from the HW descriptor
+ * @pt: libeth parsed packet type
+ *
+ * Handle zeroed/non-available hash and convert libeth parsed packet type to
+ * the corresponding XDP RSS hash type. To be called at the end of
+ * xdp_metadata_ops idpf_xdpmo::xmo_rx_hash() implementation.
+ * Note that if the driver doesn't use a constant packet type lookup table but
+ * generates it at runtime, it must call libeth_rx_pt_gen_hash_type(pt) to
+ * generate XDP RSS hash type for each packet type.
+ *
+ * Return: 0 on success, -ENODATA when the hash is not available.
+ */
+static inline int libeth_xdpmo_rx_hash(u32 *hash,
+				       enum xdp_rss_hash_type *rss_type,
+				       u32 val, struct libeth_rx_pt pt)
+{
+	if (unlikely(!val))
+		return -ENODATA;
+
+	*hash = val;
+	*rss_type = pt.hash_type;
+
+	return 0;
+}
+
+/* Tx buffer completion */
+
+void libeth_xdp_return_buff_bulk(const struct skb_shared_info *sinfo,
+				 struct xdp_frame_bulk *bq, bool frags);
+
+/**
+ * __libeth_xdp_complete_tx - complete sent XDPSQE
+ * @sqe: SQ element / Tx buffer to complete
+ * @cp: Tx polling/completion params
+ * @bulk: internal callback to bulk-free ``XDP_TX`` buffers
+ *
+ * Use the non-underscored version in drivers instead. This one is shared
+ * internally with libeth_tx_complete_any().
+ * Complete an XDPSQE of any type of XDP frame. This includes DMA unmapping
+ * when needed, buffer freeing, stats update, and SQE invalidating.
+ */
+static __always_inline void
+__libeth_xdp_complete_tx(struct libeth_sqe *sqe, struct libeth_cq_pp *cp,
+			 typeof(libeth_xdp_return_buff_bulk) bulk)
+{
+	enum libeth_sqe_type type = sqe->type;
+
+	switch (type) {
+	case LIBETH_SQE_EMPTY:
+		return;
+	case LIBETH_SQE_XDP_XMIT:
+	case LIBETH_SQE_XDP_XMIT_FRAG:
+		dma_unmap_page(cp->dev, dma_unmap_addr(sqe, dma),
+			       dma_unmap_len(sqe, len), DMA_TO_DEVICE);
+		break;
+	default:
+		break;
+	}
+
+	switch (type) {
+	case LIBETH_SQE_XDP_TX:
+		bulk(sqe->sinfo, cp->bq, sqe->nr_frags != 1);
+		break;
+	case LIBETH_SQE_XDP_XMIT:
+		xdp_return_frame_bulk(sqe->xdpf, cp->bq);
+		break;
+	default:
+		break;
+	}
+
+	switch (type) {
+	case LIBETH_SQE_XDP_TX:
+	case LIBETH_SQE_XDP_XMIT:
+		cp->xdp_tx -= sqe->nr_frags;
+
+		cp->xss->packets++;
+		cp->xss->bytes += sqe->bytes;
+		break;
+	default:
+		break;
+	}
+
+	sqe->type = LIBETH_SQE_EMPTY;
+}
+
+static inline void libeth_xdp_complete_tx(struct libeth_sqe *sqe,
+					  struct libeth_cq_pp *cp)
+{
+	__libeth_xdp_complete_tx(sqe, cp, libeth_xdp_return_buff_bulk);
+}
+
+/* Misc */
+
+u32 libeth_xdp_queue_threshold(u32 count);
+
+void __libeth_xdp_set_features(struct net_device *dev,
+			       const struct xdp_metadata_ops *xmo);
+void libeth_xdp_set_redirect(struct net_device *dev, bool enable);
+
+/**
+ * libeth_xdp_set_features - set XDP features for netdev
+ * @dev: &net_device to configure
+ * @...: optional params, see __libeth_xdp_set_features()
+ *
+ * Set all the features libeth_xdp supports, including .ndo_xdp_xmit(). That
+ * said, it should be used only when XDPSQs are always available regardless
+ * of whether an XDP prog is attached to @dev.
+ */
+#define libeth_xdp_set_features(dev, ...)				      \
+	CONCATENATE(__libeth_xdp_feat,					      \
+		    COUNT_ARGS(__VA_ARGS__))(dev, ##__VA_ARGS__)
+
+#define __libeth_xdp_feat0(dev)						      \
+	__libeth_xdp_set_features(dev, NULL)
+#define __libeth_xdp_feat1(dev, xmo)					      \
+	__libeth_xdp_set_features(dev, xmo)
+
+/**
+ * libeth_xdp_set_features_noredir - enable all libeth_xdp features w/o redir
+ * @dev: target &net_device
+ * @...: optional params, see __libeth_xdp_set_features()
+ *
+ * Enable everything except the .ndo_xdp_xmit() feature, use when XDPSQs are
+ * not available right after netdev registration.
+ */
+#define libeth_xdp_set_features_noredir(dev, ...)			      \
+	__libeth_xdp_set_features_noredir(dev, __UNIQUE_ID(dev_),	      \
+					  ##__VA_ARGS__)
+
+#define __libeth_xdp_set_features_noredir(dev, ud, ...) do {		      \
+	struct net_device *ud = (dev);					      \
+									      \
+	libeth_xdp_set_features(ud, ##__VA_ARGS__);			      \
+	libeth_xdp_set_redirect(ud, false);				      \
+} while (0)
+
+#endif /* __LIBETH_XDP_H */
diff --git a/drivers/net/ethernet/intel/libeth/tx.c b/drivers/net/ethernet/intel/libeth/tx.c
new file mode 100644
index 000000000000..227c841ab16a
--- /dev/null
+++ b/drivers/net/ethernet/intel/libeth/tx.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2025 Intel Corporation */
+
+#define DEFAULT_SYMBOL_NAMESPACE	"LIBETH"
+
+#include <net/libeth/xdp.h>
+
+#include "priv.h"
+
+/* Tx buffer completion */
+
+DEFINE_STATIC_CALL_NULL(bulk, libeth_xdp_return_buff_bulk);
+
+/**
+ * libeth_tx_complete_any - perform Tx completion for one SQE of any type
+ * @sqe: Tx buffer to complete
+ * @cp: polling params
+ *
+ * Can be used to complete both regular and XDP SQEs, for example when
+ * destroying queues.
+ * When libeth_xdp is not loaded, XDPSQEs won't be handled.
+ */
+void libeth_tx_complete_any(struct libeth_sqe *sqe, struct libeth_cq_pp *cp)
+{
+	if (sqe->type >= __LIBETH_SQE_XDP_START)
+		__libeth_xdp_complete_tx(sqe, cp, static_call(bulk));
+	else
+		libeth_tx_complete(sqe, cp);
+}
+EXPORT_SYMBOL_GPL(libeth_tx_complete_any);
+
+/* Module */
+
+void libeth_attach_xdp(const struct libeth_xdp_ops *ops)
+{
+	static_call_update(bulk, ops ? ops->bulk : NULL);
+}
+EXPORT_SYMBOL_GPL(libeth_attach_xdp);
diff --git a/drivers/net/ethernet/intel/libeth/xdp.c b/drivers/net/ethernet/intel/libeth/xdp.c
new file mode 100644
index 000000000000..dbede9a696a7
--- /dev/null
+++ b/drivers/net/ethernet/intel/libeth/xdp.c
@@ -0,0 +1,431 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2025 Intel Corporation */
+
+#define DEFAULT_SYMBOL_NAMESPACE	"LIBETH_XDP"
+
+#include <net/libeth/xdp.h>
+
+#include "priv.h"
+
+/* XDPSQ sharing */
+
+DEFINE_STATIC_KEY_FALSE(libeth_xdpsq_share);
+EXPORT_SYMBOL_GPL(libeth_xdpsq_share);
+
+void __libeth_xdpsq_get(struct libeth_xdpsq_lock *lock,
+			const struct net_device *dev)
+{
+	bool warn;
+
+	spin_lock_init(&lock->lock);
+	lock->share = true;
+
+	warn = !static_key_enabled(&libeth_xdpsq_share);
+	static_branch_inc(&libeth_xdpsq_share);
+
+	if (warn && net_ratelimit())
+		netdev_warn(dev, "XDPSQ sharing enabled, possible XDP Tx slowdown\n");
+}
+EXPORT_SYMBOL_GPL(__libeth_xdpsq_get);
+
+void __libeth_xdpsq_put(struct libeth_xdpsq_lock *lock,
+			const struct net_device *dev)
+{
+	static_branch_dec(&libeth_xdpsq_share);
+
+	if (!static_key_enabled(&libeth_xdpsq_share) && net_ratelimit())
+		netdev_notice(dev, "XDPSQ sharing disabled\n");
+
+	lock->share = false;
+}
+EXPORT_SYMBOL_GPL(__libeth_xdpsq_put);
+
+void __acquires(&lock->lock)
+__libeth_xdpsq_lock(struct libeth_xdpsq_lock *lock)
+{
+	spin_lock(&lock->lock);
+}
+EXPORT_SYMBOL_GPL(__libeth_xdpsq_lock);
+
+void __releases(&lock->lock)
+__libeth_xdpsq_unlock(struct libeth_xdpsq_lock *lock)
+{
+	spin_unlock(&lock->lock);
+}
+EXPORT_SYMBOL_GPL(__libeth_xdpsq_unlock);
+
+/* XDPSQ clean-up timers */
+
+/**
+ * libeth_xdpsq_init_timer - initialize an XDPSQ clean-up timer
+ * @timer: timer to initialize
+ * @xdpsq: queue this timer belongs to
+ * @lock: corresponding XDPSQ lock
+ * @poll: queue polling/completion function
+ *
+ * XDPSQ clean-up timers must be set up before using at the queue configuration
+ * time. Set the required pointers and the cleaning callback.
+ */
+void libeth_xdpsq_init_timer(struct libeth_xdpsq_timer *timer, void *xdpsq,
+			     struct libeth_xdpsq_lock *lock,
+			     void (*poll)(struct work_struct *work))
+{
+	timer->xdpsq = xdpsq;
+	timer->lock = lock;
+
+	INIT_DELAYED_WORK(&timer->dwork, poll);
+}
+EXPORT_SYMBOL_GPL(libeth_xdpsq_init_timer);
+
+/* ``XDP_TX`` bulking */
+
+static void __cold
+libeth_xdp_tx_return_one(const struct libeth_xdp_tx_frame *frm)
+{
+	if (frm->len_fl & LIBETH_XDP_TX_MULTI)
+		libeth_xdp_return_frags(frm->data + frm->soff, true);
+
+	libeth_xdp_return_va(frm->data, true);
+}
+
+static void __cold
+libeth_xdp_tx_return_bulk(const struct libeth_xdp_tx_frame *bq, u32 count)
+{
+	for (u32 i = 0; i < count; i++) {
+		const struct libeth_xdp_tx_frame *frm = &bq[i];
+
+		if (!(frm->len_fl & LIBETH_XDP_TX_FIRST))
+			continue;
+
+		libeth_xdp_tx_return_one(frm);
+	}
+}
+
+static void __cold libeth_trace_xdp_exception(const struct net_device *dev,
+					      const struct bpf_prog *prog,
+					      u32 act)
+{
+	trace_xdp_exception(dev, prog, act);
+}
+
+/**
+ * libeth_xdp_tx_exception - handle Tx exceptions of XDP frames
+ * @bq: XDP Tx frame bulk
+ * @sent: number of frames sent successfully (from this bulk)
+ * @flags: internal libeth_xdp flags (.ndo_xdp_xmit etc.)
+ *
+ * Cold helper used by __libeth_xdp_tx_flush_bulk(), do not call directly.
+ * Reports XDP Tx exceptions, frees the frames that won't be sent or adjust
+ * the Tx bulk to try again later.
+ */
+void __cold libeth_xdp_tx_exception(struct libeth_xdp_tx_bulk *bq, u32 sent,
+				    u32 flags)
+{
+	const struct libeth_xdp_tx_frame *pos = &bq->bulk[sent];
+	u32 left = bq->count - sent;
+
+	if (!(flags & LIBETH_XDP_TX_NDO))
+		libeth_trace_xdp_exception(bq->dev, bq->prog, XDP_TX);
+
+	if (!(flags & LIBETH_XDP_TX_DROP)) {
+		memmove(bq->bulk, pos, left * sizeof(*bq->bulk));
+		bq->count = left;
+
+		return;
+	}
+
+	if (!(flags & LIBETH_XDP_TX_NDO))
+		libeth_xdp_tx_return_bulk(pos, left);
+	else
+		libeth_xdp_xmit_return_bulk(pos, left, bq->dev);
+
+	bq->count = 0;
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_tx_exception);
+
+/* .ndo_xdp_xmit() implementation */
+
+u32 __cold libeth_xdp_xmit_return_bulk(const struct libeth_xdp_tx_frame *bq,
+				       u32 count, const struct net_device *dev)
+{
+	u32 n = 0;
+
+	for (u32 i = 0; i < count; i++) {
+		const struct libeth_xdp_tx_frame *frm = &bq[i];
+		dma_addr_t dma;
+
+		if (frm->flags & LIBETH_XDP_TX_FIRST)
+			dma = *libeth_xdp_xmit_frame_dma(frm->xdpf);
+		else
+			dma = dma_unmap_addr(frm, dma);
+
+		dma_unmap_page(dev->dev.parent, dma, dma_unmap_len(frm, len),
+			       DMA_TO_DEVICE);
+
+		/* Actual xdp_frames are freed by the core */
+		n += !!(frm->flags & LIBETH_XDP_TX_FIRST);
+	}
+
+	return n;
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_xmit_return_bulk);
+
+/* Rx polling path */
+
+/**
+ * libeth_xdp_load_stash - recreate an &xdp_buff from libeth_xdp buffer stash
+ * @dst: target &libeth_xdp_buff to initialize
+ * @src: source stash
+ *
+ * External helper used by libeth_xdp_init_buff(), do not call directly.
+ * Recreate an onstack &libeth_xdp_buff using the stash saved earlier.
+ * The only field untouched (rxq) is initialized later in the
+ * abovementioned function.
+ */
+void libeth_xdp_load_stash(struct libeth_xdp_buff *dst,
+			   const struct libeth_xdp_buff_stash *src)
+{
+	dst->data = src->data;
+	dst->base.data_end = src->data + src->len;
+	dst->base.data_meta = src->data;
+	dst->base.data_hard_start = src->data - src->headroom;
+
+	dst->base.frame_sz = src->frame_sz;
+	dst->base.flags = src->flags;
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_load_stash);
+
+/**
+ * libeth_xdp_save_stash - convert &xdp_buff to a libeth_xdp buffer stash
+ * @dst: target &libeth_xdp_buff_stash to initialize
+ * @src: source XDP buffer
+ *
+ * External helper used by libeth_xdp_save_buff(), do not call directly.
+ * Use the fields from the passed XDP buffer to initialize the stash on the
+ * queue, so that a partially received frame can be finished later during
+ * the next NAPI poll.
+ */
+void libeth_xdp_save_stash(struct libeth_xdp_buff_stash *dst,
+			   const struct libeth_xdp_buff *src)
+{
+	dst->data = src->data;
+	dst->headroom = src->data - src->base.data_hard_start;
+	dst->len = src->base.data_end - src->data;
+
+	dst->frame_sz = src->base.frame_sz;
+	dst->flags = src->base.flags;
+
+	WARN_ON_ONCE(dst->flags != src->base.flags);
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_save_stash);
+
+void __libeth_xdp_return_stash(struct libeth_xdp_buff_stash *stash)
+{
+	LIBETH_XDP_ONSTACK_BUFF(xdp);
+
+	libeth_xdp_load_stash(xdp, stash);
+	libeth_xdp_return_buff_slow(xdp);
+
+	stash->data = NULL;
+}
+EXPORT_SYMBOL_GPL(__libeth_xdp_return_stash);
+
+/**
+ * libeth_xdp_return_buff_slow - free &libeth_xdp_buff
+ * @xdp: buffer to free/return
+ *
+ * Slowpath version of libeth_xdp_return_buff() to be called on exceptions,
+ * queue clean-ups etc., without unwanted inlining.
+ */
+void __cold libeth_xdp_return_buff_slow(struct libeth_xdp_buff *xdp)
+{
+	__libeth_xdp_return_buff(xdp, false);
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_return_buff_slow);
+
+/**
+ * libeth_xdp_buff_add_frag - add frag to XDP buffer
+ * @xdp: head XDP buffer
+ * @fqe: Rx buffer containing the frag
+ * @len: frag length reported by HW
+ *
+ * External helper used by libeth_xdp_process_buff(), do not call directly.
+ * Frees both head and frag buffers on error.
+ *
+ * Return: true success, false on error (no space for a new frag).
+ */
+bool libeth_xdp_buff_add_frag(struct libeth_xdp_buff *xdp,
+			      const struct libeth_fqe *fqe,
+			      u32 len)
+{
+	netmem_ref netmem = fqe->netmem;
+
+	if (!xdp_buff_add_frag(&xdp->base, netmem,
+			       fqe->offset + netmem_get_pp(netmem)->p.offset,
+			       len, fqe->truesize))
+		goto recycle;
+
+	return true;
+
+recycle:
+	libeth_rx_recycle_slow(netmem);
+	libeth_xdp_return_buff_slow(xdp);
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_buff_add_frag);
+
+/**
+ * libeth_xdp_prog_exception - handle XDP prog exceptions
+ * @bq: XDP Tx bulk
+ * @xdp: buffer to process
+ * @act: original XDP prog verdict
+ * @ret: error code if redirect failed
+ *
+ * External helper used by __libeth_xdp_run_prog(), do not call directly.
+ * Reports invalid @act, XDP exception trace event and frees the buffer.
+ *
+ * Return: libeth_xdp XDP prog verdict.
+ */
+u32 __cold libeth_xdp_prog_exception(const struct libeth_xdp_tx_bulk *bq,
+				     struct libeth_xdp_buff *xdp,
+				     enum xdp_action act, int ret)
+{
+	if (act > XDP_REDIRECT)
+		bpf_warn_invalid_xdp_action(bq->dev, bq->prog, act);
+
+	libeth_trace_xdp_exception(bq->dev, bq->prog, act);
+	libeth_xdp_return_buff_slow(xdp);
+
+	return LIBETH_XDP_DROP;
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_prog_exception);
+
+/* Tx buffer completion */
+
+static void libeth_xdp_put_netmem_bulk(netmem_ref netmem,
+				       struct xdp_frame_bulk *bq)
+{
+	if (unlikely(bq->count == XDP_BULK_QUEUE_SIZE))
+		xdp_flush_frame_bulk(bq);
+
+	bq->q[bq->count++] = netmem;
+}
+
+/**
+ * libeth_xdp_return_buff_bulk - free &xdp_buff as part of a bulk
+ * @sinfo: shared info corresponding to the buffer
+ * @bq: XDP frame bulk to store the buffer
+ * @frags: whether the buffer has frags
+ *
+ * Same as xdp_return_frame_bulk(), but for &libeth_xdp_buff, speeds up Tx
+ * completion of ``XDP_TX`` buffers and allows to free them in same bulks
+ * with &xdp_frame buffers.
+ */
+void libeth_xdp_return_buff_bulk(const struct skb_shared_info *sinfo,
+				 struct xdp_frame_bulk *bq, bool frags)
+{
+	if (!frags)
+		goto head;
+
+	for (u32 i = 0; i < sinfo->nr_frags; i++)
+		libeth_xdp_put_netmem_bulk(skb_frag_netmem(&sinfo->frags[i]),
+					   bq);
+
+head:
+	libeth_xdp_put_netmem_bulk(virt_to_netmem(sinfo), bq);
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_return_buff_bulk);
+
+/* Misc */
+
+/**
+ * libeth_xdp_queue_threshold - calculate XDP queue clean/refill threshold
+ * @count: number of descriptors in the queue
+ *
+ * The threshold is the limit at which RQs start to refill (when the number of
+ * empty buffers exceeds it) and SQs get cleaned up (when the number of free
+ * descriptors goes below it). To speed up hotpath processing, threshold is
+ * always pow-2, closest to 1/4 of the queue length.
+ * Don't call it on hotpath, calculate and cache the threshold during the
+ * queue initialization.
+ *
+ * Return: the calculated threshold.
+ */
+u32 libeth_xdp_queue_threshold(u32 count)
+{
+	u32 quarter, low, high;
+
+	if (likely(is_power_of_2(count)))
+		return count >> 2;
+
+	quarter = DIV_ROUND_CLOSEST(count, 4);
+	low = rounddown_pow_of_two(quarter);
+	high = roundup_pow_of_two(quarter);
+
+	return high - quarter <= quarter - low ? high : low;
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_queue_threshold);
+
+/**
+ * __libeth_xdp_set_features - set XDP features for netdev
+ * @dev: &net_device to configure
+ * @xmo: XDP metadata ops (Rx hints)
+ *
+ * Set all the features libeth_xdp supports. Only the first argument is
+ * necessary; without the third one (zero), XSk support won't be advertised.
+ * Use the non-underscored versions in drivers instead.
+ */
+void __libeth_xdp_set_features(struct net_device *dev,
+			       const struct xdp_metadata_ops *xmo)
+{
+	xdp_set_features_flag(dev,
+			      NETDEV_XDP_ACT_BASIC |
+			      NETDEV_XDP_ACT_REDIRECT |
+			      NETDEV_XDP_ACT_NDO_XMIT |
+			      NETDEV_XDP_ACT_RX_SG |
+			      NETDEV_XDP_ACT_NDO_XMIT_SG);
+	dev->xdp_metadata_ops = xmo;
+}
+EXPORT_SYMBOL_GPL(__libeth_xdp_set_features);
+
+/**
+ * libeth_xdp_set_redirect - toggle the XDP redirect feature
+ * @dev: &net_device to configure
+ * @enable: whether XDP is enabled
+ *
+ * Use this when XDPSQs are not always available to dynamically enable
+ * and disable redirect feature.
+ */
+void libeth_xdp_set_redirect(struct net_device *dev, bool enable)
+{
+	if (enable)
+		xdp_features_set_redirect_target(dev, true);
+	else
+		xdp_features_clear_redirect_target(dev);
+}
+EXPORT_SYMBOL_GPL(libeth_xdp_set_redirect);
+
+/* Module */
+
+static const struct libeth_xdp_ops xdp_ops __initconst = {
+	.bulk	= libeth_xdp_return_buff_bulk,
+};
+
+static int __init libeth_xdp_module_init(void)
+{
+	libeth_attach_xdp(&xdp_ops);
+
+	return 0;
+}
+module_init(libeth_xdp_module_init);
+
+static void __exit libeth_xdp_module_exit(void)
+{
+	libeth_detach_xdp();
+}
+module_exit(libeth_xdp_module_exit);
+
+MODULE_DESCRIPTION("Common Ethernet library - XDP infra");
+MODULE_IMPORT_NS("LIBETH");
+MODULE_LICENSE("GPL");
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 04/16] libeth: add XSk helpers
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (2 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp) Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 10:15   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 05/16] idpf: fix Rx descriptor ready check barrier in splitq Alexander Lobakin
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Add the following counterparts of functions from libeth_xdp which need
special care on XSk path:

* building &xdp_buff (head and frags);
* running XDP prog and managing all possible verdicts;
* xmit (with S/G and metadata support);
* wakeup via CSD/IPI;
* FQ init/deinit and refilling.

Xmit by default unrolls loops by 8 when filling Tx DMA descriptors.
XDP_REDIRECT verdict is considered default/likely(). Rx frags are
considered unlikely().
It is assumed that Tx/completion queues are not mapped to any
interrupts, thus we clean them only when needed (=> 3/4 of
descriptors is busy) and keep need_wakeup set.
IPI for XSk wakeup showed better performance than triggering an SW
NIC interrupt, though it doesn't respect NIC's interrupt affinity.

Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/libeth/Kconfig  |   2 +-
 drivers/net/ethernet/intel/libeth/Makefile |   1 +
 drivers/net/ethernet/intel/libeth/priv.h   |  11 +
 include/net/libeth/tx.h                    |  10 +-
 include/net/libeth/xdp.h                   |  90 ++-
 include/net/libeth/xsk.h                   | 685 +++++++++++++++++++++
 drivers/net/ethernet/intel/libeth/tx.c     |   5 +-
 drivers/net/ethernet/intel/libeth/xdp.c    |  26 +-
 drivers/net/ethernet/intel/libeth/xsk.c    | 269 ++++++++
 9 files changed, 1067 insertions(+), 32 deletions(-)
 create mode 100644 include/net/libeth/xsk.h
 create mode 100644 drivers/net/ethernet/intel/libeth/xsk.c

diff --git a/drivers/net/ethernet/intel/libeth/Kconfig b/drivers/net/ethernet/intel/libeth/Kconfig
index d8c4926574fb..2445b979c499 100644
--- a/drivers/net/ethernet/intel/libeth/Kconfig
+++ b/drivers/net/ethernet/intel/libeth/Kconfig
@@ -12,4 +12,4 @@ config LIBETH_XDP
 	tristate "Common XDP library (libeth_xdp)" if COMPILE_TEST
 	select LIBETH
 	help
-	  XDP helpers based on libeth hotpath management.
+	  XDP and XSk helpers based on libeth hotpath management.
diff --git a/drivers/net/ethernet/intel/libeth/Makefile b/drivers/net/ethernet/intel/libeth/Makefile
index 51669840ee06..350bc0b38bad 100644
--- a/drivers/net/ethernet/intel/libeth/Makefile
+++ b/drivers/net/ethernet/intel/libeth/Makefile
@@ -9,3 +9,4 @@ libeth-y			+= tx.o
 obj-$(CONFIG_LIBETH_XDP)	+= libeth_xdp.o
 
 libeth_xdp-y			+= xdp.o
+libeth_xdp-y			+= xsk.o
diff --git a/drivers/net/ethernet/intel/libeth/priv.h b/drivers/net/ethernet/intel/libeth/priv.h
index 1bd6e2d7a3e7..9b811d31015c 100644
--- a/drivers/net/ethernet/intel/libeth/priv.h
+++ b/drivers/net/ethernet/intel/libeth/priv.h
@@ -8,12 +8,23 @@
 
 /* XDP */
 
+enum xdp_action;
+struct libeth_xdp_buff;
+struct libeth_xdp_tx_frame;
 struct skb_shared_info;
 struct xdp_frame_bulk;
 
+extern const struct xsk_tx_metadata_ops libeth_xsktmo_slow;
+
+void libeth_xsk_tx_return_bulk(const struct libeth_xdp_tx_frame *bq,
+			       u32 count);
+u32 libeth_xsk_prog_exception(struct libeth_xdp_buff *xdp, enum xdp_action act,
+			      int ret);
+
 struct libeth_xdp_ops {
 	void	(*bulk)(const struct skb_shared_info *sinfo,
 			struct xdp_frame_bulk *bq, bool frags);
+	void	(*xsk)(struct libeth_xdp_buff *xdp);
 };
 
 void libeth_attach_xdp(const struct libeth_xdp_ops *ops);
diff --git a/include/net/libeth/tx.h b/include/net/libeth/tx.h
index c3459917330e..c3db5c6f1641 100644
--- a/include/net/libeth/tx.h
+++ b/include/net/libeth/tx.h
@@ -12,7 +12,7 @@
 
 /**
  * enum libeth_sqe_type - type of &libeth_sqe to act on Tx completion
- * @LIBETH_SQE_EMPTY: unused/empty OR XDP_TX, no action required
+ * @LIBETH_SQE_EMPTY: unused/empty OR XDP_TX/XSk frame, no action required
  * @LIBETH_SQE_CTX: context descriptor with empty SQE, no action required
  * @LIBETH_SQE_SLAB: kmalloc-allocated buffer, unmap and kfree()
  * @LIBETH_SQE_FRAG: mapped skb frag, only unmap DMA
@@ -21,6 +21,8 @@
  * @LIBETH_SQE_XDP_TX: &skb_shared_info, libeth_xdp_return_buff_bulk(), stats
  * @LIBETH_SQE_XDP_XMIT: &xdp_frame, unmap and xdp_return_frame_bulk(), stats
  * @LIBETH_SQE_XDP_XMIT_FRAG: &xdp_frame frag, only unmap DMA
+ * @LIBETH_SQE_XSK_TX: &libeth_xdp_buff on XSk queue, xsk_buff_free(), stats
+ * @LIBETH_SQE_XSK_TX_FRAG: &libeth_xdp_buff frag on XSk queue, xsk_buff_free()
  */
 enum libeth_sqe_type {
 	LIBETH_SQE_EMPTY		= 0U,
@@ -33,6 +35,8 @@ enum libeth_sqe_type {
 	LIBETH_SQE_XDP_TX		= __LIBETH_SQE_XDP_START,
 	LIBETH_SQE_XDP_XMIT,
 	LIBETH_SQE_XDP_XMIT_FRAG,
+	LIBETH_SQE_XSK_TX,
+	LIBETH_SQE_XSK_TX_FRAG,
 };
 
 /**
@@ -43,6 +47,7 @@ enum libeth_sqe_type {
  * @skb: &sk_buff to consume
  * @sinfo: skb shared info of an XDP_TX frame
  * @xdpf: XDP frame from ::ndo_xdp_xmit()
+ * @xsk: XSk Rx frame from XDP_TX action
  * @dma: DMA address to unmap
  * @len: length of the mapped region to unmap
  * @nr_frags: number of frags in the frame this buffer belongs to
@@ -59,6 +64,7 @@ struct libeth_sqe {
 		struct sk_buff			*skb;
 		struct skb_shared_info		*sinfo;
 		struct xdp_frame		*xdpf;
+		struct libeth_xdp_buff		*xsk;
 	};
 
 	DEFINE_DMA_UNMAP_ADDR(dma);
@@ -87,7 +93,7 @@ struct libeth_sqe {
  * @bq: XDP frame bulk to combine return operations
  * @ss: onstack NAPI stats to fill
  * @xss: onstack XDPSQ NAPI stats to fill
- * @xdp_tx: number of XDP frames processed
+ * @xdp_tx: number of XDP-not-XSk frames processed
  * @napi: whether it's called from the NAPI context
  *
  * libeth uses this structure to access objects needed for performing full
diff --git a/include/net/libeth/xdp.h b/include/net/libeth/xdp.h
index 1039cd5d8a56..bef9dda690f0 100644
--- a/include/net/libeth/xdp.h
+++ b/include/net/libeth/xdp.h
@@ -276,6 +276,7 @@ libeth_xdpsq_run_timer(struct work_struct *work,
  * @LIBETH_XDP_TX_BATCH: batch size for which the queue fill loop is unrolled
  * @LIBETH_XDP_TX_DROP: indicates the send function must drop frames not sent
  * @LIBETH_XDP_TX_NDO: whether the send function is called from .ndo_xdp_xmit()
+ * @LIBETH_XDP_TX_XSK: whether the function is called for ``XDP_TX`` for XSk
  */
 enum {
 	LIBETH_XDP_TX_BULK		= DEV_MAP_BULK_SIZE,
@@ -283,11 +284,14 @@ enum {
 
 	LIBETH_XDP_TX_DROP		= BIT(0),
 	LIBETH_XDP_TX_NDO		= BIT(1),
+	LIBETH_XDP_TX_XSK		= BIT(2),
 };
 
 /**
  * enum - &libeth_xdp_tx_frame and &libeth_xdp_tx_desc flags
  * @LIBETH_XDP_TX_LEN: only for ``XDP_TX``, [15:0] of ::len_fl is actual length
+ * @LIBETH_XDP_TX_CSUM: for XSk xmit, enable checksum offload
+ * @LIBETH_XDP_TX_XSKMD: for XSk xmit, mask of the metadata bits
  * @LIBETH_XDP_TX_FIRST: indicates the frag is the first one of the frame
  * @LIBETH_XDP_TX_LAST: whether the frag is the last one of the frame
  * @LIBETH_XDP_TX_MULTI: whether the frame contains several frags
@@ -296,6 +300,9 @@ enum {
 enum {
 	LIBETH_XDP_TX_LEN		= GENMASK(15, 0),
 
+	LIBETH_XDP_TX_CSUM		= XDP_TXMD_FLAGS_CHECKSUM,
+	LIBETH_XDP_TX_XSKMD		= LIBETH_XDP_TX_LEN,
+
 	LIBETH_XDP_TX_FIRST		= BIT(16),
 	LIBETH_XDP_TX_LAST		= BIT(17),
 	LIBETH_XDP_TX_MULTI		= BIT(18),
@@ -311,9 +318,11 @@ enum {
  * @frag: one (non-head) frag for ``XDP_TX``
  * @xdpf: &xdp_frame for the head frag for .ndo_xdp_xmit()
  * @dma: DMA address of the non-head frag for .ndo_xdp_xmit()
- * @len: frag length for .ndo_xdp_xmit()
+ * @xsk: ``XDP_TX`` for XSk, XDP buffer for any frag
+ * @len: frag length for XSk ``XDP_TX`` and .ndo_xdp_xmit()
  * @flags: Tx flags for the above
  * @opts: combined @len + @flags for the above for speed
+ * @desc: XSk xmit descriptor for direct casting
  */
 struct libeth_xdp_tx_frame {
 	union {
@@ -327,11 +336,13 @@ struct libeth_xdp_tx_frame {
 		/* ``XDP_TX`` frag */
 		skb_frag_t			frag;
 
-		/* .ndo_xdp_xmit() */
+		/* .ndo_xdp_xmit(), XSk ``XDP_TX`` */
 		struct {
 			union {
 				struct xdp_frame		*xdpf;
 				dma_addr_t			dma;
+
+				struct libeth_xdp_buff		*xsk;
 			};
 			union {
 				struct {
@@ -341,10 +352,14 @@ struct libeth_xdp_tx_frame {
 				aligned_u64			opts;
 			};
 		};
+
+		/* XSk xmit */
+		struct xdp_desc			desc;
 	};
 } __aligned(sizeof(struct xdp_desc));
 static_assert(offsetof(struct libeth_xdp_tx_frame, frag.len) ==
 	      offsetof(struct libeth_xdp_tx_frame, len_fl));
+static_assert(sizeof(struct libeth_xdp_tx_frame) == sizeof(struct xdp_desc));
 
 /**
  * struct libeth_xdp_tx_bulk - XDP Tx frame bulk for bulk sending
@@ -355,10 +370,13 @@ static_assert(offsetof(struct libeth_xdp_tx_frame, frag.len) ==
  * @count: current number of frames in @bulk
  * @bulk: array of queued frames for bulk Tx
  *
- * All XDP Tx operations queue each frame to the bulk first and flush it
- * when @count reaches the array end. Bulk is always placed on the stack
- * for performance. One bulk element contains all the data necessary
+ * All XDP Tx operations except XSk xmit queue each frame to the bulk first
+ * and flush it when @count reaches the array end. Bulk is always placed on
+ * the stack for performance. One bulk element contains all the data necessary
  * for sending a frame and then freeing it on completion.
+ * For XSk xmit, Tx descriptor array from &xsk_buff_pool is casted directly
+ * to &libeth_xdp_tx_frame as they are compatible and the bulk structure is
+ * not used.
  */
 struct libeth_xdp_tx_bulk {
 	const struct bpf_prog		*prog;
@@ -372,12 +390,13 @@ struct libeth_xdp_tx_bulk {
 
 /**
  * struct libeth_xdpsq - abstraction for an XDPSQ
+ * @pool: XSk buffer pool for XSk ``XDP_TX`` and xmit
  * @sqes: array of Tx buffers from the actual queue struct
  * @descs: opaque pointer to the HW descriptor array
  * @ntu: pointer to the next free descriptor index
  * @count: number of descriptors on that queue
  * @pending: pointer to the number of sent-not-completed descs on that queue
- * @xdp_tx: pointer to the above
+ * @xdp_tx: pointer to the above, but only for non-XSk-xmit frames
  * @lock: corresponding XDPSQ lock
  *
  * Abstraction for driver-independent implementation of Tx. Placed on the stack
@@ -385,6 +404,7 @@ struct libeth_xdp_tx_bulk {
  * functions can access and modify driver-specific resources.
  */
 struct libeth_xdpsq {
+	struct xsk_buff_pool		*pool;
 	struct libeth_sqe		*sqes;
 	void				*descs;
 
@@ -468,10 +488,11 @@ struct libeth_xdp_tx_desc {
  * @xmit: callback for filling a HW descriptor with the frame info
  *
  * Internal abstraction for placing @n XDP Tx frames on the HW XDPSQ. Used for
- * all types of frames: ``XDP_TX`` and .ndo_xdp_xmit().
+ * all types of frames: ``XDP_TX``, .ndo_xdp_xmit(), XSk ``XDP_TX``, and XSk
+ * xmit.
  * @prep must lock the queue as this function releases it at the end. @unroll
- * greatly increases the object code size, but also greatly increases
- * performance.
+ * greatly increases the object code size, but also greatly increases XSk xmit
+ * performance; for other types of frames, it's not enabled.
  * The compilers inline all those onstack abstractions to direct data accesses.
  *
  * Return: number of frames actually placed on the queue, <= @n. The function
@@ -726,12 +747,13 @@ void libeth_xdp_tx_exception(struct libeth_xdp_tx_bulk *bq, u32 sent,
 /**
  * __libeth_xdp_tx_flush_bulk - internal helper to flush one XDP Tx bulk
  * @bq: bulk to flush
- * @flags: XDP TX flags (.ndo_xdp_xmit(), etc.)
+ * @flags: XDP TX flags (.ndo_xdp_xmit(), XSk etc.)
  * @prep: driver-specific callback to prepare the queue for sending
  * @fill: libeth_xdp callback to fill &libeth_sqe and &libeth_xdp_tx_desc
  * @xmit: driver callback to fill a HW descriptor
  *
- * Internal abstraction to create bulk flush functions for drivers.
+ * Internal abstraction to create bulk flush functions for drivers. Used for
+ * everything except XSk xmit.
  *
  * Return: true if anything was sent, false otherwise.
  */
@@ -1104,18 +1126,19 @@ __libeth_xdp_xmit_do_bulk(struct libeth_xdp_tx_bulk *bq,
  * Should be called on an onstack XDP Tx bulk before the NAPI polling loop.
  * Initializes all the needed fields to run libeth_xdp functions. If @num == 0,
  * assumes XDP is not enabled.
+ * Do not use for XSk, it has its own optimized helper.
  */
 #define libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num)		      \
-	__libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num,		      \
+	__libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num, false,	      \
 				  __UNIQUE_ID(bq_), __UNIQUE_ID(nqs_))
 
-#define __libeth_xdp_tx_init_bulk(bq, pr, d, xdpsqs, num, ub, un) do {	      \
+#define __libeth_xdp_tx_init_bulk(bq, pr, d, xdpsqs, num, xsk, ub, un) do {   \
 	typeof(bq) ub = (bq);						      \
 	u32 un = (num);							      \
 									      \
 	rcu_read_lock();						      \
 									      \
-	if (un) {							      \
+	if (un || (xsk)) {						      \
 		ub->prog = rcu_dereference(pr);				      \
 		ub->dev = (d);						      \
 		ub->xdpsq = (xdpsqs)[libeth_xdpsq_id(un)];		      \
@@ -1141,6 +1164,7 @@ void __libeth_xdp_return_stash(struct libeth_xdp_buff_stash *stash);
  *
  * Should be called before the main NAPI polling loop. Loads the content of
  * the previously saved stash or initializes the buffer from scratch.
+ * Do not use for XSk.
  */
 static inline void
 libeth_xdp_init_buff(struct libeth_xdp_buff *dst,
@@ -1369,7 +1393,7 @@ __libeth_xdp_run_prog(struct libeth_xdp_buff *xdp,
  * @flush_bulk: driver callback for flushing a bulk
  *
  * Internal inline abstraction to run XDP program and additionally handle
- * ``XDP_TX`` verdict.
+ * ``XDP_TX`` verdict. Used by both XDP and XSk, hence @run and @queue.
  * Do not use directly.
  *
  * Return: libeth_xdp prog verdict depending on the prog's verdict.
@@ -1399,12 +1423,13 @@ __libeth_xdp_run_flush(struct libeth_xdp_buff *xdp,
 }
 
 /**
- * libeth_xdp_run_prog - run XDP program and handle all verdicts
+ * libeth_xdp_run_prog - run XDP program (non-XSk path) and handle all verdicts
  * @xdp: XDP buffer to process
  * @bq: XDP Tx bulk to queue ``XDP_TX`` buffers
  * @fl: driver ``XDP_TX`` bulk flush callback
  *
- * Run the attached XDP program and handle all possible verdicts.
+ * Run the attached XDP program and handle all possible verdicts. XSk has its
+ * own version.
  * Prefer using it via LIBETH_XDP_DEFINE_RUN{,_PASS,_PROG}().
  *
  * Return: true if the buffer should be passed up the stack, false if the poll
@@ -1426,7 +1451,7 @@ __libeth_xdp_run_flush(struct libeth_xdp_buff *xdp,
  * @run: driver wrapper to run XDP program
  * @populate: driver callback to populate an skb with the HW descriptor data
  *
- * Inline abstraction that does the following:
+ * Inline abstraction that does the following (non-XSk path):
  * 1) adds frame size and frag number (if needed) to the onstack stats;
  * 2) fills the descriptor metadata to the onstack &libeth_xdp_buff
  * 3) runs XDP program if present;
@@ -1509,7 +1534,7 @@ static inline void libeth_xdp_prep_desc(struct libeth_xdp_buff *xdp,
 			      run, populate)
 
 /**
- * libeth_xdp_finalize_rx - finalize XDPSQ after a NAPI polling loop
+ * libeth_xdp_finalize_rx - finalize XDPSQ after a NAPI polling loop (non-XSk)
  * @bq: ``XDP_TX`` frame bulk
  * @flush: driver callback to flush the bulk
  * @finalize: driver callback to start sending the frames and run the timer
@@ -1717,12 +1742,14 @@ static inline int libeth_xdpmo_rx_hash(u32 *hash,
 
 void libeth_xdp_return_buff_bulk(const struct skb_shared_info *sinfo,
 				 struct xdp_frame_bulk *bq, bool frags);
+void libeth_xsk_buff_free_slow(struct libeth_xdp_buff *xdp);
 
 /**
  * __libeth_xdp_complete_tx - complete sent XDPSQE
  * @sqe: SQ element / Tx buffer to complete
  * @cp: Tx polling/completion params
  * @bulk: internal callback to bulk-free ``XDP_TX`` buffers
+ * @xsk: internal callback to free XSk ``XDP_TX`` buffers
  *
  * Use the non-underscored version in drivers instead. This one is shared
  * internally with libeth_tx_complete_any().
@@ -1731,7 +1758,8 @@ void libeth_xdp_return_buff_bulk(const struct skb_shared_info *sinfo,
  */
 static __always_inline void
 __libeth_xdp_complete_tx(struct libeth_sqe *sqe, struct libeth_cq_pp *cp,
-			 typeof(libeth_xdp_return_buff_bulk) bulk)
+			 typeof(libeth_xdp_return_buff_bulk) bulk,
+			 typeof(libeth_xsk_buff_free_slow) xsk)
 {
 	enum libeth_sqe_type type = sqe->type;
 
@@ -1754,6 +1782,10 @@ __libeth_xdp_complete_tx(struct libeth_sqe *sqe, struct libeth_cq_pp *cp,
 	case LIBETH_SQE_XDP_XMIT:
 		xdp_return_frame_bulk(sqe->xdpf, cp->bq);
 		break;
+	case LIBETH_SQE_XSK_TX:
+	case LIBETH_SQE_XSK_TX_FRAG:
+		xsk(sqe->xsk);
+		break;
 	default:
 		break;
 	}
@@ -1761,6 +1793,7 @@ __libeth_xdp_complete_tx(struct libeth_sqe *sqe, struct libeth_cq_pp *cp,
 	switch (type) {
 	case LIBETH_SQE_XDP_TX:
 	case LIBETH_SQE_XDP_XMIT:
+	case LIBETH_SQE_XSK_TX:
 		cp->xdp_tx -= sqe->nr_frags;
 
 		cp->xss->packets++;
@@ -1776,7 +1809,8 @@ __libeth_xdp_complete_tx(struct libeth_sqe *sqe, struct libeth_cq_pp *cp,
 static inline void libeth_xdp_complete_tx(struct libeth_sqe *sqe,
 					  struct libeth_cq_pp *cp)
 {
-	__libeth_xdp_complete_tx(sqe, cp, libeth_xdp_return_buff_bulk);
+	__libeth_xdp_complete_tx(sqe, cp, libeth_xdp_return_buff_bulk,
+				 libeth_xsk_buff_free_slow);
 }
 
 /* Misc */
@@ -1784,7 +1818,9 @@ static inline void libeth_xdp_complete_tx(struct libeth_sqe *sqe,
 u32 libeth_xdp_queue_threshold(u32 count);
 
 void __libeth_xdp_set_features(struct net_device *dev,
-			       const struct xdp_metadata_ops *xmo);
+			       const struct xdp_metadata_ops *xmo,
+			       u32 zc_segs,
+			       const struct xsk_tx_metadata_ops *tmo);
 void libeth_xdp_set_redirect(struct net_device *dev, bool enable);
 
 /**
@@ -1801,9 +1837,13 @@ void libeth_xdp_set_redirect(struct net_device *dev, bool enable);
 		    COUNT_ARGS(__VA_ARGS__))(dev, ##__VA_ARGS__)
 
 #define __libeth_xdp_feat0(dev)						      \
-	__libeth_xdp_set_features(dev, NULL)
+	__libeth_xdp_set_features(dev, NULL, 0, NULL)
 #define __libeth_xdp_feat1(dev, xmo)					      \
-	__libeth_xdp_set_features(dev, xmo)
+	__libeth_xdp_set_features(dev, xmo, 0, NULL)
+#define __libeth_xdp_feat2(dev, xmo, zc_segs)				      \
+	__libeth_xdp_set_features(dev, xmo, zc_segs, NULL)
+#define __libeth_xdp_feat3(dev, xmo, zc_segs, tmo)			      \
+	__libeth_xdp_set_features(dev, xmo, zc_segs, tmo)
 
 /**
  * libeth_xdp_set_features_noredir - enable all libeth_xdp features w/o redir
@@ -1824,4 +1864,6 @@ void libeth_xdp_set_redirect(struct net_device *dev, bool enable);
 	libeth_xdp_set_redirect(ud, false);				      \
 } while (0)
 
+#define libeth_xsktmo			((const void *)GOLDEN_RATIO_PRIME)
+
 #endif /* __LIBETH_XDP_H */
diff --git a/include/net/libeth/xsk.h b/include/net/libeth/xsk.h
new file mode 100644
index 000000000000..481a7b28e6f2
--- /dev/null
+++ b/include/net/libeth/xsk.h
@@ -0,0 +1,685 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2025 Intel Corporation */
+
+#ifndef __LIBETH_XSK_H
+#define __LIBETH_XSK_H
+
+#include <net/libeth/xdp.h>
+#include <net/xdp_sock_drv.h>
+
+/* ``XDP_TXMD_FLAGS_VALID`` is defined only under ``CONFIG_XDP_SOCKETS`` */
+#ifdef XDP_TXMD_FLAGS_VALID
+static_assert(XDP_TXMD_FLAGS_VALID <= LIBETH_XDP_TX_XSKMD);
+#endif
+
+/* ``XDP_TX`` bulking */
+
+/**
+ * libeth_xsk_tx_queue_head - internal helper for queueing XSk ``XDP_TX`` head
+ * @bq: XDP Tx bulk to queue the head frag to
+ * @xdp: XSk buffer with the head to queue
+ *
+ * Return: false if it's the only frag of the frame, true if it's an S/G frame.
+ */
+static inline bool libeth_xsk_tx_queue_head(struct libeth_xdp_tx_bulk *bq,
+					    struct libeth_xdp_buff *xdp)
+{
+	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
+		.xsk	= xdp,
+		__libeth_xdp_tx_len(xdp->base.data_end - xdp->data,
+				    LIBETH_XDP_TX_FIRST),
+	};
+
+	if (likely(!xdp_buff_has_frags(&xdp->base)))
+		return false;
+
+	bq->bulk[bq->count - 1].flags |= LIBETH_XDP_TX_MULTI;
+
+	return true;
+}
+
+/**
+ * libeth_xsk_tx_queue_frag - internal helper for queueing XSk ``XDP_TX`` frag
+ * @bq: XDP Tx bulk to queue the frag to
+ * @frag: XSk frag to queue
+ */
+static inline void libeth_xsk_tx_queue_frag(struct libeth_xdp_tx_bulk *bq,
+					    struct libeth_xdp_buff *frag)
+{
+	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
+		.xsk	= frag,
+		__libeth_xdp_tx_len(frag->base.data_end - frag->data),
+	};
+}
+
+/**
+ * libeth_xsk_tx_queue_bulk - internal helper for queueing XSk ``XDP_TX`` frame
+ * @bq: XDP Tx bulk to queue the frame to
+ * @xdp: XSk buffer to queue
+ * @flush_bulk: driver callback to flush the bulk to the HW queue
+ *
+ * Return: true on success, false on flush error.
+ */
+static __always_inline bool
+libeth_xsk_tx_queue_bulk(struct libeth_xdp_tx_bulk *bq,
+			 struct libeth_xdp_buff *xdp,
+			 bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
+					    u32 flags))
+{
+	bool ret = true;
+
+	if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
+	    unlikely(!flush_bulk(bq, LIBETH_XDP_TX_XSK))) {
+		libeth_xsk_buff_free_slow(xdp);
+		return false;
+	}
+
+	if (!libeth_xsk_tx_queue_head(bq, xdp))
+		goto out;
+
+	for (const struct libeth_xdp_buff *head = xdp; ; ) {
+		xdp = container_of(xsk_buff_get_frag(&head->base),
+				   typeof(*xdp), base);
+		if (!xdp)
+			break;
+
+		if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
+		    unlikely(!flush_bulk(bq, LIBETH_XDP_TX_XSK))) {
+			ret = false;
+			break;
+		}
+
+		libeth_xsk_tx_queue_frag(bq, xdp);
+	}
+
+out:
+	bq->bulk[bq->count - 1].flags |= LIBETH_XDP_TX_LAST;
+
+	return ret;
+}
+
+/**
+ * libeth_xsk_tx_fill_buf - internal helper to fill XSk ``XDP_TX`` &libeth_sqe
+ * @frm: XDP Tx frame from the bulk
+ * @i: index on the HW queue
+ * @sq: XDPSQ abstraction for the queue
+ * @priv: private data
+ *
+ * Return: XDP Tx descriptor with the synced DMA and other info to pass to
+ * the driver callback.
+ */
+static inline struct libeth_xdp_tx_desc
+libeth_xsk_tx_fill_buf(struct libeth_xdp_tx_frame frm, u32 i,
+		       const struct libeth_xdpsq *sq, u64 priv)
+{
+	struct libeth_xdp_buff *xdp = frm.xsk;
+	struct libeth_xdp_tx_desc desc = {
+		.addr	= xsk_buff_xdp_get_dma(&xdp->base),
+		.opts	= frm.opts,
+	};
+	struct libeth_sqe *sqe;
+
+	xsk_buff_raw_dma_sync_for_device(sq->pool, desc.addr, desc.len);
+
+	sqe = &sq->sqes[i];
+	sqe->xsk = xdp;
+
+	if (!(desc.flags & LIBETH_XDP_TX_FIRST)) {
+		sqe->type = LIBETH_SQE_XSK_TX_FRAG;
+		return desc;
+	}
+
+	sqe->type = LIBETH_SQE_XSK_TX;
+	libeth_xdp_tx_fill_stats(sqe, &desc,
+				 xdp_get_shared_info_from_buff(&xdp->base));
+
+	return desc;
+}
+
+/**
+ * libeth_xsk_tx_flush_bulk - wrapper to define flush of XSk ``XDP_TX`` bulk
+ * @bq: bulk to flush
+ * @flags: Tx flags, see __libeth_xdp_tx_flush_bulk()
+ * @prep: driver callback to prepare the queue
+ * @xmit: driver callback to fill a HW descriptor
+ *
+ * Use via LIBETH_XSK_DEFINE_FLUSH_TX() to define an XSk ``XDP_TX`` driver
+ * callback.
+ */
+#define libeth_xsk_tx_flush_bulk(bq, flags, prep, xmit)			     \
+	__libeth_xdp_tx_flush_bulk(bq, (flags) | LIBETH_XDP_TX_XSK, prep,    \
+				   libeth_xsk_tx_fill_buf, xmit)
+
+/* XSk TMO */
+
+/**
+ * libeth_xsktmo_req_csum - XSk Tx metadata op to request checksum offload
+ * @csum_start: unused
+ * @csum_offset: unused
+ * @priv: &libeth_xdp_tx_desc from the filling helper
+ *
+ * Generic implementation of ::tmo_request_checksum. Works only when HW doesn't
+ * require filling checksum offsets and other parameters beside the checksum
+ * request bit.
+ * Consider using within @libeth_xsktmo unless the driver requires HW-specific
+ * callbacks.
+ */
+static inline void libeth_xsktmo_req_csum(u16 csum_start, u16 csum_offset,
+					  void *priv)
+{
+	((struct libeth_xdp_tx_desc *)priv)->flags |= LIBETH_XDP_TX_CSUM;
+}
+
+/* Only to inline the callbacks below, use @libeth_xsktmo in drivers instead */
+static const struct xsk_tx_metadata_ops __libeth_xsktmo = {
+	.tmo_request_checksum	= libeth_xsktmo_req_csum,
+};
+
+/**
+ * __libeth_xsk_xmit_fill_buf_md - internal helper to prepare XSk xmit w/meta
+ * @xdesc: &xdp_desc from the XSk buffer pool
+ * @sq: XDPSQ abstraction for the queue
+ * @priv: XSk Tx metadata ops
+ *
+ * Same as __libeth_xsk_xmit_fill_buf(), but requests metadata pointer and
+ * fills additional fields in &libeth_xdp_tx_desc to ask for metadata offload.
+ *
+ * Return: XDP Tx descriptor with the DMA, metadata request bits, and other
+ * info to pass to the driver callback.
+ */
+static __always_inline struct libeth_xdp_tx_desc
+__libeth_xsk_xmit_fill_buf_md(const struct xdp_desc *xdesc,
+			      const struct libeth_xdpsq *sq,
+			      u64 priv)
+{
+	const struct xsk_tx_metadata_ops *tmo = libeth_xdp_priv_to_ptr(priv);
+	struct libeth_xdp_tx_desc desc;
+	struct xdp_desc_ctx ctx;
+
+	ctx = xsk_buff_raw_get_ctx(sq->pool, xdesc->addr);
+	desc = (typeof(desc)){
+		.addr	= ctx.dma,
+		__libeth_xdp_tx_len(xdesc->len),
+	};
+
+	BUILD_BUG_ON(!__builtin_constant_p(tmo == libeth_xsktmo));
+	tmo = tmo == libeth_xsktmo ? &__libeth_xsktmo : tmo;
+
+	xsk_tx_metadata_request(ctx.meta, tmo, &desc);
+
+	return desc;
+}
+
+/* XSk xmit implementation */
+
+/**
+ * __libeth_xsk_xmit_fill_buf - internal helper to prepare XSk xmit w/o meta
+ * @xdesc: &xdp_desc from the XSk buffer pool
+ * @sq: XDPSQ abstraction for the queue
+ *
+ * Return: XDP Tx descriptor with the DMA and other info to pass to
+ * the driver callback.
+ */
+static inline struct libeth_xdp_tx_desc
+__libeth_xsk_xmit_fill_buf(const struct xdp_desc *xdesc,
+			   const struct libeth_xdpsq *sq)
+{
+	return (struct libeth_xdp_tx_desc){
+		.addr	= xsk_buff_raw_get_dma(sq->pool, xdesc->addr),
+		__libeth_xdp_tx_len(xdesc->len),
+	};
+}
+
+/**
+ * libeth_xsk_xmit_fill_buf - internal helper to prepare an XSk xmit
+ * @frm: &xdp_desc from the XSk buffer pool
+ * @i: index on the HW queue
+ * @sq: XDPSQ abstraction for the queue
+ * @priv: XSk Tx metadata ops
+ *
+ * Depending on the metadata ops presence (determined at compile time), calls
+ * the quickest helper to build a libeth XDP Tx descriptor.
+ *
+ * Return: XDP Tx descriptor with the synced DMA, metadata request bits,
+ * and other info to pass to the driver callback.
+ */
+static __always_inline struct libeth_xdp_tx_desc
+libeth_xsk_xmit_fill_buf(struct libeth_xdp_tx_frame frm, u32 i,
+			 const struct libeth_xdpsq *sq, u64 priv)
+{
+	struct libeth_xdp_tx_desc desc;
+
+	if (priv)
+		desc = __libeth_xsk_xmit_fill_buf_md(&frm.desc, sq, priv);
+	else
+		desc = __libeth_xsk_xmit_fill_buf(&frm.desc, sq);
+
+	desc.flags |= xsk_is_eop_desc(&frm.desc) ? LIBETH_XDP_TX_LAST : 0;
+
+	xsk_buff_raw_dma_sync_for_device(sq->pool, desc.addr, desc.len);
+
+	return desc;
+}
+
+/**
+ * libeth_xsk_xmit_do_bulk - send XSk xmit frames
+ * @pool: XSk buffer pool containing the frames to send
+ * @xdpsq: opaque pointer to driver's XDPSQ struct
+ * @budget: maximum number of frames can be sent
+ * @tmo: optional XSk Tx metadata ops
+ * @prep: driver callback to build a &libeth_xdpsq
+ * @xmit: driver callback to put frames to a HW queue
+ * @finalize: driver callback to start a transmission
+ *
+ * Implements generic XSk xmit. Always turns on XSk Tx wakeup as it's assumed
+ * lazy cleaning is used and interrupts are disabled for the queue.
+ * HW descriptor filling is unrolled by ``LIBETH_XDP_TX_BATCH`` to optimize
+ * writes.
+ * Note that unlike other XDP Tx ops, the queue must be locked and cleaned
+ * prior to calling this function to already know available @budget.
+ * @prepare must only build a &libeth_xdpsq and return ``U32_MAX``.
+ *
+ * Return: false if @budget was exhausted, true otherwise.
+ */
+static __always_inline bool
+libeth_xsk_xmit_do_bulk(struct xsk_buff_pool *pool, void *xdpsq, u32 budget,
+			const struct xsk_tx_metadata_ops *tmo,
+			u32 (*prep)(void *xdpsq, struct libeth_xdpsq *sq),
+			void (*xmit)(struct libeth_xdp_tx_desc desc, u32 i,
+				     const struct libeth_xdpsq *sq, u64 priv),
+			void (*finalize)(void *xdpsq, bool sent, bool flush))
+{
+	const struct libeth_xdp_tx_frame *bulk;
+	bool wake;
+	u32 n;
+
+	wake = xsk_uses_need_wakeup(pool);
+	if (wake)
+		xsk_clear_tx_need_wakeup(pool);
+
+	n = xsk_tx_peek_release_desc_batch(pool, budget);
+	bulk = container_of(&pool->tx_descs[0], typeof(*bulk), desc);
+
+	libeth_xdp_tx_xmit_bulk(bulk, xdpsq, n, true,
+				libeth_xdp_ptr_to_priv(tmo), prep,
+				libeth_xsk_xmit_fill_buf, xmit);
+	finalize(xdpsq, n, true);
+
+	if (wake)
+		xsk_set_tx_need_wakeup(pool);
+
+	return n < budget;
+}
+
+/* Rx polling path */
+
+/**
+ * libeth_xsk_tx_init_bulk - initialize XDP Tx bulk for an XSk Rx NAPI poll
+ * @bq: bulk to initialize
+ * @prog: RCU pointer to the XDP program (never %NULL)
+ * @dev: target &net_device
+ * @xdpsqs: array of driver XDPSQ structs
+ * @num: number of active XDPSQs, the above array length
+ *
+ * Should be called on an onstack XDP Tx bulk before the XSk NAPI polling loop.
+ * Initializes all the needed fields to run libeth_xdp functions.
+ * Never checks if @prog is %NULL or @num == 0 as XDP must always be enabled
+ * when hitting this path.
+ */
+#define libeth_xsk_tx_init_bulk(bq, prog, dev, xdpsqs, num)		     \
+	__libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num, true,	     \
+				  __UNIQUE_ID(bq_), __UNIQUE_ID(nqs_))
+
+struct libeth_xdp_buff *libeth_xsk_buff_add_frag(struct libeth_xdp_buff *head,
+						 struct libeth_xdp_buff *xdp);
+
+/**
+ * libeth_xsk_process_buff - attach XSk Rx buffer to &libeth_xdp_buff
+ * @head: head XSk buffer to attach the XSk buffer to (or %NULL)
+ * @xdp: XSk buffer to process
+ * @len: received data length from the descriptor
+ *
+ * If @head == %NULL, treats the XSk buffer as head and initializes
+ * the required fields. Otherwise, attaches the buffer as a frag.
+ * Already performs DMA sync-for-CPU and frame start prefetch
+ * (for head buffers only).
+ *
+ * Return: head XSk buffer on success or if the descriptor must be skipped
+ * (empty), %NULL if there is no space for a new frag.
+ */
+static inline struct libeth_xdp_buff *
+libeth_xsk_process_buff(struct libeth_xdp_buff *head,
+			struct libeth_xdp_buff *xdp, u32 len)
+{
+	if (unlikely(!len)) {
+		libeth_xsk_buff_free_slow(xdp);
+		return head;
+	}
+
+	xsk_buff_set_size(&xdp->base, len);
+	xsk_buff_dma_sync_for_cpu(&xdp->base);
+
+	if (head)
+		return libeth_xsk_buff_add_frag(head, xdp);
+
+	prefetch(xdp->data);
+
+	return xdp;
+}
+
+void libeth_xsk_buff_stats_frags(struct libeth_rq_napi_stats *rs,
+				 const struct libeth_xdp_buff *xdp);
+
+u32 __libeth_xsk_run_prog_slow(struct libeth_xdp_buff *xdp,
+			       const struct libeth_xdp_tx_bulk *bq,
+			       enum xdp_action act, int ret);
+
+/**
+ * __libeth_xsk_run_prog - run XDP program on XSk buffer
+ * @xdp: XSk buffer to run the prog on
+ * @bq: buffer bulk for ``XDP_TX`` queueing
+ *
+ * Internal inline abstraction to run XDP program on XSk Rx path. Handles
+ * only the most common ``XDP_REDIRECT`` inline, the rest is processed
+ * externally.
+ * Reports an XDP prog exception on errors.
+ *
+ * Return: libeth_xdp prog verdict depending on the prog's verdict.
+ */
+static __always_inline u32
+__libeth_xsk_run_prog(struct libeth_xdp_buff *xdp,
+		      const struct libeth_xdp_tx_bulk *bq)
+{
+	enum xdp_action act;
+	int ret = 0;
+
+	act = bpf_prog_run_xdp(bq->prog, &xdp->base);
+	if (unlikely(act != XDP_REDIRECT))
+rest:
+		return __libeth_xsk_run_prog_slow(xdp, bq, act, ret);
+
+	ret = xdp_do_redirect(bq->dev, &xdp->base, bq->prog);
+	if (unlikely(ret))
+		goto rest;
+
+	return LIBETH_XDP_REDIRECT;
+}
+
+/**
+ * libeth_xsk_run_prog - run XDP program on XSk path and handle all verdicts
+ * @xdp: XSk buffer to process
+ * @bq: XDP Tx bulk to queue ``XDP_TX`` buffers
+ * @fl: driver ``XDP_TX`` bulk flush callback
+ *
+ * Run the attached XDP program and handle all possible verdicts.
+ * Prefer using it via LIBETH_XSK_DEFINE_RUN{,_PASS,_PROG}().
+ *
+ * Return: libeth_xdp prog verdict depending on the prog's verdict.
+ */
+#define libeth_xsk_run_prog(xdp, bq, fl)				     \
+	__libeth_xdp_run_flush(xdp, bq, __libeth_xsk_run_prog,		     \
+			       libeth_xsk_tx_queue_bulk, fl)
+
+/**
+ * __libeth_xsk_run_pass - helper to run XDP program and handle the result
+ * @xdp: XSk buffer to process
+ * @bq: XDP Tx bulk to queue ``XDP_TX`` frames
+ * @napi: NAPI to build an skb and pass it up the stack
+ * @rs: onstack libeth RQ stats
+ * @md: metadata that should be filled to the XSk buffer
+ * @prep: callback for filling the metadata
+ * @run: driver wrapper to run XDP program
+ * @populate: driver callback to populate an skb with the HW descriptor data
+ *
+ * Inline abstraction, XSk's counterpart of __libeth_xdp_run_pass(), see its
+ * doc for details.
+ *
+ * Return: false if the polling loop must be exited due to lack of free
+ * buffers, true otherwise.
+ */
+static __always_inline bool
+__libeth_xsk_run_pass(struct libeth_xdp_buff *xdp,
+		      struct libeth_xdp_tx_bulk *bq, struct napi_struct *napi,
+		      struct libeth_rq_napi_stats *rs, const void *md,
+		      void (*prep)(struct libeth_xdp_buff *xdp,
+				   const void *md),
+		      u32 (*run)(struct libeth_xdp_buff *xdp,
+				 struct libeth_xdp_tx_bulk *bq),
+		      bool (*populate)(struct sk_buff *skb,
+				       const struct libeth_xdp_buff *xdp,
+				       struct libeth_rq_napi_stats *rs))
+{
+	struct sk_buff *skb;
+	u32 act;
+
+	rs->bytes += xdp->base.data_end - xdp->data;
+	rs->packets++;
+
+	if (unlikely(xdp_buff_has_frags(&xdp->base)))
+		libeth_xsk_buff_stats_frags(rs, xdp);
+
+	if (prep && (!__builtin_constant_p(!!md) || md))
+		prep(xdp, md);
+
+	act = run(xdp, bq);
+	if (likely(act == LIBETH_XDP_REDIRECT))
+		return true;
+
+	if (act != LIBETH_XDP_PASS)
+		return act != LIBETH_XDP_ABORTED;
+
+	skb = xdp_build_skb_from_zc(&xdp->base);
+	if (unlikely(!skb)) {
+		libeth_xsk_buff_free_slow(xdp);
+		return true;
+	}
+
+	if (unlikely(!populate(skb, xdp, rs))) {
+		napi_consume_skb(skb, true);
+		return true;
+	}
+
+	napi_gro_receive(napi, skb);
+
+	return true;
+}
+
+/**
+ * libeth_xsk_run_pass - helper to run XDP program and handle the result
+ * @xdp: XSk buffer to process
+ * @bq: XDP Tx bulk to queue ``XDP_TX`` frames
+ * @napi: NAPI to build an skb and pass it up the stack
+ * @rs: onstack libeth RQ stats
+ * @desc: pointer to the HW descriptor for that frame
+ * @run: driver wrapper to run XDP program
+ * @populate: driver callback to populate an skb with the HW descriptor data
+ *
+ * Wrapper around the underscored version when "fill the descriptor metadata"
+ * means just writing the pointer to the HW descriptor as @xdp->desc.
+ */
+#define libeth_xsk_run_pass(xdp, bq, napi, rs, desc, run, populate)	     \
+	__libeth_xsk_run_pass(xdp, bq, napi, rs, desc, libeth_xdp_prep_desc, \
+			      run, populate)
+
+/**
+ * libeth_xsk_finalize_rx - finalize XDPSQ after an XSk NAPI polling loop
+ * @bq: ``XDP_TX`` frame bulk
+ * @flush: driver callback to flush the bulk
+ * @finalize: driver callback to start sending the frames and run the timer
+ *
+ * Flush the bulk if there are frames left to send, kick the queue and flush
+ * the XDP maps.
+ */
+#define libeth_xsk_finalize_rx(bq, flush, finalize)			     \
+	__libeth_xdp_finalize_rx(bq, LIBETH_XDP_TX_XSK, flush, finalize)
+
+/*
+ * Helpers to reduce boilerplate code in drivers.
+ *
+ * Typical driver XSk Rx flow would be (excl. bulk and buff init, frag attach):
+ *
+ * LIBETH_XDP_DEFINE_START();
+ * LIBETH_XSK_DEFINE_FLUSH_TX(static driver_xsk_flush_tx, driver_xsk_tx_prep,
+ *			      driver_xdp_xmit);
+ * LIBETH_XSK_DEFINE_RUN(static driver_xsk_run, driver_xsk_run_prog,
+ *			 driver_xsk_flush_tx, driver_populate_skb);
+ * LIBETH_XSK_DEFINE_FINALIZE(static driver_xsk_finalize_rx,
+ *			      driver_xsk_flush_tx, driver_xdp_finalize_sq);
+ * LIBETH_XDP_DEFINE_END();
+ *
+ * This will build a set of 4 static functions. The compiler is free to decide
+ * whether to inline them.
+ * Then, in the NAPI polling function:
+ *
+ *	while (packets < budget) {
+ *		// ...
+ *		if (!driver_xsk_run(xdp, &bq, napi, &rs, desc))
+ *			break;
+ *	}
+ *	driver_xsk_finalize_rx(&bq);
+ */
+
+/**
+ * LIBETH_XSK_DEFINE_FLUSH_TX - define a driver XSk ``XDP_TX`` flush function
+ * @name: name of the function to define
+ * @prep: driver callback to clean an XDPSQ
+ * @xmit: driver callback to write a HW Tx descriptor
+ */
+#define LIBETH_XSK_DEFINE_FLUSH_TX(name, prep, xmit)			     \
+	__LIBETH_XDP_DEFINE_FLUSH_TX(name, prep, xmit, xsk)
+
+/**
+ * LIBETH_XSK_DEFINE_RUN_PROG - define a driver XDP program run function
+ * @name: name of the function to define
+ * @flush: driver callback to flush an XSk ``XDP_TX`` bulk
+ */
+#define LIBETH_XSK_DEFINE_RUN_PROG(name, flush)				     \
+	u32 __LIBETH_XDP_DEFINE_RUN_PROG(name, flush, xsk)
+
+/**
+ * LIBETH_XSK_DEFINE_RUN_PASS - define a driver buffer process + pass function
+ * @name: name of the function to define
+ * @run: driver callback to run XDP program (above)
+ * @populate: driver callback to fill an skb with HW descriptor info
+ */
+#define LIBETH_XSK_DEFINE_RUN_PASS(name, run, populate)			     \
+	bool __LIBETH_XDP_DEFINE_RUN_PASS(name, run, populate, xsk)
+
+/**
+ * LIBETH_XSK_DEFINE_RUN - define a driver buffer process, run + pass function
+ * @name: name of the function to define
+ * @run: name of the XDP prog run function to define
+ * @flush: driver callback to flush an XSk ``XDP_TX`` bulk
+ * @populate: driver callback to fill an skb with HW descriptor info
+ */
+#define LIBETH_XSK_DEFINE_RUN(name, run, flush, populate)		     \
+	__LIBETH_XDP_DEFINE_RUN(name, run, flush, populate, XSK)
+
+/**
+ * LIBETH_XSK_DEFINE_FINALIZE - define a driver XSk NAPI poll finalize function
+ * @name: name of the function to define
+ * @flush: driver callback to flush an XSk ``XDP_TX`` bulk
+ * @finalize: driver callback to finalize an XDPSQ and run the timer
+ */
+#define LIBETH_XSK_DEFINE_FINALIZE(name, flush, finalize)		     \
+	__LIBETH_XDP_DEFINE_FINALIZE(name, flush, finalize, xsk)
+
+/* Refilling */
+
+/**
+ * struct libeth_xskfq - structure representing an XSk buffer (fill) queue
+ * @fp: hotpath part of the structure
+ * @pool: &xsk_buff_pool for buffer management
+ * @fqes: array of XSk buffer pointers
+ * @descs: opaque pointer to the HW descriptor array
+ * @ntu: index of the next buffer to poll
+ * @count: number of descriptors/buffers the queue has
+ * @pending: current number of XSkFQEs to refill
+ * @thresh: threshold below which the queue is refilled
+ * @buf_len: HW-writeable length per each buffer
+ * @nid: ID of the closest NUMA node with memory
+ */
+struct libeth_xskfq {
+	struct_group_tagged(libeth_xskfq_fp, fp,
+		struct xsk_buff_pool	*pool;
+		struct libeth_xdp_buff	**fqes;
+		void			*descs;
+
+		u32			ntu;
+		u32			count;
+	);
+
+	/* Cold fields */
+	u32			pending;
+	u32			thresh;
+
+	u32			buf_len;
+	int			nid;
+};
+
+int libeth_xskfq_create(struct libeth_xskfq *fq);
+void libeth_xskfq_destroy(struct libeth_xskfq *fq);
+
+/**
+ * libeth_xsk_buff_xdp_get_dma - get DMA address of XSk &libeth_xdp_buff
+ * @xdp: buffer to get the DMA addr for
+ */
+#define libeth_xsk_buff_xdp_get_dma(xdp)				     \
+	xsk_buff_xdp_get_dma(&(xdp)->base)
+
+/**
+ * libeth_xskfqe_alloc - allocate @n XSk Rx buffers
+ * @fq: hotpath part of the XSkFQ, usually onstack
+ * @n: number of buffers to allocate
+ * @fill: driver callback to write DMA addresses to HW descriptors
+ *
+ * Note that @fq->ntu gets updated, but ::pending must be recalculated
+ * by the caller.
+ *
+ * Return: number of buffers refilled.
+ */
+static __always_inline u32
+libeth_xskfqe_alloc(struct libeth_xskfq_fp *fq, u32 n,
+		    void (*fill)(const struct libeth_xskfq_fp *fq, u32 i))
+{
+	u32 this, ret, done = 0;
+	struct xdp_buff **xskb;
+
+	this = fq->count - fq->ntu;
+	if (likely(this > n))
+		this = n;
+
+again:
+	xskb = (typeof(xskb))&fq->fqes[fq->ntu];
+	ret = xsk_buff_alloc_batch(fq->pool, xskb, this);
+
+	for (u32 i = 0, ntu = fq->ntu; likely(i < ret); i++)
+		fill(fq, ntu + i);
+
+	done += ret;
+	fq->ntu += ret;
+
+	if (likely(fq->ntu < fq->count) || unlikely(ret < this))
+		goto out;
+
+	fq->ntu = 0;
+
+	if (this < n) {
+		this = n - this;
+		goto again;
+	}
+
+out:
+	return done;
+}
+
+/* .ndo_xsk_wakeup */
+
+void libeth_xsk_init_wakeup(call_single_data_t *csd, struct napi_struct *napi);
+void libeth_xsk_wakeup(call_single_data_t *csd, u32 qid);
+
+/* Pool setup */
+
+int libeth_xsk_setup_pool(struct net_device *dev, u32 qid, bool enable);
+
+#endif /* __LIBETH_XSK_H */
diff --git a/drivers/net/ethernet/intel/libeth/tx.c b/drivers/net/ethernet/intel/libeth/tx.c
index 227c841ab16a..e0167f43d2a8 100644
--- a/drivers/net/ethernet/intel/libeth/tx.c
+++ b/drivers/net/ethernet/intel/libeth/tx.c
@@ -10,6 +10,7 @@
 /* Tx buffer completion */
 
 DEFINE_STATIC_CALL_NULL(bulk, libeth_xdp_return_buff_bulk);
+DEFINE_STATIC_CALL_NULL(xsk, libeth_xsk_buff_free_slow);
 
 /**
  * libeth_tx_complete_any - perform Tx completion for one SQE of any type
@@ -23,7 +24,8 @@ DEFINE_STATIC_CALL_NULL(bulk, libeth_xdp_return_buff_bulk);
 void libeth_tx_complete_any(struct libeth_sqe *sqe, struct libeth_cq_pp *cp)
 {
 	if (sqe->type >= __LIBETH_SQE_XDP_START)
-		__libeth_xdp_complete_tx(sqe, cp, static_call(bulk));
+		__libeth_xdp_complete_tx(sqe, cp, static_call(bulk),
+					 static_call(xsk));
 	else
 		libeth_tx_complete(sqe, cp);
 }
@@ -34,5 +36,6 @@ EXPORT_SYMBOL_GPL(libeth_tx_complete_any);
 void libeth_attach_xdp(const struct libeth_xdp_ops *ops)
 {
 	static_call_update(bulk, ops ? ops->bulk : NULL);
+	static_call_update(xsk, ops ? ops->xsk : NULL);
 }
 EXPORT_SYMBOL_GPL(libeth_attach_xdp);
diff --git a/drivers/net/ethernet/intel/libeth/xdp.c b/drivers/net/ethernet/intel/libeth/xdp.c
index dbede9a696a7..b84b9041f02e 100644
--- a/drivers/net/ethernet/intel/libeth/xdp.c
+++ b/drivers/net/ethernet/intel/libeth/xdp.c
@@ -112,7 +112,7 @@ static void __cold libeth_trace_xdp_exception(const struct net_device *dev,
  * libeth_xdp_tx_exception - handle Tx exceptions of XDP frames
  * @bq: XDP Tx frame bulk
  * @sent: number of frames sent successfully (from this bulk)
- * @flags: internal libeth_xdp flags (.ndo_xdp_xmit etc.)
+ * @flags: internal libeth_xdp flags (XSk, .ndo_xdp_xmit etc.)
  *
  * Cold helper used by __libeth_xdp_tx_flush_bulk(), do not call directly.
  * Reports XDP Tx exceptions, frees the frames that won't be sent or adjust
@@ -134,7 +134,9 @@ void __cold libeth_xdp_tx_exception(struct libeth_xdp_tx_bulk *bq, u32 sent,
 		return;
 	}
 
-	if (!(flags & LIBETH_XDP_TX_NDO))
+	if (flags & LIBETH_XDP_TX_XSK)
+		libeth_xsk_tx_return_bulk(pos, left);
+	else if (!(flags & LIBETH_XDP_TX_NDO))
 		libeth_xdp_tx_return_bulk(pos, left);
 	else
 		libeth_xdp_xmit_return_bulk(pos, left, bq->dev);
@@ -282,7 +284,8 @@ EXPORT_SYMBOL_GPL(libeth_xdp_buff_add_frag);
  * @act: original XDP prog verdict
  * @ret: error code if redirect failed
  *
- * External helper used by __libeth_xdp_run_prog(), do not call directly.
+ * External helper used by __libeth_xdp_run_prog() and
+ * __libeth_xsk_run_prog_slow(), do not call directly.
  * Reports invalid @act, XDP exception trace event and frees the buffer.
  *
  * Return: libeth_xdp XDP prog verdict.
@@ -295,6 +298,10 @@ u32 __cold libeth_xdp_prog_exception(const struct libeth_xdp_tx_bulk *bq,
 		bpf_warn_invalid_xdp_action(bq->dev, bq->prog, act);
 
 	libeth_trace_xdp_exception(bq->dev, bq->prog, act);
+
+	if (xdp->base.rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)
+		return libeth_xsk_prog_exception(xdp, act, ret);
+
 	libeth_xdp_return_buff_slow(xdp);
 
 	return LIBETH_XDP_DROP;
@@ -371,21 +378,31 @@ EXPORT_SYMBOL_GPL(libeth_xdp_queue_threshold);
  * __libeth_xdp_set_features - set XDP features for netdev
  * @dev: &net_device to configure
  * @xmo: XDP metadata ops (Rx hints)
+ * @zc_segs: maximum number of S/G frags the HW can transmit
+ * @tmo: XSk Tx metadata ops (Tx hints)
  *
  * Set all the features libeth_xdp supports. Only the first argument is
  * necessary; without the third one (zero), XSk support won't be advertised.
  * Use the non-underscored versions in drivers instead.
  */
 void __libeth_xdp_set_features(struct net_device *dev,
-			       const struct xdp_metadata_ops *xmo)
+			       const struct xdp_metadata_ops *xmo,
+			       u32 zc_segs,
+			       const struct xsk_tx_metadata_ops *tmo)
 {
 	xdp_set_features_flag(dev,
 			      NETDEV_XDP_ACT_BASIC |
 			      NETDEV_XDP_ACT_REDIRECT |
 			      NETDEV_XDP_ACT_NDO_XMIT |
+			      (zc_segs ? NETDEV_XDP_ACT_XSK_ZEROCOPY : 0) |
 			      NETDEV_XDP_ACT_RX_SG |
 			      NETDEV_XDP_ACT_NDO_XMIT_SG);
 	dev->xdp_metadata_ops = xmo;
+
+	tmo = tmo == libeth_xsktmo ? &libeth_xsktmo_slow : tmo;
+
+	dev->xdp_zc_max_segs = zc_segs ? : 1;
+	dev->xsk_tx_metadata_ops = zc_segs ? tmo : NULL;
 }
 EXPORT_SYMBOL_GPL(__libeth_xdp_set_features);
 
@@ -410,6 +427,7 @@ EXPORT_SYMBOL_GPL(libeth_xdp_set_redirect);
 
 static const struct libeth_xdp_ops xdp_ops __initconst = {
 	.bulk	= libeth_xdp_return_buff_bulk,
+	.xsk	= libeth_xsk_buff_free_slow,
 };
 
 static int __init libeth_xdp_module_init(void)
diff --git a/drivers/net/ethernet/intel/libeth/xsk.c b/drivers/net/ethernet/intel/libeth/xsk.c
new file mode 100644
index 000000000000..9a510a509dcd
--- /dev/null
+++ b/drivers/net/ethernet/intel/libeth/xsk.c
@@ -0,0 +1,269 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2025 Intel Corporation */
+
+#define DEFAULT_SYMBOL_NAMESPACE	"LIBETH_XDP"
+
+#include <net/libeth/xsk.h>
+
+#include "priv.h"
+
+/* ``XDP_TX`` bulking */
+
+void __cold libeth_xsk_tx_return_bulk(const struct libeth_xdp_tx_frame *bq,
+				      u32 count)
+{
+	for (u32 i = 0; i < count; i++)
+		libeth_xsk_buff_free_slow(bq[i].xsk);
+}
+
+/* XSk TMO */
+
+const struct xsk_tx_metadata_ops libeth_xsktmo_slow = {
+	.tmo_request_checksum		= libeth_xsktmo_req_csum,
+};
+
+/* Rx polling path */
+
+/**
+ * libeth_xsk_buff_free_slow - free an XSk Rx buffer
+ * @xdp: buffer to free
+ *
+ * Slowpath version of xsk_buff_free() to be used on exceptions, cleanups etc.
+ * to avoid unwanted inlining.
+ */
+void libeth_xsk_buff_free_slow(struct libeth_xdp_buff *xdp)
+{
+	xsk_buff_free(&xdp->base);
+}
+EXPORT_SYMBOL_GPL(libeth_xsk_buff_free_slow);
+
+/**
+ * libeth_xsk_buff_add_frag - add frag to XSk Rx buffer
+ * @head: head buffer
+ * @xdp: frag buffer
+ *
+ * External helper used by libeth_xsk_process_buff(), do not call directly.
+ * Frees both main and frag buffers on error.
+ *
+ * Return: main buffer with attached frag on success, %NULL on error (no space
+ * for a new frag).
+ */
+struct libeth_xdp_buff *libeth_xsk_buff_add_frag(struct libeth_xdp_buff *head,
+						 struct libeth_xdp_buff *xdp)
+{
+	if (!xsk_buff_add_frag(&head->base, &xdp->base))
+		goto free;
+
+	return head;
+
+free:
+	libeth_xsk_buff_free_slow(xdp);
+	libeth_xsk_buff_free_slow(head);
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(libeth_xsk_buff_add_frag);
+
+/**
+ * libeth_xsk_buff_stats_frags - update onstack RQ stats with XSk frags info
+ * @rs: onstack stats to update
+ * @xdp: buffer to account
+ *
+ * External helper used by __libeth_xsk_run_pass(), do not call directly.
+ * Adds buffer's frags count and total len to the onstack stats.
+ */
+void libeth_xsk_buff_stats_frags(struct libeth_rq_napi_stats *rs,
+				 const struct libeth_xdp_buff *xdp)
+{
+	libeth_xdp_buff_stats_frags(rs, xdp);
+}
+EXPORT_SYMBOL_GPL(libeth_xsk_buff_stats_frags);
+
+/**
+ * __libeth_xsk_run_prog_slow - process the non-``XDP_REDIRECT`` verdicts
+ * @xdp: buffer to process
+ * @bq: Tx bulk for queueing on ``XDP_TX``
+ * @act: verdict to process
+ * @ret: error code if ``XDP_REDIRECT`` failed
+ *
+ * External helper used by __libeth_xsk_run_prog(), do not call directly.
+ * ``XDP_REDIRECT`` is the most common and hottest verdict on XSk, thus
+ * it is processed inline. The rest goes here for out-of-line processing,
+ * together with redirect errors.
+ *
+ * Return: libeth_xdp XDP prog verdict.
+ */
+u32 __libeth_xsk_run_prog_slow(struct libeth_xdp_buff *xdp,
+			       const struct libeth_xdp_tx_bulk *bq,
+			       enum xdp_action act, int ret)
+{
+	switch (act) {
+	case XDP_DROP:
+		xsk_buff_free(&xdp->base);
+
+		return LIBETH_XDP_DROP;
+	case XDP_TX:
+		return LIBETH_XDP_TX;
+	case XDP_PASS:
+		return LIBETH_XDP_PASS;
+	default:
+		break;
+	}
+
+	return libeth_xdp_prog_exception(bq, xdp, act, ret);
+}
+EXPORT_SYMBOL_GPL(__libeth_xsk_run_prog_slow);
+
+/**
+ * libeth_xsk_prog_exception - handle XDP prog exceptions on XSk
+ * @xdp: buffer to process
+ * @act: verdict returned by the prog
+ * @ret: error code if ``XDP_REDIRECT`` failed
+ *
+ * Internal. Frees the buffer and, if the queue uses XSk wakeups, stop the
+ * current NAPI poll when there are no free buffers left.
+ *
+ * Return: libeth_xdp's XDP prog verdict.
+ */
+u32 __cold libeth_xsk_prog_exception(struct libeth_xdp_buff *xdp,
+				     enum xdp_action act, int ret)
+{
+	const struct xdp_buff_xsk *xsk;
+	u32 __ret = LIBETH_XDP_DROP;
+
+	if (act != XDP_REDIRECT)
+		goto drop;
+
+	xsk = container_of(&xdp->base, typeof(*xsk), xdp);
+	if (xsk_uses_need_wakeup(xsk->pool) && ret == -ENOBUFS)
+		__ret = LIBETH_XDP_ABORTED;
+
+drop:
+	libeth_xsk_buff_free_slow(xdp);
+
+	return __ret;
+}
+
+/* Refill */
+
+/**
+ * libeth_xskfq_create - create an XSkFQ
+ * @fq: fill queue to initialize
+ *
+ * Allocates the FQEs and initializes the fields used by libeth_xdp: number
+ * of buffers to refill, refill threshold and buffer len.
+ *
+ * Return: %0 on success, -errno otherwise.
+ */
+int libeth_xskfq_create(struct libeth_xskfq *fq)
+{
+	fq->fqes = kvcalloc_node(fq->count, sizeof(*fq->fqes), GFP_KERNEL,
+				 fq->nid);
+	if (!fq->fqes)
+		return -ENOMEM;
+
+	fq->pending = fq->count;
+	fq->thresh = libeth_xdp_queue_threshold(fq->count);
+	fq->buf_len = xsk_pool_get_rx_frame_size(fq->pool);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(libeth_xskfq_create);
+
+/**
+ * libeth_xskfq_destroy - destroy an XSkFQ
+ * @fq: fill queue to destroy
+ *
+ * Zeroes the used fields and frees the FQEs array.
+ */
+void libeth_xskfq_destroy(struct libeth_xskfq *fq)
+{
+	fq->buf_len = 0;
+	fq->thresh = 0;
+	fq->pending = 0;
+
+	kvfree(fq->fqes);
+}
+EXPORT_SYMBOL_GPL(libeth_xskfq_destroy);
+
+/* .ndo_xsk_wakeup */
+
+static void libeth_xsk_napi_sched(void *info)
+{
+	__napi_schedule_irqoff(info);
+}
+
+/**
+ * libeth_xsk_init_wakeup - initialize libeth XSk wakeup structure
+ * @csd: struct to initialize
+ * @napi: NAPI corresponding to this queue
+ *
+ * libeth_xdp uses inter-processor interrupts to perform XSk wakeups. In order
+ * to do that, the corresponding CSDs must be initialized when creating the
+ * queues.
+ */
+void libeth_xsk_init_wakeup(call_single_data_t *csd, struct napi_struct *napi)
+{
+	INIT_CSD(csd, libeth_xsk_napi_sched, napi);
+}
+EXPORT_SYMBOL_GPL(libeth_xsk_init_wakeup);
+
+/**
+ * libeth_xsk_wakeup - perform an XSk wakeup
+ * @csd: CSD corresponding to the queue
+ * @qid: the stack queue index
+ *
+ * Try to mark the NAPI as missed first, so that it could be rescheduled.
+ * If it's not, schedule it on the corresponding CPU using IPIs (or directly
+ * if already running on it).
+ */
+void libeth_xsk_wakeup(call_single_data_t *csd, u32 qid)
+{
+	struct napi_struct *napi = csd->info;
+
+	if (napi_if_scheduled_mark_missed(napi) ||
+	    unlikely(!napi_schedule_prep(napi)))
+		return;
+
+	if (unlikely(qid >= nr_cpu_ids))
+		qid %= nr_cpu_ids;
+
+	if (qid != raw_smp_processor_id() && cpu_online(qid))
+		smp_call_function_single_async(qid, csd);
+	else
+		__napi_schedule(napi);
+}
+EXPORT_SYMBOL_GPL(libeth_xsk_wakeup);
+
+/* Pool setup */
+
+#define LIBETH_XSK_DMA_ATTR					\
+	(DMA_ATTR_WEAK_ORDERING | DMA_ATTR_SKIP_CPU_SYNC)
+
+/**
+ * libeth_xsk_setup_pool - setup or destroy an XSk pool for a queue
+ * @dev: target &net_device
+ * @qid: stack queue index to configure
+ * @enable: whether to enable or disable the pool
+ *
+ * Check that @qid is valid and then map or unmap the pool.
+ *
+ * Return: %0 on success, -errno otherwise.
+ */
+int libeth_xsk_setup_pool(struct net_device *dev, u32 qid, bool enable)
+{
+	struct xsk_buff_pool *pool;
+
+	pool = xsk_get_pool_from_qid(dev, qid);
+	if (!pool)
+		return -EINVAL;
+
+	if (enable)
+		return xsk_pool_dma_map(pool, dev->dev.parent,
+					LIBETH_XSK_DMA_ATTR);
+	else
+		xsk_pool_dma_unmap(pool, LIBETH_XSK_DMA_ATTR);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(libeth_xsk_setup_pool);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 05/16] idpf: fix Rx descriptor ready check barrier in splitq
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (3 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 04/16] libeth: add XSk helpers Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 10:17   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 06/16] idpf: a use saner limit for default number of queues to allocate Alexander Lobakin
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

No idea what the current barrier position was meant for. At that point,
nothing is read from the descriptor, only the pointer to the actual one
is fetched.
The correct barrier usage here is after the generation check, so that
only the first qword is read if the descriptor is not yet ready and we
need to stop polling. Debatable on coherent DMA as the Rx descriptor
size is <= cacheline size, but anyway, the current barrier position
only makes the codegen worse.

Fixes: 3a8845af66ed ("idpf: add RX splitq napi poll support")
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 6254806c2072..c15833928ea1 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -3232,18 +3232,14 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 		/* get the Rx desc from Rx queue based on 'next_to_clean' */
 		rx_desc = &rxq->rx[ntc].flex_adv_nic_3_wb;
 
-		/* This memory barrier is needed to keep us from reading
-		 * any other fields out of the rx_desc
-		 */
-		dma_rmb();
-
 		/* if the descriptor isn't done, no work yet to do */
 		gen_id = le16_get_bits(rx_desc->pktlen_gen_bufq_id,
 				       VIRTCHNL2_RX_FLEX_DESC_ADV_GEN_M);
-
 		if (idpf_queue_has(GEN_CHK, rxq) != gen_id)
 			break;
 
+		dma_rmb();
+
 		rxdid = FIELD_GET(VIRTCHNL2_RX_FLEX_DESC_ADV_RXDID_M,
 				  rx_desc->rxdid_ucast);
 		if (rxdid != VIRTCHNL2_RXDID_2_FLEX_SPLITQ) {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 06/16] idpf: a use saner limit for default number of queues to allocate
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (4 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 05/16] idpf: fix Rx descriptor ready check barrier in splitq Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 10:32   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 07/16] idpf: link NAPIs to queues Alexander Lobakin
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Currently, the maximum number of queues available for one vport is 16.
This is hardcoded, but then the function calculating the optimal number
of queues takes min(16, num_online_cpus()).
On order to be able to allocate more queues, which will be then used for
XDP, stop hardcoding 16 and rely on what the device gives us. Instead of
num_online_cpus(), which is considered suboptimal since at least 2013,
use netif_get_num_default_rss_queues() to still have free queues in the
pool.
nr_cpu_ids number of Tx queues are needed only for lockless XDP sending,
the regular stack doesn't benefit from that anyhow.
On a 128-thread Xeon, this now gives me 32 regular Tx queues and leaves
224 free for XDP (128 of which will handle XDP_TX, .ndo_xdp_xmit(), and
XSk xmit when enabled).

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.c     | 8 +-------
 drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 2 +-
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index c15833928ea1..2f221c0abad8 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -1234,13 +1234,7 @@ int idpf_vport_calc_total_qs(struct idpf_adapter *adapter, u16 vport_idx,
 		num_req_tx_qs = vport_config->user_config.num_req_tx_qs;
 		num_req_rx_qs = vport_config->user_config.num_req_rx_qs;
 	} else {
-		int num_cpus;
-
-		/* Restrict num of queues to cpus online as a default
-		 * configuration to give best performance. User can always
-		 * override to a max number of queues via ethtool.
-		 */
-		num_cpus = num_online_cpus();
+		u32 num_cpus = netif_get_num_default_rss_queues();
 
 		dflt_splitq_txq_grps = min_t(int, max_q->max_txq, num_cpus);
 		dflt_singleq_txqs = min_t(int, max_q->max_txq, num_cpus);
diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
index 3d2413b8684f..135af3cc243f 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
@@ -937,7 +937,7 @@ int idpf_vport_alloc_max_qs(struct idpf_adapter *adapter,
 	max_tx_q = le16_to_cpu(caps->max_tx_q) / default_vports;
 	if (adapter->num_alloc_vports < default_vports) {
 		max_q->max_rxq = min_t(u16, max_rx_q, IDPF_MAX_Q);
-		max_q->max_txq = min_t(u16, max_tx_q, IDPF_MAX_Q);
+		max_q->max_txq = min_t(u16, max_tx_q, IDPF_LARGE_MAX_Q);
 	} else {
 		max_q->max_rxq = IDPF_MIN_Q;
 		max_q->max_txq = IDPF_MIN_Q;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 07/16] idpf: link NAPIs to queues
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (5 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 06/16] idpf: a use saner limit for default number of queues to allocate Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 10:28   ` Eric Dumazet
  2025-03-07 10:51   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 08/16] idpf: make complq cleaning dependent on scheduling mode Alexander Lobakin
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Add the missing linking of NAPIs to netdev queues when enabling
interrupt vectors in order to support NAPI configuration and
interfaces requiring get_rx_queue()->napi to be set (like XSk
busy polling).

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.c | 30 +++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 2f221c0abad8..a3f6e8cff7a0 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -3560,8 +3560,11 @@ void idpf_vport_intr_rel(struct idpf_vport *vport)
 static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
 {
 	struct idpf_adapter *adapter = vport->adapter;
+	bool unlock;
 	int vector;
 
+	unlock = rtnl_trylock();
+
 	for (vector = 0; vector < vport->num_q_vectors; vector++) {
 		struct idpf_q_vector *q_vector = &vport->q_vectors[vector];
 		int irq_num, vidx;
@@ -3573,8 +3576,23 @@ static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
 		vidx = vport->q_vector_idxs[vector];
 		irq_num = adapter->msix_entries[vidx].vector;
 
+		for (u32 i = 0; i < q_vector->num_rxq; i++)
+			netif_queue_set_napi(vport->netdev,
+					     q_vector->rx[i]->idx,
+					     NETDEV_QUEUE_TYPE_RX,
+					     NULL);
+
+		for (u32 i = 0; i < q_vector->num_txq; i++)
+			netif_queue_set_napi(vport->netdev,
+					     q_vector->tx[i]->idx,
+					     NETDEV_QUEUE_TYPE_TX,
+					     NULL);
+
 		kfree(free_irq(irq_num, q_vector));
 	}
+
+	if (unlock)
+		rtnl_unlock();
 }
 
 /**
@@ -3760,6 +3778,18 @@ static int idpf_vport_intr_req_irq(struct idpf_vport *vport)
 				   "Request_irq failed, error: %d\n", err);
 			goto free_q_irqs;
 		}
+
+		for (u32 i = 0; i < q_vector->num_rxq; i++)
+			netif_queue_set_napi(vport->netdev,
+					     q_vector->rx[i]->idx,
+					     NETDEV_QUEUE_TYPE_RX,
+					     &q_vector->napi);
+
+		for (u32 i = 0; i < q_vector->num_txq; i++)
+			netif_queue_set_napi(vport->netdev,
+					     q_vector->tx[i]->idx,
+					     NETDEV_QUEUE_TYPE_TX,
+					     &q_vector->napi);
 	}
 
 	return 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 08/16] idpf: make complq cleaning dependent on scheduling mode
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (6 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 07/16] idpf: link NAPIs to queues Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 11:11   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 09/16] idpf: remove SW marker handling from NAPI Alexander Lobakin
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Michal Kubiak <michal.kubiak@intel.com>

Extend completion queue cleaning function to support queue-based
scheduling mode needed for XDP queues.
Add 4-byte descriptor for queue-based scheduling mode and
perform some refactoring to extract the common code for
both scheduling modes.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 .../net/ethernet/intel/idpf/idpf_lan_txrx.h   |   6 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  11 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 256 +++++++++++-------
 3 files changed, 177 insertions(+), 96 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
index 8c7f8ef8f1a1..7f12c7f2e70e 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
@@ -186,13 +186,17 @@ struct idpf_base_tx_desc {
 	__le64 qw1; /* type_cmd_offset_bsz_l2tag1 */
 }; /* read used with buffer queues */
 
-struct idpf_splitq_tx_compl_desc {
+struct idpf_splitq_4b_tx_compl_desc {
 	/* qid=[10:0] comptype=[13:11] rsvd=[14] gen=[15] */
 	__le16 qid_comptype_gen;
 	union {
 		__le16 q_head; /* Queue head */
 		__le16 compl_tag; /* Completion tag */
 	} q_head_compl_tag;
+}; /* writeback used with completion queues */
+
+struct idpf_splitq_tx_compl_desc {
+	struct idpf_splitq_4b_tx_compl_desc common;
 	u8 ts[3];
 	u8 rsvd; /* Reserved */
 }; /* writeback used with completion queues */
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index b029f566e57c..9f938301b2c5 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -743,7 +743,9 @@ libeth_cacheline_set_assert(struct idpf_buf_queue, 64, 24, 32);
 
 /**
  * struct idpf_compl_queue - software structure representing a completion queue
- * @comp: completion descriptor array
+ * @comp: 8-byte completion descriptor array
+ * @comp_4b: 4-byte completion descriptor array
+ * @desc_ring: virtual descriptor ring address
  * @txq_grp: See struct idpf_txq_group
  * @flags: See enum idpf_queue_flags_t
  * @desc_count: Number of descriptors
@@ -763,7 +765,12 @@ libeth_cacheline_set_assert(struct idpf_buf_queue, 64, 24, 32);
  */
 struct idpf_compl_queue {
 	__cacheline_group_begin_aligned(read_mostly);
-	struct idpf_splitq_tx_compl_desc *comp;
+	union {
+		struct idpf_splitq_tx_compl_desc *comp;
+		struct idpf_splitq_4b_tx_compl_desc *comp_4b;
+
+		void *desc_ring;
+	};
 	struct idpf_txq_group *txq_grp;
 
 	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index a3f6e8cff7a0..a240ed115e3e 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -156,8 +156,8 @@ static void idpf_compl_desc_rel(struct idpf_compl_queue *complq)
 		return;
 
 	dma_free_coherent(complq->netdev->dev.parent, complq->size,
-			  complq->comp, complq->dma);
-	complq->comp = NULL;
+			  complq->desc_ring, complq->dma);
+	complq->desc_ring = NULL;
 	complq->next_to_use = 0;
 	complq->next_to_clean = 0;
 }
@@ -284,12 +284,16 @@ static int idpf_tx_desc_alloc(const struct idpf_vport *vport,
 static int idpf_compl_desc_alloc(const struct idpf_vport *vport,
 				 struct idpf_compl_queue *complq)
 {
-	complq->size = array_size(complq->desc_count, sizeof(*complq->comp));
+	u32 desc_size;
 
-	complq->comp = dma_alloc_coherent(complq->netdev->dev.parent,
-					  complq->size, &complq->dma,
-					  GFP_KERNEL);
-	if (!complq->comp)
+	desc_size = idpf_queue_has(FLOW_SCH_EN, complq) ?
+		    sizeof(*complq->comp) : sizeof(*complq->comp_4b);
+	complq->size = array_size(complq->desc_count, desc_size);
+
+	complq->desc_ring = dma_alloc_coherent(complq->netdev->dev.parent,
+					       complq->size, &complq->dma,
+					       GFP_KERNEL);
+	if (!complq->desc_ring)
 		return -ENOMEM;
 
 	complq->next_to_use = 0;
@@ -1921,8 +1925,46 @@ static bool idpf_tx_clean_buf_ring(struct idpf_tx_queue *txq, u16 compl_tag,
 }
 
 /**
- * idpf_tx_handle_rs_completion - clean a single packet and all of its buffers
- * whether on the buffer ring or in the hash table
+ * idpf_parse_compl_desc - Parse the completion descriptor
+ * @desc: completion descriptor to be parsed
+ * @complq: completion queue containing the descriptor
+ * @txq: returns corresponding Tx queue for a given descriptor
+ * @gen_flag: current generation flag in the completion queue
+ *
+ * Return: completion type from descriptor or negative value in case of error:
+ *	   -ENODATA if there is no completion descriptor to be cleaned,
+ *	   -EINVAL if no Tx queue has been found for the completion queue.
+ */
+static int
+idpf_parse_compl_desc(const struct idpf_splitq_4b_tx_compl_desc *desc,
+		      const struct idpf_compl_queue *complq,
+		      struct idpf_tx_queue **txq, bool gen_flag)
+{
+	struct idpf_tx_queue *target;
+	u32 rel_tx_qid, comptype;
+
+	/* if the descriptor isn't done, no work yet to do */
+	comptype = le16_to_cpu(desc->qid_comptype_gen);
+	if (!!(comptype & IDPF_TXD_COMPLQ_GEN_M) != gen_flag)
+		return -ENODATA;
+
+	/* Find necessary info of TX queue to clean buffers */
+	rel_tx_qid = FIELD_GET(IDPF_TXD_COMPLQ_QID_M, comptype);
+	target = likely(rel_tx_qid < complq->txq_grp->num_txq) ?
+		 complq->txq_grp->txqs[rel_tx_qid] : NULL;
+
+	if (!target)
+		return -EINVAL;
+
+	*txq = target;
+
+	/* Determine completion type */
+	return FIELD_GET(IDPF_TXD_COMPLQ_COMPL_TYPE_M, comptype);
+}
+
+/**
+ * idpf_tx_handle_rs_cmpl_qb - clean a single packet and all of its buffers
+ * whether the Tx queue is working in queue-based scheduling
  * @txq: Tx ring to clean
  * @desc: pointer to completion queue descriptor to extract completion
  * information from
@@ -1931,21 +1973,33 @@ static bool idpf_tx_clean_buf_ring(struct idpf_tx_queue *txq, u16 compl_tag,
  *
  * Returns bytes/packets cleaned
  */
-static void idpf_tx_handle_rs_completion(struct idpf_tx_queue *txq,
-					 struct idpf_splitq_tx_compl_desc *desc,
-					 struct libeth_sq_napi_stats *cleaned,
-					 int budget)
+static void
+idpf_tx_handle_rs_cmpl_qb(struct idpf_tx_queue *txq,
+			  const struct idpf_splitq_4b_tx_compl_desc *desc,
+			  struct libeth_sq_napi_stats *cleaned, int budget)
 {
-	u16 compl_tag;
+	u16 head = le16_to_cpu(desc->q_head_compl_tag.q_head);
 
-	if (!idpf_queue_has(FLOW_SCH_EN, txq)) {
-		u16 head = le16_to_cpu(desc->q_head_compl_tag.q_head);
-
-		idpf_tx_splitq_clean(txq, head, budget, cleaned, false);
-		return;
-	}
+	idpf_tx_splitq_clean(txq, head, budget, cleaned, false);
+}
 
-	compl_tag = le16_to_cpu(desc->q_head_compl_tag.compl_tag);
+/**
+ * idpf_tx_handle_rs_cmpl_fb - clean a single packet and all of its buffers
+ * whether on the buffer ring or in the hash table (flow-based scheduling only)
+ * @txq: Tx ring to clean
+ * @desc: pointer to completion queue descriptor to extract completion
+ * information from
+ * @cleaned: pointer to stats struct to track cleaned packets/bytes
+ * @budget: Used to determine if we are in netpoll
+ *
+ * Returns bytes/packets cleaned
+ */
+static void
+idpf_tx_handle_rs_cmpl_fb(struct idpf_tx_queue *txq,
+			  const struct idpf_splitq_4b_tx_compl_desc *desc,
+			  struct libeth_sq_napi_stats *cleaned, int budget)
+{
+	u16 compl_tag = le16_to_cpu(desc->q_head_compl_tag.compl_tag);
 
 	/* If we didn't clean anything on the ring, this packet must be
 	 * in the hash table. Go clean it there.
@@ -1954,6 +2008,61 @@ static void idpf_tx_handle_rs_completion(struct idpf_tx_queue *txq,
 		idpf_tx_clean_stashed_bufs(txq, compl_tag, cleaned, budget);
 }
 
+/**
+ * idpf_tx_finalize_complq - Finalize completion queue cleaning
+ * @complq: completion queue to finalize
+ * @ntc: next to complete index
+ * @gen_flag: current state of generation flag
+ * @cleaned: returns number of packets cleaned
+ */
+static void idpf_tx_finalize_complq(struct idpf_compl_queue *complq, int ntc,
+				    bool gen_flag, int *cleaned)
+{
+	struct idpf_netdev_priv *np;
+	bool complq_ok = true;
+	int i;
+
+	/* Store the state of the complq to be used later in deciding if a
+	 * TXQ can be started again
+	 */
+	if (unlikely(IDPF_TX_COMPLQ_PENDING(complq->txq_grp) >
+		     IDPF_TX_COMPLQ_OVERFLOW_THRESH(complq)))
+		complq_ok = false;
+
+	np = netdev_priv(complq->netdev);
+	for (i = 0; i < complq->txq_grp->num_txq; ++i) {
+		struct idpf_tx_queue *tx_q = complq->txq_grp->txqs[i];
+		struct netdev_queue *nq;
+		bool dont_wake;
+
+		/* We didn't clean anything on this queue, move along */
+		if (!tx_q->cleaned_bytes)
+			continue;
+
+		*cleaned += tx_q->cleaned_pkts;
+
+		/* Update BQL */
+		nq = netdev_get_tx_queue(tx_q->netdev, tx_q->idx);
+
+		dont_wake = !complq_ok || IDPF_TX_BUF_RSV_LOW(tx_q) ||
+			    np->state != __IDPF_VPORT_UP ||
+			    !netif_carrier_ok(tx_q->netdev);
+		/* Check if the TXQ needs to and can be restarted */
+		__netif_txq_completed_wake(nq, tx_q->cleaned_pkts, tx_q->cleaned_bytes,
+					   IDPF_DESC_UNUSED(tx_q), IDPF_TX_WAKE_THRESH,
+					   dont_wake);
+
+		/* Reset cleaned stats for the next time this queue is
+		 * cleaned
+		 */
+		tx_q->cleaned_bytes = 0;
+		tx_q->cleaned_pkts = 0;
+	}
+
+	complq->next_to_clean = ntc + complq->desc_count;
+	idpf_queue_assign(GEN_CHK, complq, gen_flag);
+}
+
 /**
  * idpf_tx_clean_complq - Reclaim resources on completion queue
  * @complq: Tx ring to clean
@@ -1965,60 +2074,56 @@ static void idpf_tx_handle_rs_completion(struct idpf_tx_queue *txq,
 static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
 				 int *cleaned)
 {
-	struct idpf_splitq_tx_compl_desc *tx_desc;
+	struct idpf_splitq_4b_tx_compl_desc *tx_desc;
 	s16 ntc = complq->next_to_clean;
-	struct idpf_netdev_priv *np;
 	unsigned int complq_budget;
-	bool complq_ok = true;
-	int i;
+	bool flow, gen_flag;
+	u32 pos = ntc;
+
+	flow = idpf_queue_has(FLOW_SCH_EN, complq);
+	gen_flag = idpf_queue_has(GEN_CHK, complq);
 
 	complq_budget = complq->clean_budget;
-	tx_desc = &complq->comp[ntc];
+	tx_desc = flow ? &complq->comp[pos].common : &complq->comp_4b[pos];
 	ntc -= complq->desc_count;
 
 	do {
 		struct libeth_sq_napi_stats cleaned_stats = { };
 		struct idpf_tx_queue *tx_q;
-		int rel_tx_qid;
 		u16 hw_head;
-		u8 ctype;	/* completion type */
-		u16 gen;
-
-		/* if the descriptor isn't done, no work yet to do */
-		gen = le16_get_bits(tx_desc->qid_comptype_gen,
-				    IDPF_TXD_COMPLQ_GEN_M);
-		if (idpf_queue_has(GEN_CHK, complq) != gen)
-			break;
-
-		/* Find necessary info of TX queue to clean buffers */
-		rel_tx_qid = le16_get_bits(tx_desc->qid_comptype_gen,
-					   IDPF_TXD_COMPLQ_QID_M);
-		if (rel_tx_qid >= complq->txq_grp->num_txq ||
-		    !complq->txq_grp->txqs[rel_tx_qid]) {
-			netdev_err(complq->netdev, "TxQ not found\n");
-			goto fetch_next_desc;
-		}
-		tx_q = complq->txq_grp->txqs[rel_tx_qid];
+		int ctype;
 
-		/* Determine completion type */
-		ctype = le16_get_bits(tx_desc->qid_comptype_gen,
-				      IDPF_TXD_COMPLQ_COMPL_TYPE_M);
+		ctype = idpf_parse_compl_desc(tx_desc, complq, &tx_q,
+					      gen_flag);
 		switch (ctype) {
 		case IDPF_TXD_COMPLT_RE:
+			if (unlikely(!flow))
+				goto fetch_next_desc;
+
 			hw_head = le16_to_cpu(tx_desc->q_head_compl_tag.q_head);
 
 			idpf_tx_splitq_clean(tx_q, hw_head, budget,
 					     &cleaned_stats, true);
 			break;
 		case IDPF_TXD_COMPLT_RS:
-			idpf_tx_handle_rs_completion(tx_q, tx_desc,
-						     &cleaned_stats, budget);
+			if (flow)
+				idpf_tx_handle_rs_cmpl_fb(tx_q, tx_desc,
+							  &cleaned_stats,
+							  budget);
+			else
+				idpf_tx_handle_rs_cmpl_qb(tx_q, tx_desc,
+							  &cleaned_stats,
+							  budget);
 			break;
 		case IDPF_TXD_COMPLT_SW_MARKER:
 			idpf_tx_handle_sw_marker(tx_q);
 			break;
+		case -ENODATA:
+			goto exit_clean_complq;
+		case -EINVAL:
+			goto fetch_next_desc;
 		default:
-			netdev_err(tx_q->netdev,
+			netdev_err(complq->netdev,
 				   "Unknown TX completion type: %d\n", ctype);
 			goto fetch_next_desc;
 		}
@@ -2032,59 +2137,24 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
 		u64_stats_update_end(&tx_q->stats_sync);
 
 fetch_next_desc:
-		tx_desc++;
+		pos++;
 		ntc++;
 		if (unlikely(!ntc)) {
 			ntc -= complq->desc_count;
-			tx_desc = &complq->comp[0];
-			idpf_queue_change(GEN_CHK, complq);
+			pos = 0;
+			gen_flag = !gen_flag;
 		}
 
+		tx_desc = flow ? &complq->comp[pos].common :
+			  &complq->comp_4b[pos];
 		prefetch(tx_desc);
 
 		/* update budget accounting */
 		complq_budget--;
 	} while (likely(complq_budget));
 
-	/* Store the state of the complq to be used later in deciding if a
-	 * TXQ can be started again
-	 */
-	if (unlikely(IDPF_TX_COMPLQ_PENDING(complq->txq_grp) >
-		     IDPF_TX_COMPLQ_OVERFLOW_THRESH(complq)))
-		complq_ok = false;
-
-	np = netdev_priv(complq->netdev);
-	for (i = 0; i < complq->txq_grp->num_txq; ++i) {
-		struct idpf_tx_queue *tx_q = complq->txq_grp->txqs[i];
-		struct netdev_queue *nq;
-		bool dont_wake;
-
-		/* We didn't clean anything on this queue, move along */
-		if (!tx_q->cleaned_bytes)
-			continue;
-
-		*cleaned += tx_q->cleaned_pkts;
-
-		/* Update BQL */
-		nq = netdev_get_tx_queue(tx_q->netdev, tx_q->idx);
-
-		dont_wake = !complq_ok || IDPF_TX_BUF_RSV_LOW(tx_q) ||
-			    np->state != __IDPF_VPORT_UP ||
-			    !netif_carrier_ok(tx_q->netdev);
-		/* Check if the TXQ needs to and can be restarted */
-		__netif_txq_completed_wake(nq, tx_q->cleaned_pkts, tx_q->cleaned_bytes,
-					   IDPF_DESC_UNUSED(tx_q), IDPF_TX_WAKE_THRESH,
-					   dont_wake);
-
-		/* Reset cleaned stats for the next time this queue is
-		 * cleaned
-		 */
-		tx_q->cleaned_bytes = 0;
-		tx_q->cleaned_pkts = 0;
-	}
-
-	ntc += complq->desc_count;
-	complq->next_to_clean = ntc;
+exit_clean_complq:
+	idpf_tx_finalize_complq(complq, ntc, gen_flag, cleaned);
 
 	return !!complq_budget;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 09/16] idpf: remove SW marker handling from NAPI
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (7 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 08/16] idpf: make complq cleaning dependent on scheduling mode Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 11:42   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 10/16] idpf: add support for nointerrupt queues Alexander Lobakin
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Michal Kubiak <michal.kubiak@intel.com>

SW marker descriptors on completion queues are used only when a queue
is about to be destroyed. It's far from hotpath and handling it in the
hotpath NAPI poll makes no sense.
Instead, run a simple poller after a virtchnl message for destroying
the queue is sent and wait for the replies. If replies for all of the
queues are received, this means the synchronization is done correctly
and we can go forth with stopping the link.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf.h        |   7 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |   4 +-
 drivers/net/ethernet/intel/idpf/idpf_lib.c    |   2 -
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 108 +++++++++++-------
 .../net/ethernet/intel/idpf/idpf_virtchnl.c   |  34 ++----
 5 files changed, 80 insertions(+), 75 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf.h b/drivers/net/ethernet/intel/idpf/idpf.h
index 66544faab710..6b51a5dcc1e0 100644
--- a/drivers/net/ethernet/intel/idpf/idpf.h
+++ b/drivers/net/ethernet/intel/idpf/idpf.h
@@ -36,6 +36,7 @@ struct idpf_vport_max_q;
 #define IDPF_NUM_CHUNKS_PER_MSG(struct_sz, chunk_sz)	\
 	((IDPF_CTLQ_MAX_BUF_LEN - (struct_sz)) / (chunk_sz))
 
+#define IDPF_WAIT_FOR_MARKER_TIMEO	500
 #define IDPF_MAX_WAIT			500
 
 /* available message levels */
@@ -224,13 +225,10 @@ enum idpf_vport_reset_cause {
 /**
  * enum idpf_vport_flags - Vport flags
  * @IDPF_VPORT_DEL_QUEUES: To send delete queues message
- * @IDPF_VPORT_SW_MARKER: Indicate TX pipe drain software marker packets
- *			  processing is done
  * @IDPF_VPORT_FLAGS_NBITS: Must be last
  */
 enum idpf_vport_flags {
 	IDPF_VPORT_DEL_QUEUES,
-	IDPF_VPORT_SW_MARKER,
 	IDPF_VPORT_FLAGS_NBITS,
 };
 
@@ -289,7 +287,6 @@ struct idpf_port_stats {
  * @tx_itr_profile: TX profiles for Dynamic Interrupt Moderation
  * @port_stats: per port csum, header split, and other offload stats
  * @link_up: True if link is up
- * @sw_marker_wq: workqueue for marker packets
  */
 struct idpf_vport {
 	u16 num_txq;
@@ -332,8 +329,6 @@ struct idpf_vport {
 	struct idpf_port_stats port_stats;
 
 	bool link_up;
-
-	wait_queue_head_t sw_marker_wq;
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index 9f938301b2c5..dd6cc3b5cdab 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -286,7 +286,6 @@ struct idpf_ptype_state {
  *			  bit and Q_RFL_GEN is the SW bit.
  * @__IDPF_Q_FLOW_SCH_EN: Enable flow scheduling
  * @__IDPF_Q_SW_MARKER: Used to indicate TX queue marker completions
- * @__IDPF_Q_POLL_MODE: Enable poll mode
  * @__IDPF_Q_CRC_EN: enable CRC offload in singleq mode
  * @__IDPF_Q_HSPLIT_EN: enable header split on Rx (splitq)
  * @__IDPF_Q_FLAGS_NBITS: Must be last
@@ -296,7 +295,6 @@ enum idpf_queue_flags_t {
 	__IDPF_Q_RFL_GEN_CHK,
 	__IDPF_Q_FLOW_SCH_EN,
 	__IDPF_Q_SW_MARKER,
-	__IDPF_Q_POLL_MODE,
 	__IDPF_Q_CRC_EN,
 	__IDPF_Q_HSPLIT_EN,
 
@@ -1044,6 +1042,8 @@ bool idpf_rx_singleq_buf_hw_alloc_all(struct idpf_rx_queue *rxq,
 				      u16 cleaned_count);
 int idpf_tso(struct sk_buff *skb, struct idpf_tx_offload_params *off);
 
+void idpf_wait_for_sw_marker_completion(struct idpf_tx_queue *txq);
+
 static inline bool idpf_tx_maybe_stop_common(struct idpf_tx_queue *tx_q,
 					     u32 needed)
 {
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
index f3aea7bcdaa3..e17582d15e27 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -1501,8 +1501,6 @@ void idpf_init_task(struct work_struct *work)
 	index = vport->idx;
 	vport_config = adapter->vport_config[index];
 
-	init_waitqueue_head(&vport->sw_marker_wq);
-
 	spin_lock_init(&vport_config->mac_filter_list_lock);
 
 	INIT_LIST_HEAD(&vport_config->user_config.mac_filter_list);
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index a240ed115e3e..4e3de6031422 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -1626,32 +1626,6 @@ int idpf_vport_queues_alloc(struct idpf_vport *vport)
 	return err;
 }
 
-/**
- * idpf_tx_handle_sw_marker - Handle queue marker packet
- * @tx_q: tx queue to handle software marker
- */
-static void idpf_tx_handle_sw_marker(struct idpf_tx_queue *tx_q)
-{
-	struct idpf_netdev_priv *priv = netdev_priv(tx_q->netdev);
-	struct idpf_vport *vport = priv->vport;
-	int i;
-
-	idpf_queue_clear(SW_MARKER, tx_q);
-	/* Hardware must write marker packets to all queues associated with
-	 * completion queues. So check if all queues received marker packets
-	 */
-	for (i = 0; i < vport->num_txq; i++)
-		/* If we're still waiting on any other TXQ marker completions,
-		 * just return now since we cannot wake up the marker_wq yet.
-		 */
-		if (idpf_queue_has(SW_MARKER, vport->txqs[i]))
-			return;
-
-	/* Drain complete */
-	set_bit(IDPF_VPORT_SW_MARKER, vport->flags);
-	wake_up(&vport->sw_marker_wq);
-}
-
 /**
  * idpf_tx_clean_stashed_bufs - clean bufs that were stored for
  * out of order completions
@@ -2008,6 +1982,19 @@ idpf_tx_handle_rs_cmpl_fb(struct idpf_tx_queue *txq,
 		idpf_tx_clean_stashed_bufs(txq, compl_tag, cleaned, budget);
 }
 
+/**
+ * idpf_tx_update_complq_indexes - update completion queue indexes
+ * @complq: completion queue being updated
+ * @ntc: current "next to clean" index value
+ * @gen_flag: current "generation" flag value
+ */
+static void idpf_tx_update_complq_indexes(struct idpf_compl_queue *complq,
+					  int ntc, bool gen_flag)
+{
+	complq->next_to_clean = ntc + complq->desc_count;
+	idpf_queue_assign(GEN_CHK, complq, gen_flag);
+}
+
 /**
  * idpf_tx_finalize_complq - Finalize completion queue cleaning
  * @complq: completion queue to finalize
@@ -2059,8 +2046,7 @@ static void idpf_tx_finalize_complq(struct idpf_compl_queue *complq, int ntc,
 		tx_q->cleaned_pkts = 0;
 	}
 
-	complq->next_to_clean = ntc + complq->desc_count;
-	idpf_queue_assign(GEN_CHK, complq, gen_flag);
+	idpf_tx_update_complq_indexes(complq, ntc, gen_flag);
 }
 
 /**
@@ -2115,9 +2101,6 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
 							  &cleaned_stats,
 							  budget);
 			break;
-		case IDPF_TXD_COMPLT_SW_MARKER:
-			idpf_tx_handle_sw_marker(tx_q);
-			break;
 		case -ENODATA:
 			goto exit_clean_complq;
 		case -EINVAL:
@@ -2159,6 +2142,59 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
 	return !!complq_budget;
 }
 
+/**
+ * idpf_wait_for_sw_marker_completion - wait for SW marker of disabled Tx queue
+ * @txq: disabled Tx queue
+ */
+void idpf_wait_for_sw_marker_completion(struct idpf_tx_queue *txq)
+{
+	struct idpf_compl_queue *complq = txq->txq_grp->complq;
+	struct idpf_splitq_4b_tx_compl_desc *tx_desc;
+	s16 ntc = complq->next_to_clean;
+	unsigned long timeout;
+	bool flow, gen_flag;
+	u32 pos = ntc;
+
+	if (!idpf_queue_has(SW_MARKER, txq))
+		return;
+
+	flow = idpf_queue_has(FLOW_SCH_EN, complq);
+	gen_flag = idpf_queue_has(GEN_CHK, complq);
+
+	timeout = jiffies + msecs_to_jiffies(IDPF_WAIT_FOR_MARKER_TIMEO);
+	tx_desc = flow ? &complq->comp[pos].common : &complq->comp_4b[pos];
+	ntc -= complq->desc_count;
+
+	do {
+		struct idpf_tx_queue *tx_q;
+		int ctype;
+
+		ctype = idpf_parse_compl_desc(tx_desc, complq, &tx_q,
+					      gen_flag);
+		if (ctype == IDPF_TXD_COMPLT_SW_MARKER) {
+			idpf_queue_clear(SW_MARKER, tx_q);
+			if (txq == tx_q)
+				break;
+		} else if (ctype == -ENODATA) {
+			usleep_range(500, 1000);
+			continue;
+		}
+
+		pos++;
+		ntc++;
+		if (unlikely(!ntc)) {
+			ntc -= complq->desc_count;
+			pos = 0;
+			gen_flag = !gen_flag;
+		}
+
+		tx_desc = flow ? &complq->comp[pos].common :
+			  &complq->comp_4b[pos];
+		prefetch(tx_desc);
+	} while (time_before(jiffies, timeout));
+
+	idpf_tx_update_complq_indexes(complq, ntc, gen_flag);
+}
 /**
  * idpf_tx_splitq_build_ctb - populate command tag and size for queue
  * based scheduling descriptors
@@ -4130,15 +4166,7 @@ static int idpf_vport_splitq_napi_poll(struct napi_struct *napi, int budget)
 	else
 		idpf_vport_intr_set_wb_on_itr(q_vector);
 
-	/* Switch to poll mode in the tear-down path after sending disable
-	 * queues virtchnl message, as the interrupts will be disabled after
-	 * that
-	 */
-	if (unlikely(q_vector->num_txq && idpf_queue_has(POLL_MODE,
-							 q_vector->tx[0])))
-		return budget;
-	else
-		return work_done;
+	return work_done;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
index 135af3cc243f..24495e4d6c78 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
@@ -752,21 +752,17 @@ int idpf_recv_mb_msg(struct idpf_adapter *adapter)
  **/
 static int idpf_wait_for_marker_event(struct idpf_vport *vport)
 {
-	int event;
-	int i;
-
-	for (i = 0; i < vport->num_txq; i++)
-		idpf_queue_set(SW_MARKER, vport->txqs[i]);
+	bool markers_rcvd = true;
 
-	event = wait_event_timeout(vport->sw_marker_wq,
-				   test_and_clear_bit(IDPF_VPORT_SW_MARKER,
-						      vport->flags),
-				   msecs_to_jiffies(500));
+	for (u32 i = 0; i < vport->num_txq; i++) {
+		struct idpf_tx_queue *txq = vport->txqs[i];
 
-	for (i = 0; i < vport->num_txq; i++)
-		idpf_queue_clear(POLL_MODE, vport->txqs[i]);
+		idpf_queue_set(SW_MARKER, txq);
+		idpf_wait_for_sw_marker_completion(txq);
+		markers_rcvd &= !idpf_queue_has(SW_MARKER, txq);
+	}
 
-	if (event)
+	if (markers_rcvd)
 		return 0;
 
 	dev_warn(&vport->adapter->pdev->dev, "Failed to receive marker packets\n");
@@ -1993,24 +1989,12 @@ int idpf_send_enable_queues_msg(struct idpf_vport *vport)
  */
 int idpf_send_disable_queues_msg(struct idpf_vport *vport)
 {
-	int err, i;
+	int err;
 
 	err = idpf_send_ena_dis_queues_msg(vport, false);
 	if (err)
 		return err;
 
-	/* switch to poll mode as interrupts will be disabled after disable
-	 * queues virtchnl message is sent
-	 */
-	for (i = 0; i < vport->num_txq; i++)
-		idpf_queue_set(POLL_MODE, vport->txqs[i]);
-
-	/* schedule the napi to receive all the marker packets */
-	local_bh_disable();
-	for (i = 0; i < vport->num_q_vectors; i++)
-		napi_schedule(&vport->q_vectors[i].napi);
-	local_bh_enable();
-
 	return idpf_wait_for_marker_event(vport);
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 10/16] idpf: add support for nointerrupt queues
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (8 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 09/16] idpf: remove SW marker handling from NAPI Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 12:10   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 11/16] idpf: prepare structures to support XDP Alexander Lobakin
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Currently, queues are associated 1:1 with interrupt vectors as it's
assumed queues are always interrupt-driven.
In order to use a queue without an interrupt, idpf still needs to have
a vector assigned to it to flush descriptors. This vector can be global
and only one for the whole vport to handle all its noirq queues.
Always request one excessive vector and configure it in non-interrupt
mode right away when creating vport, so that it can be used later by
queues when needed.

Co-developed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf.h        |  8 +++
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  4 ++
 drivers/net/ethernet/intel/idpf/idpf_dev.c    | 11 +++-
 drivers/net/ethernet/intel/idpf/idpf_lib.c    |  2 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   |  8 +++
 drivers/net/ethernet/intel/idpf/idpf_vf_dev.c | 11 +++-
 .../net/ethernet/intel/idpf/idpf_virtchnl.c   | 53 +++++++++++++------
 7 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf.h b/drivers/net/ethernet/intel/idpf/idpf.h
index 6b51a5dcc1e0..50dde09c525b 100644
--- a/drivers/net/ethernet/intel/idpf/idpf.h
+++ b/drivers/net/ethernet/intel/idpf/idpf.h
@@ -281,6 +281,9 @@ struct idpf_port_stats {
  * @num_q_vectors: Number of IRQ vectors allocated
  * @q_vectors: Array of queue vectors
  * @q_vector_idxs: Starting index of queue vectors
+ * @noirq_dyn_ctl: register to enable/disable the vector for NOIRQ queues
+ * @noirq_dyn_ctl_ena: value to write to the above to enable it
+ * @noirq_v_idx: ID of the NOIRQ vector
  * @max_mtu: device given max possible MTU
  * @default_mac_addr: device will give a default MAC to use
  * @rx_itr_profile: RX profiles for Dynamic Interrupt Moderation
@@ -322,6 +325,11 @@ struct idpf_vport {
 	u16 num_q_vectors;
 	struct idpf_q_vector *q_vectors;
 	u16 *q_vector_idxs;
+
+	void __iomem *noirq_dyn_ctl;
+	u32 noirq_dyn_ctl_ena;
+	u16 noirq_v_idx;
+
 	u16 max_mtu;
 	u8 default_mac_addr[ETH_ALEN];
 	u16 rx_itr_profile[IDPF_DIM_PROFILE_SLOTS];
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index dd6cc3b5cdab..fb3b352d542e 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -57,6 +57,8 @@
 /* Default vector sharing */
 #define IDPF_MBX_Q_VEC		1
 #define IDPF_MIN_Q_VEC		1
+/* Data vector for NOIRQ queues */
+#define IDPF_RESERVED_VECS			1
 
 #define IDPF_DFLT_TX_Q_DESC_COUNT		512
 #define IDPF_DFLT_TX_COMPLQ_DESC_COUNT		512
@@ -288,6 +290,7 @@ struct idpf_ptype_state {
  * @__IDPF_Q_SW_MARKER: Used to indicate TX queue marker completions
  * @__IDPF_Q_CRC_EN: enable CRC offload in singleq mode
  * @__IDPF_Q_HSPLIT_EN: enable header split on Rx (splitq)
+ * @__IDPF_Q_NOIRQ: queue is polling-driven and has no interrupt
  * @__IDPF_Q_FLAGS_NBITS: Must be last
  */
 enum idpf_queue_flags_t {
@@ -297,6 +300,7 @@ enum idpf_queue_flags_t {
 	__IDPF_Q_SW_MARKER,
 	__IDPF_Q_CRC_EN,
 	__IDPF_Q_HSPLIT_EN,
+	__IDPF_Q_NOIRQ,
 
 	__IDPF_Q_FLAGS_NBITS,
 };
diff --git a/drivers/net/ethernet/intel/idpf/idpf_dev.c b/drivers/net/ethernet/intel/idpf/idpf_dev.c
index 41e4bd49402a..5f177933b55c 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_dev.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_dev.c
@@ -73,7 +73,7 @@ static int idpf_intr_reg_init(struct idpf_vport *vport)
 	int num_vecs = vport->num_q_vectors;
 	struct idpf_vec_regs *reg_vals;
 	int num_regs, i, err = 0;
-	u32 rx_itr, tx_itr;
+	u32 rx_itr, tx_itr, val;
 	u16 total_vecs;
 
 	total_vecs = idpf_get_reserved_vecs(vport->adapter);
@@ -117,6 +117,15 @@ static int idpf_intr_reg_init(struct idpf_vport *vport)
 		intr->tx_itr = idpf_get_reg_addr(adapter, tx_itr);
 	}
 
+	/* Data vector for NOIRQ queues */
+
+	val = reg_vals[vport->q_vector_idxs[i] - IDPF_MBX_Q_VEC].dyn_ctl_reg;
+	vport->noirq_dyn_ctl = idpf_get_reg_addr(adapter, val);
+
+	val = PF_GLINT_DYN_CTL_WB_ON_ITR_M | PF_GLINT_DYN_CTL_INTENA_MSK_M |
+	      FIELD_PREP(PF_GLINT_DYN_CTL_ITR_INDX_M, IDPF_NO_ITR_UPDATE_IDX);
+	vport->noirq_dyn_ctl_ena = val;
+
 free_reg_vals:
 	kfree(reg_vals);
 
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
index e17582d15e27..2594ca38e8ca 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -1126,7 +1126,7 @@ static struct idpf_vport *idpf_vport_alloc(struct idpf_adapter *adapter,
 	vport->default_vport = adapter->num_alloc_vports <
 			       idpf_get_default_vports(adapter);
 
-	num_max_q = max(max_q->max_txq, max_q->max_rxq);
+	num_max_q = max(max_q->max_txq, max_q->max_rxq) + IDPF_RESERVED_VECS;
 	vport->q_vector_idxs = kcalloc(num_max_q, sizeof(u16), GFP_KERNEL);
 	if (!vport->q_vector_idxs) {
 		kfree(vport);
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 4e3de6031422..5d51e68c2878 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -3710,6 +3710,8 @@ static void idpf_vport_intr_dis_irq_all(struct idpf_vport *vport)
 	struct idpf_q_vector *q_vector = vport->q_vectors;
 	int q_idx;
 
+	writel(0, vport->noirq_dyn_ctl);
+
 	for (q_idx = 0; q_idx < vport->num_q_vectors; q_idx++)
 		writel(0, q_vector[q_idx].intr_reg.dyn_ctl);
 }
@@ -3963,6 +3965,8 @@ static void idpf_vport_intr_ena_irq_all(struct idpf_vport *vport)
 		if (qv->num_txq || qv->num_rxq)
 			idpf_vport_intr_update_itr_ena_irq(qv);
 	}
+
+	writel(vport->noirq_dyn_ctl_ena, vport->noirq_dyn_ctl);
 }
 
 /**
@@ -4274,6 +4278,8 @@ static int idpf_vport_intr_init_vec_idx(struct idpf_vport *vport)
 		for (i = 0; i < vport->num_q_vectors; i++)
 			vport->q_vectors[i].v_idx = vport->q_vector_idxs[i];
 
+		vport->noirq_v_idx = vport->q_vector_idxs[i];
+
 		return 0;
 	}
 
@@ -4287,6 +4293,8 @@ static int idpf_vport_intr_init_vec_idx(struct idpf_vport *vport)
 	for (i = 0; i < vport->num_q_vectors; i++)
 		vport->q_vectors[i].v_idx = vecids[vport->q_vector_idxs[i]];
 
+	vport->noirq_v_idx = vecids[vport->q_vector_idxs[i]];
+
 	kfree(vecids);
 
 	return 0;
diff --git a/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c b/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c
index aba828abcb17..a6993a01a9b0 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c
@@ -73,7 +73,7 @@ static int idpf_vf_intr_reg_init(struct idpf_vport *vport)
 	int num_vecs = vport->num_q_vectors;
 	struct idpf_vec_regs *reg_vals;
 	int num_regs, i, err = 0;
-	u32 rx_itr, tx_itr;
+	u32 rx_itr, tx_itr, val;
 	u16 total_vecs;
 
 	total_vecs = idpf_get_reserved_vecs(vport->adapter);
@@ -117,6 +117,15 @@ static int idpf_vf_intr_reg_init(struct idpf_vport *vport)
 		intr->tx_itr = idpf_get_reg_addr(adapter, tx_itr);
 	}
 
+	/* Data vector for NOIRQ queues */
+
+	val = reg_vals[vport->q_vector_idxs[i] - IDPF_MBX_Q_VEC].dyn_ctl_reg;
+	vport->noirq_dyn_ctl = idpf_get_reg_addr(adapter, val);
+
+	val = VF_INT_DYN_CTLN_WB_ON_ITR_M | VF_INT_DYN_CTLN_INTENA_MSK_M |
+	      FIELD_PREP(VF_INT_DYN_CTLN_ITR_INDX_M, IDPF_NO_ITR_UPDATE_IDX);
+	vport->noirq_dyn_ctl_ena = val;
+
 free_reg_vals:
 	kfree(reg_vals);
 
diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
index 24495e4d6c78..aa45821f38f1 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
@@ -1871,21 +1871,31 @@ int idpf_send_map_unmap_queue_vector_msg(struct idpf_vport *vport, bool map)
 		struct idpf_txq_group *tx_qgrp = &vport->txq_grps[i];
 
 		for (j = 0; j < tx_qgrp->num_txq; j++, k++) {
+			const struct idpf_tx_queue *txq = tx_qgrp->txqs[j];
+			const struct idpf_q_vector *vec;
+			u32 v_idx, tx_itr_idx;
+
 			vqv[k].queue_type =
 				cpu_to_le32(VIRTCHNL2_QUEUE_TYPE_TX);
-			vqv[k].queue_id = cpu_to_le32(tx_qgrp->txqs[j]->q_id);
+			vqv[k].queue_id = cpu_to_le32(txq->q_id);
 
-			if (idpf_is_queue_model_split(vport->txq_model)) {
-				vqv[k].vector_id =
-				cpu_to_le16(tx_qgrp->complq->q_vector->v_idx);
-				vqv[k].itr_idx =
-				cpu_to_le32(tx_qgrp->complq->q_vector->tx_itr_idx);
+			if (idpf_queue_has(NOIRQ, txq))
+				vec = NULL;
+			else if (idpf_is_queue_model_split(vport->txq_model))
+				vec = txq->txq_grp->complq->q_vector;
+			else
+				vec = txq->q_vector;
+
+			if (vec) {
+				v_idx = vec->v_idx;
+				tx_itr_idx = vec->tx_itr_idx;
 			} else {
-				vqv[k].vector_id =
-				cpu_to_le16(tx_qgrp->txqs[j]->q_vector->v_idx);
-				vqv[k].itr_idx =
-				cpu_to_le32(tx_qgrp->txqs[j]->q_vector->tx_itr_idx);
+				v_idx = vport->noirq_v_idx;
+				tx_itr_idx = VIRTCHNL2_ITR_IDX_1;
 			}
+
+			vqv[k].vector_id = cpu_to_le16(v_idx);
+			vqv[k].itr_idx = cpu_to_le32(tx_itr_idx);
 		}
 	}
 
@@ -1903,6 +1913,7 @@ int idpf_send_map_unmap_queue_vector_msg(struct idpf_vport *vport, bool map)
 
 		for (j = 0; j < num_rxq; j++, k++) {
 			struct idpf_rx_queue *rxq;
+			u32 v_idx, rx_itr_idx;
 
 			if (idpf_is_queue_model_split(vport->rxq_model))
 				rxq = &rx_qgrp->splitq.rxq_sets[j]->rxq;
@@ -1912,8 +1923,17 @@ int idpf_send_map_unmap_queue_vector_msg(struct idpf_vport *vport, bool map)
 			vqv[k].queue_type =
 				cpu_to_le32(VIRTCHNL2_QUEUE_TYPE_RX);
 			vqv[k].queue_id = cpu_to_le32(rxq->q_id);
-			vqv[k].vector_id = cpu_to_le16(rxq->q_vector->v_idx);
-			vqv[k].itr_idx = cpu_to_le32(rxq->q_vector->rx_itr_idx);
+
+			if (idpf_queue_has(NOIRQ, rxq)) {
+				v_idx = vport->noirq_v_idx;
+				rx_itr_idx = VIRTCHNL2_ITR_IDX_0;
+			} else {
+				v_idx = rxq->q_vector->v_idx;
+				rx_itr_idx = rxq->q_vector->rx_itr_idx;
+			}
+
+			vqv[k].vector_id = cpu_to_le16(v_idx);
+			vqv[k].itr_idx = cpu_to_le32(rx_itr_idx);
 		}
 	}
 
@@ -3106,9 +3126,12 @@ int idpf_vport_alloc_vec_indexes(struct idpf_vport *vport)
 {
 	struct idpf_vector_info vec_info;
 	int num_alloc_vecs;
+	u32 req;
+
+	vec_info.num_curr_vecs = vport->num_q_vectors + IDPF_RESERVED_VECS;
+	req = max(vport->num_txq, vport->num_rxq) + IDPF_RESERVED_VECS;
+	vec_info.num_req_vecs = req;
 
-	vec_info.num_curr_vecs = vport->num_q_vectors;
-	vec_info.num_req_vecs = max(vport->num_txq, vport->num_rxq);
 	vec_info.default_vport = vport->default_vport;
 	vec_info.index = vport->idx;
 
@@ -3121,7 +3144,7 @@ int idpf_vport_alloc_vec_indexes(struct idpf_vport *vport)
 		return -EINVAL;
 	}
 
-	vport->num_q_vectors = num_alloc_vecs;
+	vport->num_q_vectors = num_alloc_vecs - IDPF_RESERVED_VECS;
 
 	return 0;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 11/16] idpf: prepare structures to support XDP
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (9 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 10/16] idpf: add support for nointerrupt queues Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07  1:12   ` Jakub Kicinski
  2025-03-07 13:27   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq Alexander Lobakin
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Michal Kubiak <michal.kubiak@intel.com>

Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_*_queue',
'idpf_vport_user_config_data') by adding members necessary to support XDP.
Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions
without interfering with regular Tx traffic.
Also add functions dedicated to support XDP initialization for Rx and
Tx queues and call those functions from the existing algorithms of
queues configuration.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/Kconfig       |   2 +-
 drivers/net/ethernet/intel/idpf/Makefile      |   2 +
 drivers/net/ethernet/intel/idpf/idpf.h        |  20 ++
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  86 ++++++--
 drivers/net/ethernet/intel/idpf/xdp.h         |  17 ++
 .../net/ethernet/intel/idpf/idpf_ethtool.c    |   6 +-
 drivers/net/ethernet/intel/idpf/idpf_lib.c    |  21 +-
 drivers/net/ethernet/intel/idpf/idpf_main.c   |   1 +
 .../ethernet/intel/idpf/idpf_singleq_txrx.c   |   8 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 109 +++++++---
 .../net/ethernet/intel/idpf/idpf_virtchnl.c   |  26 +--
 drivers/net/ethernet/intel/idpf/xdp.c         | 189 ++++++++++++++++++
 12 files changed, 415 insertions(+), 72 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/idpf/xdp.h
 create mode 100644 drivers/net/ethernet/intel/idpf/xdp.c

diff --git a/drivers/net/ethernet/intel/idpf/Kconfig b/drivers/net/ethernet/intel/idpf/Kconfig
index 1addd663acad..7207ee4dbae8 100644
--- a/drivers/net/ethernet/intel/idpf/Kconfig
+++ b/drivers/net/ethernet/intel/idpf/Kconfig
@@ -5,7 +5,7 @@ config IDPF
 	tristate "Intel(R) Infrastructure Data Path Function Support"
 	depends on PCI_MSI
 	select DIMLIB
-	select LIBETH
+	select LIBETH_XDP
 	help
 	  This driver supports Intel(R) Infrastructure Data Path Function
 	  devices.
diff --git a/drivers/net/ethernet/intel/idpf/Makefile b/drivers/net/ethernet/intel/idpf/Makefile
index 2ce01a0b5898..c58abe6f8f5d 100644
--- a/drivers/net/ethernet/intel/idpf/Makefile
+++ b/drivers/net/ethernet/intel/idpf/Makefile
@@ -17,3 +17,5 @@ idpf-y := \
 	idpf_vf_dev.o
 
 idpf-$(CONFIG_IDPF_SINGLEQ)	+= idpf_singleq_txrx.o
+
+idpf-y				+= xdp.o
diff --git a/drivers/net/ethernet/intel/idpf/idpf.h b/drivers/net/ethernet/intel/idpf/idpf.h
index 50dde09c525b..4847760744ff 100644
--- a/drivers/net/ethernet/intel/idpf/idpf.h
+++ b/drivers/net/ethernet/intel/idpf/idpf.h
@@ -257,6 +257,10 @@ struct idpf_port_stats {
  * @txq_model: Split queue or single queue queuing model
  * @txqs: Used only in hotpath to get to the right queue very fast
  * @crc_enable: Enable CRC insertion offload
+ * @xdpq_share: whether XDPSQ sharing is enabled
+ * @num_xdp_txq: number of XDPSQs
+ * @xdp_txq_offset: index of the first XDPSQ (== number of regular SQs)
+ * @xdp_prog: installed XDP program
  * @num_rxq: Number of allocated RX queues
  * @num_bufq: Number of allocated buffer queues
  * @rxq_desc_count: RX queue descriptor count. *MUST* have enough descriptors
@@ -303,6 +307,11 @@ struct idpf_vport {
 	struct idpf_tx_queue **txqs;
 	bool crc_enable;
 
+	bool xdpq_share;
+	u16 num_xdp_txq;
+	u16 xdp_txq_offset;
+	struct bpf_prog *xdp_prog;
+
 	u16 num_rxq;
 	u16 num_bufq;
 	u32 rxq_desc_count;
@@ -380,6 +389,7 @@ struct idpf_rss_data {
  *		      ethtool
  * @num_req_rxq_desc: Number of user requested RX queue descriptors through
  *		      ethtool
+ * @xdp_prog: requested XDP program to install
  * @user_flags: User toggled config flags
  * @mac_filter_list: List of MAC filters
  *
@@ -391,6 +401,7 @@ struct idpf_vport_user_config_data {
 	u16 num_req_rx_qs;
 	u32 num_req_txq_desc;
 	u32 num_req_rxq_desc;
+	struct bpf_prog *xdp_prog;
 	DECLARE_BITMAP(user_flags, __IDPF_USER_FLAGS_NBITS);
 	struct list_head mac_filter_list;
 };
@@ -604,6 +615,15 @@ static inline int idpf_is_queue_model_split(u16 q_model)
 	       q_model == VIRTCHNL2_QUEUE_MODEL_SPLIT;
 }
 
+/**
+ * idpf_xdp_is_prog_ena - check if there is an XDP program on adapter
+ * @vport: vport to check
+ */
+static inline bool idpf_xdp_is_prog_ena(const struct idpf_vport *vport)
+{
+	return vport->adapter && vport->xdp_prog;
+}
+
 #define idpf_is_cap_ena(adapter, field, flag) \
 	idpf_is_capability_ena(adapter, false, field, flag)
 #define idpf_is_cap_ena_all(adapter, field, flag) \
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index fb3b352d542e..6d9eb6f4ab38 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -7,8 +7,10 @@
 #include <linux/dim.h>
 
 #include <net/libeth/cache.h>
-#include <net/tcp.h>
+#include <net/libeth/types.h>
 #include <net/netdev_queues.h>
+#include <net/tcp.h>
+#include <net/xdp.h>
 
 #include "idpf_lan_txrx.h"
 #include "virtchnl2_lan_desc.h"
@@ -291,6 +293,7 @@ struct idpf_ptype_state {
  * @__IDPF_Q_CRC_EN: enable CRC offload in singleq mode
  * @__IDPF_Q_HSPLIT_EN: enable header split on Rx (splitq)
  * @__IDPF_Q_NOIRQ: queue is polling-driven and has no interrupt
+ * @__IDPF_Q_XDP: this is an XDP queue
  * @__IDPF_Q_FLAGS_NBITS: Must be last
  */
 enum idpf_queue_flags_t {
@@ -301,6 +304,7 @@ enum idpf_queue_flags_t {
 	__IDPF_Q_CRC_EN,
 	__IDPF_Q_HSPLIT_EN,
 	__IDPF_Q_NOIRQ,
+	__IDPF_Q_XDP,
 
 	__IDPF_Q_FLAGS_NBITS,
 };
@@ -483,19 +487,21 @@ struct idpf_txq_stash {
  * @napi: NAPI instance corresponding to this queue (splitq)
  * @rx_buf: See struct &libeth_fqe
  * @pp: Page pool pointer in singleq mode
- * @netdev: &net_device corresponding to this queue
  * @tail: Tail offset. Used for both queue models single and split.
  * @flags: See enum idpf_queue_flags_t
  * @idx: For RX queue, it is used to index to total RX queue across groups and
  *	 used for skb reporting.
  * @desc_count: Number of descriptors
+ * @num_xdp_txq: total number of XDP Tx queues
+ * @xdpqs: shortcut for XDP Tx queues array
  * @rxdids: Supported RX descriptor ids
+ * @truesize: data buffer truesize in singleq
  * @rx_ptype_lkup: LUT of Rx ptypes
+ * @xdp_rxq: XDP queue info
  * @next_to_use: Next descriptor to use
  * @next_to_clean: Next descriptor to clean
  * @next_to_alloc: RX buffer to allocate at
  * @skb: Pointer to the skb
- * @truesize: data buffer truesize in singleq
  * @stats_sync: See struct u64_stats_sync
  * @q_stats: See union idpf_rx_queue_stats
  * @q_id: Queue id
@@ -525,15 +531,23 @@ struct idpf_rx_queue {
 			struct page_pool *pp;
 		};
 	};
-	struct net_device *netdev;
 	void __iomem *tail;
 
 	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
 	u16 idx;
 	u16 desc_count;
 
-	u32 rxdids;
+	u32 num_xdp_txq;
+	union {
+		struct idpf_tx_queue **xdpqs;
+		struct {
+			u32 rxdids;
+			u32 truesize;
+		};
+	};
 	const struct libeth_rx_pt *rx_ptype_lkup;
+
+	struct xdp_rxq_info xdp_rxq;
 	__cacheline_group_end_aligned(read_mostly);
 
 	__cacheline_group_begin_aligned(read_write);
@@ -542,7 +556,6 @@ struct idpf_rx_queue {
 	u16 next_to_alloc;
 
 	struct sk_buff *skb;
-	u32 truesize;
 
 	struct u64_stats_sync stats_sync;
 	struct idpf_rx_queue_stats q_stats;
@@ -561,8 +574,11 @@ struct idpf_rx_queue {
 	u16 rx_max_pkt_size;
 	__cacheline_group_end_aligned(cold);
 };
-libeth_cacheline_set_assert(struct idpf_rx_queue, 64,
-			    80 + sizeof(struct u64_stats_sync),
+libeth_cacheline_set_assert(struct idpf_rx_queue,
+			    ALIGN(64, __alignof(struct xdp_rxq_info)) +
+			    sizeof(struct xdp_rxq_info),
+			    72 + offsetof(struct idpf_rx_queue, q_stats) -
+			    offsetofend(struct idpf_rx_queue, skb),
 			    32);
 
 /**
@@ -574,6 +590,7 @@ libeth_cacheline_set_assert(struct idpf_rx_queue, 64,
  * @desc_ring: virtual descriptor ring address
  * @tx_buf: See struct idpf_tx_buf
  * @txq_grp: See struct idpf_txq_group
+ * @complq: corresponding completion queue in XDP mode
  * @dev: Device back pointer for DMA mapping
  * @tail: Tail offset. Used for both queue models single and split
  * @flags: See enum idpf_queue_flags_t
@@ -601,6 +618,7 @@ libeth_cacheline_set_assert(struct idpf_rx_queue, 64,
  *	--------------------------------
  *
  *	This gives us 8*8160 = 65280 possible unique values.
+ * @thresh: XDP queue cleaning threshold
  * @netdev: &net_device corresponding to this queue
  * @next_to_use: Next descriptor to use
  * @next_to_clean: Next descriptor to clean
@@ -619,6 +637,10 @@ libeth_cacheline_set_assert(struct idpf_rx_queue, 64,
  * @compl_tag_bufid_m: Completion tag buffer id mask
  * @compl_tag_cur_gen: Used to keep track of current completion tag generation
  * @compl_tag_gen_max: To determine when compl_tag_cur_gen should be reset
+ * @pending: number of pending descriptors to send in QB
+ * @xdp_tx: number of pending &xdp_buff or &xdp_frame buffers
+ * @timer: timer for XDP Tx queue cleanup
+ * @xdp_lock: lock for XDP Tx queues sharing
  * @stats_sync: See struct u64_stats_sync
  * @q_stats: See union idpf_tx_queue_stats
  * @q_id: Queue id
@@ -637,7 +659,10 @@ struct idpf_tx_queue {
 		void *desc_ring;
 	};
 	struct libeth_sqe *tx_buf;
-	struct idpf_txq_group *txq_grp;
+	union {
+		struct idpf_txq_group *txq_grp;
+		struct idpf_compl_queue *complq;
+	};
 	struct device *dev;
 	void __iomem *tail;
 
@@ -645,8 +670,13 @@ struct idpf_tx_queue {
 	u16 idx;
 	u16 desc_count;
 
-	u16 tx_min_pkt_len;
-	u16 compl_tag_gen_s;
+	union {
+		struct {
+			u16 tx_min_pkt_len;
+			u16 compl_tag_gen_s;
+		};
+		u32 thresh;
+	};
 
 	struct net_device *netdev;
 	__cacheline_group_end_aligned(read_mostly);
@@ -656,17 +686,28 @@ struct idpf_tx_queue {
 	u16 next_to_clean;
 
 	union {
-		u32 cleaned_bytes;
-		u32 clean_budget;
-	};
-	u16 cleaned_pkts;
-
-	u16 tx_max_bufs;
-	struct idpf_txq_stash *stash;
+		struct {
+			union {
+				u32 cleaned_bytes;
+				u32 clean_budget;
+			};
+			u16 cleaned_pkts;
+
+			u16 tx_max_bufs;
+			struct idpf_txq_stash *stash;
+
+			u16 compl_tag_bufid_m;
+			u16 compl_tag_cur_gen;
+			u16 compl_tag_gen_max;
+		};
+		struct {
+			u32 pending;
+			u32 xdp_tx;
 
-	u16 compl_tag_bufid_m;
-	u16 compl_tag_cur_gen;
-	u16 compl_tag_gen_max;
+			struct libeth_xdpsq_timer *timer;
+			struct libeth_xdpsq_lock xdp_lock;
+		};
+	};
 
 	struct u64_stats_sync stats_sync;
 	struct idpf_tx_queue_stats q_stats;
@@ -681,7 +722,8 @@ struct idpf_tx_queue {
 	__cacheline_group_end_aligned(cold);
 };
 libeth_cacheline_set_assert(struct idpf_tx_queue, 64,
-			    88 + sizeof(struct u64_stats_sync),
+			    80 + offsetof(struct idpf_tx_queue, q_stats) -
+			    offsetofend(struct idpf_tx_queue, timer),
 			    24);
 
 /**
diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
new file mode 100644
index 000000000000..8ace8384f348
--- /dev/null
+++ b/drivers/net/ethernet/intel/idpf/xdp.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2024 Intel Corporation */
+
+#ifndef _IDPF_XDP_H_
+#define _IDPF_XDP_H_
+
+#include <linux/types.h>
+
+struct idpf_vport;
+
+int idpf_xdp_rxq_info_init_all(const struct idpf_vport *vport);
+void idpf_xdp_rxq_info_deinit_all(const struct idpf_vport *vport);
+
+int idpf_vport_xdpq_get(const struct idpf_vport *vport);
+void idpf_vport_xdpq_put(const struct idpf_vport *vport);
+
+#endif /* _IDPF_XDP_H_ */
diff --git a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
index 59b1a1a09996..1ca322bfe92f 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
@@ -186,9 +186,11 @@ static void idpf_get_channels(struct net_device *netdev,
 {
 	struct idpf_netdev_priv *np = netdev_priv(netdev);
 	struct idpf_vport_config *vport_config;
+	const struct idpf_vport *vport;
 	u16 num_txq, num_rxq;
 	u16 combined;
 
+	vport = idpf_netdev_to_vport(netdev);
 	vport_config = np->adapter->vport_config[np->vport_idx];
 
 	num_txq = vport_config->user_config.num_req_tx_qs;
@@ -202,8 +204,8 @@ static void idpf_get_channels(struct net_device *netdev,
 	ch->max_rx = vport_config->max_q.max_rxq;
 	ch->max_tx = vport_config->max_q.max_txq;
 
-	ch->max_other = IDPF_MAX_MBXQ;
-	ch->other_count = IDPF_MAX_MBXQ;
+	ch->max_other = IDPF_MAX_MBXQ + vport->num_xdp_txq;
+	ch->other_count = IDPF_MAX_MBXQ + vport->num_xdp_txq;
 
 	ch->combined_count = combined;
 	ch->rx_count = num_rxq - combined;
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
index 2594ca38e8ca..0f4edc9cd1ad 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -3,6 +3,7 @@
 
 #include "idpf.h"
 #include "idpf_virtchnl.h"
+#include "xdp.h"
 
 static const struct net_device_ops idpf_netdev_ops;
 
@@ -888,6 +889,7 @@ static void idpf_vport_stop(struct idpf_vport *vport)
 
 	vport->link_up = false;
 	idpf_vport_intr_deinit(vport);
+	idpf_xdp_rxq_info_deinit_all(vport);
 	idpf_vport_queues_rel(vport);
 	idpf_vport_intr_rel(vport);
 	np->state = __IDPF_VPORT_DOWN;
@@ -1262,13 +1264,13 @@ static void idpf_restore_features(struct idpf_vport *vport)
  */
 static int idpf_set_real_num_queues(struct idpf_vport *vport)
 {
-	int err;
+	int err, txq = vport->num_txq - vport->num_xdp_txq;
 
 	err = netif_set_real_num_rx_queues(vport->netdev, vport->num_rxq);
 	if (err)
 		return err;
 
-	return netif_set_real_num_tx_queues(vport->netdev, vport->num_txq);
+	return netif_set_real_num_tx_queues(vport->netdev, txq);
 }
 
 /**
@@ -1377,20 +1379,29 @@ static int idpf_vport_open(struct idpf_vport *vport)
 	}
 
 	idpf_rx_init_buf_tail(vport);
+
+	err = idpf_xdp_rxq_info_init_all(vport);
+	if (err) {
+		netdev_err(vport->netdev,
+			   "Failed to initialize XDP RxQ info for vport %u: %pe\n",
+			   vport->vport_id, ERR_PTR(err));
+		goto intr_deinit;
+	}
+
 	idpf_vport_intr_ena(vport);
 
 	err = idpf_send_config_queues_msg(vport);
 	if (err) {
 		dev_err(&adapter->pdev->dev, "Failed to configure queues for vport %u, %d\n",
 			vport->vport_id, err);
-		goto intr_deinit;
+		goto rxq_deinit;
 	}
 
 	err = idpf_send_map_unmap_queue_vector_msg(vport, true);
 	if (err) {
 		dev_err(&adapter->pdev->dev, "Failed to map queue vectors for vport %u: %d\n",
 			vport->vport_id, err);
-		goto intr_deinit;
+		goto rxq_deinit;
 	}
 
 	err = idpf_send_enable_queues_msg(vport);
@@ -1438,6 +1449,8 @@ static int idpf_vport_open(struct idpf_vport *vport)
 	idpf_send_disable_queues_msg(vport);
 unmap_queue_vectors:
 	idpf_send_map_unmap_queue_vector_msg(vport, false);
+rxq_deinit:
+	idpf_xdp_rxq_info_deinit_all(vport);
 intr_deinit:
 	idpf_vport_intr_deinit(vport);
 queues_rel:
diff --git a/drivers/net/ethernet/intel/idpf/idpf_main.c b/drivers/net/ethernet/intel/idpf/idpf_main.c
index b6c515d14cbf..5e6e0758e24c 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_main.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_main.c
@@ -9,6 +9,7 @@
 
 MODULE_DESCRIPTION(DRV_SUMMARY);
 MODULE_IMPORT_NS("LIBETH");
+MODULE_IMPORT_NS("LIBETH_XDP");
 MODULE_LICENSE("GPL");
 
 /**
diff --git a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
index aeb2ca5f5a0a..c81065b4fb24 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
@@ -601,7 +601,7 @@ static void idpf_rx_singleq_csum(struct idpf_rx_queue *rxq,
 	bool ipv4, ipv6;
 
 	/* check if Rx checksum is enabled */
-	if (!libeth_rx_pt_has_checksum(rxq->netdev, decoded))
+	if (!libeth_rx_pt_has_checksum(rxq->xdp_rxq.dev, decoded))
 		return;
 
 	/* check if HW has decoded the packet and checksum */
@@ -740,7 +740,7 @@ static void idpf_rx_singleq_base_hash(struct idpf_rx_queue *rx_q,
 {
 	u64 mask, qw1;
 
-	if (!libeth_rx_pt_has_hash(rx_q->netdev, decoded))
+	if (!libeth_rx_pt_has_hash(rx_q->xdp_rxq.dev, decoded))
 		return;
 
 	mask = VIRTCHNL2_RX_BASE_DESC_FLTSTAT_RSS_HASH_M;
@@ -768,7 +768,7 @@ static void idpf_rx_singleq_flex_hash(struct idpf_rx_queue *rx_q,
 				      const union virtchnl2_rx_desc *rx_desc,
 				      struct libeth_rx_pt decoded)
 {
-	if (!libeth_rx_pt_has_hash(rx_q->netdev, decoded))
+	if (!libeth_rx_pt_has_hash(rx_q->xdp_rxq.dev, decoded))
 		return;
 
 	if (FIELD_GET(VIRTCHNL2_RX_FLEX_DESC_STATUS0_RSS_VALID_M,
@@ -801,7 +801,7 @@ idpf_rx_singleq_process_skb_fields(struct idpf_rx_queue *rx_q,
 	struct libeth_rx_csum csum_bits;
 
 	/* modifies the skb - consumes the enet header */
-	skb->protocol = eth_type_trans(skb, rx_q->netdev);
+	skb->protocol = eth_type_trans(skb, rx_q->xdp_rxq.dev);
 
 	/* Check if we're using base mode descriptor IDs */
 	if (rx_q->rxdids == VIRTCHNL2_RXDID_1_32B_BASE_M) {
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 5d51e68c2878..97513822d614 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -1,11 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright (C) 2023 Intel Corporation */
 
-#include <net/libeth/rx.h>
-#include <net/libeth/tx.h>
+#include <net/libeth/xdp.h>
 
 #include "idpf.h"
 #include "idpf_virtchnl.h"
+#include "xdp.h"
 
 struct idpf_tx_stash {
 	struct hlist_node hlist;
@@ -78,8 +78,10 @@ static void idpf_tx_buf_rel_all(struct idpf_tx_queue *txq)
 	struct libeth_sq_napi_stats ss = { };
 	struct idpf_buf_lifo *buf_stack;
 	struct idpf_tx_stash *stash;
+	struct xdp_frame_bulk bq;
 	struct libeth_cq_pp cp = {
 		.dev	= txq->dev,
+		.bq	= &bq,
 		.ss	= &ss,
 	};
 	struct hlist_node *tmp;
@@ -89,9 +91,13 @@ static void idpf_tx_buf_rel_all(struct idpf_tx_queue *txq)
 	if (!txq->tx_buf)
 		return;
 
+	xdp_frame_bulk_init(&bq);
+
 	/* Free all the Tx buffer sk_buffs */
 	for (i = 0; i < txq->desc_count; i++)
-		libeth_tx_complete(&txq->tx_buf[i], &cp);
+		libeth_tx_complete_any(&txq->tx_buf[i], &cp);
+
+	xdp_flush_frame_bulk(&bq);
 
 	kfree(txq->tx_buf);
 	txq->tx_buf = NULL;
@@ -133,7 +139,9 @@ static void idpf_tx_buf_rel_all(struct idpf_tx_queue *txq)
 static void idpf_tx_desc_rel(struct idpf_tx_queue *txq)
 {
 	idpf_tx_buf_rel_all(txq);
-	netdev_tx_reset_subqueue(txq->netdev, txq->idx);
+
+	if (!idpf_queue_has(XDP, txq))
+		netdev_tx_reset_subqueue(txq->netdev, txq->idx);
 
 	if (!txq->desc_ring)
 		return;
@@ -331,7 +339,8 @@ static int idpf_tx_desc_alloc_all(struct idpf_vport *vport)
 				goto err_out;
 			}
 
-			if (!idpf_is_queue_model_split(vport->txq_model))
+			if (!idpf_is_queue_model_split(vport->txq_model) ||
+			    idpf_queue_has(XDP, txq))
 				continue;
 
 			txq->compl_tag_cur_gen = 0;
@@ -589,6 +598,7 @@ static int idpf_rx_hdr_buf_alloc_all(struct idpf_buf_queue *bufq)
 	struct libeth_fq fq = {
 		.count	= bufq->desc_count,
 		.type	= LIBETH_FQE_HDR,
+		.xdp	= idpf_xdp_is_prog_ena(bufq->q_vector->vport),
 		.nid	= idpf_q_vector_to_mem(bufq->q_vector),
 	};
 	int ret;
@@ -788,6 +798,7 @@ static int idpf_rx_bufs_init(struct idpf_buf_queue *bufq,
 		.count		= bufq->desc_count,
 		.type		= type,
 		.hsplit		= idpf_queue_has(HSPLIT_EN, bufq),
+		.xdp		= idpf_xdp_is_prog_ena(bufq->q_vector->vport),
 		.nid		= idpf_q_vector_to_mem(bufq->q_vector),
 	};
 	int ret;
@@ -1093,6 +1104,8 @@ void idpf_vport_queues_rel(struct idpf_vport *vport)
 {
 	idpf_tx_desc_rel_all(vport);
 	idpf_rx_desc_rel_all(vport);
+
+	idpf_vport_xdpq_put(vport);
 	idpf_vport_queue_grp_rel_all(vport);
 
 	kfree(vport->txqs);
@@ -1158,6 +1171,18 @@ void idpf_vport_init_num_qs(struct idpf_vport *vport,
 	if (idpf_is_queue_model_split(vport->rxq_model))
 		vport->num_bufq = le16_to_cpu(vport_msg->num_rx_bufq);
 
+	vport->xdp_prog = config_data->xdp_prog;
+	if (idpf_xdp_is_prog_ena(vport)) {
+		vport->xdp_txq_offset = config_data->num_req_tx_qs;
+		vport->num_xdp_txq = le16_to_cpu(vport_msg->num_tx_q) -
+				     vport->xdp_txq_offset;
+		vport->xdpq_share = libeth_xdpsq_shared(vport->num_xdp_txq);
+	} else {
+		vport->xdp_txq_offset = 0;
+		vport->num_xdp_txq = 0;
+		vport->xdpq_share = false;
+	}
+
 	/* Adjust number of buffer queues per Rx queue group. */
 	if (!idpf_is_queue_model_split(vport->rxq_model)) {
 		vport->num_bufqs_per_qgrp = 0;
@@ -1229,9 +1254,10 @@ int idpf_vport_calc_total_qs(struct idpf_adapter *adapter, u16 vport_idx,
 	int dflt_splitq_txq_grps = 0, dflt_singleq_txqs = 0;
 	int dflt_splitq_rxq_grps = 0, dflt_singleq_rxqs = 0;
 	u16 num_req_tx_qs = 0, num_req_rx_qs = 0;
+	struct idpf_vport_user_config_data *user;
 	struct idpf_vport_config *vport_config;
 	u16 num_txq_grps, num_rxq_grps;
-	u32 num_qs;
+	u32 num_qs, num_xdpq;
 
 	vport_config = adapter->vport_config[vport_idx];
 	if (vport_config) {
@@ -1273,6 +1299,24 @@ int idpf_vport_calc_total_qs(struct idpf_adapter *adapter, u16 vport_idx,
 		vport_msg->num_rx_bufq = 0;
 	}
 
+	if (!vport_config)
+		return 0;
+
+	user = &vport_config->user_config;
+	user->num_req_rx_qs = le16_to_cpu(vport_msg->num_rx_q);
+	user->num_req_tx_qs = le16_to_cpu(vport_msg->num_tx_q);
+
+	if (vport_config->user_config.xdp_prog)
+		num_xdpq = libeth_xdpsq_num(user->num_req_rx_qs,
+					    user->num_req_tx_qs,
+					    vport_config->max_q.max_txq);
+	else
+		num_xdpq = 0;
+
+	vport_msg->num_tx_q = cpu_to_le16(user->num_req_tx_qs + num_xdpq);
+	if (idpf_is_queue_model_split(le16_to_cpu(vport_msg->txq_model)))
+		vport_msg->num_tx_complq = vport_msg->num_tx_q;
+
 	return 0;
 }
 
@@ -1322,14 +1366,13 @@ static void idpf_vport_calc_numq_per_grp(struct idpf_vport *vport,
 static void idpf_rxq_set_descids(const struct idpf_vport *vport,
 				 struct idpf_rx_queue *q)
 {
-	if (idpf_is_queue_model_split(vport->rxq_model)) {
-		q->rxdids = VIRTCHNL2_RXDID_2_FLEX_SPLITQ_M;
-	} else {
-		if (vport->base_rxd)
-			q->rxdids = VIRTCHNL2_RXDID_1_32B_BASE_M;
-		else
-			q->rxdids = VIRTCHNL2_RXDID_2_FLEX_SQ_NIC_M;
-	}
+	if (idpf_is_queue_model_split(vport->rxq_model))
+		return;
+
+	if (vport->base_rxd)
+		q->rxdids = VIRTCHNL2_RXDID_1_32B_BASE_M;
+	else
+		q->rxdids = VIRTCHNL2_RXDID_2_FLEX_SQ_NIC_M;
 }
 
 /**
@@ -1545,7 +1588,6 @@ static int idpf_rxq_group_alloc(struct idpf_vport *vport, u16 num_rxq)
 setup_rxq:
 			q->desc_count = vport->rxq_desc_count;
 			q->rx_ptype_lkup = vport->rx_ptype_lkup;
-			q->netdev = vport->netdev;
 			q->bufq_sets = rx_qgrp->splitq.bufq_sets;
 			q->idx = (i * num_rxq) + j;
 			q->rx_buffer_low_watermark = IDPF_LOW_WATERMARK;
@@ -1606,15 +1648,19 @@ int idpf_vport_queues_alloc(struct idpf_vport *vport)
 	if (err)
 		goto err_out;
 
-	err = idpf_tx_desc_alloc_all(vport);
+	err = idpf_vport_init_fast_path_txqs(vport);
 	if (err)
 		goto err_out;
 
-	err = idpf_rx_desc_alloc_all(vport);
+	err = idpf_vport_xdpq_get(vport);
 	if (err)
 		goto err_out;
 
-	err = idpf_vport_init_fast_path_txqs(vport);
+	err = idpf_tx_desc_alloc_all(vport);
+	if (err)
+		goto err_out;
+
+	err = idpf_rx_desc_alloc_all(vport);
 	if (err)
 		goto err_out;
 
@@ -2148,16 +2194,24 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
  */
 void idpf_wait_for_sw_marker_completion(struct idpf_tx_queue *txq)
 {
-	struct idpf_compl_queue *complq = txq->txq_grp->complq;
 	struct idpf_splitq_4b_tx_compl_desc *tx_desc;
-	s16 ntc = complq->next_to_clean;
+	struct idpf_compl_queue *complq;
 	unsigned long timeout;
 	bool flow, gen_flag;
-	u32 pos = ntc;
+	u32 pos;
+	s16 ntc;
 
 	if (!idpf_queue_has(SW_MARKER, txq))
 		return;
 
+	if (idpf_queue_has(XDP, txq))
+		complq = txq->complq;
+	else
+		complq = txq->txq_grp->complq;
+
+	ntc = complq->next_to_clean;
+	pos = ntc;
+
 	flow = idpf_queue_has(FLOW_SCH_EN, complq);
 	gen_flag = idpf_queue_has(GEN_CHK, complq);
 
@@ -2935,10 +2989,11 @@ static netdev_tx_t idpf_tx_splitq_frame(struct sk_buff *skb,
  */
 netdev_tx_t idpf_tx_start(struct sk_buff *skb, struct net_device *netdev)
 {
-	struct idpf_vport *vport = idpf_netdev_to_vport(netdev);
+	const struct idpf_vport *vport = idpf_netdev_to_vport(netdev);
 	struct idpf_tx_queue *tx_q;
 
-	if (unlikely(skb_get_queue_mapping(skb) >= vport->num_txq)) {
+	if (unlikely(skb_get_queue_mapping(skb) >=
+		     vport->num_txq - vport->num_xdp_txq)) {
 		dev_kfree_skb_any(skb);
 
 		return NETDEV_TX_OK;
@@ -2975,7 +3030,7 @@ idpf_rx_hash(const struct idpf_rx_queue *rxq, struct sk_buff *skb,
 {
 	u32 hash;
 
-	if (!libeth_rx_pt_has_hash(rxq->netdev, decoded))
+	if (!libeth_rx_pt_has_hash(rxq->xdp_rxq.dev, decoded))
 		return;
 
 	hash = le16_to_cpu(rx_desc->hash1) |
@@ -3001,7 +3056,7 @@ static void idpf_rx_csum(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 	bool ipv4, ipv6;
 
 	/* check if Rx checksum is enabled */
-	if (!libeth_rx_pt_has_checksum(rxq->netdev, decoded))
+	if (!libeth_rx_pt_has_checksum(rxq->xdp_rxq.dev, decoded))
 		return;
 
 	/* check if HW has decoded the packet and checksum */
@@ -3170,7 +3225,7 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 	/* process RSS/hash */
 	idpf_rx_hash(rxq, skb, rx_desc, decoded);
 
-	skb->protocol = eth_type_trans(skb, rxq->netdev);
+	skb->protocol = eth_type_trans(skb, rxq->xdp_rxq.dev);
 	skb_record_rx_queue(skb, rxq->idx);
 
 	if (le16_get_bits(rx_desc->hdrlen_flags,
@@ -4181,8 +4236,8 @@ static int idpf_vport_splitq_napi_poll(struct napi_struct *napi, int budget)
  */
 static void idpf_vport_intr_map_vector_to_qs(struct idpf_vport *vport)
 {
+	u16 num_txq_grp = vport->num_txq_grp - vport->num_xdp_txq;
 	bool split = idpf_is_queue_model_split(vport->rxq_model);
-	u16 num_txq_grp = vport->num_txq_grp;
 	struct idpf_rxq_group *rx_qgrp;
 	struct idpf_txq_group *tx_qgrp;
 	u32 i, qv_idx, q_index;
diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
index aa45821f38f1..a86eea9ccd18 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
@@ -1602,9 +1602,12 @@ static int idpf_send_config_rx_queues_msg(struct idpf_vport *vport)
 		for (j = 0; j < num_rxq; j++, k++) {
 			const struct idpf_bufq_set *sets;
 			struct idpf_rx_queue *rxq;
+			u32 rxdids;
 
 			if (!idpf_is_queue_model_split(vport->rxq_model)) {
 				rxq = rx_qgrp->singleq.rxqs[j];
+				rxdids = rxq->rxdids;
+
 				goto common_qi_fields;
 			}
 
@@ -1637,6 +1640,8 @@ static int idpf_send_config_rx_queues_msg(struct idpf_vport *vport)
 					cpu_to_le16(rxq->rx_hbuf_size);
 			}
 
+			rxdids = VIRTCHNL2_RXDID_2_FLEX_SPLITQ_M;
+
 common_qi_fields:
 			qi[k].queue_id = cpu_to_le32(rxq->q_id);
 			qi[k].model = cpu_to_le16(vport->rxq_model);
@@ -1647,7 +1652,7 @@ static int idpf_send_config_rx_queues_msg(struct idpf_vport *vport)
 			qi[k].data_buffer_size = cpu_to_le32(rxq->rx_buf_size);
 			qi[k].qflags |=
 				cpu_to_le16(VIRTCHNL2_RX_DESC_SIZE_32BYTE);
-			qi[k].desc_ids = cpu_to_le64(rxq->rxdids);
+			qi[k].desc_ids = cpu_to_le64(rxdids);
 		}
 	}
 
@@ -1881,6 +1886,8 @@ int idpf_send_map_unmap_queue_vector_msg(struct idpf_vport *vport, bool map)
 
 			if (idpf_queue_has(NOIRQ, txq))
 				vec = NULL;
+			else if (idpf_queue_has(XDP, txq))
+				vec = txq->complq->q_vector;
 			else if (idpf_is_queue_model_split(vport->txq_model))
 				vec = txq->txq_grp->complq->q_vector;
 			else
@@ -1899,9 +1906,6 @@ int idpf_send_map_unmap_queue_vector_msg(struct idpf_vport *vport, bool map)
 		}
 	}
 
-	if (vport->num_txq != k)
-		return -EINVAL;
-
 	for (i = 0; i < vport->num_rxq_grp; i++) {
 		struct idpf_rxq_group *rx_qgrp = &vport->rxq_grps[i];
 		u16 num_rxq;
@@ -1937,13 +1941,8 @@ int idpf_send_map_unmap_queue_vector_msg(struct idpf_vport *vport, bool map)
 		}
 	}
 
-	if (idpf_is_queue_model_split(vport->txq_model)) {
-		if (vport->num_rxq != k - vport->num_complq)
-			return -EINVAL;
-	} else {
-		if (vport->num_rxq != k - vport->num_txq)
-			return -EINVAL;
-	}
+	if (k != num_q)
+		return -EINVAL;
 
 	/* Chunk up the vector info into multiple messages */
 	config_sz = sizeof(struct virtchnl2_queue_vector_maps);
@@ -3129,7 +3128,10 @@ int idpf_vport_alloc_vec_indexes(struct idpf_vport *vport)
 	u32 req;
 
 	vec_info.num_curr_vecs = vport->num_q_vectors + IDPF_RESERVED_VECS;
-	req = max(vport->num_txq, vport->num_rxq) + IDPF_RESERVED_VECS;
+
+	/* XDPSQs are all bound to the NOIRQ vector from IDPF_RESERVED_VECS */
+	req = max(vport->num_txq - vport->num_xdp_txq, vport->num_rxq) +
+	      IDPF_RESERVED_VECS;
 	vec_info.num_req_vecs = req;
 
 	vec_info.default_vport = vport->default_vport;
diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
new file mode 100644
index 000000000000..8770249b5abe
--- /dev/null
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2024 Intel Corporation */
+
+#include <net/libeth/xdp.h>
+
+#include "idpf.h"
+#include "xdp.h"
+
+static int idpf_rxq_for_each(const struct idpf_vport *vport,
+			     int (*fn)(struct idpf_rx_queue *rxq, void *arg),
+			     void *arg)
+{
+	bool splitq = idpf_is_queue_model_split(vport->rxq_model);
+
+	if (!vport->rxq_grps)
+		return -ENETDOWN;
+
+	for (u32 i = 0; i < vport->num_rxq_grp; i++) {
+		const struct idpf_rxq_group *rx_qgrp = &vport->rxq_grps[i];
+		u32 num_rxq;
+
+		if (splitq)
+			num_rxq = rx_qgrp->splitq.num_rxq_sets;
+		else
+			num_rxq = rx_qgrp->singleq.num_rxq;
+
+		for (u32 j = 0; j < num_rxq; j++) {
+			struct idpf_rx_queue *q;
+			int err;
+
+			if (splitq)
+				q = &rx_qgrp->splitq.rxq_sets[j]->rxq;
+			else
+				q = rx_qgrp->singleq.rxqs[j];
+
+			err = fn(q, arg);
+			if (err)
+				return err;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * __idpf_xdp_rxq_info_init - Setup XDP RxQ info for a given Rx queue
+ * @rxq: Rx queue for which the resources are setup
+ * @arg: flag indicating if the HW works in split queue mode
+ *
+ * Return: 0 on success, negative on failure.
+ */
+static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
+{
+	const struct idpf_vport *vport = rxq->q_vector->vport;
+	bool split = idpf_is_queue_model_split(vport->rxq_model);
+	const struct page_pool *pp;
+	int err;
+
+	err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
+				 rxq->q_vector->napi.napi_id,
+				 rxq->rx_buf_size);
+	if (err)
+		return err;
+
+	pp = split ? rxq->bufq_sets[0].bufq.pp : rxq->pp;
+	xdp_rxq_info_attach_page_pool(&rxq->xdp_rxq, pp);
+
+	if (!split)
+		return 0;
+
+	rxq->xdpqs = &vport->txqs[vport->xdp_txq_offset];
+	rxq->num_xdp_txq = vport->num_xdp_txq;
+
+	return 0;
+}
+
+/**
+ * idpf_xdp_rxq_info_init_all - initialize RxQ info for all Rx queues in vport
+ * @vport: vport to setup the info
+ *
+ * Return: 0 on success, negative on failure.
+ */
+int idpf_xdp_rxq_info_init_all(const struct idpf_vport *vport)
+{
+	return idpf_rxq_for_each(vport, __idpf_xdp_rxq_info_init, NULL);
+}
+
+/**
+ * __idpf_xdp_rxq_info_deinit - Deinit XDP RxQ info for a given Rx queue
+ * @rxq: Rx queue for which the resources are destroyed
+ * @arg: flag indicating if the HW works in split queue mode
+ *
+ * Return: always 0.
+ */
+static int __idpf_xdp_rxq_info_deinit(struct idpf_rx_queue *rxq, void *arg)
+{
+	if (idpf_is_queue_model_split((size_t)arg)) {
+		rxq->xdpqs = NULL;
+		rxq->num_xdp_txq = 0;
+	}
+
+	xdp_rxq_info_detach_mem_model(&rxq->xdp_rxq);
+	xdp_rxq_info_unreg(&rxq->xdp_rxq);
+
+	return 0;
+}
+
+/**
+ * idpf_xdp_rxq_info_deinit_all - deinit RxQ info for all Rx queues in vport
+ * @vport: vport to setup the info
+ */
+void idpf_xdp_rxq_info_deinit_all(const struct idpf_vport *vport)
+{
+	idpf_rxq_for_each(vport, __idpf_xdp_rxq_info_deinit,
+			  (void *)(size_t)vport->rxq_model);
+}
+
+int idpf_vport_xdpq_get(const struct idpf_vport *vport)
+{
+	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;
+	struct net_device *dev;
+	u32 sqs;
+
+	if (!idpf_xdp_is_prog_ena(vport))
+		return 0;
+
+	timers = kvcalloc(vport->num_xdp_txq, sizeof(*timers), GFP_KERNEL);
+	if (!timers)
+		return -ENOMEM;
+
+	for (u32 i = 0; i < vport->num_xdp_txq; i++) {
+		timers[i] = kzalloc_node(sizeof(*timers[i]), GFP_KERNEL,
+					 cpu_to_mem(i));
+		if (!timers[i]) {
+			for (int j = i - 1; j >= 0; j--)
+				kfree(timers[j]);
+
+			return -ENOMEM;
+		}
+	}
+
+	dev = vport->netdev;
+	sqs = vport->xdp_txq_offset;
+
+	for (u32 i = sqs; i < vport->num_txq; i++) {
+		struct idpf_tx_queue *xdpq = vport->txqs[i];
+
+		xdpq->complq = xdpq->txq_grp->complq;
+
+		idpf_queue_clear(FLOW_SCH_EN, xdpq);
+		idpf_queue_clear(FLOW_SCH_EN, xdpq->complq);
+		idpf_queue_set(NOIRQ, xdpq);
+		idpf_queue_set(XDP, xdpq);
+		idpf_queue_set(XDP, xdpq->complq);
+
+		xdpq->timer = timers[i - sqs];
+		libeth_xdpsq_get(&xdpq->xdp_lock, dev, vport->xdpq_share);
+
+		xdpq->pending = 0;
+		xdpq->xdp_tx = 0;
+		xdpq->thresh = libeth_xdp_queue_threshold(xdpq->desc_count);
+	}
+
+	return 0;
+}
+
+void idpf_vport_xdpq_put(const struct idpf_vport *vport)
+{
+	struct net_device *dev;
+	u32 sqs;
+
+	if (!idpf_xdp_is_prog_ena(vport))
+		return;
+
+	dev = vport->netdev;
+	sqs = vport->xdp_txq_offset;
+
+	for (u32 i = sqs; i < vport->num_txq; i++) {
+		struct idpf_tx_queue *xdpq = vport->txqs[i];
+
+		if (!idpf_queue_has_clear(XDP, xdpq))
+			continue;
+
+		libeth_xdpsq_put(&xdpq->xdp_lock, dev);
+
+		kfree(xdpq->timer);
+		idpf_queue_clear(NOIRQ, xdpq);
+	}
+}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (10 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 11/16] idpf: prepare structures to support XDP Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-07 14:16   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 13/16] idpf: use generic functions to build xdp_buff and skb Alexander Lobakin
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Michal Kubiak <michal.kubiak@intel.com>

Implement loading/removing XDP program using .ndo_bpf callback
in the split queue mode. Reconfigure and restart the queues if needed
(!!old_prog != !!new_prog), otherwise, just update the pointers.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.h |   4 +-
 drivers/net/ethernet/intel/idpf/xdp.h       |   7 ++
 drivers/net/ethernet/intel/idpf/idpf_lib.c  |   1 +
 drivers/net/ethernet/intel/idpf/idpf_txrx.c |   4 +
 drivers/net/ethernet/intel/idpf/xdp.c       | 114 ++++++++++++++++++++
 5 files changed, 129 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index 6d9eb6f4ab38..38ef0db08133 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -485,6 +485,7 @@ struct idpf_txq_stash {
  * @desc_ring: virtual descriptor ring address
  * @bufq_sets: Pointer to the array of buffer queues in splitq mode
  * @napi: NAPI instance corresponding to this queue (splitq)
+ * @xdp_prog: attached XDP program
  * @rx_buf: See struct &libeth_fqe
  * @pp: Page pool pointer in singleq mode
  * @tail: Tail offset. Used for both queue models single and split.
@@ -525,13 +526,14 @@ struct idpf_rx_queue {
 		struct {
 			struct idpf_bufq_set *bufq_sets;
 			struct napi_struct *napi;
+			struct bpf_prog __rcu *xdp_prog;
 		};
 		struct {
 			struct libeth_fqe *rx_buf;
 			struct page_pool *pp;
+			void __iomem *tail;
 		};
 	};
-	void __iomem *tail;
 
 	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
 	u16 idx;
diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
index 8ace8384f348..a72a7638a6ea 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.h
+++ b/drivers/net/ethernet/intel/idpf/xdp.h
@@ -6,12 +6,19 @@
 
 #include <linux/types.h>
 
+struct bpf_prog;
 struct idpf_vport;
+struct net_device;
+struct netdev_bpf;
 
 int idpf_xdp_rxq_info_init_all(const struct idpf_vport *vport);
 void idpf_xdp_rxq_info_deinit_all(const struct idpf_vport *vport);
+void idpf_copy_xdp_prog_to_qs(const struct idpf_vport *vport,
+			      struct bpf_prog *xdp_prog);
 
 int idpf_vport_xdpq_get(const struct idpf_vport *vport);
 void idpf_vport_xdpq_put(const struct idpf_vport *vport);
 
+int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp);
+
 #endif /* _IDPF_XDP_H_ */
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
index 0f4edc9cd1ad..84ca8c08bd56 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -2368,4 +2368,5 @@ static const struct net_device_ops idpf_netdev_ops = {
 	.ndo_get_stats64 = idpf_get_stats64,
 	.ndo_set_features = idpf_set_features,
 	.ndo_tx_timeout = idpf_tx_timeout,
+	.ndo_bpf = idpf_xdp,
 };
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 97513822d614..e152fbe4ebe3 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -1102,6 +1102,8 @@ static void idpf_vport_queue_grp_rel_all(struct idpf_vport *vport)
  */
 void idpf_vport_queues_rel(struct idpf_vport *vport)
 {
+	idpf_copy_xdp_prog_to_qs(vport, NULL);
+
 	idpf_tx_desc_rel_all(vport);
 	idpf_rx_desc_rel_all(vport);
 
@@ -1664,6 +1666,8 @@ int idpf_vport_queues_alloc(struct idpf_vport *vport)
 	if (err)
 		goto err_out;
 
+	idpf_copy_xdp_prog_to_qs(vport, vport->xdp_prog);
+
 	return 0;
 
 err_out:
diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index 8770249b5abe..c0322fa7bfee 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -4,6 +4,7 @@
 #include <net/libeth/xdp.h>
 
 #include "idpf.h"
+#include "idpf_virtchnl.h"
 #include "xdp.h"
 
 static int idpf_rxq_for_each(const struct idpf_vport *vport,
@@ -115,6 +116,33 @@ void idpf_xdp_rxq_info_deinit_all(const struct idpf_vport *vport)
 			  (void *)(size_t)vport->rxq_model);
 }
 
+static int idpf_xdp_rxq_assign_prog(struct idpf_rx_queue *rxq, void *arg)
+{
+	struct mutex *lock = &rxq->q_vector->vport->adapter->vport_ctrl_lock;
+	struct bpf_prog *prog = arg;
+	struct bpf_prog *old;
+
+	if (prog)
+		bpf_prog_inc(prog);
+
+	old = rcu_replace_pointer(rxq->xdp_prog, prog, lockdep_is_held(lock));
+	if (old)
+		bpf_prog_put(old);
+
+	return 0;
+}
+
+/**
+ * idpf_copy_xdp_prog_to_qs - set pointers to XDP program for each Rx queue
+ * @vport: vport to setup XDP for
+ * @xdp_prog: XDP program that should be copied to all Rx queues
+ */
+void idpf_copy_xdp_prog_to_qs(const struct idpf_vport *vport,
+			      struct bpf_prog *xdp_prog)
+{
+	idpf_rxq_for_each(vport, idpf_xdp_rxq_assign_prog, xdp_prog);
+}
+
 int idpf_vport_xdpq_get(const struct idpf_vport *vport)
 {
 	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;
@@ -187,3 +215,89 @@ void idpf_vport_xdpq_put(const struct idpf_vport *vport)
 		idpf_queue_clear(NOIRQ, xdpq);
 	}
 }
+
+/**
+ * idpf_xdp_setup_prog - handle XDP program install/remove requests
+ * @vport: vport to configure
+ * @xdp: request data (program, extack)
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+static int
+idpf_xdp_setup_prog(struct idpf_vport *vport, const struct netdev_bpf *xdp)
+{
+	const struct idpf_netdev_priv *np = netdev_priv(vport->netdev);
+	struct bpf_prog *old, *prog = xdp->prog;
+	struct idpf_vport_config *cfg;
+	int ret;
+
+	cfg = vport->adapter->vport_config[vport->idx];
+	if (!vport->num_xdp_txq && vport->num_txq == cfg->max_q.max_txq) {
+		NL_SET_ERR_MSG_MOD(xdp->extack,
+				   "No Tx queues available for XDP, please decrease the number of regular SQs");
+		return -ENOSPC;
+	}
+
+	if (test_bit(IDPF_REMOVE_IN_PROG, vport->adapter->flags) ||
+	    !!vport->xdp_prog == !!prog) {
+		if (np->state == __IDPF_VPORT_UP)
+			idpf_copy_xdp_prog_to_qs(vport, prog);
+
+		old = xchg(&vport->xdp_prog, prog);
+		if (old)
+			bpf_prog_put(old);
+
+		cfg->user_config.xdp_prog = prog;
+
+		return 0;
+	}
+
+	old = cfg->user_config.xdp_prog;
+	cfg->user_config.xdp_prog = prog;
+
+	ret = idpf_initiate_soft_reset(vport, IDPF_SR_Q_CHANGE);
+	if (ret) {
+		NL_SET_ERR_MSG_MOD(xdp->extack,
+				   "Could not reopen the vport after XDP setup");
+
+		if (prog)
+			bpf_prog_put(prog);
+
+		cfg->user_config.xdp_prog = old;
+	}
+
+	return ret;
+}
+
+/**
+ * idpf_xdp - handle XDP-related requests
+ * @dev: network device to configure
+ * @xdp: request data (program, extack)
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	struct idpf_vport *vport;
+	int ret;
+
+	idpf_vport_ctrl_lock(dev);
+	vport = idpf_netdev_to_vport(dev);
+
+	if (!idpf_is_queue_model_split(vport->txq_model))
+		goto notsupp;
+
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		ret = idpf_xdp_setup_prog(vport, xdp);
+		break;
+	default:
+notsupp:
+		ret = -EOPNOTSUPP;
+		break;
+	}
+
+	idpf_vport_ctrl_unlock(dev);
+
+	return ret;
+}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 13/16] idpf: use generic functions to build xdp_buff and skb
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (11 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-05 16:21 ` [PATCH net-next 14/16] idpf: add support for XDP on Rx Alexander Lobakin
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

In preparation of XDP support, move from having skb as the main frame
container during the Rx polling to &xdp_buff.
This allows to use generic and libeth helpers for building an XDP
buffer and changes the logics: now we try to allocate an skb only
when we processed all the descriptors related to the frame.
Store &libeth_xdp_stash instead of the skb pointer on the Rx queue.
It's only 8 bytes wider, but contains everything we may need.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  17 +-
 .../ethernet/intel/idpf/idpf_singleq_txrx.c   | 103 ++++++-------
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 145 +++++-------------
 3 files changed, 90 insertions(+), 175 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index 38ef0db08133..e36c55baf23f 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -502,7 +502,7 @@ struct idpf_txq_stash {
  * @next_to_use: Next descriptor to use
  * @next_to_clean: Next descriptor to clean
  * @next_to_alloc: RX buffer to allocate at
- * @skb: Pointer to the skb
+ * @xdp: XDP buffer with the current frame
  * @stats_sync: See struct u64_stats_sync
  * @q_stats: See union idpf_rx_queue_stats
  * @q_id: Queue id
@@ -553,11 +553,11 @@ struct idpf_rx_queue {
 	__cacheline_group_end_aligned(read_mostly);
 
 	__cacheline_group_begin_aligned(read_write);
-	u16 next_to_use;
-	u16 next_to_clean;
-	u16 next_to_alloc;
+	u32 next_to_use;
+	u32 next_to_clean;
+	u32 next_to_alloc;
 
-	struct sk_buff *skb;
+	struct libeth_xdp_buff_stash xdp;
 
 	struct u64_stats_sync stats_sync;
 	struct idpf_rx_queue_stats q_stats;
@@ -579,8 +579,8 @@ struct idpf_rx_queue {
 libeth_cacheline_set_assert(struct idpf_rx_queue,
 			    ALIGN(64, __alignof(struct xdp_rxq_info)) +
 			    sizeof(struct xdp_rxq_info),
-			    72 + offsetof(struct idpf_rx_queue, q_stats) -
-			    offsetofend(struct idpf_rx_queue, skb),
+			    88 + offsetof(struct idpf_rx_queue, q_stats) -
+			    offsetofend(struct idpf_rx_queue, xdp),
 			    32);
 
 /**
@@ -1071,9 +1071,6 @@ int idpf_config_rss(struct idpf_vport *vport);
 int idpf_init_rss(struct idpf_vport *vport);
 void idpf_deinit_rss(struct idpf_vport *vport);
 int idpf_rx_bufs_init_all(struct idpf_vport *vport);
-void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
-		      unsigned int size);
-struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size);
 void idpf_tx_buf_hw_update(struct idpf_tx_queue *tx_q, u32 val,
 			   bool xmit_more);
 unsigned int idpf_size_to_txd_count(unsigned int size);
diff --git a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
index c81065b4fb24..544fe113265b 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
@@ -1,8 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright (C) 2023 Intel Corporation */
 
-#include <net/libeth/rx.h>
-#include <net/libeth/tx.h>
+#include <net/libeth/xdp.h>
 
 #include "idpf.h"
 
@@ -780,7 +779,7 @@ static void idpf_rx_singleq_flex_hash(struct idpf_rx_queue *rx_q,
 }
 
 /**
- * idpf_rx_singleq_process_skb_fields - Populate skb header fields from Rx
+ * __idpf_rx_singleq_process_skb_fields - Populate skb header fields from Rx
  * descriptor
  * @rx_q: Rx ring being processed
  * @skb: pointer to current skb being populated
@@ -792,17 +791,14 @@ static void idpf_rx_singleq_flex_hash(struct idpf_rx_queue *rx_q,
  * other fields within the skb.
  */
 static void
-idpf_rx_singleq_process_skb_fields(struct idpf_rx_queue *rx_q,
-				   struct sk_buff *skb,
-				   const union virtchnl2_rx_desc *rx_desc,
-				   u16 ptype)
+__idpf_rx_singleq_process_skb_fields(struct idpf_rx_queue *rx_q,
+				     struct sk_buff *skb,
+				     const union virtchnl2_rx_desc *rx_desc,
+				     u16 ptype)
 {
 	struct libeth_rx_pt decoded = rx_q->rx_ptype_lkup[ptype];
 	struct libeth_rx_csum csum_bits;
 
-	/* modifies the skb - consumes the enet header */
-	skb->protocol = eth_type_trans(skb, rx_q->xdp_rxq.dev);
-
 	/* Check if we're using base mode descriptor IDs */
 	if (rx_q->rxdids == VIRTCHNL2_RXDID_1_32B_BASE_M) {
 		idpf_rx_singleq_base_hash(rx_q, skb, rx_desc, decoded);
@@ -813,7 +809,6 @@ idpf_rx_singleq_process_skb_fields(struct idpf_rx_queue *rx_q,
 	}
 
 	idpf_rx_singleq_csum(rx_q, skb, csum_bits, decoded);
-	skb_record_rx_queue(skb, rx_q->idx);
 }
 
 /**
@@ -952,6 +947,32 @@ idpf_rx_singleq_extract_fields(const struct idpf_rx_queue *rx_q,
 		idpf_rx_singleq_extract_flex_fields(rx_desc, fields, ptype);
 }
 
+static bool
+idpf_rx_singleq_process_skb_fields(struct sk_buff *skb,
+				   const struct libeth_xdp_buff *xdp,
+				   struct libeth_rq_napi_stats *rs)
+{
+	struct libeth_rqe_info fields;
+	struct idpf_rx_queue *rxq;
+	u32 ptype;
+
+	rxq = libeth_xdp_buff_to_rq(xdp, typeof(*rxq), xdp_rxq);
+
+	idpf_rx_singleq_extract_fields(rxq, xdp->desc, &fields, &ptype);
+	__idpf_rx_singleq_process_skb_fields(rxq, skb, xdp->desc, ptype);
+
+	return true;
+}
+
+static void idpf_xdp_run_pass(struct libeth_xdp_buff *xdp,
+			      struct napi_struct *napi,
+			      struct libeth_rq_napi_stats *rs,
+			      const union virtchnl2_rx_desc *desc)
+{
+	libeth_xdp_run_pass(xdp, NULL, napi, rs, desc, NULL,
+			    idpf_rx_singleq_process_skb_fields);
+}
+
 /**
  * idpf_rx_singleq_clean - Reclaim resources after receive completes
  * @rx_q: rx queue to clean
@@ -961,14 +982,15 @@ idpf_rx_singleq_extract_fields(const struct idpf_rx_queue *rx_q,
  */
 static int idpf_rx_singleq_clean(struct idpf_rx_queue *rx_q, int budget)
 {
-	unsigned int total_rx_bytes = 0, total_rx_pkts = 0;
-	struct sk_buff *skb = rx_q->skb;
+	struct libeth_rq_napi_stats rs = { };
 	u16 ntc = rx_q->next_to_clean;
+	LIBETH_XDP_ONSTACK_BUFF(xdp);
 	u16 cleaned_count = 0;
-	bool failure = false;
+
+	libeth_xdp_init_buff(xdp, &rx_q->xdp, &rx_q->xdp_rxq);
 
 	/* Process Rx packets bounded by budget */
-	while (likely(total_rx_pkts < (unsigned int)budget)) {
+	while (likely(rs.packets < budget)) {
 		struct libeth_rqe_info fields = { };
 		union virtchnl2_rx_desc *rx_desc;
 		struct idpf_rx_buf *rx_buf;
@@ -996,72 +1018,41 @@ static int idpf_rx_singleq_clean(struct idpf_rx_queue *rx_q, int budget)
 		idpf_rx_singleq_extract_fields(rx_q, rx_desc, &fields, &ptype);
 
 		rx_buf = &rx_q->rx_buf[ntc];
-		if (!libeth_rx_sync_for_cpu(rx_buf, fields.len))
-			goto skip_data;
-
-		if (skb)
-			idpf_rx_add_frag(rx_buf, skb, fields.len);
-		else
-			skb = idpf_rx_build_skb(rx_buf, fields.len);
-
-		/* exit if we failed to retrieve a buffer */
-		if (!skb)
-			break;
-
-skip_data:
+		libeth_xdp_process_buff(xdp, rx_buf, fields.len);
 		rx_buf->netmem = 0;
 
 		IDPF_SINGLEQ_BUMP_RING_IDX(rx_q, ntc);
 		cleaned_count++;
 
 		/* skip if it is non EOP desc */
-		if (idpf_rx_singleq_is_non_eop(rx_desc) || unlikely(!skb))
+		if (idpf_rx_singleq_is_non_eop(rx_desc) ||
+		    unlikely(!xdp->data))
 			continue;
 
 #define IDPF_RXD_ERR_S FIELD_PREP(VIRTCHNL2_RX_BASE_DESC_QW1_ERROR_M, \
 				  VIRTCHNL2_RX_BASE_DESC_ERROR_RXE_M)
 		if (unlikely(idpf_rx_singleq_test_staterr(rx_desc,
 							  IDPF_RXD_ERR_S))) {
-			dev_kfree_skb_any(skb);
-			skb = NULL;
-			continue;
-		}
-
-		/* pad skb if needed (to make valid ethernet frame) */
-		if (eth_skb_pad(skb)) {
-			skb = NULL;
+			libeth_xdp_return_buff_slow(xdp);
 			continue;
 		}
 
-		/* probably a little skewed due to removing CRC */
-		total_rx_bytes += skb->len;
-
-		/* protocol */
-		idpf_rx_singleq_process_skb_fields(rx_q, skb, rx_desc, ptype);
-
-		/* send completed skb up the stack */
-		napi_gro_receive(rx_q->pp->p.napi, skb);
-		skb = NULL;
-
-		/* update budget accounting */
-		total_rx_pkts++;
+		idpf_xdp_run_pass(xdp, rx_q->pp->p.napi, &rs, rx_desc);
 	}
 
-	rx_q->skb = skb;
-
 	rx_q->next_to_clean = ntc;
+	libeth_xdp_save_buff(&rx_q->xdp, xdp);
 
 	page_pool_nid_changed(rx_q->pp, numa_mem_id());
 	if (cleaned_count)
-		failure = idpf_rx_singleq_buf_hw_alloc_all(rx_q, cleaned_count);
+		idpf_rx_singleq_buf_hw_alloc_all(rx_q, cleaned_count);
 
 	u64_stats_update_begin(&rx_q->stats_sync);
-	u64_stats_add(&rx_q->q_stats.packets, total_rx_pkts);
-	u64_stats_add(&rx_q->q_stats.bytes, total_rx_bytes);
+	u64_stats_add(&rx_q->q_stats.packets, rs.packets);
+	u64_stats_add(&rx_q->q_stats.bytes, rs.bytes);
 	u64_stats_update_end(&rx_q->stats_sync);
 
-	/* guarantee a trip back through this routine if there was a failure */
-	return failure ? budget : (int)total_rx_pkts;
+	return rs.packets;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index e152fbe4ebe3..f25c50d8947b 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -486,10 +486,7 @@ static void idpf_rx_desc_rel(struct idpf_rx_queue *rxq, struct device *dev,
 	if (!rxq)
 		return;
 
-	if (rxq->skb) {
-		dev_kfree_skb_any(rxq->skb);
-		rxq->skb = NULL;
-	}
+	libeth_xdp_return_stash(&rxq->xdp);
 
 	if (!idpf_is_queue_model_split(model))
 		idpf_rx_buf_rel_all(rxq);
@@ -3205,7 +3202,7 @@ static int idpf_rx_rsc(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 }
 
 /**
- * idpf_rx_process_skb_fields - Populate skb header fields from Rx descriptor
+ * __idpf_rx_process_skb_fields - Populate skb header fields from Rx descriptor
  * @rxq: Rx descriptor ring packet is being transacted on
  * @skb: pointer to current skb being populated
  * @rx_desc: Receive descriptor
@@ -3215,8 +3212,8 @@ static int idpf_rx_rsc(struct idpf_rx_queue *rxq, struct sk_buff *skb,
  * other fields within the skb.
  */
 static int
-idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
-			   const struct virtchnl2_rx_flex_desc_adv_nic_3 *rx_desc)
+__idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
+			     const struct virtchnl2_rx_flex_desc_adv_nic_3 *rx_desc)
 {
 	struct libeth_rx_csum csum_bits;
 	struct libeth_rx_pt decoded;
@@ -3229,9 +3226,6 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 	/* process RSS/hash */
 	idpf_rx_hash(rxq, skb, rx_desc, decoded);
 
-	skb->protocol = eth_type_trans(skb, rxq->xdp_rxq.dev);
-	skb_record_rx_queue(skb, rxq->idx);
-
 	if (le16_get_bits(rx_desc->hdrlen_flags,
 			  VIRTCHNL2_RX_FLEX_DESC_ADV_RSC_M))
 		return idpf_rx_rsc(rxq, skb, rx_desc, decoded);
@@ -3242,23 +3236,24 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 	return 0;
 }
 
-/**
- * idpf_rx_add_frag - Add contents of Rx buffer to sk_buff as a frag
- * @rx_buf: buffer containing page to add
- * @skb: sk_buff to place the data into
- * @size: packet length from rx_desc
- *
- * This function will add the data contained in rx_buf->page to the skb.
- * It will just attach the page as a frag to the skb.
- * The function will then update the page offset.
- */
-void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
-		      unsigned int size)
+static bool idpf_rx_process_skb_fields(struct sk_buff *skb,
+				       const struct libeth_xdp_buff *xdp,
+				       struct libeth_rq_napi_stats *rs)
 {
-	u32 hr = netmem_get_pp(rx_buf->netmem)->p.offset;
+	struct idpf_rx_queue *rxq;
+
+	rxq = libeth_xdp_buff_to_rq(xdp, typeof(*rxq), xdp_rxq);
 
-	skb_add_rx_frag_netmem(skb, skb_shinfo(skb)->nr_frags, rx_buf->netmem,
-			       rx_buf->offset + hr, size, rx_buf->truesize);
+	return !__idpf_rx_process_skb_fields(rxq, skb, xdp->desc);
+}
+
+static void
+idpf_xdp_run_pass(struct libeth_xdp_buff *xdp, struct napi_struct *napi,
+		  struct libeth_rq_napi_stats *ss,
+		  const struct virtchnl2_rx_flex_desc_adv_nic_3 *desc)
+{
+	libeth_xdp_run_pass(xdp, NULL, napi, ss, desc, NULL,
+			    idpf_rx_process_skb_fields);
 }
 
 /**
@@ -3300,36 +3295,6 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
 	return copy;
 }
 
-/**
- * idpf_rx_build_skb - Allocate skb and populate it from header buffer
- * @buf: Rx buffer to pull data from
- * @size: the length of the packet
- *
- * This function allocates an skb. It then populates it with the page data from
- * the current receive descriptor, taking care to set up the skb correctly.
- */
-struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size)
-{
-	struct page *buf_page = __netmem_to_page(buf->netmem);
-	u32 hr = buf_page->pp->p.offset;
-	struct sk_buff *skb;
-	void *va;
-
-	va = page_address(buf_page) + buf->offset;
-	prefetch(va + hr);
-
-	skb = napi_build_skb(va, buf->truesize);
-	if (unlikely(!skb))
-		return NULL;
-
-	skb_mark_for_recycle(skb);
-
-	skb_reserve(skb, hr);
-	__skb_put(skb, size);
-
-	return skb;
-}
-
 /**
  * idpf_rx_splitq_test_staterr - tests bits in Rx descriptor
  * status and error fields
@@ -3371,13 +3336,15 @@ static bool idpf_rx_splitq_is_eop(struct virtchnl2_rx_flex_desc_adv_nic_3 *rx_de
  */
 static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 {
-	int total_rx_bytes = 0, total_rx_pkts = 0;
 	struct idpf_buf_queue *rx_bufq = NULL;
-	struct sk_buff *skb = rxq->skb;
+	struct libeth_rq_napi_stats rs = { };
+	LIBETH_XDP_ONSTACK_BUFF(xdp);
 	u16 ntc = rxq->next_to_clean;
 
+	libeth_xdp_init_buff(xdp, &rxq->xdp, &rxq->xdp_rxq);
+
 	/* Process Rx packets bounded by budget */
-	while (likely(total_rx_pkts < budget)) {
+	while (likely(rs.packets < budget)) {
 		struct virtchnl2_rx_flex_desc_adv_nic_3 *rx_desc;
 		struct libeth_fqe *hdr, *rx_buf = NULL;
 		struct idpf_sw_queue *refillq = NULL;
@@ -3443,7 +3410,7 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 
 		hdr = &rx_bufq->hdr_buf[buf_id];
 
-		if (unlikely(!hdr_len && !skb)) {
+		if (unlikely(!hdr_len && !xdp->data)) {
 			hdr_len = idpf_rx_hsplit_wa(hdr, rx_buf, pkt_len);
 			pkt_len -= hdr_len;
 
@@ -3452,75 +3419,35 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 			u64_stats_update_end(&rxq->stats_sync);
 		}
 
-		if (libeth_rx_sync_for_cpu(hdr, hdr_len)) {
-			skb = idpf_rx_build_skb(hdr, hdr_len);
-			if (!skb)
-				break;
-
-			u64_stats_update_begin(&rxq->stats_sync);
-			u64_stats_inc(&rxq->q_stats.hsplit_pkts);
-			u64_stats_update_end(&rxq->stats_sync);
-		}
+		if (libeth_xdp_process_buff(xdp, hdr, hdr_len))
+			rs.hsplit++;
 
 		hdr->netmem = 0;
 
 payload:
-		if (!libeth_rx_sync_for_cpu(rx_buf, pkt_len))
-			goto skip_data;
-
-		if (skb)
-			idpf_rx_add_frag(rx_buf, skb, pkt_len);
-		else
-			skb = idpf_rx_build_skb(rx_buf, pkt_len);
-
-		/* exit if we failed to retrieve a buffer */
-		if (!skb)
-			break;
-
-skip_data:
+		libeth_xdp_process_buff(xdp, rx_buf, pkt_len);
 		rx_buf->netmem = 0;
 
 		idpf_rx_post_buf_refill(refillq, buf_id);
 		IDPF_RX_BUMP_NTC(rxq, ntc);
 
 		/* skip if it is non EOP desc */
-		if (!idpf_rx_splitq_is_eop(rx_desc) || unlikely(!skb))
-			continue;
-
-		/* pad skb if needed (to make valid ethernet frame) */
-		if (eth_skb_pad(skb)) {
-			skb = NULL;
-			continue;
-		}
-
-		/* probably a little skewed due to removing CRC */
-		total_rx_bytes += skb->len;
-
-		/* protocol */
-		if (unlikely(idpf_rx_process_skb_fields(rxq, skb, rx_desc))) {
-			dev_kfree_skb_any(skb);
-			skb = NULL;
+		if (!idpf_rx_splitq_is_eop(rx_desc) || unlikely(!xdp->data))
 			continue;
-		}
 
-		/* send completed skb up the stack */
-		napi_gro_receive(rxq->napi, skb);
-		skb = NULL;
-
-		/* update budget accounting */
-		total_rx_pkts++;
+		idpf_xdp_run_pass(xdp, rxq->napi, &rs, rx_desc);
 	}
 
 	rxq->next_to_clean = ntc;
+	libeth_xdp_save_buff(&rxq->xdp, xdp);
 
-	rxq->skb = skb;
 	u64_stats_update_begin(&rxq->stats_sync);
-	u64_stats_add(&rxq->q_stats.packets, total_rx_pkts);
-	u64_stats_add(&rxq->q_stats.bytes, total_rx_bytes);
+	u64_stats_add(&rxq->q_stats.packets, rs.packets);
+	u64_stats_add(&rxq->q_stats.bytes, rs.bytes);
+	u64_stats_add(&rxq->q_stats.hsplit_pkts, rs.hsplit);
 	u64_stats_update_end(&rxq->stats_sync);
 
-	/* guarantee a trip back through this routine if there was a failure */
-	return total_rx_pkts;
+	return rs.packets;
 }
 
 /**
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 14/16] idpf: add support for XDP on Rx
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (12 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 13/16] idpf: use generic functions to build xdp_buff and skb Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-11 15:50   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 15/16] idpf: add support for .ndo_xdp_xmit() Alexander Lobakin
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Use libeth XDP infra to support running XDP program on Rx polling.
This includes all of the possible verdicts/actions.
XDP Tx queues are cleaned only in "lazy" mode when there are less than
1/4 free descriptors left on the ring. libeth helper macros to define
driver-specific XDP functions make sure the compiler could uninline
them when needed.
Use __LIBETH_WORD_ACCESS to parse descriptors more efficiently when
applicable. It really gives some good boosts and code size reduction
on x86_64.

Co-developed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.h |   4 +-
 drivers/net/ethernet/intel/idpf/xdp.h       | 100 ++++++++++++-
 drivers/net/ethernet/intel/idpf/idpf_lib.c  |   2 +
 drivers/net/ethernet/intel/idpf/idpf_txrx.c |  23 +--
 drivers/net/ethernet/intel/idpf/xdp.c       | 155 +++++++++++++++++++-
 5 files changed, 264 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index e36c55baf23f..5d62074c94b1 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -684,8 +684,8 @@ struct idpf_tx_queue {
 	__cacheline_group_end_aligned(read_mostly);
 
 	__cacheline_group_begin_aligned(read_write);
-	u16 next_to_use;
-	u16 next_to_clean;
+	u32 next_to_use;
+	u32 next_to_clean;
 
 	union {
 		struct {
diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
index a72a7638a6ea..fde85528a315 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.h
+++ b/drivers/net/ethernet/intel/idpf/xdp.h
@@ -4,12 +4,9 @@
 #ifndef _IDPF_XDP_H_
 #define _IDPF_XDP_H_
 
-#include <linux/types.h>
+#include <net/libeth/xdp.h>
 
-struct bpf_prog;
-struct idpf_vport;
-struct net_device;
-struct netdev_bpf;
+#include "idpf_txrx.h"
 
 int idpf_xdp_rxq_info_init_all(const struct idpf_vport *vport);
 void idpf_xdp_rxq_info_deinit_all(const struct idpf_vport *vport);
@@ -19,6 +16,99 @@ void idpf_copy_xdp_prog_to_qs(const struct idpf_vport *vport,
 int idpf_vport_xdpq_get(const struct idpf_vport *vport);
 void idpf_vport_xdpq_put(const struct idpf_vport *vport);
 
+bool idpf_xdp_tx_flush_bulk(struct libeth_xdp_tx_bulk *bq, u32 flags);
+
+/**
+ * idpf_xdp_tx_xmit - produce a single HW Tx descriptor out of XDP desc
+ * @desc: XDP descriptor to pull the DMA address and length from
+ * @i: descriptor index on the queue to fill
+ * @sq: XDP queue to produce the HW Tx descriptor on
+ * @priv: &xsk_tx_metadata_ops on XSk xmit or %NULL
+ */
+static inline void idpf_xdp_tx_xmit(struct libeth_xdp_tx_desc desc, u32 i,
+				    const struct libeth_xdpsq *sq, u64 priv)
+{
+	struct idpf_flex_tx_desc *tx_desc = sq->descs;
+	u32 cmd;
+
+	cmd = FIELD_PREP(IDPF_FLEX_TXD_QW1_DTYPE_M,
+			 IDPF_TX_DESC_DTYPE_FLEX_L2TAG1_L2TAG2);
+	if (desc.flags & LIBETH_XDP_TX_LAST)
+		cmd |= FIELD_PREP(IDPF_FLEX_TXD_QW1_CMD_M,
+				  IDPF_TX_DESC_CMD_EOP);
+	if (priv && (desc.flags & LIBETH_XDP_TX_CSUM))
+		cmd |= FIELD_PREP(IDPF_FLEX_TXD_QW1_CMD_M,
+				  IDPF_TX_FLEX_DESC_CMD_CS_EN);
+
+	tx_desc = &tx_desc[i];
+	tx_desc->buf_addr = cpu_to_le64(desc.addr);
+#ifdef __LIBETH_WORD_ACCESS
+	*(u64 *)&tx_desc->qw1 = ((u64)desc.len << 48) | cmd;
+#else
+	tx_desc->qw1.buf_size = cpu_to_le16(desc.len);
+	tx_desc->qw1.cmd_dtype = cpu_to_le16(cmd);
+#endif
+}
+
+/**
+ * idpf_set_rs_bit - set RS bit on last produced descriptor
+ * @xdpq: XDP queue to produce the HW Tx descriptors on
+ */
+static inline void idpf_set_rs_bit(const struct idpf_tx_queue *xdpq)
+{
+	u32 ntu, cmd;
+
+	ntu = xdpq->next_to_use;
+	if (unlikely(!ntu))
+		ntu = xdpq->desc_count;
+
+	cmd = FIELD_PREP(IDPF_FLEX_TXD_QW1_CMD_M, IDPF_TX_DESC_CMD_RS);
+#ifdef __LIBETH_WORD_ACCESS
+	*(u64 *)&xdpq->flex_tx[ntu - 1].q.qw1 |= cmd;
+#else
+	xdpq->flex_tx[ntu - 1].q.qw1.cmd_dtype |= cpu_to_le16(cmd);
+#endif
+}
+
+/**
+ * idpf_xdpq_update_tail - update the XDP Tx queue tail register
+ * @xdpq: XDP Tx queue
+ */
+static inline void idpf_xdpq_update_tail(const struct idpf_tx_queue *xdpq)
+{
+	dma_wmb();
+	writel_relaxed(xdpq->next_to_use, xdpq->tail);
+}
+
+/**
+ * idpf_xdp_tx_finalize - Update RS bit and bump XDP Tx tail
+ * @_xdpq: XDP Tx queue
+ * @sent: whether any frames were sent
+ * @flush: whether to update RS bit and the tail register
+ *
+ * This function bumps XDP Tx tail and should be called when a batch of packets
+ * has been processed in the napi loop.
+ */
+static inline void idpf_xdp_tx_finalize(void *_xdpq, bool sent, bool flush)
+{
+	struct idpf_tx_queue *xdpq = _xdpq;
+
+	if ((!flush || unlikely(!sent)) &&
+	    likely(xdpq->desc_count != xdpq->pending))
+		return;
+
+	libeth_xdpsq_lock(&xdpq->xdp_lock);
+
+	idpf_set_rs_bit(xdpq);
+	idpf_xdpq_update_tail(xdpq);
+
+	libeth_xdpsq_queue_timer(xdpq->timer);
+
+	libeth_xdpsq_unlock(&xdpq->xdp_lock);
+}
+
+void idpf_xdp_set_features(const struct idpf_vport *vport);
+
 int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp);
 
 #endif /* _IDPF_XDP_H_ */
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
index 84ca8c08bd56..2d1efcb854be 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -814,6 +814,8 @@ static int idpf_cfg_netdev(struct idpf_vport *vport)
 	netdev->features |= dflt_features;
 	netdev->hw_features |= dflt_features | offloads;
 	netdev->hw_enc_features |= dflt_features | offloads;
+	idpf_xdp_set_features(vport);
+
 	idpf_set_ethtool_ops(netdev);
 	netif_set_affinity_auto(netdev);
 	SET_NETDEV_DEV(netdev, &adapter->pdev->dev);
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index f25c50d8947b..cddcc5fc291f 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -1,8 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright (C) 2023 Intel Corporation */
 
-#include <net/libeth/xdp.h>
-
 #include "idpf.h"
 #include "idpf_virtchnl.h"
 #include "xdp.h"
@@ -3247,14 +3245,12 @@ static bool idpf_rx_process_skb_fields(struct sk_buff *skb,
 	return !__idpf_rx_process_skb_fields(rxq, skb, xdp->desc);
 }
 
-static void
-idpf_xdp_run_pass(struct libeth_xdp_buff *xdp, struct napi_struct *napi,
-		  struct libeth_rq_napi_stats *ss,
-		  const struct virtchnl2_rx_flex_desc_adv_nic_3 *desc)
-{
-	libeth_xdp_run_pass(xdp, NULL, napi, ss, desc, NULL,
-			    idpf_rx_process_skb_fields);
-}
+LIBETH_XDP_DEFINE_START();
+LIBETH_XDP_DEFINE_RUN(static idpf_xdp_run_pass, idpf_xdp_run_prog,
+		      idpf_xdp_tx_flush_bulk, idpf_rx_process_skb_fields);
+LIBETH_XDP_DEFINE_FINALIZE(static idpf_xdp_finalize_rx, idpf_xdp_tx_flush_bulk,
+			   idpf_xdp_tx_finalize);
+LIBETH_XDP_DEFINE_END();
 
 /**
  * idpf_rx_hsplit_wa - handle header buffer overflows and split errors
@@ -3338,9 +3334,12 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 {
 	struct idpf_buf_queue *rx_bufq = NULL;
 	struct libeth_rq_napi_stats rs = { };
+	struct libeth_xdp_tx_bulk bq;
 	LIBETH_XDP_ONSTACK_BUFF(xdp);
 	u16 ntc = rxq->next_to_clean;
 
+	libeth_xdp_tx_init_bulk(&bq, rxq->xdp_prog, rxq->xdp_rxq.dev,
+				rxq->xdpqs, rxq->num_xdp_txq);
 	libeth_xdp_init_buff(xdp, &rxq->xdp, &rxq->xdp_rxq);
 
 	/* Process Rx packets bounded by budget */
@@ -3435,11 +3434,13 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
 		if (!idpf_rx_splitq_is_eop(rx_desc) || unlikely(!xdp->data))
 			continue;
 
-		idpf_xdp_run_pass(xdp, rxq->napi, &rs, rx_desc);
+		idpf_xdp_run_pass(xdp, &bq, rxq->napi, &rs, rx_desc);
 	}
 
 	rxq->next_to_clean = ntc;
+
 	libeth_xdp_save_buff(&rxq->xdp, xdp);
+	idpf_xdp_finalize_rx(&bq);
 
 	u64_stats_update_begin(&rxq->stats_sync);
 	u64_stats_add(&rxq->q_stats.packets, rs.packets);
diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index c0322fa7bfee..abf75e840c0a 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -1,8 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright (C) 2024 Intel Corporation */
 
-#include <net/libeth/xdp.h>
-
 #include "idpf.h"
 #include "idpf_virtchnl.h"
 #include "xdp.h"
@@ -143,6 +141,8 @@ void idpf_copy_xdp_prog_to_qs(const struct idpf_vport *vport,
 	idpf_rxq_for_each(vport, idpf_xdp_rxq_assign_prog, xdp_prog);
 }
 
+static void idpf_xdp_tx_timer(struct work_struct *work);
+
 int idpf_vport_xdpq_get(const struct idpf_vport *vport)
 {
 	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;
@@ -183,6 +183,8 @@ int idpf_vport_xdpq_get(const struct idpf_vport *vport)
 
 		xdpq->timer = timers[i - sqs];
 		libeth_xdpsq_get(&xdpq->xdp_lock, dev, vport->xdpq_share);
+		libeth_xdpsq_init_timer(xdpq->timer, xdpq, &xdpq->xdp_lock,
+					idpf_xdp_tx_timer);
 
 		xdpq->pending = 0;
 		xdpq->xdp_tx = 0;
@@ -209,6 +211,7 @@ void idpf_vport_xdpq_put(const struct idpf_vport *vport)
 		if (!idpf_queue_has_clear(XDP, xdpq))
 			continue;
 
+		libeth_xdpsq_deinit_timer(xdpq->timer);
 		libeth_xdpsq_put(&xdpq->xdp_lock, dev);
 
 		kfree(xdpq->timer);
@@ -216,6 +219,154 @@ void idpf_vport_xdpq_put(const struct idpf_vport *vport)
 	}
 }
 
+static int
+idpf_xdp_parse_compl_desc(const struct idpf_splitq_4b_tx_compl_desc *desc,
+			  bool gen)
+{
+	u32 val;
+
+#ifdef __LIBETH_WORD_ACCESS
+	val = *(const u32 *)desc;
+#else
+	val = ((u32)le16_to_cpu(desc->q_head_compl_tag.q_head) << 16) |
+	      le16_to_cpu(desc->qid_comptype_gen);
+#endif
+	if (!!(val & IDPF_TXD_COMPLQ_GEN_M) != gen)
+		return -ENODATA;
+
+	if (unlikely((val & GENMASK(IDPF_TXD_COMPLQ_GEN_S - 1, 0)) !=
+		     FIELD_PREP(IDPF_TXD_COMPLQ_COMPL_TYPE_M,
+				IDPF_TXD_COMPLT_RS)))
+		return -EINVAL;
+
+	return upper_16_bits(val);
+}
+
+static u32 idpf_xdpsq_poll(struct idpf_tx_queue *xdpsq, u32 budget)
+{
+	struct idpf_compl_queue *cq = xdpsq->complq;
+	u32 tx_ntc = xdpsq->next_to_clean;
+	u32 tx_cnt = xdpsq->desc_count;
+	u32 ntc = cq->next_to_clean;
+	u32 cnt = cq->desc_count;
+	u32 done_frames;
+	bool gen;
+
+	gen = idpf_queue_has(GEN_CHK, cq);
+
+	for (done_frames = 0; done_frames < budget; ) {
+		int ret;
+
+		ret = idpf_xdp_parse_compl_desc(&cq->comp_4b[ntc], gen);
+		if (ret >= 0) {
+			done_frames = ret > tx_ntc ? ret - tx_ntc :
+						     ret + tx_cnt - tx_ntc;
+			goto next;
+		}
+
+		switch (ret) {
+		case -ENODATA:
+			goto out;
+		case -EINVAL:
+			break;
+		}
+
+next:
+		if (unlikely(++ntc == cnt)) {
+			ntc = 0;
+			gen = !gen;
+			idpf_queue_change(GEN_CHK, cq);
+		}
+	}
+
+out:
+	cq->next_to_clean = ntc;
+
+	return done_frames;
+}
+
+/**
+ * idpf_clean_xdp_irq - Reclaim a batch of TX resources from completed XDP_TX
+ * @_xdpq: XDP Tx queue
+ * @budget: maximum number of descriptors to clean
+ *
+ * Returns number of cleaned descriptors.
+ */
+static u32 idpf_clean_xdp_irq(void *_xdpq, u32 budget)
+{
+	struct libeth_xdpsq_napi_stats ss = { };
+	struct idpf_tx_queue *xdpq = _xdpq;
+	u32 tx_ntc = xdpq->next_to_clean;
+	u32 tx_cnt = xdpq->desc_count;
+	struct xdp_frame_bulk bq;
+	struct libeth_cq_pp cp = {
+		.dev	= xdpq->dev,
+		.bq	= &bq,
+		.xss	= &ss,
+		.napi	= true,
+	};
+	u32 done_frames;
+
+	done_frames = idpf_xdpsq_poll(xdpq, budget);
+	if (unlikely(!done_frames))
+		return 0;
+
+	xdp_frame_bulk_init(&bq);
+
+	for (u32 i = 0; likely(i < done_frames); i++) {
+		libeth_xdp_complete_tx(&xdpq->tx_buf[tx_ntc], &cp);
+
+		if (unlikely(++tx_ntc == tx_cnt))
+			tx_ntc = 0;
+	}
+
+	xdp_flush_frame_bulk(&bq);
+
+	xdpq->next_to_clean = tx_ntc;
+	xdpq->pending -= done_frames;
+	xdpq->xdp_tx -= cp.xdp_tx;
+
+	return done_frames;
+}
+
+static u32 idpf_xdp_tx_prep(void *_xdpq, struct libeth_xdpsq *sq)
+{
+	struct idpf_tx_queue *xdpq = _xdpq;
+	u32 free;
+
+	libeth_xdpsq_lock(&xdpq->xdp_lock);
+
+	free = xdpq->desc_count - xdpq->pending;
+	if (free <= xdpq->thresh)
+		free += idpf_clean_xdp_irq(xdpq, xdpq->thresh);
+
+	*sq = (struct libeth_xdpsq){
+		.sqes		= xdpq->tx_buf,
+		.descs		= xdpq->desc_ring,
+		.count		= xdpq->desc_count,
+		.lock		= &xdpq->xdp_lock,
+		.ntu		= &xdpq->next_to_use,
+		.pending	= &xdpq->pending,
+		.xdp_tx		= &xdpq->xdp_tx,
+	};
+
+	return free;
+}
+
+LIBETH_XDP_DEFINE_START();
+LIBETH_XDP_DEFINE_TIMER(static idpf_xdp_tx_timer, idpf_clean_xdp_irq);
+LIBETH_XDP_DEFINE_FLUSH_TX(idpf_xdp_tx_flush_bulk, idpf_xdp_tx_prep,
+			   idpf_xdp_tx_xmit);
+LIBETH_XDP_DEFINE_END();
+
+void idpf_xdp_set_features(const struct idpf_vport *vport)
+{
+	if (!idpf_is_queue_model_split(vport->rxq_model))
+		return;
+
+	libeth_xdp_set_features_noredir(vport->netdev);
+}
+
 /**
  * idpf_xdp_setup_prog - handle XDP program install/remove requests
  * @vport: vport to configure
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 15/16] idpf: add support for .ndo_xdp_xmit()
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (13 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 14/16] idpf: add support for XDP on Rx Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-11 16:08   ` Maciej Fijalkowski
  2025-03-05 16:21 ` [PATCH net-next 16/16] idpf: add XDP RSS hash hint Alexander Lobakin
  2025-03-11 15:28 ` [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
  16 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Use libeth XDP infra to implement .ndo_xdp_xmit() in idpf.
The Tx callbacks are reused from XDP_TX code. XDP redirect target
feature is set/cleared depending on the XDP prog presence, as for now
we still don't allocate XDP Tx queues when there's no program.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/xdp.h      |  2 ++
 drivers/net/ethernet/intel/idpf/idpf_lib.c |  1 +
 drivers/net/ethernet/intel/idpf/xdp.c      | 29 ++++++++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
index fde85528a315..a2ac1b2f334f 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.h
+++ b/drivers/net/ethernet/intel/idpf/xdp.h
@@ -110,5 +110,7 @@ static inline void idpf_xdp_tx_finalize(void *_xdpq, bool sent, bool flush)
 void idpf_xdp_set_features(const struct idpf_vport *vport);
 
 int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp);
+int idpf_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
+		  u32 flags);
 
 #endif /* _IDPF_XDP_H_ */
diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
index 2d1efcb854be..39b9885293a9 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
@@ -2371,4 +2371,5 @@ static const struct net_device_ops idpf_netdev_ops = {
 	.ndo_set_features = idpf_set_features,
 	.ndo_tx_timeout = idpf_tx_timeout,
 	.ndo_bpf = idpf_xdp,
+	.ndo_xdp_xmit = idpf_xdp_xmit,
 };
diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index abf75e840c0a..1834f217a07f 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -357,8 +357,35 @@ LIBETH_XDP_DEFINE_START();
 LIBETH_XDP_DEFINE_TIMER(static idpf_xdp_tx_timer, idpf_clean_xdp_irq);
 LIBETH_XDP_DEFINE_FLUSH_TX(idpf_xdp_tx_flush_bulk, idpf_xdp_tx_prep,
 			   idpf_xdp_tx_xmit);
+LIBETH_XDP_DEFINE_FLUSH_XMIT(static idpf_xdp_xmit_flush_bulk, idpf_xdp_tx_prep,
+			     idpf_xdp_tx_xmit);
 LIBETH_XDP_DEFINE_END();
 
+/**
+ * idpf_xdp_xmit - send frames queued by ``XDP_REDIRECT`` to this interface
+ * @dev: network device
+ * @n: number of frames to transmit
+ * @frames: frames to transmit
+ * @flags: transmit flags (``XDP_XMIT_FLUSH`` or zero)
+ *
+ * Return: number of frames successfully sent or -errno on error.
+ */
+int idpf_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
+		  u32 flags)
+{
+	const struct idpf_netdev_priv *np = netdev_priv(dev);
+	const struct idpf_vport *vport = np->vport;
+
+	if (unlikely(!netif_carrier_ok(dev) || !vport->link_up))
+		return -ENETDOWN;
+
+	return libeth_xdp_xmit_do_bulk(dev, n, frames, flags,
+				       &vport->txqs[vport->xdp_txq_offset],
+				       vport->num_xdp_txq,
+				       idpf_xdp_xmit_flush_bulk,
+				       idpf_xdp_tx_finalize);
+}
+
 void idpf_xdp_set_features(const struct idpf_vport *vport)
 {
 	if (!idpf_is_queue_model_split(vport->rxq_model))
@@ -417,6 +444,8 @@ idpf_xdp_setup_prog(struct idpf_vport *vport, const struct netdev_bpf *xdp)
 		cfg->user_config.xdp_prog = old;
 	}
 
+	libeth_xdp_set_redirect(vport->netdev, vport->xdp_prog);
+
 	return ret;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH net-next 16/16] idpf: add XDP RSS hash hint
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (14 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 15/16] idpf: add support for .ndo_xdp_xmit() Alexander Lobakin
@ 2025-03-05 16:21 ` Alexander Lobakin
  2025-03-11 15:28 ` [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
  16 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-05 16:21 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

Add &xdp_metadata_ops with a callback to get RSS hash hint from the
descriptor. Declare the splitq 32-byte descriptor as 4 u64s to parse
them more efficiently when possible.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/xdp.h | 64 +++++++++++++++++++++++++++
 drivers/net/ethernet/intel/idpf/xdp.c | 28 +++++++++++-
 2 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
index a2ac1b2f334f..52783a5c8e0f 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.h
+++ b/drivers/net/ethernet/intel/idpf/xdp.h
@@ -107,6 +107,70 @@ static inline void idpf_xdp_tx_finalize(void *_xdpq, bool sent, bool flush)
 	libeth_xdpsq_unlock(&xdpq->xdp_lock);
 }
 
+struct idpf_xdp_rx_desc {
+	aligned_u64		qw0;
+#define IDPF_XDP_RX_BUFQ	BIT_ULL(47)
+#define IDPF_XDP_RX_GEN		BIT_ULL(46)
+#define IDPF_XDP_RX_LEN		GENMASK_ULL(45, 32)
+#define IDPF_XDP_RX_PT		GENMASK_ULL(25, 16)
+
+	aligned_u64		qw1;
+#define IDPF_XDP_RX_BUF		GENMASK_ULL(47, 32)
+#define IDPF_XDP_RX_EOP		BIT_ULL(1)
+
+	aligned_u64		qw2;
+#define IDPF_XDP_RX_HASH	GENMASK_ULL(31, 0)
+
+	aligned_u64		qw3;
+} __aligned(4 * sizeof(u64));
+static_assert(sizeof(struct idpf_xdp_rx_desc) ==
+	      sizeof(struct virtchnl2_rx_flex_desc_adv_nic_3));
+
+#define idpf_xdp_rx_bufq(desc)	!!((desc)->qw0 & IDPF_XDP_RX_BUFQ)
+#define idpf_xdp_rx_gen(desc)	!!((desc)->qw0 & IDPF_XDP_RX_GEN)
+#define idpf_xdp_rx_len(desc)	FIELD_GET(IDPF_XDP_RX_LEN, (desc)->qw0)
+#define idpf_xdp_rx_pt(desc)	FIELD_GET(IDPF_XDP_RX_PT, (desc)->qw0)
+#define idpf_xdp_rx_buf(desc)	FIELD_GET(IDPF_XDP_RX_BUF, (desc)->qw1)
+#define idpf_xdp_rx_eop(desc)	!!((desc)->qw1 & IDPF_XDP_RX_EOP)
+#define idpf_xdp_rx_hash(desc)	FIELD_GET(IDPF_XDP_RX_HASH, (desc)->qw2)
+
+static inline void
+idpf_xdp_get_qw0(struct idpf_xdp_rx_desc *desc,
+		 const struct virtchnl2_rx_flex_desc_adv_nic_3 *rxd)
+{
+#ifdef __LIBETH_WORD_ACCESS
+	desc->qw0 = ((const typeof(desc))rxd)->qw0;
+#else
+	desc->qw0 = ((u64)le16_to_cpu(rxd->pktlen_gen_bufq_id) << 32) |
+		    ((u64)le16_to_cpu(rxd->ptype_err_fflags0) << 16);
+#endif
+}
+
+static inline void
+idpf_xdp_get_qw1(struct idpf_xdp_rx_desc *desc,
+		 const struct virtchnl2_rx_flex_desc_adv_nic_3 *rxd)
+{
+#ifdef __LIBETH_WORD_ACCESS
+	desc->qw1 = ((const typeof(desc))rxd)->qw1;
+#else
+	desc->qw1 = ((u64)le16_to_cpu(rxd->buf_id) << 32) |
+		    rxd->status_err0_qw1;
+#endif
+}
+
+static inline void
+idpf_xdp_get_qw2(struct idpf_xdp_rx_desc *desc,
+		 const struct virtchnl2_rx_flex_desc_adv_nic_3 *rxd)
+{
+#ifdef __LIBETH_WORD_ACCESS
+	desc->qw2 = ((const typeof(desc))rxd)->qw2;
+#else
+	desc->qw2 = ((u64)rxd->hash3 << 24) |
+		    ((u64)rxd->ff2_mirrid_hash2.hash2 << 16) |
+		    le16_to_cpu(rxd->hash1);
+#endif
+}
+
 void idpf_xdp_set_features(const struct idpf_vport *vport);
 
 int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp);
diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index 1834f217a07f..b0b4b785bf8e 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -386,12 +386,38 @@ int idpf_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 				       idpf_xdp_tx_finalize);
 }
 
+static int idpf_xdpmo_rx_hash(const struct xdp_md *ctx, u32 *hash,
+			      enum xdp_rss_hash_type *rss_type)
+{
+	const struct libeth_xdp_buff *xdp = (typeof(xdp))ctx;
+	const struct idpf_rx_queue *rxq;
+	struct idpf_xdp_rx_desc desc;
+	struct libeth_rx_pt pt;
+
+	rxq = libeth_xdp_buff_to_rq(xdp, typeof(*rxq), xdp_rxq);
+
+	idpf_xdp_get_qw0(&desc, xdp->desc);
+
+	pt = rxq->rx_ptype_lkup[idpf_xdp_rx_pt(&desc)];
+	if (!libeth_rx_pt_has_hash(rxq->xdp_rxq.dev, pt))
+		return -ENODATA;
+
+	idpf_xdp_get_qw2(&desc, xdp->desc);
+
+	return libeth_xdpmo_rx_hash(hash, rss_type, idpf_xdp_rx_hash(&desc),
+				    pt);
+}
+
+static const struct xdp_metadata_ops idpf_xdpmo = {
+	.xmo_rx_hash		= idpf_xdpmo_rx_hash,
+};
+
 void idpf_xdp_set_features(const struct idpf_vport *vport)
 {
 	if (!idpf_is_queue_model_split(vport->rxq_model))
 		return;
 
-	libeth_xdp_set_features_noredir(vport->netdev);
+	libeth_xdp_set_features_noredir(vport->netdev, &idpf_xdpmo);
 }
 
 /**
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 01/16] libeth: convert to netmem
  2025-03-05 16:21 ` [PATCH net-next 01/16] libeth: convert to netmem Alexander Lobakin
@ 2025-03-06  0:13   ` Mina Almasry
  2025-03-11 17:22     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Mina Almasry @ 2025-03-06  0:13 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 5, 2025 at 8:23 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> Back when the libeth Rx core was initially written, devmem was a draft
> and netmem_ref didn't exist in the mainline. Now that it's here, make
> libeth MP-agnostic before introducing any new code or any new library
> users.
> When it's known that the created PP/FQ is for header buffers, use faster
> "unsafe" underscored netmem <--> virt accessors as netmem_is_net_iov()
> is always false in that case, but consumes some cycles (bit test +
> true branch).
> Misc: replace explicit EXPORT_SYMBOL_NS_GPL("NS") with
> DEFAULT_SYMBOL_NAMESPACE.
>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  include/net/libeth/rx.h                       | 22 +++++++------
>  drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 14 ++++----
>  .../ethernet/intel/idpf/idpf_singleq_txrx.c   |  2 +-
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 33 +++++++++++--------
>  drivers/net/ethernet/intel/libeth/rx.c        | 20 ++++++-----
>  5 files changed, 51 insertions(+), 40 deletions(-)
>
> diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
> index ab05024be518..7d5dc58984b1 100644
> --- a/include/net/libeth/rx.h
> +++ b/include/net/libeth/rx.h
> @@ -1,5 +1,5 @@
>  /* SPDX-License-Identifier: GPL-2.0-only */
> -/* Copyright (C) 2024 Intel Corporation */
> +/* Copyright (C) 2024-2025 Intel Corporation */
>
>  #ifndef __LIBETH_RX_H
>  #define __LIBETH_RX_H
> @@ -31,7 +31,7 @@
>
>  /**
>   * struct libeth_fqe - structure representing an Rx buffer (fill queue element)
> - * @page: page holding the buffer
> + * @netmem: network memory reference holding the buffer
>   * @offset: offset from the page start (to the headroom)
>   * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
>   *
> @@ -40,7 +40,7 @@
>   * former, @offset is always 0 and @truesize is always ```PAGE_SIZE```.
>   */
>  struct libeth_fqe {
> -       struct page             *page;
> +       netmem_ref              netmem;
>         u32                     offset;
>         u32                     truesize;
>  } __aligned_largest;
> @@ -102,15 +102,16 @@ static inline dma_addr_t libeth_rx_alloc(const struct libeth_fq_fp *fq, u32 i)
>         struct libeth_fqe *buf = &fq->fqes[i];
>
>         buf->truesize = fq->truesize;
> -       buf->page = page_pool_dev_alloc(fq->pp, &buf->offset, &buf->truesize);
> -       if (unlikely(!buf->page))
> +       buf->netmem = page_pool_dev_alloc_netmem(fq->pp, &buf->offset,
> +                                                &buf->truesize);
> +       if (unlikely(!buf->netmem))
>                 return DMA_MAPPING_ERROR;
>
> -       return page_pool_get_dma_addr(buf->page) + buf->offset +
> +       return page_pool_get_dma_addr_netmem(buf->netmem) + buf->offset +
>                fq->pp->p.offset;
>  }
>
> -void libeth_rx_recycle_slow(struct page *page);
> +void libeth_rx_recycle_slow(netmem_ref netmem);
>
>  /**
>   * libeth_rx_sync_for_cpu - synchronize or recycle buffer post DMA
> @@ -126,18 +127,19 @@ void libeth_rx_recycle_slow(struct page *page);
>  static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe,
>                                           u32 len)
>  {
> -       struct page *page = fqe->page;
> +       netmem_ref netmem = fqe->netmem;
>
>         /* Very rare, but possible case. The most common reason:
>          * the last fragment contained FCS only, which was then
>          * stripped by the HW.
>          */
>         if (unlikely(!len)) {
> -               libeth_rx_recycle_slow(page);
> +               libeth_rx_recycle_slow(netmem);

I think before this patch this would have expanded to:

page_pool_put_full_page(pool, page, true);

But now I think it expands to:

page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);

Is the switch from true to false intentional? Is this a slow path so
it doesn't matter?

>                 return false;
>         }
>
> -       page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len);
> +       page_pool_dma_sync_netmem_for_cpu(netmem_get_pp(netmem), netmem,
> +                                         fqe->offset, len);
>
>         return true;
>  }
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> index 422312b8b54a..35d353d38129 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> @@ -723,7 +723,7 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>         for (u32 i = rx_ring->next_to_clean; i != rx_ring->next_to_use; ) {
>                 const struct libeth_fqe *rx_fqes = &rx_ring->rx_fqes[i];
>
> -               page_pool_put_full_page(rx_ring->pp, rx_fqes->page, false);
> +               libeth_rx_recycle_slow(rx_fqes->netmem);
>
>                 if (unlikely(++i == rx_ring->count))
>                         i = 0;
> @@ -1197,10 +1197,11 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
>                              const struct libeth_fqe *rx_buffer,
>                              unsigned int size)
>  {
> -       u32 hr = rx_buffer->page->pp->p.offset;
> +       u32 hr = netmem_get_pp(rx_buffer->netmem)->p.offset;
>
> -       skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
> -                       rx_buffer->offset + hr, size, rx_buffer->truesize);
> +       skb_add_rx_frag_netmem(skb, skb_shinfo(skb)->nr_frags,
> +                              rx_buffer->netmem, rx_buffer->offset + hr,
> +                              size, rx_buffer->truesize);
>  }
>
>  /**
> @@ -1214,12 +1215,13 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
>  static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer,
>                                       unsigned int size)
>  {
> -       u32 hr = rx_buffer->page->pp->p.offset;
> +       struct page *buf_page = __netmem_to_page(rx_buffer->netmem);
> +       u32 hr = buf_page->pp->p.offset;
>         struct sk_buff *skb;
>         void *va;
>
>         /* prefetch first cache line of first page */
> -       va = page_address(rx_buffer->page) + rx_buffer->offset;
> +       va = page_address(buf_page) + rx_buffer->offset;
>         net_prefetch(va + hr);
>
>         /* build an skb around the page buffer */
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
> index eae1b6f474e6..aeb2ca5f5a0a 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_singleq_txrx.c
> @@ -1009,7 +1009,7 @@ static int idpf_rx_singleq_clean(struct idpf_rx_queue *rx_q, int budget)
>                         break;
>
>  skip_data:
> -               rx_buf->page = NULL;
> +               rx_buf->netmem = 0;
>
>                 IDPF_SINGLEQ_BUMP_RING_IDX(rx_q, ntc);
>                 cleaned_count++;
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index bdf52cef3891..6254806c2072 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -382,12 +382,12 @@ static int idpf_tx_desc_alloc_all(struct idpf_vport *vport)
>   */
>  static void idpf_rx_page_rel(struct libeth_fqe *rx_buf)
>  {
> -       if (unlikely(!rx_buf->page))
> +       if (unlikely(!rx_buf->netmem))
>                 return;
>
> -       page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false);
> +       libeth_rx_recycle_slow(rx_buf->netmem);
>
> -       rx_buf->page = NULL;
> +       rx_buf->netmem = 0;
>         rx_buf->offset = 0;
>  }
>
> @@ -3096,10 +3096,10 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
>  void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
>                       unsigned int size)
>  {
> -       u32 hr = rx_buf->page->pp->p.offset;
> +       u32 hr = netmem_get_pp(rx_buf->netmem)->p.offset;
>
> -       skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
> -                       rx_buf->offset + hr, size, rx_buf->truesize);
> +       skb_add_rx_frag_netmem(skb, skb_shinfo(skb)->nr_frags, rx_buf->netmem,
> +                              rx_buf->offset + hr, size, rx_buf->truesize);
>  }
>
>  /**
> @@ -3122,16 +3122,20 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
>                              struct libeth_fqe *buf, u32 data_len)
>  {
>         u32 copy = data_len <= L1_CACHE_BYTES ? data_len : ETH_HLEN;
> +       struct page *hdr_page, *buf_page;
>         const void *src;
>         void *dst;
>
> -       if (!libeth_rx_sync_for_cpu(buf, copy))
> +       if (unlikely(netmem_is_net_iov(buf->netmem)) ||
> +           !libeth_rx_sync_for_cpu(buf, copy))
>                 return 0;
>

I could not immediately understand why you need a netmem_is_net_iov
check here. libeth_rx_sync_for_cpu will delegate to
page_pool_dma_sync_netmem_for_cpu which should do the right thing
regardless of whether the netmem is a page or net_iov, right? Is this
to save some cycles?

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 11/16] idpf: prepare structures to support XDP
  2025-03-05 16:21 ` [PATCH net-next 11/16] idpf: prepare structures to support XDP Alexander Lobakin
@ 2025-03-07  1:12   ` Jakub Kicinski
  2025-03-12 14:00     ` [Intel-wired-lan] " Alexander Lobakin
  2025-03-07 13:27   ` Maciej Fijalkowski
  1 sibling, 1 reply; 59+ messages in thread
From: Jakub Kicinski @ 2025-03-07  1:12 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed,  5 Mar 2025 17:21:27 +0100 Alexander Lobakin wrote:
> +/**
> + * idpf_xdp_is_prog_ena - check if there is an XDP program on adapter
> + * @vport: vport to check
> + */
> +static inline bool idpf_xdp_is_prog_ena(const struct idpf_vport *vport)
> +{
> +	return vport->adapter && vport->xdp_prog;
> +}

drivers/net/ethernet/intel/idpf/idpf.h:624: warning: No description found for return value of 'idpf_xdp_is_prog_ena'

The documentation doesn't add much info, just remove it ?
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 04/16] libeth: add XSk helpers
  2025-03-05 16:21 ` [PATCH net-next 04/16] libeth: add XSk helpers Alexander Lobakin
@ 2025-03-07 10:15   ` Maciej Fijalkowski
  2025-03-12 17:03     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 10:15 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:20PM +0100, Alexander Lobakin wrote:
> Add the following counterparts of functions from libeth_xdp which need
> special care on XSk path:
> 
> * building &xdp_buff (head and frags);
> * running XDP prog and managing all possible verdicts;
> * xmit (with S/G and metadata support);
> * wakeup via CSD/IPI;
> * FQ init/deinit and refilling.
> 
> Xmit by default unrolls loops by 8 when filling Tx DMA descriptors.
> XDP_REDIRECT verdict is considered default/likely(). Rx frags are
> considered unlikely().
> It is assumed that Tx/completion queues are not mapped to any
> interrupts, thus we clean them only when needed (=> 3/4 of
> descriptors is busy) and keep need_wakeup set.
> IPI for XSk wakeup showed better performance than triggering an SW
> NIC interrupt, though it doesn't respect NIC's interrupt affinity.

Maybe introduce this with xsk support on idpf (i suppose when set after
this one) ?

Otherwise, what is the reason to have this included? I didn't check
in-depth if there are any functions used from this patch on drivers side.

> 
> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/libeth/Kconfig  |   2 +-
>  drivers/net/ethernet/intel/libeth/Makefile |   1 +
>  drivers/net/ethernet/intel/libeth/priv.h   |  11 +
>  include/net/libeth/tx.h                    |  10 +-
>  include/net/libeth/xdp.h                   |  90 ++-
>  include/net/libeth/xsk.h                   | 685 +++++++++++++++++++++
>  drivers/net/ethernet/intel/libeth/tx.c     |   5 +-
>  drivers/net/ethernet/intel/libeth/xdp.c    |  26 +-
>  drivers/net/ethernet/intel/libeth/xsk.c    | 269 ++++++++
>  9 files changed, 1067 insertions(+), 32 deletions(-)
>  create mode 100644 include/net/libeth/xsk.h
>  create mode 100644 drivers/net/ethernet/intel/libeth/xsk.c
> 

(...)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 05/16] idpf: fix Rx descriptor ready check barrier in splitq
  2025-03-05 16:21 ` [PATCH net-next 05/16] idpf: fix Rx descriptor ready check barrier in splitq Alexander Lobakin
@ 2025-03-07 10:17   ` Maciej Fijalkowski
  2025-03-12 17:10     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 10:17 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:21PM +0100, Alexander Lobakin wrote:
> No idea what the current barrier position was meant for. At that point,
> nothing is read from the descriptor, only the pointer to the actual one
> is fetched.
> The correct barrier usage here is after the generation check, so that
> only the first qword is read if the descriptor is not yet ready and we
> need to stop polling. Debatable on coherent DMA as the Rx descriptor
> size is <= cacheline size, but anyway, the current barrier position
> only makes the codegen worse.

Makes sense:
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

But you know the process... :P fixes should go to -net.

> 
> Fixes: 3a8845af66ed ("idpf: add RX splitq napi poll support")
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index 6254806c2072..c15833928ea1 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -3232,18 +3232,14 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
>  		/* get the Rx desc from Rx queue based on 'next_to_clean' */
>  		rx_desc = &rxq->rx[ntc].flex_adv_nic_3_wb;
>  
> -		/* This memory barrier is needed to keep us from reading
> -		 * any other fields out of the rx_desc
> -		 */
> -		dma_rmb();
> -
>  		/* if the descriptor isn't done, no work yet to do */
>  		gen_id = le16_get_bits(rx_desc->pktlen_gen_bufq_id,
>  				       VIRTCHNL2_RX_FLEX_DESC_ADV_GEN_M);
> -
>  		if (idpf_queue_has(GEN_CHK, rxq) != gen_id)
>  			break;
>  
> +		dma_rmb();
> +
>  		rxdid = FIELD_GET(VIRTCHNL2_RX_FLEX_DESC_ADV_RXDID_M,
>  				  rx_desc->rxdid_ucast);
>  		if (rxdid != VIRTCHNL2_RXDID_2_FLEX_SPLITQ) {
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 07/16] idpf: link NAPIs to queues
  2025-03-05 16:21 ` [PATCH net-next 07/16] idpf: link NAPIs to queues Alexander Lobakin
@ 2025-03-07 10:28   ` Eric Dumazet
  2025-03-12 17:16     ` Alexander Lobakin
  2025-03-07 10:51   ` Maciej Fijalkowski
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2025-03-07 10:28 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 5, 2025 at 5:22 PM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> Add the missing linking of NAPIs to netdev queues when enabling
> interrupt vectors in order to support NAPI configuration and
> interfaces requiring get_rx_queue()->napi to be set (like XSk
> busy polling).
>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 30 +++++++++++++++++++++
>  1 file changed, 30 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index 2f221c0abad8..a3f6e8cff7a0 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -3560,8 +3560,11 @@ void idpf_vport_intr_rel(struct idpf_vport *vport)
>  static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
>  {
>         struct idpf_adapter *adapter = vport->adapter;
> +       bool unlock;
>         int vector;
>
> +       unlock = rtnl_trylock();

This is probably not what you want here ?

If another thread is holding RTNL, then rtnl_ttrylock() will not add
any protection.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 06/16] idpf: a use saner limit for default number of queues to allocate
  2025-03-05 16:21 ` [PATCH net-next 06/16] idpf: a use saner limit for default number of queues to allocate Alexander Lobakin
@ 2025-03-07 10:32   ` Maciej Fijalkowski
  2025-03-12 17:22     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 10:32 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:22PM +0100, Alexander Lobakin wrote:
> Currently, the maximum number of queues available for one vport is 16.
> This is hardcoded, but then the function calculating the optimal number
> of queues takes min(16, num_online_cpus()).
> On order to be able to allocate more queues, which will be then used for

nit: s/On/In

> XDP, stop hardcoding 16 and rely on what the device gives us. Instead of
> num_online_cpus(), which is considered suboptimal since at least 2013,
> use netif_get_num_default_rss_queues() to still have free queues in the
> pool.

Should we update older drivers as well?

> nr_cpu_ids number of Tx queues are needed only for lockless XDP sending,
> the regular stack doesn't benefit from that anyhow.
> On a 128-thread Xeon, this now gives me 32 regular Tx queues and leaves
> 224 free for XDP (128 of which will handle XDP_TX, .ndo_xdp_xmit(), and
> XSk xmit when enabled).
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c     | 8 +-------
>  drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 2 +-
>  2 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index c15833928ea1..2f221c0abad8 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -1234,13 +1234,7 @@ int idpf_vport_calc_total_qs(struct idpf_adapter *adapter, u16 vport_idx,
>  		num_req_tx_qs = vport_config->user_config.num_req_tx_qs;
>  		num_req_rx_qs = vport_config->user_config.num_req_rx_qs;
>  	} else {
> -		int num_cpus;
> -
> -		/* Restrict num of queues to cpus online as a default
> -		 * configuration to give best performance. User can always
> -		 * override to a max number of queues via ethtool.
> -		 */
> -		num_cpus = num_online_cpus();
> +		u32 num_cpus = netif_get_num_default_rss_queues();
>  
>  		dflt_splitq_txq_grps = min_t(int, max_q->max_txq, num_cpus);
>  		dflt_singleq_txqs = min_t(int, max_q->max_txq, num_cpus);
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> index 3d2413b8684f..135af3cc243f 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> @@ -937,7 +937,7 @@ int idpf_vport_alloc_max_qs(struct idpf_adapter *adapter,
>  	max_tx_q = le16_to_cpu(caps->max_tx_q) / default_vports;
>  	if (adapter->num_alloc_vports < default_vports) {
>  		max_q->max_rxq = min_t(u16, max_rx_q, IDPF_MAX_Q);
> -		max_q->max_txq = min_t(u16, max_tx_q, IDPF_MAX_Q);
> +		max_q->max_txq = min_t(u16, max_tx_q, IDPF_LARGE_MAX_Q);
>  	} else {
>  		max_q->max_rxq = IDPF_MIN_Q;
>  		max_q->max_txq = IDPF_MIN_Q;
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 07/16] idpf: link NAPIs to queues
  2025-03-05 16:21 ` [PATCH net-next 07/16] idpf: link NAPIs to queues Alexander Lobakin
  2025-03-07 10:28   ` Eric Dumazet
@ 2025-03-07 10:51   ` Maciej Fijalkowski
  2025-03-12 17:25     ` Alexander Lobakin
  1 sibling, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 10:51 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:23PM +0100, Alexander Lobakin wrote:
> Add the missing linking of NAPIs to netdev queues when enabling
> interrupt vectors in order to support NAPI configuration and
> interfaces requiring get_rx_queue()->napi to be set (like XSk
> busy polling).
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 30 +++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index 2f221c0abad8..a3f6e8cff7a0 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -3560,8 +3560,11 @@ void idpf_vport_intr_rel(struct idpf_vport *vport)
>  static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
>  {
>  	struct idpf_adapter *adapter = vport->adapter;
> +	bool unlock;
>  	int vector;
>  
> +	unlock = rtnl_trylock();
> +
>  	for (vector = 0; vector < vport->num_q_vectors; vector++) {
>  		struct idpf_q_vector *q_vector = &vport->q_vectors[vector];
>  		int irq_num, vidx;
> @@ -3573,8 +3576,23 @@ static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
>  		vidx = vport->q_vector_idxs[vector];
>  		irq_num = adapter->msix_entries[vidx].vector;
>  
> +		for (u32 i = 0; i < q_vector->num_rxq; i++)
> +			netif_queue_set_napi(vport->netdev,
> +					     q_vector->rx[i]->idx,
> +					     NETDEV_QUEUE_TYPE_RX,
> +					     NULL);
> +
> +		for (u32 i = 0; i < q_vector->num_txq; i++)
> +			netif_queue_set_napi(vport->netdev,
> +					     q_vector->tx[i]->idx,
> +					     NETDEV_QUEUE_TYPE_TX,
> +					     NULL);
> +

maybe we could have a wrapper for this?

static void idpf_q_set_napi(struct net_device *netdev,
			    struct idpf_q_vector *q_vector,
			    enum netdev_queue_type q_type,
			    struct napi_struct *napi)
{
	u32 q_cnt = q_type == NETDEV_QUEUE_TYPE_RX ? q_vector->num_rxq :
						     q_vector->num_txq;
	struct idpf_rx_queue **qs = q_type == NETDEV_QUEUE_TYPE_RX ?
					      q_vector->rx : q_vector->tx;

	for (u32 i = 0; i < q_cnt; i++)
		netif_queue_set_napi(netdev, qs[i]->idx, q_type, napi);
}

idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_RX, NULL);
idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_TX, NULL);
...
idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_RX, &q_vector->napi);
idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_TX, &q_vector->napi);


up to you if you take it, less lines in the end but i don't have strong
opinion if this should be considered as an improvement or makes code
harder to follow.

>  		kfree(free_irq(irq_num, q_vector));
>  	}
> +
> +	if (unlock)
> +		rtnl_unlock();
>  }
>  
>  /**
> @@ -3760,6 +3778,18 @@ static int idpf_vport_intr_req_irq(struct idpf_vport *vport)
>  				   "Request_irq failed, error: %d\n", err);
>  			goto free_q_irqs;
>  		}
> +
> +		for (u32 i = 0; i < q_vector->num_rxq; i++)
> +			netif_queue_set_napi(vport->netdev,
> +					     q_vector->rx[i]->idx,
> +					     NETDEV_QUEUE_TYPE_RX,
> +					     &q_vector->napi);
> +
> +		for (u32 i = 0; i < q_vector->num_txq; i++)
> +			netif_queue_set_napi(vport->netdev,
> +					     q_vector->tx[i]->idx,
> +					     NETDEV_QUEUE_TYPE_TX,
> +					     &q_vector->napi);
>  	}
>  
>  	return 0;
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 08/16] idpf: make complq cleaning dependent on scheduling mode
  2025-03-05 16:21 ` [PATCH net-next 08/16] idpf: make complq cleaning dependent on scheduling mode Alexander Lobakin
@ 2025-03-07 11:11   ` Maciej Fijalkowski
  2025-03-13 16:16     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 11:11 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:24PM +0100, Alexander Lobakin wrote:
> From: Michal Kubiak <michal.kubiak@intel.com>
> 
> Extend completion queue cleaning function to support queue-based
> scheduling mode needed for XDP queues.
> Add 4-byte descriptor for queue-based scheduling mode and
> perform some refactoring to extract the common code for
> both scheduling modes.
> 
> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  .../net/ethernet/intel/idpf/idpf_lan_txrx.h   |   6 +-
>  drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  11 +-
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 256 +++++++++++-------
>  3 files changed, 177 insertions(+), 96 deletions(-)

some comments inline, i didn't trim though.

> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
> index 8c7f8ef8f1a1..7f12c7f2e70e 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
> +++ b/drivers/net/ethernet/intel/idpf/idpf_lan_txrx.h
> @@ -186,13 +186,17 @@ struct idpf_base_tx_desc {
>  	__le64 qw1; /* type_cmd_offset_bsz_l2tag1 */
>  }; /* read used with buffer queues */
>  
> -struct idpf_splitq_tx_compl_desc {
> +struct idpf_splitq_4b_tx_compl_desc {
>  	/* qid=[10:0] comptype=[13:11] rsvd=[14] gen=[15] */
>  	__le16 qid_comptype_gen;
>  	union {
>  		__le16 q_head; /* Queue head */
>  		__le16 compl_tag; /* Completion tag */
>  	} q_head_compl_tag;
> +}; /* writeback used with completion queues */
> +
> +struct idpf_splitq_tx_compl_desc {
> +	struct idpf_splitq_4b_tx_compl_desc common;
>  	u8 ts[3];
>  	u8 rsvd; /* Reserved */
>  }; /* writeback used with completion queues */
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> index b029f566e57c..9f938301b2c5 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> @@ -743,7 +743,9 @@ libeth_cacheline_set_assert(struct idpf_buf_queue, 64, 24, 32);
>  
>  /**
>   * struct idpf_compl_queue - software structure representing a completion queue
> - * @comp: completion descriptor array
> + * @comp: 8-byte completion descriptor array
> + * @comp_4b: 4-byte completion descriptor array
> + * @desc_ring: virtual descriptor ring address
>   * @txq_grp: See struct idpf_txq_group
>   * @flags: See enum idpf_queue_flags_t
>   * @desc_count: Number of descriptors
> @@ -763,7 +765,12 @@ libeth_cacheline_set_assert(struct idpf_buf_queue, 64, 24, 32);
>   */
>  struct idpf_compl_queue {
>  	__cacheline_group_begin_aligned(read_mostly);
> -	struct idpf_splitq_tx_compl_desc *comp;
> +	union {
> +		struct idpf_splitq_tx_compl_desc *comp;
> +		struct idpf_splitq_4b_tx_compl_desc *comp_4b;
> +
> +		void *desc_ring;
> +	};
>  	struct idpf_txq_group *txq_grp;
>  
>  	DECLARE_BITMAP(flags, __IDPF_Q_FLAGS_NBITS);
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index a3f6e8cff7a0..a240ed115e3e 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -156,8 +156,8 @@ static void idpf_compl_desc_rel(struct idpf_compl_queue *complq)
>  		return;
>  
>  	dma_free_coherent(complq->netdev->dev.parent, complq->size,
> -			  complq->comp, complq->dma);
> -	complq->comp = NULL;
> +			  complq->desc_ring, complq->dma);
> +	complq->desc_ring = NULL;
>  	complq->next_to_use = 0;
>  	complq->next_to_clean = 0;
>  }
> @@ -284,12 +284,16 @@ static int idpf_tx_desc_alloc(const struct idpf_vport *vport,
>  static int idpf_compl_desc_alloc(const struct idpf_vport *vport,
>  				 struct idpf_compl_queue *complq)
>  {
> -	complq->size = array_size(complq->desc_count, sizeof(*complq->comp));
> +	u32 desc_size;
>  
> -	complq->comp = dma_alloc_coherent(complq->netdev->dev.parent,
> -					  complq->size, &complq->dma,
> -					  GFP_KERNEL);
> -	if (!complq->comp)
> +	desc_size = idpf_queue_has(FLOW_SCH_EN, complq) ?
> +		    sizeof(*complq->comp) : sizeof(*complq->comp_4b);
> +	complq->size = array_size(complq->desc_count, desc_size);
> +
> +	complq->desc_ring = dma_alloc_coherent(complq->netdev->dev.parent,
> +					       complq->size, &complq->dma,
> +					       GFP_KERNEL);
> +	if (!complq->desc_ring)
>  		return -ENOMEM;
>  
>  	complq->next_to_use = 0;
> @@ -1921,8 +1925,46 @@ static bool idpf_tx_clean_buf_ring(struct idpf_tx_queue *txq, u16 compl_tag,
>  }
>  
>  /**
> - * idpf_tx_handle_rs_completion - clean a single packet and all of its buffers
> - * whether on the buffer ring or in the hash table
> + * idpf_parse_compl_desc - Parse the completion descriptor
> + * @desc: completion descriptor to be parsed
> + * @complq: completion queue containing the descriptor
> + * @txq: returns corresponding Tx queue for a given descriptor
> + * @gen_flag: current generation flag in the completion queue
> + *
> + * Return: completion type from descriptor or negative value in case of error:
> + *	   -ENODATA if there is no completion descriptor to be cleaned,
> + *	   -EINVAL if no Tx queue has been found for the completion queue.
> + */
> +static int
> +idpf_parse_compl_desc(const struct idpf_splitq_4b_tx_compl_desc *desc,
> +		      const struct idpf_compl_queue *complq,
> +		      struct idpf_tx_queue **txq, bool gen_flag)
> +{
> +	struct idpf_tx_queue *target;
> +	u32 rel_tx_qid, comptype;
> +
> +	/* if the descriptor isn't done, no work yet to do */
> +	comptype = le16_to_cpu(desc->qid_comptype_gen);
> +	if (!!(comptype & IDPF_TXD_COMPLQ_GEN_M) != gen_flag)
> +		return -ENODATA;
> +
> +	/* Find necessary info of TX queue to clean buffers */
> +	rel_tx_qid = FIELD_GET(IDPF_TXD_COMPLQ_QID_M, comptype);
> +	target = likely(rel_tx_qid < complq->txq_grp->num_txq) ?
> +		 complq->txq_grp->txqs[rel_tx_qid] : NULL;
> +
> +	if (!target)
> +		return -EINVAL;
> +
> +	*txq = target;
> +
> +	/* Determine completion type */
> +	return FIELD_GET(IDPF_TXD_COMPLQ_COMPL_TYPE_M, comptype);
> +}
> +
> +/**
> + * idpf_tx_handle_rs_cmpl_qb - clean a single packet and all of its buffers
> + * whether the Tx queue is working in queue-based scheduling
>   * @txq: Tx ring to clean
>   * @desc: pointer to completion queue descriptor to extract completion
>   * information from
> @@ -1931,21 +1973,33 @@ static bool idpf_tx_clean_buf_ring(struct idpf_tx_queue *txq, u16 compl_tag,
>   *
>   * Returns bytes/packets cleaned
>   */
> -static void idpf_tx_handle_rs_completion(struct idpf_tx_queue *txq,
> -					 struct idpf_splitq_tx_compl_desc *desc,
> -					 struct libeth_sq_napi_stats *cleaned,
> -					 int budget)
> +static void
> +idpf_tx_handle_rs_cmpl_qb(struct idpf_tx_queue *txq,
> +			  const struct idpf_splitq_4b_tx_compl_desc *desc,
> +			  struct libeth_sq_napi_stats *cleaned, int budget)
>  {
> -	u16 compl_tag;
> +	u16 head = le16_to_cpu(desc->q_head_compl_tag.q_head);
>  
> -	if (!idpf_queue_has(FLOW_SCH_EN, txq)) {
> -		u16 head = le16_to_cpu(desc->q_head_compl_tag.q_head);
> -
> -		idpf_tx_splitq_clean(txq, head, budget, cleaned, false);
> -		return;
> -	}
> +	idpf_tx_splitq_clean(txq, head, budget, cleaned, false);
> +}
>  
> -	compl_tag = le16_to_cpu(desc->q_head_compl_tag.compl_tag);
> +/**
> + * idpf_tx_handle_rs_cmpl_fb - clean a single packet and all of its buffers
> + * whether on the buffer ring or in the hash table (flow-based scheduling only)
> + * @txq: Tx ring to clean
> + * @desc: pointer to completion queue descriptor to extract completion
> + * information from
> + * @cleaned: pointer to stats struct to track cleaned packets/bytes
> + * @budget: Used to determine if we are in netpoll
> + *
> + * Returns bytes/packets cleaned
> + */
> +static void
> +idpf_tx_handle_rs_cmpl_fb(struct idpf_tx_queue *txq,
> +			  const struct idpf_splitq_4b_tx_compl_desc *desc,
> +			  struct libeth_sq_napi_stats *cleaned, int budget)
> +{
> +	u16 compl_tag = le16_to_cpu(desc->q_head_compl_tag.compl_tag);
>  
>  	/* If we didn't clean anything on the ring, this packet must be
>  	 * in the hash table. Go clean it there.
> @@ -1954,6 +2008,61 @@ static void idpf_tx_handle_rs_completion(struct idpf_tx_queue *txq,
>  		idpf_tx_clean_stashed_bufs(txq, compl_tag, cleaned, budget);
>  }
>  
> +/**
> + * idpf_tx_finalize_complq - Finalize completion queue cleaning
> + * @complq: completion queue to finalize
> + * @ntc: next to complete index
> + * @gen_flag: current state of generation flag
> + * @cleaned: returns number of packets cleaned
> + */
> +static void idpf_tx_finalize_complq(struct idpf_compl_queue *complq, int ntc,
> +				    bool gen_flag, int *cleaned)
> +{
> +	struct idpf_netdev_priv *np;
> +	bool complq_ok = true;
> +	int i;
> +
> +	/* Store the state of the complq to be used later in deciding if a
> +	 * TXQ can be started again
> +	 */
> +	if (unlikely(IDPF_TX_COMPLQ_PENDING(complq->txq_grp) >
> +		     IDPF_TX_COMPLQ_OVERFLOW_THRESH(complq)))
> +		complq_ok = false;
> +
> +	np = netdev_priv(complq->netdev);
> +	for (i = 0; i < complq->txq_grp->num_txq; ++i) {

All of your new code tends to scope the iterators within loop, would be
good to stay consistent maybe?

Also, looks like
	struct idpf_txq_group *txq_grp = complq->txq_grp;
would be handy in this function.

> +		struct idpf_tx_queue *tx_q = complq->txq_grp->txqs[i];
> +		struct netdev_queue *nq;
> +		bool dont_wake;
> +
> +		/* We didn't clean anything on this queue, move along */
> +		if (!tx_q->cleaned_bytes)
> +			continue;
> +
> +		*cleaned += tx_q->cleaned_pkts;
> +
> +		/* Update BQL */
> +		nq = netdev_get_tx_queue(tx_q->netdev, tx_q->idx);
> +
> +		dont_wake = !complq_ok || IDPF_TX_BUF_RSV_LOW(tx_q) ||
> +			    np->state != __IDPF_VPORT_UP ||
> +			    !netif_carrier_ok(tx_q->netdev);
> +		/* Check if the TXQ needs to and can be restarted */
> +		__netif_txq_completed_wake(nq, tx_q->cleaned_pkts, tx_q->cleaned_bytes,
> +					   IDPF_DESC_UNUSED(tx_q), IDPF_TX_WAKE_THRESH,
> +					   dont_wake);
> +
> +		/* Reset cleaned stats for the next time this queue is
> +		 * cleaned
> +		 */
> +		tx_q->cleaned_bytes = 0;
> +		tx_q->cleaned_pkts = 0;
> +	}
> +
> +	complq->next_to_clean = ntc + complq->desc_count;

don't you have to handle the >= count case?

> +	idpf_queue_assign(GEN_CHK, complq, gen_flag);
> +}
> +
>  /**
>   * idpf_tx_clean_complq - Reclaim resources on completion queue
>   * @complq: Tx ring to clean
> @@ -1965,60 +2074,56 @@ static void idpf_tx_handle_rs_completion(struct idpf_tx_queue *txq,
>  static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
>  				 int *cleaned)
>  {
> -	struct idpf_splitq_tx_compl_desc *tx_desc;
> +	struct idpf_splitq_4b_tx_compl_desc *tx_desc;
>  	s16 ntc = complq->next_to_clean;
> -	struct idpf_netdev_priv *np;
>  	unsigned int complq_budget;
> -	bool complq_ok = true;
> -	int i;
> +	bool flow, gen_flag;
> +	u32 pos = ntc;
> +
> +	flow = idpf_queue_has(FLOW_SCH_EN, complq);
> +	gen_flag = idpf_queue_has(GEN_CHK, complq);
>  
>  	complq_budget = complq->clean_budget;
> -	tx_desc = &complq->comp[ntc];
> +	tx_desc = flow ? &complq->comp[pos].common : &complq->comp_4b[pos];
>  	ntc -= complq->desc_count;
>  
>  	do {
>  		struct libeth_sq_napi_stats cleaned_stats = { };
>  		struct idpf_tx_queue *tx_q;
> -		int rel_tx_qid;
>  		u16 hw_head;
> -		u8 ctype;	/* completion type */
> -		u16 gen;
> -
> -		/* if the descriptor isn't done, no work yet to do */
> -		gen = le16_get_bits(tx_desc->qid_comptype_gen,
> -				    IDPF_TXD_COMPLQ_GEN_M);
> -		if (idpf_queue_has(GEN_CHK, complq) != gen)
> -			break;
> -
> -		/* Find necessary info of TX queue to clean buffers */
> -		rel_tx_qid = le16_get_bits(tx_desc->qid_comptype_gen,
> -					   IDPF_TXD_COMPLQ_QID_M);
> -		if (rel_tx_qid >= complq->txq_grp->num_txq ||
> -		    !complq->txq_grp->txqs[rel_tx_qid]) {
> -			netdev_err(complq->netdev, "TxQ not found\n");
> -			goto fetch_next_desc;
> -		}
> -		tx_q = complq->txq_grp->txqs[rel_tx_qid];
> +		int ctype;
>  
> -		/* Determine completion type */
> -		ctype = le16_get_bits(tx_desc->qid_comptype_gen,
> -				      IDPF_TXD_COMPLQ_COMPL_TYPE_M);
> +		ctype = idpf_parse_compl_desc(tx_desc, complq, &tx_q,
> +					      gen_flag);
>  		switch (ctype) {
>  		case IDPF_TXD_COMPLT_RE:
> +			if (unlikely(!flow))
> +				goto fetch_next_desc;
> +
>  			hw_head = le16_to_cpu(tx_desc->q_head_compl_tag.q_head);
>  
>  			idpf_tx_splitq_clean(tx_q, hw_head, budget,
>  					     &cleaned_stats, true);
>  			break;
>  		case IDPF_TXD_COMPLT_RS:
> -			idpf_tx_handle_rs_completion(tx_q, tx_desc,
> -						     &cleaned_stats, budget);
> +			if (flow)
> +				idpf_tx_handle_rs_cmpl_fb(tx_q, tx_desc,
> +							  &cleaned_stats,
> +							  budget);
> +			else
> +				idpf_tx_handle_rs_cmpl_qb(tx_q, tx_desc,

I'd rather have 'queue' and 'flow' spelled out in these functions, they
differ by single char and take the same args on input so it's an eye
exercise to follow this. However, nothing better comes to my mind now.

> +							  &cleaned_stats,
> +							  budget);
>  			break;
>  		case IDPF_TXD_COMPLT_SW_MARKER:
>  			idpf_tx_handle_sw_marker(tx_q);
>  			break;
> +		case -ENODATA:
> +			goto exit_clean_complq;
> +		case -EINVAL:
> +			goto fetch_next_desc;
>  		default:
> -			netdev_err(tx_q->netdev,
> +			netdev_err(complq->netdev,
>  				   "Unknown TX completion type: %d\n", ctype);
>  			goto fetch_next_desc;
>  		}
> @@ -2032,59 +2137,24 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
>  		u64_stats_update_end(&tx_q->stats_sync);
>  
>  fetch_next_desc:
> -		tx_desc++;
> +		pos++;
>  		ntc++;
>  		if (unlikely(!ntc)) {
>  			ntc -= complq->desc_count;
> -			tx_desc = &complq->comp[0];
> -			idpf_queue_change(GEN_CHK, complq);
> +			pos = 0;
> +			gen_flag = !gen_flag;
>  		}
>  
> +		tx_desc = flow ? &complq->comp[pos].common :
> +			  &complq->comp_4b[pos];
>  		prefetch(tx_desc);
>  
>  		/* update budget accounting */
>  		complq_budget--;
>  	} while (likely(complq_budget));
>  
> -	/* Store the state of the complq to be used later in deciding if a
> -	 * TXQ can be started again
> -	 */
> -	if (unlikely(IDPF_TX_COMPLQ_PENDING(complq->txq_grp) >
> -		     IDPF_TX_COMPLQ_OVERFLOW_THRESH(complq)))
> -		complq_ok = false;
> -
> -	np = netdev_priv(complq->netdev);
> -	for (i = 0; i < complq->txq_grp->num_txq; ++i) {
> -		struct idpf_tx_queue *tx_q = complq->txq_grp->txqs[i];
> -		struct netdev_queue *nq;
> -		bool dont_wake;
> -
> -		/* We didn't clean anything on this queue, move along */
> -		if (!tx_q->cleaned_bytes)
> -			continue;
> -
> -		*cleaned += tx_q->cleaned_pkts;
> -
> -		/* Update BQL */
> -		nq = netdev_get_tx_queue(tx_q->netdev, tx_q->idx);
> -
> -		dont_wake = !complq_ok || IDPF_TX_BUF_RSV_LOW(tx_q) ||
> -			    np->state != __IDPF_VPORT_UP ||
> -			    !netif_carrier_ok(tx_q->netdev);
> -		/* Check if the TXQ needs to and can be restarted */
> -		__netif_txq_completed_wake(nq, tx_q->cleaned_pkts, tx_q->cleaned_bytes,
> -					   IDPF_DESC_UNUSED(tx_q), IDPF_TX_WAKE_THRESH,
> -					   dont_wake);
> -
> -		/* Reset cleaned stats for the next time this queue is
> -		 * cleaned
> -		 */
> -		tx_q->cleaned_bytes = 0;
> -		tx_q->cleaned_pkts = 0;
> -	}
> -
> -	ntc += complq->desc_count;
> -	complq->next_to_clean = ntc;
> +exit_clean_complq:
> +	idpf_tx_finalize_complq(complq, ntc, gen_flag, cleaned);
>  
>  	return !!complq_budget;
>  }
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 09/16] idpf: remove SW marker handling from NAPI
  2025-03-05 16:21 ` [PATCH net-next 09/16] idpf: remove SW marker handling from NAPI Alexander Lobakin
@ 2025-03-07 11:42   ` Maciej Fijalkowski
  2025-03-13 16:50     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 11:42 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:25PM +0100, Alexander Lobakin wrote:
> From: Michal Kubiak <michal.kubiak@intel.com>
> 
> SW marker descriptors on completion queues are used only when a queue
> is about to be destroyed. It's far from hotpath and handling it in the
> hotpath NAPI poll makes no sense.
> Instead, run a simple poller after a virtchnl message for destroying
> the queue is sent and wait for the replies. If replies for all of the
> queues are received, this means the synchronization is done correctly
> and we can go forth with stopping the link.
> 
> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf.h        |   7 +-
>  drivers/net/ethernet/intel/idpf/idpf_txrx.h   |   4 +-
>  drivers/net/ethernet/intel/idpf/idpf_lib.c    |   2 -
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 108 +++++++++++-------
>  .../net/ethernet/intel/idpf/idpf_virtchnl.c   |  34 ++----
>  5 files changed, 80 insertions(+), 75 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf.h b/drivers/net/ethernet/intel/idpf/idpf.h
> index 66544faab710..6b51a5dcc1e0 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf.h
> +++ b/drivers/net/ethernet/intel/idpf/idpf.h
> @@ -36,6 +36,7 @@ struct idpf_vport_max_q;
>  #define IDPF_NUM_CHUNKS_PER_MSG(struct_sz, chunk_sz)	\
>  	((IDPF_CTLQ_MAX_BUF_LEN - (struct_sz)) / (chunk_sz))
>  
> +#define IDPF_WAIT_FOR_MARKER_TIMEO	500
>  #define IDPF_MAX_WAIT			500
>  
>  /* available message levels */
> @@ -224,13 +225,10 @@ enum idpf_vport_reset_cause {
>  /**
>   * enum idpf_vport_flags - Vport flags
>   * @IDPF_VPORT_DEL_QUEUES: To send delete queues message
> - * @IDPF_VPORT_SW_MARKER: Indicate TX pipe drain software marker packets
> - *			  processing is done
>   * @IDPF_VPORT_FLAGS_NBITS: Must be last
>   */
>  enum idpf_vport_flags {
>  	IDPF_VPORT_DEL_QUEUES,
> -	IDPF_VPORT_SW_MARKER,
>  	IDPF_VPORT_FLAGS_NBITS,
>  };
>  
> @@ -289,7 +287,6 @@ struct idpf_port_stats {
>   * @tx_itr_profile: TX profiles for Dynamic Interrupt Moderation
>   * @port_stats: per port csum, header split, and other offload stats
>   * @link_up: True if link is up
> - * @sw_marker_wq: workqueue for marker packets
>   */
>  struct idpf_vport {
>  	u16 num_txq;
> @@ -332,8 +329,6 @@ struct idpf_vport {
>  	struct idpf_port_stats port_stats;
>  
>  	bool link_up;
> -
> -	wait_queue_head_t sw_marker_wq;
>  };
>  
>  /**
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> index 9f938301b2c5..dd6cc3b5cdab 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> @@ -286,7 +286,6 @@ struct idpf_ptype_state {
>   *			  bit and Q_RFL_GEN is the SW bit.
>   * @__IDPF_Q_FLOW_SCH_EN: Enable flow scheduling
>   * @__IDPF_Q_SW_MARKER: Used to indicate TX queue marker completions
> - * @__IDPF_Q_POLL_MODE: Enable poll mode
>   * @__IDPF_Q_CRC_EN: enable CRC offload in singleq mode
>   * @__IDPF_Q_HSPLIT_EN: enable header split on Rx (splitq)
>   * @__IDPF_Q_FLAGS_NBITS: Must be last
> @@ -296,7 +295,6 @@ enum idpf_queue_flags_t {
>  	__IDPF_Q_RFL_GEN_CHK,
>  	__IDPF_Q_FLOW_SCH_EN,
>  	__IDPF_Q_SW_MARKER,
> -	__IDPF_Q_POLL_MODE,
>  	__IDPF_Q_CRC_EN,
>  	__IDPF_Q_HSPLIT_EN,
>  
> @@ -1044,6 +1042,8 @@ bool idpf_rx_singleq_buf_hw_alloc_all(struct idpf_rx_queue *rxq,
>  				      u16 cleaned_count);
>  int idpf_tso(struct sk_buff *skb, struct idpf_tx_offload_params *off);
>  
> +void idpf_wait_for_sw_marker_completion(struct idpf_tx_queue *txq);
> +
>  static inline bool idpf_tx_maybe_stop_common(struct idpf_tx_queue *tx_q,
>  					     u32 needed)
>  {
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> index f3aea7bcdaa3..e17582d15e27 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> @@ -1501,8 +1501,6 @@ void idpf_init_task(struct work_struct *work)
>  	index = vport->idx;
>  	vport_config = adapter->vport_config[index];
>  
> -	init_waitqueue_head(&vport->sw_marker_wq);
> -
>  	spin_lock_init(&vport_config->mac_filter_list_lock);
>  
>  	INIT_LIST_HEAD(&vport_config->user_config.mac_filter_list);
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index a240ed115e3e..4e3de6031422 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -1626,32 +1626,6 @@ int idpf_vport_queues_alloc(struct idpf_vport *vport)
>  	return err;
>  }
>  
> -/**
> - * idpf_tx_handle_sw_marker - Handle queue marker packet
> - * @tx_q: tx queue to handle software marker
> - */
> -static void idpf_tx_handle_sw_marker(struct idpf_tx_queue *tx_q)
> -{
> -	struct idpf_netdev_priv *priv = netdev_priv(tx_q->netdev);
> -	struct idpf_vport *vport = priv->vport;
> -	int i;
> -
> -	idpf_queue_clear(SW_MARKER, tx_q);
> -	/* Hardware must write marker packets to all queues associated with
> -	 * completion queues. So check if all queues received marker packets
> -	 */
> -	for (i = 0; i < vport->num_txq; i++)
> -		/* If we're still waiting on any other TXQ marker completions,
> -		 * just return now since we cannot wake up the marker_wq yet.
> -		 */
> -		if (idpf_queue_has(SW_MARKER, vport->txqs[i]))
> -			return;
> -
> -	/* Drain complete */
> -	set_bit(IDPF_VPORT_SW_MARKER, vport->flags);
> -	wake_up(&vport->sw_marker_wq);
> -}
> -
>  /**
>   * idpf_tx_clean_stashed_bufs - clean bufs that were stored for
>   * out of order completions
> @@ -2008,6 +1982,19 @@ idpf_tx_handle_rs_cmpl_fb(struct idpf_tx_queue *txq,
>  		idpf_tx_clean_stashed_bufs(txq, compl_tag, cleaned, budget);
>  }
>  
> +/**
> + * idpf_tx_update_complq_indexes - update completion queue indexes
> + * @complq: completion queue being updated
> + * @ntc: current "next to clean" index value
> + * @gen_flag: current "generation" flag value
> + */
> +static void idpf_tx_update_complq_indexes(struct idpf_compl_queue *complq,
> +					  int ntc, bool gen_flag)
> +{
> +	complq->next_to_clean = ntc + complq->desc_count;
> +	idpf_queue_assign(GEN_CHK, complq, gen_flag);
> +}
> +
>  /**
>   * idpf_tx_finalize_complq - Finalize completion queue cleaning
>   * @complq: completion queue to finalize
> @@ -2059,8 +2046,7 @@ static void idpf_tx_finalize_complq(struct idpf_compl_queue *complq, int ntc,
>  		tx_q->cleaned_pkts = 0;
>  	}
>  
> -	complq->next_to_clean = ntc + complq->desc_count;
> -	idpf_queue_assign(GEN_CHK, complq, gen_flag);
> +	idpf_tx_update_complq_indexes(complq, ntc, gen_flag);
>  }
>  
>  /**
> @@ -2115,9 +2101,6 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
>  							  &cleaned_stats,
>  							  budget);
>  			break;
> -		case IDPF_TXD_COMPLT_SW_MARKER:
> -			idpf_tx_handle_sw_marker(tx_q);
> -			break;
>  		case -ENODATA:
>  			goto exit_clean_complq;
>  		case -EINVAL:
> @@ -2159,6 +2142,59 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
>  	return !!complq_budget;
>  }
>  
> +/**
> + * idpf_wait_for_sw_marker_completion - wait for SW marker of disabled Tx queue
> + * @txq: disabled Tx queue
> + */
> +void idpf_wait_for_sw_marker_completion(struct idpf_tx_queue *txq)
> +{
> +	struct idpf_compl_queue *complq = txq->txq_grp->complq;
> +	struct idpf_splitq_4b_tx_compl_desc *tx_desc;
> +	s16 ntc = complq->next_to_clean;
> +	unsigned long timeout;
> +	bool flow, gen_flag;
> +	u32 pos = ntc;
> +
> +	if (!idpf_queue_has(SW_MARKER, txq))
> +		return;
> +
> +	flow = idpf_queue_has(FLOW_SCH_EN, complq);
> +	gen_flag = idpf_queue_has(GEN_CHK, complq);
> +
> +	timeout = jiffies + msecs_to_jiffies(IDPF_WAIT_FOR_MARKER_TIMEO);
> +	tx_desc = flow ? &complq->comp[pos].common : &complq->comp_4b[pos];
> +	ntc -= complq->desc_count;

could we stop this logic? it was introduced back in the days as comparison
against 0 for wrap case was faster, here as you said it doesn't have much
in common with hot path.

> +
> +	do {
> +		struct idpf_tx_queue *tx_q;
> +		int ctype;
> +
> +		ctype = idpf_parse_compl_desc(tx_desc, complq, &tx_q,
> +					      gen_flag);
> +		if (ctype == IDPF_TXD_COMPLT_SW_MARKER) {
> +			idpf_queue_clear(SW_MARKER, tx_q);
> +			if (txq == tx_q)
> +				break;
> +		} else if (ctype == -ENODATA) {
> +			usleep_range(500, 1000);
> +			continue;
> +		}
> +
> +		pos++;
> +		ntc++;
> +		if (unlikely(!ntc)) {
> +			ntc -= complq->desc_count;
> +			pos = 0;
> +			gen_flag = !gen_flag;
> +		}
> +
> +		tx_desc = flow ? &complq->comp[pos].common :
> +			  &complq->comp_4b[pos];
> +		prefetch(tx_desc);
> +	} while (time_before(jiffies, timeout));

what if timeout expires and you didn't find the marker desc? why do you
need timer? couldn't you scan the whole ring instead?

> +
> +	idpf_tx_update_complq_indexes(complq, ntc, gen_flag);
> +}
>  /**
>   * idpf_tx_splitq_build_ctb - populate command tag and size for queue
>   * based scheduling descriptors
> @@ -4130,15 +4166,7 @@ static int idpf_vport_splitq_napi_poll(struct napi_struct *napi, int budget)
>  	else
>  		idpf_vport_intr_set_wb_on_itr(q_vector);
>  
> -	/* Switch to poll mode in the tear-down path after sending disable
> -	 * queues virtchnl message, as the interrupts will be disabled after
> -	 * that
> -	 */
> -	if (unlikely(q_vector->num_txq && idpf_queue_has(POLL_MODE,
> -							 q_vector->tx[0])))
> -		return budget;
> -	else
> -		return work_done;
> +	return work_done;
>  }
>  
>  /**
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> index 135af3cc243f..24495e4d6c78 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> @@ -752,21 +752,17 @@ int idpf_recv_mb_msg(struct idpf_adapter *adapter)
>   **/
>  static int idpf_wait_for_marker_event(struct idpf_vport *vport)
>  {
> -	int event;
> -	int i;
> -
> -	for (i = 0; i < vport->num_txq; i++)
> -		idpf_queue_set(SW_MARKER, vport->txqs[i]);
> +	bool markers_rcvd = true;
>  
> -	event = wait_event_timeout(vport->sw_marker_wq,
> -				   test_and_clear_bit(IDPF_VPORT_SW_MARKER,
> -						      vport->flags),
> -				   msecs_to_jiffies(500));
> +	for (u32 i = 0; i < vport->num_txq; i++) {
> +		struct idpf_tx_queue *txq = vport->txqs[i];
>  
> -	for (i = 0; i < vport->num_txq; i++)
> -		idpf_queue_clear(POLL_MODE, vport->txqs[i]);
> +		idpf_queue_set(SW_MARKER, txq);
> +		idpf_wait_for_sw_marker_completion(txq);
> +		markers_rcvd &= !idpf_queue_has(SW_MARKER, txq);
> +	}
>  
> -	if (event)
> +	if (markers_rcvd)
>  		return 0;
>  
>  	dev_warn(&vport->adapter->pdev->dev, "Failed to receive marker packets\n");
> @@ -1993,24 +1989,12 @@ int idpf_send_enable_queues_msg(struct idpf_vport *vport)
>   */
>  int idpf_send_disable_queues_msg(struct idpf_vport *vport)
>  {
> -	int err, i;
> +	int err;
>  
>  	err = idpf_send_ena_dis_queues_msg(vport, false);
>  	if (err)
>  		return err;
>  
> -	/* switch to poll mode as interrupts will be disabled after disable
> -	 * queues virtchnl message is sent
> -	 */
> -	for (i = 0; i < vport->num_txq; i++)
> -		idpf_queue_set(POLL_MODE, vport->txqs[i]);
> -
> -	/* schedule the napi to receive all the marker packets */
> -	local_bh_disable();
> -	for (i = 0; i < vport->num_q_vectors; i++)
> -		napi_schedule(&vport->q_vectors[i].napi);
> -	local_bh_enable();
> -
>  	return idpf_wait_for_marker_event(vport);
>  }
>  
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 10/16] idpf: add support for nointerrupt queues
  2025-03-05 16:21 ` [PATCH net-next 10/16] idpf: add support for nointerrupt queues Alexander Lobakin
@ 2025-03-07 12:10   ` Maciej Fijalkowski
  2025-03-13 16:19     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 12:10 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:26PM +0100, Alexander Lobakin wrote:
> Currently, queues are associated 1:1 with interrupt vectors as it's
> assumed queues are always interrupt-driven.
> In order to use a queue without an interrupt, idpf still needs to have
> a vector assigned to it to flush descriptors. This vector can be global
> and only one for the whole vport to handle all its noirq queues.
> Always request one excessive vector and configure it in non-interrupt
> mode right away when creating vport, so that it can be used later by
> queues when needed.

Description sort of miss the purpose of this commit, you don't ever
mention that your design choice for XDP Tx queues is to have them
irq-less.

> 
> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf.h        |  8 +++
>  drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  4 ++
>  drivers/net/ethernet/intel/idpf/idpf_dev.c    | 11 +++-
>  drivers/net/ethernet/intel/idpf/idpf_lib.c    |  2 +-
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c   |  8 +++
>  drivers/net/ethernet/intel/idpf/idpf_vf_dev.c | 11 +++-
>  .../net/ethernet/intel/idpf/idpf_virtchnl.c   | 53 +++++++++++++------
>  7 files changed, 79 insertions(+), 18 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 11/16] idpf: prepare structures to support XDP
  2025-03-05 16:21 ` [PATCH net-next 11/16] idpf: prepare structures to support XDP Alexander Lobakin
  2025-03-07  1:12   ` Jakub Kicinski
@ 2025-03-07 13:27   ` Maciej Fijalkowski
  2025-03-17 14:50     ` Alexander Lobakin
  1 sibling, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 13:27 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:27PM +0100, Alexander Lobakin wrote:
> From: Michal Kubiak <michal.kubiak@intel.com>
> 
> Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_*_queue',
> 'idpf_vport_user_config_data') by adding members necessary to support XDP.
> Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions
> without interfering with regular Tx traffic.
> Also add functions dedicated to support XDP initialization for Rx and
> Tx queues and call those functions from the existing algorithms of
> queues configuration.
> 
> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/Kconfig       |   2 +-
>  drivers/net/ethernet/intel/idpf/Makefile      |   2 +
>  drivers/net/ethernet/intel/idpf/idpf.h        |  20 ++
>  drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  86 ++++++--
>  drivers/net/ethernet/intel/idpf/xdp.h         |  17 ++
>  .../net/ethernet/intel/idpf/idpf_ethtool.c    |   6 +-
>  drivers/net/ethernet/intel/idpf/idpf_lib.c    |  21 +-
>  drivers/net/ethernet/intel/idpf/idpf_main.c   |   1 +
>  .../ethernet/intel/idpf/idpf_singleq_txrx.c   |   8 +-
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c   | 109 +++++++---
>  .../net/ethernet/intel/idpf/idpf_virtchnl.c   |  26 +--
>  drivers/net/ethernet/intel/idpf/xdp.c         | 189 ++++++++++++++++++
>  12 files changed, 415 insertions(+), 72 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/idpf/xdp.h
>  create mode 100644 drivers/net/ethernet/intel/idpf/xdp.c
> 
> diff --git a/drivers/net/ethernet/intel/idpf/Kconfig b/drivers/net/ethernet/intel/idpf/Kconfig
> index 1addd663acad..7207ee4dbae8 100644
> --- a/drivers/net/ethernet/intel/idpf/Kconfig
> +++ b/drivers/net/ethernet/intel/idpf/Kconfig
> @@ -5,7 +5,7 @@ config IDPF
>  	tristate "Intel(R) Infrastructure Data Path Function Support"
>  	depends on PCI_MSI
>  	select DIMLIB
> -	select LIBETH
> +	select LIBETH_XDP
>  	help
>  	  This driver supports Intel(R) Infrastructure Data Path Function
>  	  devices.
> diff --git a/drivers/net/ethernet/intel/idpf/Makefile b/drivers/net/ethernet/intel/idpf/Makefile
> index 2ce01a0b5898..c58abe6f8f5d 100644
> --- a/drivers/net/ethernet/intel/idpf/Makefile
> +++ b/drivers/net/ethernet/intel/idpf/Makefile
> @@ -17,3 +17,5 @@ idpf-y := \
>  	idpf_vf_dev.o
>  
>  idpf-$(CONFIG_IDPF_SINGLEQ)	+= idpf_singleq_txrx.o
> +
> +idpf-y				+= xdp.o
> diff --git a/drivers/net/ethernet/intel/idpf/idpf.h b/drivers/net/ethernet/intel/idpf/idpf.h
> index 50dde09c525b..4847760744ff 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf.h
> +++ b/drivers/net/ethernet/intel/idpf/idpf.h
> @@ -257,6 +257,10 @@ struct idpf_port_stats {
>   * @txq_model: Split queue or single queue queuing model
>   * @txqs: Used only in hotpath to get to the right queue very fast
>   * @crc_enable: Enable CRC insertion offload
> + * @xdpq_share: whether XDPSQ sharing is enabled
> + * @num_xdp_txq: number of XDPSQs
> + * @xdp_txq_offset: index of the first XDPSQ (== number of regular SQs)
> + * @xdp_prog: installed XDP program
>   * @num_rxq: Number of allocated RX queues
>   * @num_bufq: Number of allocated buffer queues
>   * @rxq_desc_count: RX queue descriptor count. *MUST* have enough descriptors
> @@ -303,6 +307,11 @@ struct idpf_vport {
>  	struct idpf_tx_queue **txqs;
>  	bool crc_enable;
>  
> +	bool xdpq_share;
> +	u16 num_xdp_txq;
> +	u16 xdp_txq_offset;
> +	struct bpf_prog *xdp_prog;
> +
>  	u16 num_rxq;
>  	u16 num_bufq;
>  	u32 rxq_desc_count;
> @@ -380,6 +389,7 @@ struct idpf_rss_data {
>   *		      ethtool
>   * @num_req_rxq_desc: Number of user requested RX queue descriptors through
>   *		      ethtool
> + * @xdp_prog: requested XDP program to install
>   * @user_flags: User toggled config flags
>   * @mac_filter_list: List of MAC filters
>   *
> @@ -391,6 +401,7 @@ struct idpf_vport_user_config_data {
>  	u16 num_req_rx_qs;
>  	u32 num_req_txq_desc;
>  	u32 num_req_rxq_desc;
> +	struct bpf_prog *xdp_prog;
>  	DECLARE_BITMAP(user_flags, __IDPF_USER_FLAGS_NBITS);
>  	struct list_head mac_filter_list;
>  };
> @@ -604,6 +615,15 @@ static inline int idpf_is_queue_model_split(u16 q_model)
>  	       q_model == VIRTCHNL2_QUEUE_MODEL_SPLIT;
>  }
>  
> +/**
> + * idpf_xdp_is_prog_ena - check if there is an XDP program on adapter
> + * @vport: vport to check
> + */
> +static inline bool idpf_xdp_is_prog_ena(const struct idpf_vport *vport)
> +{
> +	return vport->adapter && vport->xdp_prog;
> +}

(...)

> +
> +#endif /* _IDPF_XDP_H_ */
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> index 59b1a1a09996..1ca322bfe92f 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> @@ -186,9 +186,11 @@ static void idpf_get_channels(struct net_device *netdev,
>  {
>  	struct idpf_netdev_priv *np = netdev_priv(netdev);
>  	struct idpf_vport_config *vport_config;
> +	const struct idpf_vport *vport;
>  	u16 num_txq, num_rxq;
>  	u16 combined;
>  
> +	vport = idpf_netdev_to_vport(netdev);
>  	vport_config = np->adapter->vport_config[np->vport_idx];
>  
>  	num_txq = vport_config->user_config.num_req_tx_qs;
> @@ -202,8 +204,8 @@ static void idpf_get_channels(struct net_device *netdev,
>  	ch->max_rx = vport_config->max_q.max_rxq;
>  	ch->max_tx = vport_config->max_q.max_txq;
>  
> -	ch->max_other = IDPF_MAX_MBXQ;
> -	ch->other_count = IDPF_MAX_MBXQ;
> +	ch->max_other = IDPF_MAX_MBXQ + vport->num_xdp_txq;
> +	ch->other_count = IDPF_MAX_MBXQ + vport->num_xdp_txq;

That's new I think. Do you explain somewhere that other `other` will carry
xdpq count? Otherwise how would I know to interpret this value?

Also from what I see num_txq carries (txq + xdpq) count. How is that
affecting the `combined` from ethtool_channels?

>  
>  	ch->combined_count = combined;
>  	ch->rx_count = num_rxq - combined;
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> index 2594ca38e8ca..0f4edc9cd1ad 100644

(...)

> +
> +/**
> + * __idpf_xdp_rxq_info_init - Setup XDP RxQ info for a given Rx queue
> + * @rxq: Rx queue for which the resources are setup
> + * @arg: flag indicating if the HW works in split queue mode
> + *
> + * Return: 0 on success, negative on failure.
> + */
> +static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
> +{
> +	const struct idpf_vport *vport = rxq->q_vector->vport;
> +	bool split = idpf_is_queue_model_split(vport->rxq_model);
> +	const struct page_pool *pp;
> +	int err;
> +
> +	err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
> +				 rxq->q_vector->napi.napi_id,
> +				 rxq->rx_buf_size);
> +	if (err)
> +		return err;
> +
> +	pp = split ? rxq->bufq_sets[0].bufq.pp : rxq->pp;
> +	xdp_rxq_info_attach_page_pool(&rxq->xdp_rxq, pp);
> +
> +	if (!split)
> +		return 0;

why do you care about splitq model if on next patch you don't allow
XDP_SETUP_PROG for that?

> +
> +	rxq->xdpqs = &vport->txqs[vport->xdp_txq_offset];
> +	rxq->num_xdp_txq = vport->num_xdp_txq;
> +
> +	return 0;
> +}
> +
> +/**
> + * idpf_xdp_rxq_info_init_all - initialize RxQ info for all Rx queues in vport
> + * @vport: vport to setup the info
> + *
> + * Return: 0 on success, negative on failure.
> + */
> +int idpf_xdp_rxq_info_init_all(const struct idpf_vport *vport)
> +{
> +	return idpf_rxq_for_each(vport, __idpf_xdp_rxq_info_init, NULL);
> +}
> +
> +/**
> + * __idpf_xdp_rxq_info_deinit - Deinit XDP RxQ info for a given Rx queue
> + * @rxq: Rx queue for which the resources are destroyed
> + * @arg: flag indicating if the HW works in split queue mode
> + *
> + * Return: always 0.
> + */
> +static int __idpf_xdp_rxq_info_deinit(struct idpf_rx_queue *rxq, void *arg)
> +{
> +	if (idpf_is_queue_model_split((size_t)arg)) {
> +		rxq->xdpqs = NULL;
> +		rxq->num_xdp_txq = 0;
> +	}
> +
> +	xdp_rxq_info_detach_mem_model(&rxq->xdp_rxq);
> +	xdp_rxq_info_unreg(&rxq->xdp_rxq);
> +
> +	return 0;
> +}
> +
> +/**
> + * idpf_xdp_rxq_info_deinit_all - deinit RxQ info for all Rx queues in vport
> + * @vport: vport to setup the info
> + */
> +void idpf_xdp_rxq_info_deinit_all(const struct idpf_vport *vport)
> +{
> +	idpf_rxq_for_each(vport, __idpf_xdp_rxq_info_deinit,
> +			  (void *)(size_t)vport->rxq_model);
> +}
> +
> +int idpf_vport_xdpq_get(const struct idpf_vport *vport)
> +{
> +	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;

please bear with me here - so this array will exist as long as there is a
single timers[i] allocated? even though it's a local var?

this way you avoid the need to store it in vport?

> +	struct net_device *dev;
> +	u32 sqs;
> +
> +	if (!idpf_xdp_is_prog_ena(vport))
> +		return 0;
> +
> +	timers = kvcalloc(vport->num_xdp_txq, sizeof(*timers), GFP_KERNEL);
> +	if (!timers)
> +		return -ENOMEM;
> +
> +	for (u32 i = 0; i < vport->num_xdp_txq; i++) {
> +		timers[i] = kzalloc_node(sizeof(*timers[i]), GFP_KERNEL,
> +					 cpu_to_mem(i));
> +		if (!timers[i]) {
> +			for (int j = i - 1; j >= 0; j--)
> +				kfree(timers[j]);
> +
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	dev = vport->netdev;
> +	sqs = vport->xdp_txq_offset;
> +
> +	for (u32 i = sqs; i < vport->num_txq; i++) {
> +		struct idpf_tx_queue *xdpq = vport->txqs[i];
> +
> +		xdpq->complq = xdpq->txq_grp->complq;
> +
> +		idpf_queue_clear(FLOW_SCH_EN, xdpq);
> +		idpf_queue_clear(FLOW_SCH_EN, xdpq->complq);
> +		idpf_queue_set(NOIRQ, xdpq);
> +		idpf_queue_set(XDP, xdpq);
> +		idpf_queue_set(XDP, xdpq->complq);
> +
> +		xdpq->timer = timers[i - sqs];
> +		libeth_xdpsq_get(&xdpq->xdp_lock, dev, vport->xdpq_share);
> +
> +		xdpq->pending = 0;
> +		xdpq->xdp_tx = 0;
> +		xdpq->thresh = libeth_xdp_queue_threshold(xdpq->desc_count);
> +	}
> +
> +	return 0;
> +}
> +
> +void idpf_vport_xdpq_put(const struct idpf_vport *vport)
> +{
> +	struct net_device *dev;
> +	u32 sqs;
> +
> +	if (!idpf_xdp_is_prog_ena(vport))
> +		return;
> +
> +	dev = vport->netdev;
> +	sqs = vport->xdp_txq_offset;
> +
> +	for (u32 i = sqs; i < vport->num_txq; i++) {
> +		struct idpf_tx_queue *xdpq = vport->txqs[i];
> +
> +		if (!idpf_queue_has_clear(XDP, xdpq))
> +			continue;
> +
> +		libeth_xdpsq_put(&xdpq->xdp_lock, dev);
> +
> +		kfree(xdpq->timer);
> +		idpf_queue_clear(NOIRQ, xdpq);
> +	}
> +}
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
  2025-03-05 16:21 ` [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq Alexander Lobakin
@ 2025-03-07 14:16   ` Maciej Fijalkowski
  2025-03-17 14:58     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-07 14:16 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:28PM +0100, Alexander Lobakin wrote:
> From: Michal Kubiak <michal.kubiak@intel.com>
> 
> Implement loading/removing XDP program using .ndo_bpf callback
> in the split queue mode. Reconfigure and restart the queues if needed
> (!!old_prog != !!new_prog), otherwise, just update the pointers.
> 
> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_txrx.h |   4 +-
>  drivers/net/ethernet/intel/idpf/xdp.h       |   7 ++
>  drivers/net/ethernet/intel/idpf/idpf_lib.c  |   1 +
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c |   4 +
>  drivers/net/ethernet/intel/idpf/xdp.c       | 114 ++++++++++++++++++++
>  5 files changed, 129 insertions(+), 1 deletion(-)
> 

(...)

> +
> +/**
> + * idpf_xdp_setup_prog - handle XDP program install/remove requests
> + * @vport: vport to configure
> + * @xdp: request data (program, extack)
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +static int
> +idpf_xdp_setup_prog(struct idpf_vport *vport, const struct netdev_bpf *xdp)
> +{
> +	const struct idpf_netdev_priv *np = netdev_priv(vport->netdev);
> +	struct bpf_prog *old, *prog = xdp->prog;
> +	struct idpf_vport_config *cfg;
> +	int ret;
> +
> +	cfg = vport->adapter->vport_config[vport->idx];
> +	if (!vport->num_xdp_txq && vport->num_txq == cfg->max_q.max_txq) {
> +		NL_SET_ERR_MSG_MOD(xdp->extack,
> +				   "No Tx queues available for XDP, please decrease the number of regular SQs");
> +		return -ENOSPC;
> +	}
> +
> +	if (test_bit(IDPF_REMOVE_IN_PROG, vport->adapter->flags) ||

IN_PROG is a bit unfortunate here as it mixes with 'prog' :P

> +	    !!vport->xdp_prog == !!prog) {
> +		if (np->state == __IDPF_VPORT_UP)
> +			idpf_copy_xdp_prog_to_qs(vport, prog);
> +
> +		old = xchg(&vport->xdp_prog, prog);
> +		if (old)
> +			bpf_prog_put(old);
> +
> +		cfg->user_config.xdp_prog = prog;
> +
> +		return 0;
> +	}
> +
> +	old = cfg->user_config.xdp_prog;
> +	cfg->user_config.xdp_prog = prog;
> +
> +	ret = idpf_initiate_soft_reset(vport, IDPF_SR_Q_CHANGE);
> +	if (ret) {
> +		NL_SET_ERR_MSG_MOD(xdp->extack,
> +				   "Could not reopen the vport after XDP setup");
> +
> +		if (prog)
> +			bpf_prog_put(prog);

aren't you missing this for prog->NULL conversion? you have this for
hot-swap case (prog->prog).

> +
> +		cfg->user_config.xdp_prog = old;
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * idpf_xdp - handle XDP-related requests
> + * @dev: network device to configure
> + * @xdp: request data (program, extack)
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> +{
> +	struct idpf_vport *vport;
> +	int ret;
> +
> +	idpf_vport_ctrl_lock(dev);
> +	vport = idpf_netdev_to_vport(dev);
> +
> +	if (!idpf_is_queue_model_split(vport->txq_model))
> +		goto notsupp;
> +
> +	switch (xdp->command) {
> +	case XDP_SETUP_PROG:
> +		ret = idpf_xdp_setup_prog(vport, xdp);
> +		break;
> +	default:
> +notsupp:
> +		ret = -EOPNOTSUPP;
> +		break;
> +	}
> +
> +	idpf_vport_ctrl_unlock(dev);
> +
> +	return ret;
> +}
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-03-05 16:21 ` [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp) Alexander Lobakin
@ 2025-03-11 14:05   ` Maciej Fijalkowski
  2025-03-17 15:26     ` Alexander Lobakin
  2025-04-08 13:22     ` Alexander Lobakin
  0 siblings, 2 replies; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-11 14:05 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:19PM +0100, Alexander Lobakin wrote:
> "Couple" is a bit humbly... Add the following functionality to libeth:
> 
> * XDP shared queues managing
> * XDP_TX bulk sending infra
> * .ndo_xdp_xmit() infra
> * adding buffers to &xdp_buff
> * running XDP prog and managing its verdict
> * completing XDP Tx buffers
> 
> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Patch is really big and I'm not sure how to trim this TBH to make my
comments bearable. I know this is highly optimized but it's rather hard to
follow with all of the callbacks, defines/aligns and whatnot. Any chance
to chop this commit a bit?

Timers and locking logic could be pulled out to separate patches I think.
You don't ever say what improvement gave you the __LIBETH_WORD_ACCESS
approach. You've put a lot of thought onto this work and I feel like this
is not explained/described thoroughly. What would be nice to see is to
have this in the separate commit as well with a comment like 'this gave me
+X% performance boost on Y workload'. That would be probably a non-zero
effort to restructure it but generally while jumping back and forth
through this code I had a lot of head-scratching moments.

> ---
>  drivers/net/ethernet/intel/libeth/Kconfig  |   10 +-
>  drivers/net/ethernet/intel/libeth/Makefile |    7 +-
>  include/net/libeth/types.h                 |  106 +-
>  drivers/net/ethernet/intel/libeth/priv.h   |   26 +
>  include/net/libeth/tx.h                    |   30 +-
>  include/net/libeth/xdp.h                   | 1827 ++++++++++++++++++++
>  drivers/net/ethernet/intel/libeth/tx.c     |   38 +
>  drivers/net/ethernet/intel/libeth/xdp.c    |  431 +++++
>  8 files changed, 2467 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/libeth/priv.h
>  create mode 100644 include/net/libeth/xdp.h
>  create mode 100644 drivers/net/ethernet/intel/libeth/tx.c
>  create mode 100644 drivers/net/ethernet/intel/libeth/xdp.c
> 
> diff --git a/drivers/net/ethernet/intel/libeth/Kconfig b/drivers/net/ethernet/intel/libeth/Kconfig
> index 480293b71dbc..d8c4926574fb 100644
> --- a/drivers/net/ethernet/intel/libeth/Kconfig
> +++ b/drivers/net/ethernet/intel/libeth/Kconfig
> @@ -1,9 +1,15 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -# Copyright (C) 2024 Intel Corporation
> +# Copyright (C) 2024-2025 Intel Corporation
>  
>  config LIBETH
> -	tristate
> +	tristate "Common Ethernet library (libeth)" if COMPILE_TEST
>  	select PAGE_POOL
>  	help
>  	  libeth is a common library containing routines shared between several
>  	  drivers, but not yet promoted to the generic kernel API.
> +
> +config LIBETH_XDP
> +	tristate "Common XDP library (libeth_xdp)" if COMPILE_TEST
> +	select LIBETH
> +	help
> +	  XDP helpers based on libeth hotpath management.
> diff --git a/drivers/net/ethernet/intel/libeth/Makefile b/drivers/net/ethernet/intel/libeth/Makefile
> index 52492b081132..51669840ee06 100644
> --- a/drivers/net/ethernet/intel/libeth/Makefile
> +++ b/drivers/net/ethernet/intel/libeth/Makefile
> @@ -1,6 +1,11 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -# Copyright (C) 2024 Intel Corporation
> +# Copyright (C) 2024-2025 Intel Corporation
>  
>  obj-$(CONFIG_LIBETH)		+= libeth.o
>  
>  libeth-y			:= rx.o
> +libeth-y			+= tx.o
> +
> +obj-$(CONFIG_LIBETH_XDP)	+= libeth_xdp.o
> +
> +libeth_xdp-y			+= xdp.o
> diff --git a/include/net/libeth/types.h b/include/net/libeth/types.h
> index 603825e45133..cf1d78a9dc38 100644
> --- a/include/net/libeth/types.h
> +++ b/include/net/libeth/types.h
> @@ -1,10 +1,32 @@
>  /* SPDX-License-Identifier: GPL-2.0-only */
> -/* Copyright (C) 2024 Intel Corporation */
> +/* Copyright (C) 2024-2025 Intel Corporation */
>  
>  #ifndef __LIBETH_TYPES_H
>  #define __LIBETH_TYPES_H
>  
> -#include <linux/types.h>
> +#include <linux/workqueue.h>
> +
> +/* Stats */
> +
> +/**
> + * struct libeth_rq_napi_stats - "hot" counters to update in Rx polling loop
> + * @packets: received frames counter
> + * @bytes: sum of bytes of received frames above
> + * @fragments: sum of fragments of received S/G frames
> + * @hsplit: number of frames the device performed the header split for
> + * @raw: alias to access all the fields as an array
> + */
> +struct libeth_rq_napi_stats {
> +	union {
> +		struct {
> +							u32 packets;
> +							u32 bytes;
> +							u32 fragments;
> +							u32 hsplit;
> +		};
> +		DECLARE_FLEX_ARRAY(u32, raw);

The @raw approach is never used throughout the patchset, right?
Could you explain the reason for introducing this and potential use case?

> +	};
> +};
>  
>  /**
>   * struct libeth_sq_napi_stats - "hot" counters to update in Tx completion loop
> @@ -22,4 +44,84 @@ struct libeth_sq_napi_stats {
>  	};
>  };
>  
> +/**
> + * struct libeth_xdpsq_napi_stats - "hot" counters to update in XDP Tx
> + *				    completion loop
> + * @packets: completed frames counter
> + * @bytes: sum of bytes of completed frames above
> + * @fragments: sum of fragments of completed S/G frames
> + * @raw: alias to access all the fields as an array
> + */
> +struct libeth_xdpsq_napi_stats {

what's the delta between this and libeth_sq_napi_stats ? couldn't you have
a single struct for purpose of tx napi stats?

> +	union {
> +		struct {
> +							u32 packets;
> +							u32 bytes;
> +							u32 fragments;
> +		};
> +		DECLARE_FLEX_ARRAY(u32, raw);
> +	};
> +};

(...)

> +/* Rx polling path */
> +
> +/**
> + * struct libeth_xdp_buff_stash - struct for stashing &xdp_buff onto a queue
> + * @data: pointer to the start of the frame, xdp_buff.data
> + * @headroom: frame headroom, xdp_buff.data - xdp_buff.data_hard_start
> + * @len: frame linear space length, xdp_buff.data_end - xdp_buff.data
> + * @frame_sz: truesize occupied by the frame, xdp_buff.frame_sz
> + * @flags: xdp_buff.flags
> + *
> + * &xdp_buff is 56 bytes long on x64, &libeth_xdp_buff is 64 bytes. This
> + * structure carries only necessary fields to save/restore a partially built
> + * frame on the queue structure to finish it during the next NAPI poll.
> + */
> +struct libeth_xdp_buff_stash {
> +	void				*data;
> +	u16				headroom;
> +	u16				len;
> +
> +	u32				frame_sz:24;
> +	u32				flags:8;
> +} __aligned_largest;
> +
>  #endif /* __LIBETH_TYPES_H */
> diff --git a/drivers/net/ethernet/intel/libeth/priv.h b/drivers/net/ethernet/intel/libeth/priv.h
> new file mode 100644
> index 000000000000..1bd6e2d7a3e7
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/libeth/priv.h
> @@ -0,0 +1,26 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (C) 2025 Intel Corporation */
> +
> +#ifndef __LIBETH_PRIV_H
> +#define __LIBETH_PRIV_H
> +
> +#include <linux/types.h>
> +
> +/* XDP */
> +
> +struct skb_shared_info;
> +struct xdp_frame_bulk;
> +
> +struct libeth_xdp_ops {
> +	void	(*bulk)(const struct skb_shared_info *sinfo,
> +			struct xdp_frame_bulk *bq, bool frags);
> +};
> +
> +void libeth_attach_xdp(const struct libeth_xdp_ops *ops);
> +
> +static inline void libeth_detach_xdp(void)
> +{
> +	libeth_attach_xdp(NULL);
> +}
> +
> +#endif /* __LIBETH_PRIV_H */
> diff --git a/include/net/libeth/tx.h b/include/net/libeth/tx.h
> index 35614f9523f6..c3459917330e 100644
> --- a/include/net/libeth/tx.h
> +++ b/include/net/libeth/tx.h
> @@ -1,5 +1,5 @@
>  /* SPDX-License-Identifier: GPL-2.0-only */
> -/* Copyright (C) 2024 Intel Corporation */
> +/* Copyright (C) 2024-2025 Intel Corporation */
>  
>  #ifndef __LIBETH_TX_H
>  #define __LIBETH_TX_H
> @@ -12,11 +12,15 @@
>  
>  /**
>   * enum libeth_sqe_type - type of &libeth_sqe to act on Tx completion
> - * @LIBETH_SQE_EMPTY: unused/empty, no action required
> + * @LIBETH_SQE_EMPTY: unused/empty OR XDP_TX, no action required
>   * @LIBETH_SQE_CTX: context descriptor with empty SQE, no action required
>   * @LIBETH_SQE_SLAB: kmalloc-allocated buffer, unmap and kfree()
>   * @LIBETH_SQE_FRAG: mapped skb frag, only unmap DMA
>   * @LIBETH_SQE_SKB: &sk_buff, unmap and napi_consume_skb(), update stats
> + * @__LIBETH_SQE_XDP_START: separator between skb and XDP types
> + * @LIBETH_SQE_XDP_TX: &skb_shared_info, libeth_xdp_return_buff_bulk(), stats
> + * @LIBETH_SQE_XDP_XMIT: &xdp_frame, unmap and xdp_return_frame_bulk(), stats
> + * @LIBETH_SQE_XDP_XMIT_FRAG: &xdp_frame frag, only unmap DMA
>   */
>  enum libeth_sqe_type {
>  	LIBETH_SQE_EMPTY		= 0U,
> @@ -24,6 +28,11 @@ enum libeth_sqe_type {
>  	LIBETH_SQE_SLAB,
>  	LIBETH_SQE_FRAG,
>  	LIBETH_SQE_SKB,
> +
> +	__LIBETH_SQE_XDP_START,
> +	LIBETH_SQE_XDP_TX		= __LIBETH_SQE_XDP_START,
> +	LIBETH_SQE_XDP_XMIT,
> +	LIBETH_SQE_XDP_XMIT_FRAG,
>  };
>  
>  /**
> @@ -32,6 +41,8 @@ enum libeth_sqe_type {
>   * @rs_idx: index of the last buffer from the batch this one was sent in
>   * @raw: slab buffer to free via kfree()
>   * @skb: &sk_buff to consume
> + * @sinfo: skb shared info of an XDP_TX frame
> + * @xdpf: XDP frame from ::ndo_xdp_xmit()
>   * @dma: DMA address to unmap
>   * @len: length of the mapped region to unmap
>   * @nr_frags: number of frags in the frame this buffer belongs to
> @@ -46,6 +57,8 @@ struct libeth_sqe {
>  	union {
>  		void				*raw;
>  		struct sk_buff			*skb;
> +		struct skb_shared_info		*sinfo;
> +		struct xdp_frame		*xdpf;
>  	};
>  
>  	DEFINE_DMA_UNMAP_ADDR(dma);
> @@ -71,7 +84,10 @@ struct libeth_sqe {
>  /**
>   * struct libeth_cq_pp - completion queue poll params
>   * @dev: &device to perform DMA unmapping
> + * @bq: XDP frame bulk to combine return operations
>   * @ss: onstack NAPI stats to fill
> + * @xss: onstack XDPSQ NAPI stats to fill
> + * @xdp_tx: number of XDP frames processed
>   * @napi: whether it's called from the NAPI context
>   *
>   * libeth uses this structure to access objects needed for performing full
> @@ -80,7 +96,13 @@ struct libeth_sqe {
>   */
>  struct libeth_cq_pp {
>  	struct device			*dev;
> -	struct libeth_sq_napi_stats	*ss;
> +	struct xdp_frame_bulk		*bq;
> +
> +	union {
> +		struct libeth_sq_napi_stats	*ss;
> +		struct libeth_xdpsq_napi_stats	*xss;
> +	};
> +	u32				xdp_tx;

you have this counted in xss::packets?

>  
>  	bool				napi;
>  };
> @@ -126,4 +148,6 @@ static inline void libeth_tx_complete(struct libeth_sqe *sqe,
>  	sqe->type = LIBETH_SQE_EMPTY;
>  }
>  
> +void libeth_tx_complete_any(struct libeth_sqe *sqe, struct libeth_cq_pp *cp);
> +
>  #endif /* __LIBETH_TX_H */
> diff --git a/include/net/libeth/xdp.h b/include/net/libeth/xdp.h
> new file mode 100644
> index 000000000000..1039cd5d8a56
> --- /dev/null
> +++ b/include/net/libeth/xdp.h
> @@ -0,0 +1,1827 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (C) 2025 Intel Corporation */
> +
> +#ifndef __LIBETH_XDP_H
> +#define __LIBETH_XDP_H
> +
> +#include <linux/bpf_trace.h>
> +#include <linux/unroll.h>
> +
> +#include <net/libeth/rx.h>
> +#include <net/libeth/tx.h>
> +#include <net/xsk_buff_pool.h>
> +
> +/* Defined as bits to be able to use them as a mask */
> +enum {
> +	LIBETH_XDP_PASS			= 0U,
> +	LIBETH_XDP_DROP			= BIT(0),
> +	LIBETH_XDP_ABORTED		= BIT(1),
> +	LIBETH_XDP_TX			= BIT(2),
> +	LIBETH_XDP_REDIRECT		= BIT(3),
> +};
> +
> +/*
> + * &xdp_buff_xsk is the largest structure &libeth_xdp_buff gets casted to,
> + * pick maximum pointer-compatible alignment.
> + */
> +#define __LIBETH_XDP_BUFF_ALIGN						      \
> +	(IS_ALIGNED(sizeof(struct xdp_buff_xsk), 16) ? 16 :		      \
> +	 IS_ALIGNED(sizeof(struct xdp_buff_xsk), 8) ? 8 :		      \
> +	 sizeof(long))
> +
> +/**
> + * struct libeth_xdp_buff - libeth extension over &xdp_buff
> + * @base: main &xdp_buff
> + * @data: shortcut for @base.data
> + * @desc: RQ descriptor containing metadata for this buffer
> + * @priv: driver-private scratchspace
> + *
> + * The main reason for this is to have a pointer to the descriptor to be able
> + * to quickly get frame metadata from xdpmo and driver buff-to-xdp callbacks
> + * (as well as bigger alignment).
> + * Pointer/layout-compatible with &xdp_buff and &xdp_buff_xsk.
> + */
> +struct libeth_xdp_buff {
> +	union {
> +		struct xdp_buff		base;
> +		void			*data;
> +	};
> +
> +	const void			*desc;
> +	unsigned long			priv[]
> +					__aligned(__LIBETH_XDP_BUFF_ALIGN);
> +} __aligned(__LIBETH_XDP_BUFF_ALIGN);
> +static_assert(offsetof(struct libeth_xdp_buff, data) ==
> +	      offsetof(struct xdp_buff_xsk, xdp.data));
> +static_assert(offsetof(struct libeth_xdp_buff, desc) ==
> +	      offsetof(struct xdp_buff_xsk, cb));
> +static_assert(IS_ALIGNED(sizeof(struct xdp_buff_xsk),
> +			 __alignof(struct libeth_xdp_buff)));
> +
> +/**
> + * __LIBETH_XDP_ONSTACK_BUFF - declare a &libeth_xdp_buff on the stack
> + * @name: name of the variable to declare
> + * @...: sizeof() of the driver-private data
> + */
> +#define __LIBETH_XDP_ONSTACK_BUFF(name, ...)				      \
> +	___LIBETH_XDP_ONSTACK_BUFF(name, ##__VA_ARGS__)
> +/**
> + * LIBETH_XDP_ONSTACK_BUFF - declare a &libeth_xdp_buff on the stack
> + * @name: name of the variable to declare
> + * @...: type or variable name of the driver-private data
> + */
> +#define LIBETH_XDP_ONSTACK_BUFF(name, ...)				      \
> +	__LIBETH_XDP_ONSTACK_BUFF(name, __libeth_xdp_priv_sz(__VA_ARGS__))
> +
> +#define ___LIBETH_XDP_ONSTACK_BUFF(name, ...)				      \
> +	_DEFINE_FLEX(struct libeth_xdp_buff, name, priv,		      \
> +		     LIBETH_XDP_PRIV_SZ(__VA_ARGS__ + 0),		      \
> +		     /* no init */);					      \
> +	LIBETH_XDP_ASSERT_PRIV_SZ(__VA_ARGS__ + 0)
> +
> +#define __libeth_xdp_priv_sz(...)					      \
> +	CONCATENATE(__libeth_xdp_psz, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
> +
> +#define __libeth_xdp_psz0(...)
> +#define __libeth_xdp_psz1(...)		sizeof(__VA_ARGS__)
> +
> +#define LIBETH_XDP_PRIV_SZ(sz)						      \
> +	(ALIGN(sz, __alignof(struct libeth_xdp_buff)) / sizeof(long))
> +
> +/* Performs XSK_CHECK_PRIV_TYPE() */
> +#define LIBETH_XDP_ASSERT_PRIV_SZ(sz)					      \
> +	static_assert(offsetofend(struct xdp_buff_xsk, cb) >=		      \
> +		      struct_size_t(struct libeth_xdp_buff, priv,	      \
> +				    LIBETH_XDP_PRIV_SZ(sz)))
> +

(...)

> +/* Common Tx bits */
> +
> +/**
> + * enum - libeth_xdp internal Tx flags
> + * @LIBETH_XDP_TX_BULK: one bulk size at which it will be flushed to the queue
> + * @LIBETH_XDP_TX_BATCH: batch size for which the queue fill loop is unrolled
> + * @LIBETH_XDP_TX_DROP: indicates the send function must drop frames not sent
> + * @LIBETH_XDP_TX_NDO: whether the send function is called from .ndo_xdp_xmit()
> + */
> +enum {
> +	LIBETH_XDP_TX_BULK		= DEV_MAP_BULK_SIZE,
> +	LIBETH_XDP_TX_BATCH		= 8,
> +
> +	LIBETH_XDP_TX_DROP		= BIT(0),
> +	LIBETH_XDP_TX_NDO		= BIT(1),

what's the reason to group these random values onto enum?

> +};
> +
> +/**
> + * enum - &libeth_xdp_tx_frame and &libeth_xdp_tx_desc flags
> + * @LIBETH_XDP_TX_LEN: only for ``XDP_TX``, [15:0] of ::len_fl is actual length
> + * @LIBETH_XDP_TX_FIRST: indicates the frag is the first one of the frame
> + * @LIBETH_XDP_TX_LAST: whether the frag is the last one of the frame
> + * @LIBETH_XDP_TX_MULTI: whether the frame contains several frags

would be good to have some extended description around usage of these
flags.

> + * @LIBETH_XDP_TX_FLAGS: only for ``XDP_TX``, [31:16] of ::len_fl is flags
> + */
> +enum {
> +	LIBETH_XDP_TX_LEN		= GENMASK(15, 0),
> +
> +	LIBETH_XDP_TX_FIRST		= BIT(16),
> +	LIBETH_XDP_TX_LAST		= BIT(17),
> +	LIBETH_XDP_TX_MULTI		= BIT(18),
> +
> +	LIBETH_XDP_TX_FLAGS		= GENMASK(31, 16),
> +};
> +
> +/**
> + * struct libeth_xdp_tx_frame - represents one XDP Tx element
> + * @data: frame start pointer for ``XDP_TX``
> + * @len_fl: ``XDP_TX``, combined flags [31:16] and len [15:0] field for speed
> + * @soff: ``XDP_TX``, offset from @data to the start of &skb_shared_info
> + * @frag: one (non-head) frag for ``XDP_TX``
> + * @xdpf: &xdp_frame for the head frag for .ndo_xdp_xmit()
> + * @dma: DMA address of the non-head frag for .ndo_xdp_xmit()
> + * @len: frag length for .ndo_xdp_xmit()
> + * @flags: Tx flags for the above
> + * @opts: combined @len + @flags for the above for speed
> + */
> +struct libeth_xdp_tx_frame {
> +	union {
> +		/* ``XDP_TX`` */
> +		struct {
> +			void				*data;
> +			u32				len_fl;
> +			u32				soff;
> +		};
> +
> +		/* ``XDP_TX`` frag */
> +		skb_frag_t			frag;
> +
> +		/* .ndo_xdp_xmit() */
> +		struct {
> +			union {
> +				struct xdp_frame		*xdpf;
> +				dma_addr_t			dma;
> +			};
> +			union {
> +				struct {
> +					u32				len;
> +					u32				flags;
> +				};
> +				aligned_u64			opts;
> +			};
> +		};
> +	};
> +} __aligned(sizeof(struct xdp_desc));
> +static_assert(offsetof(struct libeth_xdp_tx_frame, frag.len) ==
> +	      offsetof(struct libeth_xdp_tx_frame, len_fl));
> +
> +/**
> + * struct libeth_xdp_tx_bulk - XDP Tx frame bulk for bulk sending
> + * @prog: corresponding active XDP program, %NULL for .ndo_xdp_xmit()
> + * @dev: &net_device which the frames are transmitted on
> + * @xdpsq: shortcut to the corresponding driver-specific XDPSQ structure
> + * @act_mask: Rx only, mask of all the XDP prog verdicts for that NAPI session
> + * @count: current number of frames in @bulk
> + * @bulk: array of queued frames for bulk Tx
> + *
> + * All XDP Tx operations queue each frame to the bulk first and flush it
> + * when @count reaches the array end. Bulk is always placed on the stack
> + * for performance. One bulk element contains all the data necessary
> + * for sending a frame and then freeing it on completion.
> + */
> +struct libeth_xdp_tx_bulk {
> +	const struct bpf_prog		*prog;
> +	struct net_device		*dev;
> +	void				*xdpsq;
> +
> +	u32				act_mask;
> +	u32				count;
> +	struct libeth_xdp_tx_frame	bulk[LIBETH_XDP_TX_BULK];
> +} __aligned(sizeof(struct libeth_xdp_tx_frame));
> +
> +/**
> + * struct libeth_xdpsq - abstraction for an XDPSQ
> + * @sqes: array of Tx buffers from the actual queue struct
> + * @descs: opaque pointer to the HW descriptor array
> + * @ntu: pointer to the next free descriptor index
> + * @count: number of descriptors on that queue
> + * @pending: pointer to the number of sent-not-completed descs on that queue
> + * @xdp_tx: pointer to the above
> + * @lock: corresponding XDPSQ lock
> + *
> + * Abstraction for driver-independent implementation of Tx. Placed on the stack
> + * and filled by the driver before the transmission, so that the generic
> + * functions can access and modify driver-specific resources.
> + */
> +struct libeth_xdpsq {
> +	struct libeth_sqe		*sqes;
> +	void				*descs;
> +
> +	u32				*ntu;
> +	u32				count;
> +
> +	u32				*pending;
> +	u32				*xdp_tx;
> +	struct libeth_xdpsq_lock	*lock;
> +};
> +
> +/**
> + * struct libeth_xdp_tx_desc - abstraction for an XDP Tx descriptor
> + * @addr: DMA address of the frame
> + * @len: length of the frame
> + * @flags: XDP Tx flags
> + * @opts: combined @len + @flags for speed
> + *
> + * Filled by the generic functions and then passed to driver-specific functions
> + * to fill a HW Tx descriptor, always placed on the [function] stack.
> + */
> +struct libeth_xdp_tx_desc {
> +	dma_addr_t			addr;
> +	union {
> +		struct {
> +			u32				len;
> +			u32				flags;
> +		};
> +		aligned_u64			opts;
> +	};
> +} __aligned_largest;
> +
> +/**
> + * libeth_xdp_ptr_to_priv - convert pointer to a libeth_xdp u64 priv
> + * @ptr: pointer to convert
> + *
> + * The main sending function passes private data as the largest scalar, u64.
> + * Use this helper when you want to pass a pointer there.
> + */
> +#define libeth_xdp_ptr_to_priv(ptr) ({					      \
> +	typecheck_pointer(ptr);						      \
> +	((u64)(uintptr_t)(ptr));					      \
> +})
> +/**
> + * libeth_xdp_priv_to_ptr - convert libeth_xdp u64 priv to a pointer
> + * @priv: private data to convert
> + *
> + * The main sending function passes private data as the largest scalar, u64.
> + * Use this helper when your callback takes this u64 and you want to convert
> + * it back to a pointer.
> + */
> +#define libeth_xdp_priv_to_ptr(priv) ({					      \
> +	static_assert(__same_type(priv, u64));				      \
> +	((const void *)(uintptr_t)(priv));				      \
> +})
> +
> +/*
> + * On 64-bit systems, assigning one u64 is faster than two u32s. When ::len
> + * occupies lowest 32 bits (LE), whole ::opts can be assigned directly instead.
> + */
> +#ifdef __LITTLE_ENDIAN
> +#define __LIBETH_WORD_ACCESS		1
> +#endif
> +#ifdef __LIBETH_WORD_ACCESS
> +#define __libeth_xdp_tx_len(flen, ...)					      \
> +	.opts = ((flen) | FIELD_PREP(GENMASK_ULL(63, 32), (__VA_ARGS__ + 0)))
> +#else
> +#define __libeth_xdp_tx_len(flen, ...)					      \
> +	.len = (flen), .flags = (__VA_ARGS__ + 0)
> +#endif
> +
> +/**
> + * libeth_xdp_tx_xmit_bulk - main XDP Tx function
> + * @bulk: array of frames to send
> + * @xdpsq: pointer to the driver-specific XDPSQ struct
> + * @n: number of frames to send
> + * @unroll: whether to unroll the queue filling loop for speed
> + * @priv: driver-specific private data
> + * @prep: callback for cleaning the queue and filling abstract &libeth_xdpsq
> + * @fill: internal callback for filling &libeth_sqe and &libeth_xdp_tx_desc
> + * @xmit: callback for filling a HW descriptor with the frame info
> + *
> + * Internal abstraction for placing @n XDP Tx frames on the HW XDPSQ. Used for
> + * all types of frames: ``XDP_TX`` and .ndo_xdp_xmit().
> + * @prep must lock the queue as this function releases it at the end. @unroll
> + * greatly increases the object code size, but also greatly increases
> + * performance.
> + * The compilers inline all those onstack abstractions to direct data accesses.
> + *
> + * Return: number of frames actually placed on the queue, <= @n. The function
> + * can't fail, but can send less frames if there's no enough free descriptors
> + * available. The actual free space is returned by @prep from the driver.
> + */
> +static __always_inline u32
> +libeth_xdp_tx_xmit_bulk(const struct libeth_xdp_tx_frame *bulk, void *xdpsq,
> +			u32 n, bool unroll, u64 priv,
> +			u32 (*prep)(void *xdpsq, struct libeth_xdpsq *sq),
> +			struct libeth_xdp_tx_desc
> +			(*fill)(struct libeth_xdp_tx_frame frm, u32 i,
> +				const struct libeth_xdpsq *sq, u64 priv),
> +			void (*xmit)(struct libeth_xdp_tx_desc desc, u32 i,
> +				     const struct libeth_xdpsq *sq, u64 priv))
> +{
> +	u32 this, batched, off = 0;
> +	struct libeth_xdpsq sq;
> +	u32 ntu, i = 0;
> +
> +	n = min(n, prep(xdpsq, &sq));
> +	if (unlikely(!n))
> +		goto unlock;
> +
> +	ntu = *sq.ntu;
> +
> +	this = sq.count - ntu;
> +	if (likely(this > n))
> +		this = n;
> +
> +again:
> +	if (!unroll)
> +		goto linear;
> +
> +	batched = ALIGN_DOWN(this, LIBETH_XDP_TX_BATCH);
> +
> +	for ( ; i < off + batched; i += LIBETH_XDP_TX_BATCH) {
> +		u32 base = ntu + i - off;
> +
> +		unrolled_count(LIBETH_XDP_TX_BATCH)
> +		for (u32 j = 0; j < LIBETH_XDP_TX_BATCH; j++)
> +			xmit(fill(bulk[i + j], base + j, &sq, priv),
> +			     base + j, &sq, priv);
> +	}
> +
> +	if (batched < this) {
> +linear:
> +		for ( ; i < off + this; i++)
> +			xmit(fill(bulk[i], ntu + i - off, &sq, priv),
> +			     ntu + i - off, &sq, priv);
> +	}
> +
> +	ntu += this;
> +	if (likely(ntu < sq.count))
> +		goto out;
> +
> +	ntu = 0;
> +
> +	if (i < n) {
> +		this = n - i;
> +		off = i;
> +
> +		goto again;
> +	}
> +
> +out:
> +	*sq.ntu = ntu;
> +	*sq.pending += n;
> +	if (sq.xdp_tx)
> +		*sq.xdp_tx += n;
> +
> +unlock:
> +	libeth_xdpsq_unlock(sq.lock);
> +
> +	return n;
> +}
> +
> +/* ``XDP_TX`` bulking */
> +
> +void libeth_xdp_return_buff_slow(struct libeth_xdp_buff *xdp);
> +
> +/**
> + * libeth_xdp_tx_queue_head - internal helper for queueing one ``XDP_TX`` head
> + * @bq: XDP Tx bulk to queue the head frag to
> + * @xdp: XDP buffer with the head to queue
> + *
> + * Return: false if it's the only frag of the frame, true if it's an S/G frame.
> + */
> +static inline bool libeth_xdp_tx_queue_head(struct libeth_xdp_tx_bulk *bq,
> +					    const struct libeth_xdp_buff *xdp)
> +{
> +	const struct xdp_buff *base = &xdp->base;
> +
> +	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
> +		.data	= xdp->data,
> +		.len_fl	= (base->data_end - xdp->data) | LIBETH_XDP_TX_FIRST,
> +		.soff	= xdp_data_hard_end(base) - xdp->data,
> +	};
> +
> +	if (!xdp_buff_has_frags(base))

likely() ?

> +		return false;
> +
> +	bq->bulk[bq->count - 1].len_fl |= LIBETH_XDP_TX_MULTI;
> +
> +	return true;
> +}
> +
> +/**
> + * libeth_xdp_tx_queue_frag - internal helper for queueing one ``XDP_TX`` frag
> + * @bq: XDP Tx bulk to queue the frag to
> + * @frag: frag to queue
> + */
> +static inline void libeth_xdp_tx_queue_frag(struct libeth_xdp_tx_bulk *bq,
> +					    const skb_frag_t *frag)
> +{
> +	bq->bulk[bq->count++].frag = *frag;

IMHO this helper is not providing anything useful

> +}
> +
> +/**
> + * libeth_xdp_tx_queue_bulk - internal helper for queueing one ``XDP_TX`` frame
> + * @bq: XDP Tx bulk to queue the frame to
> + * @xdp: XDP buffer to queue
> + * @flush_bulk: driver callback to flush the bulk to the HW queue
> + *
> + * Return: true on success, false on flush error.
> + */
> +static __always_inline bool
> +libeth_xdp_tx_queue_bulk(struct libeth_xdp_tx_bulk *bq,
> +			 struct libeth_xdp_buff *xdp,
> +			 bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
> +					    u32 flags))
> +{
> +	const struct skb_shared_info *sinfo;
> +	bool ret = true;
> +	u32 nr_frags;
> +
> +	if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
> +	    unlikely(!flush_bulk(bq, 0))) {
> +		libeth_xdp_return_buff_slow(xdp);
> +		return false;
> +	}
> +
> +	if (!libeth_xdp_tx_queue_head(bq, xdp))
> +		goto out;
> +
> +	sinfo = xdp_get_shared_info_from_buff(&xdp->base);
> +	nr_frags = sinfo->nr_frags;
> +
> +	for (u32 i = 0; i < nr_frags; i++) {
> +		if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
> +		    unlikely(!flush_bulk(bq, 0))) {
> +			ret = false;
> +			break;
> +		}
> +
> +		libeth_xdp_tx_queue_frag(bq, &sinfo->frags[i]);
> +	}
> +
> +out:
> +	bq->bulk[bq->count - 1].len_fl |= LIBETH_XDP_TX_LAST;
> +	xdp->data = NULL;
> +
> +	return ret;
> +}
> +
> +/**
> + * libeth_xdp_tx_fill_stats - fill &libeth_sqe with ``XDP_TX`` frame stats
> + * @sqe: SQ element to fill
> + * @desc: libeth_xdp Tx descriptor
> + * @sinfo: &skb_shared_info for this frame
> + *
> + * Internal helper for filling an SQE with the frame stats, do not use in
> + * drivers. Fills the number of frags and bytes for this frame.
> + */
> +#define libeth_xdp_tx_fill_stats(sqe, desc, sinfo)			      \
> +	__libeth_xdp_tx_fill_stats(sqe, desc, sinfo, __UNIQUE_ID(sqe_),	      \
> +				   __UNIQUE_ID(desc_), __UNIQUE_ID(sinfo_))
> +
> +#define __libeth_xdp_tx_fill_stats(sqe, desc, sinfo, ue, ud, us) do {	      \
> +	const struct libeth_xdp_tx_desc *ud = (desc);			      \
> +	const struct skb_shared_info *us;				      \
> +	struct libeth_sqe *ue = (sqe);					      \
> +									      \
> +	ue->nr_frags = 1;						      \
> +	ue->bytes = ud->len;						      \
> +									      \
> +	if (ud->flags & LIBETH_XDP_TX_MULTI) {				      \
> +		us = (sinfo);						      \

why? what 'u' stands for? ue us don't tell the reader much from the first
glance. sinfo tells me everything.

> +		ue->nr_frags += us->nr_frags;				      \
> +		ue->bytes += us->xdp_frags_size;			      \
> +	}								      \
> +} while (0)
> +
> +/**
> + * libeth_xdp_tx_fill_buf - internal helper to fill one ``XDP_TX`` &libeth_sqe
> + * @frm: XDP Tx frame from the bulk
> + * @i: index on the HW queue
> + * @sq: XDPSQ abstraction for the queue
> + * @priv: private data
> + *
> + * Return: XDP Tx descriptor with the synced DMA and other info to pass to
> + * the driver callback.
> + */
> +static inline struct libeth_xdp_tx_desc
> +libeth_xdp_tx_fill_buf(struct libeth_xdp_tx_frame frm, u32 i,
> +		       const struct libeth_xdpsq *sq, u64 priv)
> +{
> +	struct libeth_xdp_tx_desc desc;
> +	struct skb_shared_info *sinfo;
> +	skb_frag_t *frag = &frm.frag;
> +	struct libeth_sqe *sqe;
> +	netmem_ref netmem;
> +
> +	if (frm.len_fl & LIBETH_XDP_TX_FIRST) {
> +		sinfo = frm.data + frm.soff;
> +		skb_frag_fill_netmem_desc(frag, virt_to_netmem(frm.data),
> +					  offset_in_page(frm.data),
> +					  frm.len_fl);
> +	} else {
> +		sinfo = NULL;
> +	}
> +
> +	netmem = skb_frag_netmem(frag);
> +	desc = (typeof(desc)){
> +		.addr	= page_pool_get_dma_addr_netmem(netmem) +
> +			  skb_frag_off(frag),
> +		.len	= skb_frag_size(frag) & LIBETH_XDP_TX_LEN,
> +		.flags	= skb_frag_size(frag) & LIBETH_XDP_TX_FLAGS,
> +	};
> +
> +	if (sinfo || !netmem_is_net_iov(netmem)) {
> +		const struct page_pool *pp = __netmem_get_pp(netmem);
> +
> +		dma_sync_single_for_device(pp->p.dev, desc.addr, desc.len,
> +					   DMA_BIDIRECTIONAL);
> +	}
> +
> +	if (!sinfo)
> +		return desc;
> +
> +	sqe = &sq->sqes[i];
> +	sqe->type = LIBETH_SQE_XDP_TX;
> +	sqe->sinfo = sinfo;
> +	libeth_xdp_tx_fill_stats(sqe, &desc, sinfo);
> +
> +	return desc;
> +}
> +
> +void libeth_xdp_tx_exception(struct libeth_xdp_tx_bulk *bq, u32 sent,
> +			     u32 flags);
> +
> +/**
> + * __libeth_xdp_tx_flush_bulk - internal helper to flush one XDP Tx bulk
> + * @bq: bulk to flush
> + * @flags: XDP TX flags (.ndo_xdp_xmit(), etc.)
> + * @prep: driver-specific callback to prepare the queue for sending
> + * @fill: libeth_xdp callback to fill &libeth_sqe and &libeth_xdp_tx_desc
> + * @xmit: driver callback to fill a HW descriptor
> + *
> + * Internal abstraction to create bulk flush functions for drivers.
> + *
> + * Return: true if anything was sent, false otherwise.
> + */
> +static __always_inline bool
> +__libeth_xdp_tx_flush_bulk(struct libeth_xdp_tx_bulk *bq, u32 flags,
> +			   u32 (*prep)(void *xdpsq, struct libeth_xdpsq *sq),
> +			   struct libeth_xdp_tx_desc
> +			   (*fill)(struct libeth_xdp_tx_frame frm, u32 i,
> +				   const struct libeth_xdpsq *sq, u64 priv),
> +			   void (*xmit)(struct libeth_xdp_tx_desc desc, u32 i,
> +					const struct libeth_xdpsq *sq,
> +					u64 priv))
> +{
> +	u32 sent, drops;
> +	int err = 0;
> +
> +	sent = libeth_xdp_tx_xmit_bulk(bq->bulk, bq->xdpsq,
> +				       min(bq->count, LIBETH_XDP_TX_BULK),
> +				       false, 0, prep, fill, xmit);
> +	drops = bq->count - sent;
> +
> +	if (unlikely(drops)) {
> +		libeth_xdp_tx_exception(bq, sent, flags);
> +		err = -ENXIO;
> +	} else {
> +		bq->count = 0;
> +	}
> +
> +	trace_xdp_bulk_tx(bq->dev, sent, drops, err);
> +
> +	return likely(sent);
> +}
> +
> +/**
> + * libeth_xdp_tx_flush_bulk - wrapper to define flush of one ``XDP_TX`` bulk
> + * @bq: bulk to flush
> + * @flags: Tx flags, see above
> + * @prep: driver callback to prepare the queue
> + * @xmit: driver callback to fill a HW descriptor
> + *
> + * Use via LIBETH_XDP_DEFINE_FLUSH_TX() to define an ``XDP_TX`` driver
> + * callback.
> + */
> +#define libeth_xdp_tx_flush_bulk(bq, flags, prep, xmit)			      \
> +	__libeth_xdp_tx_flush_bulk(bq, flags, prep, libeth_xdp_tx_fill_buf,   \
> +				   xmit)
> +
> +/* .ndo_xdp_xmit() implementation */
> +
> +/**
> + * libeth_xdp_xmit_init_bulk - internal helper to initialize bulk for XDP xmit
> + * @bq: bulk to initialize
> + * @dev: target &net_device
> + * @xdpsqs: array of driver-specific XDPSQ structs
> + * @num: number of active XDPSQs (the above array length)
> + */
> +#define libeth_xdp_xmit_init_bulk(bq, dev, xdpsqs, num)			      \
> +	__libeth_xdp_xmit_init_bulk(bq, dev, (xdpsqs)[libeth_xdpsq_id(num)])
> +
> +static inline void __libeth_xdp_xmit_init_bulk(struct libeth_xdp_tx_bulk *bq,
> +					       struct net_device *dev,
> +					       void *xdpsq)
> +{
> +	bq->dev = dev;
> +	bq->xdpsq = xdpsq;
> +	bq->count = 0;
> +}
> +
> +/**
> + * libeth_xdp_xmit_frame_dma - internal helper to access DMA of an &xdp_frame
> + * @xf: pointer to the XDP frame
> + *
> + * There's no place in &libeth_xdp_tx_frame to store DMA address for an
> + * &xdp_frame head. The headroom is used then, the address is placed right
> + * after the frame struct, naturally aligned.
> + *
> + * Return: pointer to the DMA address to use.
> + */
> +#define libeth_xdp_xmit_frame_dma(xf)					      \
> +	_Generic((xf),							      \
> +		 const struct xdp_frame *:				      \
> +			(const dma_addr_t *)__libeth_xdp_xmit_frame_dma(xf),  \
> +		 struct xdp_frame *:					      \
> +			(dma_addr_t *)__libeth_xdp_xmit_frame_dma(xf)	      \
> +	)
> +
> +static inline void *__libeth_xdp_xmit_frame_dma(const struct xdp_frame *xdpf)
> +{
> +	void *addr = (void *)(xdpf + 1);
> +
> +	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> +	    __alignof(*xdpf) < sizeof(dma_addr_t))
> +		addr = PTR_ALIGN(addr, sizeof(dma_addr_t));
> +
> +	return addr;
> +}
> +
> +/**
> + * libeth_xdp_xmit_queue_head - internal helper for queueing one XDP xmit head
> + * @bq: XDP Tx bulk to queue the head frag to
> + * @xdpf: XDP frame with the head to queue
> + * @dev: device to perform DMA mapping
> + *
> + * Return: ``LIBETH_XDP_DROP`` on DMA mapping error,
> + *	   ``LIBETH_XDP_PASS`` if it's the only frag in the frame,
> + *	   ``LIBETH_XDP_TX`` if it's an S/G frame.
> + */
> +static inline u32 libeth_xdp_xmit_queue_head(struct libeth_xdp_tx_bulk *bq,
> +					     struct xdp_frame *xdpf,
> +					     struct device *dev)
> +{
> +	dma_addr_t dma;
> +
> +	dma = dma_map_single(dev, xdpf->data, xdpf->len, DMA_TO_DEVICE);
> +	if (dma_mapping_error(dev, dma))
> +		return LIBETH_XDP_DROP;
> +
> +	*libeth_xdp_xmit_frame_dma(xdpf) = dma;
> +
> +	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
> +		.xdpf	= xdpf,
> +		__libeth_xdp_tx_len(xdpf->len, LIBETH_XDP_TX_FIRST),
> +	};
> +
> +	if (!xdp_frame_has_frags(xdpf))
> +		return LIBETH_XDP_PASS;
> +
> +	bq->bulk[bq->count - 1].flags |= LIBETH_XDP_TX_MULTI;
> +
> +	return LIBETH_XDP_TX;
> +}
> +
> +/**
> + * libeth_xdp_xmit_queue_frag - internal helper for queueing one XDP xmit frag
> + * @bq: XDP Tx bulk to queue the frag to
> + * @frag: frag to queue
> + * @dev: device to perform DMA mapping
> + *
> + * Return: true on success, false on DMA mapping error.
> + */
> +static inline bool libeth_xdp_xmit_queue_frag(struct libeth_xdp_tx_bulk *bq,
> +					      const skb_frag_t *frag,
> +					      struct device *dev)
> +{
> +	dma_addr_t dma;
> +
> +	dma = skb_frag_dma_map(dev, frag);
> +	if (dma_mapping_error(dev, dma))
> +		return false;
> +
> +	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
> +		.dma	= dma,
> +		__libeth_xdp_tx_len(skb_frag_size(frag)),
> +	};
> +
> +	return true;
> +}
> +
> +/**
> + * libeth_xdp_xmit_queue_bulk - internal helper for queueing one XDP xmit frame
> + * @bq: XDP Tx bulk to queue the frame to
> + * @xdpf: XDP frame to queue
> + * @flush_bulk: driver callback to flush the bulk to the HW queue
> + *
> + * Return: ``LIBETH_XDP_TX`` on success,
> + *	   ``LIBETH_XDP_DROP`` if the frame should be dropped by the stack,
> + *	   ``LIBETH_XDP_ABORTED`` if the frame will be dropped by libeth_xdp.
> + */
> +static __always_inline u32
> +libeth_xdp_xmit_queue_bulk(struct libeth_xdp_tx_bulk *bq,
> +			   struct xdp_frame *xdpf,
> +			   bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
> +					      u32 flags))
> +{
> +	u32 head, nr_frags, i, ret = LIBETH_XDP_TX;
> +	struct device *dev = bq->dev->dev.parent;
> +	const struct skb_shared_info *sinfo;
> +
> +	if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
> +	    unlikely(!flush_bulk(bq, LIBETH_XDP_TX_NDO)))
> +		return LIBETH_XDP_DROP;
> +
> +	head = libeth_xdp_xmit_queue_head(bq, xdpf, dev);
> +	if (head == LIBETH_XDP_PASS)
> +		goto out;
> +	else if (head == LIBETH_XDP_DROP)
> +		return LIBETH_XDP_DROP;
> +
> +	sinfo = xdp_get_shared_info_from_frame(xdpf);
> +	nr_frags = sinfo->nr_frags;
> +
> +	for (i = 0; i < nr_frags; i++) {
> +		if (unlikely(bq->count == LIBETH_XDP_TX_BULK) &&
> +		    unlikely(!flush_bulk(bq, LIBETH_XDP_TX_NDO)))
> +			break;
> +
> +		if (!libeth_xdp_xmit_queue_frag(bq, &sinfo->frags[i], dev))
> +			break;
> +	}
> +
> +	if (unlikely(i < nr_frags))
> +		ret = LIBETH_XDP_ABORTED;
> +
> +out:
> +	bq->bulk[bq->count - 1].flags |= LIBETH_XDP_TX_LAST;
> +
> +	return ret;
> +}
> +
> +/**
> + * libeth_xdp_xmit_fill_buf - internal helper to fill one XDP xmit &libeth_sqe
> + * @frm: XDP Tx frame from the bulk
> + * @i: index on the HW queue
> + * @sq: XDPSQ abstraction for the queue
> + * @priv: private data
> + *
> + * Return: XDP Tx descriptor with the mapped DMA and other info to pass to
> + * the driver callback.
> + */
> +static inline struct libeth_xdp_tx_desc
> +libeth_xdp_xmit_fill_buf(struct libeth_xdp_tx_frame frm, u32 i,
> +			 const struct libeth_xdpsq *sq, u64 priv)
> +{
> +	struct libeth_xdp_tx_desc desc;
> +	struct libeth_sqe *sqe;
> +	struct xdp_frame *xdpf;
> +
> +	if (frm.flags & LIBETH_XDP_TX_FIRST) {
> +		xdpf = frm.xdpf;
> +		desc.addr = *libeth_xdp_xmit_frame_dma(xdpf);
> +	} else {
> +		xdpf = NULL;
> +		desc.addr = frm.dma;
> +	}
> +	desc.opts = frm.opts;
> +
> +	sqe = &sq->sqes[i];
> +	dma_unmap_addr_set(sqe, dma, desc.addr);
> +	dma_unmap_len_set(sqe, len, desc.len);
> +
> +	if (!xdpf) {
> +		sqe->type = LIBETH_SQE_XDP_XMIT_FRAG;
> +		return desc;
> +	}
> +
> +	sqe->type = LIBETH_SQE_XDP_XMIT;
> +	sqe->xdpf = xdpf;
> +	libeth_xdp_tx_fill_stats(sqe, &desc,
> +				 xdp_get_shared_info_from_frame(xdpf));
> +
> +	return desc;
> +}
> +
> +/**
> + * libeth_xdp_xmit_flush_bulk - wrapper to define flush of one XDP xmit bulk
> + * @bq: bulk to flush
> + * @flags: Tx flags, see __libeth_xdp_tx_flush_bulk()
> + * @prep: driver callback to prepare the queue
> + * @xmit: driver callback to fill a HW descriptor
> + *
> + * Use via LIBETH_XDP_DEFINE_FLUSH_XMIT() to define an XDP xmit driver
> + * callback.
> + */
> +#define libeth_xdp_xmit_flush_bulk(bq, flags, prep, xmit)		      \
> +	__libeth_xdp_tx_flush_bulk(bq, (flags) | LIBETH_XDP_TX_NDO, prep,     \
> +				   libeth_xdp_xmit_fill_buf, xmit)
> +
> +u32 libeth_xdp_xmit_return_bulk(const struct libeth_xdp_tx_frame *bq,
> +				u32 count, const struct net_device *dev);
> +
> +/**
> + * __libeth_xdp_xmit_do_bulk - internal function to implement .ndo_xdp_xmit()
> + * @bq: XDP Tx bulk to queue frames to
> + * @frames: XDP frames passed by the stack
> + * @n: number of frames
> + * @flags: flags passed by the stack
> + * @flush_bulk: driver callback to flush an XDP xmit bulk
> + * @finalize: driver callback to finalize sending XDP Tx frames on the queue
> + *
> + * Perform common checks, map the frags and queue them to the bulk, then flush
> + * the bulk to the XDPSQ. If requested by the stack, finalize the queue.
> + *
> + * Return: number of frames send or -errno on error.
> + */
> +static __always_inline int
> +__libeth_xdp_xmit_do_bulk(struct libeth_xdp_tx_bulk *bq,
> +			  struct xdp_frame **frames, u32 n, u32 flags,
> +			  bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
> +					     u32 flags),
> +			  void (*finalize)(void *xdpsq, bool sent, bool flush))
> +{
> +	u32 nxmit = 0;
> +
> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
> +		return -EINVAL;
> +
> +	for (u32 i = 0; likely(i < n); i++) {
> +		u32 ret;
> +
> +		ret = libeth_xdp_xmit_queue_bulk(bq, frames[i], flush_bulk);
> +		if (unlikely(ret != LIBETH_XDP_TX)) {
> +			nxmit += ret == LIBETH_XDP_ABORTED;
> +			break;
> +		}
> +
> +		nxmit++;
> +	}
> +
> +	if (bq->count) {
> +		flush_bulk(bq, LIBETH_XDP_TX_NDO);
> +		if (unlikely(bq->count))
> +			nxmit -= libeth_xdp_xmit_return_bulk(bq->bulk,
> +							     bq->count,
> +							     bq->dev);
> +	}
> +
> +	finalize(bq->xdpsq, nxmit, flags & XDP_XMIT_FLUSH);
> +
> +	return nxmit;
> +}
> +
> +/**
> + * libeth_xdp_xmit_do_bulk - implement full .ndo_xdp_xmit() in driver
> + * @dev: target &net_device
> + * @n: number of frames to send
> + * @fr: XDP frames to send
> + * @f: flags passed by the stack
> + * @xqs: array of XDPSQs driver structs
> + * @nqs: number of active XDPSQs, the above array length
> + * @fl: driver callback to flush an XDP xmit bulk
> + * @fin: driver cabback to finalize the queue
> + *
> + * If the driver has active XDPSQs, perform common checks and send the frames.
> + * Finalize the queue, if requested.
> + *
> + * Return: number of frames sent or -errno on error.
> + */
> +#define libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin)	      \
> +	_libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin,	      \
> +				 __UNIQUE_ID(bq_), __UNIQUE_ID(ret_),	      \
> +				 __UNIQUE_ID(nqs_))

why __UNIQUE_ID() is needed?

> +
> +#define _libeth_xdp_xmit_do_bulk(d, n, fr, f, xqs, nqs, fl, fin, ub, ur, un)  \

why single underscore? usually we do __ for internal funcs as you did
somewhere above.

also, why define and not inlined func?

> +({									      \
> +	u32 un = (nqs);							      \
> +	int ur;								      \
> +									      \
> +	if (likely(un)) {						      \
> +		struct libeth_xdp_tx_bulk ub;				      \
> +									      \
> +		libeth_xdp_xmit_init_bulk(&ub, d, xqs, un);		      \
> +		ur = __libeth_xdp_xmit_do_bulk(&ub, fr, n, f, fl, fin);	      \
> +	} else {							      \
> +		ur = -ENXIO;						      \
> +	}								      \
> +									      \
> +	ur;								      \
> +})
> +
> +/* Rx polling path */
> +
> +/**
> + * libeth_xdp_tx_init_bulk - initialize an XDP Tx bulk for Rx NAPI poll
> + * @bq: bulk to initialize
> + * @prog: RCU pointer to the XDP program (can be %NULL)
> + * @dev: target &net_device
> + * @xdpsqs: array of driver XDPSQ structs
> + * @num: number of active XDPSQs, the above array length
> + *
> + * Should be called on an onstack XDP Tx bulk before the NAPI polling loop.
> + * Initializes all the needed fields to run libeth_xdp functions. If @num == 0,
> + * assumes XDP is not enabled.
> + */
> +#define libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num)		      \
> +	__libeth_xdp_tx_init_bulk(bq, prog, dev, xdpsqs, num,		      \
> +				  __UNIQUE_ID(bq_), __UNIQUE_ID(nqs_))
> +
> +#define __libeth_xdp_tx_init_bulk(bq, pr, d, xdpsqs, num, ub, un) do {	      \
> +	typeof(bq) ub = (bq);						      \
> +	u32 un = (num);							      \
> +									      \
> +	rcu_read_lock();						      \
> +									      \
> +	if (un) {							      \
> +		ub->prog = rcu_dereference(pr);				      \
> +		ub->dev = (d);						      \
> +		ub->xdpsq = (xdpsqs)[libeth_xdpsq_id(un)];		      \
> +	} else {							      \
> +		ub->prog = NULL;					      \
> +	}								      \
> +									      \
> +	ub->act_mask = 0;						      \
> +	ub->count = 0;							      \
> +} while (0)
> +
> +void libeth_xdp_load_stash(struct libeth_xdp_buff *dst,
> +			   const struct libeth_xdp_buff_stash *src);
> +void libeth_xdp_save_stash(struct libeth_xdp_buff_stash *dst,
> +			   const struct libeth_xdp_buff *src);
> +void __libeth_xdp_return_stash(struct libeth_xdp_buff_stash *stash);
> +
> +/**
> + * libeth_xdp_init_buff - initialize a &libeth_xdp_buff for Rx NAPI poll
> + * @dst: onstack buffer to initialize
> + * @src: XDP buffer stash placed on the queue
> + * @rxq: registered &xdp_rxq_info corresponding to this queue
> + *
> + * Should be called before the main NAPI polling loop. Loads the content of
> + * the previously saved stash or initializes the buffer from scratch.
> + */
> +static inline void
> +libeth_xdp_init_buff(struct libeth_xdp_buff *dst,
> +		     const struct libeth_xdp_buff_stash *src,
> +		     struct xdp_rxq_info *rxq)

what is the rationale for storing/loading xdp_buff onto/from driver's Rx
queue? could we work directly on xdp_buff from Rx queue? ice is doing so
currently.

> +{
> +	if (likely(!src->data))
> +		dst->data = NULL;
> +	else
> +		libeth_xdp_load_stash(dst, src);
> +
> +	dst->base.rxq = rxq;
> +}
> +
> +/**
> + * libeth_xdp_save_buff - save a partially built buffer on a queue
> + * @dst: XDP buffer stash placed on the queue
> + * @src: onstack buffer to save
> + *
> + * Should be called after the main NAPI polling loop. If the loop exited before
> + * the buffer was finished, saves its content on the queue, so that it can be
> + * completed during the next poll. Otherwise, clears the stash.
> + */
> +static inline void libeth_xdp_save_buff(struct libeth_xdp_buff_stash *dst,
> +					const struct libeth_xdp_buff *src)
> +{
> +	if (likely(!src->data))
> +		dst->data = NULL;
> +	else
> +		libeth_xdp_save_stash(dst, src);
> +}
> +
> +/**
> + * libeth_xdp_return_stash - free an XDP buffer stash from a queue
> + * @stash: stash to free
> + *
> + * If the queue is about to be destroyed, but it still has an incompleted
> + * buffer stash, this helper should be called to free it.
> + */
> +static inline void libeth_xdp_return_stash(struct libeth_xdp_buff_stash *stash)
> +{
> +	if (stash->data)
> +		__libeth_xdp_return_stash(stash);
> +}
> +
> +static inline void libeth_xdp_return_va(const void *data, bool napi)
> +{
> +	netmem_ref netmem = virt_to_netmem(data);
> +
> +	page_pool_put_full_netmem(__netmem_get_pp(netmem), netmem, napi);
> +}
> +
> +static inline void libeth_xdp_return_frags(const struct skb_shared_info *sinfo,
> +					   bool napi)
> +{
> +	for (u32 i = 0; i < sinfo->nr_frags; i++) {
> +		netmem_ref netmem = skb_frag_netmem(&sinfo->frags[i]);
> +
> +		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, napi);
> +	}
> +}
> +
> +/**
> + * libeth_xdp_return_buff - free/recycle &libeth_xdp_buff
> + * @xdp: buffer to free
> + *
> + * Hotpath helper to free &libeth_xdp_buff. Comparing to xdp_return_buff(),
> + * it's faster as it gets inlined and always assumes order-0 pages and safe
> + * direct recycling. Zeroes @xdp->data to avoid UAFs.
> + */
> +#define libeth_xdp_return_buff(xdp)	__libeth_xdp_return_buff(xdp, true)
> +
> +static inline void __libeth_xdp_return_buff(struct libeth_xdp_buff *xdp,
> +					    bool napi)
> +{
> +	if (!xdp_buff_has_frags(&xdp->base))
> +		goto out;
> +
> +	libeth_xdp_return_frags(xdp_get_shared_info_from_buff(&xdp->base),
> +				napi);
> +
> +out:
> +	libeth_xdp_return_va(xdp->data, napi);
> +	xdp->data = NULL;
> +}
> +
> +bool libeth_xdp_buff_add_frag(struct libeth_xdp_buff *xdp,
> +			      const struct libeth_fqe *fqe,
> +			      u32 len);
> +
> +/**
> + * libeth_xdp_prepare_buff - fill &libeth_xdp_buff with head FQE data
> + * @xdp: XDP buffer to attach the head to
> + * @fqe: FQE containing the head buffer
> + * @len: buffer len passed from HW
> + *
> + * Internal, use libeth_xdp_process_buff() instead. Initializes XDP buffer
> + * head with the Rx buffer data: data pointer, length, headroom, and
> + * truesize/tailroom. Zeroes the flags.
> + * Uses faster single u64 write instead of per-field access.
> + */
> +static inline void libeth_xdp_prepare_buff(struct libeth_xdp_buff *xdp,
> +					   const struct libeth_fqe *fqe,
> +					   u32 len)
> +{
> +	const struct page *page = __netmem_to_page(fqe->netmem);
> +
> +#ifdef __LIBETH_WORD_ACCESS
> +	static_assert(offsetofend(typeof(xdp->base), flags) -
> +		      offsetof(typeof(xdp->base), frame_sz) ==
> +		      sizeof(u64));
> +
> +	*(u64 *)&xdp->base.frame_sz = fqe->truesize;
> +#else
> +	xdp_init_buff(&xdp->base, fqe->truesize, xdp->base.rxq);
> +#endif
> +	xdp_prepare_buff(&xdp->base, page_address(page) + fqe->offset,
> +			 page->pp->p.offset, len, true);
> +}
> +
> +/**
> + * libeth_xdp_process_buff - attach Rx buffer to &libeth_xdp_buff
> + * @xdp: XDP buffer to attach the Rx buffer to
> + * @fqe: Rx buffer to process
> + * @len: received data length from the descriptor
> + *
> + * If the XDP buffer is empty, attaches the Rx buffer as head and initializes
> + * the required fields. Otherwise, attaches the buffer as a frag.
> + * Already performs DMA sync-for-CPU and frame start prefetch
> + * (for head buffers only).
> + *
> + * Return: true on success, false if the descriptor must be skipped (empty or
> + * no space for a new frag).
> + */
> +static inline bool libeth_xdp_process_buff(struct libeth_xdp_buff *xdp,
> +					   const struct libeth_fqe *fqe,
> +					   u32 len)
> +{
> +	if (!libeth_rx_sync_for_cpu(fqe, len))
> +		return false;
> +
> +	if (xdp->data)

unlikely() ?

> +		return libeth_xdp_buff_add_frag(xdp, fqe, len);
> +
> +	libeth_xdp_prepare_buff(xdp, fqe, len);
> +
> +	prefetch(xdp->data);
> +
> +	return true;
> +}
> +
> +/**
> + * libeth_xdp_buff_stats_frags - update onstack RQ stats with XDP frags info
> + * @ss: onstack stats to update
> + * @xdp: buffer to account
> + *
> + * Internal helper used by __libeth_xdp_run_pass(), do not call directly.
> + * Adds buffer's frags count and total len to the onstack stats.
> + */
> +static inline void
> +libeth_xdp_buff_stats_frags(struct libeth_rq_napi_stats *ss,
> +			    const struct libeth_xdp_buff *xdp)
> +{
> +	const struct skb_shared_info *sinfo;
> +
> +	sinfo = xdp_get_shared_info_from_buff(&xdp->base);
> +	ss->bytes += sinfo->xdp_frags_size;
> +	ss->fragments += sinfo->nr_frags + 1;
> +}
> +
> +u32 libeth_xdp_prog_exception(const struct libeth_xdp_tx_bulk *bq,
> +			      struct libeth_xdp_buff *xdp,
> +			      enum xdp_action act, int ret);
> +
> +/**
> + * __libeth_xdp_run_prog - run XDP program on an XDP buffer
> + * @xdp: XDP buffer to run the prog on
> + * @bq: buffer bulk for ``XDP_TX`` queueing
> + *
> + * Internal inline abstraction to run XDP program. Handles ``XDP_DROP``
> + * and ``XDP_REDIRECT`` only, the rest is processed levels up.
> + * Reports an XDP prog exception on errors.
> + *
> + * Return: libeth_xdp prog verdict depending on the prog's verdict.
> + */
> +static __always_inline u32
> +__libeth_xdp_run_prog(struct libeth_xdp_buff *xdp,
> +		      const struct libeth_xdp_tx_bulk *bq)
> +{
> +	enum xdp_action act;
> +
> +	act = bpf_prog_run_xdp(bq->prog, &xdp->base);
> +	if (unlikely(act < XDP_DROP || act > XDP_REDIRECT))
> +		goto out;
> +
> +	switch (act) {
> +	case XDP_PASS:
> +		return LIBETH_XDP_PASS;
> +	case XDP_DROP:
> +		libeth_xdp_return_buff(xdp);
> +
> +		return LIBETH_XDP_DROP;
> +	case XDP_TX:
> +		return LIBETH_XDP_TX;
> +	case XDP_REDIRECT:
> +		if (unlikely(xdp_do_redirect(bq->dev, &xdp->base, bq->prog)))
> +			break;
> +
> +		xdp->data = NULL;
> +
> +		return LIBETH_XDP_REDIRECT;
> +	default:
> +		break;
> +	}
> +
> +out:
> +	return libeth_xdp_prog_exception(bq, xdp, act, 0);
> +}
> +
> +/**
> + * __libeth_xdp_run_flush - run XDP program and handle ``XDP_TX`` verdict
> + * @xdp: XDP buffer to run the prog on
> + * @bq: buffer bulk for ``XDP_TX`` queueing
> + * @run: internal callback for running XDP program
> + * @queue: internal callback for queuing ``XDP_TX`` frame
> + * @flush_bulk: driver callback for flushing a bulk
> + *
> + * Internal inline abstraction to run XDP program and additionally handle
> + * ``XDP_TX`` verdict.
> + * Do not use directly.
> + *
> + * Return: libeth_xdp prog verdict depending on the prog's verdict.
> + */
> +static __always_inline u32
> +__libeth_xdp_run_flush(struct libeth_xdp_buff *xdp,
> +		       struct libeth_xdp_tx_bulk *bq,
> +		       u32 (*run)(struct libeth_xdp_buff *xdp,
> +				  const struct libeth_xdp_tx_bulk *bq),
> +		       bool (*queue)(struct libeth_xdp_tx_bulk *bq,
> +				     struct libeth_xdp_buff *xdp,
> +				     bool (*flush_bulk)
> +					  (struct libeth_xdp_tx_bulk *bq,
> +					   u32 flags)),
> +		       bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
> +					  u32 flags))
> +{
> +	u32 act;
> +
> +	act = run(xdp, bq);
> +	if (act == LIBETH_XDP_TX && unlikely(!queue(bq, xdp, flush_bulk)))
> +		act = LIBETH_XDP_DROP;
> +
> +	bq->act_mask |= act;
> +
> +	return act;
> +}
> +
> +/**
> + * libeth_xdp_run_prog - run XDP program and handle all verdicts
> + * @xdp: XDP buffer to process
> + * @bq: XDP Tx bulk to queue ``XDP_TX`` buffers
> + * @fl: driver ``XDP_TX`` bulk flush callback
> + *
> + * Run the attached XDP program and handle all possible verdicts.
> + * Prefer using it via LIBETH_XDP_DEFINE_RUN{,_PASS,_PROG}().
> + *
> + * Return: true if the buffer should be passed up the stack, false if the poll
> + * should go to the next buffer.
> + */
> +#define libeth_xdp_run_prog(xdp, bq, fl)				      \

is this used in idpf in this patchset?

> +	(__libeth_xdp_run_flush(xdp, bq, __libeth_xdp_run_prog,		      \
> +				libeth_xdp_tx_queue_bulk,		      \
> +				fl) == LIBETH_XDP_PASS)
> +
> +/**
> + * __libeth_xdp_run_pass - helper to run XDP program and handle the result
> + * @xdp: XDP buffer to process
> + * @bq: XDP Tx bulk to queue ``XDP_TX`` frames
> + * @napi: NAPI to build an skb and pass it up the stack
> + * @rs: onstack libeth RQ stats
> + * @md: metadata that should be filled to the XDP buffer
> + * @prep: callback for filling the metadata
> + * @run: driver wrapper to run XDP program

I see it's NULLed on idpf? why have this?

> + * @populate: driver callback to populate an skb with the HW descriptor data
> + *
> + * Inline abstraction that does the following:
> + * 1) adds frame size and frag number (if needed) to the onstack stats;
> + * 2) fills the descriptor metadata to the onstack &libeth_xdp_buff
> + * 3) runs XDP program if present;
> + * 4) handles all possible verdicts;
> + * 5) on ``XDP_PASS`, builds an skb from the buffer;
> + * 6) populates it with the descriptor metadata;
> + * 7) passes it up the stack.
> + *
> + * In most cases, number 2 means just writing the pointer to the HW descriptor
> + * to the XDP buffer. If so, please use LIBETH_XDP_DEFINE_RUN{,_PASS}()
> + * wrappers to build a driver function.
> + */
> +static __always_inline void
> +__libeth_xdp_run_pass(struct libeth_xdp_buff *xdp,
> +		      struct libeth_xdp_tx_bulk *bq, struct napi_struct *napi,
> +		      struct libeth_rq_napi_stats *rs, const void *md,
> +		      void (*prep)(struct libeth_xdp_buff *xdp,
> +				   const void *md),
> +		      bool (*run)(struct libeth_xdp_buff *xdp,
> +				  struct libeth_xdp_tx_bulk *bq),
> +		      bool (*populate)(struct sk_buff *skb,
> +				       const struct libeth_xdp_buff *xdp,
> +				       struct libeth_rq_napi_stats *rs))
> +{
> +	struct sk_buff *skb;
> +
> +	rs->bytes += xdp->base.data_end - xdp->data;
> +	rs->packets++;
> +
> +	if (xdp_buff_has_frags(&xdp->base))
> +		libeth_xdp_buff_stats_frags(rs, xdp);
> +
> +	if (prep && (!__builtin_constant_p(!!md) || md))
> +		prep(xdp, md);
> +
> +	if (!bq || !run || !bq->prog)
> +		goto build;
> +
> +	if (!run(xdp, bq))
> +		return;
> +
> +build:
> +	skb = xdp_build_skb_from_buff(&xdp->base);
> +	if (unlikely(!skb)) {
> +		libeth_xdp_return_buff_slow(xdp);
> +		return;
> +	}
> +
> +	xdp->data = NULL;
> +
> +	if (unlikely(!populate(skb, xdp, rs))) {
> +		napi_consume_skb(skb, true);
> +		return;
> +	}
> +
> +	napi_gro_receive(napi, skb);
> +}
> +
> +static inline void libeth_xdp_prep_desc(struct libeth_xdp_buff *xdp,
> +					const void *desc)
> +{
> +	xdp->desc = desc;
> +}
> +
> +/**
> + * libeth_xdp_run_pass - helper to run XDP program and handle the result
> + * @xdp: XDP buffer to process
> + * @bq: XDP Tx bulk to queue ``XDP_TX`` frames
> + * @napi: NAPI to build an skb and pass it up the stack
> + * @ss: onstack libeth RQ stats
> + * @desc: pointer to the HW descriptor for that frame
> + * @run: driver wrapper to run XDP program
> + * @populate: driver callback to populate an skb with the HW descriptor data
> + *
> + * Wrapper around the underscored version when "fill the descriptor metadata"
> + * means just writing the pointer to the HW descriptor as @xdp->desc.
> + */
> +#define libeth_xdp_run_pass(xdp, bq, napi, ss, desc, run, populate)	      \
> +	__libeth_xdp_run_pass(xdp, bq, napi, ss, desc, libeth_xdp_prep_desc,  \
> +			      run, populate)
> +
> +/**
> + * libeth_xdp_finalize_rx - finalize XDPSQ after a NAPI polling loop
> + * @bq: ``XDP_TX`` frame bulk
> + * @flush: driver callback to flush the bulk
> + * @finalize: driver callback to start sending the frames and run the timer
> + *
> + * Flush the bulk if there are frames left to send, kick the queue and flush
> + * the XDP maps.
> + */
> +#define libeth_xdp_finalize_rx(bq, flush, finalize)			      \
> +	__libeth_xdp_finalize_rx(bq, 0, flush, finalize)
> +
> +static __always_inline void
> +__libeth_xdp_finalize_rx(struct libeth_xdp_tx_bulk *bq, u32 flags,
> +			 bool (*flush_bulk)(struct libeth_xdp_tx_bulk *bq,
> +					    u32 flags),
> +			 void (*finalize)(void *xdpsq, bool sent, bool flush))
> +{
> +	if (bq->act_mask & LIBETH_XDP_TX) {
> +		if (bq->count)
> +			flush_bulk(bq, flags | LIBETH_XDP_TX_DROP);
> +		finalize(bq->xdpsq, true, true);
> +	}
> +	if (bq->act_mask & LIBETH_XDP_REDIRECT)
> +		xdp_do_flush();
> +
> +	rcu_read_unlock();
> +}

(...)

> +
> +/* XMO */
> +
> +/**
> + * libeth_xdp_buff_to_rq - get RQ pointer from an XDP buffer pointer
> + * @xdp: &libeth_xdp_buff corresponding to the queue
> + * @type: typeof() of the driver Rx queue structure
> + * @member: name of &xdp_rxq_info inside @type
> + *
> + * Often times, pointer to the RQ is needed when reading/filling metadata from
> + * HW descriptors. The helper can be used to quickly jump from an XDP buffer
> + * to the queue corresponding to its &xdp_rxq_info without introducing
> + * additional fields (&libeth_xdp_buff is precisely 1 cacheline long on x64).
> + */
> +#define libeth_xdp_buff_to_rq(xdp, type, member)			      \
> +	container_of_const((xdp)->base.rxq, type, member)
> +
> +/**
> + * libeth_xdpmo_rx_hash - convert &libeth_rx_pt to an XDP RSS hash metadata
> + * @hash: pointer to the variable to write the hash to
> + * @rss_type: pointer to the variable to write the hash type to
> + * @val: hash value from the HW descriptor
> + * @pt: libeth parsed packet type
> + *
> + * Handle zeroed/non-available hash and convert libeth parsed packet type to
> + * the corresponding XDP RSS hash type. To be called at the end of
> + * xdp_metadata_ops idpf_xdpmo::xmo_rx_hash() implementation.
> + * Note that if the driver doesn't use a constant packet type lookup table but
> + * generates it at runtime, it must call libeth_rx_pt_gen_hash_type(pt) to
> + * generate XDP RSS hash type for each packet type.
> + *
> + * Return: 0 on success, -ENODATA when the hash is not available.
> + */
> +static inline int libeth_xdpmo_rx_hash(u32 *hash,
> +				       enum xdp_rss_hash_type *rss_type,
> +				       u32 val, struct libeth_rx_pt pt)
> +{
> +	if (unlikely(!val))
> +		return -ENODATA;
> +
> +	*hash = val;
> +	*rss_type = pt.hash_type;
> +
> +	return 0;
> +}
> +
> +/* Tx buffer completion */
> +
> +void libeth_xdp_return_buff_bulk(const struct skb_shared_info *sinfo,
> +				 struct xdp_frame_bulk *bq, bool frags);
> +
> +/**
> + * __libeth_xdp_complete_tx - complete sent XDPSQE
> + * @sqe: SQ element / Tx buffer to complete
> + * @cp: Tx polling/completion params
> + * @bulk: internal callback to bulk-free ``XDP_TX`` buffers
> + *
> + * Use the non-underscored version in drivers instead. This one is shared
> + * internally with libeth_tx_complete_any().
> + * Complete an XDPSQE of any type of XDP frame. This includes DMA unmapping
> + * when needed, buffer freeing, stats update, and SQE invalidating.
> + */
> +static __always_inline void
> +__libeth_xdp_complete_tx(struct libeth_sqe *sqe, struct libeth_cq_pp *cp,
> +			 typeof(libeth_xdp_return_buff_bulk) bulk)
> +{
> +	enum libeth_sqe_type type = sqe->type;
> +
> +	switch (type) {
> +	case LIBETH_SQE_EMPTY:
> +		return;
> +	case LIBETH_SQE_XDP_XMIT:
> +	case LIBETH_SQE_XDP_XMIT_FRAG:
> +		dma_unmap_page(cp->dev, dma_unmap_addr(sqe, dma),
> +			       dma_unmap_len(sqe, len), DMA_TO_DEVICE);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	switch (type) {
> +	case LIBETH_SQE_XDP_TX:
> +		bulk(sqe->sinfo, cp->bq, sqe->nr_frags != 1);
> +		break;
> +	case LIBETH_SQE_XDP_XMIT:
> +		xdp_return_frame_bulk(sqe->xdpf, cp->bq);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	switch (type) {
> +	case LIBETH_SQE_XDP_TX:
> +	case LIBETH_SQE_XDP_XMIT:
> +		cp->xdp_tx -= sqe->nr_frags;
> +
> +		cp->xss->packets++;
> +		cp->xss->bytes += sqe->bytes;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	sqe->type = LIBETH_SQE_EMPTY;
> +}
> +
> +static inline void libeth_xdp_complete_tx(struct libeth_sqe *sqe,
> +					  struct libeth_cq_pp *cp)
> +{
> +	__libeth_xdp_complete_tx(sqe, cp, libeth_xdp_return_buff_bulk);
> +}
> +
> +/* Misc */
> +
> +u32 libeth_xdp_queue_threshold(u32 count);
> +
> +void __libeth_xdp_set_features(struct net_device *dev,
> +			       const struct xdp_metadata_ops *xmo);
> +void libeth_xdp_set_redirect(struct net_device *dev, bool enable);
> +
> +/**
> + * libeth_xdp_set_features - set XDP features for netdev
> + * @dev: &net_device to configure
> + * @...: optional params, see __libeth_xdp_set_features()
> + *
> + * Set all the features libeth_xdp supports, including .ndo_xdp_xmit(). That
> + * said, it should be used only when XDPSQs are always available regardless
> + * of whether an XDP prog is attached to @dev.
> + */
> +#define libeth_xdp_set_features(dev, ...)				      \
> +	CONCATENATE(__libeth_xdp_feat,					      \
> +		    COUNT_ARGS(__VA_ARGS__))(dev, ##__VA_ARGS__)
> +
> +#define __libeth_xdp_feat0(dev)						      \
> +	__libeth_xdp_set_features(dev, NULL)
> +#define __libeth_xdp_feat1(dev, xmo)					      \
> +	__libeth_xdp_set_features(dev, xmo)
> +
> +/**
> + * libeth_xdp_set_features_noredir - enable all libeth_xdp features w/o redir
> + * @dev: target &net_device
> + * @...: optional params, see __libeth_xdp_set_features()
> + *
> + * Enable everything except the .ndo_xdp_xmit() feature, use when XDPSQs are
> + * not available right after netdev registration.
> + */
> +#define libeth_xdp_set_features_noredir(dev, ...)			      \
> +	__libeth_xdp_set_features_noredir(dev, __UNIQUE_ID(dev_),	      \
> +					  ##__VA_ARGS__)
> +
> +#define __libeth_xdp_set_features_noredir(dev, ud, ...) do {		      \
> +	struct net_device *ud = (dev);					      \
> +									      \
> +	libeth_xdp_set_features(ud, ##__VA_ARGS__);			      \
> +	libeth_xdp_set_redirect(ud, false);				      \
> +} while (0)
> +
> +#endif /* __LIBETH_XDP_H */
> diff --git a/drivers/net/ethernet/intel/libeth/tx.c b/drivers/net/ethernet/intel/libeth/tx.c
> new file mode 100644
> index 000000000000..227c841ab16a
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/libeth/tx.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2025 Intel Corporation */
> +
> +#define DEFAULT_SYMBOL_NAMESPACE	"LIBETH"
> +
> +#include <net/libeth/xdp.h>
> +
> +#include "priv.h"
> +
> +/* Tx buffer completion */
> +
> +DEFINE_STATIC_CALL_NULL(bulk, libeth_xdp_return_buff_bulk);
> +
> +/**
> + * libeth_tx_complete_any - perform Tx completion for one SQE of any type
> + * @sqe: Tx buffer to complete
> + * @cp: polling params
> + *
> + * Can be used to complete both regular and XDP SQEs, for example when
> + * destroying queues.
> + * When libeth_xdp is not loaded, XDPSQEs won't be handled.
> + */
> +void libeth_tx_complete_any(struct libeth_sqe *sqe, struct libeth_cq_pp *cp)
> +{
> +	if (sqe->type >= __LIBETH_SQE_XDP_START)
> +		__libeth_xdp_complete_tx(sqe, cp, static_call(bulk));
> +	else
> +		libeth_tx_complete(sqe, cp);
> +}
> +EXPORT_SYMBOL_GPL(libeth_tx_complete_any);
> +
> +/* Module */
> +
> +void libeth_attach_xdp(const struct libeth_xdp_ops *ops)
> +{
> +	static_call_update(bulk, ops ? ops->bulk : NULL);
> +}
> +EXPORT_SYMBOL_GPL(libeth_attach_xdp);
> diff --git a/drivers/net/ethernet/intel/libeth/xdp.c b/drivers/net/ethernet/intel/libeth/xdp.c
> new file mode 100644
> index 000000000000..dbede9a696a7
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/libeth/xdp.c
> @@ -0,0 +1,431 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2025 Intel Corporation */
> +
> +#define DEFAULT_SYMBOL_NAMESPACE	"LIBETH_XDP"
> +
> +#include <net/libeth/xdp.h>
> +
> +#include "priv.h"
> +
> +/* XDPSQ sharing */
> +
> +DEFINE_STATIC_KEY_FALSE(libeth_xdpsq_share);
> +EXPORT_SYMBOL_GPL(libeth_xdpsq_share);
> +
> +void __libeth_xdpsq_get(struct libeth_xdpsq_lock *lock,
> +			const struct net_device *dev)
> +{
> +	bool warn;
> +
> +	spin_lock_init(&lock->lock);
> +	lock->share = true;
> +
> +	warn = !static_key_enabled(&libeth_xdpsq_share);
> +	static_branch_inc(&libeth_xdpsq_share);
> +
> +	if (warn && net_ratelimit())
> +		netdev_warn(dev, "XDPSQ sharing enabled, possible XDP Tx slowdown\n");
> +}
> +EXPORT_SYMBOL_GPL(__libeth_xdpsq_get);
> +
> +void __libeth_xdpsq_put(struct libeth_xdpsq_lock *lock,
> +			const struct net_device *dev)
> +{
> +	static_branch_dec(&libeth_xdpsq_share);
> +
> +	if (!static_key_enabled(&libeth_xdpsq_share) && net_ratelimit())
> +		netdev_notice(dev, "XDPSQ sharing disabled\n");
> +
> +	lock->share = false;
> +}
> +EXPORT_SYMBOL_GPL(__libeth_xdpsq_put);
> +
> +void __acquires(&lock->lock)
> +__libeth_xdpsq_lock(struct libeth_xdpsq_lock *lock)
> +{
> +	spin_lock(&lock->lock);
> +}
> +EXPORT_SYMBOL_GPL(__libeth_xdpsq_lock);
> +
> +void __releases(&lock->lock)
> +__libeth_xdpsq_unlock(struct libeth_xdpsq_lock *lock)
> +{
> +	spin_unlock(&lock->lock);
> +}
> +EXPORT_SYMBOL_GPL(__libeth_xdpsq_unlock);
> +
> +/* XDPSQ clean-up timers */
> +
> +/**
> + * libeth_xdpsq_init_timer - initialize an XDPSQ clean-up timer
> + * @timer: timer to initialize
> + * @xdpsq: queue this timer belongs to
> + * @lock: corresponding XDPSQ lock
> + * @poll: queue polling/completion function
> + *
> + * XDPSQ clean-up timers must be set up before using at the queue configuration
> + * time. Set the required pointers and the cleaning callback.
> + */
> +void libeth_xdpsq_init_timer(struct libeth_xdpsq_timer *timer, void *xdpsq,
> +			     struct libeth_xdpsq_lock *lock,
> +			     void (*poll)(struct work_struct *work))
> +{
> +	timer->xdpsq = xdpsq;
> +	timer->lock = lock;
> +
> +	INIT_DELAYED_WORK(&timer->dwork, poll);
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdpsq_init_timer);
> +
> +/* ``XDP_TX`` bulking */
> +
> +static void __cold
> +libeth_xdp_tx_return_one(const struct libeth_xdp_tx_frame *frm)
> +{
> +	if (frm->len_fl & LIBETH_XDP_TX_MULTI)
> +		libeth_xdp_return_frags(frm->data + frm->soff, true);
> +
> +	libeth_xdp_return_va(frm->data, true);
> +}
> +
> +static void __cold
> +libeth_xdp_tx_return_bulk(const struct libeth_xdp_tx_frame *bq, u32 count)
> +{
> +	for (u32 i = 0; i < count; i++) {
> +		const struct libeth_xdp_tx_frame *frm = &bq[i];
> +
> +		if (!(frm->len_fl & LIBETH_XDP_TX_FIRST))
> +			continue;
> +
> +		libeth_xdp_tx_return_one(frm);
> +	}
> +}
> +
> +static void __cold libeth_trace_xdp_exception(const struct net_device *dev,
> +					      const struct bpf_prog *prog,
> +					      u32 act)
> +{
> +	trace_xdp_exception(dev, prog, act);
> +}
> +
> +/**
> + * libeth_xdp_tx_exception - handle Tx exceptions of XDP frames
> + * @bq: XDP Tx frame bulk
> + * @sent: number of frames sent successfully (from this bulk)
> + * @flags: internal libeth_xdp flags (.ndo_xdp_xmit etc.)
> + *
> + * Cold helper used by __libeth_xdp_tx_flush_bulk(), do not call directly.
> + * Reports XDP Tx exceptions, frees the frames that won't be sent or adjust
> + * the Tx bulk to try again later.
> + */
> +void __cold libeth_xdp_tx_exception(struct libeth_xdp_tx_bulk *bq, u32 sent,
> +				    u32 flags)
> +{
> +	const struct libeth_xdp_tx_frame *pos = &bq->bulk[sent];
> +	u32 left = bq->count - sent;
> +
> +	if (!(flags & LIBETH_XDP_TX_NDO))
> +		libeth_trace_xdp_exception(bq->dev, bq->prog, XDP_TX);
> +
> +	if (!(flags & LIBETH_XDP_TX_DROP)) {
> +		memmove(bq->bulk, pos, left * sizeof(*bq->bulk));

can this overflow? if queue got stuck for some reason.

> +		bq->count = left;
> +
> +		return;
> +	}
> +
> +	if (!(flags & LIBETH_XDP_TX_NDO))
> +		libeth_xdp_tx_return_bulk(pos, left);
> +	else
> +		libeth_xdp_xmit_return_bulk(pos, left, bq->dev);
> +
> +	bq->count = 0;
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_tx_exception);
> +
> +/* .ndo_xdp_xmit() implementation */
> +
> +u32 __cold libeth_xdp_xmit_return_bulk(const struct libeth_xdp_tx_frame *bq,
> +				       u32 count, const struct net_device *dev)
> +{
> +	u32 n = 0;
> +
> +	for (u32 i = 0; i < count; i++) {
> +		const struct libeth_xdp_tx_frame *frm = &bq[i];
> +		dma_addr_t dma;
> +
> +		if (frm->flags & LIBETH_XDP_TX_FIRST)
> +			dma = *libeth_xdp_xmit_frame_dma(frm->xdpf);
> +		else
> +			dma = dma_unmap_addr(frm, dma);
> +
> +		dma_unmap_page(dev->dev.parent, dma, dma_unmap_len(frm, len),
> +			       DMA_TO_DEVICE);
> +
> +		/* Actual xdp_frames are freed by the core */
> +		n += !!(frm->flags & LIBETH_XDP_TX_FIRST);
> +	}
> +
> +	return n;
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_xmit_return_bulk);
> +
> +/* Rx polling path */
> +
> +/**
> + * libeth_xdp_load_stash - recreate an &xdp_buff from libeth_xdp buffer stash
> + * @dst: target &libeth_xdp_buff to initialize
> + * @src: source stash
> + *
> + * External helper used by libeth_xdp_init_buff(), do not call directly.
> + * Recreate an onstack &libeth_xdp_buff using the stash saved earlier.
> + * The only field untouched (rxq) is initialized later in the
> + * abovementioned function.
> + */
> +void libeth_xdp_load_stash(struct libeth_xdp_buff *dst,
> +			   const struct libeth_xdp_buff_stash *src)
> +{
> +	dst->data = src->data;
> +	dst->base.data_end = src->data + src->len;
> +	dst->base.data_meta = src->data;
> +	dst->base.data_hard_start = src->data - src->headroom;
> +
> +	dst->base.frame_sz = src->frame_sz;
> +	dst->base.flags = src->flags;
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_load_stash);
> +
> +/**
> + * libeth_xdp_save_stash - convert &xdp_buff to a libeth_xdp buffer stash
> + * @dst: target &libeth_xdp_buff_stash to initialize
> + * @src: source XDP buffer
> + *
> + * External helper used by libeth_xdp_save_buff(), do not call directly.
> + * Use the fields from the passed XDP buffer to initialize the stash on the
> + * queue, so that a partially received frame can be finished later during
> + * the next NAPI poll.
> + */
> +void libeth_xdp_save_stash(struct libeth_xdp_buff_stash *dst,
> +			   const struct libeth_xdp_buff *src)
> +{
> +	dst->data = src->data;
> +	dst->headroom = src->data - src->base.data_hard_start;
> +	dst->len = src->base.data_end - src->data;
> +
> +	dst->frame_sz = src->base.frame_sz;
> +	dst->flags = src->base.flags;
> +
> +	WARN_ON_ONCE(dst->flags != src->base.flags);
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_save_stash);
> +
> +void __libeth_xdp_return_stash(struct libeth_xdp_buff_stash *stash)
> +{
> +	LIBETH_XDP_ONSTACK_BUFF(xdp);
> +
> +	libeth_xdp_load_stash(xdp, stash);
> +	libeth_xdp_return_buff_slow(xdp);
> +
> +	stash->data = NULL;
> +}
> +EXPORT_SYMBOL_GPL(__libeth_xdp_return_stash);
> +
> +/**
> + * libeth_xdp_return_buff_slow - free &libeth_xdp_buff
> + * @xdp: buffer to free/return
> + *
> + * Slowpath version of libeth_xdp_return_buff() to be called on exceptions,
> + * queue clean-ups etc., without unwanted inlining.
> + */
> +void __cold libeth_xdp_return_buff_slow(struct libeth_xdp_buff *xdp)
> +{
> +	__libeth_xdp_return_buff(xdp, false);
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_return_buff_slow);
> +
> +/**
> + * libeth_xdp_buff_add_frag - add frag to XDP buffer
> + * @xdp: head XDP buffer
> + * @fqe: Rx buffer containing the frag
> + * @len: frag length reported by HW
> + *
> + * External helper used by libeth_xdp_process_buff(), do not call directly.
> + * Frees both head and frag buffers on error.
> + *
> + * Return: true success, false on error (no space for a new frag).
> + */
> +bool libeth_xdp_buff_add_frag(struct libeth_xdp_buff *xdp,
> +			      const struct libeth_fqe *fqe,
> +			      u32 len)
> +{
> +	netmem_ref netmem = fqe->netmem;
> +
> +	if (!xdp_buff_add_frag(&xdp->base, netmem,
> +			       fqe->offset + netmem_get_pp(netmem)->p.offset,
> +			       len, fqe->truesize))
> +		goto recycle;
> +
> +	return true;
> +
> +recycle:
> +	libeth_rx_recycle_slow(netmem);
> +	libeth_xdp_return_buff_slow(xdp);
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_buff_add_frag);
> +
> +/**
> + * libeth_xdp_prog_exception - handle XDP prog exceptions
> + * @bq: XDP Tx bulk
> + * @xdp: buffer to process
> + * @act: original XDP prog verdict
> + * @ret: error code if redirect failed
> + *
> + * External helper used by __libeth_xdp_run_prog(), do not call directly.
> + * Reports invalid @act, XDP exception trace event and frees the buffer.
> + *
> + * Return: libeth_xdp XDP prog verdict.
> + */
> +u32 __cold libeth_xdp_prog_exception(const struct libeth_xdp_tx_bulk *bq,
> +				     struct libeth_xdp_buff *xdp,
> +				     enum xdp_action act, int ret)
> +{
> +	if (act > XDP_REDIRECT)
> +		bpf_warn_invalid_xdp_action(bq->dev, bq->prog, act);
> +
> +	libeth_trace_xdp_exception(bq->dev, bq->prog, act);
> +	libeth_xdp_return_buff_slow(xdp);
> +
> +	return LIBETH_XDP_DROP;
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_prog_exception);
> +
> +/* Tx buffer completion */
> +
> +static void libeth_xdp_put_netmem_bulk(netmem_ref netmem,
> +				       struct xdp_frame_bulk *bq)
> +{
> +	if (unlikely(bq->count == XDP_BULK_QUEUE_SIZE))
> +		xdp_flush_frame_bulk(bq);
> +
> +	bq->q[bq->count++] = netmem;
> +}
> +
> +/**
> + * libeth_xdp_return_buff_bulk - free &xdp_buff as part of a bulk
> + * @sinfo: shared info corresponding to the buffer
> + * @bq: XDP frame bulk to store the buffer
> + * @frags: whether the buffer has frags
> + *
> + * Same as xdp_return_frame_bulk(), but for &libeth_xdp_buff, speeds up Tx
> + * completion of ``XDP_TX`` buffers and allows to free them in same bulks
> + * with &xdp_frame buffers.
> + */
> +void libeth_xdp_return_buff_bulk(const struct skb_shared_info *sinfo,
> +				 struct xdp_frame_bulk *bq, bool frags)
> +{
> +	if (!frags)
> +		goto head;
> +
> +	for (u32 i = 0; i < sinfo->nr_frags; i++)
> +		libeth_xdp_put_netmem_bulk(skb_frag_netmem(&sinfo->frags[i]),
> +					   bq);
> +
> +head:
> +	libeth_xdp_put_netmem_bulk(virt_to_netmem(sinfo), bq);
> +}
> +EXPORT_SYMBOL_GPL(libeth_xdp_return_buff_bulk);

(...)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 00/16] idpf: add XDP support
  2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
                   ` (15 preceding siblings ...)
  2025-03-05 16:21 ` [PATCH net-next 16/16] idpf: add XDP RSS hash hint Alexander Lobakin
@ 2025-03-11 15:28 ` Alexander Lobakin
  16 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-11 15:28 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Michal Kubiak, Maciej Fijalkowski, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 5 Mar 2025 17:21:16 +0100

> Add XDP support (w/o XSk yet) to the idpf driver using the libeth_xdp
> sublib, which will be then reused in at least iavf and ice.

Ok, today I'm back at work.

First of all, sorry for the confusion, the subject prefix must've been
"PATCH iwl-next" as the whole code is under intel/ and it will go
through Tony.
I'll change it when sending v2.

Now I'll be checking the comments...

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 14/16] idpf: add support for XDP on Rx
  2025-03-05 16:21 ` [PATCH net-next 14/16] idpf: add support for XDP on Rx Alexander Lobakin
@ 2025-03-11 15:50   ` Maciej Fijalkowski
  2025-04-08 13:28     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-11 15:50 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:30PM +0100, Alexander Lobakin wrote:
> Use libeth XDP infra to support running XDP program on Rx polling.
> This includes all of the possible verdicts/actions.
> XDP Tx queues are cleaned only in "lazy" mode when there are less than
> 1/4 free descriptors left on the ring. libeth helper macros to define
> driver-specific XDP functions make sure the compiler could uninline
> them when needed.
> Use __LIBETH_WORD_ACCESS to parse descriptors more efficiently when
> applicable. It really gives some good boosts and code size reduction
> on x86_64.
> 
> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_txrx.h |   4 +-
>  drivers/net/ethernet/intel/idpf/xdp.h       | 100 ++++++++++++-
>  drivers/net/ethernet/intel/idpf/idpf_lib.c  |   2 +
>  drivers/net/ethernet/intel/idpf/idpf_txrx.c |  23 +--
>  drivers/net/ethernet/intel/idpf/xdp.c       | 155 +++++++++++++++++++-
>  5 files changed, 264 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> index e36c55baf23f..5d62074c94b1 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
> @@ -684,8 +684,8 @@ struct idpf_tx_queue {
>  	__cacheline_group_end_aligned(read_mostly);
>  
>  	__cacheline_group_begin_aligned(read_write);
> -	u16 next_to_use;
> -	u16 next_to_clean;
> +	u32 next_to_use;
> +	u32 next_to_clean;
>  
>  	union {
>  		struct {
> diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
> index a72a7638a6ea..fde85528a315 100644
> --- a/drivers/net/ethernet/intel/idpf/xdp.h
> +++ b/drivers/net/ethernet/intel/idpf/xdp.h
> @@ -4,12 +4,9 @@
>  #ifndef _IDPF_XDP_H_
>  #define _IDPF_XDP_H_
>  
> -#include <linux/types.h>
> +#include <net/libeth/xdp.h>
>  
> -struct bpf_prog;
> -struct idpf_vport;
> -struct net_device;
> -struct netdev_bpf;
> +#include "idpf_txrx.h"
>  
>  int idpf_xdp_rxq_info_init_all(const struct idpf_vport *vport);
>  void idpf_xdp_rxq_info_deinit_all(const struct idpf_vport *vport);
> @@ -19,6 +16,99 @@ void idpf_copy_xdp_prog_to_qs(const struct idpf_vport *vport,
>  int idpf_vport_xdpq_get(const struct idpf_vport *vport);
>  void idpf_vport_xdpq_put(const struct idpf_vport *vport);
>  
> +bool idpf_xdp_tx_flush_bulk(struct libeth_xdp_tx_bulk *bq, u32 flags);
> +
> +/**
> + * idpf_xdp_tx_xmit - produce a single HW Tx descriptor out of XDP desc
> + * @desc: XDP descriptor to pull the DMA address and length from
> + * @i: descriptor index on the queue to fill
> + * @sq: XDP queue to produce the HW Tx descriptor on
> + * @priv: &xsk_tx_metadata_ops on XSk xmit or %NULL
> + */
> +static inline void idpf_xdp_tx_xmit(struct libeth_xdp_tx_desc desc, u32 i,
> +				    const struct libeth_xdpsq *sq, u64 priv)
> +{
> +	struct idpf_flex_tx_desc *tx_desc = sq->descs;
> +	u32 cmd;
> +
> +	cmd = FIELD_PREP(IDPF_FLEX_TXD_QW1_DTYPE_M,
> +			 IDPF_TX_DESC_DTYPE_FLEX_L2TAG1_L2TAG2);
> +	if (desc.flags & LIBETH_XDP_TX_LAST)
> +		cmd |= FIELD_PREP(IDPF_FLEX_TXD_QW1_CMD_M,
> +				  IDPF_TX_DESC_CMD_EOP);
> +	if (priv && (desc.flags & LIBETH_XDP_TX_CSUM))
> +		cmd |= FIELD_PREP(IDPF_FLEX_TXD_QW1_CMD_M,
> +				  IDPF_TX_FLEX_DESC_CMD_CS_EN);
> +
> +	tx_desc = &tx_desc[i];
> +	tx_desc->buf_addr = cpu_to_le64(desc.addr);
> +#ifdef __LIBETH_WORD_ACCESS
> +	*(u64 *)&tx_desc->qw1 = ((u64)desc.len << 48) | cmd;
> +#else
> +	tx_desc->qw1.buf_size = cpu_to_le16(desc.len);
> +	tx_desc->qw1.cmd_dtype = cpu_to_le16(cmd);
> +#endif
> +}
> +
> +/**
> + * idpf_set_rs_bit - set RS bit on last produced descriptor
> + * @xdpq: XDP queue to produce the HW Tx descriptors on
> + */
> +static inline void idpf_set_rs_bit(const struct idpf_tx_queue *xdpq)
> +{
> +	u32 ntu, cmd;
> +
> +	ntu = xdpq->next_to_use;
> +	if (unlikely(!ntu))
> +		ntu = xdpq->desc_count;
> +
> +	cmd = FIELD_PREP(IDPF_FLEX_TXD_QW1_CMD_M, IDPF_TX_DESC_CMD_RS);
> +#ifdef __LIBETH_WORD_ACCESS
> +	*(u64 *)&xdpq->flex_tx[ntu - 1].q.qw1 |= cmd;
> +#else
> +	xdpq->flex_tx[ntu - 1].q.qw1.cmd_dtype |= cpu_to_le16(cmd);
> +#endif
> +}
> +
> +/**
> + * idpf_xdpq_update_tail - update the XDP Tx queue tail register
> + * @xdpq: XDP Tx queue
> + */
> +static inline void idpf_xdpq_update_tail(const struct idpf_tx_queue *xdpq)
> +{
> +	dma_wmb();
> +	writel_relaxed(xdpq->next_to_use, xdpq->tail);
> +}
> +
> +/**
> + * idpf_xdp_tx_finalize - Update RS bit and bump XDP Tx tail
> + * @_xdpq: XDP Tx queue
> + * @sent: whether any frames were sent
> + * @flush: whether to update RS bit and the tail register
> + *
> + * This function bumps XDP Tx tail and should be called when a batch of packets
> + * has been processed in the napi loop.
> + */
> +static inline void idpf_xdp_tx_finalize(void *_xdpq, bool sent, bool flush)
> +{
> +	struct idpf_tx_queue *xdpq = _xdpq;
> +
> +	if ((!flush || unlikely(!sent)) &&
> +	    likely(xdpq->desc_count != xdpq->pending))
> +		return;
> +
> +	libeth_xdpsq_lock(&xdpq->xdp_lock);
> +
> +	idpf_set_rs_bit(xdpq);
> +	idpf_xdpq_update_tail(xdpq);
> +
> +	libeth_xdpsq_queue_timer(xdpq->timer);
> +
> +	libeth_xdpsq_unlock(&xdpq->xdp_lock);
> +}
> +
> +void idpf_xdp_set_features(const struct idpf_vport *vport);
> +
>  int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp);
>  
>  #endif /* _IDPF_XDP_H_ */
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> index 84ca8c08bd56..2d1efcb854be 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> @@ -814,6 +814,8 @@ static int idpf_cfg_netdev(struct idpf_vport *vport)
>  	netdev->features |= dflt_features;
>  	netdev->hw_features |= dflt_features | offloads;
>  	netdev->hw_enc_features |= dflt_features | offloads;
> +	idpf_xdp_set_features(vport);
> +
>  	idpf_set_ethtool_ops(netdev);
>  	netif_set_affinity_auto(netdev);
>  	SET_NETDEV_DEV(netdev, &adapter->pdev->dev);
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index f25c50d8947b..cddcc5fc291f 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -1,8 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright (C) 2023 Intel Corporation */
>  
> -#include <net/libeth/xdp.h>
> -
>  #include "idpf.h"
>  #include "idpf_virtchnl.h"
>  #include "xdp.h"
> @@ -3247,14 +3245,12 @@ static bool idpf_rx_process_skb_fields(struct sk_buff *skb,
>  	return !__idpf_rx_process_skb_fields(rxq, skb, xdp->desc);
>  }
>  
> -static void
> -idpf_xdp_run_pass(struct libeth_xdp_buff *xdp, struct napi_struct *napi,
> -		  struct libeth_rq_napi_stats *ss,
> -		  const struct virtchnl2_rx_flex_desc_adv_nic_3 *desc)
> -{
> -	libeth_xdp_run_pass(xdp, NULL, napi, ss, desc, NULL,
> -			    idpf_rx_process_skb_fields);
> -}
> +LIBETH_XDP_DEFINE_START();
> +LIBETH_XDP_DEFINE_RUN(static idpf_xdp_run_pass, idpf_xdp_run_prog,
> +		      idpf_xdp_tx_flush_bulk, idpf_rx_process_skb_fields);
> +LIBETH_XDP_DEFINE_FINALIZE(static idpf_xdp_finalize_rx, idpf_xdp_tx_flush_bulk,
> +			   idpf_xdp_tx_finalize);
> +LIBETH_XDP_DEFINE_END();
>  
>  /**
>   * idpf_rx_hsplit_wa - handle header buffer overflows and split errors
> @@ -3338,9 +3334,12 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
>  {
>  	struct idpf_buf_queue *rx_bufq = NULL;
>  	struct libeth_rq_napi_stats rs = { };
> +	struct libeth_xdp_tx_bulk bq;
>  	LIBETH_XDP_ONSTACK_BUFF(xdp);
>  	u16 ntc = rxq->next_to_clean;
>  
> +	libeth_xdp_tx_init_bulk(&bq, rxq->xdp_prog, rxq->xdp_rxq.dev,
> +				rxq->xdpqs, rxq->num_xdp_txq);
>  	libeth_xdp_init_buff(xdp, &rxq->xdp, &rxq->xdp_rxq);
>  
>  	/* Process Rx packets bounded by budget */
> @@ -3435,11 +3434,13 @@ static int idpf_rx_splitq_clean(struct idpf_rx_queue *rxq, int budget)
>  		if (!idpf_rx_splitq_is_eop(rx_desc) || unlikely(!xdp->data))
>  			continue;
>  
> -		idpf_xdp_run_pass(xdp, rxq->napi, &rs, rx_desc);
> +		idpf_xdp_run_pass(xdp, &bq, rxq->napi, &rs, rx_desc);
>  	}
>  
>  	rxq->next_to_clean = ntc;
> +
>  	libeth_xdp_save_buff(&rxq->xdp, xdp);
> +	idpf_xdp_finalize_rx(&bq);
>  
>  	u64_stats_update_begin(&rxq->stats_sync);
>  	u64_stats_add(&rxq->q_stats.packets, rs.packets);
> diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
> index c0322fa7bfee..abf75e840c0a 100644
> --- a/drivers/net/ethernet/intel/idpf/xdp.c
> +++ b/drivers/net/ethernet/intel/idpf/xdp.c
> @@ -1,8 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /* Copyright (C) 2024 Intel Corporation */
>  
> -#include <net/libeth/xdp.h>
> -
>  #include "idpf.h"
>  #include "idpf_virtchnl.h"
>  #include "xdp.h"
> @@ -143,6 +141,8 @@ void idpf_copy_xdp_prog_to_qs(const struct idpf_vport *vport,
>  	idpf_rxq_for_each(vport, idpf_xdp_rxq_assign_prog, xdp_prog);
>  }
>  
> +static void idpf_xdp_tx_timer(struct work_struct *work);
> +
>  int idpf_vport_xdpq_get(const struct idpf_vport *vport)
>  {
>  	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;
> @@ -183,6 +183,8 @@ int idpf_vport_xdpq_get(const struct idpf_vport *vport)
>  
>  		xdpq->timer = timers[i - sqs];
>  		libeth_xdpsq_get(&xdpq->xdp_lock, dev, vport->xdpq_share);
> +		libeth_xdpsq_init_timer(xdpq->timer, xdpq, &xdpq->xdp_lock,
> +					idpf_xdp_tx_timer);
>  
>  		xdpq->pending = 0;
>  		xdpq->xdp_tx = 0;
> @@ -209,6 +211,7 @@ void idpf_vport_xdpq_put(const struct idpf_vport *vport)
>  		if (!idpf_queue_has_clear(XDP, xdpq))
>  			continue;
>  
> +		libeth_xdpsq_deinit_timer(xdpq->timer);
>  		libeth_xdpsq_put(&xdpq->xdp_lock, dev);
>  
>  		kfree(xdpq->timer);
> @@ -216,6 +219,154 @@ void idpf_vport_xdpq_put(const struct idpf_vport *vport)
>  	}
>  }
>  
> +static int
> +idpf_xdp_parse_compl_desc(const struct idpf_splitq_4b_tx_compl_desc *desc,
> +			  bool gen)
> +{
> +	u32 val;
> +
> +#ifdef __LIBETH_WORD_ACCESS
> +	val = *(const u32 *)desc;
> +#else
> +	val = ((u32)le16_to_cpu(desc->q_head_compl_tag.q_head) << 16) |
> +	      le16_to_cpu(desc->qid_comptype_gen);
> +#endif
> +	if (!!(val & IDPF_TXD_COMPLQ_GEN_M) != gen)
> +		return -ENODATA;
> +
> +	if (unlikely((val & GENMASK(IDPF_TXD_COMPLQ_GEN_S - 1, 0)) !=
> +		     FIELD_PREP(IDPF_TXD_COMPLQ_COMPL_TYPE_M,
> +				IDPF_TXD_COMPLT_RS)))
> +		return -EINVAL;
> +
> +	return upper_16_bits(val);
> +}
> +
> +static u32 idpf_xdpsq_poll(struct idpf_tx_queue *xdpsq, u32 budget)
> +{
> +	struct idpf_compl_queue *cq = xdpsq->complq;
> +	u32 tx_ntc = xdpsq->next_to_clean;
> +	u32 tx_cnt = xdpsq->desc_count;
> +	u32 ntc = cq->next_to_clean;
> +	u32 cnt = cq->desc_count;
> +	u32 done_frames;
> +	bool gen;
> +
> +	gen = idpf_queue_has(GEN_CHK, cq);
> +
> +	for (done_frames = 0; done_frames < budget; ) {
> +		int ret;
> +
> +		ret = idpf_xdp_parse_compl_desc(&cq->comp_4b[ntc], gen);
> +		if (ret >= 0) {
> +			done_frames = ret > tx_ntc ? ret - tx_ntc :
> +						     ret + tx_cnt - tx_ntc;
> +			goto next;
> +		}
> +
> +		switch (ret) {
> +		case -ENODATA:
> +			goto out;
> +		case -EINVAL:
> +			break;
> +		}
> +
> +next:
> +		if (unlikely(++ntc == cnt)) {
> +			ntc = 0;
> +			gen = !gen;
> +			idpf_queue_change(GEN_CHK, cq);
> +		}
> +	}
> +
> +out:
> +	cq->next_to_clean = ntc;
> +
> +	return done_frames;
> +}
> +
> +/**
> + * idpf_clean_xdp_irq - Reclaim a batch of TX resources from completed XDP_TX
> + * @_xdpq: XDP Tx queue
> + * @budget: maximum number of descriptors to clean
> + *
> + * Returns number of cleaned descriptors.
> + */
> +static u32 idpf_clean_xdp_irq(void *_xdpq, u32 budget)
> +{
> +	struct libeth_xdpsq_napi_stats ss = { };
> +	struct idpf_tx_queue *xdpq = _xdpq;
> +	u32 tx_ntc = xdpq->next_to_clean;
> +	u32 tx_cnt = xdpq->desc_count;
> +	struct xdp_frame_bulk bq;
> +	struct libeth_cq_pp cp = {
> +		.dev	= xdpq->dev,
> +		.bq	= &bq,
> +		.xss	= &ss,
> +		.napi	= true,
> +	};
> +	u32 done_frames;
> +
> +	done_frames = idpf_xdpsq_poll(xdpq, budget);

nit: maybe pass {tx_ntc, tx_cnt} to the above?

> +	if (unlikely(!done_frames))
> +		return 0;
> +
> +	xdp_frame_bulk_init(&bq);
> +
> +	for (u32 i = 0; likely(i < done_frames); i++) {
> +		libeth_xdp_complete_tx(&xdpq->tx_buf[tx_ntc], &cp);
> +
> +		if (unlikely(++tx_ntc == tx_cnt))
> +			tx_ntc = 0;
> +	}
> +
> +	xdp_flush_frame_bulk(&bq);
> +
> +	xdpq->next_to_clean = tx_ntc;
> +	xdpq->pending -= done_frames;
> +	xdpq->xdp_tx -= cp.xdp_tx;

not following this variable. __libeth_xdp_complete_tx() decresases
libeth_cq_pp::xdp_tx by libeth_sqe::nr_frags. can you shed more light
what's going on here?

> +
> +	return done_frames;
> +}
> +
> +static u32 idpf_xdp_tx_prep(void *_xdpq, struct libeth_xdpsq *sq)
> +{
> +	struct idpf_tx_queue *xdpq = _xdpq;
> +	u32 free;
> +
> +	libeth_xdpsq_lock(&xdpq->xdp_lock);
> +
> +	free = xdpq->desc_count - xdpq->pending;
> +	if (free <= xdpq->thresh)
> +		free += idpf_clean_xdp_irq(xdpq, xdpq->thresh);
> +
> +	*sq = (struct libeth_xdpsq){

could you have libeth_xdpsq embedded in idpf_tx_queue and avoid that
initialization?

> +		.sqes		= xdpq->tx_buf,
> +		.descs		= xdpq->desc_ring,
> +		.count		= xdpq->desc_count,
> +		.lock		= &xdpq->xdp_lock,
> +		.ntu		= &xdpq->next_to_use,
> +		.pending	= &xdpq->pending,
> +		.xdp_tx		= &xdpq->xdp_tx,
> +	};
> +
> +	return free;
> +}
> +
> +LIBETH_XDP_DEFINE_START();
> +LIBETH_XDP_DEFINE_TIMER(static idpf_xdp_tx_timer, idpf_clean_xdp_irq);
> +LIBETH_XDP_DEFINE_FLUSH_TX(idpf_xdp_tx_flush_bulk, idpf_xdp_tx_prep,
> +			   idpf_xdp_tx_xmit);
> +LIBETH_XDP_DEFINE_END();
> +
> +void idpf_xdp_set_features(const struct idpf_vport *vport)
> +{
> +	if (!idpf_is_queue_model_split(vport->rxq_model))
> +		return;
> +
> +	libeth_xdp_set_features_noredir(vport->netdev);
> +}
> +
>  /**
>   * idpf_xdp_setup_prog - handle XDP program install/remove requests
>   * @vport: vport to configure
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 15/16] idpf: add support for .ndo_xdp_xmit()
  2025-03-05 16:21 ` [PATCH net-next 15/16] idpf: add support for .ndo_xdp_xmit() Alexander Lobakin
@ 2025-03-11 16:08   ` Maciej Fijalkowski
  2025-04-08 13:31     ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-11 16:08 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Wed, Mar 05, 2025 at 05:21:31PM +0100, Alexander Lobakin wrote:
> Use libeth XDP infra to implement .ndo_xdp_xmit() in idpf.
> The Tx callbacks are reused from XDP_TX code. XDP redirect target
> feature is set/cleared depending on the XDP prog presence, as for now
> we still don't allocate XDP Tx queues when there's no program.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/xdp.h      |  2 ++
>  drivers/net/ethernet/intel/idpf/idpf_lib.c |  1 +
>  drivers/net/ethernet/intel/idpf/xdp.c      | 29 ++++++++++++++++++++++
>  3 files changed, 32 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
> index fde85528a315..a2ac1b2f334f 100644
> --- a/drivers/net/ethernet/intel/idpf/xdp.h
> +++ b/drivers/net/ethernet/intel/idpf/xdp.h
> @@ -110,5 +110,7 @@ static inline void idpf_xdp_tx_finalize(void *_xdpq, bool sent, bool flush)
>  void idpf_xdp_set_features(const struct idpf_vport *vport);
>  
>  int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp);
> +int idpf_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
> +		  u32 flags);
>  
>  #endif /* _IDPF_XDP_H_ */
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> index 2d1efcb854be..39b9885293a9 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> @@ -2371,4 +2371,5 @@ static const struct net_device_ops idpf_netdev_ops = {
>  	.ndo_set_features = idpf_set_features,
>  	.ndo_tx_timeout = idpf_tx_timeout,
>  	.ndo_bpf = idpf_xdp,
> +	.ndo_xdp_xmit = idpf_xdp_xmit,
>  };
> diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
> index abf75e840c0a..1834f217a07f 100644
> --- a/drivers/net/ethernet/intel/idpf/xdp.c
> +++ b/drivers/net/ethernet/intel/idpf/xdp.c
> @@ -357,8 +357,35 @@ LIBETH_XDP_DEFINE_START();
>  LIBETH_XDP_DEFINE_TIMER(static idpf_xdp_tx_timer, idpf_clean_xdp_irq);
>  LIBETH_XDP_DEFINE_FLUSH_TX(idpf_xdp_tx_flush_bulk, idpf_xdp_tx_prep,
>  			   idpf_xdp_tx_xmit);
> +LIBETH_XDP_DEFINE_FLUSH_XMIT(static idpf_xdp_xmit_flush_bulk, idpf_xdp_tx_prep,
> +			     idpf_xdp_tx_xmit);
>  LIBETH_XDP_DEFINE_END();
>  
> +/**
> + * idpf_xdp_xmit - send frames queued by ``XDP_REDIRECT`` to this interface
> + * @dev: network device
> + * @n: number of frames to transmit
> + * @frames: frames to transmit
> + * @flags: transmit flags (``XDP_XMIT_FLUSH`` or zero)
> + *
> + * Return: number of frames successfully sent or -errno on error.
> + */
> +int idpf_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
> +		  u32 flags)
> +{
> +	const struct idpf_netdev_priv *np = netdev_priv(dev);
> +	const struct idpf_vport *vport = np->vport;
> +
> +	if (unlikely(!netif_carrier_ok(dev) || !vport->link_up))
> +		return -ENETDOWN;
> +
> +	return libeth_xdp_xmit_do_bulk(dev, n, frames, flags,
> +				       &vport->txqs[vport->xdp_txq_offset],
> +				       vport->num_xdp_txq,

Have you considered in some future libeth being stateful where you could
provide some initialization data such as vport->num_xdp_txq which is
rather constant so that we wouldn't have to pass this all the time?

I got a bit puzzled here as it took me some digging that it is only used a
bound check and libeth_xdpsq_id() uses cpu id as an index.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> +				       idpf_xdp_xmit_flush_bulk,
> +				       idpf_xdp_tx_finalize);
> +}
> +
>  void idpf_xdp_set_features(const struct idpf_vport *vport)
>  {
>  	if (!idpf_is_queue_model_split(vport->rxq_model))
> @@ -417,6 +444,8 @@ idpf_xdp_setup_prog(struct idpf_vport *vport, const struct netdev_bpf *xdp)
>  		cfg->user_config.xdp_prog = old;
>  	}
>  
> +	libeth_xdp_set_redirect(vport->netdev, vport->xdp_prog);
> +
>  	return ret;
>  }
>  
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 01/16] libeth: convert to netmem
  2025-03-06  0:13   ` Mina Almasry
@ 2025-03-11 17:22     ` Alexander Lobakin
  2025-03-11 17:43       ` Mina Almasry
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-11 17:22 UTC (permalink / raw)
  To: Mina Almasry
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Mina Almasry <almasrymina@google.com>
Date: Wed, 5 Mar 2025 16:13:32 -0800

> On Wed, Mar 5, 2025 at 8:23 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> Back when the libeth Rx core was initially written, devmem was a draft
>> and netmem_ref didn't exist in the mainline. Now that it's here, make
>> libeth MP-agnostic before introducing any new code or any new library
>> users.

[...]

>>         /* Very rare, but possible case. The most common reason:
>>          * the last fragment contained FCS only, which was then
>>          * stripped by the HW.
>>          */
>>         if (unlikely(!len)) {
>> -               libeth_rx_recycle_slow(page);
>> +               libeth_rx_recycle_slow(netmem);
> 
> I think before this patch this would have expanded to:
> 
> page_pool_put_full_page(pool, page, true);
> 
> But now I think it expands to:
> 
> page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
> 
> Is the switch from true to false intentional? Is this a slow path so
> it doesn't matter?

Intentional. unlikely() means it's slowpath already. libeth_rx_recycle()
is inline, while _slow() is not. I don't want slowpath to be inlined.
While I was originally writing the code changed here, I didn't pay much
attention to that, but since then I altered my approach and now try to
put anything slow out of line to not waste object code.

Also, some time ago I changed PP's approach to decide whether it can
recycle buffers directly or not. Previously, if that `allow_direct` was
false, it was always falling back to ptr_ring, but now if `allow_direct`
is false, it still checks whether it can be recycled directly.

[...]

>> @@ -3122,16 +3122,20 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
>>                              struct libeth_fqe *buf, u32 data_len)
>>  {
>>         u32 copy = data_len <= L1_CACHE_BYTES ? data_len : ETH_HLEN;
>> +       struct page *hdr_page, *buf_page;
>>         const void *src;
>>         void *dst;
>>
>> -       if (!libeth_rx_sync_for_cpu(buf, copy))
>> +       if (unlikely(netmem_is_net_iov(buf->netmem)) ||
>> +           !libeth_rx_sync_for_cpu(buf, copy))
>>                 return 0;
>>
> 
> I could not immediately understand why you need a netmem_is_net_iov
> check here. libeth_rx_sync_for_cpu will delegate to
> page_pool_dma_sync_netmem_for_cpu which should do the right thing
> regardless of whether the netmem is a page or net_iov, right? Is this
> to save some cycles?

If the payload buffer is net_iov, the kernel doesn't have access to it.
This means, this W/A can't be performed (see memcpy() below the check).
That's why I exit early explicitly.
libeth_rx_sync_for_cpu() returns false only if the size is zero.

netmem_is_net_iov() is under unlikely() here, because when using devmem,
you explicitly configure flow steering, so that only TCP/UDP/whatever
frames will land on this queue. Such frames are split correctly by
idpf's HW.
I need this WA because let's say unfortunately this HW places the whole
frame to the payload buffer when it's not TCP/UDP/... (see the comment
above this function).
For example, it even does so for ICMP, although HW is fully aware of the
ICMP format. If I was a HW designer of this NIC, I'd instead try putting
the whole frame to the header buffer, not the payload one. And in
general, do header split for all known packet types, not just TCP/UDP/..
But meh... A different story.

> 
> --
> Thanks,
> Mina

Thanks!
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 01/16] libeth: convert to netmem
  2025-03-11 17:22     ` Alexander Lobakin
@ 2025-03-11 17:43       ` Mina Almasry
  0 siblings, 0 replies; 59+ messages in thread
From: Mina Almasry @ 2025-03-11 17:43 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Tue, Mar 11, 2025 at 10:23 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Mina Almasry <almasrymina@google.com>
> Date: Wed, 5 Mar 2025 16:13:32 -0800
>
> > On Wed, Mar 5, 2025 at 8:23 AM Alexander Lobakin
> > <aleksander.lobakin@intel.com> wrote:
> >>
> >> Back when the libeth Rx core was initially written, devmem was a draft
> >> and netmem_ref didn't exist in the mainline. Now that it's here, make
> >> libeth MP-agnostic before introducing any new code or any new library
> >> users.
>
> [...]
>
> >>         /* Very rare, but possible case. The most common reason:
> >>          * the last fragment contained FCS only, which was then
> >>          * stripped by the HW.
> >>          */
> >>         if (unlikely(!len)) {
> >> -               libeth_rx_recycle_slow(page);
> >> +               libeth_rx_recycle_slow(netmem);
> >
> > I think before this patch this would have expanded to:
> >
> > page_pool_put_full_page(pool, page, true);
> >
> > But now I think it expands to:
> >
> > page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
> >
> > Is the switch from true to false intentional? Is this a slow path so
> > it doesn't matter?
>
> Intentional. unlikely() means it's slowpath already. libeth_rx_recycle()
> is inline, while _slow() is not. I don't want slowpath to be inlined.
> While I was originally writing the code changed here, I didn't pay much
> attention to that, but since then I altered my approach and now try to
> put anything slow out of line to not waste object code.
>
> Also, some time ago I changed PP's approach to decide whether it can
> recycle buffers directly or not. Previously, if that `allow_direct` was
> false, it was always falling back to ptr_ring, but now if `allow_direct`
> is false, it still checks whether it can be recycled directly.
>

Thanks, yes I forgot about that.

> [...]
>
> >> @@ -3122,16 +3122,20 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
> >>                              struct libeth_fqe *buf, u32 data_len)
> >>  {
> >>         u32 copy = data_len <= L1_CACHE_BYTES ? data_len : ETH_HLEN;
> >> +       struct page *hdr_page, *buf_page;
> >>         const void *src;
> >>         void *dst;
> >>
> >> -       if (!libeth_rx_sync_for_cpu(buf, copy))
> >> +       if (unlikely(netmem_is_net_iov(buf->netmem)) ||
> >> +           !libeth_rx_sync_for_cpu(buf, copy))
> >>                 return 0;
> >>
> >
> > I could not immediately understand why you need a netmem_is_net_iov
> > check here. libeth_rx_sync_for_cpu will delegate to
> > page_pool_dma_sync_netmem_for_cpu which should do the right thing
> > regardless of whether the netmem is a page or net_iov, right? Is this
> > to save some cycles?
>
> If the payload buffer is net_iov, the kernel doesn't have access to it.
> This means, this W/A can't be performed (see memcpy() below the check).
> That's why I exit early explicitly.
> libeth_rx_sync_for_cpu() returns false only if the size is zero.
>
> netmem_is_net_iov() is under unlikely() here, because when using devmem,
> you explicitly configure flow steering, so that only TCP/UDP/whatever
> frames will land on this queue. Such frames are split correctly by
> idpf's HW.
> I need this WA because let's say unfortunately this HW places the whole
> frame to the payload buffer when it's not TCP/UDP/... (see the comment
> above this function).
> For example, it even does so for ICMP, although HW is fully aware of the
> ICMP format. If I was a HW designer of this NIC, I'd instead try putting
> the whole frame to the header buffer, not the payload one. And in
> general, do header split for all known packet types, not just TCP/UDP/..
> But meh... A different story.
>

Makes sense. FWIW:

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Intel-wired-lan] [PATCH net-next 11/16] idpf: prepare structures to support XDP
  2025-03-07  1:12   ` Jakub Kicinski
@ 2025-03-12 14:00     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-12 14:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Jakub Kicinski <kuba@kernel.org>
Date: Thu, 6 Mar 2025 17:12:08 -0800

> On Wed,  5 Mar 2025 17:21:27 +0100 Alexander Lobakin wrote:
>> +/**
>> + * idpf_xdp_is_prog_ena - check if there is an XDP program on adapter
>> + * @vport: vport to check
>> + */
>> +static inline bool idpf_xdp_is_prog_ena(const struct idpf_vport *vport)
>> +{
>> +	return vport->adapter && vport->xdp_prog;
>> +}
> 
> drivers/net/ethernet/intel/idpf/idpf.h:624: warning: No description found for return value of 'idpf_xdp_is_prog_ena'

Breh, I swear I ran sparse and kdoc... >_<

> 
> The documentation doesn't add much info, just remove it ?

Ack, agree.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 04/16] libeth: add XSk helpers
  2025-03-07 10:15   ` Maciej Fijalkowski
@ 2025-03-12 17:03     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-12 17:03 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 11:15:56 +0100

> On Wed, Mar 05, 2025 at 05:21:20PM +0100, Alexander Lobakin wrote:
>> Add the following counterparts of functions from libeth_xdp which need
>> special care on XSk path:
>>
>> * building &xdp_buff (head and frags);
>> * running XDP prog and managing all possible verdicts;
>> * xmit (with S/G and metadata support);
>> * wakeup via CSD/IPI;
>> * FQ init/deinit and refilling.
>>
>> Xmit by default unrolls loops by 8 when filling Tx DMA descriptors.
>> XDP_REDIRECT verdict is considered default/likely(). Rx frags are
>> considered unlikely().
>> It is assumed that Tx/completion queues are not mapped to any
>> interrupts, thus we clean them only when needed (=> 3/4 of
>> descriptors is busy) and keep need_wakeup set.
>> IPI for XSk wakeup showed better performance than triggering an SW
>> NIC interrupt, though it doesn't respect NIC's interrupt affinity.
> 
> Maybe introduce this with xsk support on idpf (i suppose when set after
> this one) ?
> 
> Otherwise, what is the reason to have this included? I didn't check
> in-depth if there are any functions used from this patch on drivers side.

I did split libeth_xdp into two commits only to ease reviewing a bit.
There's also stuff from Michał in progress which converts ice to
libeth_xdp and adds XDP to iavf... I don't want to block it by idpf,
who knows which one will go first :>

> 
>>
>> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  drivers/net/ethernet/intel/libeth/Kconfig  |   2 +-
>>  drivers/net/ethernet/intel/libeth/Makefile |   1 +
>>  drivers/net/ethernet/intel/libeth/priv.h   |  11 +
>>  include/net/libeth/tx.h                    |  10 +-
>>  include/net/libeth/xdp.h                   |  90 ++-
>>  include/net/libeth/xsk.h                   | 685 +++++++++++++++++++++
>>  drivers/net/ethernet/intel/libeth/tx.c     |   5 +-
>>  drivers/net/ethernet/intel/libeth/xdp.c    |  26 +-
>>  drivers/net/ethernet/intel/libeth/xsk.c    | 269 ++++++++
>>  9 files changed, 1067 insertions(+), 32 deletions(-)
>>  create mode 100644 include/net/libeth/xsk.h
>>  create mode 100644 drivers/net/ethernet/intel/libeth/xsk.c

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 05/16] idpf: fix Rx descriptor ready check barrier in splitq
  2025-03-07 10:17   ` Maciej Fijalkowski
@ 2025-03-12 17:10     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-12 17:10 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 11:17:11 +0100

> On Wed, Mar 05, 2025 at 05:21:21PM +0100, Alexander Lobakin wrote:
>> No idea what the current barrier position was meant for. At that point,
>> nothing is read from the descriptor, only the pointer to the actual one
>> is fetched.
>> The correct barrier usage here is after the generation check, so that
>> only the first qword is read if the descriptor is not yet ready and we
>> need to stop polling. Debatable on coherent DMA as the Rx descriptor
>> size is <= cacheline size, but anyway, the current barrier position
>> only makes the codegen worse.
> 
> Makes sense:
> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> But you know the process... :P fixes should go to -net.

The thing's that it makes no difference for regular skb Rx, but with
ret != XDP_PASS it starts making issues. So yes, this is a fix, but
I don't think it should go separately.

> 
>>
>> Fixes: 3a8845af66ed ("idpf: add RX splitq napi poll support")
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 8 ++------
>>  1 file changed, 2 insertions(+), 6 deletions(-)

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 07/16] idpf: link NAPIs to queues
  2025-03-07 10:28   ` Eric Dumazet
@ 2025-03-12 17:16     ` Alexander Lobakin
  2025-03-18 17:10       ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-12 17:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Eric Dumazet <edumazet@google.com>
Date: Fri, 7 Mar 2025 11:28:36 +0100

> On Wed, Mar 5, 2025 at 5:22 PM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> Add the missing linking of NAPIs to netdev queues when enabling
>> interrupt vectors in order to support NAPI configuration and
>> interfaces requiring get_rx_queue()->napi to be set (like XSk
>> busy polling).
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 30 +++++++++++++++++++++
>>  1 file changed, 30 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>> index 2f221c0abad8..a3f6e8cff7a0 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>> @@ -3560,8 +3560,11 @@ void idpf_vport_intr_rel(struct idpf_vport *vport)
>>  static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
>>  {
>>         struct idpf_adapter *adapter = vport->adapter;
>> +       bool unlock;
>>         int vector;
>>
>> +       unlock = rtnl_trylock();
> 
> This is probably not what you want here ?
> 
> If another thread is holding RTNL, then rtnl_ttrylock() will not add
> any protection.

Yep I know. trylock() is because this function can be called in two
scenarios:

1) .ndo_close(), when RTNL is already locked;
2) "soft reset" aka "stop the traffic, reallocate the queues, start the
   traffic", when RTNL is not taken.

The second one spits a WARN without the RTNL being locked. So this
trylock() will do nothing for the first scenario and will take the lock
for the second one.

If that is not correct, let me know, I'll do it a different way (maybe
it's better to unconditionally take the lock on the callsite for the
second case?).

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 06/16] idpf: a use saner limit for default number of queues to allocate
  2025-03-07 10:32   ` Maciej Fijalkowski
@ 2025-03-12 17:22     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-12 17:22 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 11:32:15 +0100

> On Wed, Mar 05, 2025 at 05:21:22PM +0100, Alexander Lobakin wrote:
>> Currently, the maximum number of queues available for one vport is 16.
>> This is hardcoded, but then the function calculating the optimal number
>> of queues takes min(16, num_online_cpus()).
>> On order to be able to allocate more queues, which will be then used for
> 
> nit: s/On/In

Also "use a saner limit", not "a use saner limit" in the subject =\

> 
>> XDP, stop hardcoding 16 and rely on what the device gives us. Instead of
>> num_online_cpus(), which is considered suboptimal since at least 2013,
>> use netif_get_num_default_rss_queues() to still have free queues in the
>> pool.
> 
> Should we update older drivers as well?

That would be good.

For idpf, this is particularly important since the current logic eats
128 Tx queues for skb traffic on my Xeon out of 256 available by default
(per vport). On a 256-thread system, it would eat the whole limit,
leaving nothing for XDP >_< ice doesn't have a per-port limit IIRC.

> 
>> nr_cpu_ids number of Tx queues are needed only for lockless XDP sending,
>> the regular stack doesn't benefit from that anyhow.
>> On a 128-thread Xeon, this now gives me 32 regular Tx queues and leaves
>> 224 free for XDP (128 of which will handle XDP_TX, .ndo_xdp_xmit(), and
>> XSk xmit when enabled).

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 07/16] idpf: link NAPIs to queues
  2025-03-07 10:51   ` Maciej Fijalkowski
@ 2025-03-12 17:25     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-12 17:25 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 11:51:18 +0100

> On Wed, Mar 05, 2025 at 05:21:23PM +0100, Alexander Lobakin wrote:
>> Add the missing linking of NAPIs to netdev queues when enabling
>> interrupt vectors in order to support NAPI configuration and
>> interfaces requiring get_rx_queue()->napi to be set (like XSk
>> busy polling).
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 30 +++++++++++++++++++++
>>  1 file changed, 30 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>> index 2f221c0abad8..a3f6e8cff7a0 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>> @@ -3560,8 +3560,11 @@ void idpf_vport_intr_rel(struct idpf_vport *vport)
>>  static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
>>  {
>>  	struct idpf_adapter *adapter = vport->adapter;
>> +	bool unlock;
>>  	int vector;
>>  
>> +	unlock = rtnl_trylock();
>> +
>>  	for (vector = 0; vector < vport->num_q_vectors; vector++) {
>>  		struct idpf_q_vector *q_vector = &vport->q_vectors[vector];
>>  		int irq_num, vidx;
>> @@ -3573,8 +3576,23 @@ static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
>>  		vidx = vport->q_vector_idxs[vector];
>>  		irq_num = adapter->msix_entries[vidx].vector;
>>  
>> +		for (u32 i = 0; i < q_vector->num_rxq; i++)
>> +			netif_queue_set_napi(vport->netdev,
>> +					     q_vector->rx[i]->idx,
>> +					     NETDEV_QUEUE_TYPE_RX,
>> +					     NULL);
>> +
>> +		for (u32 i = 0; i < q_vector->num_txq; i++)
>> +			netif_queue_set_napi(vport->netdev,
>> +					     q_vector->tx[i]->idx,
>> +					     NETDEV_QUEUE_TYPE_TX,
>> +					     NULL);
>> +
> 
> maybe we could have a wrapper for this?
> 
> static void idpf_q_set_napi(struct net_device *netdev,
> 			    struct idpf_q_vector *q_vector,
> 			    enum netdev_queue_type q_type,
> 			    struct napi_struct *napi)
> {
> 	u32 q_cnt = q_type == NETDEV_QUEUE_TYPE_RX ? q_vector->num_rxq :
> 						     q_vector->num_txq;
> 	struct idpf_rx_queue **qs = q_type == NETDEV_QUEUE_TYPE_RX ?
> 					      q_vector->rx : q_vector->tx;
> 
> 	for (u32 i = 0; i < q_cnt; i++)
> 		netif_queue_set_napi(netdev, qs[i]->idx, q_type, napi);
> }
> 
> idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_RX, NULL);
> idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_TX, NULL);
> ...
> idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_RX, &q_vector->napi);
> idpf_q_set_napi(vport->netdev, q_vector, NETDEV_QUEUE_TYPE_TX, &q_vector->napi);
> 
> 
> up to you if you take it, less lines in the end but i don't have strong
> opinion if this should be considered as an improvement or makes code
> harder to follow.

No no, it's actually a good idea. Previously, it looked different, but
then this stuff with the CPU affinity embedded into the NAPI config got
merged and I had to rewrite this in the last minute.

> 
>>  		kfree(free_irq(irq_num, q_vector));
>>  	}
>> +
>> +	if (unlock)
>> +		rtnl_unlock();
>>  }

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 08/16] idpf: make complq cleaning dependent on scheduling mode
  2025-03-07 11:11   ` Maciej Fijalkowski
@ 2025-03-13 16:16     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-13 16:16 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 12:11:05 +0100

> On Wed, Mar 05, 2025 at 05:21:24PM +0100, Alexander Lobakin wrote:
>> From: Michal Kubiak <michal.kubiak@intel.com>
>>
>> Extend completion queue cleaning function to support queue-based
>> scheduling mode needed for XDP queues.
>> Add 4-byte descriptor for queue-based scheduling mode and
>> perform some refactoring to extract the common code for
>> both scheduling modes.

TBH it's not needed at all as the cleaning logic for XDP queues is in
xdp.c and doesn't depend on the regular Tx. Previously, the same
functions were used for both, but then we rewrote stuff and I forgot
to toss it off =\

I only need to add 4-byte completion descriptors and allocation
depending on the queue type. Regular skb functions don't use queue-based
mode, XDP path doesn't use flow-based mode.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 10/16] idpf: add support for nointerrupt queues
  2025-03-07 12:10   ` Maciej Fijalkowski
@ 2025-03-13 16:19     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-13 16:19 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 13:10:29 +0100

> On Wed, Mar 05, 2025 at 05:21:26PM +0100, Alexander Lobakin wrote:
>> Currently, queues are associated 1:1 with interrupt vectors as it's
>> assumed queues are always interrupt-driven.
>> In order to use a queue without an interrupt, idpf still needs to have
>> a vector assigned to it to flush descriptors. This vector can be global
>> and only one for the whole vport to handle all its noirq queues.
>> Always request one excessive vector and configure it in non-interrupt
>> mode right away when creating vport, so that it can be used later by
>> queues when needed.
> 
> Description sort of miss the purpose of this commit, you don't ever
> mention that your design choice for XDP Tx queues is to have them
> irq-less.

Because this is not directly related to XDP and maybe some time later
more code could make use of noirq queues, who knows :>

But I'll mention why this is needed, ok.

> 
>>
>> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com>
>> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  drivers/net/ethernet/intel/idpf/idpf.h        |  8 +++
>>  drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  4 ++
>>  drivers/net/ethernet/intel/idpf/idpf_dev.c    | 11 +++-
>>  drivers/net/ethernet/intel/idpf/idpf_lib.c    |  2 +-
>>  drivers/net/ethernet/intel/idpf/idpf_txrx.c   |  8 +++
>>  drivers/net/ethernet/intel/idpf/idpf_vf_dev.c | 11 +++-
>>  .../net/ethernet/intel/idpf/idpf_virtchnl.c   | 53 +++++++++++++------
>>  7 files changed, 79 insertions(+), 18 deletions(-)

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 09/16] idpf: remove SW marker handling from NAPI
  2025-03-07 11:42   ` Maciej Fijalkowski
@ 2025-03-13 16:50     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-13 16:50 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 12:42:10 +0100

> On Wed, Mar 05, 2025 at 05:21:25PM +0100, Alexander Lobakin wrote:
>> From: Michal Kubiak <michal.kubiak@intel.com>
>>
>> SW marker descriptors on completion queues are used only when a queue
>> is about to be destroyed. It's far from hotpath and handling it in the
>> hotpath NAPI poll makes no sense.

[...]

>> +/**
>> + * idpf_wait_for_sw_marker_completion - wait for SW marker of disabled Tx queue
>> + * @txq: disabled Tx queue
>> + */
>> +void idpf_wait_for_sw_marker_completion(struct idpf_tx_queue *txq)
>> +{
>> +	struct idpf_compl_queue *complq = txq->txq_grp->complq;
>> +	struct idpf_splitq_4b_tx_compl_desc *tx_desc;
>> +	s16 ntc = complq->next_to_clean;
>> +	unsigned long timeout;
>> +	bool flow, gen_flag;
>> +	u32 pos = ntc;
>> +
>> +	if (!idpf_queue_has(SW_MARKER, txq))
>> +		return;
>> +
>> +	flow = idpf_queue_has(FLOW_SCH_EN, complq);
>> +	gen_flag = idpf_queue_has(GEN_CHK, complq);
>> +
>> +	timeout = jiffies + msecs_to_jiffies(IDPF_WAIT_FOR_MARKER_TIMEO);
>> +	tx_desc = flow ? &complq->comp[pos].common : &complq->comp_4b[pos];
>> +	ntc -= complq->desc_count;
> 
> could we stop this logic? it was introduced back in the days as comparison
> against 0 for wrap case was faster, here as you said it doesn't have much
> in common with hot path.

+1

> 
>> +
>> +	do {
>> +		struct idpf_tx_queue *tx_q;
>> +		int ctype;
>> +
>> +		ctype = idpf_parse_compl_desc(tx_desc, complq, &tx_q,
>> +					      gen_flag);
>> +		if (ctype == IDPF_TXD_COMPLT_SW_MARKER) {
>> +			idpf_queue_clear(SW_MARKER, tx_q);
>> +			if (txq == tx_q)
>> +				break;
>> +		} else if (ctype == -ENODATA) {
>> +			usleep_range(500, 1000);
>> +			continue;
>> +		}
>> +
>> +		pos++;
>> +		ntc++;
>> +		if (unlikely(!ntc)) {
>> +			ntc -= complq->desc_count;
>> +			pos = 0;
>> +			gen_flag = !gen_flag;
>> +		}
>> +
>> +		tx_desc = flow ? &complq->comp[pos].common :
>> +			  &complq->comp_4b[pos];
>> +		prefetch(tx_desc);
>> +	} while (time_before(jiffies, timeout));
> 
> what if timeout expires and you didn't find the marker desc? why do you

Then we'll print "failed to receive marker" and that's it. Usually that
happens only if HW went out for cigarettes and won't come back until
a full power cycle. In that case, timeout prevents the kernel from hanging.

> need timer? couldn't you scan the whole ring instead?

Queue destroy marker is always the last written descriptor, there's no
point in scanning the whole ring.
The marker arrives as the CP receives the virtchnl message, queues the
queue (lol) for destroying and sends the marker. This may take up to
several msecs, but you never know.
So you anyway need a loop with some sane sleeps (here it's 500-1000 usec
and it usually takes 2-3 iterations).

> 
>> +
>> +	idpf_tx_update_complq_indexes(complq, ntc, gen_flag);
>> +}

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 11/16] idpf: prepare structures to support XDP
  2025-03-07 13:27   ` Maciej Fijalkowski
@ 2025-03-17 14:50     ` Alexander Lobakin
  2025-03-19 16:29       ` Maciej Fijalkowski
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-17 14:50 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 14:27:13 +0100

> On Wed, Mar 05, 2025 at 05:21:27PM +0100, Alexander Lobakin wrote:
>> From: Michal Kubiak <michal.kubiak@intel.com>
>>
>> Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_*_queue',
>> 'idpf_vport_user_config_data') by adding members necessary to support XDP.
>> Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions
>> without interfering with regular Tx traffic.
>> Also add functions dedicated to support XDP initialization for Rx and
>> Tx queues and call those functions from the existing algorithms of
>> queues configuration.

[...]

>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
>> index 59b1a1a09996..1ca322bfe92f 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
>> +++ b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
>> @@ -186,9 +186,11 @@ static void idpf_get_channels(struct net_device *netdev,
>>  {
>>  	struct idpf_netdev_priv *np = netdev_priv(netdev);
>>  	struct idpf_vport_config *vport_config;
>> +	const struct idpf_vport *vport;
>>  	u16 num_txq, num_rxq;
>>  	u16 combined;
>>  
>> +	vport = idpf_netdev_to_vport(netdev);
>>  	vport_config = np->adapter->vport_config[np->vport_idx];
>>  
>>  	num_txq = vport_config->user_config.num_req_tx_qs;
>> @@ -202,8 +204,8 @@ static void idpf_get_channels(struct net_device *netdev,
>>  	ch->max_rx = vport_config->max_q.max_rxq;
>>  	ch->max_tx = vport_config->max_q.max_txq;
>>  
>> -	ch->max_other = IDPF_MAX_MBXQ;
>> -	ch->other_count = IDPF_MAX_MBXQ;
>> +	ch->max_other = IDPF_MAX_MBXQ + vport->num_xdp_txq;
>> +	ch->other_count = IDPF_MAX_MBXQ + vport->num_xdp_txq;
> 
> That's new I think. Do you explain somewhere that other `other` will carry
> xdpq count? Otherwise how would I know to interpret this value?

Where? :D

> 
> Also from what I see num_txq carries (txq + xdpq) count. How is that
> affecting the `combined` from ethtool_channels?

No changes in combined/Ethtool, num_txq is not used there. Stuff like
req_txq_num includes skb queues only.

> 
>>  
>>  	ch->combined_count = combined;
>>  	ch->rx_count = num_rxq - combined;
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
>> index 2594ca38e8ca..0f4edc9cd1ad 100644
> 
> (...)
> 
>> +
>> +/**
>> + * __idpf_xdp_rxq_info_init - Setup XDP RxQ info for a given Rx queue
>> + * @rxq: Rx queue for which the resources are setup
>> + * @arg: flag indicating if the HW works in split queue mode
>> + *
>> + * Return: 0 on success, negative on failure.
>> + */
>> +static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
>> +{
>> +	const struct idpf_vport *vport = rxq->q_vector->vport;
>> +	bool split = idpf_is_queue_model_split(vport->rxq_model);
>> +	const struct page_pool *pp;
>> +	int err;
>> +
>> +	err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
>> +				 rxq->q_vector->napi.napi_id,
>> +				 rxq->rx_buf_size);
>> +	if (err)
>> +		return err;
>> +
>> +	pp = split ? rxq->bufq_sets[0].bufq.pp : rxq->pp;
>> +	xdp_rxq_info_attach_page_pool(&rxq->xdp_rxq, pp);
>> +
>> +	if (!split)
>> +		return 0;
> 
> why do you care about splitq model if on next patch you don't allow
> XDP_SETUP_PROG for that?

This function is called unconditionally for both queue models. If we
don't account it here, we'd break regular traffic flow.

(singleq will be removed soon, don't take it seriously anyway)

[...]

>> +int idpf_vport_xdpq_get(const struct idpf_vport *vport)
>> +{
>> +	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;
> 
> please bear with me here - so this array will exist as long as there is a
> single timers[i] allocated? even though it's a local var?

No problem.

No, this array will be freed when the function exits. This array is an
array of pointers to iterate in a loop and assign timers to queues. When
we exit this function, it's no longer needed.
I can't place the whole array on the stack since I don't know the actual
queue count + it can be really big (1024 pointers * 8 = 8 Kb, even 128
or 256 queues is already 1-2 Kb).

The actual timers are allocated separately and NUMA-locally below.

> 
> this way you avoid the need to store it in vport?
> 
>> +	struct net_device *dev;
>> +	u32 sqs;
>> +
>> +	if (!idpf_xdp_is_prog_ena(vport))
>> +		return 0;
>> +
>> +	timers = kvcalloc(vport->num_xdp_txq, sizeof(*timers), GFP_KERNEL);
>> +	if (!timers)
>> +		return -ENOMEM;
>> +
>> +	for (u32 i = 0; i < vport->num_xdp_txq; i++) {
>> +		timers[i] = kzalloc_node(sizeof(*timers[i]), GFP_KERNEL,
>> +					 cpu_to_mem(i));
>> +		if (!timers[i]) {
>> +			for (int j = i - 1; j >= 0; j--)
>> +				kfree(timers[j]);
>> +
>> +			return -ENOMEM;
>> +		}
>> +	}
>> +
>> +	dev = vport->netdev;
>> +	sqs = vport->xdp_txq_offset;
>> +
>> +	for (u32 i = sqs; i < vport->num_txq; i++) {
>> +		struct idpf_tx_queue *xdpq = vport->txqs[i];
>> +
>> +		xdpq->complq = xdpq->txq_grp->complq;
>> +
>> +		idpf_queue_clear(FLOW_SCH_EN, xdpq);
>> +		idpf_queue_clear(FLOW_SCH_EN, xdpq->complq);
>> +		idpf_queue_set(NOIRQ, xdpq);
>> +		idpf_queue_set(XDP, xdpq);
>> +		idpf_queue_set(XDP, xdpq->complq);
>> +
>> +		xdpq->timer = timers[i - sqs];
>> +		libeth_xdpsq_get(&xdpq->xdp_lock, dev, vport->xdpq_share);
>> +
>> +		xdpq->pending = 0;
>> +		xdpq->xdp_tx = 0;
>> +		xdpq->thresh = libeth_xdp_queue_threshold(xdpq->desc_count);
>> +	}
>> +
>> +	return 0;
>> +}

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
  2025-03-07 14:16   ` Maciej Fijalkowski
@ 2025-03-17 14:58     ` Alexander Lobakin
  2025-03-19 16:23       ` Maciej Fijalkowski
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-17 14:58 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 7 Mar 2025 15:16:48 +0100

> On Wed, Mar 05, 2025 at 05:21:28PM +0100, Alexander Lobakin wrote:
>> From: Michal Kubiak <michal.kubiak@intel.com>
>>
>> Implement loading/removing XDP program using .ndo_bpf callback
>> in the split queue mode. Reconfigure and restart the queues if needed
>> (!!old_prog != !!new_prog), otherwise, just update the pointers.
>>
>> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  drivers/net/ethernet/intel/idpf/idpf_txrx.h |   4 +-
>>  drivers/net/ethernet/intel/idpf/xdp.h       |   7 ++
>>  drivers/net/ethernet/intel/idpf/idpf_lib.c  |   1 +
>>  drivers/net/ethernet/intel/idpf/idpf_txrx.c |   4 +
>>  drivers/net/ethernet/intel/idpf/xdp.c       | 114 ++++++++++++++++++++
>>  5 files changed, 129 insertions(+), 1 deletion(-)
>>
> 
> (...)
> 
>> +
>> +/**
>> + * idpf_xdp_setup_prog - handle XDP program install/remove requests
>> + * @vport: vport to configure
>> + * @xdp: request data (program, extack)
>> + *
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +static int
>> +idpf_xdp_setup_prog(struct idpf_vport *vport, const struct netdev_bpf *xdp)
>> +{
>> +	const struct idpf_netdev_priv *np = netdev_priv(vport->netdev);
>> +	struct bpf_prog *old, *prog = xdp->prog;
>> +	struct idpf_vport_config *cfg;
>> +	int ret;
>> +
>> +	cfg = vport->adapter->vport_config[vport->idx];
>> +	if (!vport->num_xdp_txq && vport->num_txq == cfg->max_q.max_txq) {
>> +		NL_SET_ERR_MSG_MOD(xdp->extack,
>> +				   "No Tx queues available for XDP, please decrease the number of regular SQs");
>> +		return -ENOSPC;
>> +	}
>> +
>> +	if (test_bit(IDPF_REMOVE_IN_PROG, vport->adapter->flags) ||
> 
> IN_PROG is a bit unfortunate here as it mixes with 'prog' :P

Authentic idpf dictionary ¯\_(ツ)_/¯

> 
>> +	    !!vport->xdp_prog == !!prog) {
>> +		if (np->state == __IDPF_VPORT_UP)
>> +			idpf_copy_xdp_prog_to_qs(vport, prog);
>> +
>> +		old = xchg(&vport->xdp_prog, prog);
>> +		if (old)
>> +			bpf_prog_put(old);
>> +
>> +		cfg->user_config.xdp_prog = prog;
>> +
>> +		return 0;
>> +	}
>> +
>> +	old = cfg->user_config.xdp_prog;
>> +	cfg->user_config.xdp_prog = prog;
>> +
>> +	ret = idpf_initiate_soft_reset(vport, IDPF_SR_Q_CHANGE);
>> +	if (ret) {
>> +		NL_SET_ERR_MSG_MOD(xdp->extack,
>> +				   "Could not reopen the vport after XDP setup");
>> +
>> +		if (prog)
>> +			bpf_prog_put(prog);
> 
> aren't you missing this for prog->NULL conversion? you have this for
> hot-swap case (prog->prog).

This path (soft_reset) handles NULL => prog and prog => NULL. This
branch in particular handles errors during the soft reset, when we need
to restore the original prog and put the new one.

What you probably meant is that I don't have bpf_prog_put(old) in case
everything went well below? Breh =\

> 
>> +
>> +		cfg->user_config.xdp_prog = old;
>> +	}
>> +
>> +	return ret;
>> +}

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-03-11 14:05   ` Maciej Fijalkowski
@ 2025-03-17 15:26     ` Alexander Lobakin
  2025-03-19 16:19       ` Maciej Fijalkowski
  2025-04-08 13:22     ` Alexander Lobakin
  1 sibling, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-17 15:26 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Tue, 11 Mar 2025 15:05:38 +0100

> On Wed, Mar 05, 2025 at 05:21:19PM +0100, Alexander Lobakin wrote:
>> "Couple" is a bit humbly... Add the following functionality to libeth:
>>
>> * XDP shared queues managing
>> * XDP_TX bulk sending infra
>> * .ndo_xdp_xmit() infra
>> * adding buffers to &xdp_buff
>> * running XDP prog and managing its verdict
>> * completing XDP Tx buffers
>>
>> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Patch is really big and I'm not sure how to trim this TBH to make my
> comments bearable. I know this is highly optimized but it's rather hard to
> follow with all of the callbacks, defines/aligns and whatnot. Any chance
> to chop this commit a bit?

Sometimes "highly optimized" code means "not really readable". See
PeterZ's code :D I mean, I'm not able to write it to look more readable
without hurting object code or not provoking code duplications. Maybe
it's an art which I don't possess.
I tried by best and left the documentation, even with pseudo-examples.
Sorry if it doesn't help =\

> 
> Timers and locking logic could be pulled out to separate patches I think.
> You don't ever say what improvement gave you the __LIBETH_WORD_ACCESS
> approach. You've put a lot of thought onto this work and I feel like this

I don't record/remember all of the perf changes. Couple percent for
sure. Plus lighter object code.
I can recall ~ -50-60 bytes in libeth_xdp_process_buff(), even though
there's only 1 64-bit write replacing 2 32-bit writes. When there's a
lot, like descriptor filling, it was 100+ bytes off, esp. when unrolling.

> is not explained/described thoroughly. What would be nice to see is to
> have this in the separate commit as well with a comment like 'this gave me
> +X% performance boost on Y workload'. That would be probably a non-zero
> effort to restructure it but generally while jumping back and forth

Yeah it would be quite a big. I had a bit of hard time splitting it into
2 commits (XDP and XSk) from one, that request would cost a bunch more.

Dunno if it would make sense at all? Defines, alignments etc, won't go
away. Same for "head-scratching moments". Moreover, sometimes splitting
the code borns more questions as it feels incomplete until the last
patch and then there'll be a train of replies like "this will be
added/changes in patch number X", which I don't like to do :s
I mean, I would like to not sacrifice time splitting it only for the
sake of split, depends on how critical this is and what it would give.

> through this code I had a lot of head-scratching moments.
> 
>> ---
>>  drivers/net/ethernet/intel/libeth/Kconfig  |   10 +-
>>  drivers/net/ethernet/intel/libeth/Makefile |    7 +-
>>  include/net/libeth/types.h                 |  106 +-
>>  drivers/net/ethernet/intel/libeth/priv.h   |   26 +
>>  include/net/libeth/tx.h                    |   30 +-
>>  include/net/libeth/xdp.h                   | 1827 ++++++++++++++++++++
>>  drivers/net/ethernet/intel/libeth/tx.c     |   38 +
>>  drivers/net/ethernet/intel/libeth/xdp.c    |  431 +++++
>>  8 files changed, 2467 insertions(+), 8 deletions(-)
>>  create mode 100644 drivers/net/ethernet/intel/libeth/priv.h
>>  create mode 100644 include/net/libeth/xdp.h
>>  create mode 100644 drivers/net/ethernet/intel/libeth/tx.c
>>  create mode 100644 drivers/net/ethernet/intel/libeth/xdp.c

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 07/16] idpf: link NAPIs to queues
  2025-03-12 17:16     ` Alexander Lobakin
@ 2025-03-18 17:10       ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-03-18 17:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: intel-wired-lan, Michal Kubiak, Maciej Fijalkowski, Tony Nguyen,
	Przemek Kitszel, Andrew Lunn, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 12 Mar 2025 18:16:11 +0100

> From: Eric Dumazet <edumazet@google.com>
> Date: Fri, 7 Mar 2025 11:28:36 +0100
> 
>> On Wed, Mar 5, 2025 at 5:22 PM Alexander Lobakin
>> <aleksander.lobakin@intel.com> wrote:
>>>
>>> Add the missing linking of NAPIs to netdev queues when enabling
>>> interrupt vectors in order to support NAPI configuration and
>>> interfaces requiring get_rx_queue()->napi to be set (like XSk
>>> busy polling).
>>>
>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/idpf/idpf_txrx.c | 30 +++++++++++++++++++++
>>>  1 file changed, 30 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>>> index 2f221c0abad8..a3f6e8cff7a0 100644
>>> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>>> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
>>> @@ -3560,8 +3560,11 @@ void idpf_vport_intr_rel(struct idpf_vport *vport)
>>>  static void idpf_vport_intr_rel_irq(struct idpf_vport *vport)
>>>  {
>>>         struct idpf_adapter *adapter = vport->adapter;
>>> +       bool unlock;
>>>         int vector;
>>>
>>> +       unlock = rtnl_trylock();
>>
>> This is probably not what you want here ?
>>
>> If another thread is holding RTNL, then rtnl_ttrylock() will not add
>> any protection.
> 
> Yep I know. trylock() is because this function can be called in two
> scenarios:
> 
> 1) .ndo_close(), when RTNL is already locked;
> 2) "soft reset" aka "stop the traffic, reallocate the queues, start the
>    traffic", when RTNL is not taken.
> 
> The second one spits a WARN without the RTNL being locked. So this
> trylock() will do nothing for the first scenario and will take the lock
> for the second one.
> 
> If that is not correct, let me know, I'll do it a different way (maybe
> it's better to unconditionally take the lock on the callsite for the
> second case?).

Ping. What should I do, lock RTNL on the callsite or proceed with trylock?

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-03-17 15:26     ` Alexander Lobakin
@ 2025-03-19 16:19       ` Maciej Fijalkowski
  2025-04-01 13:11         ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-19 16:19 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Mon, Mar 17, 2025 at 04:26:04PM +0100, Alexander Lobakin wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Tue, 11 Mar 2025 15:05:38 +0100
> 
> > On Wed, Mar 05, 2025 at 05:21:19PM +0100, Alexander Lobakin wrote:
> >> "Couple" is a bit humbly... Add the following functionality to libeth:
> >>
> >> * XDP shared queues managing
> >> * XDP_TX bulk sending infra
> >> * .ndo_xdp_xmit() infra
> >> * adding buffers to &xdp_buff
> >> * running XDP prog and managing its verdict
> >> * completing XDP Tx buffers
> >>
> >> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> > 
> > Patch is really big and I'm not sure how to trim this TBH to make my
> > comments bearable. I know this is highly optimized but it's rather hard to
> > follow with all of the callbacks, defines/aligns and whatnot. Any chance
> > to chop this commit a bit?
> 
> Sometimes "highly optimized" code means "not really readable". See
> PeterZ's code :D I mean, I'm not able to write it to look more readable
> without hurting object code or not provoking code duplications. Maybe
> it's an art which I don't possess.
> I tried by best and left the documentation, even with pseudo-examples.
> Sorry if it doesn't help =\

Do you mean doxygen descriptions or what kind of documentation - I must be
missing something?

You cut out all of the stuff I asked about in this review - are you going
to address any of those or what should I expect?

> 
> > 
> > Timers and locking logic could be pulled out to separate patches I think.
> > You don't ever say what improvement gave you the __LIBETH_WORD_ACCESS
> > approach. You've put a lot of thought onto this work and I feel like this
> 
> I don't record/remember all of the perf changes. Couple percent for
> sure. Plus lighter object code.
> I can recall ~ -50-60 bytes in libeth_xdp_process_buff(), even though
> there's only 1 64-bit write replacing 2 32-bit writes. When there's a
> lot, like descriptor filling, it was 100+ bytes off, esp. when unrolling.

I just wanted to hint that it felt like this feature could be stripped
from this huge patch and then on of top of it you would have it as 'this
is my awesome feature that gave me X improvement, eat it'. As I tried to
say any small pullouts would make it easier to comprehend, at least from
reviewer's POV...

> 
> > is not explained/described thoroughly. What would be nice to see is to
> > have this in the separate commit as well with a comment like 'this gave me
> > +X% performance boost on Y workload'. That would be probably a non-zero
> > effort to restructure it but generally while jumping back and forth
> 
> Yeah it would be quite a big. I had a bit of hard time splitting it into
> 2 commits (XDP and XSk) from one, that request would cost a bunch more.
> 
> Dunno if it would make sense at all? Defines, alignments etc, won't go
> away. Same for "head-scratching moments". Moreover, sometimes splitting

maybe ask yourself this - if you add a new ethernet driver, are you adding
this in a single commit or do you send a patch set that is structured in
some degree:) I have a feeling that this patch could be converted to a
patch set where each bullet from commit message is a separate patch.

> the code borns more questions as it feels incomplete until the last
> patch and then there'll be a train of replies like "this will be
> added/changes in patch number X", which I don't like to do :s

I agree here it's a tradeoff which given that user of lib is driver would
be tricky to split properly.

> I mean, I would like to not sacrifice time splitting it only for the
> sake of split, depends on how critical this is and what it would give.

Not sure what to say here. Your time dedicated for making this work easier
to swallow means less time dedicated for going through this by reviewer.

I like the end result though and how driver side looks like when using
this lib. Sorry for trying to understand the internals:)

> 
> > through this code I had a lot of head-scratching moments.
> > 
> >> ---
> >>  drivers/net/ethernet/intel/libeth/Kconfig  |   10 +-
> >>  drivers/net/ethernet/intel/libeth/Makefile |    7 +-
> >>  include/net/libeth/types.h                 |  106 +-
> >>  drivers/net/ethernet/intel/libeth/priv.h   |   26 +
> >>  include/net/libeth/tx.h                    |   30 +-
> >>  include/net/libeth/xdp.h                   | 1827 ++++++++++++++++++++
> >>  drivers/net/ethernet/intel/libeth/tx.c     |   38 +
> >>  drivers/net/ethernet/intel/libeth/xdp.c    |  431 +++++
> >>  8 files changed, 2467 insertions(+), 8 deletions(-)
> >>  create mode 100644 drivers/net/ethernet/intel/libeth/priv.h
> >>  create mode 100644 include/net/libeth/xdp.h
> >>  create mode 100644 drivers/net/ethernet/intel/libeth/tx.c
> >>  create mode 100644 drivers/net/ethernet/intel/libeth/xdp.c
> 
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
  2025-03-17 14:58     ` Alexander Lobakin
@ 2025-03-19 16:23       ` Maciej Fijalkowski
  0 siblings, 0 replies; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-19 16:23 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Mon, Mar 17, 2025 at 03:58:12PM +0100, Alexander Lobakin wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Fri, 7 Mar 2025 15:16:48 +0100
> 
> > On Wed, Mar 05, 2025 at 05:21:28PM +0100, Alexander Lobakin wrote:
> >> From: Michal Kubiak <michal.kubiak@intel.com>
> >>
> >> Implement loading/removing XDP program using .ndo_bpf callback
> >> in the split queue mode. Reconfigure and restart the queues if needed
> >> (!!old_prog != !!new_prog), otherwise, just update the pointers.
> >>
> >> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> >> ---
> >>  drivers/net/ethernet/intel/idpf/idpf_txrx.h |   4 +-
> >>  drivers/net/ethernet/intel/idpf/xdp.h       |   7 ++
> >>  drivers/net/ethernet/intel/idpf/idpf_lib.c  |   1 +
> >>  drivers/net/ethernet/intel/idpf/idpf_txrx.c |   4 +
> >>  drivers/net/ethernet/intel/idpf/xdp.c       | 114 ++++++++++++++++++++
> >>  5 files changed, 129 insertions(+), 1 deletion(-)
> >>
> > 
> > (...)
> > 
> >> +
> >> +/**
> >> + * idpf_xdp_setup_prog - handle XDP program install/remove requests
> >> + * @vport: vport to configure
> >> + * @xdp: request data (program, extack)
> >> + *
> >> + * Return: 0 on success, -errno on failure.
> >> + */
> >> +static int
> >> +idpf_xdp_setup_prog(struct idpf_vport *vport, const struct netdev_bpf *xdp)
> >> +{
> >> +	const struct idpf_netdev_priv *np = netdev_priv(vport->netdev);
> >> +	struct bpf_prog *old, *prog = xdp->prog;
> >> +	struct idpf_vport_config *cfg;
> >> +	int ret;
> >> +
> >> +	cfg = vport->adapter->vport_config[vport->idx];
> >> +	if (!vport->num_xdp_txq && vport->num_txq == cfg->max_q.max_txq) {
> >> +		NL_SET_ERR_MSG_MOD(xdp->extack,
> >> +				   "No Tx queues available for XDP, please decrease the number of regular SQs");
> >> +		return -ENOSPC;
> >> +	}
> >> +
> >> +	if (test_bit(IDPF_REMOVE_IN_PROG, vport->adapter->flags) ||
> > 
> > IN_PROG is a bit unfortunate here as it mixes with 'prog' :P
> 
> Authentic idpf dictionary ¯\_(ツ)_/¯
> 
> > 
> >> +	    !!vport->xdp_prog == !!prog) {
> >> +		if (np->state == __IDPF_VPORT_UP)
> >> +			idpf_copy_xdp_prog_to_qs(vport, prog);
> >> +
> >> +		old = xchg(&vport->xdp_prog, prog);
> >> +		if (old)
> >> +			bpf_prog_put(old);
> >> +
> >> +		cfg->user_config.xdp_prog = prog;
> >> +
> >> +		return 0;
> >> +	}
> >> +
> >> +	old = cfg->user_config.xdp_prog;
> >> +	cfg->user_config.xdp_prog = prog;
> >> +
> >> +	ret = idpf_initiate_soft_reset(vport, IDPF_SR_Q_CHANGE);
> >> +	if (ret) {
> >> +		NL_SET_ERR_MSG_MOD(xdp->extack,
> >> +				   "Could not reopen the vport after XDP setup");
> >> +
> >> +		if (prog)
> >> +			bpf_prog_put(prog);
> > 
> > aren't you missing this for prog->NULL conversion? you have this for
> > hot-swap case (prog->prog).
> 
> This path (soft_reset) handles NULL => prog and prog => NULL. This
> branch in particular handles errors during the soft reset, when we need
> to restore the original prog and put the new one.
> 
> What you probably meant is that I don't have bpf_prog_put(old) in case
> everything went well below? Breh =\

yes, best to check with bpftool if there are dangling bpf progs on system
after using few xdp samples, for example.

> 
> > 
> >> +
> >> +		cfg->user_config.xdp_prog = old;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> 
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 11/16] idpf: prepare structures to support XDP
  2025-03-17 14:50     ` Alexander Lobakin
@ 2025-03-19 16:29       ` Maciej Fijalkowski
  2025-04-08 13:42         ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-03-19 16:29 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Mon, Mar 17, 2025 at 03:50:11PM +0100, Alexander Lobakin wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Fri, 7 Mar 2025 14:27:13 +0100
> 
> > On Wed, Mar 05, 2025 at 05:21:27PM +0100, Alexander Lobakin wrote:
> >> From: Michal Kubiak <michal.kubiak@intel.com>
> >>
> >> Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_*_queue',
> >> 'idpf_vport_user_config_data') by adding members necessary to support XDP.
> >> Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions
> >> without interfering with regular Tx traffic.
> >> Also add functions dedicated to support XDP initialization for Rx and
> >> Tx queues and call those functions from the existing algorithms of
> >> queues configuration.
> 
> [...]
> 
> >> diff --git a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> >> index 59b1a1a09996..1ca322bfe92f 100644
> >> --- a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> >> +++ b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> >> @@ -186,9 +186,11 @@ static void idpf_get_channels(struct net_device *netdev,
> >>  {
> >>  	struct idpf_netdev_priv *np = netdev_priv(netdev);
> >>  	struct idpf_vport_config *vport_config;
> >> +	const struct idpf_vport *vport;
> >>  	u16 num_txq, num_rxq;
> >>  	u16 combined;
> >>  
> >> +	vport = idpf_netdev_to_vport(netdev);
> >>  	vport_config = np->adapter->vport_config[np->vport_idx];
> >>  
> >>  	num_txq = vport_config->user_config.num_req_tx_qs;
> >> @@ -202,8 +204,8 @@ static void idpf_get_channels(struct net_device *netdev,
> >>  	ch->max_rx = vport_config->max_q.max_rxq;
> >>  	ch->max_tx = vport_config->max_q.max_txq;
> >>  
> >> -	ch->max_other = IDPF_MAX_MBXQ;
> >> -	ch->other_count = IDPF_MAX_MBXQ;
> >> +	ch->max_other = IDPF_MAX_MBXQ + vport->num_xdp_txq;
> >> +	ch->other_count = IDPF_MAX_MBXQ + vport->num_xdp_txq;
> > 
> > That's new I think. Do you explain somewhere that other `other` will carry
> > xdpq count? Otherwise how would I know to interpret this value?
> 
> Where? :D

I meant to say something in commit message how new output should be
interpreted?

> 
> > 
> > Also from what I see num_txq carries (txq + xdpq) count. How is that
> > affecting the `combined` from ethtool_channels?
> 
> No changes in combined/Ethtool, num_txq is not used there. Stuff like
> req_txq_num includes skb queues only.
> 
> > 
> >>  
> >>  	ch->combined_count = combined;
> >>  	ch->rx_count = num_rxq - combined;
> >> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
> >> index 2594ca38e8ca..0f4edc9cd1ad 100644
> > 
> > (...)
> > 
> >> +
> >> +/**
> >> + * __idpf_xdp_rxq_info_init - Setup XDP RxQ info for a given Rx queue
> >> + * @rxq: Rx queue for which the resources are setup
> >> + * @arg: flag indicating if the HW works in split queue mode
> >> + *
> >> + * Return: 0 on success, negative on failure.
> >> + */
> >> +static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
> >> +{
> >> +	const struct idpf_vport *vport = rxq->q_vector->vport;
> >> +	bool split = idpf_is_queue_model_split(vport->rxq_model);
> >> +	const struct page_pool *pp;
> >> +	int err;
> >> +
> >> +	err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
> >> +				 rxq->q_vector->napi.napi_id,
> >> +				 rxq->rx_buf_size);
> >> +	if (err)
> >> +		return err;
> >> +
> >> +	pp = split ? rxq->bufq_sets[0].bufq.pp : rxq->pp;
> >> +	xdp_rxq_info_attach_page_pool(&rxq->xdp_rxq, pp);
> >> +
> >> +	if (!split)
> >> +		return 0;
> > 
> > why do you care about splitq model if on next patch you don't allow
> > XDP_SETUP_PROG for that?
> 
> This function is called unconditionally for both queue models. If we
> don't account it here, we'd break regular traffic flow.
> 
> (singleq will be removed soon, don't take it seriously anyway)

ack, thanks

> 
> [...]
> 
> >> +int idpf_vport_xdpq_get(const struct idpf_vport *vport)
> >> +{
> >> +	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;
> > 
> > please bear with me here - so this array will exist as long as there is a
> > single timers[i] allocated? even though it's a local var?
> 
> No problem.
> 
> No, this array will be freed when the function exits. This array is an
> array of pointers to iterate in a loop and assign timers to queues. When
> we exit this function, it's no longer needed.
> I can't place the whole array on the stack since I don't know the actual
> queue count + it can be really big (1024 pointers * 8 = 8 Kb, even 128
> or 256 queues is already 1-2 Kb).

so this array is needed to ease the error path handling?

> 
> The actual timers are allocated separately and NUMA-locally below.
> 
> > 
> > this way you avoid the need to store it in vport?
> > 
> >> +	struct net_device *dev;
> >> +	u32 sqs;
> >> +
> >> +	if (!idpf_xdp_is_prog_ena(vport))
> >> +		return 0;
> >> +
> >> +	timers = kvcalloc(vport->num_xdp_txq, sizeof(*timers), GFP_KERNEL);
> >> +	if (!timers)
> >> +		return -ENOMEM;
> >> +
> >> +	for (u32 i = 0; i < vport->num_xdp_txq; i++) {
> >> +		timers[i] = kzalloc_node(sizeof(*timers[i]), GFP_KERNEL,
> >> +					 cpu_to_mem(i));
> >> +		if (!timers[i]) {
> >> +			for (int j = i - 1; j >= 0; j--)
> >> +				kfree(timers[j]);
> >> +
> >> +			return -ENOMEM;
> >> +		}
> >> +	}
> >> +
> >> +	dev = vport->netdev;
> >> +	sqs = vport->xdp_txq_offset;
> >> +
> >> +	for (u32 i = sqs; i < vport->num_txq; i++) {
> >> +		struct idpf_tx_queue *xdpq = vport->txqs[i];
> >> +
> >> +		xdpq->complq = xdpq->txq_grp->complq;
> >> +
> >> +		idpf_queue_clear(FLOW_SCH_EN, xdpq);
> >> +		idpf_queue_clear(FLOW_SCH_EN, xdpq->complq);
> >> +		idpf_queue_set(NOIRQ, xdpq);
> >> +		idpf_queue_set(XDP, xdpq);
> >> +		idpf_queue_set(XDP, xdpq->complq);
> >> +
> >> +		xdpq->timer = timers[i - sqs];
> >> +		libeth_xdpsq_get(&xdpq->xdp_lock, dev, vport->xdpq_share);
> >> +
> >> +		xdpq->pending = 0;
> >> +		xdpq->xdp_tx = 0;
> >> +		xdpq->thresh = libeth_xdp_queue_threshold(xdpq->desc_count);
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> 
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-03-19 16:19       ` Maciej Fijalkowski
@ 2025-04-01 13:11         ` Alexander Lobakin
  2025-04-08 13:38           ` Alexander Lobakin
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-04-01 13:11 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Wed, 19 Mar 2025 17:19:44 +0100

> On Mon, Mar 17, 2025 at 04:26:04PM +0100, Alexander Lobakin wrote:
>> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> Date: Tue, 11 Mar 2025 15:05:38 +0100
>>
>>> On Wed, Mar 05, 2025 at 05:21:19PM +0100, Alexander Lobakin wrote:
>>>> "Couple" is a bit humbly... Add the following functionality to libeth:
>>>>
>>>> * XDP shared queues managing
>>>> * XDP_TX bulk sending infra
>>>> * .ndo_xdp_xmit() infra
>>>> * adding buffers to &xdp_buff
>>>> * running XDP prog and managing its verdict
>>>> * completing XDP Tx buffers
>>>>
>>>> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff
>>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>>
>>> Patch is really big and I'm not sure how to trim this TBH to make my
>>> comments bearable. I know this is highly optimized but it's rather hard to
>>> follow with all of the callbacks, defines/aligns and whatnot. Any chance
>>> to chop this commit a bit?
>>
>> Sometimes "highly optimized" code means "not really readable". See
>> PeterZ's code :D I mean, I'm not able to write it to look more readable
>> without hurting object code or not provoking code duplications. Maybe
>> it's an art which I don't possess.
>> I tried by best and left the documentation, even with pseudo-examples.
>> Sorry if it doesn't help =\
> 
> Do you mean doxygen descriptions or what kind of documentation - I must be
> missing something?

Yes and not only, I meant all of the comments. There are even some
pseudo-code example blocks for complicated stuff.

> 
> You cut out all of the stuff I asked about in this review - are you going
> to address any of those or what should I expect?

I haven't read all of them yet, a bit of patience. Of course I didn't
cut it to not address at all :D

> 
>>
>>>
>>> Timers and locking logic could be pulled out to separate patches I think.
>>> You don't ever say what improvement gave you the __LIBETH_WORD_ACCESS
>>> approach. You've put a lot of thought onto this work and I feel like this
>>
>> I don't record/remember all of the perf changes. Couple percent for
>> sure. Plus lighter object code.
>> I can recall ~ -50-60 bytes in libeth_xdp_process_buff(), even though
>> there's only 1 64-bit write replacing 2 32-bit writes. When there's a
>> lot, like descriptor filling, it was 100+ bytes off, esp. when unrolling.
> 
> I just wanted to hint that it felt like this feature could be stripped
> from this huge patch and then on of top of it you would have it as 'this
> is my awesome feature that gave me X improvement, eat it'. As I tried to
> say any small pullouts would make it easier to comprehend, at least from
> reviewer's POV...

Makes sense, but unfortunately this won't cut off a lot. But I'll try,
to the degree where I'd need to provide stubs.

> 
>>
>>> is not explained/described thoroughly. What would be nice to see is to
>>> have this in the separate commit as well with a comment like 'this gave me
>>> +X% performance boost on Y workload'. That would be probably a non-zero
>>> effort to restructure it but generally while jumping back and forth
>>
>> Yeah it would be quite a big. I had a bit of hard time splitting it into
>> 2 commits (XDP and XSk) from one, that request would cost a bunch more.
>>
>> Dunno if it would make sense at all? Defines, alignments etc, won't go
>> away. Same for "head-scratching moments". Moreover, sometimes splitting
> 
> maybe ask yourself this - if you add a new ethernet driver, are you adding
> this in a single commit or do you send a patch set that is structured in
> some degree:) I have a feeling that this patch could be converted to a
> patch set where each bullet from commit message is a separate patch.
> 
>> the code borns more questions as it feels incomplete until the last
>> patch and then there'll be a train of replies like "this will be
>> added/changes in patch number X", which I don't like to do :s
> 
> I agree here it's a tradeoff which given that user of lib is driver would
> be tricky to split properly.
> 
>> I mean, I would like to not sacrifice time splitting it only for the
>> sake of split, depends on how critical this is and what it would give.
> 
> Not sure what to say here. Your time dedicated for making this work easier
> to swallow means less time dedicated for going through this by reviewer.

Also correct.

> 
> I like the end result though and how driver side looks like when using
> this lib. Sorry for trying to understand the internals:)
> 
>>
>>> through this code I had a lot of head-scratching moments.

I'll process the rest of your review really soon.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-03-11 14:05   ` Maciej Fijalkowski
  2025-03-17 15:26     ` Alexander Lobakin
@ 2025-04-08 13:22     ` Alexander Lobakin
  2025-04-08 13:51       ` Alexander Lobakin
  1 sibling, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-04-08 13:22 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Tue, 11 Mar 2025 15:05:38 +0100

> On Wed, Mar 05, 2025 at 05:21:19PM +0100, Alexander Lobakin wrote:
>> "Couple" is a bit humbly... Add the following functionality to libeth:

[...]

>> +struct libeth_rq_napi_stats {
>> +	union {
>> +		struct {
>> +							u32 packets;
>> +							u32 bytes;
>> +							u32 fragments;
>> +							u32 hsplit;
>> +		};
>> +		DECLARE_FLEX_ARRAY(u32, raw);
> 
> The @raw approach is never used throughout the patchset, right?

Right, but

> Could you explain the reason for introducing this and potential use case?

initially, when my tree contained libeth generic stats also, I used this
field to update queue stats in a loop (unrolled by 4 fields) instead of
field-by-field.
Generic stats are still planned, and since ::raw is already present in
&libeth_sq_napi_stats, I'd like to keep it :z

> 
>> +	};
>> +};
>>  
>>  /**
>>   * struct libeth_sq_napi_stats - "hot" counters to update in Tx completion loop
>> @@ -22,4 +44,84 @@ struct libeth_sq_napi_stats {
>>  	};
>>  };
>>  
>> +/**
>> + * struct libeth_xdpsq_napi_stats - "hot" counters to update in XDP Tx
>> + *				    completion loop
>> + * @packets: completed frames counter
>> + * @bytes: sum of bytes of completed frames above
>> + * @fragments: sum of fragments of completed S/G frames
>> + * @raw: alias to access all the fields as an array
>> + */
>> +struct libeth_xdpsq_napi_stats {
> 
> what's the delta between this and libeth_sq_napi_stats ? couldn't you have
> a single struct for purpose of tx napi stats?

Same as previous, future-proof. &libeth_sq{,_napi}_stats will contain
stuff XDP queues will never need and vice versa.

> 
>> +	union {
>> +		struct {
>> +							u32 packets;
>> +							u32 bytes;
>> +							u32 fragments;
>> +		};
>> +		DECLARE_FLEX_ARRAY(u32, raw);
>> +	};
>> +};

[...]

>> @@ -71,7 +84,10 @@ struct libeth_sqe {
>>  /**
>>   * struct libeth_cq_pp - completion queue poll params
>>   * @dev: &device to perform DMA unmapping
>> + * @bq: XDP frame bulk to combine return operations
>>   * @ss: onstack NAPI stats to fill
>> + * @xss: onstack XDPSQ NAPI stats to fill
>> + * @xdp_tx: number of XDP frames processed
>>   * @napi: whether it's called from the NAPI context
>>   *
>>   * libeth uses this structure to access objects needed for performing full
>> @@ -80,7 +96,13 @@ struct libeth_sqe {
>>   */
>>  struct libeth_cq_pp {
>>  	struct device			*dev;
>> -	struct libeth_sq_napi_stats	*ss;
>> +	struct xdp_frame_bulk		*bq;
>> +
>> +	union {
>> +		struct libeth_sq_napi_stats	*ss;
>> +		struct libeth_xdpsq_napi_stats	*xss;
>> +	};
>> +	u32				xdp_tx;
> 
> you have this counted in xss::packets?

Nope, it's the same as in ice, you have separate ::packets AND ::xdp_tx
on the ring to speed up XSk completion when there's no XDP-non-XSk buffers.

> 
>>  
>>  	bool				napi;
>>  };

[...]

>> +/* Common Tx bits */
>> +
>> +/**
>> + * enum - libeth_xdp internal Tx flags
>> + * @LIBETH_XDP_TX_BULK: one bulk size at which it will be flushed to the queue
>> + * @LIBETH_XDP_TX_BATCH: batch size for which the queue fill loop is unrolled
>> + * @LIBETH_XDP_TX_DROP: indicates the send function must drop frames not sent
>> + * @LIBETH_XDP_TX_NDO: whether the send function is called from .ndo_xdp_xmit()
>> + */
>> +enum {
>> +	LIBETH_XDP_TX_BULK		= DEV_MAP_BULK_SIZE,
>> +	LIBETH_XDP_TX_BATCH		= 8,
>> +
>> +	LIBETH_XDP_TX_DROP		= BIT(0),
>> +	LIBETH_XDP_TX_NDO		= BIT(1),
> 
> what's the reason to group these random values onto enum?

They then will be visible in BTFs (not sure anyone will need them there).

> 
>> +};
>> +
>> +/**
>> + * enum - &libeth_xdp_tx_frame and &libeth_xdp_tx_desc flags
>> + * @LIBETH_XDP_TX_LEN: only for ``XDP_TX``, [15:0] of ::len_fl is actual length
>> + * @LIBETH_XDP_TX_FIRST: indicates the frag is the first one of the frame
>> + * @LIBETH_XDP_TX_LAST: whether the frag is the last one of the frame
>> + * @LIBETH_XDP_TX_MULTI: whether the frame contains several frags
> 
> would be good to have some extended description around usage of these
> flags.

They are internal to libeth functions anyway, hence no detailed description.

> 
>> + * @LIBETH_XDP_TX_FLAGS: only for ``XDP_TX``, [31:16] of ::len_fl is flags
>> + */
>> +enum {
>> +	LIBETH_XDP_TX_LEN		= GENMASK(15, 0),
>> +
>> +	LIBETH_XDP_TX_FIRST		= BIT(16),
>> +	LIBETH_XDP_TX_LAST		= BIT(17),
>> +	LIBETH_XDP_TX_MULTI		= BIT(18),
>> +
>> +	LIBETH_XDP_TX_FLAGS		= GENMASK(31, 16),
>> +};

[...]

>> +/**
>> + * libeth_xdp_tx_queue_head - internal helper for queueing one ``XDP_TX`` head
>> + * @bq: XDP Tx bulk to queue the head frag to
>> + * @xdp: XDP buffer with the head to queue
>> + *
>> + * Return: false if it's the only frag of the frame, true if it's an S/G frame.
>> + */
>> +static inline bool libeth_xdp_tx_queue_head(struct libeth_xdp_tx_bulk *bq,
>> +					    const struct libeth_xdp_buff *xdp)
>> +{
>> +	const struct xdp_buff *base = &xdp->base;
>> +
>> +	bq->bulk[bq->count++] = (typeof(*bq->bulk)){
>> +		.data	= xdp->data,
>> +		.len_fl	= (base->data_end - xdp->data) | LIBETH_XDP_TX_FIRST,
>> +		.soff	= xdp_data_hard_end(base) - xdp->data,
>> +	};
>> +
>> +	if (!xdp_buff_has_frags(base))
> 
> likely() ?

With the header split enabled and getting more and more popular -- not
really. likely() hurts perf here actually.

> 
>> +		return false;
>> +
>> +	bq->bulk[bq->count - 1].len_fl |= LIBETH_XDP_TX_MULTI;
>> +
>> +	return true;
>> +}
>> +
>> +/**
>> + * libeth_xdp_tx_queue_frag - internal helper for queueing one ``XDP_TX`` frag
>> + * @bq: XDP Tx bulk to queue the frag to
>> + * @frag: frag to queue
>> + */
>> +static inline void libeth_xdp_tx_queue_frag(struct libeth_xdp_tx_bulk *bq,
>> +					    const skb_frag_t *frag)
>> +{
>> +	bq->bulk[bq->count++].frag = *frag;
> 
> IMHO this helper is not providing anything useful

That's why it's stated "internal helper" :D

> 
>> +}

[...]

>> +#define libeth_xdp_tx_fill_stats(sqe, desc, sinfo)			      \
>> +	__libeth_xdp_tx_fill_stats(sqe, desc, sinfo, __UNIQUE_ID(sqe_),	      \
>> +				   __UNIQUE_ID(desc_), __UNIQUE_ID(sinfo_))
>> +
>> +#define __libeth_xdp_tx_fill_stats(sqe, desc, sinfo, ue, ud, us) do {	      \
>> +	const struct libeth_xdp_tx_desc *ud = (desc);			      \
>> +	const struct skb_shared_info *us;				      \
>> +	struct libeth_sqe *ue = (sqe);					      \
>> +									      \
>> +	ue->nr_frags = 1;						      \
>> +	ue->bytes = ud->len;						      \
>> +									      \
>> +	if (ud->flags & LIBETH_XDP_TX_MULTI) {				      \
>> +		us = (sinfo);						      \
> 
> why? what 'u' stands for? ue us don't tell the reader much from the first
> glance. sinfo tells me everything.

ue -- "unique element"
ud -- "unique descriptor"
us -- "unique sinfo"

All of them are purely internal to pass __UNIQUE_ID() in the
non-underscored version to avoid variable shadowing.

> 
>> +		ue->nr_frags += us->nr_frags;				      \
>> +		ue->bytes += us->xdp_frags_size;			      \
>> +	}								      \
>> +} while (0)

[...]

>> +/**
>> + * libeth_xdp_xmit_do_bulk - implement full .ndo_xdp_xmit() in driver
>> + * @dev: target &net_device
>> + * @n: number of frames to send
>> + * @fr: XDP frames to send
>> + * @f: flags passed by the stack
>> + * @xqs: array of XDPSQs driver structs
>> + * @nqs: number of active XDPSQs, the above array length
>> + * @fl: driver callback to flush an XDP xmit bulk
>> + * @fin: driver cabback to finalize the queue
>> + *
>> + * If the driver has active XDPSQs, perform common checks and send the frames.
>> + * Finalize the queue, if requested.
>> + *
>> + * Return: number of frames sent or -errno on error.
>> + */
>> +#define libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin)	      \
>> +	_libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin,	      \
>> +				 __UNIQUE_ID(bq_), __UNIQUE_ID(ret_),	      \
>> +				 __UNIQUE_ID(nqs_))
> 
> why __UNIQUE_ID() is needed?

As above, variable shadowing.

> 
>> +
>> +#define _libeth_xdp_xmit_do_bulk(d, n, fr, f, xqs, nqs, fl, fin, ub, ur, un)  \
> 
> why single underscore? usually we do __ for internal funcs as you did
> somewhere above.

Double-underscored is defined above already :D
So it would be either like this or __ + ___

> 
> also, why define and not inlined func?

I'll double check, but if you look at its usage in idpf/xdp.c, you'll
see that some arguments are non-trivial to obtain, IOW they cost some
cycles. Macro ensures they won't be fetched prior to
`likely(number_of_xdpsqs)`.
I'll convert to an inline and check if the compiler handles this itself.
It didn't behave in {,__}libeth_xdp_tx_fill_stats() unfortunately, hence
macro there as well =\

> 
>> +({									      \
>> +	u32 un = (nqs);							      \
>> +	int ur;								      \
>> +									      \
>> +	if (likely(un)) {						      \
>> +		struct libeth_xdp_tx_bulk ub;				      \
>> +									      \
>> +		libeth_xdp_xmit_init_bulk(&ub, d, xqs, un);		      \
>> +		ur = __libeth_xdp_xmit_do_bulk(&ub, fr, n, f, fl, fin);	      \
>> +	} else {							      \
>> +		ur = -ENXIO;						      \
>> +	}								      \
>> +									      \
>> +	ur;								      \
>> +})

[...]

>> +static inline void
>> +libeth_xdp_init_buff(struct libeth_xdp_buff *dst,
>> +		     const struct libeth_xdp_buff_stash *src,
>> +		     struct xdp_rxq_info *rxq)
> 
> what is the rationale for storing/loading xdp_buff onto/from driver's Rx
> queue? could we work directly on xdp_buff from Rx queue? ice is doing so
> currently.

Stack vs heap. I was getting lower numbers working on the queue directly.
Also note that &libeth_xdp_buff_stash is 16 bytes, while
&libeth_xdp_buff is 64. I don't think it makes sense to waste +48
bytes in the structure.

Load-store of the stash is rare anyway. It can happen *only* if the HW
for some reason hasn't written the whole multi-buffer frame yet, since
NAPI budget is counted by packets, not fragments.

> 
>> +{
>> +	if (likely(!src->data))
>> +		dst->data = NULL;
>> +	else
>> +		libeth_xdp_load_stash(dst, src);
>> +
>> +	dst->base.rxq = rxq;
>> +}

[...]

>> +static inline bool libeth_xdp_process_buff(struct libeth_xdp_buff *xdp,
>> +					   const struct libeth_fqe *fqe,
>> +					   u32 len)
>> +{
>> +	if (!libeth_rx_sync_for_cpu(fqe, len))
>> +		return false;
>> +
>> +	if (xdp->data)
> 
> unlikely() ?

Same as for libeth_xdp_tx_queue_head(): with the header split, you'll
hit this branch every frame.

> 
>> +		return libeth_xdp_buff_add_frag(xdp, fqe, len);
>> +
>> +	libeth_xdp_prepare_buff(xdp, fqe, len);
>> +
>> +	prefetch(xdp->data);
>> +
>> +	return true;
>> +}

[...]

>> +/**
>> + * libeth_xdp_run_prog - run XDP program and handle all verdicts
>> + * @xdp: XDP buffer to process
>> + * @bq: XDP Tx bulk to queue ``XDP_TX`` buffers
>> + * @fl: driver ``XDP_TX`` bulk flush callback
>> + *
>> + * Run the attached XDP program and handle all possible verdicts.
>> + * Prefer using it via LIBETH_XDP_DEFINE_RUN{,_PASS,_PROG}().
>> + *
>> + * Return: true if the buffer should be passed up the stack, false if the poll
>> + * should go to the next buffer.
>> + */
>> +#define libeth_xdp_run_prog(xdp, bq, fl)				      \
> 
> is this used in idpf in this patchset?

Sure. __LIBETH_XDP_DEFINE_RUN() builds two functions, one of which uses it.
Same for __LIBETH_XDP_DEFINE_RUN_PROG(). I know they are poor to read,
but otherwise I'd need to duplicate them for XDP and XSk separately.

> 
>> +	(__libeth_xdp_run_flush(xdp, bq, __libeth_xdp_run_prog,		      \
>> +				libeth_xdp_tx_queue_bulk,		      \
>> +				fl) == LIBETH_XDP_PASS)
>> +
>> +/**
>> + * __libeth_xdp_run_pass - helper to run XDP program and handle the result
>> + * @xdp: XDP buffer to process
>> + * @bq: XDP Tx bulk to queue ``XDP_TX`` frames
>> + * @napi: NAPI to build an skb and pass it up the stack
>> + * @rs: onstack libeth RQ stats
>> + * @md: metadata that should be filled to the XDP buffer
>> + * @prep: callback for filling the metadata
>> + * @run: driver wrapper to run XDP program
> 
> I see it's NULLed on idpf? why have this?

Only for singleq which we don't support. splitq uses
LIBETH_XDP_DEFINE_RUN() to build idpf_xdp_run_prog() and
idpf_xdp_run_pass().

> 
>> + * @populate: driver callback to populate an skb with the HW descriptor data
>> + *
>> + * Inline abstraction that does the following:
>> + * 1) adds frame size and frag number (if needed) to the onstack stats;
>> + * 2) fills the descriptor metadata to the onstack &libeth_xdp_buff
>> + * 3) runs XDP program if present;
>> + * 4) handles all possible verdicts;
>> + * 5) on ``XDP_PASS`, builds an skb from the buffer;
>> + * 6) populates it with the descriptor metadata;
>> + * 7) passes it up the stack.

[...]

>> +void __cold libeth_xdp_tx_exception(struct libeth_xdp_tx_bulk *bq, u32 sent,
>> +				    u32 flags)
>> +{
>> +	const struct libeth_xdp_tx_frame *pos = &bq->bulk[sent];
>> +	u32 left = bq->count - sent;
>> +
>> +	if (!(flags & LIBETH_XDP_TX_NDO))
>> +		libeth_trace_xdp_exception(bq->dev, bq->prog, XDP_TX);
>> +
>> +	if (!(flags & LIBETH_XDP_TX_DROP)) {
>> +		memmove(bq->bulk, pos, left * sizeof(*bq->bulk));
> 
> can this overflow? if queue got stuck for some reason.

memmove() is safe to call even when src == dst. As for XDP Tx logic, if
the queue is stuck, the bulk will never overflow, libeth will just try
send it again and again. At the end, both XDP Tx and xmit calls it with
DROP to make sure no memleaks or other issues can take place.

> 
>> +		bq->count = left;
>> +
>> +		return;
>> +	}
>> +
>> +	if (!(flags & LIBETH_XDP_TX_NDO))
>> +		libeth_xdp_tx_return_bulk(pos, left);
>> +	else
>> +		libeth_xdp_xmit_return_bulk(pos, left, bq->dev);
>> +
>> +	bq->count = 0;
>> +}
>> +EXPORT_SYMBOL_GPL(libeth_xdp_tx_exception);

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 14/16] idpf: add support for XDP on Rx
  2025-03-11 15:50   ` Maciej Fijalkowski
@ 2025-04-08 13:28     ` Alexander Lobakin
  2025-04-08 15:53       ` Maciej Fijalkowski
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Lobakin @ 2025-04-08 13:28 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Tue, 11 Mar 2025 16:50:07 +0100

> On Wed, Mar 05, 2025 at 05:21:30PM +0100, Alexander Lobakin wrote:
>> Use libeth XDP infra to support running XDP program on Rx polling.
>> This includes all of the possible verdicts/actions.
>> XDP Tx queues are cleaned only in "lazy" mode when there are less than
>> 1/4 free descriptors left on the ring. libeth helper macros to define
>> driver-specific XDP functions make sure the compiler could uninline
>> them when needed.

[...]

>> +/**
>> + * idpf_clean_xdp_irq - Reclaim a batch of TX resources from completed XDP_TX
>> + * @_xdpq: XDP Tx queue
>> + * @budget: maximum number of descriptors to clean
>> + *
>> + * Returns number of cleaned descriptors.
>> + */
>> +static u32 idpf_clean_xdp_irq(void *_xdpq, u32 budget)
>> +{
>> +	struct libeth_xdpsq_napi_stats ss = { };
>> +	struct idpf_tx_queue *xdpq = _xdpq;
>> +	u32 tx_ntc = xdpq->next_to_clean;
>> +	u32 tx_cnt = xdpq->desc_count;
>> +	struct xdp_frame_bulk bq;
>> +	struct libeth_cq_pp cp = {
>> +		.dev	= xdpq->dev,
>> +		.bq	= &bq,
>> +		.xss	= &ss,
>> +		.napi	= true,
>> +	};
>> +	u32 done_frames;
>> +
>> +	done_frames = idpf_xdpsq_poll(xdpq, budget);
> 
> nit: maybe pass {tx_ntc, tx_cnt} to the above?

Not folloween... =\

> 
>> +	if (unlikely(!done_frames))
>> +		return 0;
>> +
>> +	xdp_frame_bulk_init(&bq);
>> +
>> +	for (u32 i = 0; likely(i < done_frames); i++) {
>> +		libeth_xdp_complete_tx(&xdpq->tx_buf[tx_ntc], &cp);
>> +
>> +		if (unlikely(++tx_ntc == tx_cnt))
>> +			tx_ntc = 0;
>> +	}
>> +
>> +	xdp_flush_frame_bulk(&bq);
>> +
>> +	xdpq->next_to_clean = tx_ntc;
>> +	xdpq->pending -= done_frames;
>> +	xdpq->xdp_tx -= cp.xdp_tx;
> 
> not following this variable. __libeth_xdp_complete_tx() decresases
> libeth_cq_pp::xdp_tx by libeth_sqe::nr_frags. can you shed more light
> what's going on here?

libeth_sqe::nr_frags is not the same as skb_shared_info::nr_frags, it
equals to 1 when there's only 1 fragment.
Basically, xdp_tx field is the number of pending XDP-non-XSk
descriptors. When it's zero, we don't traverse Tx descriptors at all
on XSk completion (thx to splitq).

> 
>> +
>> +	return done_frames;
>> +}
>> +
>> +static u32 idpf_xdp_tx_prep(void *_xdpq, struct libeth_xdpsq *sq)
>> +{
>> +	struct idpf_tx_queue *xdpq = _xdpq;
>> +	u32 free;
>> +
>> +	libeth_xdpsq_lock(&xdpq->xdp_lock);
>> +
>> +	free = xdpq->desc_count - xdpq->pending;
>> +	if (free <= xdpq->thresh)
>> +		free += idpf_clean_xdp_irq(xdpq, xdpq->thresh);
>> +
>> +	*sq = (struct libeth_xdpsq){
> 
> could you have libeth_xdpsq embedded in idpf_tx_queue and avoid that
> initialization?

Not really. &libeth_xdpsq, same as &libeth_fq et al, has only a few
fields grouped together, while in driver's queue structure they can (and
likely will be) be scattered across cachelines.
This initialization is cheap anyway, &libeth_xdpsq exists only inside
__always_inline helpers, so it might not even be present in the bytecode.

> 
>> +		.sqes		= xdpq->tx_buf,
>> +		.descs		= xdpq->desc_ring,
>> +		.count		= xdpq->desc_count,
>> +		.lock		= &xdpq->xdp_lock,
>> +		.ntu		= &xdpq->next_to_use,
>> +		.pending	= &xdpq->pending,
>> +		.xdp_tx		= &xdpq->xdp_tx,
>> +	};
>> +
>> +	return free;
>> +}

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 15/16] idpf: add support for .ndo_xdp_xmit()
  2025-03-11 16:08   ` Maciej Fijalkowski
@ 2025-04-08 13:31     ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-04-08 13:31 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Tue, 11 Mar 2025 17:08:22 +0100

> On Wed, Mar 05, 2025 at 05:21:31PM +0100, Alexander Lobakin wrote:
>> Use libeth XDP infra to implement .ndo_xdp_xmit() in idpf.
>> The Tx callbacks are reused from XDP_TX code. XDP redirect target
>> feature is set/cleared depending on the XDP prog presence, as for now
>> we still don't allocate XDP Tx queues when there's no program.
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  drivers/net/ethernet/intel/idpf/xdp.h      |  2 ++
>>  drivers/net/ethernet/intel/idpf/idpf_lib.c |  1 +
>>  drivers/net/ethernet/intel/idpf/xdp.c      | 29 ++++++++++++++++++++++
>>  3 files changed, 32 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/idpf/xdp.h b/drivers/net/ethernet/intel/idpf/xdp.h
>> index fde85528a315..a2ac1b2f334f 100644
>> --- a/drivers/net/ethernet/intel/idpf/xdp.h
>> +++ b/drivers/net/ethernet/intel/idpf/xdp.h
>> @@ -110,5 +110,7 @@ static inline void idpf_xdp_tx_finalize(void *_xdpq, bool sent, bool flush)
>>  void idpf_xdp_set_features(const struct idpf_vport *vport);
>>  
>>  int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp);
>> +int idpf_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>> +		  u32 flags);
>>  
>>  #endif /* _IDPF_XDP_H_ */
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
>> index 2d1efcb854be..39b9885293a9 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf_lib.c
>> +++ b/drivers/net/ethernet/intel/idpf/idpf_lib.c
>> @@ -2371,4 +2371,5 @@ static const struct net_device_ops idpf_netdev_ops = {
>>  	.ndo_set_features = idpf_set_features,
>>  	.ndo_tx_timeout = idpf_tx_timeout,
>>  	.ndo_bpf = idpf_xdp,
>> +	.ndo_xdp_xmit = idpf_xdp_xmit,
>>  };
>> diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
>> index abf75e840c0a..1834f217a07f 100644
>> --- a/drivers/net/ethernet/intel/idpf/xdp.c
>> +++ b/drivers/net/ethernet/intel/idpf/xdp.c
>> @@ -357,8 +357,35 @@ LIBETH_XDP_DEFINE_START();
>>  LIBETH_XDP_DEFINE_TIMER(static idpf_xdp_tx_timer, idpf_clean_xdp_irq);
>>  LIBETH_XDP_DEFINE_FLUSH_TX(idpf_xdp_tx_flush_bulk, idpf_xdp_tx_prep,
>>  			   idpf_xdp_tx_xmit);
>> +LIBETH_XDP_DEFINE_FLUSH_XMIT(static idpf_xdp_xmit_flush_bulk, idpf_xdp_tx_prep,
>> +			     idpf_xdp_tx_xmit);
>>  LIBETH_XDP_DEFINE_END();
>>  
>> +/**
>> + * idpf_xdp_xmit - send frames queued by ``XDP_REDIRECT`` to this interface
>> + * @dev: network device
>> + * @n: number of frames to transmit
>> + * @frames: frames to transmit
>> + * @flags: transmit flags (``XDP_XMIT_FLUSH`` or zero)
>> + *
>> + * Return: number of frames successfully sent or -errno on error.
>> + */
>> +int idpf_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>> +		  u32 flags)
>> +{
>> +	const struct idpf_netdev_priv *np = netdev_priv(dev);
>> +	const struct idpf_vport *vport = np->vport;
>> +
>> +	if (unlikely(!netif_carrier_ok(dev) || !vport->link_up))
>> +		return -ENETDOWN;
>> +
>> +	return libeth_xdp_xmit_do_bulk(dev, n, frames, flags,
>> +				       &vport->txqs[vport->xdp_txq_offset],
>> +				       vport->num_xdp_txq,
> 
> Have you considered in some future libeth being stateful where you could
> provide some initialization data such as vport->num_xdp_txq which is
> rather constant so that we wouldn't have to pass this all the time?

Is it? Especially in our driver where there's no XDP Tx queues when no
XDP prog loaded?
The "state" of libeth would only be a duplication of already existing
data in the drivers themselves, but with additional problem with
synchronizing this data. XDP prog removed -- you need to reflect that
in the libeth "state", and so on.

> 
> I got a bit puzzled here as it took me some digging that it is only used a
> bound check and libeth_xdpsq_id() uses cpu id as an index.

It's also used to quickly check whether we can send frames at all.

> 
> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
>> +				       idpf_xdp_xmit_flush_bulk,
>> +				       idpf_xdp_tx_finalize);
>> +}

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-04-01 13:11         ` Alexander Lobakin
@ 2025-04-08 13:38           ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-04-08 13:38 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Tue, 1 Apr 2025 15:11:50 +0200

> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Wed, 19 Mar 2025 17:19:44 +0100

[...]

>> Not sure what to say here. Your time dedicated for making this work easier
>> to swallow means less time dedicated for going through this by reviewer.

I think we were chatting already, but just for the record: I was able to
split 03/16 + 04/16 into 14 patches, so the next series sent by me will
be libeth_xdp alone as 16 patches ._.

> 
> Also correct.
> 
>>
>> I like the end result though and how driver side looks like when using
>> this lib. Sorry for trying to understand the internals:)
>>
>>>
>>>> through this code I had a lot of head-scratching moments.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 11/16] idpf: prepare structures to support XDP
  2025-03-19 16:29       ` Maciej Fijalkowski
@ 2025-04-08 13:42         ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-04-08 13:42 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Wed, 19 Mar 2025 17:29:37 +0100

> On Mon, Mar 17, 2025 at 03:50:11PM +0100, Alexander Lobakin wrote:
>> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> Date: Fri, 7 Mar 2025 14:27:13 +0100
>>
>>> On Wed, Mar 05, 2025 at 05:21:27PM +0100, Alexander Lobakin wrote:
>>>> From: Michal Kubiak <michal.kubiak@intel.com>
>>>>
>>>> Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_*_queue',
>>>> 'idpf_vport_user_config_data') by adding members necessary to support XDP.
>>>> Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions
>>>> without interfering with regular Tx traffic.
>>>> Also add functions dedicated to support XDP initialization for Rx and
>>>> Tx queues and call those functions from the existing algorithms of
>>>> queues configuration.
>>
>> [...]
>>
>>>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
>>>> index 59b1a1a09996..1ca322bfe92f 100644
>>>> --- a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
>>>> +++ b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
>>>> @@ -186,9 +186,11 @@ static void idpf_get_channels(struct net_device *netdev,
>>>>  {
>>>>  	struct idpf_netdev_priv *np = netdev_priv(netdev);
>>>>  	struct idpf_vport_config *vport_config;
>>>> +	const struct idpf_vport *vport;
>>>>  	u16 num_txq, num_rxq;
>>>>  	u16 combined;
>>>>  
>>>> +	vport = idpf_netdev_to_vport(netdev);
>>>>  	vport_config = np->adapter->vport_config[np->vport_idx];
>>>>  
>>>>  	num_txq = vport_config->user_config.num_req_tx_qs;
>>>> @@ -202,8 +204,8 @@ static void idpf_get_channels(struct net_device *netdev,
>>>>  	ch->max_rx = vport_config->max_q.max_rxq;
>>>>  	ch->max_tx = vport_config->max_q.max_txq;
>>>>  
>>>> -	ch->max_other = IDPF_MAX_MBXQ;
>>>> -	ch->other_count = IDPF_MAX_MBXQ;
>>>> +	ch->max_other = IDPF_MAX_MBXQ + vport->num_xdp_txq;
>>>> +	ch->other_count = IDPF_MAX_MBXQ + vport->num_xdp_txq;
>>>
>>> That's new I think. Do you explain somewhere that other `other` will carry
>>> xdpq count? Otherwise how would I know to interpret this value?
>>
>> Where? :D
> 
> I meant to say something in commit message how new output should be
> interpreted?
> 
>>
>>>
>>> Also from what I see num_txq carries (txq + xdpq) count. How is that
>>> affecting the `combined` from ethtool_channels?
>>
>> No changes in combined/Ethtool, num_txq is not used there. Stuff like
>> req_txq_num includes skb queues only.
>>
>>>
>>>>  
>>>>  	ch->combined_count = combined;
>>>>  	ch->rx_count = num_rxq - combined;
>>>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_lib.c b/drivers/net/ethernet/intel/idpf/idpf_lib.c
>>>> index 2594ca38e8ca..0f4edc9cd1ad 100644
>>>
>>> (...)
>>>
>>>> +
>>>> +/**
>>>> + * __idpf_xdp_rxq_info_init - Setup XDP RxQ info for a given Rx queue
>>>> + * @rxq: Rx queue for which the resources are setup
>>>> + * @arg: flag indicating if the HW works in split queue mode
>>>> + *
>>>> + * Return: 0 on success, negative on failure.
>>>> + */
>>>> +static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
>>>> +{
>>>> +	const struct idpf_vport *vport = rxq->q_vector->vport;
>>>> +	bool split = idpf_is_queue_model_split(vport->rxq_model);
>>>> +	const struct page_pool *pp;
>>>> +	int err;
>>>> +
>>>> +	err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
>>>> +				 rxq->q_vector->napi.napi_id,
>>>> +				 rxq->rx_buf_size);
>>>> +	if (err)
>>>> +		return err;
>>>> +
>>>> +	pp = split ? rxq->bufq_sets[0].bufq.pp : rxq->pp;
>>>> +	xdp_rxq_info_attach_page_pool(&rxq->xdp_rxq, pp);
>>>> +
>>>> +	if (!split)
>>>> +		return 0;
>>>
>>> why do you care about splitq model if on next patch you don't allow
>>> XDP_SETUP_PROG for that?
>>
>> This function is called unconditionally for both queue models. If we
>> don't account it here, we'd break regular traffic flow.
>>
>> (singleq will be removed soon, don't take it seriously anyway)
> 
> ack, thanks
> 
>>
>> [...]
>>
>>>> +int idpf_vport_xdpq_get(const struct idpf_vport *vport)
>>>> +{
>>>> +	struct libeth_xdpsq_timer **timers __free(kvfree) = NULL;
>>>
>>> please bear with me here - so this array will exist as long as there is a
>>> single timers[i] allocated? even though it's a local var?
>>
>> No problem.
>>
>> No, this array will be freed when the function exits. This array is an
>> array of pointers to iterate in a loop and assign timers to queues. When
>> we exit this function, it's no longer needed.
>> I can't place the whole array on the stack since I don't know the actual
>> queue count + it can be really big (1024 pointers * 8 = 8 Kb, even 128
>> or 256 queues is already 1-2 Kb).
> 
> so this array is needed to ease the error path handling?

It's needed to store pointers to the actual timers which are allocated
one by one to assign them to the queues in a loop later.

> 
>>
>> The actual timers are allocated separately and NUMA-locally below.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp)
  2025-04-08 13:22     ` Alexander Lobakin
@ 2025-04-08 13:51       ` Alexander Lobakin
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Lobakin @ 2025-04-08 13:51 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Tue, 8 Apr 2025 15:22:48 +0200

> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Tue, 11 Mar 2025 15:05:38 +0100
> 
>> On Wed, Mar 05, 2025 at 05:21:19PM +0100, Alexander Lobakin wrote:
>>> "Couple" is a bit humbly... Add the following functionality to libeth:

[...]

>>> +/**
>>> + * libeth_xdp_xmit_do_bulk - implement full .ndo_xdp_xmit() in driver
>>> + * @dev: target &net_device
>>> + * @n: number of frames to send
>>> + * @fr: XDP frames to send
>>> + * @f: flags passed by the stack
>>> + * @xqs: array of XDPSQs driver structs
>>> + * @nqs: number of active XDPSQs, the above array length
>>> + * @fl: driver callback to flush an XDP xmit bulk
>>> + * @fin: driver cabback to finalize the queue
>>> + *
>>> + * If the driver has active XDPSQs, perform common checks and send the frames.
>>> + * Finalize the queue, if requested.
>>> + *
>>> + * Return: number of frames sent or -errno on error.
>>> + */
>>> +#define libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin)	      \
>>> +	_libeth_xdp_xmit_do_bulk(dev, n, fr, f, xqs, nqs, fl, fin,	      \
>>> +				 __UNIQUE_ID(bq_), __UNIQUE_ID(ret_),	      \
>>> +				 __UNIQUE_ID(nqs_))
>>
>> why __UNIQUE_ID() is needed?
> 
> As above, variable shadowing.
> 
>>
>>> +
>>> +#define _libeth_xdp_xmit_do_bulk(d, n, fr, f, xqs, nqs, fl, fin, ub, ur, un)  \
>>
>> why single underscore? usually we do __ for internal funcs as you did
>> somewhere above.
> 
> Double-underscored is defined above already :D
> So it would be either like this or __ + ___
> 
>>
>> also, why define and not inlined func?
> 
> I'll double check, but if you look at its usage in idpf/xdp.c, you'll
> see that some arguments are non-trivial to obtain, IOW they cost some
> cycles. Macro ensures they won't be fetched prior to
> `likely(number_of_xdpsqs)`.
> I'll convert to an inline and check if the compiler handles this itself.
> It didn't behave in {,__}libeth_xdp_tx_fill_stats() unfortunately, hence
> macro there as well =\

UPD: it can't be an inline func since it's meant to be called like that
from the driver:

	return libeth_xdp_xmit_do_bulk(dev, n, frames, flags,
				       &vport->txqs[vport->xdp_txq_offset],
				       vport->num_xdp_txq,
				       idpf_xdp_xmit_flush_bulk,
				       idpf_xdp_tx_finalize);

The type of `&vport->txqs[vport->xdp_txq_offset]` is undefined from
libeth's perspective. libeth_xdp_xmit_init_bulk() embedded into it picks
the appropriate queue right away in the driver and it's a macro itself.

> 
>>
>>> +({									      \
>>> +	u32 un = (nqs);							      \
>>> +	int ur;								      \
>>> +									      \
>>> +	if (likely(un)) {						      \
>>> +		struct libeth_xdp_tx_bulk ub;				      \
>>> +									      \
>>> +		libeth_xdp_xmit_init_bulk(&ub, d, xqs, un);		      \
>>> +		ur = __libeth_xdp_xmit_do_bulk(&ub, fr, n, f, fl, fin);	      \
>>> +	} else {							      \
>>> +		ur = -ENXIO;						      \
>>> +	}								      \
>>> +									      \
>>> +	ur;								      \
>>> +})

Thanks,
Olek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH net-next 14/16] idpf: add support for XDP on Rx
  2025-04-08 13:28     ` Alexander Lobakin
@ 2025-04-08 15:53       ` Maciej Fijalkowski
  0 siblings, 0 replies; 59+ messages in thread
From: Maciej Fijalkowski @ 2025-04-08 15:53 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, Michal Kubiak, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Simon Horman, bpf, netdev,
	linux-kernel

On Tue, Apr 08, 2025 at 03:28:21PM +0200, Alexander Lobakin wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Tue, 11 Mar 2025 16:50:07 +0100
> 
> > On Wed, Mar 05, 2025 at 05:21:30PM +0100, Alexander Lobakin wrote:
> >> Use libeth XDP infra to support running XDP program on Rx polling.
> >> This includes all of the possible verdicts/actions.
> >> XDP Tx queues are cleaned only in "lazy" mode when there are less than
> >> 1/4 free descriptors left on the ring. libeth helper macros to define
> >> driver-specific XDP functions make sure the compiler could uninline
> >> them when needed.
> 
> [...]
> 
> >> +/**
> >> + * idpf_clean_xdp_irq - Reclaim a batch of TX resources from completed XDP_TX
> >> + * @_xdpq: XDP Tx queue
> >> + * @budget: maximum number of descriptors to clean
> >> + *
> >> + * Returns number of cleaned descriptors.
> >> + */
> >> +static u32 idpf_clean_xdp_irq(void *_xdpq, u32 budget)
> >> +{
> >> +	struct libeth_xdpsq_napi_stats ss = { };
> >> +	struct idpf_tx_queue *xdpq = _xdpq;
> >> +	u32 tx_ntc = xdpq->next_to_clean;
> >> +	u32 tx_cnt = xdpq->desc_count;
> >> +	struct xdp_frame_bulk bq;
> >> +	struct libeth_cq_pp cp = {
> >> +		.dev	= xdpq->dev,
> >> +		.bq	= &bq,
> >> +		.xss	= &ss,
> >> +		.napi	= true,
> >> +	};
> >> +	u32 done_frames;
> >> +
> >> +	done_frames = idpf_xdpsq_poll(xdpq, budget);
> > 
> > nit: maybe pass {tx_ntc, tx_cnt} to the above?
> 
> Not folloween... =\

you deref ::next_to_clean and ::desc_count again in idpf_xdpsq_poll() and
you have them derefd here in local vars already so i was just suggesting
to maybe pass them as args but not a big deal

> 
> > 
> >> +	if (unlikely(!done_frames))
> >> +		return 0;
> >> +
> >> +	xdp_frame_bulk_init(&bq);
> >> +
> >> +	for (u32 i = 0; likely(i < done_frames); i++) {
> >> +		libeth_xdp_complete_tx(&xdpq->tx_buf[tx_ntc], &cp);
> >> +
> >> +		if (unlikely(++tx_ntc == tx_cnt))
> >> +			tx_ntc = 0;
> >> +	}
> >> +
> >> +	xdp_flush_frame_bulk(&bq);
> >> +
> >> +	xdpq->next_to_clean = tx_ntc;
> >> +	xdpq->pending -= done_frames;
> >> +	xdpq->xdp_tx -= cp.xdp_tx;
> > 
> > not following this variable. __libeth_xdp_complete_tx() decresases
> > libeth_cq_pp::xdp_tx by libeth_sqe::nr_frags. can you shed more light
> > what's going on here?
> 
> libeth_sqe::nr_frags is not the same as skb_shared_info::nr_frags, it
> equals to 1 when there's only 1 fragment.
> Basically, xdp_tx field is the number of pending XDP-non-XSk
> descriptors. When it's zero, we don't traverse Tx descriptors at all
> on XSk completion (thx to splitq).
> 
> > 
> >> +
> >> +	return done_frames;
> >> +}
> >> +

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-04-08 15:53 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-05 16:21 [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 01/16] libeth: convert to netmem Alexander Lobakin
2025-03-06  0:13   ` Mina Almasry
2025-03-11 17:22     ` Alexander Lobakin
2025-03-11 17:43       ` Mina Almasry
2025-03-05 16:21 ` [PATCH net-next 02/16] libeth: support native XDP and register memory model Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 03/16] libeth: add a couple of XDP helpers (libeth_xdp) Alexander Lobakin
2025-03-11 14:05   ` Maciej Fijalkowski
2025-03-17 15:26     ` Alexander Lobakin
2025-03-19 16:19       ` Maciej Fijalkowski
2025-04-01 13:11         ` Alexander Lobakin
2025-04-08 13:38           ` Alexander Lobakin
2025-04-08 13:22     ` Alexander Lobakin
2025-04-08 13:51       ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 04/16] libeth: add XSk helpers Alexander Lobakin
2025-03-07 10:15   ` Maciej Fijalkowski
2025-03-12 17:03     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 05/16] idpf: fix Rx descriptor ready check barrier in splitq Alexander Lobakin
2025-03-07 10:17   ` Maciej Fijalkowski
2025-03-12 17:10     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 06/16] idpf: a use saner limit for default number of queues to allocate Alexander Lobakin
2025-03-07 10:32   ` Maciej Fijalkowski
2025-03-12 17:22     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 07/16] idpf: link NAPIs to queues Alexander Lobakin
2025-03-07 10:28   ` Eric Dumazet
2025-03-12 17:16     ` Alexander Lobakin
2025-03-18 17:10       ` Alexander Lobakin
2025-03-07 10:51   ` Maciej Fijalkowski
2025-03-12 17:25     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 08/16] idpf: make complq cleaning dependent on scheduling mode Alexander Lobakin
2025-03-07 11:11   ` Maciej Fijalkowski
2025-03-13 16:16     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 09/16] idpf: remove SW marker handling from NAPI Alexander Lobakin
2025-03-07 11:42   ` Maciej Fijalkowski
2025-03-13 16:50     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 10/16] idpf: add support for nointerrupt queues Alexander Lobakin
2025-03-07 12:10   ` Maciej Fijalkowski
2025-03-13 16:19     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 11/16] idpf: prepare structures to support XDP Alexander Lobakin
2025-03-07  1:12   ` Jakub Kicinski
2025-03-12 14:00     ` [Intel-wired-lan] " Alexander Lobakin
2025-03-07 13:27   ` Maciej Fijalkowski
2025-03-17 14:50     ` Alexander Lobakin
2025-03-19 16:29       ` Maciej Fijalkowski
2025-04-08 13:42         ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 12/16] idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq Alexander Lobakin
2025-03-07 14:16   ` Maciej Fijalkowski
2025-03-17 14:58     ` Alexander Lobakin
2025-03-19 16:23       ` Maciej Fijalkowski
2025-03-05 16:21 ` [PATCH net-next 13/16] idpf: use generic functions to build xdp_buff and skb Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 14/16] idpf: add support for XDP on Rx Alexander Lobakin
2025-03-11 15:50   ` Maciej Fijalkowski
2025-04-08 13:28     ` Alexander Lobakin
2025-04-08 15:53       ` Maciej Fijalkowski
2025-03-05 16:21 ` [PATCH net-next 15/16] idpf: add support for .ndo_xdp_xmit() Alexander Lobakin
2025-03-11 16:08   ` Maciej Fijalkowski
2025-04-08 13:31     ` Alexander Lobakin
2025-03-05 16:21 ` [PATCH net-next 16/16] idpf: add XDP RSS hash hint Alexander Lobakin
2025-03-11 15:28 ` [PATCH net-next 00/16] idpf: add XDP support Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).