[PATCH net-next 0/4] xsk: the lost bits from Chapter III

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/4] xsk: the lost bits from Chapter III
@ 2025-02-06 18:26 Alexander Lobakin
  2025-02-06 18:26 ` [PATCH net-next 1/4] unroll: add generic loop unroll helpers Alexander Lobakin
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Alexander Lobakin @ 2025-02-06 18:26 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

Before introducing libeth_xdp, we need to add a couple more generic
helpers. Notably:

* 01: add generic loop unrolling hint helpers;
* 04: add helper to get both xdp_desc's DMA address and metadata
  pointer in one go, saving several cycles and hotpath object
  code size in drivers (especially when unrolling).

Bonus:

* 02, 03: convert two drivers which were using custom macros to
  generic unrolled_count() (trivial, no object code changes).

Alexander Lobakin (4):
  unroll: add generic loop unroll helpers
  i40e: use generic unrolled_count() macro
  ice: use generic unrolled_count() macro
  xsk: add helper to get &xdp_desc's DMA and meta pointer in one go

 drivers/net/ethernet/intel/i40e/i40e_xsk.h | 10 +----
 drivers/net/ethernet/intel/ice/ice_xsk.h   |  8 ----
 include/linux/unroll.h                     | 44 +++++++++++++++++++++
 include/net/xdp_sock_drv.h                 | 43 ++++++++++++++++++--
 include/net/xsk_buff_pool.h                |  8 ++++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c |  4 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c   |  4 +-
 net/xdp/xsk_buff_pool.c                    | 46 ++++++++++++++++++++--
 8 files changed, 141 insertions(+), 26 deletions(-)

---
Note: 04 had reviews already; in this series, I reused the existing
helpers instead of copying them and eliminated the compound
assignment in favor of a field-by-field one, which generates
the same Asm code (requested by Jakub).
-- 
2.48.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next 1/4] unroll: add generic loop unroll helpers
  2025-02-06 18:26 [PATCH net-next 0/4] xsk: the lost bits from Chapter III Alexander Lobakin
@ 2025-02-06 18:26 ` Alexander Lobakin
  2025-02-09 11:07   ` Simon Horman
  2025-02-06 18:26 ` [PATCH net-next 2/4] i40e: use generic unrolled_count() macro Alexander Lobakin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Alexander Lobakin @ 2025-02-06 18:26 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

There are cases when we need to explicitly unroll loops. For example,
cache operations, filling DMA descriptors on very high speeds etc.
Add compiler-specific attribute macros to give the compiler a hint
that we'd like to unroll a loop.
Example usage:

 #define UNROLL_BATCH 8

	unrolled_count(UNROLL_BATCH)
	for (u32 i = 0; i < UNROLL_BATCH; i++)
		op(priv, i);

Note that sometimes the compilers won't unroll loops if they think this
would have worse optimization and perf than without unrolling, and that
unroll attributes are available only starting GCC 8. For older compiler
versions, no hints/attributes will be applied.
For better unrolling/parallelization, don't have any variables that
interfere between iterations except for the iterator itself.

Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/linux/unroll.h | 44 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/include/linux/unroll.h b/include/linux/unroll.h
index d42fd6366373..863fb69f6a7e 100644
--- a/include/linux/unroll.h
+++ b/include/linux/unroll.h
@@ -9,6 +9,50 @@
 
 #include <linux/args.h>
 
+#ifdef CONFIG_CC_IS_CLANG
+#define __pick_unrolled(x, y)		_Pragma(#x)
+#elif CONFIG_GCC_VERSION >= 80000
+#define __pick_unrolled(x, y)		_Pragma(#y)
+#else
+#define __pick_unrolled(x, y)		/* not supported */
+#endif
+
+/**
+ * unrolled - loop attributes to ask the compiler to unroll it
+ *
+ * Usage:
+ *
+ * #define BATCH 8
+ *
+ *	unrolled_count(BATCH)
+ *	for (u32 i = 0; i < BATCH; i++)
+ *		// loop body without cross-iteration dependencies
+ *
+ * This is only a hint and the compiler is free to disable unrolling if it
+ * thinks the count is suboptimal and may hurt performance and/or hugely
+ * increase object code size.
+ * Not having any cross-iteration dependencies (i.e. when iter x + 1 depends
+ * on what iter x will do with variables) is not a strict requirement, but
+ * provides best performance and object code size.
+ * Available only on Clang and GCC 8.x onwards.
+ */
+
+/* Ask the compiler to pick an optimal unroll count, Clang only */
+#define unrolled							\
+	__pick_unrolled(clang loop unroll(enable), /* nothing */)
+
+/* Unroll each @n iterations of the loop */
+#define unrolled_count(n)						\
+	__pick_unrolled(clang loop unroll_count(n), GCC unroll n)
+
+/* Unroll the whole loop */
+#define unrolled_full							\
+	__pick_unrolled(clang loop unroll(full), GCC unroll 65534)
+
+/* Never unroll the loop */
+#define unrolled_none							\
+	__pick_unrolled(clang loop unroll(disable), GCC unroll 1)
+
 #define UNROLL(N, MACRO, args...) CONCATENATE(__UNROLL_, N)(MACRO, args)
 
 #define __UNROLL_0(MACRO, args...)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 2/4] i40e: use generic unrolled_count() macro
  2025-02-06 18:26 [PATCH net-next 0/4] xsk: the lost bits from Chapter III Alexander Lobakin
  2025-02-06 18:26 ` [PATCH net-next 1/4] unroll: add generic loop unroll helpers Alexander Lobakin
@ 2025-02-06 18:26 ` Alexander Lobakin
  2025-02-10 12:30   ` Maciej Fijalkowski
  2025-02-10 22:28   ` David Laight
  2025-02-06 18:26 ` [PATCH net-next 3/4] ice: " Alexander Lobakin
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Alexander Lobakin @ 2025-02-06 18:26 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

i40e, as well as ice, has a custom loop unrolling macro for unrolling
Tx descriptors filling on XSk xmit.
Replace i40e defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be a usual for-loop.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.h | 10 +---------
 drivers/net/ethernet/intel/i40e/i40e_xsk.c |  4 +++-
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
index ef156fad52f2..dd16351a7af8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -6,7 +6,7 @@
 
 #include <linux/types.h>
 
-/* This value should match the pragma in the loop_unrolled_for
+/* This value should match the pragma in the unrolled_count()
  * macro. Why 4? It is strictly empirical. It seems to be a good
  * compromise between the advantage of having simultaneous outstanding
  * reads to the DMA array that can hide each others latency and the
@@ -14,14 +14,6 @@
  */
 #define PKTS_PER_BATCH 4
 
-#ifdef __clang__
-#define loop_unrolled_for _Pragma("clang loop unroll_count(4)") for
-#elif __GNUC__ >= 8
-#define loop_unrolled_for _Pragma("GCC unroll 4") for
-#else
-#define loop_unrolled_for for
-#endif
-
 struct i40e_ring;
 struct i40e_vsi;
 struct net_device;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e28f1905a4a0..9f47388eaba5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -2,6 +2,7 @@
 /* Copyright(c) 2018 Intel Corporation. */
 
 #include <linux/bpf_trace.h>
+#include <linux/unroll.h>
 #include <net/xdp_sock_drv.h>
 #include "i40e_txrx_common.h"
 #include "i40e_xsk.h"
@@ -529,7 +530,8 @@ static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *des
 	dma_addr_t dma;
 	u32 i;
 
-	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
+	unrolled_count(PKTS_PER_BATCH)
+	for (i = 0; i < PKTS_PER_BATCH; i++) {
 		u32 cmd = I40E_TX_DESC_CMD_ICRC | xsk_is_eop_desc(&desc[i]);
 
 		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 3/4] ice: use generic unrolled_count() macro
  2025-02-06 18:26 [PATCH net-next 0/4] xsk: the lost bits from Chapter III Alexander Lobakin
  2025-02-06 18:26 ` [PATCH net-next 1/4] unroll: add generic loop unroll helpers Alexander Lobakin
  2025-02-06 18:26 ` [PATCH net-next 2/4] i40e: use generic unrolled_count() macro Alexander Lobakin
@ 2025-02-06 18:26 ` Alexander Lobakin
  2025-02-10 12:31   ` Maciej Fijalkowski
  2025-02-06 18:26 ` [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go Alexander Lobakin
  2025-02-11  2:00 ` [PATCH net-next 0/4] xsk: the lost bits from Chapter III patchwork-bot+netdevbpf
  4 siblings, 1 reply; 17+ messages in thread
From: Alexander Lobakin @ 2025-02-06 18:26 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

ice, same as i40e, has a custom loop unrolling macros for unrolling
Tx descriptors filling on XSk xmit.
Replace ice defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be usual for-loop.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_xsk.h | 8 --------
 drivers/net/ethernet/intel/ice/ice_xsk.c | 4 +++-
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
index 45adeb513253..8dc5d55e26c5 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.h
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
@@ -7,14 +7,6 @@
 
 #define PKTS_PER_BATCH 8
 
-#ifdef __clang__
-#define loop_unrolled_for _Pragma("clang loop unroll_count(8)") for
-#elif __GNUC__ >= 8
-#define loop_unrolled_for _Pragma("GCC unroll 8") for
-#else
-#define loop_unrolled_for for
-#endif
-
 struct ice_vsi;
 
 #ifdef CONFIG_XDP_SOCKETS
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 8975d2971bc3..a3a4eaa17739 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2019, Intel Corporation. */
 
 #include <linux/bpf_trace.h>
+#include <linux/unroll.h>
 #include <net/xdp_sock_drv.h>
 #include <net/xdp.h>
 #include "ice.h"
@@ -989,7 +990,8 @@ static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring,
 	struct ice_tx_desc *tx_desc;
 	u32 i;
 
-	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
+	unrolled_count(PKTS_PER_BATCH)
+	for (i = 0; i < PKTS_PER_BATCH; i++) {
 		dma_addr_t dma;
 
 		dma = xsk_buff_raw_get_dma(xsk_pool, descs[i].addr);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go
  2025-02-06 18:26 [PATCH net-next 0/4] xsk: the lost bits from Chapter III Alexander Lobakin
                   ` (2 preceding siblings ...)
  2025-02-06 18:26 ` [PATCH net-next 3/4] ice: " Alexander Lobakin
@ 2025-02-06 18:26 ` Alexander Lobakin
  2025-02-09 11:03   ` Simon Horman
  2025-02-11  2:00 ` [PATCH net-next 0/4] xsk: the lost bits from Chapter III patchwork-bot+netdevbpf
  4 siblings, 1 reply; 17+ messages in thread
From: Alexander Lobakin @ 2025-02-06 18:26 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

Currently, when your driver supports XSk Tx metadata and you want to
send an XSk frame, you need to do the following:

* call external xsk_buff_raw_get_dma();
* call inline xsk_buff_get_metadata(), which calls external
  xsk_buff_raw_get_data() and then do some inline checks.

This effectively means that the following piece:

addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;

is done twice per frame, plus you have 2 external calls per frame, plus
this:

	meta = pool->addrs + addr - pool->tx_metadata_len;
	if (unlikely(!xsk_buff_valid_tx_metadata(meta)))

is always inlined, even if there's no meta or it's invalid.

Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that
in one go. It returns a small structure with 2 fields: DMA address,
filled unconditionally, and metadata pointer, non-NULL only if it's
present and valid. The address correction is performed only once and
you also have only 1 external call per XSk frame, which does all the
calculations and checks outside of your hotpath. You only need to
check `if (ctx.meta)` for the metadata presence.
To not copy any existing code, derive address correction and getting
virtual and DMA address into small helpers. bloat-o-meter reports no
object code changes for the existing functionality.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/xdp_sock_drv.h  | 43 +++++++++++++++++++++++++++++++---
 include/net/xsk_buff_pool.h |  8 +++++++
 net/xdp/xsk_buff_pool.c     | 46 +++++++++++++++++++++++++++++++++----
 3 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 784cd34f5bba..15086dcf51d8 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -196,6 +196,23 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
 	return xp_raw_get_data(pool, addr);
 }
 
+/**
+ * xsk_buff_raw_get_ctx - get &xdp_desc context
+ * @pool: XSk buff pool desc address belongs to
+ * @addr: desc address (from userspace)
+ *
+ * Wrapper for xp_raw_get_ctx() to be used in drivers, see its kdoc for
+ * details.
+ *
+ * Return: new &xdp_desc_ctx struct containing desc's DMA address and metadata
+ * pointer, if it is present and valid (initialized to %NULL otherwise).
+ */
+static inline struct xdp_desc_ctx
+xsk_buff_raw_get_ctx(const struct xsk_buff_pool *pool, u64 addr)
+{
+	return xp_raw_get_ctx(pool, addr);
+}
+
 #define XDP_TXMD_FLAGS_VALID ( \
 		XDP_TXMD_FLAGS_TIMESTAMP | \
 		XDP_TXMD_FLAGS_CHECKSUM | \
@@ -207,20 +224,27 @@ xsk_buff_valid_tx_metadata(const struct xsk_tx_metadata *meta)
 	return !(meta->flags & ~XDP_TXMD_FLAGS_VALID);
 }
 
-static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+static inline struct xsk_tx_metadata *
+__xsk_buff_get_metadata(const struct xsk_buff_pool *pool, void *data)
 {
 	struct xsk_tx_metadata *meta;
 
 	if (!pool->tx_metadata_len)
 		return NULL;
 
-	meta = xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+	meta = data - pool->tx_metadata_len;
 	if (unlikely(!xsk_buff_valid_tx_metadata(meta)))
 		return NULL; /* no way to signal the error to the user */
 
 	return meta;
 }
 
+static inline struct xsk_tx_metadata *
+xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+	return __xsk_buff_get_metadata(pool, xp_raw_get_data(pool, addr));
+}
+
 static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp)
 {
 	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
@@ -388,12 +412,25 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
 	return NULL;
 }
 
+static inline struct xdp_desc_ctx
+xsk_buff_raw_get_ctx(const struct xsk_buff_pool *pool, u64 addr)
+{
+	return (struct xdp_desc_ctx){ };
+}
+
 static inline bool xsk_buff_valid_tx_metadata(struct xsk_tx_metadata *meta)
 {
 	return false;
 }
 
-static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+static inline struct xsk_tx_metadata *
+__xsk_buff_get_metadata(const struct xsk_buff_pool *pool, void *data)
+{
+	return NULL;
+}
+
+static inline struct xsk_tx_metadata *
+xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
 {
 	return NULL;
 }
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 50779406bc2d..1dcd4d71468a 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -141,6 +141,14 @@ u32 xp_alloc_batch(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u32 max);
 bool xp_can_alloc(struct xsk_buff_pool *pool, u32 count);
 void *xp_raw_get_data(struct xsk_buff_pool *pool, u64 addr);
 dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr);
+
+struct xdp_desc_ctx {
+	dma_addr_t dma;
+	struct xsk_tx_metadata *meta;
+};
+
+struct xdp_desc_ctx xp_raw_get_ctx(const struct xsk_buff_pool *pool, u64 addr);
+
 static inline dma_addr_t xp_get_dma(struct xdp_buff_xsk *xskb)
 {
 	return xskb->dma;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 1f7975b49657..c263fb7a68dc 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -699,18 +699,56 @@ void xp_free(struct xdp_buff_xsk *xskb)
 }
 EXPORT_SYMBOL(xp_free);
 
-void *xp_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
+static u64 __xp_raw_get_addr(const struct xsk_buff_pool *pool, u64 addr)
+{
+	return pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
+}
+
+static void *__xp_raw_get_data(const struct xsk_buff_pool *pool, u64 addr)
 {
-	addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
 	return pool->addrs + addr;
 }
+
+void *xp_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
+{
+	return __xp_raw_get_data(pool, __xp_raw_get_addr(pool, addr));
+}
 EXPORT_SYMBOL(xp_raw_get_data);
 
-dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr)
+static dma_addr_t __xp_raw_get_dma(const struct xsk_buff_pool *pool, u64 addr)
 {
-	addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
 	return (pool->dma_pages[addr >> PAGE_SHIFT] &
 		~XSK_NEXT_PG_CONTIG_MASK) +
 		(addr & ~PAGE_MASK);
 }
+
+dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr)
+{
+	return __xp_raw_get_dma(pool, __xp_raw_get_addr(pool, addr));
+}
 EXPORT_SYMBOL(xp_raw_get_dma);
+
+/**
+ * xp_raw_get_ctx - get &xdp_desc context
+ * @pool: XSk buff pool desc address belongs to
+ * @addr: desc address (from userspace)
+ *
+ * Helper for getting desc's DMA address and metadata pointer, if present.
+ * Saves one call on hotpath, double calculation of the actual address,
+ * and inline checks for metadata presence and sanity.
+ *
+ * Return: new &xdp_desc_ctx struct containing desc's DMA address and metadata
+ * pointer, if it is present and valid (initialized to %NULL otherwise).
+ */
+struct xdp_desc_ctx xp_raw_get_ctx(const struct xsk_buff_pool *pool, u64 addr)
+{
+	struct xdp_desc_ctx ret;
+
+	addr = __xp_raw_get_addr(pool, addr);
+
+	ret.dma = __xp_raw_get_dma(pool, addr);
+	ret.meta = __xsk_buff_get_metadata(pool, __xp_raw_get_data(pool, addr));
+
+	return ret;
+}
+EXPORT_SYMBOL(xp_raw_get_ctx);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go
  2025-02-06 18:26 ` [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go Alexander Lobakin
@ 2025-02-09 11:03   ` Simon Horman
  2025-02-10 16:00     ` Alexander Lobakin
  0 siblings, 1 reply; 17+ messages in thread
From: Simon Horman @ 2025-02-09 11:03 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

On Thu, Feb 06, 2025 at 07:26:29PM +0100, Alexander Lobakin wrote:
> Currently, when your driver supports XSk Tx metadata and you want to
> send an XSk frame, you need to do the following:
> 
> * call external xsk_buff_raw_get_dma();
> * call inline xsk_buff_get_metadata(), which calls external
>   xsk_buff_raw_get_data() and then do some inline checks.
> 
> This effectively means that the following piece:
> 
> addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
> 
> is done twice per frame, plus you have 2 external calls per frame, plus
> this:
> 
> 	meta = pool->addrs + addr - pool->tx_metadata_len;
> 	if (unlikely(!xsk_buff_valid_tx_metadata(meta)))
> 
> is always inlined, even if there's no meta or it's invalid.
> 
> Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that
> in one go. It returns a small structure with 2 fields: DMA address,
> filled unconditionally, and metadata pointer, non-NULL only if it's
> present and valid. The address correction is performed only once and
> you also have only 1 external call per XSk frame, which does all the
> calculations and checks outside of your hotpath. You only need to
> check `if (ctx.meta)` for the metadata presence.
> To not copy any existing code, derive address correction and getting
> virtual and DMA address into small helpers. bloat-o-meter reports no
> object code changes for the existing functionality.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Hi Alexander,

I think that this patch needs to be accompanied by at least one
patch that uses xsk_buff_raw_get_ctx() in a driver.

Also, as this seems to be an optimisation, some performance data would
be nice too.

Which brings me to my last point. I'd always understood that
returning a struct was discouraged due to performance implications.
Perhaps that information is out of date, doesn't apply because
the returned struct is so small in this case, or just plain wrong.
But I'd appreciate it if you could add some colour to this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/4] unroll: add generic loop unroll helpers
  2025-02-06 18:26 ` [PATCH net-next 1/4] unroll: add generic loop unroll helpers Alexander Lobakin
@ 2025-02-09 11:07   ` Simon Horman
  2025-02-10 12:28     ` Maciej Fijalkowski
  2025-02-10 15:49     ` Alexander Lobakin
  0 siblings, 2 replies; 17+ messages in thread
From: Simon Horman @ 2025-02-09 11:07 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

On Thu, Feb 06, 2025 at 07:26:26PM +0100, Alexander Lobakin wrote:
> There are cases when we need to explicitly unroll loops. For example,
> cache operations, filling DMA descriptors on very high speeds etc.
> Add compiler-specific attribute macros to give the compiler a hint
> that we'd like to unroll a loop.
> Example usage:
> 
>  #define UNROLL_BATCH 8
> 
> 	unrolled_count(UNROLL_BATCH)
> 	for (u32 i = 0; i < UNROLL_BATCH; i++)
> 		op(priv, i);
> 
> Note that sometimes the compilers won't unroll loops if they think this
> would have worse optimization and perf than without unrolling, and that
> unroll attributes are available only starting GCC 8. For older compiler
> versions, no hints/attributes will be applied.
> For better unrolling/parallelization, don't have any variables that
> interfere between iterations except for the iterator itself.
> 
> Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas
> Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Hi Alexander,

This patch adds four variants of the unrolled helper.  But as far as I can
tell the patch-set only makes use of one of them, unrolled_count().

I think it would be best if this patch only added helpers that are used.

...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/4] unroll: add generic loop unroll helpers
  2025-02-09 11:07   ` Simon Horman
@ 2025-02-10 12:28     ` Maciej Fijalkowski
  2025-02-10 15:49     ` Alexander Lobakin
  1 sibling, 0 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-02-10 12:28 UTC (permalink / raw)
  To: Simon Horman
  Cc: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Przemek Kitszel, Jason Baron, Casey Schaufler, Nathan Chancellor,
	bpf, netdev, linux-kernel

On Sun, Feb 09, 2025 at 11:07:25AM +0000, Simon Horman wrote:
> On Thu, Feb 06, 2025 at 07:26:26PM +0100, Alexander Lobakin wrote:
> > There are cases when we need to explicitly unroll loops. For example,
> > cache operations, filling DMA descriptors on very high speeds etc.
> > Add compiler-specific attribute macros to give the compiler a hint
> > that we'd like to unroll a loop.
> > Example usage:
> > 
> >  #define UNROLL_BATCH 8
> > 
> > 	unrolled_count(UNROLL_BATCH)
> > 	for (u32 i = 0; i < UNROLL_BATCH; i++)
> > 		op(priv, i);
> > 
> > Note that sometimes the compilers won't unroll loops if they think this
> > would have worse optimization and perf than without unrolling, and that
> > unroll attributes are available only starting GCC 8. For older compiler
> > versions, no hints/attributes will be applied.
> > For better unrolling/parallelization, don't have any variables that
> > interfere between iterations except for the iterator itself.
> > 
> > Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas
> > Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
> > Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> > Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Hi Alexander,
> 
> This patch adds four variants of the unrolled helper.  But as far as I can
> tell the patch-set only makes use of one of them, unrolled_count().
> 
> I think it would be best if this patch only added helpers that are used.

That is debatable but I think I tend to agree here. If we add say 3 unused
macros then nothing stops someone from coming up with a patch that deletes
them because they are unused, right?

> 
> ...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/4] i40e: use generic unrolled_count() macro
  2025-02-06 18:26 ` [PATCH net-next 2/4] i40e: use generic unrolled_count() macro Alexander Lobakin
@ 2025-02-10 12:30   ` Maciej Fijalkowski
  2025-02-10 22:28   ` David Laight
  1 sibling, 0 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-02-10 12:30 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Przemek Kitszel, Jason Baron, Casey Schaufler, Nathan Chancellor,
	bpf, netdev, linux-kernel

On Thu, Feb 06, 2025 at 07:26:27PM +0100, Alexander Lobakin wrote:
> i40e, as well as ice, has a custom loop unrolling macro for unrolling
> Tx descriptors filling on XSk xmit.
> Replace i40e defs with generic unrolled_count(), which is also more
> convenient as it allows passing defines as its argument, not hardcoded
> values, while the loop declaration will still be a usual for-loop.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h | 10 +---------
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c |  4 +++-
>  2 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> index ef156fad52f2..dd16351a7af8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
> @@ -6,7 +6,7 @@
>  
>  #include <linux/types.h>
>  
> -/* This value should match the pragma in the loop_unrolled_for
> +/* This value should match the pragma in the unrolled_count()
>   * macro. Why 4? It is strictly empirical. It seems to be a good
>   * compromise between the advantage of having simultaneous outstanding
>   * reads to the DMA array that can hide each others latency and the
> @@ -14,14 +14,6 @@
>   */
>  #define PKTS_PER_BATCH 4
>  
> -#ifdef __clang__
> -#define loop_unrolled_for _Pragma("clang loop unroll_count(4)") for
> -#elif __GNUC__ >= 8
> -#define loop_unrolled_for _Pragma("GCC unroll 4") for
> -#else
> -#define loop_unrolled_for for
> -#endif
> -
>  struct i40e_ring;
>  struct i40e_vsi;
>  struct net_device;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index e28f1905a4a0..9f47388eaba5 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -2,6 +2,7 @@
>  /* Copyright(c) 2018 Intel Corporation. */
>  
>  #include <linux/bpf_trace.h>
> +#include <linux/unroll.h>
>  #include <net/xdp_sock_drv.h>
>  #include "i40e_txrx_common.h"
>  #include "i40e_xsk.h"
> @@ -529,7 +530,8 @@ static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *des
>  	dma_addr_t dma;
>  	u32 i;
>  
> -	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
> +	unrolled_count(PKTS_PER_BATCH)
> +	for (i = 0; i < PKTS_PER_BATCH; i++) {
>  		u32 cmd = I40E_TX_DESC_CMD_ICRC | xsk_is_eop_desc(&desc[i]);
>  
>  		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/4] ice: use generic unrolled_count() macro
  2025-02-06 18:26 ` [PATCH net-next 3/4] ice: " Alexander Lobakin
@ 2025-02-10 12:31   ` Maciej Fijalkowski
  0 siblings, 0 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2025-02-10 12:31 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Przemek Kitszel, Jason Baron, Casey Schaufler, Nathan Chancellor,
	bpf, netdev, linux-kernel

On Thu, Feb 06, 2025 at 07:26:28PM +0100, Alexander Lobakin wrote:
> ice, same as i40e, has a custom loop unrolling macros for unrolling
> Tx descriptors filling on XSk xmit.
> Replace ice defs with generic unrolled_count(), which is also more
> convenient as it allows passing defines as its argument, not hardcoded
> values, while the loop declaration will still be usual for-loop.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> ---
>  drivers/net/ethernet/intel/ice/ice_xsk.h | 8 --------
>  drivers/net/ethernet/intel/ice/ice_xsk.c | 4 +++-
>  2 files changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
> index 45adeb513253..8dc5d55e26c5 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.h
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
> @@ -7,14 +7,6 @@
>  
>  #define PKTS_PER_BATCH 8
>  
> -#ifdef __clang__
> -#define loop_unrolled_for _Pragma("clang loop unroll_count(8)") for
> -#elif __GNUC__ >= 8
> -#define loop_unrolled_for _Pragma("GCC unroll 8") for
> -#else
> -#define loop_unrolled_for for
> -#endif
> -
>  struct ice_vsi;
>  
>  #ifdef CONFIG_XDP_SOCKETS
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index 8975d2971bc3..a3a4eaa17739 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -2,6 +2,7 @@
>  /* Copyright (c) 2019, Intel Corporation. */
>  
>  #include <linux/bpf_trace.h>
> +#include <linux/unroll.h>
>  #include <net/xdp_sock_drv.h>
>  #include <net/xdp.h>
>  #include "ice.h"
> @@ -989,7 +990,8 @@ static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring,
>  	struct ice_tx_desc *tx_desc;
>  	u32 i;
>  
> -	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
> +	unrolled_count(PKTS_PER_BATCH)
> +	for (i = 0; i < PKTS_PER_BATCH; i++) {
>  		dma_addr_t dma;
>  
>  		dma = xsk_buff_raw_get_dma(xsk_pool, descs[i].addr);
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/4] unroll: add generic loop unroll helpers
  2025-02-09 11:07   ` Simon Horman
  2025-02-10 12:28     ` Maciej Fijalkowski
@ 2025-02-10 15:49     ` Alexander Lobakin
  2025-02-10 21:08       ` Simon Horman
  1 sibling, 1 reply; 17+ messages in thread
From: Alexander Lobakin @ 2025-02-10 15:49 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

From: Simon Horman <horms@kernel.org>
Date: Sun, 9 Feb 2025 11:07:25 +0000

> On Thu, Feb 06, 2025 at 07:26:26PM +0100, Alexander Lobakin wrote:
>> There are cases when we need to explicitly unroll loops. For example,
>> cache operations, filling DMA descriptors on very high speeds etc.
>> Add compiler-specific attribute macros to give the compiler a hint
>> that we'd like to unroll a loop.
>> Example usage:
>>
>>  #define UNROLL_BATCH 8
>>
>> 	unrolled_count(UNROLL_BATCH)
>> 	for (u32 i = 0; i < UNROLL_BATCH; i++)
>> 		op(priv, i);
>>
>> Note that sometimes the compilers won't unroll loops if they think this
>> would have worse optimization and perf than without unrolling, and that
>> unroll attributes are available only starting GCC 8. For older compiler
>> versions, no hints/attributes will be applied.
>> For better unrolling/parallelization, don't have any variables that
>> interfere between iterations except for the iterator itself.
>>
>> Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas
>> Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
>> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Hi Alexander,
> 
> This patch adds four variants of the unrolled helper.  But as far as I can
> tell the patch-set only makes use of one of them, unrolled_count().
> 
> I think it would be best if this patch only added helpers that are used.

I thought they might help people in future.
I can remove them if you insist. BTW the original patch from Jose also
added several variants.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go
  2025-02-09 11:03   ` Simon Horman
@ 2025-02-10 16:00     ` Alexander Lobakin
  2025-02-10 21:06       ` Simon Horman
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Lobakin @ 2025-02-10 16:00 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

From: Simon Horman <horms@kernel.org>
Date: Sun, 9 Feb 2025 11:03:44 +0000

> On Thu, Feb 06, 2025 at 07:26:29PM +0100, Alexander Lobakin wrote:
>> Currently, when your driver supports XSk Tx metadata and you want to
>> send an XSk frame, you need to do the following:
>>
>> * call external xsk_buff_raw_get_dma();
>> * call inline xsk_buff_get_metadata(), which calls external
>>   xsk_buff_raw_get_data() and then do some inline checks.
>>
>> This effectively means that the following piece:
>>
>> addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
>>
>> is done twice per frame, plus you have 2 external calls per frame, plus
>> this:
>>
>> 	meta = pool->addrs + addr - pool->tx_metadata_len;
>> 	if (unlikely(!xsk_buff_valid_tx_metadata(meta)))
>>
>> is always inlined, even if there's no meta or it's invalid.
>>
>> Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that
>> in one go. It returns a small structure with 2 fields: DMA address,
>> filled unconditionally, and metadata pointer, non-NULL only if it's
>> present and valid. The address correction is performed only once and
>> you also have only 1 external call per XSk frame, which does all the
>> calculations and checks outside of your hotpath. You only need to
>> check `if (ctx.meta)` for the metadata presence.
>> To not copy any existing code, derive address correction and getting
>> virtual and DMA address into small helpers. bloat-o-meter reports no
>> object code changes for the existing functionality.
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Hi Alexander,
> 
> I think that this patch needs to be accompanied by at least one
> patch that uses xsk_buff_raw_get_ctx() in a driver.

This mini-series is the final part of my Chapter III, which was all
about prereqs in order to add libeth_xdp and then XDP for idpf.
This helper will be used in the next series (Chapter IV) I'll send once
this lands.

> 
> Also, as this seems to be an optimisation, some performance data would
> be nice too.

-1 Kb of object code which has an unrolled-by-8 loop which used this
function each iteration. I don't remember the perf numbers since it was
a year ago and since then I've been using this helper only, but it was
something around a couple procent (which is several hundred Kpps when it
comes to XSk).

> 
> Which brings me to my last point. I'd always understood that
> returning a struct was discouraged due to performance implications.

Rather stack usage, not perf implications. Compound returns are used
heavily throughout the kernel code when sizeof(result) <= 16 bytes.
Here it's also 16 bytes. Just the same as one __u128. Plus this function
doesn't recurse, so the stack won't blow up.

> Perhaps that information is out of date, doesn't apply because
> the returned struct is so small in this case, or just plain wrong.
> But I'd appreciate it if you could add some colour to this.

Moreover, the function is global, not inline, so passing a pointer here
instead of returning a struct may even behave worse in this case.

(and we'll save basically only 8 bytes on the stack, which I don't
 believe is worth it).

Thanks,
Olek

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go
  2025-02-10 16:00     ` Alexander Lobakin
@ 2025-02-10 21:06       ` Simon Horman
  0 siblings, 0 replies; 17+ messages in thread
From: Simon Horman @ 2025-02-10 21:06 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

On Mon, Feb 10, 2025 at 05:00:36PM +0100, Alexander Lobakin wrote:
> From: Simon Horman <horms@kernel.org>
> Date: Sun, 9 Feb 2025 11:03:44 +0000
> 
> > On Thu, Feb 06, 2025 at 07:26:29PM +0100, Alexander Lobakin wrote:
> >> Currently, when your driver supports XSk Tx metadata and you want to
> >> send an XSk frame, you need to do the following:
> >>
> >> * call external xsk_buff_raw_get_dma();
> >> * call inline xsk_buff_get_metadata(), which calls external
> >>   xsk_buff_raw_get_data() and then do some inline checks.
> >>
> >> This effectively means that the following piece:
> >>
> >> addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
> >>
> >> is done twice per frame, plus you have 2 external calls per frame, plus
> >> this:
> >>
> >> 	meta = pool->addrs + addr - pool->tx_metadata_len;
> >> 	if (unlikely(!xsk_buff_valid_tx_metadata(meta)))
> >>
> >> is always inlined, even if there's no meta or it's invalid.
> >>
> >> Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that
> >> in one go. It returns a small structure with 2 fields: DMA address,
> >> filled unconditionally, and metadata pointer, non-NULL only if it's
> >> present and valid. The address correction is performed only once and
> >> you also have only 1 external call per XSk frame, which does all the
> >> calculations and checks outside of your hotpath. You only need to
> >> check `if (ctx.meta)` for the metadata presence.
> >> To not copy any existing code, derive address correction and getting
> >> virtual and DMA address into small helpers. bloat-o-meter reports no
> >> object code changes for the existing functionality.
> >>
> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> > 
> > Hi Alexander,
> > 
> > I think that this patch needs to be accompanied by at least one
> > patch that uses xsk_buff_raw_get_ctx() in a driver.
> 
> This mini-series is the final part of my Chapter III, which was all
> about prereqs in order to add libeth_xdp and then XDP for idpf.
> This helper will be used in the next series (Chapter IV) I'll send once
> this lands.

Understood. If it's going to be used in chapter IV then, given
that we've made it to chapter II, that is fine by me.

> > Also, as this seems to be an optimisation, some performance data would
> > be nice too.
> 
> -1 Kb of object code which has an unrolled-by-8 loop which used this
> function each iteration. I don't remember the perf numbers since it was
> a year ago and since then I've been using this helper only, but it was
> something around a couple procent (which is several hundred Kpps when it
> comes to XSk).

Thanks. It might be worth including some of that information in the commit
message, but I don't feel strongly about it.

> 
> > 
> > Which brings me to my last point. I'd always understood that
> > returning a struct was discouraged due to performance implications.
> 
> Rather stack usage, not perf implications. Compound returns are used
> heavily throughout the kernel code when sizeof(result) <= 16 bytes.
> Here it's also 16 bytes. Just the same as one __u128. Plus this function
> doesn't recurse, so the stack won't blow up.

Also understood. It seems my assumptions were somewhat wrong.
So I have no objections to this approach.

> > Perhaps that information is out of date, doesn't apply because
> > the returned struct is so small in this case, or just plain wrong.
> > But I'd appreciate it if you could add some colour to this.
> 
> Moreover, the function is global, not inline, so passing a pointer here
> instead of returning a struct may even behave worse in this case.
> 
> (and we'll save basically only 8 bytes on the stack, which I don't
>  believe is worth it).
> 
> Thanks,
> Olek
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/4] unroll: add generic loop unroll helpers
  2025-02-10 15:49     ` Alexander Lobakin
@ 2025-02-10 21:08       ` Simon Horman
  2025-02-11  1:56         ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Simon Horman @ 2025-02-10 21:08 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

On Mon, Feb 10, 2025 at 04:49:14PM +0100, Alexander Lobakin wrote:
> From: Simon Horman <horms@kernel.org>
> Date: Sun, 9 Feb 2025 11:07:25 +0000
> 
> > On Thu, Feb 06, 2025 at 07:26:26PM +0100, Alexander Lobakin wrote:
> >> There are cases when we need to explicitly unroll loops. For example,
> >> cache operations, filling DMA descriptors on very high speeds etc.
> >> Add compiler-specific attribute macros to give the compiler a hint
> >> that we'd like to unroll a loop.
> >> Example usage:
> >>
> >>  #define UNROLL_BATCH 8
> >>
> >> 	unrolled_count(UNROLL_BATCH)
> >> 	for (u32 i = 0; i < UNROLL_BATCH; i++)
> >> 		op(priv, i);
> >>
> >> Note that sometimes the compilers won't unroll loops if they think this
> >> would have worse optimization and perf than without unrolling, and that
> >> unroll attributes are available only starting GCC 8. For older compiler
> >> versions, no hints/attributes will be applied.
> >> For better unrolling/parallelization, don't have any variables that
> >> interfere between iterations except for the iterator itself.
> >>
> >> Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas
> >> Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
> >> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> > 
> > Hi Alexander,
> > 
> > This patch adds four variants of the unrolled helper.  But as far as I can
> > tell the patch-set only makes use of one of them, unrolled_count().
> > 
> > I think it would be best if this patch only added helpers that are used.
> 
> I thought they might help people in future.
> I can remove them if you insist. BTW the original patch from Jose also
> added several variants.

I do slightly prefer only adding what is used.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/4] i40e: use generic unrolled_count() macro
  2025-02-06 18:26 ` [PATCH net-next 2/4] i40e: use generic unrolled_count() macro Alexander Lobakin
  2025-02-10 12:30   ` Maciej Fijalkowski
@ 2025-02-10 22:28   ` David Laight
  1 sibling, 0 replies; 17+ messages in thread
From: David Laight @ 2025-02-10 22:28 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

On Thu,  6 Feb 2025 19:26:27 +0100
Alexander Lobakin <aleksander.lobakin@intel.com> wrote:

> i40e, as well as ice, has a custom loop unrolling macro for unrolling
> Tx descriptors filling on XSk xmit.
> Replace i40e defs with generic unrolled_count(), which is also more
> convenient as it allows passing defines as its argument, not hardcoded
> values, while the loop declaration will still be a usual for-loop.
..
>  #define PKTS_PER_BATCH 4
>  
> -#ifdef __clang__
> -#define loop_unrolled_for _Pragma("clang loop unroll_count(4)") for
> -#elif __GNUC__ >= 8
> -#define loop_unrolled_for _Pragma("GCC unroll 4") for
> -#else
> -#define loop_unrolled_for for
> -#endif
...
> @@ -529,7 +530,8 @@ static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *des
>  	dma_addr_t dma;
>  	u32 i;
>  
> -	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
> +	unrolled_count(PKTS_PER_BATCH)
> +	for (i = 0; i < PKTS_PER_BATCH; i++) {
>  		u32 cmd = I40E_TX_DESC_CMD_ICRC | xsk_is_eop_desc(&desc[i]);
>  
>  		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);

The rest of that code is:


		tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
		tx_desc->buffer_addr = cpu_to_le64(dma);
		tx_desc->cmd_type_offset_bsz = build_ctob(cmd, 0, desc[i].len, 0);

		*total_bytes += desc[i].len;
	}

	xdp_ring->next_to_use = ntu;
}

static void i40e_fill_tx_hw_ring(struct i40e_ring *xdp_ring, struct xdp_desc *descs, u32 nb_pkts,
				 unsigned int *total_bytes)
{
	u32 batched, leftover, i;

	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
	for (i = 0; i < batched; i += PKTS_PER_BATCH)
		i40e_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
	for (i = batched; i < batched + leftover; i++)
		i40e_xmit_pkt(xdp_ring, &descs[i], total_bytes);
}

If it isn't a silly question why all the faffing with unrolling?
It isn't as though the loop body is trivial - it contains real function calls.
Unrolling loops is so 1980s - unless you are trying to get the absolute
max performance from a very short loop and need to unroll once (maybe twice)
to get enough spare instruction execution slots to run the loop control
code in parallel with the body.

In this case it looks like the 'batched' loop contains an inlined copy of
the function called for the remainder.
I can't see anything else.
You'd probably gain more by getting rid of the 'int *total bytes' and using
the function return value - that is what it is fot.

	David


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/4] unroll: add generic loop unroll helpers
  2025-02-10 21:08       ` Simon Horman
@ 2025-02-11  1:56         ` Jakub Kicinski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2025-02-11  1:56 UTC (permalink / raw)
  To: Simon Horman
  Cc: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Jose E. Marchesi,
	Toke Høiland-Jørgensen, Magnus Karlsson,
	Maciej Fijalkowski, Przemek Kitszel, Jason Baron, Casey Schaufler,
	Nathan Chancellor, bpf, netdev, linux-kernel

On Mon, 10 Feb 2025 21:08:19 +0000 Simon Horman wrote:
> > > This patch adds four variants of the unrolled helper.  But as far as I can
> > > tell the patch-set only makes use of one of them, unrolled_count().
> > > 
> > > I think it would be best if this patch only added helpers that are used.  
> > 
> > I thought they might help people in future.
> > I can remove them if you insist. BTW the original patch from Jose also
> > added several variants.  
> 
> I do slightly prefer only adding what is used.

Hm, I'm a bit on the fence. IDK how trivial it is to figure out how 
to get the equivalent behavior from the compilers... 

Let's keep it, if someone feels strongly I guess they could post 
a patch to delete the unused variants.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 0/4] xsk: the lost bits from Chapter III
  2025-02-06 18:26 [PATCH net-next 0/4] xsk: the lost bits from Chapter III Alexander Lobakin
                   ` (3 preceding siblings ...)
  2025-02-06 18:26 ` [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go Alexander Lobakin
@ 2025-02-11  2:00 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 17+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-02-11  2:00 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, ast, daniel,
	john.fastabend, andrii, jose.marchesi, toke, magnus.karlsson,
	maciej.fijalkowski, przemyslaw.kitszel, jbaron, casey, nathan,
	bpf, netdev, linux-kernel

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu,  6 Feb 2025 19:26:25 +0100 you wrote:
> Before introducing libeth_xdp, we need to add a couple more generic
> helpers. Notably:
> 
> * 01: add generic loop unrolling hint helpers;
> * 04: add helper to get both xdp_desc's DMA address and metadata
>   pointer in one go, saving several cycles and hotpath object
>   code size in drivers (especially when unrolling).
> 
> [...]

Here is the summary with links:
  - [net-next,1/4] unroll: add generic loop unroll helpers
    https://git.kernel.org/netdev/net-next/c/c6594d642717
  - [net-next,2/4] i40e: use generic unrolled_count() macro
    (no matching commit)
  - [net-next,3/4] ice: use generic unrolled_count() macro
    (no matching commit)
  - [net-next,4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go
    https://git.kernel.org/netdev/net-next/c/23d9324a27a4

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-02-11  2:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-06 18:26 [PATCH net-next 0/4] xsk: the lost bits from Chapter III Alexander Lobakin
2025-02-06 18:26 ` [PATCH net-next 1/4] unroll: add generic loop unroll helpers Alexander Lobakin
2025-02-09 11:07   ` Simon Horman
2025-02-10 12:28     ` Maciej Fijalkowski
2025-02-10 15:49     ` Alexander Lobakin
2025-02-10 21:08       ` Simon Horman
2025-02-11  1:56         ` Jakub Kicinski
2025-02-06 18:26 ` [PATCH net-next 2/4] i40e: use generic unrolled_count() macro Alexander Lobakin
2025-02-10 12:30   ` Maciej Fijalkowski
2025-02-10 22:28   ` David Laight
2025-02-06 18:26 ` [PATCH net-next 3/4] ice: " Alexander Lobakin
2025-02-10 12:31   ` Maciej Fijalkowski
2025-02-06 18:26 ` [PATCH net-next 4/4] xsk: add helper to get &xdp_desc's DMA and meta pointer in one go Alexander Lobakin
2025-02-09 11:03   ` Simon Horman
2025-02-10 16:00     ` Alexander Lobakin
2025-02-10 21:06       ` Simon Horman
2025-02-11  2:00 ` [PATCH net-next 0/4] xsk: the lost bits from Chapter III patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).