[PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame
@ 2025-06-03 17:45 Jesper Dangaard Brouer
  2025-06-03 17:45 ` [PATCH bpf-next V1 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:45 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

This patch series enables the propagation of NIC hardware RX metadata
offload hints for packets undergoing XDP_REDIRECT. Currently, SKBs
created from `xdp_frame`s after an XDP_REDIRECT (e.g. to cpumap or veth)
lack hardware hints.

While XDP hardware RX metadata can be read by BPF programs bound to the
physical device's ifindex (BPF_F_XDP_DEV_BOUND_ONLY) using kfuncs [1],
there's no mechanism to persist these hints for use after a redirect.
The currently available kfuncs[1] provide rx_hash, rx_vlan_tag and
rx_timestamp.

This series introduces new BPF kfuncs allowing an XDP program to store
existing HW metadata hints (rx_hash, rx_vlan_tag and rx_timestamp) into
the `xdp_frame`. These stored hints are then used in
`__xdp_build_skb_from_frame()` to populate the corresponding fields in
the newly created SKB.

The immediate production motivation is to correctly populate `skb->hash`
(the RX hash). This is important for GRO (Generic Receive Offload)
functionality. For instance, the netstack needs the `skb->hash` to be
set *before* the GRO engine processes the packet (see
`dev_gro_receive()` [0]). Without the correct RX hash, the GRO engine
(e.g., cpumap recently gained GRO support) effectively operates on a
single hash bucket, limiting its ability to aggregate flows.

Populating these fields via a TC ingress hook is not viable as it
executes too late in the packet processing pipeline for uses like GRO.

We considered XDP traits as an alternative to statically adding members
to the end of `struct xdp_frame` area. However, given the immediate need
for this functionality and the current development status of traits, we
believe this approach is a pragmatic solution. We are open to revisiting
this and potentially migrating to a traits-based implementation if/when
they become a generally accepted mechanism for such extensions.

Furthermore, this patchset demonstrates a tangible in-kernel requirement
for such metadata propagation and could serve as an early example or
adopter of the XDP traits mechanism.

[0] https://elixir.bootlin.com/linux/v6.14.7/source/net/core/gro.c#L463
[1] https://docs.kernel.org/networking/xdp-rx-metadata.html

---

Jesper Dangaard Brouer (2):
      selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect
      net: xdp: update documentation for xdp-rx-metadata.rst

Lorenzo Bianconi (5):
      net: xdp: Add xdp_rx_meta structure
      net: xdp: Add kfuncs to store hw metadata in xdp_buff
      net: xdp: Set skb hw metadata from xdp_frame
      net: veth: Read xdp metadata from rx_meta struct if available
      bpf: selftests: Add rx_meta store kfuncs selftest

 Documentation/networking/xdp-rx-metadata.rst  |  74 ++++++--
 drivers/net/veth.c                            |  12 ++
 include/net/xdp.h                             | 134 ++++++++++++--
 net/core/xdp.c                                | 107 ++++++++++-
 net/xdp/xsk_buff_pool.c                       |   4 +-
 .../bpf/prog_tests/xdp_do_redirect.c          |   6 +-
 .../selftests/bpf/prog_tests/xdp_rxmeta.c     | 166 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_rxmeta_receiver.c |  44 +++++
 .../selftests/bpf/progs/xdp_rxmeta_redirect.c |  48 +++++
 9 files changed, 560 insertions(+), 35 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH bpf-next V1 1/7] net: xdp: Add xdp_rx_meta structure
  2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
@ 2025-06-03 17:45 ` Jesper Dangaard Brouer
  2025-06-03 17:46 ` [PATCH bpf-next V1 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:45 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

From: Lorenzo Bianconi <lorenzo@kernel.org>

Introduce the `xdp_rx_meta` structure to serve as a container for XDP RX
hardware hints within XDP packet buffers. Initially, this structure will
accommodate `rx_hash` and `rx_vlan` metadata. (The `rx_timestamp` hint will
get stored in `skb_shared_info`).

A key design aspect is making this metadata accessible both during BPF
program execution (via `struct xdp_buff`) and later if an `struct
xdp_frame` is materialized (e.g., for XDP_REDIRECT).
To achieve this:
  - The `struct xdp_frame` embeds an `xdp_rx_meta` field directly for
    storage.
  - The `struct xdp_buff` includes an `xdp_rx_meta` pointer. This pointer
    is initialized (in `xdp_prepare_buff`) to point to the memory location
    within the packet buffer's headroom where the `xdp_frame`'s embedded
    `rx_meta` field would reside.

This setup allows BPF kfuncs, operating on `xdp_buff`, to populate the
metadata in the precise location where it will be found if an `xdp_frame`
is subsequently created.

The availability of this metadata storage area within the buffer is
indicated by the `XDP_FLAGS_META_AREA` flag in `xdp_buff->flags` (and
propagated to `xdp_frame->flags`). This flag is only set if sufficient
headroom (at least `XDP_MIN_HEADROOM`, currently 192 bytes) is present.
Specific hints like `XDP_FLAGS_META_RX_HASH` and `XDP_FLAGS_META_RX_VLAN`
will then denote which types of metadata have been populated into the
`xdp_rx_meta` structure.

This patch is a step for enabling the preservation and use of XDP RX
hints across operations like XDP_REDIRECT.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 include/net/xdp.h       |   57 +++++++++++++++++++++++++++++++++++------------
 net/core/xdp.c          |    1 +
 net/xdp/xsk_buff_pool.c |    4 ++-
 3 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 48efacbaa35d..5dcdf634ae4a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -71,11 +71,31 @@ struct xdp_txq_info {
 	struct net_device *dev;
 };
 
+struct xdp_rx_meta {
+	struct xdp_rx_meta_hash {
+		u32 val;
+		u32 type; /* enum xdp_rss_hash_type */
+	} hash;
+	struct xdp_rx_meta_vlan {
+		__be16 proto;
+		u16 tci;
+	} vlan;
+};
+
+/* Storage area for HW RX metadata only available with reasonable headroom
+ * available. Less than XDP_PACKET_HEADROOM due to Intel drivers.
+ */
+#define XDP_MIN_HEADROOM	192
+
 enum xdp_buff_flags {
 	XDP_FLAGS_HAS_FRAGS		= BIT(0), /* non-linear xdp buff */
 	XDP_FLAGS_FRAGS_PF_MEMALLOC	= BIT(1), /* xdp paged memory is under
 						   * pressure
 						   */
+	XDP_FLAGS_META_AREA		= BIT(2), /* storage area available */
+	XDP_FLAGS_META_RX_HASH		= BIT(3), /* hw rx hash */
+	XDP_FLAGS_META_RX_VLAN		= BIT(4), /* hw rx vlan */
+	XDP_FLAGS_META_RX_TS		= BIT(5), /* hw rx timestamp */
 };
 
 struct xdp_buff {
@@ -87,6 +107,24 @@ struct xdp_buff {
 	struct xdp_txq_info *txq;
 	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
 	u32 flags; /* supported values defined in xdp_buff_flags */
+	struct xdp_rx_meta *rx_meta; /* rx hw metadata pointer in the
+				      * buffer headroom
+				      */
+};
+
+struct xdp_frame {
+	void *data;
+	u32 len;
+	u32 headroom;
+	u32 metasize; /* uses lower 8-bits */
+	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+	 * while mem_type is valid on remote CPU.
+	 */
+	enum xdp_mem_type mem_type:32;
+	struct net_device *dev_rx; /* used by cpumap */
+	u32 frame_sz;
+	u32 flags; /* supported values defined in xdp_buff_flags */
+	struct xdp_rx_meta rx_meta; /* rx hw metadata */
 };
 
 static __always_inline bool xdp_buff_has_frags(const struct xdp_buff *xdp)
@@ -133,6 +171,9 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
 	xdp->data = data;
 	xdp->data_end = data + data_len;
 	xdp->data_meta = meta_valid ? data : data + 1;
+	xdp->flags = (headroom < XDP_MIN_HEADROOM) ? 0 : XDP_FLAGS_META_AREA;
+	xdp->rx_meta = (void *)(hard_start +
+				offsetof(struct xdp_frame, rx_meta));
 }
 
 /* Reserve memory area at end-of data area.
@@ -253,20 +294,6 @@ static inline bool xdp_buff_add_frag(struct xdp_buff *xdp, netmem_ref netmem,
 	return true;
 }
 
-struct xdp_frame {
-	void *data;
-	u32 len;
-	u32 headroom;
-	u32 metasize; /* uses lower 8-bits */
-	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
-	 * while mem_type is valid on remote CPU.
-	 */
-	enum xdp_mem_type mem_type:32;
-	struct net_device *dev_rx; /* used by cpumap */
-	u32 frame_sz;
-	u32 flags; /* supported values defined in xdp_buff_flags */
-};
-
 static __always_inline bool xdp_frame_has_frags(const struct xdp_frame *frame)
 {
 	return !!(frame->flags & XDP_FLAGS_HAS_FRAGS);
@@ -355,6 +382,8 @@ void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
 	xdp->data_meta = frame->data - frame->metasize;
 	xdp->frame_sz = frame->frame_sz;
 	xdp->flags = frame->flags;
+	xdp->rx_meta = xdp->data_hard_start +
+		       offsetof(struct xdp_frame, rx_meta);
 }
 
 static inline
diff --git a/net/core/xdp.c b/net/core/xdp.c
index f86eedad586a..61edbd424494 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -605,6 +605,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	xdpf->metasize = metasize;
 	xdpf->frame_sz = PAGE_SIZE;
 	xdpf->mem_type = MEM_TYPE_PAGE_ORDER0;
+	memcpy(&xdpf->rx_meta, xdp->rx_meta, sizeof(*xdp->rx_meta));
 
 	xsk_buff_free(xdp);
 	return xdpf;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 25a76c5ce0f1..bdeb6a4ef7dc 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -569,7 +569,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
 
 	xskb->xdp.data = xskb->xdp.data_hard_start + XDP_PACKET_HEADROOM;
 	xskb->xdp.data_meta = xskb->xdp.data;
-	xskb->xdp.flags = 0;
+	xskb->xdp.flags = XDP_FLAGS_META_AREA;
+	xskb->xdp.rx_meta = (void *)(xskb->xdp.data_hard_start +
+				     offsetof(struct xdp_frame, rx_meta));
 
 	if (pool->dev)
 		xp_dma_sync_for_device(pool, xskb->dma, pool->frame_len);



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next V1 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect
  2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
  2025-06-03 17:45 ` [PATCH bpf-next V1 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
@ 2025-06-03 17:46 ` Jesper Dangaard Brouer
  2025-06-03 17:46 ` [PATCH bpf-next V1 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:46 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

Patchset increased xdp_buff with a pointer 8 bytes, and the bpf/test_run
struct xdp_page_head have two xdp_buff's.  Thus adjust test with 16 bytes.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 .../selftests/bpf/prog_tests/xdp_do_redirect.c     |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c b/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
index 7dac044664ac..25904348b954 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
@@ -59,12 +59,12 @@ static int attach_tc_prog(struct bpf_tc_hook *hook, int fd)
 
 /* The maximum permissible size is: PAGE_SIZE - sizeof(struct xdp_page_head) -
  * SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) - XDP_PACKET_HEADROOM =
- * 3408 bytes for 64-byte cacheline and 3216 for 256-byte one.
+ * 3392 bytes for 64-byte cacheline and 3200 for 256-byte one.
  */
 #if defined(__s390x__)
-#define MAX_PKT_SIZE 3216
+#define MAX_PKT_SIZE 3200
 #else
-#define MAX_PKT_SIZE 3408
+#define MAX_PKT_SIZE 3392
 #endif
 static void test_max_pkt_size(int fd)
 {



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next V1 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff
  2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
  2025-06-03 17:45 ` [PATCH bpf-next V1 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
  2025-06-03 17:46 ` [PATCH bpf-next V1 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect Jesper Dangaard Brouer
@ 2025-06-03 17:46 ` Jesper Dangaard Brouer
  2025-06-16 21:55   ` Jakub Kicinski
  2025-06-03 17:46 ` [PATCH bpf-next V1 4/7] net: xdp: Set skb hw metadata from xdp_frame Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:46 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

From: Lorenzo Bianconi <lorenzo@kernel.org>

Introduce the following kfuncs to store hw metadata provided by the NIC
into the xdp_buff struct:

- rx-hash: bpf_xdp_store_rx_hash
- rx-vlan: bpf_xdp_store_rx_vlan
- rx-hw-ts: bpf_xdp_store_rx_ts

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 include/net/xdp.h |    5 +++++
 net/core/xdp.c    |   45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5dcdf634ae4a..ef73f0bcc441 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -153,6 +153,11 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
 	xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
 }
 
+static __always_inline bool xdp_buff_has_valid_meta_area(struct xdp_buff *xdp)
+{
+	return !!(xdp->flags & XDP_FLAGS_META_AREA);
+}
+
 static __always_inline void
 xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 61edbd424494..7d59543d7916 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -957,12 +957,57 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
 	return -EOPNOTSUPP;
 }
 
+__bpf_kfunc int bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
+				      enum xdp_rss_hash_type rss_type)
+{
+	struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+
+	if (!xdp_buff_has_valid_meta_area(xdp))
+		return -ENOSPC;
+
+	xdp->rx_meta->hash.val = hash;
+	xdp->rx_meta->hash.type = rss_type;
+	xdp->flags |= XDP_FLAGS_META_RX_HASH;
+
+	return 0;
+}
+
+__bpf_kfunc int bpf_xdp_store_rx_vlan(struct xdp_md *ctx, __be16 vlan_proto,
+				      u16 vlan_tci)
+{
+	struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+
+	if (!xdp_buff_has_valid_meta_area(xdp))
+		return -ENOSPC;
+
+	xdp->rx_meta->vlan.proto = vlan_proto;
+	xdp->rx_meta->vlan.tci = vlan_tci;
+	xdp->flags |= XDP_FLAGS_META_RX_VLAN;
+
+	return 0;
+}
+
+__bpf_kfunc int bpf_xdp_store_rx_ts(struct xdp_md *ctx, u64 ts)
+{
+	struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+	struct skb_shared_hwtstamps *shwt = &sinfo->hwtstamps;
+
+	shwt->hwtstamp = ts;
+	xdp->flags |= XDP_FLAGS_META_RX_TS;
+
+	return 0;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(xdp_metadata_kfunc_ids)
 #define XDP_METADATA_KFUNC(_, __, name, ___) BTF_ID_FLAGS(func, name, KF_TRUSTED_ARGS)
 XDP_METADATA_KFUNC_xxx
 #undef XDP_METADATA_KFUNC
+BTF_ID_FLAGS(func, bpf_xdp_store_rx_hash)
+BTF_ID_FLAGS(func, bpf_xdp_store_rx_vlan)
+BTF_ID_FLAGS(func, bpf_xdp_store_rx_ts)
 BTF_KFUNCS_END(xdp_metadata_kfunc_ids)
 
 static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next V1 4/7] net: xdp: Set skb hw metadata from xdp_frame
  2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2025-06-03 17:46 ` [PATCH bpf-next V1 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
@ 2025-06-03 17:46 ` Jesper Dangaard Brouer
  2025-06-03 17:46 ` [PATCH bpf-next V1 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:46 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

From: Lorenzo Bianconi <lorenzo@kernel.org>

Update the following hw metadata provided by the NIC building the skb
from a xdp_frame.
- rx hash
- rx vlan
- rx hw-ts

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h |   15 +++++++++++++++
 net/core/xdp.c    |   29 ++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index ef73f0bcc441..1ecbfe2053f2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -310,6 +310,21 @@ xdp_frame_is_frag_pfmemalloc(const struct xdp_frame *frame)
 	return !!(frame->flags & XDP_FLAGS_FRAGS_PF_MEMALLOC);
 }
 
+static __always_inline bool xdp_frame_has_rx_meta_hash(struct xdp_frame *frame)
+{
+	return !!(frame->flags & XDP_FLAGS_META_RX_HASH);
+}
+
+static __always_inline bool xdp_frame_has_rx_meta_vlan(struct xdp_frame *frame)
+{
+	return !!(frame->flags & XDP_FLAGS_META_RX_VLAN);
+}
+
+static __always_inline bool xdp_frame_has_rx_meta_ts(struct xdp_frame *frame)
+{
+	return !!(frame->flags & XDP_FLAGS_META_RX_TS);
+}
+
 #define XDP_BULK_QUEUE_SIZE	16
 struct xdp_frame_bulk {
 	int count;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 7d59543d7916..69077cf4c541 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -786,6 +786,23 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
 }
 EXPORT_SYMBOL_GPL(xdp_build_skb_from_zc);
 
+static void xdp_set_skb_rx_hash_from_meta(struct xdp_frame *frame,
+					  struct sk_buff *skb)
+{
+	enum pkt_hash_types hash_type = PKT_HASH_TYPE_NONE;
+
+	if (!xdp_frame_has_rx_meta_hash(frame))
+		return;
+
+	if (frame->rx_meta.hash.type & XDP_RSS_TYPE_L4_ANY)
+		hash_type = PKT_HASH_TYPE_L4;
+	else if (frame->rx_meta.hash.type & (XDP_RSS_TYPE_L3_IPV4 |
+					     XDP_RSS_TYPE_L3_IPV6))
+		hash_type = PKT_HASH_TYPE_L3;
+
+	skb_set_hash(skb, frame->rx_meta.hash.val, hash_type);
+}
+
 struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					   struct sk_buff *skb,
 					   struct net_device *dev)
@@ -794,11 +811,15 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 	unsigned int headroom, frame_size;
 	void *hard_start;
 	u8 nr_frags;
+	u64 ts;
 
 	/* xdp frags frame */
 	if (unlikely(xdp_frame_has_frags(xdpf)))
 		nr_frags = sinfo->nr_frags;
 
+	if (unlikely(xdp_frame_has_rx_meta_ts(xdpf)))
+		ts = sinfo->hwtstamps.hwtstamp;
+
 	/* Part of headroom was reserved to xdpf */
 	headroom = sizeof(*xdpf) + xdpf->headroom;
 
@@ -826,9 +847,15 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 	/* Essential SKB info: protocol and skb->dev */
 	skb->protocol = eth_type_trans(skb, dev);
 
+	xdp_set_skb_rx_hash_from_meta(xdpf, skb);
+	if (xdp_frame_has_rx_meta_vlan(xdpf))
+		__vlan_hwaccel_put_tag(skb, xdpf->rx_meta.vlan.proto,
+				       xdpf->rx_meta.vlan.tci);
+	if (unlikely(xdp_frame_has_rx_meta_ts(xdpf)))
+		skb_hwtstamps(skb)->hwtstamp = ts;
+
 	/* Optional SKB info, currently missing:
 	 * - HW checksum info		(skb->ip_summed)
-	 * - HW RX hash			(skb_set_hash)
 	 * - RX ring dev queue index	(skb_record_rx_queue)
 	 */
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next V1 5/7] net: veth: Read xdp metadata from rx_meta struct if available
  2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2025-06-03 17:46 ` [PATCH bpf-next V1 4/7] net: xdp: Set skb hw metadata from xdp_frame Jesper Dangaard Brouer
@ 2025-06-03 17:46 ` Jesper Dangaard Brouer
  2025-06-03 17:46 ` [PATCH bpf-next V1 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
  2025-06-03 17:46 ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
  6 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:46 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

From: Lorenzo Bianconi <lorenzo@kernel.org>

Report xdp_rx_meta info if available in xdp_buff struct in
xdp_metadata_ops callbacks for veth driver

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/veth.c |   12 +++++++++++
 include/net/xdp.h  |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7bb53961c0ea..94b470b6b680 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1614,6 +1614,10 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 static int veth_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
 {
 	struct veth_xdp_buff *_ctx = (void *)ctx;
+	const struct xdp_buff *xdp = &_ctx->xdp;
+
+	if (!xdp_load_rx_ts_from_buff(xdp, timestamp))
+		return 0;
 
 	if (!_ctx->skb)
 		return -ENODATA;
@@ -1626,8 +1630,12 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 			    enum xdp_rss_hash_type *rss_type)
 {
 	struct veth_xdp_buff *_ctx = (void *)ctx;
+	const struct xdp_buff *xdp = &_ctx->xdp;
 	struct sk_buff *skb = _ctx->skb;
 
+	if (!xdp_load_rx_hash_from_buff(xdp, hash, rss_type))
+		return 0;
+
 	if (!skb)
 		return -ENODATA;
 
@@ -1641,9 +1649,13 @@ static int veth_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
 				u16 *vlan_tci)
 {
 	const struct veth_xdp_buff *_ctx = (void *)ctx;
+	const struct xdp_buff *xdp = &_ctx->xdp;
 	const struct sk_buff *skb = _ctx->skb;
 	int err;
 
+	if (!xdp_load_rx_vlan_tag_from_buff(xdp, vlan_proto, vlan_tci))
+		return 0;
+
 	if (!skb)
 		return -ENODATA;
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 1ecbfe2053f2..6d93f0cf1b53 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -158,6 +158,23 @@ static __always_inline bool xdp_buff_has_valid_meta_area(struct xdp_buff *xdp)
 	return !!(xdp->flags & XDP_FLAGS_META_AREA);
 }
 
+static __always_inline bool
+xdp_buff_has_rx_meta_hash(const struct xdp_buff *xdp)
+{
+	return !!(xdp->flags & XDP_FLAGS_META_RX_HASH);
+}
+
+static __always_inline bool
+xdp_buff_has_rx_meta_vlan(const struct xdp_buff *xdp)
+{
+	return !!(xdp->flags & XDP_FLAGS_META_RX_VLAN);
+}
+
+static __always_inline bool xdp_buff_has_rx_meta_ts(const struct xdp_buff *xdp)
+{
+	return !!(xdp->flags & XDP_FLAGS_META_RX_TS);
+}
+
 static __always_inline void
 xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
 {
@@ -708,4 +725,44 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 
 	return act;
 }
+
+static inline int xdp_load_rx_hash_from_buff(const struct xdp_buff *xdp,
+					     u32 *hash,
+					     enum xdp_rss_hash_type *rss_type)
+{
+	if (!xdp_buff_has_rx_meta_hash(xdp))
+		return -ENODATA;
+
+	*hash = xdp->rx_meta->hash.val;
+	*rss_type = xdp->rx_meta->hash.type;
+
+	return 0;
+}
+
+static inline int xdp_load_rx_vlan_tag_from_buff(const struct xdp_buff *xdp,
+						 __be16 *vlan_proto,
+						 u16 *vlan_tci)
+{
+	if (!xdp_buff_has_rx_meta_vlan(xdp))
+		return -ENODATA;
+
+	*vlan_proto = xdp->rx_meta->vlan.proto;
+	*vlan_tci = xdp->rx_meta->vlan.tci;
+
+	return 0;
+}
+
+static inline int xdp_load_rx_ts_from_buff(const struct xdp_buff *xdp, u64 *ts)
+{
+	struct skb_shared_info *sinfo;
+
+	if (!xdp_buff_has_rx_meta_ts(xdp))
+		return -ENODATA;
+
+	sinfo = xdp_get_shared_info_from_buff(xdp);
+	*ts = sinfo->hwtstamps.hwtstamp;
+
+	return 0;
+}
+
 #endif /* __LINUX_NET_XDP_H__ */



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next V1 6/7] bpf: selftests: Add rx_meta store kfuncs selftest
  2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2025-06-03 17:46 ` [PATCH bpf-next V1 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
@ 2025-06-03 17:46 ` Jesper Dangaard Brouer
  2025-06-06 21:57   ` Alexei Starovoitov
  2025-06-03 17:46 ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
  6 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:46 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

From: Lorenzo Bianconi <lorenzo@kernel.org>

Introduce bpf selftests for the XDP rx_meta store kfuncs.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/xdp_rxmeta.c  |  166 ++++++++++++++++++++
 .../selftests/bpf/progs/xdp_rxmeta_receiver.c      |   44 +++++
 .../selftests/bpf/progs/xdp_rxmeta_redirect.c      |   48 ++++++
 3 files changed, 258 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c b/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
new file mode 100644
index 000000000000..544279b58e10
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <bpf/btf.h>
+#include <linux/if_link.h>
+
+#include "xdp_rxmeta_redirect.skel.h"
+#include "xdp_rxmeta_receiver.skel.h"
+
+#define LOCAL_NETNS_NAME	"local"
+#define FWD_NETNS_NAME		"forward"
+#define DST_NETNS_NAME		"dest"
+
+#define LOCAL_NAME	"local"
+#define FWD0_NAME	"fwd0"
+#define FWD1_NAME	"fwd1"
+#define DST_NAME	"dest"
+
+#define LOCAL_MAC	"00:00:00:00:00:01"
+#define FWD0_MAC	"00:00:00:00:00:02"
+#define FWD1_MAC	"00:00:00:00:01:01"
+#define DST_MAC		"00:00:00:00:01:02"
+
+#define LOCAL_ADDR	"10.0.0.1"
+#define FWD0_ADDR	"10.0.0.2"
+#define FWD1_ADDR	"20.0.0.1"
+#define DST_ADDR	"20.0.0.2"
+
+#define PREFIX_LEN	"8"
+#define NUM_PACKETS	10
+
+static int run_ping(const char *dst, int num_ping)
+{
+	SYS(fail, "ping -c%d -W1 -i0.5 %s >/dev/null", num_ping, dst);
+	return 0;
+fail:
+	return -1;
+}
+
+void test_xdp_rxmeta(void)
+{
+	struct xdp_rxmeta_redirect *skel_redirect = NULL;
+	struct xdp_rxmeta_receiver *skel_receiver = NULL;
+	struct bpf_devmap_val val = {};
+	struct bpf_program *prog;
+	__u32 key = 0, stats;
+	struct nstoken *tok;
+	int ret, index;
+
+	SYS(out, "ip netns add " LOCAL_NETNS_NAME);
+	SYS(out, "ip netns add " FWD_NETNS_NAME);
+	SYS(out, "ip netns add " DST_NETNS_NAME);
+
+	tok = open_netns(LOCAL_NETNS_NAME);
+	if (!ASSERT_OK_PTR(tok, "setns"))
+		goto out;
+
+	SYS(out, "ip link add " LOCAL_NAME " type veth peer " FWD0_NAME);
+	SYS(out, "ip link set " FWD0_NAME " netns " FWD_NETNS_NAME);
+	SYS(out, "ip link set dev " LOCAL_NAME " address " LOCAL_MAC);
+	SYS(out, "ip addr add " LOCAL_ADDR "/" PREFIX_LEN " dev " LOCAL_NAME);
+	SYS(out, "ip link set dev " LOCAL_NAME " up");
+	SYS(out, "ip route add default via " FWD0_ADDR);
+	close_netns(tok);
+
+	tok = open_netns(DST_NETNS_NAME);
+	if (!ASSERT_OK_PTR(tok, "setns"))
+		goto out;
+
+	SYS(out, "ip link add " DST_NAME " type veth peer " FWD1_NAME);
+	SYS(out, "ip link set " FWD1_NAME " netns " FWD_NETNS_NAME);
+	SYS(out, "ip link set dev " DST_NAME " address " DST_MAC);
+	SYS(out, "ip addr add " DST_ADDR "/" PREFIX_LEN " dev " DST_NAME);
+	SYS(out, "ip link set dev " DST_NAME " up");
+	SYS(out, "ip route add default via " FWD1_ADDR);
+
+	skel_receiver = xdp_rxmeta_receiver__open();
+	if (!ASSERT_OK_PTR(skel_receiver, "open skel_receiver"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(skel_receiver->obj,
+						"xdp_rxmeta_receiver");
+	index = if_nametoindex(DST_NAME);
+	bpf_program__set_ifindex(prog, index);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	if (!ASSERT_OK(xdp_rxmeta_receiver__load(skel_receiver),
+		       "load skel_receiver"))
+		goto out;
+
+	ret = bpf_xdp_attach(index,
+			     bpf_program__fd(skel_receiver->progs.xdp_rxmeta_receiver),
+			     XDP_FLAGS_DRV_MODE, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach rx_meta_redirect"))
+		goto out;
+
+	close_netns(tok);
+	tok = open_netns(FWD_NETNS_NAME);
+	if (!ASSERT_OK_PTR(tok, "setns"))
+		goto out;
+
+	SYS(out, "ip link set dev " FWD0_NAME " address " FWD0_MAC);
+	SYS(out, "ip addr add " FWD0_ADDR "/" PREFIX_LEN " dev " FWD0_NAME);
+	SYS(out, "ip link set dev " FWD0_NAME " up");
+
+	SYS(out, "ip link set dev " FWD1_NAME " address " FWD1_MAC);
+	SYS(out, "ip addr add " FWD1_ADDR "/" PREFIX_LEN " dev " FWD1_NAME);
+	SYS(out, "ip link set dev " FWD1_NAME " up");
+
+	SYS(out, "sysctl -qw net.ipv4.conf.all.forwarding=1");
+
+	skel_redirect = xdp_rxmeta_redirect__open();
+	if (!ASSERT_OK_PTR(skel_redirect, "open skel_redirect"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(skel_redirect->obj,
+						"xdp_rxmeta_redirect");
+	index = if_nametoindex(FWD0_NAME);
+	bpf_program__set_ifindex(prog, index);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	if (!ASSERT_OK(xdp_rxmeta_redirect__load(skel_redirect),
+		       "load skel_redirect"))
+		goto out;
+
+	val.ifindex = if_nametoindex(FWD1_NAME);
+	ret = bpf_map_update_elem(bpf_map__fd(skel_redirect->maps.dev_map),
+				  &key, &val, 0);
+	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+		goto out;
+
+	ret = bpf_xdp_attach(index,
+			     bpf_program__fd(skel_redirect->progs.xdp_rxmeta_redirect),
+			     XDP_FLAGS_DRV_MODE, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach rxmeta_redirect"))
+		goto out;
+
+	close_netns(tok);
+	tok = open_netns(LOCAL_NETNS_NAME);
+	if (!ASSERT_OK_PTR(tok, "setns"))
+		goto out;
+
+	if (!ASSERT_OK(run_ping(DST_ADDR, NUM_PACKETS), "ping"))
+		goto out;
+
+	close_netns(tok);
+	tok = open_netns(DST_NETNS_NAME);
+	if (!ASSERT_OK_PTR(tok, "setns"))
+		goto out;
+
+	ret = bpf_map__lookup_elem(skel_receiver->maps.stats,
+				   &key, sizeof(key),
+				   &stats, sizeof(stats), 0);
+	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+		goto out;
+
+	ASSERT_EQ(stats, NUM_PACKETS, "rx_meta stats");
+out:
+	xdp_rxmeta_redirect__destroy(skel_redirect);
+	xdp_rxmeta_receiver__destroy(skel_receiver);
+	if (tok)
+		close_netns(tok);
+	SYS_NOFAIL("ip netns del " LOCAL_NETNS_NAME);
+	SYS_NOFAIL("ip netns del " FWD_NETNS_NAME);
+	SYS_NOFAIL("ip netns del " DST_NETNS_NAME);
+}
diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c b/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
new file mode 100644
index 000000000000..1033fa558970
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#define BPF_NO_KFUNC_PROTOTYPES
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
+				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
+					 __u64 *timestamp) __ksym;
+
+#define RX_TIMESTAMP	0x12345678
+#define RX_HASH		0x1234
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, __u32);
+	__uint(max_entries, 1);
+} stats SEC(".maps");
+
+SEC("xdp")
+int xdp_rxmeta_receiver(struct xdp_md *ctx)
+{
+	enum xdp_rss_hash_type rss_type;
+	__u64 timestamp;
+	__u32 hash;
+
+	if (!bpf_xdp_metadata_rx_hash(ctx, &hash, &rss_type) &&
+	    !bpf_xdp_metadata_rx_timestamp(ctx, &timestamp)) {
+		if (hash == RX_HASH && rss_type == XDP_RSS_L4_TCP &&
+		    timestamp == RX_TIMESTAMP) {
+			__u32 *val, key = 0;
+
+			val = bpf_map_lookup_elem(&stats, &key);
+			if (val)
+				 __sync_add_and_fetch(val, 1);
+		}
+	}
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
new file mode 100644
index 000000000000..1606454a1fbc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+#define BPF_NO_KFUNC_PROTOTYPES
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+extern void bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
+				  enum xdp_rss_hash_type rss_type) __ksym;
+extern void bpf_xdp_store_rx_ts(struct xdp_md *ctx, __u64 ts) __ksym;
+
+#define RX_TIMESTAMP	0x12345678
+#define RX_HASH		0x1234
+
+#define ETH_ALEN	6
+#define ETH_P_IP	0x0800
+
+struct {
+	__uint(type, BPF_MAP_TYPE_DEVMAP);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_devmap_val));
+	__uint(max_entries, 1);
+} dev_map SEC(".maps");
+
+SEC("xdp")
+int xdp_rxmeta_redirect(struct xdp_md *ctx)
+{
+	__u8 src_mac[] = { 0x00, 0x00, 0x00, 0x00, 0x01, 0x01 };
+	__u8 dst_mac[] = { 0x00, 0x00, 0x00, 0x00, 0x01, 0x02 };
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eh = data;
+
+	if (eh + 1 > (struct ethhdr *)data_end)
+		return XDP_DROP;
+
+	if (eh->h_proto != bpf_htons(ETH_P_IP))
+		return XDP_PASS;
+
+	__builtin_memcpy(eh->h_source, src_mac, ETH_ALEN);
+	__builtin_memcpy(eh->h_dest, dst_mac, ETH_ALEN);
+
+	bpf_xdp_store_rx_hash(ctx, RX_HASH, XDP_RSS_L4_TCP);
+	bpf_xdp_store_rx_ts(ctx, RX_TIMESTAMP);
+
+	return bpf_redirect_map(&dev_map, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  2025-06-03 17:46 ` [PATCH bpf-next V1 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
@ 2025-06-03 17:46 ` Jesper Dangaard Brouer
  2025-06-06  2:45   ` Stanislav Fomichev
  6 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-03 17:46 UTC (permalink / raw)
  To: bpf, netdev, Jakub Kicinski, lorenzo
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

Update the documentation[1] based on the changes in this patchset.

[1] https://docs.kernel.org/networking/xdp-rx-metadata.html

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 Documentation/networking/xdp-rx-metadata.rst |   74 ++++++++++++++++++++------
 net/core/xdp.c                               |   32 +++++++++++
 2 files changed, 90 insertions(+), 16 deletions(-)

diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index a6e0ece18be5..2c54208e4f7e 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -90,22 +90,64 @@ the ``data_meta`` pointer.
 In the future, we'd like to support a case where an XDP program
 can override some of the metadata used for building ``skbs``.
 
-bpf_redirect_map
-================
-
-``bpf_redirect_map`` can redirect the frame to a different device.
-Some devices (like virtual ethernet links) support running a second XDP
-program after the redirect. However, the final consumer doesn't have
-access to the original hardware descriptor and can't access any of
-the original metadata. The same applies to XDP programs installed
-into devmaps and cpumaps.
-
-This means that for redirected packets only custom metadata is
-currently supported, which has to be prepared by the initial XDP program
-before redirect. If the frame is eventually passed to the kernel, the
-``skb`` created from such a frame won't have any hardware metadata populated
-in its ``skb``. If such a packet is later redirected into an ``XSK``,
-that will also only have access to the custom metadata.
+XDP_REDIRECT
+============
+
+The ``XDP_REDIRECT`` action forwards an XDP frame to another net device or a CPU
+(via cpumap/devmap) for further processing. It is invoked using BPF helpers like
+``bpf_redirect_map()`` or ``bpf_redirect()``.  When an XDP frame is redirected,
+the recipient (e.g., an XDP program on a veth device, or the kernel stack via
+cpumap) loses direct access to the original NIC's hardware descriptor and thus
+its hardware metadata
+
+By default, this loss of access means that if an ``xdp_frame`` is redirected and
+then converted to an ``skb``, its ``skb`` fields for hardware-derived metadata
+(like ``skb->hash`` or VLAN info) are not populated from the original
+packet. This can impact features like Generic Receive Offload (GRO).  While XDP
+programs can manually save custom data (e.g., using ``bpf_xdp_adjust_meta()``),
+propagating specific *hardware* RX hints to ``skb`` creation requires using the
+kfuncs described below.
+
+To enable propagating selected hardware RX hints, store BPF kfuncs allow an
+XDP program on the initial NIC to read these hints and then explicitly
+*store* them. The kfuncs place this metadata in locations associated with
+the XDP packet buffer, making it available if an ``skb`` is later built or
+the frame is otherwise processed. For instance, RX hash and VLAN tags are
+stored within the XDP frame's addressable headroom, while RX timestamps are
+typically written to an area corresponding to ``skb_shared_info``.
+
+**Crucially, the BPF programmer must call these "store" kfuncs to save the
+desired hardware hints for propagation.** The system does not do this
+automatically. The NIC driver is responsible for ensuring sufficient headroom is
+available; kfuncs may return ``-ENOSPC`` if space is inadequate for storing
+these hints.
+
+When these kfuncs are used to store hints before redirection:
+
+* If the ``xdp_frame`` is converted to an ``skb``, the networking stack can use
+  the stored hints to populate ``skb`` fields (e.g., ``skb->hash``,
+  ``skb->vlan_tci``, timestamps), aiding netstack features like GRO.
+* When running a second XDP-program after the redirect. The veth driver supports
+  access to the previous stored metadata is accessed though the normal reader
+  kfuncs.
+
+Kfuncs are available for storing RX hash (``bpf_xdp_store_rx_hash()``),
+VLAN information (``bpf_xdp_store_rx_vlan()``), and hardware timestamps
+(``bpf_xdp_store_rx_ts()``). Consult the kfunc API documentation for usage
+details, expected data, return codes, and relevant XDP flags that may
+indicate success or metadata availability.
+
+Kfuncs for **store** operations:
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_store_rx_timestamp
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_store_rx_hash
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_store_rx_vlan_tag
+
 
 bpf_tail_call
 =============
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 69077cf4c541..1c0f5f980394 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -984,6 +984,18 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
 	return -EOPNOTSUPP;
 }
 
+/**
+ * bpf_xdp_store_rx_hash - Store XDP frame RX hash.
+ * @ctx: XDP context pointer.
+ * @hash: 32-bit hash value.
+ * @rss_type: RSS hash type.
+ *
+ * The RSS hash type (@rss_type) is as descibed in bpf_xdp_metadata_rx_hash.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-NOSPC``   : means device driver doesn't provide enough headroom for storing
+ */
 __bpf_kfunc int bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
 				      enum xdp_rss_hash_type rss_type)
 {
@@ -999,6 +1011,18 @@ __bpf_kfunc int bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
 	return 0;
 }
 
+/**
+ * bpf_xdp_store_rx_vlan_tag - Store XDP packet outermost VLAN tag
+ * @ctx: XDP context pointer.
+ * @vlan_proto: VLAN protocol stored in **network byte order (BE)**
+ * @vlan_tci: VLAN TCI (VID + DEI + PCP) stored in **host byte order**
+ *
+ * See bpf_xdp_metadata_rx_vlan_tag() for byte order reasoning.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-NOSPC``   : means device driver doesn't provide enough headroom for storing
+ */
 __bpf_kfunc int bpf_xdp_store_rx_vlan(struct xdp_md *ctx, __be16 vlan_proto,
 				      u16 vlan_tci)
 {
@@ -1014,6 +1038,14 @@ __bpf_kfunc int bpf_xdp_store_rx_vlan(struct xdp_md *ctx, __be16 vlan_proto,
 	return 0;
 }
 
+/**
+ * bpf_xdp_metadata_rx_timestamp - Store XDP frame RX timestamp.
+ * @ctx: XDP context pointer.
+ * @timestamp: Timestamp value.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ */
 __bpf_kfunc int bpf_xdp_store_rx_ts(struct xdp_md *ctx, u64 ts)
 {
 	struct xdp_buff *xdp = (struct xdp_buff *)ctx;



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-03 17:46 ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
@ 2025-06-06  2:45   ` Stanislav Fomichev
  2025-06-10 13:48     ` Daniel Borkmann
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2025-06-06  2:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub

On 06/03, Jesper Dangaard Brouer wrote:
> Update the documentation[1] based on the changes in this patchset.
> 
> [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  Documentation/networking/xdp-rx-metadata.rst |   74 ++++++++++++++++++++------
>  net/core/xdp.c                               |   32 +++++++++++
>  2 files changed, 90 insertions(+), 16 deletions(-)
> 
> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> index a6e0ece18be5..2c54208e4f7e 100644
> --- a/Documentation/networking/xdp-rx-metadata.rst
> +++ b/Documentation/networking/xdp-rx-metadata.rst
> @@ -90,22 +90,64 @@ the ``data_meta`` pointer.
>  In the future, we'd like to support a case where an XDP program
>  can override some of the metadata used for building ``skbs``.
>  
> -bpf_redirect_map
> -================
> -
> -``bpf_redirect_map`` can redirect the frame to a different device.
> -Some devices (like virtual ethernet links) support running a second XDP
> -program after the redirect. However, the final consumer doesn't have
> -access to the original hardware descriptor and can't access any of
> -the original metadata. The same applies to XDP programs installed
> -into devmaps and cpumaps.
> -
> -This means that for redirected packets only custom metadata is
> -currently supported, which has to be prepared by the initial XDP program
> -before redirect. If the frame is eventually passed to the kernel, the
> -``skb`` created from such a frame won't have any hardware metadata populated
> -in its ``skb``. If such a packet is later redirected into an ``XSK``,
> -that will also only have access to the custom metadata.
> +XDP_REDIRECT
> +============
> +
> +The ``XDP_REDIRECT`` action forwards an XDP frame to another net device or a CPU
> +(via cpumap/devmap) for further processing. It is invoked using BPF helpers like
> +``bpf_redirect_map()`` or ``bpf_redirect()``.  When an XDP frame is redirected,
> +the recipient (e.g., an XDP program on a veth device, or the kernel stack via
> +cpumap) loses direct access to the original NIC's hardware descriptor and thus
> +its hardware metadata
> +
> +By default, this loss of access means that if an ``xdp_frame`` is redirected and
> +then converted to an ``skb``, its ``skb`` fields for hardware-derived metadata
> +(like ``skb->hash`` or VLAN info) are not populated from the original
> +packet. This can impact features like Generic Receive Offload (GRO).  While XDP
> +programs can manually save custom data (e.g., using ``bpf_xdp_adjust_meta()``),
> +propagating specific *hardware* RX hints to ``skb`` creation requires using the
> +kfuncs described below.
> +
> +To enable propagating selected hardware RX hints, store BPF kfuncs allow an
> +XDP program on the initial NIC to read these hints and then explicitly
> +*store* them. The kfuncs place this metadata in locations associated with
> +the XDP packet buffer, making it available if an ``skb`` is later built or
> +the frame is otherwise processed. For instance, RX hash and VLAN tags are
> +stored within the XDP frame's addressable headroom, while RX timestamps are
> +typically written to an area corresponding to ``skb_shared_info``.
> +
> +**Crucially, the BPF programmer must call these "store" kfuncs to save the
> +desired hardware hints for propagation.** The system does not do this
> +automatically. The NIC driver is responsible for ensuring sufficient headroom is
> +available; kfuncs may return ``-ENOSPC`` if space is inadequate for storing
> +these hints.

Why not have a new flag for bpf_redirect that transparently stores all
available metadata? If you care only about the redirect -> skb case.
Might give us more wiggle room in the future to make it work with
traits.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 6/7] bpf: selftests: Add rx_meta store kfuncs selftest
  2025-06-03 17:46 ` [PATCH bpf-next V1 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
@ 2025-06-06 21:57   ` Alexei Starovoitov
  2025-06-06 22:16     ` Lorenzo Bianconi
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-06-06 21:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, Network Development, Jakub Kicinski, Lorenzo Bianconi,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	David S. Miller, Paolo Abeni, Stanislav Fomichev, kernel-team,
	Arthur Fabre, Jakub Sitnicki

On Tue, Jun 3, 2025 at 10:46 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
> new file mode 100644
> index 000000000000..1606454a1fbc
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
> @@ -0,0 +1,48 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define BPF_NO_KFUNC_PROTOTYPES
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +extern void bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
> +                                 enum xdp_rss_hash_type rss_type) __ksym;
> +extern void bpf_xdp_store_rx_ts(struct xdp_md *ctx, __u64 ts) __ksym;

CI rightfully complains that kfunc proto is incorrect.

In general there is no need to manually specify kfunc protos.
They will be in vmlinux.h already with a recent pahole.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 6/7] bpf: selftests: Add rx_meta store kfuncs selftest
  2025-06-06 21:57   ` Alexei Starovoitov
@ 2025-06-06 22:16     ` Lorenzo Bianconi
  0 siblings, 0 replies; 30+ messages in thread
From: Lorenzo Bianconi @ 2025-06-06 22:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, bpf, Network Development, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	David S. Miller, Paolo Abeni, Stanislav Fomichev, kernel-team,
	Arthur Fabre, Jakub Sitnicki

[-- Attachment #1: Type: text/plain, Size: 1038 bytes --]

> On Tue, Jun 3, 2025 at 10:46 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
> > new file mode 100644
> > index 000000000000..1606454a1fbc
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
> > @@ -0,0 +1,48 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#define BPF_NO_KFUNC_PROTOTYPES
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_endian.h>
> > +
> > +extern void bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
> > +                                 enum xdp_rss_hash_type rss_type) __ksym;
> > +extern void bpf_xdp_store_rx_ts(struct xdp_md *ctx, __u64 ts) __ksym;
> 
> CI rightfully complains that kfunc proto is incorrect.
> 
> In general there is no need to manually specify kfunc protos.
> They will be in vmlinux.h already with a recent pahole.

ack, I will fix it in v2.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-06  2:45   ` Stanislav Fomichev
@ 2025-06-10 13:48     ` Daniel Borkmann
  2025-06-10 20:12       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Borkmann @ 2025-06-10 13:48 UTC (permalink / raw)
  To: Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski

On 6/6/25 4:45 AM, Stanislav Fomichev wrote:
> On 06/03, Jesper Dangaard Brouer wrote:
>> Update the documentation[1] based on the changes in this patchset.
>>
>> [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   Documentation/networking/xdp-rx-metadata.rst |   74 ++++++++++++++++++++------
>>   net/core/xdp.c                               |   32 +++++++++++
>>   2 files changed, 90 insertions(+), 16 deletions(-)
>>
>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>> index a6e0ece18be5..2c54208e4f7e 100644
>> --- a/Documentation/networking/xdp-rx-metadata.rst
>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>> @@ -90,22 +90,64 @@ the ``data_meta`` pointer.
>>   In the future, we'd like to support a case where an XDP program
>>   can override some of the metadata used for building ``skbs``.
>>   
>> -bpf_redirect_map
>> -================
>> -
>> -``bpf_redirect_map`` can redirect the frame to a different device.
>> -Some devices (like virtual ethernet links) support running a second XDP
>> -program after the redirect. However, the final consumer doesn't have
>> -access to the original hardware descriptor and can't access any of
>> -the original metadata. The same applies to XDP programs installed
>> -into devmaps and cpumaps.
>> -
>> -This means that for redirected packets only custom metadata is
>> -currently supported, which has to be prepared by the initial XDP program
>> -before redirect. If the frame is eventually passed to the kernel, the
>> -``skb`` created from such a frame won't have any hardware metadata populated
>> -in its ``skb``. If such a packet is later redirected into an ``XSK``,
>> -that will also only have access to the custom metadata.
>> +XDP_REDIRECT
>> +============
>> +
>> +The ``XDP_REDIRECT`` action forwards an XDP frame to another net device or a CPU
>> +(via cpumap/devmap) for further processing. It is invoked using BPF helpers like
>> +``bpf_redirect_map()`` or ``bpf_redirect()``.  When an XDP frame is redirected,
>> +the recipient (e.g., an XDP program on a veth device, or the kernel stack via
>> +cpumap) loses direct access to the original NIC's hardware descriptor and thus
>> +its hardware metadata
>> +
>> +By default, this loss of access means that if an ``xdp_frame`` is redirected and
>> +then converted to an ``skb``, its ``skb`` fields for hardware-derived metadata
>> +(like ``skb->hash`` or VLAN info) are not populated from the original
>> +packet. This can impact features like Generic Receive Offload (GRO).  While XDP
>> +programs can manually save custom data (e.g., using ``bpf_xdp_adjust_meta()``),
>> +propagating specific *hardware* RX hints to ``skb`` creation requires using the
>> +kfuncs described below.
>> +
>> +To enable propagating selected hardware RX hints, store BPF kfuncs allow an
>> +XDP program on the initial NIC to read these hints and then explicitly
>> +*store* them. The kfuncs place this metadata in locations associated with
>> +the XDP packet buffer, making it available if an ``skb`` is later built or
>> +the frame is otherwise processed. For instance, RX hash and VLAN tags are
>> +stored within the XDP frame's addressable headroom, while RX timestamps are
>> +typically written to an area corresponding to ``skb_shared_info``.
>> +
>> +**Crucially, the BPF programmer must call these "store" kfuncs to save the
>> +desired hardware hints for propagation.** The system does not do this
>> +automatically. The NIC driver is responsible for ensuring sufficient headroom is
>> +available; kfuncs may return ``-ENOSPC`` if space is inadequate for storing
>> +these hints.
> 
> Why not have a new flag for bpf_redirect that transparently stores all
> available metadata? If you care only about the redirect -> skb case.
> Might give us more wiggle room in the future to make it work with
> traits.

Also q from my side: If I understand the proposal correctly, in order to fully
populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
to collect the data from the driver descriptors (indirect call), and then yet
again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
meta data aren't you better off switching to tc(x) directly so the driver can
do all this natively? :/

Also, have you thought about taking the opportunity to generalize the existing
struct xsk_tx_metadata? It would be nice to actually use the same/similar struct
for RX and TX, similarly as done in struct virtio_net_hdr. Such that we have
XDP_{RX,TX}_METADATA and XDP_{RX,TX}MD_FLAGS_* to describe what meta data we
have and from a developer PoV this will be a nicely consistent API in XDP. Then
you could store at the right location in the meta data region just with
bpf_xdp_metadata_* kfuncs (and/or plain BPF code) and finally set XDP_RX_METADATA
indicator bit.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-10 13:48     ` Daniel Borkmann
@ 2025-06-10 20:12       ` Toke Høiland-Jørgensen
  2025-06-10 22:26         ` Lorenzo Bianconi
  2025-06-13 11:18         ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Daniel Borkmann
  0 siblings, 2 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-06-10 20:12 UTC (permalink / raw)
  To: Daniel Borkmann, Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 6/6/25 4:45 AM, Stanislav Fomichev wrote:
>> On 06/03, Jesper Dangaard Brouer wrote:
>>> Update the documentation[1] based on the changes in this patchset.
>>>
>>> [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> ---
>>>   Documentation/networking/xdp-rx-metadata.rst |   74 ++++++++++++++++++++------
>>>   net/core/xdp.c                               |   32 +++++++++++
>>>   2 files changed, 90 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>>> index a6e0ece18be5..2c54208e4f7e 100644
>>> --- a/Documentation/networking/xdp-rx-metadata.rst
>>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>>> @@ -90,22 +90,64 @@ the ``data_meta`` pointer.
>>>   In the future, we'd like to support a case where an XDP program
>>>   can override some of the metadata used for building ``skbs``.
>>>   
>>> -bpf_redirect_map
>>> -================
>>> -
>>> -``bpf_redirect_map`` can redirect the frame to a different device.
>>> -Some devices (like virtual ethernet links) support running a second XDP
>>> -program after the redirect. However, the final consumer doesn't have
>>> -access to the original hardware descriptor and can't access any of
>>> -the original metadata. The same applies to XDP programs installed
>>> -into devmaps and cpumaps.
>>> -
>>> -This means that for redirected packets only custom metadata is
>>> -currently supported, which has to be prepared by the initial XDP program
>>> -before redirect. If the frame is eventually passed to the kernel, the
>>> -``skb`` created from such a frame won't have any hardware metadata populated
>>> -in its ``skb``. If such a packet is later redirected into an ``XSK``,
>>> -that will also only have access to the custom metadata.
>>> +XDP_REDIRECT
>>> +============
>>> +
>>> +The ``XDP_REDIRECT`` action forwards an XDP frame to another net device or a CPU
>>> +(via cpumap/devmap) for further processing. It is invoked using BPF helpers like
>>> +``bpf_redirect_map()`` or ``bpf_redirect()``.  When an XDP frame is redirected,
>>> +the recipient (e.g., an XDP program on a veth device, or the kernel stack via
>>> +cpumap) loses direct access to the original NIC's hardware descriptor and thus
>>> +its hardware metadata
>>> +
>>> +By default, this loss of access means that if an ``xdp_frame`` is redirected and
>>> +then converted to an ``skb``, its ``skb`` fields for hardware-derived metadata
>>> +(like ``skb->hash`` or VLAN info) are not populated from the original
>>> +packet. This can impact features like Generic Receive Offload (GRO).  While XDP
>>> +programs can manually save custom data (e.g., using ``bpf_xdp_adjust_meta()``),
>>> +propagating specific *hardware* RX hints to ``skb`` creation requires using the
>>> +kfuncs described below.
>>> +
>>> +To enable propagating selected hardware RX hints, store BPF kfuncs allow an
>>> +XDP program on the initial NIC to read these hints and then explicitly
>>> +*store* them. The kfuncs place this metadata in locations associated with
>>> +the XDP packet buffer, making it available if an ``skb`` is later built or
>>> +the frame is otherwise processed. For instance, RX hash and VLAN tags are
>>> +stored within the XDP frame's addressable headroom, while RX timestamps are
>>> +typically written to an area corresponding to ``skb_shared_info``.
>>> +
>>> +**Crucially, the BPF programmer must call these "store" kfuncs to save the
>>> +desired hardware hints for propagation.** The system does not do this
>>> +automatically. The NIC driver is responsible for ensuring sufficient headroom is
>>> +available; kfuncs may return ``-ENOSPC`` if space is inadequate for storing
>>> +these hints.
>> 
>> Why not have a new flag for bpf_redirect that transparently stores all
>> available metadata? If you care only about the redirect -> skb case.
>> Might give us more wiggle room in the future to make it work with
>> traits.
>
> Also q from my side: If I understand the proposal correctly, in order to fully
> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> to collect the data from the driver descriptors (indirect call), and then yet
> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> meta data aren't you better off switching to tc(x) directly so the driver can
> do all this natively? :/

I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
hope was (back when we added the initial HW metadata support) that we
would be able to inline them to avoid the function call overhead.

That being said, even with half a dozen function calls, that's still a
lot less overhead from going all the way to TC(x). The goal of the use
case here is to do as little work as possible on the CPU that initially
receives the packet, instead moving the network stack processing (and
skb allocation) to a different CPU with cpumap.

So even if the *total* amount of work being done is a bit higher because
of the kfunc overhead, that can still be beneficial because it's split
between two (or more) CPUs.

I'm sure Jesper has some concrete benchmarks for this lying around
somewhere, hopefully he can share those :)

> Also, have you thought about taking the opportunity to generalize the existing
> struct xsk_tx_metadata? It would be nice to actually use the same/similar struct
> for RX and TX, similarly as done in struct virtio_net_hdr. Such that we have
> XDP_{RX,TX}_METADATA and XDP_{RX,TX}MD_FLAGS_* to describe what meta data we
> have and from a developer PoV this will be a nicely consistent API in XDP. Then
> you could store at the right location in the meta data region just with
> bpf_xdp_metadata_* kfuncs (and/or plain BPF code) and finally set XDP_RX_METADATA
> indicator bit.

Wouldn't this make the whole thing (effectively) UAPI?

-Toke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-10 20:12       ` Toke Høiland-Jørgensen
@ 2025-06-10 22:26         ` Lorenzo Bianconi
  2025-06-11  3:40           ` Stanislav Fomichev
  2025-06-13 11:18         ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Daniel Borkmann
  1 sibling, 1 reply; 30+ messages in thread
From: Lorenzo Bianconi @ 2025-06-10 22:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, Stanislav Fomichev, Jesper Dangaard Brouer, bpf,
	netdev, Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub, Magnus Karlsson, Maciej Fijalkowski

[-- Attachment #1: Type: text/plain, Size: 2802 bytes --]

> Daniel Borkmann <daniel@iogearbox.net> writes:
> 
[...]
> >> 
> >> Why not have a new flag for bpf_redirect that transparently stores all
> >> available metadata? If you care only about the redirect -> skb case.
> >> Might give us more wiggle room in the future to make it work with
> >> traits.
> >
> > Also q from my side: If I understand the proposal correctly, in order to fully
> > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> > to collect the data from the driver descriptors (indirect call), and then yet
> > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> > meta data aren't you better off switching to tc(x) directly so the driver can
> > do all this natively? :/
> 
> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
> hope was (back when we added the initial HW metadata support) that we
> would be able to inline them to avoid the function call overhead.
> 
> That being said, even with half a dozen function calls, that's still a
> lot less overhead from going all the way to TC(x). The goal of the use
> case here is to do as little work as possible on the CPU that initially
> receives the packet, instead moving the network stack processing (and
> skb allocation) to a different CPU with cpumap.
> 
> So even if the *total* amount of work being done is a bit higher because
> of the kfunc overhead, that can still be beneficial because it's split
> between two (or more) CPUs.
> 
> I'm sure Jesper has some concrete benchmarks for this lying around
> somewhere, hopefully he can share those :)

Another possible approach would be to have some utility functions (not kfuncs)
used to 'store' the hw metadata in the xdp_frame that are executed in each
driver codebase before performing XDP_REDIRECT. The downside of this approach
is we need to parse the hw metadata twice if the eBPF program that is bounded
to the NIC is consuming these info. What do you think?

Regards,
Lorenzo

> 
> > Also, have you thought about taking the opportunity to generalize the existing
> > struct xsk_tx_metadata? It would be nice to actually use the same/similar struct
> > for RX and TX, similarly as done in struct virtio_net_hdr. Such that we have
> > XDP_{RX,TX}_METADATA and XDP_{RX,TX}MD_FLAGS_* to describe what meta data we
> > have and from a developer PoV this will be a nicely consistent API in XDP. Then
> > you could store at the right location in the meta data region just with
> > bpf_xdp_metadata_* kfuncs (and/or plain BPF code) and finally set XDP_RX_METADATA
> > indicator bit.
> 
> Wouldn't this make the whole thing (effectively) UAPI?
> 
> -Toke
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-10 22:26         ` Lorenzo Bianconi
@ 2025-06-11  3:40           ` Stanislav Fomichev
  2025-06-13 10:59             ` Jesper Dangaard Brouer
  2025-06-16 12:37             ` Lorenzo Bianconi
  0 siblings, 2 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2025-06-11  3:40 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Toke Høiland-Jørgensen, Daniel Borkmann,
	Jesper Dangaard Brouer, bpf, netdev, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
	Magnus Karlsson, Maciej Fijalkowski

On 06/11, Lorenzo Bianconi wrote:
> > Daniel Borkmann <daniel@iogearbox.net> writes:
> > 
> [...]
> > >> 
> > >> Why not have a new flag for bpf_redirect that transparently stores all
> > >> available metadata? If you care only about the redirect -> skb case.
> > >> Might give us more wiggle room in the future to make it work with
> > >> traits.
> > >
> > > Also q from my side: If I understand the proposal correctly, in order to fully
> > > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> > > to collect the data from the driver descriptors (indirect call), and then yet
> > > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> > > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> > > meta data aren't you better off switching to tc(x) directly so the driver can
> > > do all this natively? :/
> > 
> > I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
> > hope was (back when we added the initial HW metadata support) that we
> > would be able to inline them to avoid the function call overhead.
> > 
> > That being said, even with half a dozen function calls, that's still a
> > lot less overhead from going all the way to TC(x). The goal of the use
> > case here is to do as little work as possible on the CPU that initially
> > receives the packet, instead moving the network stack processing (and
> > skb allocation) to a different CPU with cpumap.
> > 
> > So even if the *total* amount of work being done is a bit higher because
> > of the kfunc overhead, that can still be beneficial because it's split
> > between two (or more) CPUs.
> > 
> > I'm sure Jesper has some concrete benchmarks for this lying around
> > somewhere, hopefully he can share those :)
> 
> Another possible approach would be to have some utility functions (not kfuncs)
> used to 'store' the hw metadata in the xdp_frame that are executed in each
> driver codebase before performing XDP_REDIRECT. The downside of this approach
> is we need to parse the hw metadata twice if the eBPF program that is bounded
> to the NIC is consuming these info. What do you think?

That's the option I was asking about. I'm assuming we should be able
to reuse existing xmo metadata callbacks for this. We should be able
to hide it from the drivers also hopefully.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-11  3:40           ` Stanislav Fomichev
@ 2025-06-13 10:59             ` Jesper Dangaard Brouer
  2025-06-16 15:34               ` Stanislav Fomichev
  2025-06-16 12:37             ` Lorenzo Bianconi
  1 sibling, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-13 10:59 UTC (permalink / raw)
  To: Daniel Borkmann, Stanislav Fomichev, Lorenzo Bianconi
  Cc: Toke Høiland-Jørgensen, Daniel Borkmann, bpf, netdev,
	Jakub Kicinski, Alexei Starovoitov, Eric Dumazet, David S. Miller,
	Paolo Abeni, sdf, kernel-team, arthur, jakub, Magnus Karlsson,
	Maciej Fijalkowski, arzeznik, Yan Zhai




On 11/06/2025 05.40, Stanislav Fomichev wrote:
> On 06/11, Lorenzo Bianconi wrote:
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>
>> [...]
>>>>>
>>>>> Why not have a new flag for bpf_redirect that transparently stores all
>>>>> available metadata? If you care only about the redirect -> skb case.
>>>>> Might give us more wiggle room in the future to make it work with
>>>>> traits.
>>>>
>>>> Also q from my side: If I understand the proposal correctly, in order to fully
>>>> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
>>>> to collect the data from the driver descriptors (indirect call), and then yet
>>>> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
>>>> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
>>>> meta data aren't you better off switching to tc(x) directly so the driver can
>>>> do all this natively? :/
>>>
>>> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
>>> hope was (back when we added the initial HW metadata support) that we
>>> would be able to inline them to avoid the function call overhead.
>>>
>>> That being said, even with half a dozen function calls, that's still a
>>> lot less overhead from going all the way to TC(x). The goal of the use
>>> case here is to do as little work as possible on the CPU that initially
>>> receives the packet, instead moving the network stack processing (and
>>> skb allocation) to a different CPU with cpumap.
>>>
>>> So even if the *total* amount of work being done is a bit higher because
>>> of the kfunc overhead, that can still be beneficial because it's split
>>> between two (or more) CPUs.
>>>
>>> I'm sure Jesper has some concrete benchmarks for this lying around
>>> somewhere, hopefully he can share those :)
>>
>> Another possible approach would be to have some utility functions (not kfuncs)
>> used to 'store' the hw metadata in the xdp_frame that are executed in each
>> driver codebase before performing XDP_REDIRECT. The downside of this approach
>> is we need to parse the hw metadata twice if the eBPF program that is bounded
>> to the NIC is consuming these info. What do you think?
> 
> That's the option I was asking about. I'm assuming we should be able
> to reuse existing xmo metadata callbacks for this. We should be able
> to hide it from the drivers also hopefully.

I'm not against this idea of transparently stores all available metadata
into the xdp_frame (via some flag/config), but it does not fit our
production use-case.  I also think that this can be added later.

We need the ability to overwrite the RX-hash value, before redirecting
packet to CPUMAP (remember as cover-letter describe RX-hash needed
*before* the GRO engine processes the packet in CPUMAP. This is before
TC/BPF).

Our use-case for overwriting the RX-hash value is load-balancing IPSEC
encapsulated tunnel traffic at XDP stage via CPUMAP redirects.  This is
generally applicable to tunneling in that we want the store the RX-hash
of the tunnels inner-headers.  Our IPSEC use-case have a variation that
we only decrypt[1] the first 32 bytes to calc a LB hash over
inner-headers, and then redirect the original packet to CPUMAP.  The
IPSEC packets travel into a veth device, which we discovered will send
everything on a single RX-queue... because RX-hash (calc by netstack)
will obviously use the outer-headers, meaning this LB doesn't scale.

I hope this makes it clear, why we need BPF-prog ability to explicitly
"store" the RX-hash in the xdp-frame.

--Jesper

[1] https://docs.ebpf.io/linux/kfuncs/bpf_crypto_decrypt/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-10 20:12       ` Toke Høiland-Jørgensen
  2025-06-10 22:26         ` Lorenzo Bianconi
@ 2025-06-13 11:18         ` Daniel Borkmann
  2025-06-16 11:51           ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Borkmann @ 2025-06-13 11:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev,
	Jesper Dangaard Brouer
  Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski

On 6/10/25 10:12 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
[...]
>> Also, have you thought about taking the opportunity to generalize the existing
>> struct xsk_tx_metadata? It would be nice to actually use the same/similar struct
>> for RX and TX, similarly as done in struct virtio_net_hdr. Such that we have
>> XDP_{RX,TX}_METADATA and XDP_{RX,TX}MD_FLAGS_* to describe what meta data we
>> have and from a developer PoV this will be a nicely consistent API in XDP. Then
>> you could store at the right location in the meta data region just with
>> bpf_xdp_metadata_* kfuncs (and/or plain BPF code) and finally set XDP_RX_METADATA
>> indicator bit.
> 
> Wouldn't this make the whole thing (effectively) UAPI?

I'm not sure I follow, we already have this in place for the meta data region
in AF_XDP, this would extend the scope to RX as well, so there would be a similar
'look and feel' in that sense it is already a developer API which is used.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-13 11:18         ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Daniel Borkmann
@ 2025-06-16 11:51           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-06-16 11:51 UTC (permalink / raw)
  To: Daniel Borkmann, Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 6/10/25 10:12 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
> [...]
>>> Also, have you thought about taking the opportunity to generalize the existing
>>> struct xsk_tx_metadata? It would be nice to actually use the same/similar struct
>>> for RX and TX, similarly as done in struct virtio_net_hdr. Such that we have
>>> XDP_{RX,TX}_METADATA and XDP_{RX,TX}MD_FLAGS_* to describe what meta data we
>>> have and from a developer PoV this will be a nicely consistent API in XDP. Then
>>> you could store at the right location in the meta data region just with
>>> bpf_xdp_metadata_* kfuncs (and/or plain BPF code) and finally set XDP_RX_METADATA
>>> indicator bit.
>> 
>> Wouldn't this make the whole thing (effectively) UAPI?
>
> I'm not sure I follow, we already have this in place for the meta data
> region in AF_XDP, this would extend the scope to RX as well, so there
> would be a similar 'look and feel' in that sense it is already a
> developer API which is used.

Right, but with this series, the format of struct xdp_rx_meta is
internal kernel API that we can change whenever we want. Whereas
exposing it to AF_XDP would make it an UAPI contract, no? IIRC, the
whole point of using kfuncs to extract the metadata from the drivers was
to avoid defining a UAPI format. This does make things a bit more
cumbersome for AF_XDP, but if we are going to expose a struct format for
this we might as well get rid of the whole kfunc machinery just have the
drivers populate the struct before executing XDP?

Or am I misunderstanding what you're proposing?

-Toke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-11  3:40           ` Stanislav Fomichev
  2025-06-13 10:59             ` Jesper Dangaard Brouer
@ 2025-06-16 12:37             ` Lorenzo Bianconi
  2025-06-16 15:45               ` Stanislav Fomichev
  1 sibling, 1 reply; 30+ messages in thread
From: Lorenzo Bianconi @ 2025-06-16 12:37 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Daniel Borkmann,
	Jesper Dangaard Brouer, bpf, netdev, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
	Magnus Karlsson, Maciej Fijalkowski

[-- Attachment #1: Type: text/plain, Size: 2981 bytes --]

On Jun 10, Stanislav Fomichev wrote:
> On 06/11, Lorenzo Bianconi wrote:
> > > Daniel Borkmann <daniel@iogearbox.net> writes:
> > > 
> > [...]
> > > >> 
> > > >> Why not have a new flag for bpf_redirect that transparently stores all
> > > >> available metadata? If you care only about the redirect -> skb case.
> > > >> Might give us more wiggle room in the future to make it work with
> > > >> traits.
> > > >
> > > > Also q from my side: If I understand the proposal correctly, in order to fully
> > > > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> > > > to collect the data from the driver descriptors (indirect call), and then yet
> > > > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> > > > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> > > > meta data aren't you better off switching to tc(x) directly so the driver can
> > > > do all this natively? :/
> > > 
> > > I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
> > > hope was (back when we added the initial HW metadata support) that we
> > > would be able to inline them to avoid the function call overhead.
> > > 
> > > That being said, even with half a dozen function calls, that's still a
> > > lot less overhead from going all the way to TC(x). The goal of the use
> > > case here is to do as little work as possible on the CPU that initially
> > > receives the packet, instead moving the network stack processing (and
> > > skb allocation) to a different CPU with cpumap.
> > > 
> > > So even if the *total* amount of work being done is a bit higher because
> > > of the kfunc overhead, that can still be beneficial because it's split
> > > between two (or more) CPUs.
> > > 
> > > I'm sure Jesper has some concrete benchmarks for this lying around
> > > somewhere, hopefully he can share those :)
> > 
> > Another possible approach would be to have some utility functions (not kfuncs)
> > used to 'store' the hw metadata in the xdp_frame that are executed in each
> > driver codebase before performing XDP_REDIRECT. The downside of this approach
> > is we need to parse the hw metadata twice if the eBPF program that is bounded
> > to the NIC is consuming these info. What do you think?
> 
> That's the option I was asking about. I'm assuming we should be able
> to reuse existing xmo metadata callbacks for this. We should be able
> to hide it from the drivers also hopefully.

If we move the hw metadata 'store' operations to the driver codebase (running
xmo metadata callbacks before performing XDP_REDIRECT), we will parse the hw
metadata twice if we attach to the NIC an AF_XDP program consuming the hw
metadata, right? One parsing is done by the AF_XDP hw metadata kfunc, and the
second one would be performed by the native driver codebase.  Moreover, this
approach seems less flexible. What do you think?

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-13 10:59             ` Jesper Dangaard Brouer
@ 2025-06-16 15:34               ` Stanislav Fomichev
  2025-06-17 16:15                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2025-06-16 15:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Lorenzo Bianconi,
	Toke Høiland-Jørgensen, Daniel Borkmann, bpf, netdev,
	Jakub Kicinski, Alexei Starovoitov, Eric Dumazet, David S. Miller,
	Paolo Abeni, sdf, kernel-team, arthur, jakub, Magnus Karlsson,
	Maciej Fijalkowski, arzeznik, Yan Zhai

On 06/13, Jesper Dangaard Brouer wrote:
> 
> 
> 
> On 11/06/2025 05.40, Stanislav Fomichev wrote:
> > On 06/11, Lorenzo Bianconi wrote:
> > > > Daniel Borkmann <daniel@iogearbox.net> writes:
> > > > 
> > > [...]
> > > > > > 
> > > > > > Why not have a new flag for bpf_redirect that transparently stores all
> > > > > > available metadata? If you care only about the redirect -> skb case.
> > > > > > Might give us more wiggle room in the future to make it work with
> > > > > > traits.
> > > > > 
> > > > > Also q from my side: If I understand the proposal correctly, in order to fully
> > > > > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> > > > > to collect the data from the driver descriptors (indirect call), and then yet
> > > > > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> > > > > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> > > > > meta data aren't you better off switching to tc(x) directly so the driver can
> > > > > do all this natively? :/
> > > > 
> > > > I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
> > > > hope was (back when we added the initial HW metadata support) that we
> > > > would be able to inline them to avoid the function call overhead.
> > > > 
> > > > That being said, even with half a dozen function calls, that's still a
> > > > lot less overhead from going all the way to TC(x). The goal of the use
> > > > case here is to do as little work as possible on the CPU that initially
> > > > receives the packet, instead moving the network stack processing (and
> > > > skb allocation) to a different CPU with cpumap.
> > > > 
> > > > So even if the *total* amount of work being done is a bit higher because
> > > > of the kfunc overhead, that can still be beneficial because it's split
> > > > between two (or more) CPUs.
> > > > 
> > > > I'm sure Jesper has some concrete benchmarks for this lying around
> > > > somewhere, hopefully he can share those :)
> > > 
> > > Another possible approach would be to have some utility functions (not kfuncs)
> > > used to 'store' the hw metadata in the xdp_frame that are executed in each
> > > driver codebase before performing XDP_REDIRECT. The downside of this approach
> > > is we need to parse the hw metadata twice if the eBPF program that is bounded
> > > to the NIC is consuming these info. What do you think?
> > 
> > That's the option I was asking about. I'm assuming we should be able
> > to reuse existing xmo metadata callbacks for this. We should be able
> > to hide it from the drivers also hopefully.
> 
> I'm not against this idea of transparently stores all available metadata
> into the xdp_frame (via some flag/config), but it does not fit our
> production use-case.  I also think that this can be added later.
> 
> We need the ability to overwrite the RX-hash value, before redirecting
> packet to CPUMAP (remember as cover-letter describe RX-hash needed
> *before* the GRO engine processes the packet in CPUMAP. This is before
> TC/BPF).

Make sense. Can we make GRO not flush a bucket for same_flow=0 instead?
This will also make it work better for other regular tunneled traffic.
Setting hash in BPF to make GRO go fast seems too implementation specific :-(

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-16 12:37             ` Lorenzo Bianconi
@ 2025-06-16 15:45               ` Stanislav Fomichev
  2025-06-16 16:40                 ` Lorenzo Bianconi
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2025-06-16 15:45 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Toke Høiland-Jørgensen, Daniel Borkmann,
	Jesper Dangaard Brouer, bpf, netdev, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
	Magnus Karlsson, Maciej Fijalkowski

On 06/16, Lorenzo Bianconi wrote:
> On Jun 10, Stanislav Fomichev wrote:
> > On 06/11, Lorenzo Bianconi wrote:
> > > > Daniel Borkmann <daniel@iogearbox.net> writes:
> > > > 
> > > [...]
> > > > >> 
> > > > >> Why not have a new flag for bpf_redirect that transparently stores all
> > > > >> available metadata? If you care only about the redirect -> skb case.
> > > > >> Might give us more wiggle room in the future to make it work with
> > > > >> traits.
> > > > >
> > > > > Also q from my side: If I understand the proposal correctly, in order to fully
> > > > > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> > > > > to collect the data from the driver descriptors (indirect call), and then yet
> > > > > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> > > > > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> > > > > meta data aren't you better off switching to tc(x) directly so the driver can
> > > > > do all this natively? :/
> > > > 
> > > > I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
> > > > hope was (back when we added the initial HW metadata support) that we
> > > > would be able to inline them to avoid the function call overhead.
> > > > 
> > > > That being said, even with half a dozen function calls, that's still a
> > > > lot less overhead from going all the way to TC(x). The goal of the use
> > > > case here is to do as little work as possible on the CPU that initially
> > > > receives the packet, instead moving the network stack processing (and
> > > > skb allocation) to a different CPU with cpumap.
> > > > 
> > > > So even if the *total* amount of work being done is a bit higher because
> > > > of the kfunc overhead, that can still be beneficial because it's split
> > > > between two (or more) CPUs.
> > > > 
> > > > I'm sure Jesper has some concrete benchmarks for this lying around
> > > > somewhere, hopefully he can share those :)
> > > 
> > > Another possible approach would be to have some utility functions (not kfuncs)
> > > used to 'store' the hw metadata in the xdp_frame that are executed in each
> > > driver codebase before performing XDP_REDIRECT. The downside of this approach
> > > is we need to parse the hw metadata twice if the eBPF program that is bounded
> > > to the NIC is consuming these info. What do you think?
> > 
> > That's the option I was asking about. I'm assuming we should be able
> > to reuse existing xmo metadata callbacks for this. We should be able
> > to hide it from the drivers also hopefully.
> 
> If we move the hw metadata 'store' operations to the driver codebase (running
> xmo metadata callbacks before performing XDP_REDIRECT), we will parse the hw
> metadata twice if we attach to the NIC an AF_XDP program consuming the hw
> metadata, right? One parsing is done by the AF_XDP hw metadata kfunc, and the
> second one would be performed by the native driver codebase.

The native driver codebase will parse the hw metadata only if the
bpf_redirect set some flag, so unless I'm missing something, there
should not be double parsing. (but it's all user controlled, so doesn't
sound like a problem?)

> Moreover, this approach seems less flexible. What do you think?

Agreed on the flexibility. Just trying to understand whether we really
need that flexibility. My worry is that we might expose too much of
the stack's internals with this and introduce some unexpected
dependencies. The things like Jesper mentioned in another thread:
set skb->hash before redirect to make GRO go fast... We either have
to make the stack more robust (my preference), or document these
cases clearly and have test coverage to avoid breakage in the future.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-16 15:45               ` Stanislav Fomichev
@ 2025-06-16 16:40                 ` Lorenzo Bianconi
  2025-06-17 11:50                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 30+ messages in thread
From: Lorenzo Bianconi @ 2025-06-16 16:40 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Daniel Borkmann,
	Jesper Dangaard Brouer, bpf, netdev, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
	Magnus Karlsson, Maciej Fijalkowski

[-- Attachment #1: Type: text/plain, Size: 4118 bytes --]

> On 06/16, Lorenzo Bianconi wrote:
> > On Jun 10, Stanislav Fomichev wrote:
> > > On 06/11, Lorenzo Bianconi wrote:
> > > > > Daniel Borkmann <daniel@iogearbox.net> writes:
> > > > > 
> > > > [...]
> > > > > >> 
> > > > > >> Why not have a new flag for bpf_redirect that transparently stores all
> > > > > >> available metadata? If you care only about the redirect -> skb case.
> > > > > >> Might give us more wiggle room in the future to make it work with
> > > > > >> traits.
> > > > > >
> > > > > > Also q from my side: If I understand the proposal correctly, in order to fully
> > > > > > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> > > > > > to collect the data from the driver descriptors (indirect call), and then yet
> > > > > > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> > > > > > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> > > > > > meta data aren't you better off switching to tc(x) directly so the driver can
> > > > > > do all this natively? :/
> > > > > 
> > > > > I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
> > > > > hope was (back when we added the initial HW metadata support) that we
> > > > > would be able to inline them to avoid the function call overhead.
> > > > > 
> > > > > That being said, even with half a dozen function calls, that's still a
> > > > > lot less overhead from going all the way to TC(x). The goal of the use
> > > > > case here is to do as little work as possible on the CPU that initially
> > > > > receives the packet, instead moving the network stack processing (and
> > > > > skb allocation) to a different CPU with cpumap.
> > > > > 
> > > > > So even if the *total* amount of work being done is a bit higher because
> > > > > of the kfunc overhead, that can still be beneficial because it's split
> > > > > between two (or more) CPUs.
> > > > > 
> > > > > I'm sure Jesper has some concrete benchmarks for this lying around
> > > > > somewhere, hopefully he can share those :)
> > > > 
> > > > Another possible approach would be to have some utility functions (not kfuncs)
> > > > used to 'store' the hw metadata in the xdp_frame that are executed in each
> > > > driver codebase before performing XDP_REDIRECT. The downside of this approach
> > > > is we need to parse the hw metadata twice if the eBPF program that is bounded
> > > > to the NIC is consuming these info. What do you think?
> > > 
> > > That's the option I was asking about. I'm assuming we should be able
> > > to reuse existing xmo metadata callbacks for this. We should be able
> > > to hide it from the drivers also hopefully.
> > 
> > If we move the hw metadata 'store' operations to the driver codebase (running
> > xmo metadata callbacks before performing XDP_REDIRECT), we will parse the hw
> > metadata twice if we attach to the NIC an AF_XDP program consuming the hw
> > metadata, right? One parsing is done by the AF_XDP hw metadata kfunc, and the
> > second one would be performed by the native driver codebase.
> 
> The native driver codebase will parse the hw metadata only if the
> bpf_redirect set some flag, so unless I'm missing something, there
> should not be double parsing. (but it's all user controlled, so doesn't
> sound like a problem?)

I do not have a strong opinion about it, I guess it is fine, but I am not
100% sure if it fits in Jesper's use case.
@Jesper: any input on it?

Regards,
Lorenzo

> 
> > Moreover, this approach seems less flexible. What do you think?
> 
> Agreed on the flexibility. Just trying to understand whether we really
> need that flexibility. My worry is that we might expose too much of
> the stack's internals with this and introduce some unexpected
> dependencies. The things like Jesper mentioned in another thread:
> set skb->hash before redirect to make GRO go fast... We either have
> to make the stack more robust (my preference), or document these
> cases clearly and have test coverage to avoid breakage in the future.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff
  2025-06-03 17:46 ` [PATCH bpf-next V1 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
@ 2025-06-16 21:55   ` Jakub Kicinski
  0 siblings, 0 replies; 30+ messages in thread
From: Jakub Kicinski @ 2025-06-16 21:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, lorenzo, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
	arthur, jakub

On Tue, 03 Jun 2025 19:46:08 +0200 Jesper Dangaard Brouer wrote:
> Introduce the following kfuncs to store hw metadata provided by the NIC
> into the xdp_buff struct:
> 
> - rx-hash: bpf_xdp_store_rx_hash
> - rx-vlan: bpf_xdp_store_rx_vlan
> - rx-hw-ts: bpf_xdp_store_rx_ts

My mental model would that these should operate within the "remote XDP".
We have helpers to "get the metadata", now we're adding helpers to "set
the metadata". Should the setting not be simply the inverse of the
getting? Store to driver-specific format?

-> local driver - all the metadata in HW-specific format / descriptors
  -> local XDP - GET and copy data to locally defined md prepend
-> remote drv - CPU map, veth) hooks in its own callbacks and md struct
  -> remote XDP - read the data from md prepend and call SET
-> remote drv - check if XDP set any md in its struct and uses it when
                converting to skb.

Note, this is just the model for the API. We can always introduce
optimizations, like a feature flag that makes the "local" driver output
well known MD struct and have the remote driver pick that up, with
neither XDP calling the helpers. 
But from the model perspective since the GET operations are local to
the driver owning the frame at the time, the SET operations should also
operate on metadata local to the owning driver.

That's my $0.02

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-16 16:40                 ` Lorenzo Bianconi
@ 2025-06-17 11:50                   ` Toke Høiland-Jørgensen
  2025-06-17 14:47                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-06-17 11:50 UTC (permalink / raw)
  To: Lorenzo Bianconi, Stanislav Fomichev
  Cc: Daniel Borkmann, Jesper Dangaard Brouer, bpf, netdev,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
	Magnus Karlsson, Maciej Fijalkowski

Lorenzo Bianconi <lorenzo@kernel.org> writes:

>> On 06/16, Lorenzo Bianconi wrote:
>> > On Jun 10, Stanislav Fomichev wrote:
>> > > On 06/11, Lorenzo Bianconi wrote:
>> > > > > Daniel Borkmann <daniel@iogearbox.net> writes:
>> > > > > 
>> > > > [...]
>> > > > > >> 
>> > > > > >> Why not have a new flag for bpf_redirect that transparently stores all
>> > > > > >> available metadata? If you care only about the redirect -> skb case.
>> > > > > >> Might give us more wiggle room in the future to make it work with
>> > > > > >> traits.
>> > > > > >
>> > > > > > Also q from my side: If I understand the proposal correctly, in order to fully
>> > > > > > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
>> > > > > > to collect the data from the driver descriptors (indirect call), and then yet
>> > > > > > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
>> > > > > > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
>> > > > > > meta data aren't you better off switching to tc(x) directly so the driver can
>> > > > > > do all this natively? :/
>> > > > > 
>> > > > > I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
>> > > > > hope was (back when we added the initial HW metadata support) that we
>> > > > > would be able to inline them to avoid the function call overhead.
>> > > > > 
>> > > > > That being said, even with half a dozen function calls, that's still a
>> > > > > lot less overhead from going all the way to TC(x). The goal of the use
>> > > > > case here is to do as little work as possible on the CPU that initially
>> > > > > receives the packet, instead moving the network stack processing (and
>> > > > > skb allocation) to a different CPU with cpumap.
>> > > > > 
>> > > > > So even if the *total* amount of work being done is a bit higher because
>> > > > > of the kfunc overhead, that can still be beneficial because it's split
>> > > > > between two (or more) CPUs.
>> > > > > 
>> > > > > I'm sure Jesper has some concrete benchmarks for this lying around
>> > > > > somewhere, hopefully he can share those :)
>> > > > 
>> > > > Another possible approach would be to have some utility functions (not kfuncs)
>> > > > used to 'store' the hw metadata in the xdp_frame that are executed in each
>> > > > driver codebase before performing XDP_REDIRECT. The downside of this approach
>> > > > is we need to parse the hw metadata twice if the eBPF program that is bounded
>> > > > to the NIC is consuming these info. What do you think?
>> > > 
>> > > That's the option I was asking about. I'm assuming we should be able
>> > > to reuse existing xmo metadata callbacks for this. We should be able
>> > > to hide it from the drivers also hopefully.
>> > 
>> > If we move the hw metadata 'store' operations to the driver codebase (running
>> > xmo metadata callbacks before performing XDP_REDIRECT), we will parse the hw
>> > metadata twice if we attach to the NIC an AF_XDP program consuming the hw
>> > metadata, right? One parsing is done by the AF_XDP hw metadata kfunc, and the
>> > second one would be performed by the native driver codebase.
>> 
>> The native driver codebase will parse the hw metadata only if the
>> bpf_redirect set some flag, so unless I'm missing something, there
>> should not be double parsing. (but it's all user controlled, so doesn't
>> sound like a problem?)
>
> I do not have a strong opinion about it, I guess it is fine, but I am not
> 100% sure if it fits in Jesper's use case.
> @Jesper: any input on it?

FWIW, one of the selling points of XDP is (IMO) that it allows you to
basically override any processing the stack does. I think this should
apply to hardware metadata as well (for instance, if the HW metadata
indicates that a packet is TCP, and XDP performs encapsulation before
PASSing it, the metadata should be overridden to reflect this).

So if the driver populates these fields natively, I think this should
either happen before the XDP program is run (so it can be overridden),
or it should check if the XDP program already set the values and leave
them be if so. Both of those obviously incur overhead; not sure which
would be more expensive, though...

-Toke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-17 11:50                   ` Toke Høiland-Jørgensen
@ 2025-06-17 14:47                     ` Jesper Dangaard Brouer
  2025-06-17 15:10                       ` Performance impact of disabling VLAN offload [was: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst] Toke Høiland-Jørgensen
  0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-17 14:47 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Lorenzo Bianconi,
	Stanislav Fomichev
  Cc: Daniel Borkmann, bpf, netdev, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski



On 17/06/2025 13.50, Toke Høiland-Jørgensen wrote:
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> 
>>> On 06/16, Lorenzo Bianconi wrote:
>>>> On Jun 10, Stanislav Fomichev wrote:
>>>>> On 06/11, Lorenzo Bianconi wrote:
>>>>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>>>>
>>>>>> [...]
>>>>>>>>>
>>>>>>>>> Why not have a new flag for bpf_redirect that transparently stores all
>>>>>>>>> available metadata? If you care only about the redirect -> skb case.
>>>>>>>>> Might give us more wiggle room in the future to make it work with
>>>>>>>>> traits.
>>>>>>>>
>>>>>>>> Also q from my side: If I understand the proposal correctly, in order to fully
>>>>>>>> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
>>>>>>>> to collect the data from the driver descriptors (indirect call), and then yet
>>>>>>>> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
>>>>>>>> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
>>>>>>>> meta data aren't you better off switching to tc(x) directly so the driver can
>>>>>>>> do all this natively? :/
>>>>>>>
>>>>>>> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
>>>>>>> hope was (back when we added the initial HW metadata support) that we
>>>>>>> would be able to inline them to avoid the function call overhead.
>>>>>>>
>>>>>>> That being said, even with half a dozen function calls, that's still a
>>>>>>> lot less overhead from going all the way to TC(x). The goal of the use
>>>>>>> case here is to do as little work as possible on the CPU that initially
>>>>>>> receives the packet, instead moving the network stack processing (and
>>>>>>> skb allocation) to a different CPU with cpumap.
>>>>>>>
>>>>>>> So even if the *total* amount of work being done is a bit higher because
>>>>>>> of the kfunc overhead, that can still be beneficial because it's split
>>>>>>> between two (or more) CPUs.
>>>>>>>
>>>>>>> I'm sure Jesper has some concrete benchmarks for this lying around
>>>>>>> somewhere, hopefully he can share those :)
>>>>>>
>>>>>> Another possible approach would be to have some utility functions (not kfuncs)
>>>>>> used to 'store' the hw metadata in the xdp_frame that are executed in each
>>>>>> driver codebase before performing XDP_REDIRECT. The downside of this approach
>>>>>> is we need to parse the hw metadata twice if the eBPF program that is bounded
>>>>>> to the NIC is consuming these info. What do you think?
>>>>>
>>>>> That's the option I was asking about. I'm assuming we should be able
>>>>> to reuse existing xmo metadata callbacks for this. We should be able
>>>>> to hide it from the drivers also hopefully.
>>>>
>>>> If we move the hw metadata 'store' operations to the driver codebase (running
>>>> xmo metadata callbacks before performing XDP_REDIRECT), we will parse the hw
>>>> metadata twice if we attach to the NIC an AF_XDP program consuming the hw
>>>> metadata, right? One parsing is done by the AF_XDP hw metadata kfunc, and the
>>>> second one would be performed by the native driver codebase.
>>>
>>> The native driver codebase will parse the hw metadata only if the
>>> bpf_redirect set some flag, so unless I'm missing something, there
>>> should not be double parsing. (but it's all user controlled, so doesn't
>>> sound like a problem?)
>>
>> I do not have a strong opinion about it, I guess it is fine, but I am not
>> 100% sure if it fits in Jesper's use case.
>> @Jesper: any input on it?
> 
> FWIW, one of the selling points of XDP is (IMO) that it allows you to
> basically override any processing the stack does. I think this should
> apply to hardware metadata as well (for instance, if the HW metadata
> indicates that a packet is TCP, and XDP performs encapsulation before
> PASSing it, the metadata should be overridden to reflect this).

I fully agree :-)

> So if the driver populates these fields natively, I think this should
> either happen before the XDP program is run (so it can be overridden),
> or it should check if the XDP program already set the values and leave
> them be if so. Both of those obviously incur overhead; not sure which
> would be more expensive, though...
> 

Yes, if the XDP BPF-prog choose to override a field, then it's value
should "win". As I explained in [0], this is our first main production
use-case.  To override the RX-hash with the tunnel inner-headers.

  [0] 
https://lore.kernel.org/all/ca38f2ed-999f-4ce1-8035-8ee9247f27f2@kernel.org/

Later we will look at using the vlan tag. Today we have disabled HW
vlan-offloading, because XDP originally didn't support accessing HW vlan
tags.  Next hurdle for us is our usage of tail-calls, which cannot
access the HW RX-metadata via the "get" kfuncs.  Not part of this
patchset, but we have considered if someone calls the "store" kfunc, if
tail-calls invoking the "get" kfunc could return the stored value
(even-though it is not device bound).

The last field is the HW timestamp, which we also don't currently use.
Notice that this patchset stores the timestamp in skb_shared_info area.
Storing in this area will likely cause a cache-miss. Thus, "if the
driver populates these fields natively" and automatically then it will
be a performance concern.

For now, I just need to override the RX-hash and have this survive to
CPUMAP (+veth).  Adding some flag that stores all three HW metadata
natively in the driver is something that we can add later IMHO.

--Jesper



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Performance impact of disabling VLAN offload [was: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst]
  2025-06-17 14:47                     ` Jesper Dangaard Brouer
@ 2025-06-17 15:10                       ` Toke Høiland-Jørgensen
  2025-06-19 12:09                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-06-17 15:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Lorenzo Bianconi, Stanislav Fomichev
  Cc: Daniel Borkmann, bpf, netdev, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski

> Later we will look at using the vlan tag. Today we have disabled HW
> vlan-offloading, because XDP originally didn't support accessing HW vlan
> tags.

Side note (with changed subject to disambiguate): Do you have any data
on the performance impact of disabling VLAN offload that you can share?
I've been sort of wondering whether saving those couple of bytes has any
measurable impact on real workloads (where you end up looking at the
headers anyway, so saving the cache miss doesn't matter so much)?

-Toke


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-16 15:34               ` Stanislav Fomichev
@ 2025-06-17 16:15                 ` Jesper Dangaard Brouer
  2025-06-17 20:47                   ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-17 16:15 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Daniel Borkmann, Lorenzo Bianconi,
	Toke Høiland-Jørgensen, Daniel Borkmann, bpf, netdev,
	Jakub Kicinski, Alexei Starovoitov, Eric Dumazet, David S. Miller,
	Paolo Abeni, sdf, kernel-team, arthur, jakub, Magnus Karlsson,
	Maciej Fijalkowski, arzeznik, Yan Zhai



On 16/06/2025 17.34, Stanislav Fomichev wrote:
> On 06/13, Jesper Dangaard Brouer wrote:
>>
>> On 11/06/2025 05.40, Stanislav Fomichev wrote:
>>> On 06/11, Lorenzo Bianconi wrote:
>>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>>
>>>> [...]
>>>>>>>
>>>>>>> Why not have a new flag for bpf_redirect that transparently stores all
>>>>>>> available metadata? If you care only about the redirect -> skb case.
>>>>>>> Might give us more wiggle room in the future to make it work with
>>>>>>> traits.
>>>>>>
>>>>>> Also q from my side: If I understand the proposal correctly, in order to fully
>>>>>> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
>>>>>> to collect the data from the driver descriptors (indirect call), and then yet
>>>>>> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
>>>>>> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
>>>>>> meta data aren't you better off switching to tc(x) directly so the driver can
>>>>>> do all this natively? :/
>>>>>
>>>>> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
>>>>> hope was (back when we added the initial HW metadata support) that we
>>>>> would be able to inline them to avoid the function call overhead.
>>>>>
>>>>> That being said, even with half a dozen function calls, that's still a
>>>>> lot less overhead from going all the way to TC(x). The goal of the use
>>>>> case here is to do as little work as possible on the CPU that initially
>>>>> receives the packet, instead moving the network stack processing (and
>>>>> skb allocation) to a different CPU with cpumap.
>>>>>
>>>>> So even if the *total* amount of work being done is a bit higher because
>>>>> of the kfunc overhead, that can still be beneficial because it's split
>>>>> between two (or more) CPUs.
>>>>>
>>>>> I'm sure Jesper has some concrete benchmarks for this lying around
>>>>> somewhere, hopefully he can share those :)
>>>>
>>>> Another possible approach would be to have some utility functions (not kfuncs)
>>>> used to 'store' the hw metadata in the xdp_frame that are executed in each
>>>> driver codebase before performing XDP_REDIRECT. The downside of this approach
>>>> is we need to parse the hw metadata twice if the eBPF program that is bounded
>>>> to the NIC is consuming these info. What do you think?
>>>
>>> That's the option I was asking about. I'm assuming we should be able
>>> to reuse existing xmo metadata callbacks for this. We should be able
>>> to hide it from the drivers also hopefully.
>>
>> I'm not against this idea of transparently stores all available metadata
>> into the xdp_frame (via some flag/config), but it does not fit our
>> production use-case.  I also think that this can be added later.
>>
>> We need the ability to overwrite the RX-hash value, before redirecting
>> packet to CPUMAP (remember as cover-letter describe RX-hash needed
>> *before* the GRO engine processes the packet in CPUMAP. This is before
>> TC/BPF).
> 
> Make sense. Can we make GRO not flush a bucket for same_flow=0 instead?
> This will also make it work better for other regular tunneled traffic.
> Setting hash in BPF to make GRO go fast seems too implementation specific :-(

I feel misunderstood here.  This was a GRO side-note to remind reviewers
that netstack expect that RX-hash isn't zero at napi_gro_receive().
This is not a make GRO faster, but a lets comply with netstack.

The important BPF optimization is the part that you forgot to quote in
the reply, so let me reproduce what I wrote below.  TL;DR: RX-hash
needed to be the tunnel inner-headers else outer-headers SW hash calc
will land everything on same veth RX-queue.

On 13/06/2025 12.59, Jesper Dangaard Brouer wrote:
 >
>> Our use-case for overwriting the RX-hash value is load-balancing
>> IPSEC encapsulated tunnel traffic at XDP stage via CPUMAP redirects.
>> This is generally applicable to tunneling in that we want the store
>> the RX-hash of the tunnels inner-headers.  Our IPSEC use-case have a
>> variation that we only decrypt[1] the first 32 bytes to calc a LB
>> hash over inner-headers, and then redirect the original packet to
>> CPUMAP.  The IPSEC packets travel into a veth device, which we
>> discovered will send everything on a single RX-queue... because
>> RX-hash (calc by netstack) will obviously use the outer-headers,
>> meaning this LB doesn't scale. >>
>> I hope this makes it clear, why we need BPF-prog ability to
>> explicitly "store" the RX-hash in the xdp-frame.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
  2025-06-17 16:15                 ` Jesper Dangaard Brouer
@ 2025-06-17 20:47                   ` Stanislav Fomichev
  0 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2025-06-17 20:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Lorenzo Bianconi,
	Toke Høiland-Jørgensen, Daniel Borkmann, bpf, netdev,
	Jakub Kicinski, Alexei Starovoitov, Eric Dumazet, David S. Miller,
	Paolo Abeni, sdf, kernel-team, arthur, jakub, Magnus Karlsson,
	Maciej Fijalkowski, arzeznik, Yan Zhai

On 06/17, Jesper Dangaard Brouer wrote:
> 
> 
> On 16/06/2025 17.34, Stanislav Fomichev wrote:
> > On 06/13, Jesper Dangaard Brouer wrote:
> > > 
> > > On 11/06/2025 05.40, Stanislav Fomichev wrote:
> > > > On 06/11, Lorenzo Bianconi wrote:
> > > > > > Daniel Borkmann <daniel@iogearbox.net> writes:
> > > > > > 
> > > > > [...]
> > > > > > > > 
> > > > > > > > Why not have a new flag for bpf_redirect that transparently stores all
> > > > > > > > available metadata? If you care only about the redirect -> skb case.
> > > > > > > > Might give us more wiggle room in the future to make it work with
> > > > > > > > traits.
> > > > > > > 
> > > > > > > Also q from my side: If I understand the proposal correctly, in order to fully
> > > > > > > populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> > > > > > > to collect the data from the driver descriptors (indirect call), and then yet
> > > > > > > again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> > > > > > > xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> > > > > > > meta data aren't you better off switching to tc(x) directly so the driver can
> > > > > > > do all this natively? :/
> > > > > > 
> > > > > > I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
> > > > > > hope was (back when we added the initial HW metadata support) that we
> > > > > > would be able to inline them to avoid the function call overhead.
> > > > > > 
> > > > > > That being said, even with half a dozen function calls, that's still a
> > > > > > lot less overhead from going all the way to TC(x). The goal of the use
> > > > > > case here is to do as little work as possible on the CPU that initially
> > > > > > receives the packet, instead moving the network stack processing (and
> > > > > > skb allocation) to a different CPU with cpumap.
> > > > > > 
> > > > > > So even if the *total* amount of work being done is a bit higher because
> > > > > > of the kfunc overhead, that can still be beneficial because it's split
> > > > > > between two (or more) CPUs.
> > > > > > 
> > > > > > I'm sure Jesper has some concrete benchmarks for this lying around
> > > > > > somewhere, hopefully he can share those :)
> > > > > 
> > > > > Another possible approach would be to have some utility functions (not kfuncs)
> > > > > used to 'store' the hw metadata in the xdp_frame that are executed in each
> > > > > driver codebase before performing XDP_REDIRECT. The downside of this approach
> > > > > is we need to parse the hw metadata twice if the eBPF program that is bounded
> > > > > to the NIC is consuming these info. What do you think?
> > > > 
> > > > That's the option I was asking about. I'm assuming we should be able
> > > > to reuse existing xmo metadata callbacks for this. We should be able
> > > > to hide it from the drivers also hopefully.
> > > 
> > > I'm not against this idea of transparently stores all available metadata
> > > into the xdp_frame (via some flag/config), but it does not fit our
> > > production use-case.  I also think that this can be added later.
> > > 
> > > We need the ability to overwrite the RX-hash value, before redirecting
> > > packet to CPUMAP (remember as cover-letter describe RX-hash needed
> > > *before* the GRO engine processes the packet in CPUMAP. This is before
> > > TC/BPF).
> > 
> > Make sense. Can we make GRO not flush a bucket for same_flow=0 instead?
> > This will also make it work better for other regular tunneled traffic.
> > Setting hash in BPF to make GRO go fast seems too implementation specific :-(
> 
> I feel misunderstood here.  This was a GRO side-note to remind reviewers
> that netstack expect that RX-hash isn't zero at napi_gro_receive().
> This is not a make GRO faster, but a lets comply with netstack.
> 
> The important BPF optimization is the part that you forgot to quote in
> the reply, so let me reproduce what I wrote below.  TL;DR: RX-hash
> needed to be the tunnel inner-headers else outer-headers SW hash calc
> will land everything on same veth RX-queue.

Might be useful to expand more on this in v2, the full path.
I'm still holding on to the GRO case from the cover letter :-[

Btw, will Jakub's suggestion from 0 work? This seems like a simpler mental
model. Each driver will get a set of metadata setter operations, and you
can use existing headroom to plumb the metadata in whatever format you
like. In cpumap xdp hook, you'll call 'set' operations to (somehow, tbd)
export them to the xdp->skb conversion.

0: https://lore.kernel.org/netdev/76a5330e-dc52-41ea-89c2-ddcde4b414bd@kernel.org/T/#ma746c47e42fbc24be5bb1c6c4b96be566821b03d

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance impact of disabling VLAN offload [was: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst]
  2025-06-17 15:10                       ` Performance impact of disabling VLAN offload [was: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst] Toke Høiland-Jørgensen
@ 2025-06-19 12:09                         ` Jesper Dangaard Brouer
  2025-06-19 12:23                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2025-06-19 12:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Lorenzo Bianconi,
	Stanislav Fomichev
  Cc: Daniel Borkmann, bpf, netdev, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski,
	Willem Ferguson, Jesse Brandeburg

On 17/06/2025 17.10, Toke Høiland-Jørgensen wrote:
>> Later we will look at using the vlan tag. Today we have disabled HW
>> vlan-offloading, because XDP originally didn't support accessing HW vlan
>> tags.
> 
> Side note (with changed subject to disambiguate): Do you have any data
> on the performance impact of disabling VLAN offload that you can share?
> I've been sort of wondering whether saving those couple of bytes has any
> measurable impact on real workloads (where you end up looking at the
> headers anyway, so saving the cache miss doesn't matter so much)?
> 

Our production setup have two different VLAN IDs, one for INTERNAL-ID
and one for EXTERNAL-ID (Internet) traffic.  On (many) servers this is
on the same physical net_device.

Our Unimog XDP load-balancer *only* handles EXTERNAL-ID.  Thus, the very
first thing Unimog does is checking the VLAN ID.  If this doesn't match
EXTERNAL-ID it returns XDP_PASS.  This is the first time packet data
area is read which (due to our AMD-CPUs) will be a cache-miss.

If this were INTERNAL-ID then we have caused a cache-miss earlier than
needed.  The NIC driver have already started a net_prefetch.  Thus, if
we can return XDP_PASS without touching packet data, then we can
(latency) hide part of the cache-miss (behind SKB-zero-ing). (We could
also CPUMAP redirect the INTERNAL-ID to a remote CPU for further gains).
  Using the kfunc (bpf_xdp_metadata_rx_vlan_tag[1]) for reading VLAN ID
doesn't touch/read packet data.

I hope this makes it clear why reading the HW offloaded VLAN tag from
the RX-descriptor is a performance benefit?

--Jesper

[1] https://docs.kernel.org/networking/xdp-rx-metadata.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance impact of disabling VLAN offload [was: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst]
  2025-06-19 12:09                         ` Jesper Dangaard Brouer
@ 2025-06-19 12:23                           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 30+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-06-19 12:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Lorenzo Bianconi, Stanislav Fomichev
  Cc: Daniel Borkmann, bpf, netdev, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
	kernel-team, arthur, jakub, Magnus Karlsson, Maciej Fijalkowski,
	Willem Ferguson, Jesse Brandeburg

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 17/06/2025 17.10, Toke Høiland-Jørgensen wrote:
>>> Later we will look at using the vlan tag. Today we have disabled HW
>>> vlan-offloading, because XDP originally didn't support accessing HW vlan
>>> tags.
>> 
>> Side note (with changed subject to disambiguate): Do you have any data
>> on the performance impact of disabling VLAN offload that you can share?
>> I've been sort of wondering whether saving those couple of bytes has any
>> measurable impact on real workloads (where you end up looking at the
>> headers anyway, so saving the cache miss doesn't matter so much)?
>> 
>
> Our production setup have two different VLAN IDs, one for INTERNAL-ID
> and one for EXTERNAL-ID (Internet) traffic.  On (many) servers this is
> on the same physical net_device.
>
> Our Unimog XDP load-balancer *only* handles EXTERNAL-ID.  Thus, the very
> first thing Unimog does is checking the VLAN ID.  If this doesn't match
> EXTERNAL-ID it returns XDP_PASS.  This is the first time packet data
> area is read which (due to our AMD-CPUs) will be a cache-miss.
>
> If this were INTERNAL-ID then we have caused a cache-miss earlier than
> needed.  The NIC driver have already started a net_prefetch.  Thus, if
> we can return XDP_PASS without touching packet data, then we can
> (latency) hide part of the cache-miss (behind SKB-zero-ing). (We could
> also CPUMAP redirect the INTERNAL-ID to a remote CPU for further gains).
>   Using the kfunc (bpf_xdp_metadata_rx_vlan_tag[1]) for reading VLAN ID
> doesn't touch/read packet data.
>
> I hope this makes it clear why reading the HW offloaded VLAN tag from
> the RX-descriptor is a performance benefit?

Right, I can certainly see the argument, but I was hoping you'd have
some data to quantify exactly how much of a difference this makes? :)

Also, I guess this XDP-based early demux is a bit special as far as this
use case is concerned? For regular net-stack usage of the VLAN field,
we'll already have touched the packet data while building the skb; so
the difference will be less, as it shouldn't be a cache miss. Which
doesn't invalidate your use case, of course, it just makes it different...

-Toke


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-06-19 12:23 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 17:45 [PATCH bpf-next V1 0/7] xdp: Propagate RX HW hints for XDP_REDIRECTed packets via xdp_frame Jesper Dangaard Brouer
2025-06-03 17:45 ` [PATCH bpf-next V1 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
2025-06-03 17:46 ` [PATCH bpf-next V1 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect Jesper Dangaard Brouer
2025-06-03 17:46 ` [PATCH bpf-next V1 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
2025-06-16 21:55   ` Jakub Kicinski
2025-06-03 17:46 ` [PATCH bpf-next V1 4/7] net: xdp: Set skb hw metadata from xdp_frame Jesper Dangaard Brouer
2025-06-03 17:46 ` [PATCH bpf-next V1 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
2025-06-03 17:46 ` [PATCH bpf-next V1 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
2025-06-06 21:57   ` Alexei Starovoitov
2025-06-06 22:16     ` Lorenzo Bianconi
2025-06-03 17:46 ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
2025-06-06  2:45   ` Stanislav Fomichev
2025-06-10 13:48     ` Daniel Borkmann
2025-06-10 20:12       ` Toke Høiland-Jørgensen
2025-06-10 22:26         ` Lorenzo Bianconi
2025-06-11  3:40           ` Stanislav Fomichev
2025-06-13 10:59             ` Jesper Dangaard Brouer
2025-06-16 15:34               ` Stanislav Fomichev
2025-06-17 16:15                 ` Jesper Dangaard Brouer
2025-06-17 20:47                   ` Stanislav Fomichev
2025-06-16 12:37             ` Lorenzo Bianconi
2025-06-16 15:45               ` Stanislav Fomichev
2025-06-16 16:40                 ` Lorenzo Bianconi
2025-06-17 11:50                   ` Toke Høiland-Jørgensen
2025-06-17 14:47                     ` Jesper Dangaard Brouer
2025-06-17 15:10                       ` Performance impact of disabling VLAN offload [was: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst] Toke Høiland-Jørgensen
2025-06-19 12:09                         ` Jesper Dangaard Brouer
2025-06-19 12:23                           ` Toke Høiland-Jørgensen
2025-06-13 11:18         ` [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Daniel Borkmann
2025-06-16 11:51           ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).