[PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store
@ 2025-04-22 13:23 Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
                   ` (16 more replies)
  0 siblings, 17 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

The only way to attach information to a sk_buff that travels
through the network stack is with the mark. This field can be
read in firewall rules, drive routing decisions, and be
accessed by BPF programs.

However, its small size creates competition for bits, restricting
its practical use.

We propose using part of the packet headroom to store metadata.
This would allow:
- Tracing packets through the network stack and across the kernel-user
  space boundary, by assigning them a unique ID.
- Metadata-driven packet redirection, routing, and socket steering with
  early classification in XDP.
- Extracting information from encapsulation headers and sharing it with
  user space or vice versa.
- Exposing XDP RX Metadata, like the timestamp, to the rest of the
  network stack.

We originally proposed extending XDP metadata - binary blob
storage also in the headroom - to expose it throughout the network
stack. However based on feedback at LPC 2024 [1]:
- sharing a binary blob amongst different applications is hard.
- exposing a binary blob to userspace is awkward.
we've shifted to a limited KV store in the headroom.

To differentiate this from the overloaded "metadata" term, it's
tentatively called "packet traits".

Traits are currently stored at the start of the headroom:

| xdp_frame | traits | headroom | XDP metadata | data / packet |

This makes adding encap headers to a packet easier: the traits don't
have to be moved out of the way first.

But to let us change this in the future, XDP metadata and traits
aren't allowed to be used together.

A get() / set() / delete() API is exposed to BPF to store and
retrieve traits.

Initial benchmarks in XDP are promising, with get() / set() comparable
to an indirect function call. See patch 7: "trait: Replace memmove calls
with inline move" for full results.

We imagine adding first class support for this in netfilter (setting
/ checking traits in rules) and routing (selecting routing tables
based on traits) in follow up work.
We also envisage a first class userspace API for storing and
retrieving traits in the future.

Like XDP metadata, this relies on there being sufficient headroom
available. Piggy backing on top of that work, traits are currently
only supported:
- On ingress.
- By NIC drivers that support XDP metadata.
- When an XDP program is attached.
This limits the applicability of traits. But future work
guaranteeing sufficient headroom through other means should allow
these restrictions to be lifted.

[1] https://lpc.events/event/18/contributions/1935/

---
Changes in v2:
- Support sizes 0 (for flags), 4, and 8. 16 will be supported in the
  future with a batch API, to set two consecutive 8 byte KVs at once.
- Prevent traits and XDP metadata from being used at the same time.
  This will let us move trait storage where XDP metadata is today if
  we want to.
- Use SKB extensions to store the traits in skbs.
- Drop registration API.
- Link to v1: https://lore.kernel.org/r/20250305-afabre-traits-010-rfc2-v1-0-d0ecfb869797@cloudflare.com

---
Arthur Fabre (16):
      trait: limited KV store for packet metadata
      xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
      trait: XDP support
      trait: XDP selftest
      trait: XDP benchmark
      trait: Replace memcpy calls with inline copies
      trait: Replace memmove calls with inline move
      skb: Extension header in packet headroom
      trait: Store traits in sk_buff extension
      bnxt: Propagate trait presence to skb
      ice: Propagate trait presence to skb
      veth: Propagate trait presence to skb
      virtio_net: Propagate trait presence to skb
      mlx5: Propagate trait presence to skb
      xdp generic: Propagate trait presence to skb
      trait: Allow socket filters to access traits

Jesper Dangaard Brouer (1):
      mlx5: move xdp_buff scope one level up

 drivers/net/ethernet/broadcom/bnxt/bnxt.c          |   4 +
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c     |   5 -
 drivers/net/ethernet/intel/ice/ice_txrx.c          |   4 +
 drivers/net/ethernet/intel/ice/ice_xsk.c           |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.c    |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h    |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 114 ++++----
 drivers/net/veth.c                                 |   4 +
 drivers/net/virtio_net.c                           |   8 +-
 include/linux/skbuff.h                             |  42 +++
 include/net/trait.h                                | 302 +++++++++++++++++++++
 include/net/xdp.h                                  |  56 +++-
 net/core/dev.c                                     |   1 +
 net/core/filter.c                                  |  10 +-
 net/core/skbuff.c                                  | 231 ++++++++++++++--
 net/core/xdp.c                                     |  69 ++++-
 net/xdp/xsk.c                                      |  11 +-
 tools/testing/selftests/bpf/Makefile               |   2 +
 tools/testing/selftests/bpf/bench.c                |   8 +
 .../selftests/bpf/benchs/bench_xdp_traits.c        | 160 +++++++++++
 .../testing/selftests/bpf/prog_tests/xdp_traits.c  |  33 +++
 .../testing/selftests/bpf/progs/bench_xdp_traits.c | 128 +++++++++
 .../testing/selftests/bpf/progs/test_xdp_traits.c  | 206 ++++++++++++++
 24 files changed, 1319 insertions(+), 99 deletions(-)
---
base-commit: 5709be4c35ba760b001733939e20069de033a697
change-id: 20250305-afabre-traits-010-rfc2-a8e4de0c490b

Best regards,
-- 
Arthur Fabre <arthur@arthurfabre.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-24 16:22   ` Alexei Starovoitov
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 02/17] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions Arthur Fabre
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

A (very limited) KV store to support storing KVs in the packet headroom,
with:
- 64 keys (0-63).
- 0, 4 or 8 byte values.
- O(1) lookup
- O(n) insertion
- A fixed 16 byte header.

Values are stored ordered by key, immediately following the fixed
header.

A future batch API will allow setting multiple consecutive keys at once.
This will allow values bigger than 8 bytes to be stored.
16 byte values would be particularly useful to store UUIDs and IPv6
addresses.

Implemented in a header file so functions are always inlinable.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 include/net/trait.h | 263 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 263 insertions(+)

diff --git a/include/net/trait.h b/include/net/trait.h
new file mode 100644
index 0000000000000000000000000000000000000000..af42c1ad2416d5c38631f1f0305ef9fefe43bd87
--- /dev/null
+++ b/include/net/trait.h
@@ -0,0 +1,263 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2025 Arthur Fabre, Cloudflare */
+#ifndef __LINUX_NET_TRAIT_H__
+#define __LINUX_NET_TRAIT_H__
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/bitops.h>
+
+/* Traits are a very limited KV store, with:
+ * - 64 keys (0-63).
+ * - 0 (flag), 4, or 8 byte values.
+ * - O(1) lookup
+ * - O(n) insertion
+ * - A fixed 16 byte header.
+ */
+struct __trait_hdr {
+	/* Values are stored ordered by key, immediately after the header.
+	 *
+	 * The size of each value set is stored in the header as two bits:
+	 *  - 00: Not set.
+	 *  - 01: 0 bytes.
+	 *  - 10: 4 bytes.
+	 *  - 11: 8 bytes.
+	 *
+	 * Naively storing the size of each value (eg packing 4 of them in a byte)
+	 * doesn't let us efficiently calculate the offset of the value of an arbitrary key:
+	 * we'd have to mask out and sum the sizes of all preceding values.
+	 *
+	 * Instead, store the high and low bits of the size in two separate words:
+	 *  - A high bit in the high word.
+	 *  - A low bit in the low word.
+	 * The bit position of each word, LSb 0, is the key.
+	 *
+	 * To calculate the offset of the value of a key:
+	 *  - Mask out subsequent keys from both words.
+	 *  - Count the number of bits in each word: hamming weight (single amd64 instruction).
+	 *  - Map the bits to sizes.
+	 */
+	u64 high;
+	u64 low;
+};
+
+static __always_inline bool __trait_valid_len(u64 len)
+{
+	return len == 0 || len == 4 || len == 8;
+}
+
+static __always_inline bool __trait_valid_key(u64 key)
+{
+	return key < 64;
+}
+
+static __always_inline int __trait_total_length(struct __trait_hdr h)
+{
+	return (hweight64(h.high) << 2)
+		// For size 8, we only get 2. Add another 4 in.
+		+ (hweight64(h.high & h.low) << 2);
+}
+
+static __always_inline struct __trait_hdr __trait_and(struct __trait_hdr h, u64 mask)
+{
+	return (struct __trait_hdr){
+		h.high & mask,
+		h.low & mask,
+	};
+}
+
+static __always_inline int __trait_offset(struct __trait_hdr h, u64 key)
+{
+	/* Calculate total length of previous keys by masking out keys after. */
+	return sizeof(struct __trait_hdr) + __trait_total_length(__trait_and(h, ~(~0llu << key)));
+}
+
+/**
+ * traits_init() - Initialize a trait store.
+ * @traits: Start of trait store area.
+ * @hard_end: Hard limit the trait store can currently grow up against.
+ *            Can change dynamically after initialization, as long as it
+ *            does not overwrite any area already used (see traits_size()).
+ *
+ * Return:
+ * * %0       - Success.
+ * * %-ENOMEM - Not enough room to store any traits.
+ */
+static __always_inline int traits_init(void *traits, void *hard_end)
+{
+	if (traits + sizeof(struct __trait_hdr) > hard_end)
+		return -ENOMEM;
+
+	memset(traits, 0, sizeof(struct __trait_hdr));
+	return 0;
+}
+
+/**
+ * traits_size() - Total size currently used by a trait store.
+ * @traits: Start of trait store area.
+ *
+ * Return: Size in bytes.
+ */
+static __always_inline int traits_size(void *traits)
+{
+	return sizeof(struct __trait_hdr) + __trait_total_length(*(struct __trait_hdr *)traits);
+}
+
+/**
+ * trait_set() - Set a trait key.
+ * @traits: Start of trait store area.
+ * @hard_end: Hard limit the trait store can currently grow up against.
+ * @key: The key to set.
+ * @val: The value to set.
+ * @len: The length of the value.
+ * @flags: Unused for now. Should be 0.
+ *
+ * Return:
+ * * %0       - Success.
+ * * %-EINVAL - Key or length invalid.
+ * * %-EBUSY  - Key previously set with different length.
+ * * %-ENOSPC - Not enough room left to store value.
+ */
+static __always_inline
+int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
+{
+	if (!__trait_valid_key(key) || !__trait_valid_len(len))
+		return -EINVAL;
+
+	struct __trait_hdr *h = (struct __trait_hdr *)traits;
+
+	/* Offset of value of this key. */
+	int off = __trait_offset(*h, key);
+
+	if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
+		/* Key is already set, but with a different length */
+		if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
+			return -EBUSY;
+	} else {
+		/* Figure out if we have enough room left: total length of everything now. */
+		if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
+			return -ENOSPC;
+
+		/* Memmove all the kvs after us over. */
+		if (traits_size(traits) > off)
+			memmove(traits + off + len, traits + off, traits_size(traits) - off);
+	}
+
+	/* Set our value. */
+	memcpy(traits + off, val, len);
+
+	/* Store our length in header. */
+	u64 encode_len = 0;
+
+	switch (len) {
+	case 0:
+		encode_len = 1;
+		break;
+	case 4:
+		encode_len = 2;
+		break;
+	case 8:
+		encode_len = 3;
+		break;
+	}
+	h->high |= (encode_len >> 1) << key;
+	h->low |= (encode_len & 1) << key;
+	return 0;
+}
+
+/**
+ * trait_is_set() - Check if a trait key is set.
+ * @traits: Start of trait store area.
+ * @key: The key to check.
+ *
+ * Return:
+ * * %0       - Key is not set.
+ * * %1       - Key is set.
+ * * %-EINVAL - Key or length invalid.
+ */
+static __always_inline
+int trait_is_set(void *traits, u64 key)
+{
+	if (!__trait_valid_key(key))
+		return -EINVAL;
+
+	struct __trait_hdr h = *(struct __trait_hdr *)traits;
+
+	return ((h.high & (1ull << key)) || (h.low & (1ull << key))) != 0;
+}
+
+/**
+ * trait_get() - Get a trait key.
+ * @traits: Start of trait store area.
+ * @key: The key to get.
+ * @val: Where to put stored value.
+ * @val_len: The length of val.
+ *
+ * Return:
+ * * %>0      - Actual size of value.
+ * * %-EINVAL - Key or length invalid.
+ * * %-ENOENT - Key has not been set with trait_set() previously.
+ * * %-ENOSPC - Val is not big enough to hold stored value.
+ */
+static __always_inline
+int trait_get(void *traits, u64 key, void *val, u64 val_len)
+{
+	if (!__trait_valid_key(key))
+		return -EINVAL;
+
+	struct __trait_hdr h = *(struct __trait_hdr *)traits;
+
+	/* Check key is set */
+	if (!((h.high & (1ull << key)) || (h.low & (1ull << key))))
+		return -ENOENT;
+
+	/* Offset of value of this key */
+	int off = __trait_offset(h, key);
+
+	/* Figure out our length */
+	int real_len = __trait_total_length(__trait_and(h, (1ull << key)));
+
+	if (real_len > val_len)
+		return -ENOSPC;
+
+	memcpy(val, traits + off, real_len);
+	return real_len;
+}
+
+/**
+ * trait_del() - Delete a trait key.
+ * @traits: Start of trait store area.
+ * @key: The key to delete.
+ *
+ * Return:
+ * * %0       - Success.
+ * * %-EINVAL - Key or length invalid.
+ * * %-ENOENT - Key has not been set with trait_set() previously.
+ */
+static __always_inline int trait_del(void *traits, u64 key)
+{
+	if (!__trait_valid_key(key))
+		return -EINVAL;
+
+	struct __trait_hdr *h = (struct __trait_hdr *)traits;
+
+	/* Check key is set */
+	if (!((h->high & (1ull << key)) || (h->low & (1ull << key))))
+		return -ENOENT;
+
+	/* Offset and length of value of this key */
+	int off = __trait_offset(*h, key);
+	int len = __trait_total_length(__trait_and(*h, (1ull << key)));
+
+	/* Memmove all the kvs after us over */
+	if (traits_size(traits) > off + len)
+		memmove(traits + off, traits + off + len, traits_size(traits) - off - len);
+
+	/* Clear our length in header */
+	h->high &= ~(1ull << key);
+	h->low &= ~(1ull << key);
+	return 0;
+}
+
+#endif /* __LINUX_NET_TRAIT_H__ */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 02/17] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 03/17] trait: XDP support Arthur Fabre
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

xdp_buff stores whether metadata is supported by a NIC by setting
data_meta to be greater than data.

But xdp_frame only stores the metadata size (as metasize), so converting
between xdp_frame and xdp_buff is lossy.

This won't let us add "generic" functions for setting skb fields from
either xdp_frame or xdp_buff in drivers.

Switch to storing whether metadata is supported or not in a new
XDP_FLAG, as flags are copied as is between xdp_frame and xdp_buff.

This also resolves the slight awkward checks needed for calculating the
metadata length: data - data_meta can now always be used.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c |  5 -----
 include/net/xdp.h                              | 12 +++++++++---
 net/core/filter.c                              |  3 +--
 net/core/xdp.c                                 |  3 +--
 net/xdp/xsk.c                                  | 11 ++---------
 5 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 4948b4906584e099515b1e1fc46428f6f0a56d1b..b009b88b5b410200500af717ca05ba72e7d2ffb8 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2629,12 +2629,7 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 
 	switch (xdp_act) {
 	case XDP_PASS:
-#ifdef CONFIG_DPAA_ERRATUM_A050385
-		*xdp_meta_len = xdp_data_meta_unsupported(&xdp) ? 0 :
-				xdp.data - xdp.data_meta;
-#else
 		*xdp_meta_len = xdp.data - xdp.data_meta;
-#endif
 		break;
 	case XDP_TX:
 		/* We can access the full headroom when sending the frame
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 48efacbaa35da6c4ec76d8d8e78b86e8f7c23e4b..16c36813cbf8631ea170d2698f1d3408286129a2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -76,6 +76,9 @@ enum xdp_buff_flags {
 	XDP_FLAGS_FRAGS_PF_MEMALLOC	= BIT(1), /* xdp paged memory is under
 						   * pressure
 						   */
+	XDP_FLAGS_META_SUPPORTED	= BIT(2), /* metadata in headroom supported
+						   * by driver
+						   */
 };
 
 struct xdp_buff {
@@ -132,7 +135,10 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
 	xdp->data_hard_start = hard_start;
 	xdp->data = data;
 	xdp->data_end = data + data_len;
-	xdp->data_meta = meta_valid ? data : data + 1;
+	xdp->data_meta = data;
+
+	if (meta_valid)
+		xdp->flags |= XDP_FLAGS_META_SUPPORTED;
 }
 
 /* Reserve memory area at end-of data area.
@@ -497,13 +503,13 @@ static inline void xdp_rxq_info_detach_mem_model(struct xdp_rxq_info *xdp_rxq)
 static __always_inline void
 xdp_set_data_meta_invalid(struct xdp_buff *xdp)
 {
-	xdp->data_meta = xdp->data + 1;
+	xdp->flags &= ~XDP_FLAGS_META_SUPPORTED;
 }
 
 static __always_inline bool
 xdp_data_meta_unsupported(const struct xdp_buff *xdp)
 {
-	return unlikely(xdp->data_meta > xdp->data);
+	return unlikely(!(xdp->flags & XDP_FLAGS_META_SUPPORTED));
 }
 
 static inline bool xdp_metalen_invalid(unsigned long metalen)
diff --git a/net/core/filter.c b/net/core/filter.c
index 79cab4d78dc3c6fe8a3a16e76fd0daf51af3282e..f9b3358e274fa4b85e39509b04192c282ba2009c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3937,8 +3937,7 @@ const struct bpf_func_proto bpf_xdp_get_buff_len_trace_proto = {
 
 static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
 {
-	return xdp_data_meta_unsupported(xdp) ? 0 :
-	       xdp->data - xdp->data_meta;
+	return xdp->data - xdp->data_meta;
 }
 
 BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
diff --git a/net/core/xdp.c b/net/core/xdp.c
index f86eedad586a77eb63a96a85aa6d068d3e94f077..4ee9ff9dbd0e810425e00c1e8bcc0d7088ddaad4 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -580,8 +580,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	struct page *page;
 
 	/* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
-	metasize = xdp_data_meta_unsupported(xdp) ? 0 :
-		   xdp->data - xdp->data_meta;
+	metasize = xdp->data - xdp->data_meta;
 	totsize = xdp->data_end - xdp->data + metasize;
 
 	if (sizeof(*xdpf) + totsize > PAGE_SIZE)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5696af45bcf711ce6b9df46a8783db0f4561e79a..eb771c7fcbd461b3899a5f45bf8a3071aeaa6a9a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -188,14 +188,6 @@ static int xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 	return err;
 }
 
-static void *xsk_copy_xdp_start(struct xdp_buff *from)
-{
-	if (unlikely(xdp_data_meta_unsupported(from)))
-		return from->data;
-	else
-		return from->data_meta;
-}
-
 static u32 xsk_copy_xdp(void *to, void **from, u32 to_len,
 			u32 *from_len, skb_frag_t **frag, u32 rem)
 {
@@ -227,12 +219,13 @@ static u32 xsk_copy_xdp(void *to, void **from, u32 to_len,
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
 	u32 frame_size = xsk_pool_get_rx_frame_size(xs->pool);
-	void *copy_from = xsk_copy_xdp_start(xdp), *copy_to;
+	void *copy_from, *copy_to;
 	u32 from_len, meta_len, rem, num_desc;
 	struct xdp_buff_xsk *xskb;
 	struct xdp_buff *xsk_xdp;
 	skb_frag_t *frag;
 
+	copy_from = xdp->data_meta;
 	from_len = xdp->data_end - copy_from;
 	meta_len = xdp->data - copy_from;
 	rem = len + meta_len;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 03/17] trait: XDP support
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 02/17] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 04/17] trait: XDP selftest Arthur Fabre
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Store the trait KV store in the xdp_buff headroom, right after the
xdp_frame.

Storing it at the front of the headroom makes it easy to add encap
headers / adjust_head(): we don't need to copy the traits out of the way
first.

But, in case we want to change this in the future, disallow traits from
being used at the same time as metadata.

Support for traits is gated on support for metadata: to propagate the
traits to the skb layer, drivers will have to explicitly signal support
like the existing skb_metadata_set() call.

This assumes there is enough headroom to store the traits header. That
avoids having to check in the hot-path, where all we could do anyways is
BUG_ON().

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 include/net/xdp.h | 25 ++++++++++++++++++++-
 net/core/filter.c |  7 ++++--
 net/core/xdp.c    | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 95 insertions(+), 3 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 16c36813cbf8631ea170d2698f1d3408286129a2..0da1e87afdebfd4323d1944de65a6d63438209bf 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -10,6 +10,7 @@
 #include <linux/filter.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h> /* skb_shared_info */
+#include <net/trait.h>
 
 #include <net/page_pool/types.h>
 
@@ -79,6 +80,7 @@ enum xdp_buff_flags {
 	XDP_FLAGS_META_SUPPORTED	= BIT(2), /* metadata in headroom supported
 						   * by driver
 						   */
+	XDP_FLAGS_TRAITS_SUPPORTED	= BIT(3), /* traits in headroom supported */
 };
 
 struct xdp_buff {
@@ -118,6 +120,13 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
 	xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
 }
 
+#define _XDP_FRAME_SIZE (40)
+
+static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
+{
+	return xdp->data_hard_start + _XDP_FRAME_SIZE;
+}
+
 static __always_inline void
 xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
 {
@@ -137,8 +146,15 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
 	xdp->data_end = data + data_len;
 	xdp->data_meta = data;
 
-	if (meta_valid)
+	if (meta_valid) {
 		xdp->flags |= XDP_FLAGS_META_SUPPORTED;
+
+		xdp->flags |= XDP_FLAGS_TRAITS_SUPPORTED;
+		/* We assume drivers reserve enough headroom to store xdp_frame
+		 * and the traits header.
+		 */
+		traits_init(xdp_buff_traits(xdp), xdp->data_meta);
+	}
 }
 
 /* Reserve memory area at end-of data area.
@@ -273,6 +289,8 @@ struct xdp_frame {
 	u32 flags; /* supported values defined in xdp_buff_flags */
 };
 
+static_assert(sizeof(struct xdp_frame) == _XDP_FRAME_SIZE);
+
 static __always_inline bool xdp_frame_has_frags(const struct xdp_frame *frame)
 {
 	return !!(frame->flags & XDP_FLAGS_HAS_FRAGS);
@@ -522,6 +540,11 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
 	return !IS_ALIGNED(metalen, sizeof(u32)) || metalen > meta_max;
 }
 
+static __always_inline void *xdp_data_hard_start(const struct xdp_buff *xdp)
+{
+	return xdp_buff_traits(xdp) + traits_size(xdp_buff_traits(xdp));
+}
+
 struct xdp_attachment_info {
 	struct bpf_prog *prog;
 	u32 flags;
diff --git a/net/core/filter.c b/net/core/filter.c
index f9b3358e274fa4b85e39509b04192c282ba2009c..909962b5d89b8f45c8963f9074e02c7cc5f1c393 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,7 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <net/trait.h>
 
 #include "dev.h"
 
@@ -3942,9 +3943,8 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
 
 BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 {
-	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
 	unsigned long metalen = xdp_get_metalen(xdp);
-	void *data_start = xdp_frame_end + metalen;
+	void *data_start = xdp_data_hard_start(xdp) + metalen;
 	void *data = xdp->data + offset;
 
 	if (unlikely(data < data_start ||
@@ -4247,6 +4247,9 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 	if (unlikely(xdp_metalen_invalid(metalen)))
 		return -EACCES;
 
+	/* Traits and meta can't be used together */
+	xdp->flags &= ~XDP_FLAGS_TRAITS_SUPPORTED;
+
 	xdp->data_meta = meta;
 
 	return 0;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4ee9ff9dbd0e810425e00c1e8bcc0d7088ddaad4..9b7124d52f377cf800664b3433c047f27956933e 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -1021,3 +1021,69 @@ void xdp_features_clear_redirect_target(struct net_device *dev)
 	xdp_set_features_flag(dev, val);
 }
 EXPORT_SYMBOL_GPL(xdp_features_clear_redirect_target);
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc int bpf_xdp_trait_set(struct xdp_buff *xdp, u64 key,
+				  const void *val, u64 val__sz, u64 flags)
+{
+	if (!(xdp->flags & XDP_FLAGS_TRAITS_SUPPORTED))
+		return -EOPNOTSUPP;
+	/* Traits and meta can't be used together */
+	xdp->flags &= ~XDP_FLAGS_META_SUPPORTED;
+
+	return trait_set(xdp_buff_traits(xdp), xdp->data, key,
+			 val, val__sz, flags);
+}
+
+__bpf_kfunc int bpf_xdp_trait_is_set(struct xdp_buff *xdp, u64 key)
+{
+	if (!(xdp->flags & XDP_FLAGS_TRAITS_SUPPORTED))
+		return -EOPNOTSUPP;
+	/* Traits and meta can't be used together */
+	xdp->flags &= ~XDP_FLAGS_META_SUPPORTED;
+
+	return trait_is_set(xdp_buff_traits(xdp), key);
+}
+
+__bpf_kfunc int bpf_xdp_trait_get(struct xdp_buff *xdp, u64 key,
+				  void *val, u64 val__sz)
+{
+	if (!(xdp->flags & XDP_FLAGS_TRAITS_SUPPORTED))
+		return -EOPNOTSUPP;
+	/* Traits and meta can't be used together */
+	xdp->flags &= ~XDP_FLAGS_META_SUPPORTED;
+
+	return trait_get(xdp_buff_traits(xdp), key, val, val__sz);
+}
+
+__bpf_kfunc int bpf_xdp_trait_del(struct xdp_buff *xdp, u64 key)
+{
+	if (!(xdp->flags & XDP_FLAGS_TRAITS_SUPPORTED))
+		return -EOPNOTSUPP;
+	/* Traits and meta can't be used together */
+	xdp->flags &= ~XDP_FLAGS_META_SUPPORTED;
+
+	return trait_del(xdp_buff_traits(xdp), key);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(xdp_trait)
+BTF_ID_FLAGS(func, bpf_xdp_trait_set)
+BTF_ID_FLAGS(func, bpf_xdp_trait_is_set)
+BTF_ID_FLAGS(func, bpf_xdp_trait_get)
+BTF_ID_FLAGS(func, bpf_xdp_trait_del)
+BTF_KFUNCS_END(xdp_trait)
+
+static const struct btf_kfunc_id_set xdp_trait_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set = &xdp_trait,
+};
+
+static int xdp_trait_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
+					 &xdp_trait_kfunc_set);
+}
+late_initcall(xdp_trait_init);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 04/17] trait: XDP selftest
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (2 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 03/17] trait: XDP support Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 05/17] trait: XDP benchmark Arthur Fabre
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Check traits work for all sizes, and that only xdp_adjust_meta() or
xdp_trait_set() can be used at the same time.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 .../testing/selftests/bpf/prog_tests/xdp_traits.c  |  33 ++++
 .../testing/selftests/bpf/progs/test_xdp_traits.c  | 206 +++++++++++++++++++++
 2 files changed, 239 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_traits.c b/tools/testing/selftests/bpf/prog_tests/xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..3e3e1b2fa73eb2ff94884fba3536b25f1b089fdd
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_traits.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+
+void test_xdp_traits(void)
+{
+	const char *file = "./test_xdp_traits.bpf.o";
+	struct bpf_object *obj;
+	struct bpf_program *prog;
+	int err, prog_fd;
+	struct bpf_test_run_opts topts;
+
+	err = bpf_prog_test_load(file, BPF_PROG_TYPE_XDP, &obj, &prog_fd);
+	if (!ASSERT_OK(err, "test_xdp_traits"))
+		return;
+
+	bpf_object__for_each_program(prog, obj) {
+		if (test__start_subtest(bpf_program__name(prog))) {
+			LIBBPF_OPTS_RESET(topts,
+				.data_in = &pkt_v4,
+				.data_size_in = sizeof(pkt_v4),
+				.repeat = 1,
+			);
+
+			prog_fd = bpf_program__fd(prog);
+			err = bpf_prog_test_run_opts(prog_fd, &topts);
+			ASSERT_OK(err, "prog_run");
+			ASSERT_EQ(topts.retval, XDP_PASS, "retval");
+		}
+	}
+
+	bpf_object__close(obj);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_traits.c b/tools/testing/selftests/bpf/progs/test_xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..e020acae6b56cb55f455075341159993c20f4c9e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_traits.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <errno.h>
+
+/* TODO - can we plumb better constants through here?
+ * 40: sizeof(struct xdp_frame)
+ * 16: sizeof(struct __trait_hdr)
+ */
+#define MAX_SPACE (XDP_PACKET_HEADROOM - 40 - 16)
+
+/* For xdp_adjust_meta() */
+#ifndef ENOTSUPP
+#define ENOTSUPP 524
+#endif
+
+extern int bpf_xdp_trait_set(const struct xdp_md *xdp, __u64 key,
+			     const void *val, __u64 val__sz,
+			     __u64 flags) __ksym;
+extern int bpf_xdp_trait_get(const struct xdp_md *xdp, __u64 key, void *val,
+			     __u64 val__sz) __ksym;
+extern int bpf_xdp_trait_del(const struct xdp_md *xdp, __u64 key) __ksym;
+extern int bpf_xdp_trait_is_set(const struct xdp_md *xdp, __u64 key) __ksym;
+
+#define ASSERT_CALL(WANT, CALL) do {							\
+	int _ret = CALL;								\
+	if (_ret != WANT) {								\
+		bpf_printk("%d: %s ret %d want %d", __LINE__, #CALL, _ret, WANT);	\
+		return XDP_DROP;							\
+	}										\
+} while (0)
+
+#define ASSERT_VAL(WANT, GOT, PTRSIZE) do {								\
+	if (__builtin_memcmp(WANT, GOT, PTRSIZE)) {							\
+		switch (PTRSIZE) {									\
+		case 0:											\
+			return XDP_DROP;								\
+		case 4:											\
+			bpf_printk("%d: got %d want %d", __LINE__, *(__u32 *)GOT, *(__u32 *)WANT);	\
+			return XDP_DROP;								\
+		case 8:											\
+			bpf_printk("%d: got %d want %d", __LINE__, *(__u64 *)GOT, *(__u64 *)WANT);	\
+			return XDP_DROP;								\
+		}											\
+	}												\
+} while (0)
+
+#define FILL(PTR, PTRSIZE, VAL) do {			\
+	int _i;						\
+	for (_i = 0; _i < PTRSIZE; _i++)		\
+		*(((__u8 *)(PTR))+_i) = (__u8)VAL;	\
+} while (0)
+
+static __always_inline int test(struct xdp_md *xdp, void *val, void *got, int valsize)
+{
+	int i, numkeys;
+
+	numkeys = 64;
+	if (valsize * numkeys > MAX_SPACE)
+		numkeys = MAX_SPACE / valsize;
+
+	/* No keys to start */
+	for (i = 0; i < numkeys; i++) {
+		ASSERT_CALL(-ENOENT, bpf_xdp_trait_get(xdp, i, val, valsize));
+		ASSERT_CALL(-ENOENT, bpf_xdp_trait_del(xdp, i));
+		ASSERT_CALL(0, bpf_xdp_trait_is_set(xdp, i));
+	}
+
+	/* Set all keys */
+	for (i = 0; i < numkeys; i++) {
+		FILL(val, valsize, i);
+		ASSERT_CALL(0, bpf_xdp_trait_set(xdp, i, val, valsize, 0));
+
+		ASSERT_CALL(valsize, bpf_xdp_trait_get(xdp, i, got, valsize));
+		ASSERT_VAL(val, got, valsize);
+		ASSERT_CALL(1, bpf_xdp_trait_is_set(xdp, i));
+	}
+
+	/* Get all keys back out */
+	for (i = 0; i < numkeys; i++) {
+		FILL(val, valsize, i);
+
+		ASSERT_CALL(valsize, bpf_xdp_trait_get(xdp, i, got, valsize));
+		ASSERT_VAL(val, got, valsize);
+		ASSERT_CALL(1, bpf_xdp_trait_is_set(xdp, i));
+	}
+
+	/* Overwrite all keys */
+	for (i = 0; i < numkeys; i++) {
+		FILL(val, valsize, i+128);
+		ASSERT_CALL(0, bpf_xdp_trait_set(xdp, i, val, valsize, 0));
+	}
+
+	/* Delete all even keys */
+	for (i = 0; i < numkeys; i++) {
+		if (!(i & 1))
+			ASSERT_CALL(0, bpf_xdp_trait_del(xdp, i));
+	}
+
+	/* Check remaining keys */
+	for (i = 0; i < numkeys; i++) {
+		if (!(i & 1)) {
+			ASSERT_CALL(-ENOENT, bpf_xdp_trait_get(xdp, i, val, valsize));
+			ASSERT_CALL(0, bpf_xdp_trait_is_set(xdp, i));
+		} else {
+			FILL(val, valsize, i+128);
+
+			ASSERT_CALL(valsize, bpf_xdp_trait_get(xdp, i, got, valsize));
+			ASSERT_VAL(val, got, valsize);
+		}
+	}
+
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_traits_0(struct xdp_md *xdp)
+{
+	return test(xdp, NULL, NULL, 0);
+}
+
+SEC("xdp")
+int xdp_traits_4(struct xdp_md *xdp)
+{
+	__u32 a, b;
+
+	return test(xdp, &a, &b, sizeof(a));
+}
+
+SEC("xdp")
+int xdp_traits_8(struct xdp_md *xdp)
+{
+	__u64 a, b;
+
+	return test(xdp, &a, &b, sizeof(a));
+}
+
+SEC("xdp")
+int xdp_traits_invalid_key(struct xdp_md *xdp)
+{
+	ASSERT_CALL(-EINVAL, bpf_xdp_trait_get(xdp, 65, NULL, 0));
+	ASSERT_CALL(-EINVAL, bpf_xdp_trait_set(xdp, 65, NULL, 0, 0));
+	ASSERT_CALL(-EINVAL, bpf_xdp_trait_del(xdp, 65));
+	ASSERT_CALL(-EINVAL, bpf_xdp_trait_is_set(xdp, 65));
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_traits_invalid_len(struct xdp_md *xdp)
+{
+	__u8 v;
+
+	ASSERT_CALL(-EINVAL, bpf_xdp_trait_set(xdp, 0, &v, sizeof(v), 0));
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_traits_len_too_small(struct xdp_md *xdp)
+{
+	__u8 v1;
+	__u32 v4;
+
+	ASSERT_CALL(0, bpf_xdp_trait_set(xdp, 0, &v4, sizeof(v4), 0));
+	ASSERT_CALL(-ENOSPC, bpf_xdp_trait_get(xdp, 0, &v1, sizeof(v1)));
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_traits_different_len(struct xdp_md *xdp)
+{
+	__u32 v;
+
+	ASSERT_CALL(0, bpf_xdp_trait_set(xdp, 0, &v, sizeof(v), 0));
+	ASSERT_CALL(-EBUSY, bpf_xdp_trait_set(xdp, 0, NULL, 0, 0));
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_traits_no_space(struct xdp_md *xdp)
+{
+	int i;
+	__u64 v;
+
+	for (i = 0; i < MAX_SPACE / 8; i++)
+		ASSERT_CALL(0, bpf_xdp_trait_set(xdp, i, &v, sizeof(v), 0));
+	ASSERT_CALL(-ENOSPC, bpf_xdp_trait_set(xdp, i+1, &v, sizeof(v), 0));
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_meta_then_traits(struct xdp_md *xdp)
+{
+	ASSERT_CALL(0, bpf_xdp_adjust_meta(xdp, -8));
+	ASSERT_CALL(-EOPNOTSUPP, bpf_xdp_trait_set(xdp, 0, NULL, 0, 0));
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_traits_then_meta(struct xdp_md *xdp)
+{
+	ASSERT_CALL(0, bpf_xdp_trait_set(xdp, 0, NULL, 0, 0));
+	ASSERT_CALL(-ENOTSUPP, bpf_xdp_adjust_meta(xdp, -8));
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 05/17] trait: XDP benchmark
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (3 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 04/17] trait: XDP selftest Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 06/17] trait: Replace memcpy calls with inline copies Arthur Fabre
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Benchmarks for:

* Getting a trait.
* Setting a trait, with no traits already stored after it.
* Moving traits, when setting or deleting a trait with traits stored
  after it.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 tools/testing/selftests/bpf/Makefile               |   2 +
 tools/testing/selftests/bpf/bench.c                |   8 ++
 .../selftests/bpf/benchs/bench_xdp_traits.c        | 160 +++++++++++++++++++++
 .../testing/selftests/bpf/progs/bench_xdp_traits.c | 128 +++++++++++++++++
 4 files changed, 298 insertions(+)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 03663222a0a5503bc4a265301621c3f9b239537a..4f0636c488ca9cbae9a7b2a269a7bc2722d2bba5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -812,6 +812,7 @@ $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
 $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
 $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h
+$(OUTPUT)/bench_xdp_traits.o: $(OUTPUT)/bench_xdp_traits.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -833,6 +834,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_htab_mem.o \
 		 $(OUTPUT)/bench_bpf_crypto.o \
 		 $(OUTPUT)/bench_sockmap.o \
+		 $(OUTPUT)/bench_xdp_traits.o \
 		 #
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index c80df9b7509432ab836cd624bbf943d0bff008bc..d25c6309c4abfebd868809fccfdf8ecdbe5edabf 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -284,6 +284,7 @@ extern struct argp bench_htab_mem_argp;
 extern struct argp bench_trigger_batch_argp;
 extern struct argp bench_crypto_argp;
 extern struct argp bench_sockmap_argp;
+extern struct argp bench_trait_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -299,6 +300,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 },
 	{ &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
 	{ &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 },
+	{ &bench_trait_argp, 0, "XDP trait benchmark", 0 },
 	{},
 };
 
@@ -552,6 +554,9 @@ extern const struct bench bench_htab_mem;
 extern const struct bench bench_crypto_encrypt;
 extern const struct bench bench_crypto_decrypt;
 extern const struct bench bench_sockmap;
+extern const struct bench bench_xdp_trait_get;
+extern const struct bench bench_xdp_trait_set;
+extern const struct bench bench_xdp_trait_move;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -613,6 +618,9 @@ static const struct bench *benchs[] = {
 	&bench_crypto_encrypt,
 	&bench_crypto_decrypt,
 	&bench_sockmap,
+	&bench_xdp_trait_get,
+	&bench_xdp_trait_set,
+	&bench_xdp_trait_move,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c b/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..3964002c27d454ab17fde305d0d2bbd94007cbaf
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <argp.h>
+#include "bench.h"
+#include "bench_xdp_traits.skel.h"
+
+static struct trait_ctx {
+	struct bench_xdp_traits *skel;
+	int pfd;
+} ctx;
+
+static struct trait_args {
+	u32 trait_len;
+} args = {
+	.trait_len = 2,
+};
+
+enum {
+	ARG_TRAIT_LEN = 6000,
+};
+
+static const struct argp_option opts[] = {
+	{ "trait-len", ARG_TRAIT_LEN, "TRAIT_LEN", 0,
+	  "Set the length of the trait set/get" },
+	{},
+};
+
+static error_t trait_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_TRAIT_LEN:
+		args.trait_len = strtoul(arg, NULL, 10);
+		if (args.trait_len != 0 && args.trait_len != 4 && args.trait_len != 8) {
+			fprintf(stderr, "Invalid trait length\n");
+			argp_usage(state);
+		}
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_trait_argp = {
+	.options = opts,
+	.parser = trait_parse_arg,
+};
+
+static void trait_validate(void)
+{
+	if (env.consumer_cnt != 0) {
+		fprintf(stderr, "bpf trait benchmark doesn't support consumer!\n");
+		exit(1);
+	}
+}
+
+static void trait_setup(void)
+{
+	int err;
+
+	setup_libbpf();
+
+	ctx.skel = bench_xdp_traits__open();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	ctx.skel->bss->trait_len = args.trait_len;
+
+	err = bench_xdp_traits__load(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		bench_xdp_traits__destroy(ctx.skel);
+		exit(1);
+	}
+}
+
+static void trait_get_setup(void)
+{
+	trait_setup();
+	ctx.pfd = bpf_program__fd(ctx.skel->progs.trait_get);
+}
+
+static void trait_set_setup(void)
+{
+	trait_setup();
+	ctx.pfd = bpf_program__fd(ctx.skel->progs.trait_set);
+}
+
+static void trait_move_setup(void)
+{
+	trait_setup();
+	ctx.pfd = bpf_program__fd(ctx.skel->progs.trait_move);
+}
+
+static void trait_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+static void *trait_producer(void *unused)
+{
+	static char in[14];
+	int err;
+
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		.data_in = in,
+		.data_size_in = sizeof(in),
+		.repeat = 256, // max
+	);
+
+	while (true) {
+		err = bpf_prog_test_run_opts(ctx.pfd, &opts);
+		if (err != 0) {
+			fprintf(stderr, "failed to prog_run: %d\n", err);
+			return NULL;
+		}
+		if (opts.retval != 0) {
+			fprintf(stderr, "prog didn't return 0: %d\n", opts.retval);
+			return NULL;
+		}
+	}
+
+	return NULL;
+}
+
+const struct bench bench_xdp_trait_get = {
+	.name = "xdp-trait-get",
+	.argp = &bench_trait_argp,
+	.validate = trait_validate,
+	.setup = trait_get_setup,
+	.producer_thread = trait_producer,
+	.measure = trait_measure,
+	.report_progress = ops_report_progress,
+	.report_final = ops_report_final,
+};
+
+const struct bench bench_xdp_trait_set = {
+	.name = "xdp-trait-set",
+	.argp = &bench_trait_argp,
+	.validate = trait_validate,
+	.setup = trait_set_setup,
+	.producer_thread = trait_producer,
+	.measure = trait_measure,
+	.report_progress = ops_report_progress,
+	.report_final = ops_report_final,
+};
+
+const struct bench bench_xdp_trait_move = {
+	.name = "xdp-trait-move",
+	.argp = &bench_trait_argp,
+	.validate = trait_validate,
+	.setup = trait_move_setup,
+	.producer_thread = trait_producer,
+	.measure = trait_measure,
+	.report_progress = ops_report_progress,
+	.report_final = ops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/progs/bench_xdp_traits.c b/tools/testing/selftests/bpf/progs/bench_xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..70949f8e78ead1e1e8a0172ed9388183344c1e11
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bench_xdp_traits.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <errno.h>
+
+extern int bpf_xdp_trait_set(const struct xdp_md *xdp, __u64 key,
+			     const void *val, __u64 val__sz,
+			     __u64 flags) __ksym;
+extern int bpf_xdp_trait_get(const struct xdp_md *xdp, __u64 key, void *val,
+			     __u64 val__sz) __ksym;
+extern int bpf_xdp_trait_del(const struct xdp_md *xdp, __u64 key) __ksym;
+
+__u32 trait_len;
+long hits = 0;
+
+#define ITERATIONS (10000)
+
+SEC("xdp")
+int trait_get(struct xdp_md *xdp)
+{
+	int ret, i;
+	__u32 val4 = 0xdeadbeef;
+	__u64 val8 = 0xdeadbeefafafcfcf;
+
+#define BENCH_GET(val, size) do { \
+		ret = bpf_xdp_trait_set(xdp, 32, val, size, 0);			\
+		if (ret != 0)							\
+			return ret;						\
+		for (i = 0; i < ITERATIONS; i++)				\
+			ret = bpf_xdp_trait_get(xdp, 32, val, size);		\
+		if (ret != size)						\
+			return ret;						\
+	} while (0)
+
+	switch (trait_len) {
+	case 0:
+		BENCH_GET(NULL, 0);
+		break;
+	case 4:
+		BENCH_GET(&val4, sizeof(val4));
+		break;
+	case 8:
+		BENCH_GET(&val8, sizeof(val8));
+		break;
+	}
+
+	__sync_add_and_fetch(&hits, ITERATIONS);
+	return 0;
+}
+
+SEC("xdp")
+int trait_set(struct xdp_md *xdp)
+{
+	int ret, i;
+	__u32 val4 = 0xdeadbeef;
+	__u64 val8 = 0xdeadbeefafafcfcf;
+
+#define BENCH_SET(val, size) do { \
+		for (i = 0; i < ITERATIONS; i++)			\
+			ret = bpf_xdp_trait_set(xdp, 32, val, size, 0);	\
+	} while (0)
+
+	switch (trait_len) {
+	case 0:
+		BENCH_SET(NULL, 0);
+		break;
+	case 4:
+		BENCH_SET(&val4, sizeof(val4));
+		break;
+	case 8:
+		BENCH_SET(&val8, sizeof(val8));
+		break;
+	}
+
+	if (ret != 0)
+		return ret;
+
+	__sync_add_and_fetch(&hits, ITERATIONS);
+	return 0;
+}
+
+SEC("xdp")
+int trait_move(struct xdp_md *xdp)
+{
+	int ret, ret_del, i;
+	__u32 val4 = 0xdeadbeef;
+	__u64 val8 = 0xdeadbeefafafcfcf;
+
+#define BENCH_MOVE(val, size) do { \
+		for (i = 0; i < 8; i++)	{						\
+			ret = bpf_xdp_trait_set(xdp, 40+i, &val8, sizeof(val8), 0);	\
+			if (ret != 0)							\
+				return ret;						\
+		}									\
+		/* We do two operations per iteration, so do half as many to make it
+		 * vaguely comparable with other benchmarks
+		 */									\
+		for (i = 0; i < ITERATIONS/2; i++) {					\
+			/* Need to delete after, otherwise we'll just overwrite an
+			 * existing trait and there will be nothing to move.
+			 */								\
+			ret = bpf_xdp_trait_set(xdp, 32, val, size, 0);			\
+			ret_del = bpf_xdp_trait_del(xdp, 32);				\
+		}									\
+	} while (0)
+
+	switch (trait_len) {
+	case 0:
+		BENCH_MOVE(NULL, 0);
+		break;
+	case 4:
+		BENCH_MOVE(&val4, sizeof(val4));
+		break;
+	case 8:
+		BENCH_MOVE(&val8, sizeof(val8));
+		break;
+	}
+
+	if (ret != 0)
+		return ret;
+	if (ret_del != 0)
+		return ret_del;
+
+	__sync_add_and_fetch(&hits, ITERATIONS);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 06/17] trait: Replace memcpy calls with inline copies
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (4 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 05/17] trait: XDP benchmark Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 07/17] trait: Replace memmove calls with inline move Arthur Fabre
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

When copying trait values to or from the caller, the size isn't a
constant so memcpy() ends up being a function call.

Replace it with an inline implementation that only handles the sizes we
support.

We store values "packed", so they won't necessarily be 4 or 8 byte
aligned.

Setting and getting traits is roughly ~40% faster.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 include/net/trait.h | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/net/trait.h b/include/net/trait.h
index af42c1ad2416d5c38631f1f0305ef9fefe43bd87..4013351549731c4e3bede211dbe9fbe651556dc9 100644
--- a/include/net/trait.h
+++ b/include/net/trait.h
@@ -7,6 +7,7 @@
 #include <linux/errno.h>
 #include <linux/string.h>
 #include <linux/bitops.h>
+#include <linux/unaligned.h>
 
 /* Traits are a very limited KV store, with:
  * - 64 keys (0-63).
@@ -144,23 +145,21 @@ int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u
 			memmove(traits + off + len, traits + off, traits_size(traits) - off);
 	}
 
-	/* Set our value. */
-	memcpy(traits + off, val, len);
-
-	/* Store our length in header. */
 	u64 encode_len = 0;
-
 	switch (len) {
 	case 0:
 		encode_len = 1;
 		break;
 	case 4:
+		put_unaligned(*(u32 *)val, (u32 *)(traits + off));
 		encode_len = 2;
 		break;
 	case 8:
+		put_unaligned(*(u64 *)val, (u64 *)(traits + off));
 		encode_len = 3;
 		break;
 	}
+
 	h->high |= (encode_len >> 1) << key;
 	h->low |= (encode_len & 1) << key;
 	return 0;
@@ -195,7 +194,7 @@ int trait_is_set(void *traits, u64 key)
  * @val_len: The length of val.
  *
  * Return:
- * * %>0      - Actual size of value.
+ * * %>=0     - Actual size of value.
  * * %-EINVAL - Key or length invalid.
  * * %-ENOENT - Key has not been set with trait_set() previously.
  * * %-ENOSPC - Val is not big enough to hold stored value.
@@ -221,7 +220,15 @@ int trait_get(void *traits, u64 key, void *val, u64 val_len)
 	if (real_len > val_len)
 		return -ENOSPC;
 
-	memcpy(val, traits + off, real_len);
+	switch (real_len) {
+	case 4:
+		*(u32 *)val = get_unaligned((u32 *)(traits + off));
+		break;
+	case 8:
+		*(u64 *)val = get_unaligned((u64 *)(traits + off));
+		break;
+	}
+
 	return real_len;
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 07/17] trait: Replace memmove calls with inline move
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (5 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 06/17] trait: Replace memcpy calls with inline copies Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 08/17] skb: Extension header in packet headroom Arthur Fabre
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

When inserting or deleting traits, we need to move any subsequent
traits over.

Replace it with an inline implementation to avoid the function call
overhead. This is especially expensive on AMD with SRSO.

In practice we shouldn't have too much data to move around, and we're
naturally limited to 238 bytes max, so a dumb implementation should
hopefully be fast enough.

Jesper Brouer kindly ran benchmarks on real hardware with three configs:
- Intel: E5-1650 v4
- AMD SRSO: 9684X SRSO
- AMD IBPB: 9684X SRSO=IBPB

		Intel	AMD IBPB	AMD SRSO
xdp-trait-get	5.530	3.901		9.188		(ns/op)
xdp-trait-set	7.538	4.941		10.050		(ns/op)
xdp-trait-move	14.245	8.865		14.834		(ns/op)
function call	1.319	1.359		5.703		(ns/op)
indirect call	8.922	6.251		10.329		(ns/op)

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 include/net/trait.h | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/net/trait.h b/include/net/trait.h
index 4013351549731c4e3bede211dbe9fbe651556dc9..1fc5f773ab9af689ac0f6e29fd3c1e62c04cfff8 100644
--- a/include/net/trait.h
+++ b/include/net/trait.h
@@ -74,6 +74,40 @@ static __always_inline int __trait_offset(struct __trait_hdr h, u64 key)
 	return sizeof(struct __trait_hdr) + __trait_total_length(__trait_and(h, ~(~0llu << key)));
 }
 
+/* Avoid overhead of memmove() function call when possible. */
+static __always_inline void __trait_move(void *src, int off, size_t n)
+{
+	if (n == 0)
+		return;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) || BITS_PER_LONG != 64) {
+		memmove(src + off, src, n);
+		return;
+	}
+
+	/* Need to move in reverse to handle overlap. */
+	if (off > 0)
+		src += n;
+
+#define ___trait_move(op) do { \
+		src -= (off > 0) ? sizeof(u##op) : 0; \
+		*(u##op *)(src + off) = *(u##op *)src; \
+		src += (off < 0) ? sizeof(u##op) : 0; \
+	} while (0)
+
+	for (int w = 0; w < n / 8; w++)
+		___trait_move(64);
+
+	if (n & 4)
+		___trait_move(32);
+
+	if (n & 2)
+		___trait_move(16);
+
+	if (n & 1)
+		___trait_move(8);
+}
+
 /**
  * traits_init() - Initialize a trait store.
  * @traits: Start of trait store area.
@@ -141,8 +175,7 @@ int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u
 			return -ENOSPC;
 
 		/* Memmove all the kvs after us over. */
-		if (traits_size(traits) > off)
-			memmove(traits + off + len, traits + off, traits_size(traits) - off);
+		__trait_move(traits + off, len, traits_size(traits) - off);
 	}
 
 	u64 encode_len = 0;
@@ -258,8 +291,7 @@ static __always_inline int trait_del(void *traits, u64 key)
 	int len = __trait_total_length(__trait_and(*h, (1ull << key)));
 
 	/* Memmove all the kvs after us over */
-	if (traits_size(traits) > off + len)
-		memmove(traits + off, traits + off + len, traits_size(traits) - off - len);
+	__trait_move(traits + off + len, -len, traits_size(traits) - off - len);
 
 	/* Clear our length in header */
 	h->high &= ~(1ull << key);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 08/17] skb: Extension header in packet headroom
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (6 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 07/17] trait: Replace memmove calls with inline move Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 09/17] trait: Store traits in sk_buff extension Arthur Fabre
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

We want to store traits in an skb extension header (see next commit).
To avoid having to allocate memory separately, and copy the traits from
the headroom to the allocated memory, allow extensions to be stored
directly in the headroom.

Only one extension may be stored at a time in the headroom, adding
another extension will allocate memory and copy the existing headroom
backed extension over.

To save space, headroom backed extensions do not use pre-declared sizes:
the amount of headroom space to use is controlled at runtime.
This preserves headroom space for other uses, like encap headers.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 include/linux/skbuff.h |  22 ++++++++++
 net/core/skbuff.c      | 115 ++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 121 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b974a277975a8a7b6f40c362542e9e8522539009..4f325a1733f8be808345424f737df36e06d4cd98 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2829,6 +2829,10 @@ static inline void *pskb_pull(struct sk_buff *skb, unsigned int len)
 
 void skb_condense(struct sk_buff *skb);
 
+#ifdef CONFIG_SKB_EXTENSIONS
+int skb_ext_headroom_used(const struct sk_buff *skb);
+#endif
+
 /**
  *	skb_headroom - bytes at buffer head
  *	@skb: buffer to check
@@ -2837,7 +2841,11 @@ void skb_condense(struct sk_buff *skb);
  */
 static inline unsigned int skb_headroom(const struct sk_buff *skb)
 {
+#ifdef CONFIG_SKB_EXTENSIONS
+	return skb->data - skb->head - skb_ext_headroom_used(skb);
+#else
 	return skb->data - skb->head;
+#endif
 }
 
 /**
@@ -4830,10 +4838,21 @@ enum skb_ext_id {
 	SKB_EXT_NUM, /* must be last */
 };
 
+enum skb_ext_mode {
+	/* struct skb_ext heap allocated */
+	SKB_EXT_ALLOC,
+	/* struct skb_ext in skb data headroom.
+	 * Only a single extension can be used at a time.
+	 * Size given to extension is dynamic.
+	 */
+	SKB_EXT_HEADROOM,
+};
+
 /**
  *	struct skb_ext - sk_buff extensions
  *	@refcnt: 1 on allocation, deallocated on 0
  *	@offset: offset to add to @data to obtain extension address
+ *	@mode: see enum skb_ext_mode
  *	@chunks: size currently allocated, stored in SKB_EXT_ALIGN_SHIFT units
  *	@data: start of extension data, variable sized
  *
@@ -4844,6 +4863,7 @@ struct skb_ext {
 	refcount_t refcnt;
 	u8 offset[SKB_EXT_NUM]; /* in chunks of 8 bytes */
 	u8 chunks;		/* same */
+	enum skb_ext_mode mode;
 	char data[] __aligned(8);
 };
 
@@ -4851,8 +4871,10 @@ struct skb_ext *__skb_ext_alloc(gfp_t flags);
 void *__skb_ext_set(struct sk_buff *skb, enum skb_ext_id id,
 		    struct skb_ext *ext);
 void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id);
+void *skb_ext_from_headroom(struct sk_buff *skb, enum skb_ext_id id, int head_offset, int size);
 void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id);
 void __skb_ext_put(struct skb_ext *ext);
+int __skb_ext_total_size(const struct skb_ext *ext);
 
 static inline void skb_ext_put(struct sk_buff *skb)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6cbf77bc61fce74c934628fd74b3a2cb7809e464..34dae26dac4cea448f84dacabdc8fb6f4ac3c1a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2232,7 +2232,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	unsigned int osize = skb_end_offset(skb);
 	unsigned int size = osize + nhead + ntail;
 	long off;
-	u8 *data;
+	u8 *data, *head;
 	int i;
 
 	BUG_ON(nhead < 0);
@@ -2249,10 +2249,20 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		goto nodata;
 	size = SKB_WITH_OVERHEAD(size);
 
+	head = skb->head;
+#ifdef CONFIG_SKB_EXTENSIONS
+	/* Keep skb_ext at the start of the new headroom */
+	if (skb_has_extensions(skb) && skb->extensions->mode == SKB_EXT_HEADROOM) {
+		head += __skb_ext_total_size(skb->extensions);
+		memcpy(data, skb->extensions, __skb_ext_total_size(skb->extensions));
+		skb->extensions = (struct skb_ext *)data;
+	}
+#endif
+
 	/* Copy only real data... and, alas, header. This should be
 	 * optimized for the cases when header is void.
 	 */
-	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
+	memcpy(data + nhead, head, skb_tail_pointer(skb) - head);
 
 	memcpy((struct skb_shared_info *)(data + size),
 	       skb_shinfo(skb),
@@ -5013,7 +5023,9 @@ EXPORT_SYMBOL_GPL(skb_segment);
 
 #ifdef CONFIG_SKB_EXTENSIONS
 #define SKB_EXT_ALIGN_VALUE	8
-#define SKB_EXT_CHUNKSIZEOF(x)	(ALIGN((sizeof(x)), SKB_EXT_ALIGN_VALUE) / SKB_EXT_ALIGN_VALUE)
+#define SKB_EXT_CHUNKS(x)	(ALIGN((x), SKB_EXT_ALIGN_VALUE) / SKB_EXT_ALIGN_VALUE)
+#define SKB_EXT_CHUNKSIZEOF(x)	(SKB_EXT_CHUNKS(sizeof(x)))
+#define SKB_EXT_CHUNKS_BYTES(x) ((x) * SKB_EXT_ALIGN_VALUE)
 
 static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
@@ -5033,7 +5045,7 @@ static const u8 skb_ext_type_len[] = {
 #endif
 };
 
-static __always_inline unsigned int skb_ext_total_length(void)
+static __always_inline unsigned int skb_ext_alloc_length(void)
 {
 	unsigned int l = SKB_EXT_CHUNKSIZEOF(struct skb_ext);
 	int i;
@@ -5048,11 +5060,11 @@ static void skb_extensions_init(void)
 {
 	BUILD_BUG_ON(SKB_EXT_NUM >= 8);
 #if !IS_ENABLED(CONFIG_KCOV_INSTRUMENT_ALL)
-	BUILD_BUG_ON(skb_ext_total_length() > 255);
+	BUILD_BUG_ON(skb_ext_alloc_length() > 255);
 #endif
 
 	skbuff_ext_cache = kmem_cache_create("skbuff_ext_cache",
-					     SKB_EXT_ALIGN_VALUE * skb_ext_total_length(),
+					     SKB_EXT_CHUNKS_BYTES(skb_ext_alloc_length()),
 					     0,
 					     SLAB_HWCACHE_ALIGN|SLAB_PANIC,
 					     NULL);
@@ -6934,7 +6946,7 @@ EXPORT_SYMBOL(skb_condense);
 #ifdef CONFIG_SKB_EXTENSIONS
 static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id)
 {
-	return (void *)ext + (ext->offset[id] * SKB_EXT_ALIGN_VALUE);
+	return (void *)ext + SKB_EXT_CHUNKS_BYTES(ext->offset[id]);
 }
 
 /**
@@ -6963,14 +6975,15 @@ static struct skb_ext *skb_ext_maybe_cow(struct skb_ext *old,
 {
 	struct skb_ext *new;
 
-	if (refcount_read(&old->refcnt) == 1)
+	/* SKB_EXT_HEADROOM only supports a single extension, always copy */
+	if (refcount_read(&old->refcnt) == 1 && old->mode == SKB_EXT_ALLOC)
 		return old;
 
 	new = kmem_cache_alloc(skbuff_ext_cache, GFP_ATOMIC);
 	if (!new)
 		return NULL;
 
-	memcpy(new, old, old->chunks * SKB_EXT_ALIGN_VALUE);
+	memcpy(new, old, SKB_EXT_CHUNKS_BYTES(old->chunks));
 	refcount_set(&new->refcnt, 1);
 
 #ifdef CONFIG_XFRM
@@ -7018,6 +7031,14 @@ void *__skb_ext_set(struct sk_buff *skb, enum skb_ext_id id,
 	return skb_ext_get_ptr(ext, id);
 }
 
+static void *__skb_ext_set_active(struct sk_buff *skb, struct skb_ext *ext, enum skb_ext_id id)
+{
+	skb->slow_gro = 1;
+	skb->extensions = ext;
+	skb->active_extensions |= 1 << id;
+	return skb_ext_get_ptr(ext, id);
+}
+
 /**
  * skb_ext_add - allocate space for given extension, COW if needed
  * @skb: buffer
@@ -7045,7 +7066,7 @@ void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id)
 			return NULL;
 
 		if (__skb_ext_exist(new, id))
-			goto set_active;
+			return __skb_ext_set_active(skb, new, id);
 
 		newoff = new->chunks;
 	} else {
@@ -7059,14 +7080,70 @@ void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id)
 	newlen = newoff + skb_ext_type_len[id];
 	new->chunks = newlen;
 	new->offset[id] = newoff;
-set_active:
-	skb->slow_gro = 1;
-	skb->extensions = new;
-	skb->active_extensions |= 1 << id;
-	return skb_ext_get_ptr(new, id);
+
+	return __skb_ext_set_active(skb, new, id);
 }
 EXPORT_SYMBOL(skb_ext_add);
 
+/**
+ * skb_ext_headroom_used - Number of headroom bytes used by extensions.
+ * @skb: buffer
+ *
+ * Returns the number of headroom bytes currently used by extensions.
+ */
+int skb_ext_headroom_used(const struct sk_buff *skb)
+{
+	if (!skb->active_extensions)
+		return 0;
+	if (skb->extensions->mode != SKB_EXT_HEADROOM)
+		return 0;
+
+	return SKB_EXT_CHUNKS_BYTES(skb->extensions->chunks);
+}
+
+/**
+ * skb_ext_from_headroom - store skb_ext in the packet headroom,
+ * and reuse headroom data for a single extension.
+ * @skb: buffer
+ * @id: extension to use headroom data for
+ * @head_offset: offset in bytes to start of headroom data to reuse
+ *               Must be a multiple of SKB_EXT_ALIGN_VALUE.
+ *               Must be bigger than sizeof(struct skb_ext).
+ * @size: size bytes following head_offset will be used for the
+ *        extension.
+ *
+ * Reuses the packet headroom to avoid a separate memory allocation:
+ * - struct skb_ext is stored at the start of the headroom.
+ * - head_offset - head_offset+size is given to the extension.
+ *
+ * Returns pointer to the extension or NULL on failure.
+ */
+void *skb_ext_from_headroom(struct sk_buff *skb, enum skb_ext_id id, int head_offset, int size)
+{
+	struct skb_ext *new;
+	unsigned int newoff;
+
+	if (head_offset % SKB_EXT_ALIGN_VALUE != 0)
+		return NULL;
+	if (head_offset < sizeof(struct skb_ext))
+		return NULL;
+	if (skb_headroom(skb) < head_offset + size)
+		return NULL;
+	if (skb->active_extensions)
+		return NULL;
+
+	new = (struct skb_ext *)skb->head;
+	new->mode = SKB_EXT_HEADROOM;
+	memset(new->offset, 0, sizeof(new->offset));
+	refcount_set(&new->refcnt, 1);
+
+	newoff = SKB_EXT_CHUNKS(head_offset);
+	new->chunks = newoff + SKB_EXT_CHUNKS(size);
+	new->offset[id] = newoff;
+
+	return __skb_ext_set_active(skb, new, id);
+}
+
 #ifdef CONFIG_XFRM
 static void skb_ext_put_sp(struct sec_path *sp)
 {
@@ -7125,9 +7202,15 @@ void __skb_ext_put(struct skb_ext *ext)
 		skb_ext_put_mctp(skb_ext_get_ptr(ext, SKB_EXT_MCTP));
 #endif
 
-	kmem_cache_free(skbuff_ext_cache, ext);
+	if (ext->mode == SKB_EXT_ALLOC)
+		kmem_cache_free(skbuff_ext_cache, ext);
 }
 EXPORT_SYMBOL(__skb_ext_put);
+
+int __skb_ext_total_size(const struct skb_ext *ext)
+{
+	return SKB_EXT_CHUNKS_BYTES(ext->chunks);
+}
 #endif /* CONFIG_SKB_EXTENSIONS */
 
 static void kfree_skb_napi_cache(struct sk_buff *skb)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 09/17] trait: Store traits in sk_buff extension
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (7 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 08/17] skb: Extension header in packet headroom Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb Arthur Fabre
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Use a bit in sk_buff to track whether or not an skb stores traits in
it's headroom.

It's tempting to store it in skb_shared_info like the XDP metadata,
but storing it in the skb allows us to more easily handle cases such as
skb clones in a few patches.

(I couldn't find an existing hole to use in sk_buff, so this makes a 4
byte hole... any pointers to a free bit?)

The following patches add support for handful of drivers, in the final
form we'd like to cover all drivers that currently support XDP metadata.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 include/linux/skbuff.h | 11 +++++++++++
 include/net/xdp.h      | 21 +++++++++++++++++++++
 net/core/skbuff.c      |  4 ++++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4f325a1733f8be808345424f737df36e06d4cd98..ae569b1b1af83b586e1be6c69439ef74bac38cf3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -39,6 +39,7 @@
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
 #include <net/netmem.h>
+#include <net/trait.h>
 
 /**
  * DOC: skb checksums
@@ -4835,6 +4836,7 @@ enum skb_ext_id {
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
 	SKB_EXT_MCTP,
 #endif
+	SKB_EXT_TRAITS,
 	SKB_EXT_NUM, /* must be last */
 };
 
@@ -4949,6 +4951,15 @@ static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
 static inline bool skb_has_extensions(struct sk_buff *skb) { return false; }
 #endif /* CONFIG_SKB_EXTENSIONS */
 
+static inline void *skb_traits(const struct sk_buff *skb)
+{
+#ifdef CONFIG_SKB_EXTENSIONS
+	return skb_ext_find(skb, SKB_EXT_TRAITS);
+#else
+	return NULL;
+#endif
+}
+
 static inline void nf_reset_ct(struct sk_buff *skb)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0da1e87afdebfd4323d1944de65a6d63438209bf..cddd5b147cf81fa8afd820c9aa4b86d78958ac3d 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -127,6 +127,17 @@ static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
 	return xdp->data_hard_start + _XDP_FRAME_SIZE;
 }
 
+static __always_inline void __xdp_flags_update_skb(u32 flags, struct sk_buff *skb, void *traits)
+{
+	if (flags & XDP_FLAGS_TRAITS_SUPPORTED)
+		skb_ext_from_headroom(skb, SKB_EXT_TRAITS, _XDP_FRAME_SIZE, traits_size(traits));
+}
+
+static __always_inline void xdp_buff_update_skb(const struct xdp_buff *xdp, struct sk_buff *skb)
+{
+	__xdp_flags_update_skb(xdp->flags, skb, xdp_buff_traits(xdp));
+}
+
 static __always_inline void
 xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
 {
@@ -302,6 +313,16 @@ xdp_frame_is_frag_pfmemalloc(const struct xdp_frame *frame)
 	return !!(frame->flags & XDP_FLAGS_FRAGS_PF_MEMALLOC);
 }
 
+static void *xdp_frame_traits(const struct xdp_frame *frame)
+{
+	return frame->data + _XDP_FRAME_SIZE;
+}
+
+static __always_inline void xdp_frame_update_skb(struct xdp_frame *frame, struct sk_buff *skb)
+{
+	__xdp_flags_update_skb(frame->flags, skb, xdp_frame_traits(frame));
+}
+
 #define XDP_BULK_QUEUE_SIZE	16
 struct xdp_frame_bulk {
 	int count;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 34dae26dac4cea448f84dacabdc8fb6f4ac3c1a6..9f27caba82af0be7897b68b5fad087f3e9c62955 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5043,6 +5043,10 @@ static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
 	[SKB_EXT_MCTP] = SKB_EXT_CHUNKSIZEOF(struct mctp_flow),
 #endif
+	/* TODO: skb_ext is slab allocated with room for all extensions,
+	 * this makes it a LOT bigger.
+	 */
+	[SKB_EXT_TRAITS] = SKB_EXT_CHUNKS(XDP_PACKET_HEADROOM),
 };
 
 static __always_inline unsigned int skb_ext_alloc_length(void)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (8 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 09/17] trait: Store traits in sk_buff extension Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-23 16:36   ` Stanislav Fomichev
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 11/17] ice: " Arthur Fabre
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Call the common xdp_buff_update_skb() helper.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c8e3468eee612ad622bfbecfd7cc1ae3396061fd..0eba3e307a3edbc5fe1abf2fa45e6256d98574c2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2297,6 +2297,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 			}
 		}
 	}
+
+	if (xdp_active)
+		xdp_buff_update_skb(&xdp, skb);
+
 	bnxt_deliver_skb(bp, bnapi, skb);
 	rc = 1;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 11/17] ice: Propagate trait presence to skb
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (9 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 12/17] veth: " Arthur Fabre
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Call the common xdp_buff_update_skb() helper.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 4 ++++
 drivers/net/ethernet/intel/ice/ice_xsk.c  | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 1e4f6f6ee449c96ed1afe6fa4c677a7b2caf53e8..ff2d79b4232181be05f413e451ac0b05e0c65bfd 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1011,6 +1011,8 @@ ice_build_skb(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp)
 					   nr_frags * xdp->frame_sz,
 					   xdp_buff_is_frag_pfmemalloc(xdp));
 
+	xdp_buff_update_skb(xdp, skb);
+
 	return skb;
 }
 
@@ -1092,6 +1094,8 @@ ice_construct_skb(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp)
 					   xdp_buff_is_frag_pfmemalloc(xdp));
 	}
 
+	xdp_buff_update_skb(xdp, skb);
+
 	return skb;
 }
 
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index a3a4eaa17739ae084ccbe2cc75f3d10ca1be3fdc..84ff6b585967e04de3cdd6230ff7d1327c08fae0 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -573,6 +573,8 @@ ice_construct_skb_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp)
 		__skb_pull(skb, metasize);
 	}
 
+	xdp_buff_update_skb(xdp, skb);
+
 	if (likely(!xdp_buff_has_frags(xdp)))
 		goto out;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 12/17] veth: Propagate trait presence to skb
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (10 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 11/17] ice: " Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 13/17] virtio_net: " Arthur Fabre
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Call the common xdp_buff_update_skb() / xdp_frame_update_skb() helpers.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 drivers/net/veth.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7bb53961c0eaf4de642d20e49c9e47770dc367dd..a0f70aa145790700eed37349b337b723edc7612e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -703,6 +703,8 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
 			stats->rx_drops++;
 			continue;
 		}
+
+		xdp_frame_update_skb(frames[i], skb);
 		napi_gro_receive(&rq->xdp_napi, skb);
 	}
 }
@@ -855,6 +857,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
+
+	xdp_buff_update_skb(xdp, skb);
 out:
 	return skb;
 drop:

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 13/17] virtio_net: Propagate trait presence to skb
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (11 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 12/17] veth: " Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 14/17] mlx5: move xdp_buff scope one level up Arthur Fabre
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Call the common xdp_buff_update_skb() / xdp_frame_update_skb() helpers.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 drivers/net/virtio_net.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7e4617216a4bd5bd52557109e252e2e00268062b..a709587382e97b8f1518b58301807dbbb00f1a0f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1186,6 +1186,7 @@ static struct sk_buff *virtnet_receive_xsk_small(struct net_device *dev, struct
 						 struct virtnet_rq_stats *stats)
 {
 	struct bpf_prog *prog;
+	struct sk_buff *skb;
 	u32 ret;
 
 	ret = XDP_PASS;
@@ -1197,7 +1198,9 @@ static struct sk_buff *virtnet_receive_xsk_small(struct net_device *dev, struct
 
 	switch (ret) {
 	case XDP_PASS:
-		return xsk_construct_skb(rq, xdp);
+		skb = xsk_construct_skb(rq, xdp);
+		xdp_buff_update_skb(xdp, skb);
+		return skb;
 
 	case XDP_TX:
 	case XDP_REDIRECT:
@@ -1319,6 +1322,7 @@ static struct sk_buff *virtnet_receive_xsk_merge(struct net_device *dev, struct
 		skb = xsk_construct_skb(rq, xdp);
 		if (!skb)
 			goto drop_bufs;
+		xdp_buff_update_skb(xdp, skb);
 
 		if (xsk_append_merge_buffer(vi, rq, skb, num_buf, hdr, stats)) {
 			dev_kfree_skb(skb);
@@ -1952,6 +1956,7 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
 	skb = virtnet_build_skb(buf, buflen, xdp.data - buf, len);
 	if (unlikely(!skb))
 		goto err;
+	xdp_buff_update_skb(&xdp, skb);
 
 	if (metasize)
 		skb_metadata_set(skb, metasize);
@@ -2320,6 +2325,7 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
 		head_skb = build_skb_from_xdp_buff(dev, vi, &xdp, xdp_frags_truesz);
 		if (unlikely(!head_skb))
 			break;
+		xdp_buff_update_skb(&xdp, head_skb);
 		return head_skb;
 
 	case XDP_TX:

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 14/17] mlx5: move xdp_buff scope one level up
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (12 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 13/17] virtio_net: " Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 15/17] mlx5: Propagate trait presence to skb Arthur Fabre
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet

From: Jesper Dangaard Brouer <hawk@kernel.org>

This is in preparation for changes.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.c    |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h    |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 103 +++++++++++----------
 4 files changed, 66 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 32ed4963b8adaa8111f1b8a83735773b1a6da19b..95d1e61bcb23205caeedadac1b6f0e07832ed48f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -585,14 +585,16 @@ struct mlx5e_mpw_info {
 #define MLX5E_MAX_RX_FRAGS 4
 
 struct mlx5e_rq;
+struct mlx5e_xdp_buff;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 			       struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
-			       u32 head_offset, u32 page_idx);
+			       u32 head_offset, u32 page_idx,
+			       struct mlx5e_xdp_buff *mxbuf);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
+			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt, struct mlx5e_xdp_buff *mxbuf);
 typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
 typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
 typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 2b05536d564ac795c5f1dabd81444db40beeaa6a..8fe7c36de3140c66f198de5cff0d212d548d7eff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -249,7 +249,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
-						    u32 page_idx)
+						    u32 page_idx,
+						    struct mlx5e_xdp_buff *mxbuf_)
 {
 	struct mlx5e_xdp_buff *mxbuf = xsk_buff_to_mxbuf(wi->alloc_units.xsk_buffs[page_idx]);
 	struct bpf_prog *prog;
@@ -304,7 +305,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
 					      struct mlx5_cqe64 *cqe,
-					      u32 cqe_bcnt)
+					      u32 cqe_bcnt,
+					      struct mlx5e_xdp_buff *mxbuf_)
 {
 	struct mlx5e_xdp_buff *mxbuf = xsk_buff_to_mxbuf(*wi->xskp);
 	struct bpf_prog *prog;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index cefc0ef6105d24705bdd450cfd7857435a9d0c67..0890c975042c7f58e512a61fc538f21b5e37c6b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -16,10 +16,12 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
-						    u32 page_idx);
+						    u32 page_idx,
+						    struct mlx5e_xdp_buff *mxbuf_);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
 					      struct mlx5_cqe64 *cqe,
-					      u32 cqe_bcnt);
+					      u32 cqe_bcnt,
+					      struct mlx5e_xdp_buff *mxbuf_);
 
 #endif /* __MLX5_EN_XSK_RX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5fd70b4d55beb4ed277a5ea896a6859350b72d21..7b5b81a42bedc2fb8da1ee2360a7d1fb46c24bd2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -63,11 +63,11 @@
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				u32 page_idx);
+				u32 page_idx, struct mlx5e_xdp_buff *mxbuf);
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				   u32 page_idx);
+				   u32 page_idx, struct mlx5e_xdp_buff *mxbuf);
 static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -1662,7 +1662,8 @@ static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt,
+			  struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_frag_page *frag_page = wi->frag_page;
 	u16 rx_headroom = rq->buff.headroom;
@@ -1684,17 +1685,15 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct mlx5e_xdp_buff mxbuf;
-
 		net_prefetchw(va); /* xdp_frame data area */
 		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-				 cqe_bcnt, &mxbuf);
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf))
+				 cqe_bcnt, mxbuf);
+		if (mlx5e_xdp_handle(rq, prog, mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
-		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
-		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
-		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
+		rx_headroom = mxbuf->xdp.data - mxbuf->xdp.data_hard_start;
+		metasize = mxbuf->xdp.data - mxbuf->xdp.data_meta;
+		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -1710,14 +1709,14 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt,
+			     struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
 	struct mlx5e_wqe_frag_info *head_wi = wi;
 	u16 rx_headroom = rq->buff.headroom;
 	struct mlx5e_frag_page *frag_page;
 	struct skb_shared_info *sinfo;
-	struct mlx5e_xdp_buff mxbuf;
 	u32 frag_consumed_bytes;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
@@ -1737,8 +1736,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetch(va + rx_headroom);
 
 	mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-			 frag_consumed_bytes, &mxbuf);
-	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
+			 frag_consumed_bytes, mxbuf);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf->xdp);
 	truesize = 0;
 
 	cqe_bcnt -= frag_consumed_bytes;
@@ -1750,7 +1749,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 
 		frag_consumed_bytes = min_t(u32, frag_info->frag_size, cqe_bcnt);
 
-		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf.xdp, frag_page,
+		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf->xdp, frag_page,
 					       wi->offset, frag_consumed_bytes);
 		truesize += frag_info->frag_stride;
 
@@ -1760,7 +1759,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	}
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (prog && mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+	if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 			struct mlx5e_wqe_frag_info *pwi;
 
@@ -1770,21 +1769,21 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
-	skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
-				     mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
-				     mxbuf.xdp.data_end - mxbuf.xdp.data,
-				     mxbuf.xdp.data - mxbuf.xdp.data_meta);
+	skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
+				     mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
+				     mxbuf->xdp.data_end - mxbuf->xdp.data,
+				     mxbuf->xdp.data - mxbuf->xdp.data_meta);
 	if (unlikely(!skb))
 		return NULL;
 
 	skb_mark_for_recycle(skb);
 	head_wi->frag_page->frags++;
 
-	if (xdp_buff_has_frags(&mxbuf.xdp)) {
+	if (xdp_buff_has_frags(&mxbuf->xdp)) {
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
 		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
 					   sinfo->xdp_frags_size, truesize,
-					   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+					   xdp_buff_is_frag_pfmemalloc(&mxbuf->xdp));
 
 		for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
 			pwi->frag_page->frags++;
@@ -1815,6 +1814,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 ci;
@@ -1832,7 +1832,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_linear,
-			      rq, wi, cqe, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1863,6 +1863,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	struct mlx5_eswitch_rep *rep = rpriv->rep;
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 ci;
@@ -1879,7 +1880,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1907,6 +1908,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	u32 wqe_offset     = stride_ix << rq->mpwqe.log_stride_sz;
 	u32 head_offset    = wqe_offset & ((1 << rq->mpwqe.page_shift) - 1);
 	u32 page_idx       = wqe_offset >> rq->mpwqe.page_shift;
+	struct mlx5e_xdp_buff mxbuf;
 	struct mlx5e_rx_wqe_ll *wqe;
 	struct mlx5_wq_ll *wq;
 	struct sk_buff *skb;
@@ -1932,7 +1934,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx, &mxbuf);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -1979,7 +1981,7 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				   u32 page_idx)
+				   u32 page_idx, struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
@@ -1987,7 +1989,6 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	u32 frag_offset    = head_offset;
 	u32 byte_cnt       = cqe_bcnt;
 	struct skb_shared_info *sinfo;
-	struct mlx5e_xdp_buff mxbuf;
 	unsigned int truesize = 0;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
@@ -2033,9 +2034,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		}
 	}
 
-	mlx5e_fill_mxbuf(rq, cqe, va, linear_hr, linear_frame_sz, linear_data_len, &mxbuf);
+	mlx5e_fill_mxbuf(rq, cqe, va, linear_hr, linear_frame_sz, linear_data_len, mxbuf);
 
-	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf->xdp);
 
 	while (byte_cnt) {
 		/* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */
@@ -2046,7 +2047,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		else
 			truesize += ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz));
 
-		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf.xdp, frag_page, frag_offset,
+		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf->xdp, frag_page, frag_offset,
 					       pg_consumed_bytes);
 		byte_cnt -= pg_consumed_bytes;
 		frag_offset = 0;
@@ -2054,7 +2055,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	}
 
 	if (prog) {
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 				struct mlx5e_frag_page *pfp;
 
@@ -2067,10 +2068,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start,
+		skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start,
 					     linear_frame_sz,
-					     mxbuf.xdp.data - mxbuf.xdp.data_hard_start, 0,
-					     mxbuf.xdp.data - mxbuf.xdp.data_meta);
+					     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
+					     mxbuf->xdp.data - mxbuf->xdp.data_meta);
 		if (unlikely(!skb)) {
 			mlx5e_page_release_fragmented(rq, &wi->linear_page);
 			return NULL;
@@ -2080,13 +2081,13 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		wi->linear_page.frags++;
 		mlx5e_page_release_fragmented(rq, &wi->linear_page);
 
-		if (xdp_buff_has_frags(&mxbuf.xdp)) {
+		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
 			/* sinfo->nr_frags is reset by build_skb, calculate again. */
 			xdp_update_skb_shared_info(skb, frag_page - head_page,
 						   sinfo->xdp_frags_size, truesize,
-						   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+						   xdp_buff_is_frag_pfmemalloc(&mxbuf->xdp));
 
 			pagep = head_page;
 			do
@@ -2097,12 +2098,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	} else {
 		dma_addr_t addr;
 
-		if (xdp_buff_has_frags(&mxbuf.xdp)) {
+		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
 			xdp_update_skb_shared_info(skb, sinfo->nr_frags,
 						   sinfo->xdp_frags_size, truesize,
-						   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+						   xdp_buff_is_frag_pfmemalloc(&mxbuf->xdp));
 
 			pagep = frag_page - sinfo->nr_frags;
 			do
@@ -2124,7 +2125,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				u32 page_idx)
+				u32 page_idx, struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -2152,20 +2153,19 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
 		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-				 cqe_bcnt, &mxbuf);
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+				 cqe_bcnt, mxbuf);
+		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				frag_page->frags++;
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
-		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
-		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
+		rx_headroom = mxbuf->xdp.data - mxbuf->xdp.data_hard_start;
+		metasize = mxbuf->xdp.data - mxbuf->xdp.data_meta;
+		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -2288,12 +2288,14 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 	bool flush		= cqe->shampo.flush;
 	bool match		= cqe->shampo.match;
 	struct mlx5e_rq_stats *stats = rq->stats;
+	struct mlx5e_xdp_buff mxbuf;
 	struct mlx5e_rx_wqe_ll *wqe;
 	struct mlx5e_mpw_info *wi;
 	struct mlx5_wq_ll *wq;
 
 	wi = mlx5e_get_mpw_info(rq, wqe_id);
 	wi->consumed_strides += cstrides;
+	mxbuf.xdp.flags = 0;
 
 	if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
 		mlx5e_handle_rx_err_cqe(rq, cqe);
@@ -2316,7 +2318,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 			*skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
 		else
 			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
-								  data_offset, page_idx);
+								  data_offset, page_idx, &mxbuf);
 		if (unlikely(!*skb))
 			goto free_hd_entry;
 
@@ -2377,6 +2379,7 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 	u32 wqe_offset     = stride_ix << rq->mpwqe.log_stride_sz;
 	u32 head_offset    = wqe_offset & ((1 << rq->mpwqe.page_shift) - 1);
 	u32 page_idx       = wqe_offset >> rq->mpwqe.page_shift;
+	struct mlx5e_xdp_buff mxbuf;
 	struct mlx5e_rx_wqe_ll *wqe;
 	struct mlx5_wq_ll *wq;
 	struct sk_buff *skb;
@@ -2404,7 +2407,7 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_mpwrq_linear,
 			      rq, wi, cqe, cqe_bcnt, head_offset,
-			      page_idx);
+			      page_idx, &mxbuf);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -2632,6 +2635,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 ci;
@@ -2648,7 +2652,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb)
 		goto wq_cyc_pop;
 
@@ -2722,6 +2726,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
 {
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 trap_id;
@@ -2737,7 +2742,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
 		goto wq_cyc_pop;
 	}
 
-	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
+	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb)
 		goto wq_cyc_pop;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 15/17] mlx5: Propagate trait presence to skb
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (13 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 14/17] mlx5: move xdp_buff scope one level up Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 16/17] xdp generic: " Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 17/17] trait: Allow socket filters to access traits Arthur Fabre
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Call the common xdp_buff_update_skb() helper.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 7b5b81a42bedc2fb8da1ee2360a7d1fb46c24bd2..9ca82c9e831ebf4fc52dac547e97c9f3d499e8a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1632,7 +1632,8 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 static inline
 struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 				       u32 frag_size, u16 headroom,
-				       u32 cqe_bcnt, u32 metasize)
+				       u32 cqe_bcnt, u32 metasize,
+				       struct mlx5e_xdp_buff *mxbuf)
 {
 	struct sk_buff *skb = napi_build_skb(va, frag_size);
 
@@ -1646,6 +1647,8 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 
 	if (metasize)
 		skb_metadata_set(skb, metasize);
+	if (mxbuf)
+		xdp_buff_update_skb(&mxbuf->xdp, skb);
 
 	return skb;
 }
@@ -1696,7 +1699,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
-	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
+	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize, mxbuf);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -1772,7 +1775,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
 				     mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
 				     mxbuf->xdp.data_end - mxbuf->xdp.data,
-				     mxbuf->xdp.data - mxbuf->xdp.data_meta);
+				     mxbuf->xdp.data - mxbuf->xdp.data_meta, mxbuf);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -2071,7 +2074,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start,
 					     linear_frame_sz,
 					     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
-					     mxbuf->xdp.data - mxbuf->xdp.data_meta);
+					     mxbuf->xdp.data - mxbuf->xdp.data_meta, mxbuf);
 		if (unlikely(!skb)) {
 			mlx5e_page_release_fragmented(rq, &wi->linear_page);
 			return NULL;
@@ -2168,7 +2171,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
-	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
+	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize, mxbuf);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -2202,7 +2205,7 @@ mlx5e_skb_from_cqe_shampo(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		dma_sync_single_range_for_cpu(rq->pdev, dma_addr, 0, frag_size, rq->buff.map_dir);
 		net_prefetchw(hdr);
 		net_prefetch(data);
-		skb = mlx5e_build_linear_skb(rq, hdr, frag_size, rx_headroom, head_size, 0);
+		skb = mlx5e_build_linear_skb(rq, hdr, frag_size, rx_headroom, head_size, 0, NULL);
 		if (unlikely(!skb))
 			return NULL;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 16/17] xdp generic: Propagate trait presence to skb
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (14 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 15/17] mlx5: Propagate trait presence to skb Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 17/17] trait: Allow socket filters to access traits Arthur Fabre
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Call the common xdp_buff_update_skb() / xdp_frame_update_skb() helpers.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1be7cb73a6024fda6797b6dfc895e4ce25f43251..4d1ac5442d2ecf3c22186ab264ab54eb7d302708 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5226,6 +5226,7 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
 		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);
+		xdp_buff_update_skb(xdp, skb);
 		break;
 	}
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC bpf-next v2 17/17] trait: Allow socket filters to access traits
  2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
                   ` (15 preceding siblings ...)
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 16/17] xdp generic: " Arthur Fabre
@ 2025-04-22 13:23 ` Arthur Fabre
  16 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-04-22 13:23 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, ast, kuba,
	edumazet, Arthur Fabre

Add kfuncs to allow socket filter programs access to traits in an skb.

To ensure every skb can modify traits independently, copy on write if
multiple skbs are using the same traits.

To allow new traits to be set (which requires more memory), try to use
up more of the headroom of the skb if we run out of space setting.

Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
---
 include/linux/skbuff.h |   9 ++++
 net/core/skbuff.c      | 112 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ae569b1b1af83b586e1be6c69439ef74bac38cf3..e573889cf1256e7aff84b488af875a13f558cb01 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4874,9 +4874,11 @@ void *__skb_ext_set(struct sk_buff *skb, enum skb_ext_id id,
 		    struct skb_ext *ext);
 void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id);
 void *skb_ext_from_headroom(struct sk_buff *skb, enum skb_ext_id id, int head_offset, int size);
+bool skb_ext_grow_headroom(const struct sk_buff *skb, int add);
 void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id);
 void __skb_ext_put(struct skb_ext *ext);
 int __skb_ext_total_size(const struct skb_ext *ext);
+int skb_ext_size(const struct sk_buff *skb, enum skb_ext_id id);
 
 static inline void skb_ext_put(struct sk_buff *skb)
 {
@@ -4960,6 +4962,13 @@ static inline void *skb_traits(const struct sk_buff *skb)
 #endif
 }
 
+int skb_trait_set(struct sk_buff *skb, u64 key,
+			const void *val, u64 val__sz, u64 flags);
+int skb_trait_is_set(const struct sk_buff *skb, u64 key);
+int skb_trait_get(const struct sk_buff *skb, u64 key,
+			void *val, u64 val__sz);
+int skb_trait_del(const struct sk_buff *skb, u64 key);
+
 static inline void nf_reset_ct(struct sk_buff *skb)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9f27caba82af0be7897b68b5fad087f3e9c62955..27e163b83b12ffac973e1e098f035b807e1f4232 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7148,6 +7148,19 @@ void *skb_ext_from_headroom(struct sk_buff *skb, enum skb_ext_id id, int head_of
 	return __skb_ext_set_active(skb, new, id);
 }
 
+bool skb_ext_grow_headroom(const struct sk_buff *skb, int add)
+{
+	if (!skb->active_extensions)
+		return false;
+	if (skb->extensions->mode != SKB_EXT_HEADROOM)
+		return false;
+	if (skb_headroom(skb) < add)
+		return false;
+
+	skb->extensions->chunks += SKB_EXT_CHUNKS(add);
+	return true;
+}
+
 #ifdef CONFIG_XFRM
 static void skb_ext_put_sp(struct sec_path *sp)
 {
@@ -7215,8 +7228,107 @@ int __skb_ext_total_size(const struct skb_ext *ext)
 {
 	return SKB_EXT_CHUNKS_BYTES(ext->chunks);
 }
+
+int skb_ext_size(const struct sk_buff *skb, enum skb_ext_id id)
+{
+	if (!skb_ext_exist(skb, id))
+		return 0;
+
+	switch (skb->extensions->mode) {
+	case SKB_EXT_ALLOC:
+		return skb_ext_type_len[id];
+	case SKB_EXT_HEADROOM:
+		return SKB_EXT_CHUNKS_BYTES(skb->extensions->chunks - skb->extensions->offset[id]);
+	}
+}
 #endif /* CONFIG_SKB_EXTENSIONS */
 
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc int skb_trait_set(struct sk_buff *skb, u64 key,
+			      const void *val, u64 val__sz, u64 flags)
+{
+#ifndef CONFIG_SKB_EXTENSIONS
+	return -EOPNOTSUPP;
+#else
+	int err;
+	void *traits = skb_traits(skb);
+
+	if (!traits)
+		return -EOPNOTSUPP;
+
+	/* Traits are shared, get our own copy before modifying */
+	if (refcount_read(&skb->extensions->refcnt) > 1) {
+		traits = skb_ext_add(skb, SKB_EXT_TRAITS);
+		if (!traits)
+			return -ENOMEM;
+	}
+
+	err = trait_set(traits, traits + skb_ext_size(skb, SKB_EXT_TRAITS),
+			key, val, val__sz, flags);
+	if (err == -ENOSPC && skb->extensions->mode == SKB_EXT_HEADROOM) {
+		/* Take more headroom if available */
+		if (!skb_ext_grow_headroom(skb, val__sz))
+			return err;
+
+		err = trait_set(traits, traits + skb_ext_size(skb, SKB_EXT_TRAITS),
+				key, val, val__sz, flags);
+	}
+	return err;
+#endif /* CONFIG_SKB_EXTENSIONS */
+}
+
+__bpf_kfunc int skb_trait_is_set(const struct sk_buff *skb, u64 key)
+{
+	void *traits = skb_traits(skb);
+
+	if (!traits)
+		return -EOPNOTSUPP;
+
+	return trait_is_set(traits, key);
+}
+
+__bpf_kfunc int skb_trait_get(const struct sk_buff *skb, u64 key,
+			      void *val, u64 val__sz)
+{
+	void *traits = skb_traits(skb);
+
+	if (!traits)
+		return -EOPNOTSUPP;
+
+	return trait_get(traits, key, val, val__sz);
+}
+
+__bpf_kfunc int skb_trait_del(const struct sk_buff *skb, u64 key)
+{
+	void *traits = skb_traits(skb);
+
+	if (!traits)
+		return -EOPNOTSUPP;
+
+	return trait_del(traits, key);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_skb_traits)
+BTF_ID_FLAGS(func, skb_trait_set)
+BTF_ID_FLAGS(func, skb_trait_is_set)
+BTF_ID_FLAGS(func, skb_trait_get)
+BTF_ID_FLAGS(func, skb_trait_del)
+BTF_KFUNCS_END(bpf_skb_traits)
+
+static const struct btf_kfunc_id_set bpf_traits_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_skb_traits,
+};
+
+static int init_subsystem(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCKET_FILTER, &bpf_traits_kfunc_set);
+}
+late_initcall(init_subsystem);
+
 static void kfree_skb_napi_cache(struct sk_buff *skb)
 {
 	/* if SKB is a clone, don't handle this case */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb Arthur Fabre
@ 2025-04-23 16:36   ` Stanislav Fomichev
  2025-04-23 20:54     ` Arthur Fabre
  0 siblings, 1 reply; 36+ messages in thread
From: Stanislav Fomichev @ 2025-04-23 16:36 UTC (permalink / raw)
  To: Arthur Fabre
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	ast, kuba, edumazet

On 04/22, Arthur Fabre wrote:
> Call the common xdp_buff_update_skb() helper.
> 
> Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index c8e3468eee612ad622bfbecfd7cc1ae3396061fd..0eba3e307a3edbc5fe1abf2fa45e6256d98574c2 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -2297,6 +2297,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
>  			}
>  		}
>  	}
> +
> +	if (xdp_active)
> +		xdp_buff_update_skb(&xdp, skb);

For me, the preference for reusing existing metadata area was
because of the patches 10-16: we now need to care about two types of
metadata explicitly.

If you insist on placing it into the headroom, can we at least have some
common helper to finish xdp->skb conversion? It can call skb_ext_from_headroom
and/or skb_metadata_set:

xdp_buff_done(*xdp, *skb) {
	if (have traits) {
		skb_ext_from_headroom
		return
	}

	metasize = xdp->data - xdp->data_meta;
	if (metasize)
		skb_metadata_set
}

And then we'll have some common rules for the drivers: call xdp_buff_done
when you're done with the xdp_buff to take care of metadata/traits. And
it might be easier to review: you're gonna (mostly) change existing
calls to skb_metadata_set to your new helper. Maybe we can even
eventually fold all xdp_update_skb_shared_info stuff into that as
well...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-23 16:36   ` Stanislav Fomichev
@ 2025-04-23 20:54     ` Arthur Fabre
  2025-04-23 23:45       ` Stanislav Fomichev
  0 siblings, 1 reply; 36+ messages in thread
From: Arthur Fabre @ 2025-04-23 20:54 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	ast, kuba, edumazet

On Wed Apr 23, 2025 at 6:36 PM CEST, Stanislav Fomichev wrote:
> On 04/22, Arthur Fabre wrote:
> > Call the common xdp_buff_update_skb() helper.
> > 
> > Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
> > ---
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index c8e3468eee612ad622bfbecfd7cc1ae3396061fd..0eba3e307a3edbc5fe1abf2fa45e6256d98574c2 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -2297,6 +2297,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
> >  			}
> >  		}
> >  	}
> > +
> > +	if (xdp_active)
> > +		xdp_buff_update_skb(&xdp, skb);
>
> For me, the preference for reusing existing metadata area was
> because of the patches 10-16: we now need to care about two types of
> metadata explicitly.

Having to update all the drivers is definitely not ideal. Motivation is:

1. Avoid trait_set() and xdp_adjust_meta() from corrupting each other's
   data. 
   But that's not a problem if we disallow trait_set() and
   xdp_adjust_meta() to be used at the same time, so maybe not a good
   reason anymore (except for maybe 3.)

2. Not have the traits at the "end" of the headroom (ie right next to
   actual packet data).
   If it's at the "end", we need to move all of it to make room for
   every xdp_adjust_head() call.
   It seems more intrusive to the current SKB API: several funcs assume
   that there is headroom directly before the packet.

3. I'm not sure how this should be exposed with AF_XDP yet. Either:
   * Expose raw trait storage, and having it at the "end" of the
     headroom is nice. But userspace would need to know how to parse the
	 header.
   * Require the XDP program to copy the traits it wants into the XDP
     metadata area, which is already exposed to userspace. That would
	 need traits and XDP metadata to coexist.

>
> If you insist on placing it into the headroom, can we at least have some
> common helper to finish xdp->skb conversion? It can call skb_ext_from_headroom
> and/or skb_metadata_set:
>
> xdp_buff_done(*xdp, *skb) {
> 	if (have traits) {
> 		skb_ext_from_headroom
> 		return
> 	}
>
> 	metasize = xdp->data - xdp->data_meta;
> 	if (metasize)
> 		skb_metadata_set
> }
>
> And then we'll have some common rules for the drivers: call xdp_buff_done
> when you're done with the xdp_buff to take care of metadata/traits. And
> it might be easier to review: you're gonna (mostly) change existing
> calls to skb_metadata_set to your new helper. Maybe we can even
> eventually fold all xdp_update_skb_shared_info stuff into that as
> well...

Yes! This is what I was going for with xdp_buff_update_skb() - it would be
nice for it handle all the SKB updating, including skb_metadata_set().

Should I do that first, and submit it as separate series()?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-23 20:54     ` Arthur Fabre
@ 2025-04-23 23:45       ` Stanislav Fomichev
  2025-04-24  9:49         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 36+ messages in thread
From: Stanislav Fomichev @ 2025-04-23 23:45 UTC (permalink / raw)
  To: Arthur Fabre
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	ast, kuba, edumazet

On 04/23, Arthur Fabre wrote:
> On Wed Apr 23, 2025 at 6:36 PM CEST, Stanislav Fomichev wrote:
> > On 04/22, Arthur Fabre wrote:
> > > Call the common xdp_buff_update_skb() helper.
> > > 
> > > Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
> > > ---
> > >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > index c8e3468eee612ad622bfbecfd7cc1ae3396061fd..0eba3e307a3edbc5fe1abf2fa45e6256d98574c2 100644
> > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > @@ -2297,6 +2297,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
> > >  			}
> > >  		}
> > >  	}
> > > +
> > > +	if (xdp_active)
> > > +		xdp_buff_update_skb(&xdp, skb);
> >
> > For me, the preference for reusing existing metadata area was
> > because of the patches 10-16: we now need to care about two types of
> > metadata explicitly.
> 
> Having to update all the drivers is definitely not ideal. Motivation is:
> 
> 1. Avoid trait_set() and xdp_adjust_meta() from corrupting each other's
>    data. 
>    But that's not a problem if we disallow trait_set() and
>    xdp_adjust_meta() to be used at the same time, so maybe not a good
>    reason anymore (except for maybe 3.)
> 
> 2. Not have the traits at the "end" of the headroom (ie right next to
>    actual packet data).
>    If it's at the "end", we need to move all of it to make room for
>    every xdp_adjust_head() call.
>    It seems more intrusive to the current SKB API: several funcs assume
>    that there is headroom directly before the packet.

[..]

> 3. I'm not sure how this should be exposed with AF_XDP yet. Either:
>    * Expose raw trait storage, and having it at the "end" of the
>      headroom is nice. But userspace would need to know how to parse the
> 	 header.
>    * Require the XDP program to copy the traits it wants into the XDP
>      metadata area, which is already exposed to userspace. That would
> 	 need traits and XDP metadata to coexist.
 
By keeping the traits at the tail we can just expose raw trait storage
and let userspace deal with it. But anyway, my main point is to avoid
having drivers to deal with two separate cases. As long as we can hide
everything behind a common call, we can change the placement later.

> > If you insist on placing it into the headroom, can we at least have some
> > common helper to finish xdp->skb conversion? It can call skb_ext_from_headroom
> > and/or skb_metadata_set:
> >
> > xdp_buff_done(*xdp, *skb) {
> > 	if (have traits) {
> > 		skb_ext_from_headroom
> > 		return
> > 	}
> >
> > 	metasize = xdp->data - xdp->data_meta;
> > 	if (metasize)
> > 		skb_metadata_set
> > }
> >
> > And then we'll have some common rules for the drivers: call xdp_buff_done
> > when you're done with the xdp_buff to take care of metadata/traits. And
> > it might be easier to review: you're gonna (mostly) change existing
> > calls to skb_metadata_set to your new helper. Maybe we can even
> > eventually fold all xdp_update_skb_shared_info stuff into that as
> > well...
> 
> Yes! This is what I was going for with xdp_buff_update_skb() - it would be
> nice for it handle all the SKB updating, including skb_metadata_set().
> 
> Should I do that first, and submit it as separate series()?

Up to you, will be happy to review regardless :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-23 23:45       ` Stanislav Fomichev
@ 2025-04-24  9:49         ` Toke Høiland-Jørgensen
  2025-04-24 15:39           ` Stanislav Fomichev
  0 siblings, 1 reply; 36+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-24  9:49 UTC (permalink / raw)
  To: Stanislav Fomichev, Arthur Fabre
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, lbiancon, ast, kuba,
	edumazet

Stanislav Fomichev <stfomichev@gmail.com> writes:

> On 04/23, Arthur Fabre wrote:
>> On Wed Apr 23, 2025 at 6:36 PM CEST, Stanislav Fomichev wrote:
>> > On 04/22, Arthur Fabre wrote:
>> > > Call the common xdp_buff_update_skb() helper.
>> > > 
>> > > Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
>> > > ---
>> > >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
>> > >  1 file changed, 4 insertions(+)
>> > > 
>> > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> > > index c8e3468eee612ad622bfbecfd7cc1ae3396061fd..0eba3e307a3edbc5fe1abf2fa45e6256d98574c2 100644
>> > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> > > @@ -2297,6 +2297,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
>> > >  			}
>> > >  		}
>> > >  	}
>> > > +
>> > > +	if (xdp_active)
>> > > +		xdp_buff_update_skb(&xdp, skb);
>> >
>> > For me, the preference for reusing existing metadata area was
>> > because of the patches 10-16: we now need to care about two types of
>> > metadata explicitly.
>> 
>> Having to update all the drivers is definitely not ideal. Motivation is:
>> 
>> 1. Avoid trait_set() and xdp_adjust_meta() from corrupting each other's
>>    data. 
>>    But that's not a problem if we disallow trait_set() and
>>    xdp_adjust_meta() to be used at the same time, so maybe not a good
>>    reason anymore (except for maybe 3.)
>> 
>> 2. Not have the traits at the "end" of the headroom (ie right next to
>>    actual packet data).
>>    If it's at the "end", we need to move all of it to make room for
>>    every xdp_adjust_head() call.
>>    It seems more intrusive to the current SKB API: several funcs assume
>>    that there is headroom directly before the packet.
>
> [..]
>
>> 3. I'm not sure how this should be exposed with AF_XDP yet. Either:
>>    * Expose raw trait storage, and having it at the "end" of the
>>      headroom is nice. But userspace would need to know how to parse the
>> 	 header.
>>    * Require the XDP program to copy the traits it wants into the XDP
>>      metadata area, which is already exposed to userspace. That would
>> 	 need traits and XDP metadata to coexist.
>  
> By keeping the traits at the tail we can just expose raw trait storage
> and let userspace deal with it. But anyway, my main point is to avoid
> having drivers to deal with two separate cases. As long as we can hide
> everything behind a common call, we can change the placement later.

Being able to change the placement (and format) of the data store is the
reason why we should absolutely *not* expose the internal trait storage
to AF_XDP. Once we do that, we effectively make the whole thing UAPI.
The trait get/set API very deliberately does not expose any details
about the underlying storage for exactly this reason :)

-Toke


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-24  9:49         ` Toke Høiland-Jørgensen
@ 2025-04-24 15:39           ` Stanislav Fomichev
  2025-04-24 18:59             ` Jakub Sitnicki
  0 siblings, 1 reply; 36+ messages in thread
From: Stanislav Fomichev @ 2025-04-24 15:39 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Arthur Fabre, netdev, bpf, jakub, hawk, yan, jbrandeburg,
	lbiancon, ast, kuba, edumazet

On 04/24, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <stfomichev@gmail.com> writes:
> 
> > On 04/23, Arthur Fabre wrote:
> >> On Wed Apr 23, 2025 at 6:36 PM CEST, Stanislav Fomichev wrote:
> >> > On 04/22, Arthur Fabre wrote:
> >> > > Call the common xdp_buff_update_skb() helper.
> >> > > 
> >> > > Signed-off-by: Arthur Fabre <arthur@arthurfabre.com>
> >> > > ---
> >> > >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
> >> > >  1 file changed, 4 insertions(+)
> >> > > 
> >> > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> >> > > index c8e3468eee612ad622bfbecfd7cc1ae3396061fd..0eba3e307a3edbc5fe1abf2fa45e6256d98574c2 100644
> >> > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> >> > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> >> > > @@ -2297,6 +2297,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
> >> > >  			}
> >> > >  		}
> >> > >  	}
> >> > > +
> >> > > +	if (xdp_active)
> >> > > +		xdp_buff_update_skb(&xdp, skb);
> >> >
> >> > For me, the preference for reusing existing metadata area was
> >> > because of the patches 10-16: we now need to care about two types of
> >> > metadata explicitly.
> >> 
> >> Having to update all the drivers is definitely not ideal. Motivation is:
> >> 
> >> 1. Avoid trait_set() and xdp_adjust_meta() from corrupting each other's
> >>    data. 
> >>    But that's not a problem if we disallow trait_set() and
> >>    xdp_adjust_meta() to be used at the same time, so maybe not a good
> >>    reason anymore (except for maybe 3.)
> >> 
> >> 2. Not have the traits at the "end" of the headroom (ie right next to
> >>    actual packet data).
> >>    If it's at the "end", we need to move all of it to make room for
> >>    every xdp_adjust_head() call.
> >>    It seems more intrusive to the current SKB API: several funcs assume
> >>    that there is headroom directly before the packet.
> >
> > [..]
> >
> >> 3. I'm not sure how this should be exposed with AF_XDP yet. Either:
> >>    * Expose raw trait storage, and having it at the "end" of the
> >>      headroom is nice. But userspace would need to know how to parse the
> >> 	 header.
> >>    * Require the XDP program to copy the traits it wants into the XDP
> >>      metadata area, which is already exposed to userspace. That would
> >> 	 need traits and XDP metadata to coexist.
> >  
> > By keeping the traits at the tail we can just expose raw trait storage
> > and let userspace deal with it. But anyway, my main point is to avoid
> > having drivers to deal with two separate cases. As long as we can hide
> > everything behind a common call, we can change the placement later.
> 
> Being able to change the placement (and format) of the data store is the
> reason why we should absolutely *not* expose the internal trait storage
> to AF_XDP. Once we do that, we effectively make the whole thing UAPI.
> The trait get/set API very deliberately does not expose any details
> about the underlying storage for exactly this reason :)

I was under the impression that we want to eventually expose trait
blobs to the userspace via getsockopt (or some other similar means),
is it not the case? How is userspace supposed to consume it?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
@ 2025-04-24 16:22   ` Alexei Starovoitov
  2025-04-25 19:26     ` Arthur Fabre
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2025-04-24 16:22 UTC (permalink / raw)
  To: Arthur Fabre
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, Toke Høiland-Jørgensen, lbiancon,
	Alexei Starovoitov, Jakub Kicinski, Eric Dumazet

On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>
> +/**
> + * trait_set() - Set a trait key.
> + * @traits: Start of trait store area.
> + * @hard_end: Hard limit the trait store can currently grow up against.
> + * @key: The key to set.
> + * @val: The value to set.
> + * @len: The length of the value.
> + * @flags: Unused for now. Should be 0.
> + *
> + * Return:
> + * * %0       - Success.
> + * * %-EINVAL - Key or length invalid.
> + * * %-EBUSY  - Key previously set with different length.
> + * * %-ENOSPC - Not enough room left to store value.
> + */
> +static __always_inline
> +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> +{
> +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
> +               return -EINVAL;
> +
> +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
> +
> +       /* Offset of value of this key. */
> +       int off = __trait_offset(*h, key);
> +
> +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> +               /* Key is already set, but with a different length */
> +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> +                       return -EBUSY;
> +       } else {
> +               /* Figure out if we have enough room left: total length of everything now. */
> +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> +                       return -ENOSPC;

I'm still not excited about having two metadata-s
in front of the packet.
Why cannot traits use the same metadata space ?

For trait_set() you already pass hard_end and have to check it
at run-time.
If you add the same hard_end to trait_get/del the kfuncs will deal
with possible corruption of metadata by the program.
Transition from xdp to skb will be automatic. The code won't care that traits
are there. It will just copy all metadata from xdp to skb. Corrupted or not.
bpf progs in xdp and skb might even use the same kfuncs
(or two different sets if the verifier is not smart enough right now).

Ideally we add hweight64 as new bpf instructions then maybe
we won't need any kernel changes at all.
bpf side will do:
bpf_xdp_adjust_meta(xdp, -max_offset_for_largest_key);
and then
trait_set(xdp->data_meta /* pointer to trait header */, xdp->data /*
hard end */, ...);
can be implemented as bpf prog.

Same thing for skb progs.
netfilter/iptable can use another bpf prog to make decisions.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-24 15:39           ` Stanislav Fomichev
@ 2025-04-24 18:59             ` Jakub Sitnicki
  2025-04-25  8:06               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 36+ messages in thread
From: Jakub Sitnicki @ 2025-04-24 18:59 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Arthur Fabre, netdev, bpf, hawk,
	yan, jbrandeburg, lbiancon, ast, kuba, edumazet, kernel-team

On Thu, Apr 24, 2025 at 08:39 AM -07, Stanislav Fomichev wrote:
> On 04/24, Toke Høiland-Jørgensen wrote:

[...]

>> Being able to change the placement (and format) of the data store is the
>> reason why we should absolutely *not* expose the internal trait storage
>> to AF_XDP. Once we do that, we effectively make the whole thing UAPI.
>> The trait get/set API very deliberately does not expose any details
>> about the underlying storage for exactly this reason :)
>
> I was under the impression that we want to eventually expose trait
> blobs to the userspace via getsockopt (or some other similar means),
> is it not the case? How is userspace supposed to consume it?

Yes, we definitely want to consume and produce traits from userspace.

Before last Netdev [1], our plan was for the socket glue layer to
convert between the in-kernel format and a TLV-based stable format for
uAPI.

But then Alexei suggested something that did not occur to us. The traits
format can be something that BPF and userspace agree on. The kernel just
passes it back and forth without needing to understand the content. This
naturally simplifies changes in the socket glue layer.

As Eric pointed out, this is similar to exposing raw TCP SYN headers via
getsockopt(TCP_SAVED_SYN). BPF can set a custom TCP option, and
userspace can consume it.

The trade-off is that then the traits can only be used in parts of the
network stack where a BPF hook exist.

Is that an acceptable limitation? That's what we're hoping to hear your
thoughts on.

One concern that comes to mind, since the network stack is unaware of
traits contents, we will be limited to simple merge strategies (like
"drop all" or "keep first") in the GRO layer.

[1] https://www.netdevconf.info/0x19/sessions/talk/traits-rich-packet-metadata.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb
  2025-04-24 18:59             ` Jakub Sitnicki
@ 2025-04-25  8:06               ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 36+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-25  8:06 UTC (permalink / raw)
  To: Jakub Sitnicki, Stanislav Fomichev
  Cc: Arthur Fabre, netdev, bpf, hawk, yan, jbrandeburg, lbiancon, ast,
	kuba, edumazet, kernel-team

Jakub Sitnicki <jakub@cloudflare.com> writes:

> On Thu, Apr 24, 2025 at 08:39 AM -07, Stanislav Fomichev wrote:
>> On 04/24, Toke Høiland-Jørgensen wrote:
>
> [...]
>
>>> Being able to change the placement (and format) of the data store is the
>>> reason why we should absolutely *not* expose the internal trait storage
>>> to AF_XDP. Once we do that, we effectively make the whole thing UAPI.
>>> The trait get/set API very deliberately does not expose any details
>>> about the underlying storage for exactly this reason :)
>>
>> I was under the impression that we want to eventually expose trait
>> blobs to the userspace via getsockopt (or some other similar means),
>> is it not the case? How is userspace supposed to consume it?
>
> Yes, we definitely want to consume and produce traits from userspace.
>
> Before last Netdev [1], our plan was for the socket glue layer to
> convert between the in-kernel format and a TLV-based stable format for
> uAPI.
>
> But then Alexei suggested something that did not occur to us. The traits
> format can be something that BPF and userspace agree on. The kernel just
> passes it back and forth without needing to understand the content. This
> naturally simplifies changes in the socket glue layer.
>
> As Eric pointed out, this is similar to exposing raw TCP SYN headers via
> getsockopt(TCP_SAVED_SYN). BPF can set a custom TCP option, and
> userspace can consume it.
>
> The trade-off is that then the traits can only be used in parts of the
> network stack where a BPF hook exist.

It also means that we can't change the format later because it becomes
an API consumed by userspace. Regardless of whether we deem it "UAPI"
and not, it is bound to ossify. TCP headers are actually an excellent
example of this; they are ostensibly modifiable, but it's all but
impossible to do so in practice, because implementations make
assumptions on the expected format and break if it changes.

Also, making the format "agreed upon between BPF and userspace", means
that the kernel won't be able to use the data stored in traits itself
(since it does not know the format). We do want fields stored in traits
to be consumable by the kernel as well, so for this reason it is not a
good idea to leave it up to BPF to define the format either.

> Is that an acceptable limitation? That's what we're hoping to hear your
> thoughts on.

I absolutely don't think this is acceptable, see above.

> One concern that comes to mind, since the network stack is unaware of
> traits contents, we will be limited to simple merge strategies (like
> "drop all" or "keep first") in the GRO layer.

Yeah, this is another limitation of the kernel not understanding the
format, but not the only one.

-Toke


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-24 16:22   ` Alexei Starovoitov
@ 2025-04-25 19:26     ` Arthur Fabre
  2025-04-29 23:36       ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Arthur Fabre @ 2025-04-25 19:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, Toke Høiland-Jørgensen, lbiancon,
	Alexei Starovoitov, Jakub Kicinski, Eric Dumazet

On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
> On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
> >
> > +/**
> > + * trait_set() - Set a trait key.
> > + * @traits: Start of trait store area.
> > + * @hard_end: Hard limit the trait store can currently grow up against.
> > + * @key: The key to set.
> > + * @val: The value to set.
> > + * @len: The length of the value.
> > + * @flags: Unused for now. Should be 0.
> > + *
> > + * Return:
> > + * * %0       - Success.
> > + * * %-EINVAL - Key or length invalid.
> > + * * %-EBUSY  - Key previously set with different length.
> > + * * %-ENOSPC - Not enough room left to store value.
> > + */
> > +static __always_inline
> > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> > +{
> > +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
> > +               return -EINVAL;
> > +
> > +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
> > +
> > +       /* Offset of value of this key. */
> > +       int off = __trait_offset(*h, key);
> > +
> > +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> > +               /* Key is already set, but with a different length */
> > +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> > +                       return -EBUSY;
> > +       } else {
> > +               /* Figure out if we have enough room left: total length of everything now. */
> > +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> > +                       return -ENOSPC;
>
> I'm still not excited about having two metadata-s
> in front of the packet.
> Why cannot traits use the same metadata space ?
>
> For trait_set() you already pass hard_end and have to check it
> at run-time.
> If you add the same hard_end to trait_get/del the kfuncs will deal
> with possible corruption of metadata by the program.
> Transition from xdp to skb will be automatic. The code won't care that traits
> are there. It will just copy all metadata from xdp to skb. Corrupted or not.
> bpf progs in xdp and skb might even use the same kfuncs
> (or two different sets if the verifier is not smart enough right now).

Good idea, that would solve the corruption problem.

But I think storing metadata at the "end" of the headroom (ie where 
XDP metadata is today) makes it harder to persist in the SKB layer.
Functions like __skb_push() assume that skb_headroom() bytes are
available just before skb->data.

They can be updated to move XDP metadata out of the way, but this
assumption seems pretty pervasive.

By using the "front" of the headroom, we can hide that from the rest of
the SKB code. We could even update skb->head to completely hide the
space used at the front of the headroom.
It also avoids the cost of moving the metadata around (but maybe that
would be insignificant).

XDP metadata also doesn't work well with GRO (see below).

> Ideally we add hweight64 as new bpf instructions then maybe
> we won't need any kernel changes at all.
> bpf side will do:
> bpf_xdp_adjust_meta(xdp, -max_offset_for_largest_key);
> and then
> trait_set(xdp->data_meta /* pointer to trait header */, xdp->data /*
> hard end */, ...);
> can be implemented as bpf prog.
>
> Same thing for skb progs.
> netfilter/iptable can use another bpf prog to make decisions.

There are (at least) two reasons for wanting the kernel to understand the
format:

* GRO: With an opaque binary blob, the kernel can either forbid GRO when
  the metadata differs (like XDP metadata today), or keep the entire blob
  of one of the packets.
  But maybe some users will want to keep a KV of the first packet, or
  the last packet, eg for receive timestamps.
  With a KV store we can have a sane default option for merging the
  different KV stores, and even add a per KV policy in the future if
  needed.

* Hardware metadata: metadata exposed from NICs (like the receive
  timestamp, 4 tuple hash...) is currently only exposed to XDP programs
  (via kfuncs).
  But that doesn't expose them to the rest of the stack.
  Storing them in traits would allow XDP, other BPF programs, and the
  kernel to access and modify them (for example to into account
  decapsulating a packet).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-25 19:26     ` Arthur Fabre
@ 2025-04-29 23:36       ` Alexei Starovoitov
  2025-04-30  9:19         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2025-04-29 23:36 UTC (permalink / raw)
  To: Arthur Fabre
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, Toke Høiland-Jørgensen, lbiancon,
	Alexei Starovoitov, Jakub Kicinski, Eric Dumazet

On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>
> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
> > >
> > > +/**
> > > + * trait_set() - Set a trait key.
> > > + * @traits: Start of trait store area.
> > > + * @hard_end: Hard limit the trait store can currently grow up against.
> > > + * @key: The key to set.
> > > + * @val: The value to set.
> > > + * @len: The length of the value.
> > > + * @flags: Unused for now. Should be 0.
> > > + *
> > > + * Return:
> > > + * * %0       - Success.
> > > + * * %-EINVAL - Key or length invalid.
> > > + * * %-EBUSY  - Key previously set with different length.
> > > + * * %-ENOSPC - Not enough room left to store value.
> > > + */
> > > +static __always_inline
> > > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> > > +{
> > > +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
> > > +               return -EINVAL;
> > > +
> > > +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
> > > +
> > > +       /* Offset of value of this key. */
> > > +       int off = __trait_offset(*h, key);
> > > +
> > > +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> > > +               /* Key is already set, but with a different length */
> > > +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> > > +                       return -EBUSY;
> > > +       } else {
> > > +               /* Figure out if we have enough room left: total length of everything now. */
> > > +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> > > +                       return -ENOSPC;
> >
> > I'm still not excited about having two metadata-s
> > in front of the packet.
> > Why cannot traits use the same metadata space ?
> >
> > For trait_set() you already pass hard_end and have to check it
> > at run-time.
> > If you add the same hard_end to trait_get/del the kfuncs will deal
> > with possible corruption of metadata by the program.
> > Transition from xdp to skb will be automatic. The code won't care that traits
> > are there. It will just copy all metadata from xdp to skb. Corrupted or not.
> > bpf progs in xdp and skb might even use the same kfuncs
> > (or two different sets if the verifier is not smart enough right now).
>
> Good idea, that would solve the corruption problem.
>
> But I think storing metadata at the "end" of the headroom (ie where
> XDP metadata is today) makes it harder to persist in the SKB layer.
> Functions like __skb_push() assume that skb_headroom() bytes are
> available just before skb->data.
>
> They can be updated to move XDP metadata out of the way, but this
> assumption seems pretty pervasive.

The same argument can be flipped.
Why does the skb layer need to push?
If it needs to encapsulate it will forward to tunnel device
to go on the wire. At this point any kind of metadata is going
to be lost on the wire. bpf prog would need to convert
metadata into actual on the wire format or stash it
or send to user space.
I don't see a use case where skb layer would move medadata by N
bytes, populate these N bytes with "???" and pass to next skb layer.
skb layers strip (pop) the header when it goes from ip to tcp to user space.
No need to move metadata.

> By using the "front" of the headroom, we can hide that from the rest of
> the SKB code. We could even update skb->head to completely hide the
> space used at the front of the headroom.
> It also avoids the cost of moving the metadata around (but maybe that
> would be insignificant).

That's a theory. Do you actually have skb layers pushing things
while metadata is there?

> XDP metadata also doesn't work well with GRO (see below).
>
> > Ideally we add hweight64 as new bpf instructions then maybe
> > we won't need any kernel changes at all.
> > bpf side will do:
> > bpf_xdp_adjust_meta(xdp, -max_offset_for_largest_key);
> > and then
> > trait_set(xdp->data_meta /* pointer to trait header */, xdp->data /*
> > hard end */, ...);
> > can be implemented as bpf prog.
> >
> > Same thing for skb progs.
> > netfilter/iptable can use another bpf prog to make decisions.
>
> There are (at least) two reasons for wanting the kernel to understand the
> format:
>
> * GRO: With an opaque binary blob, the kernel can either forbid GRO when
>   the metadata differs (like XDP metadata today), or keep the entire blob
>   of one of the packets.
>   But maybe some users will want to keep a KV of the first packet, or
>   the last packet, eg for receive timestamps.
>   With a KV store we can have a sane default option for merging the
>   different KV stores, and even add a per KV policy in the future if
>   needed.

We can have this default for metadata too.
If all bytes in the metadata blob are the same -> let it GRO.

I don't think it's a good idea to extend GRO with bpf to make it "smart".

> * Hardware metadata: metadata exposed from NICs (like the receive
>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>   (via kfuncs).
>   But that doesn't expose them to the rest of the stack.
>   Storing them in traits would allow XDP, other BPF programs, and the
>   kernel to access and modify them (for example to into account
>   decapsulating a packet).

Sure. If traits == existing metadata bpf prog in xdp can communicate
with bpf prog in skb layer via that "trait" format.
xdp can take tuple hash and store it as key==0 in the trait.
The kernel doesn't need to know how to parse that format.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-29 23:36       ` Alexei Starovoitov
@ 2025-04-30  9:19         ` Toke Høiland-Jørgensen
  2025-04-30 16:29           ` Alexei Starovoitov
  2025-04-30 19:19           ` Jakub Sitnicki
  0 siblings, 2 replies; 36+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-04-30  9:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Arthur Fabre
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, lbiancon, Alexei Starovoitov,
	Jakub Kicinski, Eric Dumazet

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>
>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>> > >
>> > > +/**
>> > > + * trait_set() - Set a trait key.
>> > > + * @traits: Start of trait store area.
>> > > + * @hard_end: Hard limit the trait store can currently grow up against.
>> > > + * @key: The key to set.
>> > > + * @val: The value to set.
>> > > + * @len: The length of the value.
>> > > + * @flags: Unused for now. Should be 0.
>> > > + *
>> > > + * Return:
>> > > + * * %0       - Success.
>> > > + * * %-EINVAL - Key or length invalid.
>> > > + * * %-EBUSY  - Key previously set with different length.
>> > > + * * %-ENOSPC - Not enough room left to store value.
>> > > + */
>> > > +static __always_inline
>> > > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
>> > > +{
>> > > +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
>> > > +               return -EINVAL;
>> > > +
>> > > +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
>> > > +
>> > > +       /* Offset of value of this key. */
>> > > +       int off = __trait_offset(*h, key);
>> > > +
>> > > +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
>> > > +               /* Key is already set, but with a different length */
>> > > +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
>> > > +                       return -EBUSY;
>> > > +       } else {
>> > > +               /* Figure out if we have enough room left: total length of everything now. */
>> > > +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
>> > > +                       return -ENOSPC;
>> >
>> > I'm still not excited about having two metadata-s
>> > in front of the packet.
>> > Why cannot traits use the same metadata space ?
>> >
>> > For trait_set() you already pass hard_end and have to check it
>> > at run-time.
>> > If you add the same hard_end to trait_get/del the kfuncs will deal
>> > with possible corruption of metadata by the program.
>> > Transition from xdp to skb will be automatic. The code won't care that traits
>> > are there. It will just copy all metadata from xdp to skb. Corrupted or not.
>> > bpf progs in xdp and skb might even use the same kfuncs
>> > (or two different sets if the verifier is not smart enough right now).
>>
>> Good idea, that would solve the corruption problem.
>>
>> But I think storing metadata at the "end" of the headroom (ie where
>> XDP metadata is today) makes it harder to persist in the SKB layer.
>> Functions like __skb_push() assume that skb_headroom() bytes are
>> available just before skb->data.
>>
>> They can be updated to move XDP metadata out of the way, but this
>> assumption seems pretty pervasive.
>
> The same argument can be flipped.
> Why does the skb layer need to push?
> If it needs to encapsulate it will forward to tunnel device
> to go on the wire. At this point any kind of metadata is going
> to be lost on the wire. bpf prog would need to convert
> metadata into actual on the wire format or stash it
> or send to user space.
> I don't see a use case where skb layer would move medadata by N
> bytes, populate these N bytes with "???" and pass to next skb layer.
> skb layers strip (pop) the header when it goes from ip to tcp to user space.
> No need to move metadata.
>
>> By using the "front" of the headroom, we can hide that from the rest of
>> the SKB code. We could even update skb->head to completely hide the
>> space used at the front of the headroom.
>> It also avoids the cost of moving the metadata around (but maybe that
>> would be insignificant).
>
> That's a theory. Do you actually have skb layers pushing things
> while metadata is there?

Erm, any encapsulation? UDP tunnels, wireguard, WiFi, etc. There are
almost 1000 calls to skb_push() all over the kernel. One of the primary
use cases for traits is to tag a packet with an ID to follow it
throughout its lifetime inside the kernel. This absolutely includes
encapsulation; in fact, having the tracing ID survive these kinds of
transformations is one of the primary motivators for this work.

>> XDP metadata also doesn't work well with GRO (see below).
>>
>> > Ideally we add hweight64 as new bpf instructions then maybe
>> > we won't need any kernel changes at all.
>> > bpf side will do:
>> > bpf_xdp_adjust_meta(xdp, -max_offset_for_largest_key);
>> > and then
>> > trait_set(xdp->data_meta /* pointer to trait header */, xdp->data /*
>> > hard end */, ...);
>> > can be implemented as bpf prog.
>> >
>> > Same thing for skb progs.
>> > netfilter/iptable can use another bpf prog to make decisions.
>>
>> There are (at least) two reasons for wanting the kernel to understand the
>> format:
>>
>> * GRO: With an opaque binary blob, the kernel can either forbid GRO when
>>   the metadata differs (like XDP metadata today), or keep the entire blob
>>   of one of the packets.
>>   But maybe some users will want to keep a KV of the first packet, or
>>   the last packet, eg for receive timestamps.
>>   With a KV store we can have a sane default option for merging the
>>   different KV stores, and even add a per KV policy in the future if
>>   needed.
>
> We can have this default for metadata too.
> If all bytes in the metadata blob are the same -> let it GRO.

But that is exactly what we do with metadata today, and it is limiting
use cases. For instance, timestamps are going to differ, but we don't
want that to block GRO, we just want to keep one of the timestamps. So
being able to set a "GRO policy" *per key in the KV-store* is useful.

> I don't think it's a good idea to extend GRO with bpf to make it "smart".

That would be a separate project; for this, the proposal is to make the
kernel understand a couple of different policies (block GRO, keep first,
keep last, discard) per key.

>> * Hardware metadata: metadata exposed from NICs (like the receive
>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>   (via kfuncs).
>>   But that doesn't expose them to the rest of the stack.
>>   Storing them in traits would allow XDP, other BPF programs, and the
>>   kernel to access and modify them (for example to into account
>>   decapsulating a packet).
>
> Sure. If traits == existing metadata bpf prog in xdp can communicate
> with bpf prog in skb layer via that "trait" format.
> xdp can take tuple hash and store it as key==0 in the trait.
> The kernel doesn't need to know how to parse that format.

Yes it does, to propagate it to the skb later. I.e.,

XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
CPUMAP: build skb, read hash from traits, populate skb hash

Same thing for (at least) timestamps and checksums.

Longer term, with traits available we could move more skb fields into
traits to make struct sk_buff smaller (by moving optional fields to
traits that don't take up any space if they're not set).

-Toke


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-30  9:19         ` Toke Høiland-Jørgensen
@ 2025-04-30 16:29           ` Alexei Starovoitov
  2025-05-01  7:30             ` Arthur Fabre
  2025-04-30 19:19           ` Jakub Sitnicki
  1 sibling, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2025-04-30 16:29 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Arthur Fabre, Network Development, bpf, Jakub Sitnicki,
	Jesper Dangaard Brouer, Yan Zhai, jbrandeburg, lbiancon,
	Alexei Starovoitov, Jakub Kicinski, Eric Dumazet

On Wed, Apr 30, 2025 at 2:19 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
> >>
> >> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
> >> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
> >> > >
> >> > > +/**
> >> > > + * trait_set() - Set a trait key.
> >> > > + * @traits: Start of trait store area.
> >> > > + * @hard_end: Hard limit the trait store can currently grow up against.
> >> > > + * @key: The key to set.
> >> > > + * @val: The value to set.
> >> > > + * @len: The length of the value.
> >> > > + * @flags: Unused for now. Should be 0.
> >> > > + *
> >> > > + * Return:
> >> > > + * * %0       - Success.
> >> > > + * * %-EINVAL - Key or length invalid.
> >> > > + * * %-EBUSY  - Key previously set with different length.
> >> > > + * * %-ENOSPC - Not enough room left to store value.
> >> > > + */
> >> > > +static __always_inline
> >> > > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> >> > > +{
> >> > > +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
> >> > > +               return -EINVAL;
> >> > > +
> >> > > +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
> >> > > +
> >> > > +       /* Offset of value of this key. */
> >> > > +       int off = __trait_offset(*h, key);
> >> > > +
> >> > > +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> >> > > +               /* Key is already set, but with a different length */
> >> > > +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> >> > > +                       return -EBUSY;
> >> > > +       } else {
> >> > > +               /* Figure out if we have enough room left: total length of everything now. */
> >> > > +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> >> > > +                       return -ENOSPC;
> >> >
> >> > I'm still not excited about having two metadata-s
> >> > in front of the packet.
> >> > Why cannot traits use the same metadata space ?
> >> >
> >> > For trait_set() you already pass hard_end and have to check it
> >> > at run-time.
> >> > If you add the same hard_end to trait_get/del the kfuncs will deal
> >> > with possible corruption of metadata by the program.
> >> > Transition from xdp to skb will be automatic. The code won't care that traits
> >> > are there. It will just copy all metadata from xdp to skb. Corrupted or not.
> >> > bpf progs in xdp and skb might even use the same kfuncs
> >> > (or two different sets if the verifier is not smart enough right now).
> >>
> >> Good idea, that would solve the corruption problem.
> >>
> >> But I think storing metadata at the "end" of the headroom (ie where
> >> XDP metadata is today) makes it harder to persist in the SKB layer.
> >> Functions like __skb_push() assume that skb_headroom() bytes are
> >> available just before skb->data.
> >>
> >> They can be updated to move XDP metadata out of the way, but this
> >> assumption seems pretty pervasive.
> >
> > The same argument can be flipped.
> > Why does the skb layer need to push?
> > If it needs to encapsulate it will forward to tunnel device
> > to go on the wire. At this point any kind of metadata is going
> > to be lost on the wire. bpf prog would need to convert
> > metadata into actual on the wire format or stash it
> > or send to user space.
> > I don't see a use case where skb layer would move medadata by N
> > bytes, populate these N bytes with "???" and pass to next skb layer.
> > skb layers strip (pop) the header when it goes from ip to tcp to user space.
> > No need to move metadata.
> >
> >> By using the "front" of the headroom, we can hide that from the rest of
> >> the SKB code. We could even update skb->head to completely hide the
> >> space used at the front of the headroom.
> >> It also avoids the cost of moving the metadata around (but maybe that
> >> would be insignificant).
> >
> > That's a theory. Do you actually have skb layers pushing things
> > while metadata is there?
>
> Erm, any encapsulation? UDP tunnels, wireguard, WiFi, etc. There are
> almost 1000 calls to skb_push() all over the kernel. One of the primary
> use cases for traits is to tag a packet with an ID to follow it
> throughout its lifetime inside the kernel. This absolutely includes
> encapsulation; in fact, having the tracing ID survive these kinds of
> transformations is one of the primary motivators for this work.

and the assumption here is that placing traits in the front will
be enough for all possible encaps that the stack can do?
Hopefully not. Moving traits due to skb_push() is mandatory anyway.

The other point you're arguing is that the current metadata approach
is broken, since it doesn't work for encap?
pskb_expand_head() does skb_metadata_clear().
Though I don't remember anyone complaining of that behavior,
let's assume it's a problem indeed.
With traits in front the pskb_expand_head() should either fail or move
traits when it runs out of room.
Let's be consistent and do the same for regular metadata.

My main point is we should not introduce a second kind of metadata.
If the current metadata doesn't quite work, fix it instead of adding
a new one.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-30  9:19         ` Toke Høiland-Jørgensen
  2025-04-30 16:29           ` Alexei Starovoitov
@ 2025-04-30 19:19           ` Jakub Sitnicki
  2025-05-01 10:43             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 36+ messages in thread
From: Jakub Sitnicki @ 2025-04-30 19:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Toke Høiland-Jørgensen,
	Arthur Fabre
  Cc: Network Development, bpf, Jesper Dangaard Brouer, Yan Zhai,
	jbrandeburg, lbiancon, Alexei Starovoitov, Jakub Kicinski,
	Eric Dumazet, kernel-team

On Wed, Apr 30, 2025 at 11:19 AM +02, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
>> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>
>>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:

[...]

>>> * Hardware metadata: metadata exposed from NICs (like the receive
>>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>>   (via kfuncs).
>>>   But that doesn't expose them to the rest of the stack.
>>>   Storing them in traits would allow XDP, other BPF programs, and the
>>>   kernel to access and modify them (for example to into account
>>>   decapsulating a packet).
>>
>> Sure. If traits == existing metadata bpf prog in xdp can communicate
>> with bpf prog in skb layer via that "trait" format.
>> xdp can take tuple hash and store it as key==0 in the trait.
>> The kernel doesn't need to know how to parse that format.
>
> Yes it does, to propagate it to the skb later. I.e.,
>
> XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
> CPUMAP: build skb, read hash from traits, populate skb hash
>
> Same thing for (at least) timestamps and checksums.
>
> Longer term, with traits available we could move more skb fields into
> traits to make struct sk_buff smaller (by moving optional fields to
> traits that don't take up any space if they're not set).

Perhaps we can have the cake and eat it too.

We could leave the traits encoding/decoding out of the kernel and, at
the same time, *expose it* to the network stack through BPF struct_ops
programs. At a high level, for example ->get_rx_hash(), not the
individual K/V access. The traits_ops vtable could grow as needed to
support new use cases.

If you think about it, it's not so different from BPF-powered congestion
algorithms and scheduler extensions. They also expose some state, kept in
maps, that only the loaded BPF code knows how to operate on.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-30 16:29           ` Alexei Starovoitov
@ 2025-05-01  7:30             ` Arthur Fabre
  0 siblings, 0 replies; 36+ messages in thread
From: Arthur Fabre @ 2025-05-01  7:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Toke Høiland-Jørgensen
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, lbiancon, Alexei Starovoitov,
	Jakub Kicinski, Eric Dumazet

On Wed Apr 30, 2025 at 6:29 PM CEST, Alexei Starovoitov wrote:
> On Wed, Apr 30, 2025 at 2:19 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
> > >>
> > >> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
> > >> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
> > >> > >
> > >> > > +/**
> > >> > > + * trait_set() - Set a trait key.
> > >> > > + * @traits: Start of trait store area.
> > >> > > + * @hard_end: Hard limit the trait store can currently grow up against.
> > >> > > + * @key: The key to set.
> > >> > > + * @val: The value to set.
> > >> > > + * @len: The length of the value.
> > >> > > + * @flags: Unused for now. Should be 0.
> > >> > > + *
> > >> > > + * Return:
> > >> > > + * * %0       - Success.
> > >> > > + * * %-EINVAL - Key or length invalid.
> > >> > > + * * %-EBUSY  - Key previously set with different length.
> > >> > > + * * %-ENOSPC - Not enough room left to store value.
> > >> > > + */
> > >> > > +static __always_inline
> > >> > > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> > >> > > +{
> > >> > > +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
> > >> > > +               return -EINVAL;
> > >> > > +
> > >> > > +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
> > >> > > +
> > >> > > +       /* Offset of value of this key. */
> > >> > > +       int off = __trait_offset(*h, key);
> > >> > > +
> > >> > > +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> > >> > > +               /* Key is already set, but with a different length */
> > >> > > +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> > >> > > +                       return -EBUSY;
> > >> > > +       } else {
> > >> > > +               /* Figure out if we have enough room left: total length of everything now. */
> > >> > > +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> > >> > > +                       return -ENOSPC;
> > >> >
> > >> > I'm still not excited about having two metadata-s
> > >> > in front of the packet.
> > >> > Why cannot traits use the same metadata space ?
> > >> >
> > >> > For trait_set() you already pass hard_end and have to check it
> > >> > at run-time.
> > >> > If you add the same hard_end to trait_get/del the kfuncs will deal
> > >> > with possible corruption of metadata by the program.
> > >> > Transition from xdp to skb will be automatic. The code won't care that traits
> > >> > are there. It will just copy all metadata from xdp to skb. Corrupted or not.
> > >> > bpf progs in xdp and skb might even use the same kfuncs
> > >> > (or two different sets if the verifier is not smart enough right now).
> > >>
> > >> Good idea, that would solve the corruption problem.
> > >>
> > >> But I think storing metadata at the "end" of the headroom (ie where
> > >> XDP metadata is today) makes it harder to persist in the SKB layer.
> > >> Functions like __skb_push() assume that skb_headroom() bytes are
> > >> available just before skb->data.
> > >>
> > >> They can be updated to move XDP metadata out of the way, but this
> > >> assumption seems pretty pervasive.
> > >
> > > The same argument can be flipped.
> > > Why does the skb layer need to push?
> > > If it needs to encapsulate it will forward to tunnel device
> > > to go on the wire. At this point any kind of metadata is going
> > > to be lost on the wire. bpf prog would need to convert
> > > metadata into actual on the wire format or stash it
> > > or send to user space.
> > > I don't see a use case where skb layer would move medadata by N
> > > bytes, populate these N bytes with "???" and pass to next skb layer.
> > > skb layers strip (pop) the header when it goes from ip to tcp to user space.
> > > No need to move metadata.
> > >
> > >> By using the "front" of the headroom, we can hide that from the rest of
> > >> the SKB code. We could even update skb->head to completely hide the
> > >> space used at the front of the headroom.
> > >> It also avoids the cost of moving the metadata around (but maybe that
> > >> would be insignificant).
> > >
> > > That's a theory. Do you actually have skb layers pushing things
> > > while metadata is there?
> >
> > Erm, any encapsulation? UDP tunnels, wireguard, WiFi, etc. There are
> > almost 1000 calls to skb_push() all over the kernel. One of the primary
> > use cases for traits is to tag a packet with an ID to follow it
> > throughout its lifetime inside the kernel. This absolutely includes
> > encapsulation; in fact, having the tracing ID survive these kinds of
> > transformations is one of the primary motivators for this work.
>
> and the assumption here is that placing traits in the front will
> be enough for all possible encaps that the stack can do?
> Hopefully not. Moving traits due to skb_push() is mandatory anyway.

The network stack seems to assume there is at least 32 bytes of headroom
available, that's guaranteed by NET_SKB_PAD.

Callers that need more seem to check if there's enough headroom
available first, and pskb_expand_head() if not.

Placing traits at the front lets most of this work naturally if we leave
NET_SKB_PAD headroom.

But I haven't looked at every single __skb_push() call, maybe there are
cases I don't know about.

>
> The other point you're arguing is that the current metadata approach
> is broken, since it doesn't work for encap?
> pskb_expand_head() does skb_metadata_clear().
> Though I don't remember anyone complaining of that behavior,

No one can complain yet, because metadata isn't accessible after TC
today. It's limited to just XDP and TC, which avoids most of the SKB
processing and these trickier cases.

> let's assume it's a problem indeed.
> With traits in front the pskb_expand_head() should either fail or move
> traits when it runs out of room.
> Let's be consistent and do the same for regular metadata.
>
> My main point is we should not introduce a second kind of metadata.
> If the current metadata doesn't quite work, fix it instead of adding
> a new one.

I'll think about this more. It's tempting to just move the current
metadata area to the front of the headroom, but that breaks things like
AF_XDP that expect the metadata to be right before the packet data.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-04-30 19:19           ` Jakub Sitnicki
@ 2025-05-01 10:43             ` Toke Høiland-Jørgensen
  2025-05-01 14:03               ` Jesper Dangaard Brouer
  2025-05-05 10:18               ` Jakub Sitnicki
  0 siblings, 2 replies; 36+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-05-01 10:43 UTC (permalink / raw)
  To: Jakub Sitnicki, Alexei Starovoitov, Arthur Fabre
  Cc: Network Development, bpf, Jesper Dangaard Brouer, Yan Zhai,
	jbrandeburg, lbiancon, Alexei Starovoitov, Jakub Kicinski,
	Eric Dumazet, kernel-team

Jakub Sitnicki <jakub@cloudflare.com> writes:

> On Wed, Apr 30, 2025 at 11:19 AM +02, Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>
>>> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>>
>>>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>>>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>
> [...]
>
>>>> * Hardware metadata: metadata exposed from NICs (like the receive
>>>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>>>   (via kfuncs).
>>>>   But that doesn't expose them to the rest of the stack.
>>>>   Storing them in traits would allow XDP, other BPF programs, and the
>>>>   kernel to access and modify them (for example to into account
>>>>   decapsulating a packet).
>>>
>>> Sure. If traits == existing metadata bpf prog in xdp can communicate
>>> with bpf prog in skb layer via that "trait" format.
>>> xdp can take tuple hash and store it as key==0 in the trait.
>>> The kernel doesn't need to know how to parse that format.
>>
>> Yes it does, to propagate it to the skb later. I.e.,
>>
>> XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
>> CPUMAP: build skb, read hash from traits, populate skb hash
>>
>> Same thing for (at least) timestamps and checksums.
>>
>> Longer term, with traits available we could move more skb fields into
>> traits to make struct sk_buff smaller (by moving optional fields to
>> traits that don't take up any space if they're not set).
>
> Perhaps we can have the cake and eat it too.
>
> We could leave the traits encoding/decoding out of the kernel and, at
> the same time, *expose it* to the network stack through BPF struct_ops
> programs. At a high level, for example ->get_rx_hash(), not the
> individual K/V access. The traits_ops vtable could grow as needed to
> support new use cases.
>
> If you think about it, it's not so different from BPF-powered congestion
> algorithms and scheduler extensions. They also expose some state, kept in
> maps, that only the loaded BPF code knows how to operate on.

Right, the difference being that the kernel works perfectly well without
an eBPF congestion control algorithm loaded because it has its own
internal implementation that is used by default.

Having a hard dependency on BPF for in-kernel functionality is a
different matter, and limits the cases it can be used for.

Besides, I don't really see the point of leaving the encoding out of the
kernel? We keep the encoding kernel-internal anyway, and just expose a
get/set API, so there's no constraint on changing it later (that's kinda
the whole point of doing that). And with bulk get/set there's not an
efficiency argument either. So what's the point, other than doing things
in BPF for its own sake?

-Toke


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-05-01 10:43             ` Toke Høiland-Jørgensen
@ 2025-05-01 14:03               ` Jesper Dangaard Brouer
  2025-05-05 10:18               ` Jakub Sitnicki
  1 sibling, 0 replies; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2025-05-01 14:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jakub Sitnicki,
	Alexei Starovoitov, Arthur Fabre
  Cc: Network Development, bpf, Yan Zhai, jbrandeburg, lbiancon,
	Alexei Starovoitov, Jakub Kicinski, Eric Dumazet, kernel-team



On 01/05/2025 12.43, Toke Høiland-Jørgensen wrote:
> Jakub Sitnicki <jakub@cloudflare.com> writes:
> 
>> On Wed, Apr 30, 2025 at 11:19 AM +02, Toke Høiland-Jørgensen wrote:
>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>
>>>> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>>>
>>>>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>>>>>> On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>
>> [...]
>>
>>>>> * Hardware metadata: metadata exposed from NICs (like the receive
>>>>>    timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>>>>    (via kfuncs).
>>>>>    But that doesn't expose them to the rest of the stack.
>>>>>    Storing them in traits would allow XDP, other BPF programs, and the
>>>>>    kernel to access and modify them (for example to into account
>>>>>    decapsulating a packet).
>>>>
>>>> Sure. If traits == existing metadata bpf prog in xdp can communicate
>>>> with bpf prog in skb layer via that "trait" format.
>>>> xdp can take tuple hash and store it as key==0 in the trait.
>>>> The kernel doesn't need to know how to parse that format.
>>>
>>> Yes it does, to propagate it to the skb later. I.e.,
>>>
>>> XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
>>> CPUMAP: build skb, read hash from traits, populate skb hash
>>>
>>> Same thing for (at least) timestamps and checksums.
>>>
>>> Longer term, with traits available we could move more skb fields into
>>> traits to make struct sk_buff smaller (by moving optional fields to
>>> traits that don't take up any space if they're not set).

Above paragraph is very significant IMHO.  Netstack have many fields in
the SKB that are only used in corner cases.  There is a huge opportunity
for making these fields optional, without taking a performance hit (via
SKB extensions). To me the traits area is simply a new dynamic struct
type available for the *kernel*.

INCEPTION: Giving NIC drivers a writable memory area before SKB
allocation opens up the possibility of avoiding SKB allocation in the
driver entirely. Hardware offload metadata can be written directly into
the traits area. Later, when the core netstack allocates the SKB, it can
extract this data from traits.

The performance implications here are significant: this effectively
brings XDP-style pre-SKB processing into the netstack core. The largest
benefits are likely to appear in packet forwarding workloads, where
avoiding early SKB allocation can yield substantial gains.


>>
>> Perhaps we can have the cake and eat it too.
>>
>> We could leave the traits encoding/decoding out of the kernel and, at
>> the same time, *expose it* to the network stack through BPF struct_ops
>> programs. At a high level, for example ->get_rx_hash(), not the
>> individual K/V access. The traits_ops vtable could grow as needed to
>> support new use cases.
>>
>> If you think about it, it's not so different from BPF-powered congestion
>> algorithms and scheduler extensions. They also expose some state, kept in
>> maps, that only the loaded BPF code knows how to operate on.
> 
> Right, the difference being that the kernel works perfectly well without
> an eBPF congestion control algorithm loaded because it has its own
> internal implementation that is used by default.

Good point.

> Having a hard dependency on BPF for in-kernel functionality is a
> different matter, and limits the cases it can be used for.

I agree.

> Besides, I don't really see the point of leaving the encoding out of the
> kernel? We keep the encoding kernel-internal anyway, and just expose a
> get/set API, so there's no constraint on changing it later (that's kinda
> the whole point of doing that). And with bulk get/set there's not an
> efficiency argument either. So what's the point, other than doing things
> in BPF for its own sake?

I agree - we should keep the traits encoding kernel-internal. The traits
area is best understood as a dynamic struct type available to the kernel.

We shouldn't expose the traits format as UAPI to BPF on day one. It's
likely we'll need to adjust key/value sizing or representation as new
use cases arise. BPF programs can still access traits via a get/set API,
like other consumers. Later on, we could consider translating BPF kfunc
calls into inline BPF instructions (when extending BPF with features
like the popcnt instruction).

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-05-01 10:43             ` Toke Høiland-Jørgensen
  2025-05-01 14:03               ` Jesper Dangaard Brouer
@ 2025-05-05 10:18               ` Jakub Sitnicki
  2025-05-05 12:35                 ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 36+ messages in thread
From: Jakub Sitnicki @ 2025-05-05 10:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Arthur Fabre, Network Development, bpf,
	Jesper Dangaard Brouer, Yan Zhai, jbrandeburg, lbiancon,
	Alexei Starovoitov, Jakub Kicinski, Eric Dumazet, kernel-team

On Thu, May 01, 2025 at 12:43 PM +02, Toke Høiland-Jørgensen wrote:
> Jakub Sitnicki <jakub@cloudflare.com> writes:
>
>> On Wed, Apr 30, 2025 at 11:19 AM +02, Toke Høiland-Jørgensen wrote:
>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>
>>>> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>>>
>>>>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>>>>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>
>> [...]
>>
>>>>> * Hardware metadata: metadata exposed from NICs (like the receive
>>>>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>>>>   (via kfuncs).
>>>>>   But that doesn't expose them to the rest of the stack.
>>>>>   Storing them in traits would allow XDP, other BPF programs, and the
>>>>>   kernel to access and modify them (for example to into account
>>>>>   decapsulating a packet).
>>>>
>>>> Sure. If traits == existing metadata bpf prog in xdp can communicate
>>>> with bpf prog in skb layer via that "trait" format.
>>>> xdp can take tuple hash and store it as key==0 in the trait.
>>>> The kernel doesn't need to know how to parse that format.
>>>
>>> Yes it does, to propagate it to the skb later. I.e.,
>>>
>>> XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
>>> CPUMAP: build skb, read hash from traits, populate skb hash
>>>
>>> Same thing for (at least) timestamps and checksums.
>>>
>>> Longer term, with traits available we could move more skb fields into
>>> traits to make struct sk_buff smaller (by moving optional fields to
>>> traits that don't take up any space if they're not set).
>>
>> Perhaps we can have the cake and eat it too.
>>
>> We could leave the traits encoding/decoding out of the kernel and, at
>> the same time, *expose it* to the network stack through BPF struct_ops
>> programs. At a high level, for example ->get_rx_hash(), not the
>> individual K/V access. The traits_ops vtable could grow as needed to
>> support new use cases.
>>
>> If you think about it, it's not so different from BPF-powered congestion
>> algorithms and scheduler extensions. They also expose some state, kept in
>> maps, that only the loaded BPF code knows how to operate on.
>
> Right, the difference being that the kernel works perfectly well without
> an eBPF congestion control algorithm loaded because it has its own
> internal implementation that is used by default.

It seems to me that any code path on the network stack still needs to
work *even if* traits K/V is not available. There has to be a fallback -
like, RX hash not present in traits K/V? must recompute it. There is no
guarantee that there will be space available in the traits K/V store for
whatever value the network stack would like to cache there.

So if we can agree that traits K/V is a cache, with limited capacity,
and any code path accessing it must be prepared to deal with a cache
miss, then I think with struct_ops approach you could have a built-in
default implementation for exclusive use by the network stack.

This default implementation of the storage access just wouldn't be
exposed to the BPF or user-space. If you want access from BPF/userland,
then you'd need to provide a BPF-backed struct_ops for accessing traits
K/V.

> Having a hard dependency on BPF for in-kernel functionality is a
> different matter, and limits the cases it can be used for.

Notice that we already rely on XDP program being attached or the storage
for traits K/V is not available.

> Besides, I don't really see the point of leaving the encoding out of the
> kernel? We keep the encoding kernel-internal anyway, and just expose a
> get/set API, so there's no constraint on changing it later (that's kinda
> the whole point of doing that). And with bulk get/set there's not an
> efficiency argument either. So what's the point, other than doing things
> in BPF for its own sake?

There's the additional complexity in the socket glue layer, but I've
already mentioned that.

What I think makes it even more appealing is that with the high-level
struct_ops approach, we abstract away the individual K/V pair access and
leave the problem of "key registration" (e.g., RX hash is key 42) to the
user-provided implementation.

You, as the user, decide for your particular system how you want to lay
out the values and for which values you actually want to reserve
space. IOW, we leave any trade off decisions to the user in the spirit
of providing a mechanism, not policy.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
  2025-05-05 10:18               ` Jakub Sitnicki
@ 2025-05-05 12:35                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 36+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-05-05 12:35 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Alexei Starovoitov, Arthur Fabre, Network Development, bpf,
	Jesper Dangaard Brouer, Yan Zhai, jbrandeburg, lbiancon,
	Alexei Starovoitov, Jakub Kicinski, Eric Dumazet, kernel-team

Jakub Sitnicki <jakub@cloudflare.com> writes:

> On Thu, May 01, 2025 at 12:43 PM +02, Toke Høiland-Jørgensen wrote:
>> Jakub Sitnicki <jakub@cloudflare.com> writes:
>>
>>> On Wed, Apr 30, 2025 at 11:19 AM +02, Toke Høiland-Jørgensen wrote:
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>
>>>>> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>>>>
>>>>>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>>>>>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>
>>> [...]
>>>
>>>>>> * Hardware metadata: metadata exposed from NICs (like the receive
>>>>>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>>>>>   (via kfuncs).
>>>>>>   But that doesn't expose them to the rest of the stack.
>>>>>>   Storing them in traits would allow XDP, other BPF programs, and the
>>>>>>   kernel to access and modify them (for example to into account
>>>>>>   decapsulating a packet).
>>>>>
>>>>> Sure. If traits == existing metadata bpf prog in xdp can communicate
>>>>> with bpf prog in skb layer via that "trait" format.
>>>>> xdp can take tuple hash and store it as key==0 in the trait.
>>>>> The kernel doesn't need to know how to parse that format.
>>>>
>>>> Yes it does, to propagate it to the skb later. I.e.,
>>>>
>>>> XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
>>>> CPUMAP: build skb, read hash from traits, populate skb hash
>>>>
>>>> Same thing for (at least) timestamps and checksums.
>>>>
>>>> Longer term, with traits available we could move more skb fields into
>>>> traits to make struct sk_buff smaller (by moving optional fields to
>>>> traits that don't take up any space if they're not set).
>>>
>>> Perhaps we can have the cake and eat it too.
>>>
>>> We could leave the traits encoding/decoding out of the kernel and, at
>>> the same time, *expose it* to the network stack through BPF struct_ops
>>> programs. At a high level, for example ->get_rx_hash(), not the
>>> individual K/V access. The traits_ops vtable could grow as needed to
>>> support new use cases.
>>>
>>> If you think about it, it's not so different from BPF-powered congestion
>>> algorithms and scheduler extensions. They also expose some state, kept in
>>> maps, that only the loaded BPF code knows how to operate on.
>>
>> Right, the difference being that the kernel works perfectly well without
>> an eBPF congestion control algorithm loaded because it has its own
>> internal implementation that is used by default.
>
> It seems to me that any code path on the network stack still needs to
> work *even if* traits K/V is not available. There has to be a fallback -
> like, RX hash not present in traits K/V? must recompute it. There is no
> guarantee that there will be space available in the traits K/V store for
> whatever value the network stack would like to cache there.

The stack is in control of both the memory allocation and the placement
of the kv-store, so it could totally guarantee that if needed. Which is
the whole point of making this kernel-internal, IMO.

> So if we can agree that traits K/V is a cache, with limited capacity,
> and any code path accessing it must be prepared to deal with a cache
> miss, then I think with struct_ops approach you could have a built-in
> default implementation for exclusive use by the network stack.

If we have such a default implementation (which would presumably be the
one in this series), why would anyone override it? The need for that
seems a bit speculative, so why not just start with the one
implementation, and add the override if a really compelling use case
does eventually turn up?

> This default implementation of the storage access just wouldn't be
> exposed to the BPF or user-space. If you want access from BPF/userland,
> then you'd need to provide a BPF-backed struct_ops for accessing traits
> K/V.
>
>> Having a hard dependency on BPF for in-kernel functionality is a
>> different matter, and limits the cases it can be used for.
>
> Notice that we already rely on XDP program being attached or the storage
> for traits K/V is not available.
>
>> Besides, I don't really see the point of leaving the encoding out of the
>> kernel? We keep the encoding kernel-internal anyway, and just expose a
>> get/set API, so there's no constraint on changing it later (that's kinda
>> the whole point of doing that). And with bulk get/set there's not an
>> efficiency argument either. So what's the point, other than doing things
>> in BPF for its own sake?
>
> There's the additional complexity in the socket glue layer, but I've
> already mentioned that.
>
> What I think makes it even more appealing is that with the high-level
> struct_ops approach, we abstract away the individual K/V pair access and
> leave the problem of "key registration" (e.g., RX hash is key 42) to the
> user-provided implementation.
>
> You, as the user, decide for your particular system how you want to lay
> out the values and for which values you actually want to reserve
> space. IOW, we leave any trade off decisions to the user in the spirit
> of providing a mechanism, not policy.

But we already have such a possibility, that's basically the metadata
space that XDP/TC already has access to. And the whole reason why this
patch set makes sense is that we need something where the kernel
provides something more structured (K/V) that facilitates sharing across
applications that don't have central coordination across the system.
Punting that (back) off to BPF just gets us back to square one...

-Toke


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-05-05 12:36 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
2025-04-24 16:22   ` Alexei Starovoitov
2025-04-25 19:26     ` Arthur Fabre
2025-04-29 23:36       ` Alexei Starovoitov
2025-04-30  9:19         ` Toke Høiland-Jørgensen
2025-04-30 16:29           ` Alexei Starovoitov
2025-05-01  7:30             ` Arthur Fabre
2025-04-30 19:19           ` Jakub Sitnicki
2025-05-01 10:43             ` Toke Høiland-Jørgensen
2025-05-01 14:03               ` Jesper Dangaard Brouer
2025-05-05 10:18               ` Jakub Sitnicki
2025-05-05 12:35                 ` Toke Høiland-Jørgensen
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 02/17] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 03/17] trait: XDP support Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 04/17] trait: XDP selftest Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 05/17] trait: XDP benchmark Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 06/17] trait: Replace memcpy calls with inline copies Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 07/17] trait: Replace memmove calls with inline move Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 08/17] skb: Extension header in packet headroom Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 09/17] trait: Store traits in sk_buff extension Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb Arthur Fabre
2025-04-23 16:36   ` Stanislav Fomichev
2025-04-23 20:54     ` Arthur Fabre
2025-04-23 23:45       ` Stanislav Fomichev
2025-04-24  9:49         ` Toke Høiland-Jørgensen
2025-04-24 15:39           ` Stanislav Fomichev
2025-04-24 18:59             ` Jakub Sitnicki
2025-04-25  8:06               ` Toke Høiland-Jørgensen
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 11/17] ice: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 12/17] veth: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 13/17] virtio_net: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 14/17] mlx5: move xdp_buff scope one level up Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 15/17] mlx5: Propagate trait presence to skb Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 16/17] xdp generic: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 17/17] trait: Allow socket filters to access traits Arthur Fabre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).