[PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store
@ 2025-03-05 14:31 arthur
  2025-03-05 14:31 ` [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata arthur
                   ` (19 more replies)
  0 siblings, 20 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:31 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

Currently, the only way to attach information to a sk_buff that travels 
through the network stack is by using the mark field. This 32-bit field
is highly versatile - it can be read in firewall rules, drive routing 
decisions, and be accessed by BPF programs.

However, its limited capacity creates competition for bits, restricting 
its practical use.

To remedy this, we propose using part of the packet headroom to store 
metadata. This would allow:
- Tracing packets through the network stack and across the kernel-user
  space boundary, by assigning them a unique ID.
- Metadata-driven packet redirection, routing, and socket steering with
  early classification in XDP.
- Extracting information from encapsulation headers and sharing it with
  user space or vice versa.
- Exposing XDP RX Metadata, like the timestamp, to the rest of the 
  network stack.

We originally proposed extending XDP metadata - binary blob
storage also in the headroom - to expose it throughout the network 
stack. However based on feedback at LPC 2024 [1]:
- sharing a binary blob amongst different applications is hard.
- exposing a binary blob to userspace is awkward.
we've shifted to a limited KV store in the headroom.

To differentiate this from the overloaded "metadata" term, it's 
tentatively called "packet traits".

A get() / set() / delete() API is exposed to BPF to store and 
retrieve traits. 

Initial benchmarks in XDP are promising, with get() / set() comparable
to an indirect function call. See patch 6: "trait: Replace memmove calls
with inline move" for full results.

We imagine adding first class support for this in netfilter (setting 
/ checking traits in rules) and routing (selecting routing tables 
based on traits) in follow up work.
We also envisage a first class userspace API for storing and
retrieving traits in the future.

To co-exist with the existing XDP metadata area, traits are stored at
the start of the headroom:

| xdp_frame | traits | headroom | XDP metadata | data / packet |

Traits and XDP metadata are not allowed to overlap.

Like XDP metadata, this relies on there being sufficient headroom
available. Piggy backing on top of that work, traits are currently
only supported:
- On ingress.
- By NIC drivers that support XDP metadata.
- When an XDP program is attached.
This limits the applicability of traits. But future work 
guaranteeing sufficient headroom through other means should allow
these restrictions to be lifted.

There are still a number of open questions:
- What sizes of values should be allowed? See patch 1 "trait: limited KV
  store for packet metadata".
- How should we handle skb clones? See patch 16 "trait: Support sk_buffs".
- How should trait keys be allocated? See patch 18 "trait: registration
  API".
- How should traits work with GRO? Could an API let us specify policies 
  for how traits should be merged? See patch 18 "trait: registration
  API".

[1] https://lpc.events/event/18/contributions/1935/

Cc: jakub@cloudflare.com
Cc: hawk@kernel.org
Cc: yan@cloudflare.com
Cc: jbrandeburg@cloudflare.com
Cc: thoiland@redhat.com
Cc: lbiancon@redhat.com

To: netdev@vger.kernel.org
To: bpf@vger.kernel.org

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
Arthur Fabre (19):
      trait: limited KV store for packet metadata
      trait: XDP support
      trait: basic XDP selftest
      trait: basic XDP benchmark
      trait: Replace memcpy calls with inline copies
      trait: Replace memmove calls with inline move
      xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
      trait: Propagate presence of traits to sk_buff
      bnxt: Propagate trait presence to skb
      ice: Propagate trait presence to skb
      veth: Propagate trait presence to skb
      virtio_net: Propagate trait presence to skb
      mlx5: Propagate trait presence to skb
      xdp generic: Propagate trait presence to skb
      trait: Support sk_buffs
      trait: Allow socket filters to access traits
      trait: registration API
      trait: Sync linux/bpf.h to tools/ for trait registration
      trait: register traits in benchmarks and tests

Jesper Dangaard Brouer (1):
      mlx5: move xdp_buff scope one level up

 drivers/net/ethernet/broadcom/bnxt/bnxt.c          |   4 +
 drivers/net/ethernet/intel/ice/ice_txrx.c          |   4 +
 drivers/net/ethernet/intel/ice/ice_xsk.c           |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.c    |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h    |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 114 ++++----
 drivers/net/veth.c                                 |   4 +
 drivers/net/virtio_net.c                           |   8 +-
 include/linux/bpf-netns.h                          |  12 +
 include/linux/skbuff.h                             |  33 ++-
 include/net/net_namespace.h                        |   6 +
 include/net/netns/trait.h                          |  22 ++
 include/net/trait.h                                | 288 +++++++++++++++++++++
 include/net/xdp.h                                  |  42 ++-
 include/uapi/linux/bpf.h                           |  26 ++
 kernel/bpf/net_namespace.c                         |  54 ++++
 kernel/bpf/syscall.c                               |  26 ++
 kernel/bpf/verifier.c                              |  39 ++-
 net/core/dev.c                                     |   1 +
 net/core/filter.c                                  |  43 ++-
 net/core/skbuff.c                                  |  25 +-
 net/core/xdp.c                                     |  50 ++++
 tools/include/uapi/linux/bpf.h                     |  26 ++
 tools/testing/selftests/bpf/Makefile               |   2 +
 tools/testing/selftests/bpf/bench.c                |  11 +
 tools/testing/selftests/bpf/bench.h                |   1 +
 .../selftests/bpf/benchs/bench_xdp_traits.c        | 191 ++++++++++++++
 .../testing/selftests/bpf/prog_tests/xdp_traits.c  |  51 ++++
 .../testing/selftests/bpf/progs/bench_xdp_traits.c | 131 ++++++++++
 .../testing/selftests/bpf/progs/test_xdp_traits.c  |  94 +++++++
 31 files changed, 1259 insertions(+), 69 deletions(-)
---
base-commit: 42ba8a49d085e0c2ad50fb9a8ec954c9762b6e01
change-id: 20250305-afabre-traits-010-rfc2-a8e4de0c490b

Best regards,
-- 
Arthur Fabre <afabre@cloudflare.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
@ 2025-03-05 14:31 ` arthur
  2025-03-07  6:36   ` Alexei Starovoitov
  2025-03-07 19:24   ` Jakub Sitnicki
  2025-03-05 14:31 ` [PATCH RFC bpf-next 02/20] trait: XDP support arthur
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:31 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

A (very limited) KV store to support storing KVs in the packet headroom,
with:
- 64 keys (0-63).
- 2, 4 or 8 byte values.
- O(1) lookup
- O(n) insertion
- A fixed 16 byte header.

Values are stored ordered by key, immediately following the fixed
header.

This could be extended in the future, for now it implements the smallest
possible API. The API intentionally uses u64 keys to not impose
restrictions on the implementation in the future.

I picked 2¸ 4, and 8 bytes arbitrarily. We could also support 0 sized
values for use as flags.
A 16 byte value could be useful to store UUIDs and IPv6 addresses.
If we want more than 3 sizes, we can add a word to the header (for a
total of 24 bytes) to support 7 sizes.

We could also allow users to set several consecutive keys in one
trait_set() call to support storing larger values.

Implemented in the header file so functions are always inlinable.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/net/trait.h | 243 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)

diff --git a/include/net/trait.h b/include/net/trait.h
new file mode 100644
index 0000000000000000000000000000000000000000..536b8a17dbbc091b4d1a4d7b4b21c1e36adea86a
--- /dev/null
+++ b/include/net/trait.h
@@ -0,0 +1,243 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2025 Arthur Fabre, Cloudflare */
+#ifndef __LINUX_NET_TRAIT_H__
+#define __LINUX_NET_TRAIT_H__
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/bitops.h>
+
+/* Traits are a very limited KV store, with:
+ * - 64 keys (0-63).
+ * - 2, 4 or 8 byte values.
+ * - O(1) lookup
+ * - O(n) insertion
+ * - A fixed 16 byte header.
+ */
+struct __trait_hdr {
+	/* Values are stored ordered by key, immediately after the header.
+	 *
+	 * The size of each value set is stored in the header as two bits:
+	 *  - 00: Not set.
+	 *  - 01: 2 bytes.
+	 *  - 10: 4 bytes.
+	 *  - 11: 8 bytes.
+	 *
+	 * Naively storing the size of each value (eg packing 4 of them in a byte)
+	 * doesn't let us efficiently calculate the offset of the value of an arbitrary key:
+	 * we'd have to mask out and sum the sizes of all preceding values.
+	 *
+	 * Instead, store the high and low bits of the size in two separate words:
+	 *  - A high bit in the high word.
+	 *  - A low bit in the low word.
+	 * The bit position of each word, LSb 0, is the key.
+	 *
+	 * To calculate the offset of the value of a key:
+	 *  - Mask out subsequent keys from both words.
+	 *  - Calculate the hamming weight / population count of each word.
+	 *    Single instruction on modern amd64 CPUs.
+	 *  - hweight(low) + hweight(high)<<1 is offset.
+	 */
+	u64 high;
+	u64 low;
+};
+
+static __always_inline bool __trait_valid_len(u64 len)
+{
+	return len == 2 || len == 4 || len == 8;
+}
+
+static __always_inline bool __trait_valid_key(u64 key)
+{
+	return key < 64;
+}
+
+static __always_inline int __trait_total_length(struct __trait_hdr h)
+{
+	return (hweight64(h.low) << 1) + (hweight64(h.high) << 2)
+		// For size 8, we only get 4+2=6. Add another 2 in.
+		+ (hweight64(h.high & h.low) << 1);
+}
+
+static __always_inline struct __trait_hdr __trait_and(struct __trait_hdr h, u64 mask)
+{
+	return (struct __trait_hdr){
+		h.high & mask,
+		h.low & mask,
+	};
+}
+
+static __always_inline int __trait_offset(struct __trait_hdr h, u64 key)
+{
+	/* Calculate total length of previous keys by masking out keys after. */
+	return sizeof(struct __trait_hdr) + __trait_total_length(__trait_and(h, ~(~0llu << key)));
+}
+
+/**
+ * traits_init() - Initialize a trait store.
+ * @traits: Start of trait store area.
+ * @hard_end: Hard limit the trait store can currently grow up against.
+ *            Can change dynamically after initialization, as long as it
+ *            does not overwrite any area already used (see traits_size()).
+ *
+ * Return:
+ * * %0       - Success.
+ * * %-ENOMEM - Not enough room to store any traits.
+ */
+static __always_inline int traits_init(void *traits, void *hard_end)
+{
+	if (traits + sizeof(struct __trait_hdr) > hard_end)
+		return -ENOMEM;
+
+	memset(traits, 0, sizeof(struct __trait_hdr));
+	return 0;
+}
+
+/**
+ * traits_size() - Total size currently used by a trait store.
+ * @traits: Start of trait store area.
+ *
+ * Return: Size in bytes.
+ */
+static __always_inline int traits_size(void *traits)
+{
+	return sizeof(struct __trait_hdr) + __trait_total_length(*(struct __trait_hdr *)traits);
+}
+
+/**
+ * trait_set() - Set a trait key.
+ * @traits: Start of trait store area.
+ * @hard_end: Hard limit the trait store can currently grow up against.
+ * @key: The key to set.
+ * @val: The value to set.
+ * @len: The length of the value.
+ * @flags: Unused for now. Should be 0.
+ *
+ * Return:
+ * * %0       - Success.
+ * * %-EINVAL - Key or length invalid.
+ * * %-EBUSY  - Key previously set with different length.
+ * * %-ENOSPC - Not enough room left to store value.
+ */
+static __always_inline
+int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
+{
+	if (!__trait_valid_key(key) || !__trait_valid_len(len))
+		return -EINVAL;
+
+	struct __trait_hdr *h = (struct __trait_hdr *)traits;
+
+	/* Offset of value of this key. */
+	int off = __trait_offset(*h, key);
+
+	if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
+		/* Key is already set, but with a different length */
+		if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
+			return -EBUSY;
+	} else {
+		/* Figure out if we have enough room left: total length of everything now. */
+		if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
+			return -ENOSPC;
+
+		/* Memmove all the kvs after us over. */
+		if (traits_size(traits) > off)
+			memmove(traits + off + len, traits + off, traits_size(traits) - off);
+	}
+
+	/* Set our value. */
+	memcpy(traits + off, val, len);
+
+	/* Store our length in header. */
+	u64 encode_len = 0;
+
+	switch (len) {
+	case 2:
+		encode_len = 1;
+		break;
+	case 4:
+		encode_len = 2;
+		break;
+	case 8:
+		encode_len = 3;
+		break;
+	}
+	h->high |= (encode_len >> 1) << key;
+	h->low |= (encode_len & 1) << key;
+	return 0;
+}
+
+/**
+ * trait_get() - Get a trait key.
+ * @traits: Start of trait store area.
+ * @key: The key to get.
+ * @val: Where to put stored value.
+ * @val_len: The length of val.
+ *
+ * Return:
+ * * %>0      - Actual size of value.
+ * * %-EINVAL - Key or length invalid.
+ * * %-ENOENT - Key has not been set with trait_set() previously.
+ * * %-ENOSPC - Val is not big enough to hold stored value.
+ */
+static __always_inline
+int trait_get(void *traits, u64 key, void *val, u64 val_len)
+{
+	if (!__trait_valid_key(key))
+		return -EINVAL;
+
+	struct __trait_hdr h = *(struct __trait_hdr *)traits;
+
+	/* Check key is set */
+	if (!((h.high & (1ull << key)) || (h.low & (1ull << key))))
+		return -ENOENT;
+
+	/* Offset of value of this key */
+	int off = __trait_offset(h, key);
+
+	/* Figure out our length */
+	int real_len = __trait_total_length(__trait_and(h, (1ull << key)));
+
+	if (real_len > val_len)
+		return -ENOSPC;
+
+	memcpy(val, traits + off, real_len);
+	return real_len;
+}
+
+/**
+ * trait_del() - Delete a trait key.
+ * @traits: Start of trait store area.
+ * @key: The key to delete.
+ *
+ * Return:
+ * * %0       - Success.
+ * * %-EINVAL - Key or length invalid.
+ * * %-ENOENT - Key has not been set with trait_set() previously.
+ */
+static __always_inline int trait_del(void *traits, u64 key)
+{
+	if (!__trait_valid_key(key))
+		return -EINVAL;
+
+	struct __trait_hdr *h = (struct __trait_hdr *)traits;
+
+	/* Check key is set */
+	if (!((h->high & (1ull << key)) || (h->low & (1ull << key))))
+		return -ENOENT;
+
+	/* Offset and length of value of this key */
+	int off = __trait_offset(*h, key);
+	int len = __trait_total_length(__trait_and(*h, (1ull << key)));
+
+	/* Memmove all the kvs after us over */
+	if (traits_size(traits) > off + len)
+		memmove(traits + off, traits + off + len, traits_size(traits) - off - len);
+
+	/* Clear our length in header */
+	h->high &= ~(1ull << key);
+	h->low &= ~(1ull << key);
+	return 0;
+}
+
+#endif /* __LINUX_NET_TRAIT_H__ */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 02/20] trait: XDP support
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
  2025-03-05 14:31 ` [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata arthur
@ 2025-03-05 14:31 ` arthur
  2025-03-07 19:13   ` Lorenzo Bianconi
  2025-03-05 14:32 ` [PATCH RFC bpf-next 03/20] trait: basic XDP selftest arthur
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: arthur @ 2025-03-05 14:31 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Store the trait KV store in the xdp_buff headroom, right after the
xdp_frame.

This ensures it can be used at the same time as XDP metadata, runtime
checks prevent either one from overlapping.
It also avoids having to move the trait KV store on every
xdp_adjust_head() call.

Support for traits is gated on support for metadata: to propagate the
traits to the skb layer, drivers will have to explicitly signal support
like the existing skb_metadata_set() call.

This also makes bounds checks simpler: we can rely on xdp->data_meta
always being a valid hard_end for traits.

This assumes there is enough headroom to store the traits header. That
avoids having to check in the hot-path, where all we could do anyways is
BUG_ON().

Defining _XDP_FRAME_SIZE in skbuff.h seems awkward, but we'll need it
to access the traits in the skb layer. Might as well use it now, to
avoid having to move the definition of struct xdp_frame.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/linux/skbuff.h |  3 +++
 include/net/xdp.h      | 20 ++++++++++++++++++++
 net/core/filter.c      |  7 +++----
 net/core/xdp.c         | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb2b751d274acff931281a72e8b4b0c699b4e8af..03553c2200ab1c3750b799edb94fa3b94e11a5f1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -274,6 +274,9 @@
 			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
 			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 
+/* From xdp.h, to avoid indirectly including skbuff.h */
+#define _XDP_FRAME_SIZE (40)
+
 struct ahash_request;
 struct net_device;
 struct scatterlist;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 4dafc5e021f13688f0bf69a21bff58d394d1ac28..58019fa299b56dbd45c104fdfa807f73af6e4fa4 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -10,6 +10,7 @@
 #include <linux/filter.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h> /* skb_shared_info */
+#include <net/trait.h>
 
 #include <net/page_pool/types.h>
 
@@ -115,6 +116,11 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
 	xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
 }
 
+static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
+{
+	return xdp->data_hard_start + _XDP_FRAME_SIZE;
+}
+
 static __always_inline void
 xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
 {
@@ -133,6 +139,13 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
 	xdp->data = data;
 	xdp->data_end = data + data_len;
 	xdp->data_meta = meta_valid ? data : data + 1;
+
+	if (meta_valid) {
+		/* We assume drivers reserve enough headroom to store xdp_frame
+		 * and the traits header.
+		 */
+		traits_init(xdp_buff_traits(xdp), xdp->data_meta);
+	}
 }
 
 /* Reserve memory area at end-of data area.
@@ -267,6 +280,8 @@ struct xdp_frame {
 	u32 flags; /* supported values defined in xdp_buff_flags */
 };
 
+static_assert(sizeof(struct xdp_frame) == _XDP_FRAME_SIZE);
+
 static __always_inline bool xdp_frame_has_frags(const struct xdp_frame *frame)
 {
 	return !!(frame->flags & XDP_FLAGS_HAS_FRAGS);
@@ -517,6 +532,11 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
 	return !IS_ALIGNED(metalen, sizeof(u32)) || metalen > meta_max;
 }
 
+static __always_inline void *xdp_meta_hard_start(const struct xdp_buff *xdp)
+{
+	return xdp_buff_traits(xdp) + traits_size(xdp_buff_traits(xdp));
+}
+
 struct xdp_attachment_info {
 	struct bpf_prog *prog;
 	u32 flags;
diff --git a/net/core/filter.c b/net/core/filter.c
index dcc53ac5c5458f67a422453134665d43d466a02e..79b78e7cd57fd78c6cc8443da54ae96408c496b0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,7 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <net/trait.h>
 
 #include "dev.h"
 
@@ -3935,9 +3936,8 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
 
 BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 {
-	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
 	unsigned long metalen = xdp_get_metalen(xdp);
-	void *data_start = xdp_frame_end + metalen;
+	void *data_start = xdp_meta_hard_start(xdp) + metalen;
 	void *data = xdp->data + offset;
 
 	if (unlikely(data < data_start ||
@@ -4228,13 +4228,12 @@ static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
 
 BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 {
-	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
 	void *meta = xdp->data_meta + offset;
 	unsigned long metalen = xdp->data - meta;
 
 	if (xdp_data_meta_unsupported(xdp))
 		return -ENOTSUPP;
-	if (unlikely(meta < xdp_frame_end ||
+	if (unlikely(meta < xdp_meta_hard_start(xdp) ||
 		     meta > xdp->data))
 		return -EINVAL;
 	if (unlikely(xdp_metalen_invalid(metalen)))
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 2c6ab6fb452f7b90d85125ae17fef96cfc9a8576..2e87f82aa5f835f60295d859a524e40bd47c42ee 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -1032,3 +1032,53 @@ void xdp_features_clear_redirect_target(struct net_device *dev)
 	xdp_set_features_flag(dev, val);
 }
 EXPORT_SYMBOL_GPL(xdp_features_clear_redirect_target);
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc int bpf_xdp_trait_set(const struct xdp_buff *xdp, u64 key,
+				  const void *val, u64 val__sz, u64 flags)
+{
+	if (xdp_data_meta_unsupported(xdp))
+		return -EOPNOTSUPP;
+
+	return trait_set(xdp_buff_traits(xdp), xdp->data_meta, key,
+			 val, val__sz, flags);
+}
+
+__bpf_kfunc int bpf_xdp_trait_get(const struct xdp_buff *xdp, u64 key,
+				  void *val, u64 val__sz)
+{
+	if (xdp_data_meta_unsupported(xdp))
+		return -EOPNOTSUPP;
+
+	return trait_get(xdp_buff_traits(xdp), key, val, val__sz);
+}
+
+__bpf_kfunc int bpf_xdp_trait_del(const struct xdp_buff *xdp, u64 key)
+{
+	if (xdp_data_meta_unsupported(xdp))
+		return -EOPNOTSUPP;
+
+	return trait_del(xdp_buff_traits(xdp), key);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(xdp_trait)
+// TODO - should we use KF_TRUSTED_ARGS? https://www.kernel.org/doc/html/next/bpf/kfuncs.html#kf-trusted-args-flag
+BTF_ID_FLAGS(func, bpf_xdp_trait_set)
+BTF_ID_FLAGS(func, bpf_xdp_trait_get)
+BTF_ID_FLAGS(func, bpf_xdp_trait_del)
+BTF_KFUNCS_END(xdp_trait)
+
+static const struct btf_kfunc_id_set xdp_trait_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set = &xdp_trait,
+};
+
+static int xdp_trait_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
+					 &xdp_trait_kfunc_set);
+}
+late_initcall(xdp_trait_init);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 03/20] trait: basic XDP selftest
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
  2025-03-05 14:31 ` [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata arthur
  2025-03-05 14:31 ` [PATCH RFC bpf-next 02/20] trait: XDP support arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 04/20] trait: basic XDP benchmark arthur
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Doesn't test interop with the "data_meta" area yet, or error cases.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 .../testing/selftests/bpf/prog_tests/xdp_traits.c  | 35 ++++++++
 .../testing/selftests/bpf/progs/test_xdp_traits.c  | 94 ++++++++++++++++++++++
 2 files changed, 129 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_traits.c b/tools/testing/selftests/bpf/prog_tests/xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..4175b28d45e91e82435e646e5edd783980d5fe70
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_traits.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+
+static void _test_xdp_traits(void)
+{
+	const char *file = "./test_xdp_traits.bpf.o";
+	struct bpf_object *obj;
+	int err, prog_fd;
+	char buf[128];
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = &pkt_v4,
+		.data_size_in = sizeof(pkt_v4),
+		.data_out = buf,
+		.data_size_out = sizeof(buf),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load(file, BPF_PROG_TYPE_XDP, &obj, &prog_fd);
+	if (!ASSERT_OK(err, "test_xdp_traits"))
+		return;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "prog_run");
+	ASSERT_EQ(topts.retval, XDP_PASS, "retval");
+
+	bpf_object__close(obj);
+}
+
+void test_xdp_traits(void)
+{
+	if (test__start_subtest("xdp_traits"))
+		_test_xdp_traits();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_traits.c b/tools/testing/selftests/bpf/progs/test_xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..79e0b0dedb5c05792579861975e8caf290f5f42b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_traits.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <errno.h>
+
+extern int bpf_xdp_trait_set(const struct xdp_md *xdp, __u64 key,
+			     const void *val, __u64 val__sz,
+			     __u64 flags) __ksym;
+extern int bpf_xdp_trait_get(const struct xdp_md *xdp, __u64 key, void *val,
+			     __u64 val__sz) __ksym;
+extern int bpf_xdp_trait_del(const struct xdp_md *xdp, __u64 key) __ksym;
+
+SEC("xdp")
+int _xdp_traits(struct xdp_md *xdp)
+{
+	int ret;
+	__u16 val, got, want;
+
+	// No keys to start.
+	for (int i = 0; i < 64; i++) {
+		ret = bpf_xdp_trait_get(xdp, i, &val, sizeof(val));
+		if (ret != -ENOENT) {
+			bpf_printk("get(%d) ret %d", i, ret);
+			return XDP_DROP;
+		}
+	}
+
+	// Set 64 2 byte KVs.
+	for (int i = 0; i < 64; i++) {
+		val = i << 8 | i;
+		ret = bpf_xdp_trait_set(xdp, i, &val, sizeof(val), 0);
+		if (ret < 0) {
+			bpf_printk("set(%d) ret %d\n", i, ret);
+			return XDP_DROP;
+		}
+		bpf_printk("set(%d, 0x%04x)\n", i, val);
+	}
+
+	// Check we can get the 64 2 byte KVs back out.
+	for (int i = 0; i < 64; i++) {
+		ret = bpf_xdp_trait_get(xdp, i, &got, sizeof(got));
+		if (ret != 2) {
+			bpf_printk("get(%d) ret %d", i, ret);
+			return XDP_DROP;
+		}
+		want = (i << 8) | i;
+		if (got != want) {
+			bpf_printk("get(%d) got 0x%04x want 0x%04x\n", i, got,
+				   want);
+			return XDP_DROP;
+		}
+		bpf_printk("get(%d) 0x%04x\n", i, got);
+	}
+
+	// Overwrite all 64 2 byte KVs.
+	for (int i = 0; i < 64; i++) {
+		val = i << 9 | i << 1;
+		ret = bpf_xdp_trait_set(xdp, i, &val, sizeof(val), 0);
+		if (ret < 0) {
+			bpf_printk("set(%d) ret %d\n", i, ret);
+			return XDP_DROP;
+		}
+		bpf_printk("set(%d, 0x%04x)\n", i, val);
+	}
+
+	// Delete all the even KVs.
+	for (int i = 0; i < 64; i += 2) {
+		ret = bpf_xdp_trait_del(xdp, i);
+		if (ret < 0) {
+			bpf_printk("del(%d) ret %d\n", i, ret);
+			return XDP_DROP;
+		}
+	}
+
+	// Read out all the odd KVs again.
+	for (int i = 1; i < 63; i += 2) {
+		ret = bpf_xdp_trait_get(xdp, i, &got, sizeof(got));
+		if (ret != 2) {
+			bpf_printk("get(%d) ret %d", i, ret);
+			return XDP_DROP;
+		}
+		want = (i << 9) | i << 1;
+		if (got != want) {
+			bpf_printk("get(%d) got 0x%04x want 0x%04x\n", i, got,
+				   want);
+			return XDP_DROP;
+		}
+		bpf_printk("get(%d) 0x%04x\n", i, got);
+	}
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 04/20] trait: basic XDP benchmark
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (2 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 03/20] trait: basic XDP selftest arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies arthur
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Basic benchmarks for:

* Getting a trait.
* Setting a trait, with no traits already stored after it.
* Moving traits, when setting or deleting a trait with traits stored
  after it.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 tools/testing/selftests/bpf/Makefile               |   2 +
 tools/testing/selftests/bpf/bench.c                |   8 ++
 .../selftests/bpf/benchs/bench_xdp_traits.c        | 160 +++++++++++++++++++++
 .../testing/selftests/bpf/progs/bench_xdp_traits.c | 131 +++++++++++++++++
 4 files changed, 301 insertions(+)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e6a02d5b87d123cef7e6b41bfbc96d34083f38e1..6b7a7351664cf611ce72fb76b308b7392e3811ba 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -815,6 +815,7 @@ $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.ske
 $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
 $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
+$(OUTPUT)/bench_xdp_traits.o: $(OUTPUT)/bench_xdp_traits.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -835,6 +836,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_local_storage_create.o \
 		 $(OUTPUT)/bench_htab_mem.o \
 		 $(OUTPUT)/bench_bpf_crypto.o \
+		 $(OUTPUT)/bench_xdp_traits.o \
 		 #
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b3401f501d790453c7be9ed289cb5..4678b928fc6ad2f0a870a25d9b10c75a1f6d77ba 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -283,6 +283,7 @@ extern struct argp bench_local_storage_create_argp;
 extern struct argp bench_htab_mem_argp;
 extern struct argp bench_trigger_batch_argp;
 extern struct argp bench_crypto_argp;
+extern struct argp bench_trait_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -297,6 +298,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
 	{ &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 },
 	{ &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
+	{ &bench_trait_argp, 0, "XDP trait benchmark", 0 },
 	{},
 };
 
@@ -549,6 +551,9 @@ extern const struct bench bench_local_storage_create;
 extern const struct bench bench_htab_mem;
 extern const struct bench bench_crypto_encrypt;
 extern const struct bench bench_crypto_decrypt;
+extern const struct bench bench_xdp_trait_get;
+extern const struct bench bench_xdp_trait_set;
+extern const struct bench bench_xdp_trait_move;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -609,6 +614,9 @@ static const struct bench *benchs[] = {
 	&bench_htab_mem,
 	&bench_crypto_encrypt,
 	&bench_crypto_decrypt,
+	&bench_xdp_trait_get,
+	&bench_xdp_trait_set,
+	&bench_xdp_trait_move,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c b/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..0fbcd49edd825f53e6957319d3f05efc218dfb02
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <argp.h>
+#include "bench.h"
+#include "bench_xdp_traits.skel.h"
+
+static struct trait_ctx {
+	struct bench_xdp_traits *skel;
+	int pfd;
+} ctx;
+
+static struct trait_args {
+	u32 trait_len;
+} args = {
+	.trait_len = 2,
+};
+
+enum {
+	ARG_TRAIT_LEN = 6000,
+};
+
+static const struct argp_option opts[] = {
+	{ "trait-len", ARG_TRAIT_LEN, "TRAIT_LEN", 0,
+	  "Set the length of the trait set/get" },
+	{},
+};
+
+static error_t trait_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_TRAIT_LEN:
+		args.trait_len = strtoul(arg, NULL, 10);
+		if (args.trait_len != 2 && args.trait_len != 4 && args.trait_len != 8) {
+			fprintf(stderr, "Invalid trait length\n");
+			argp_usage(state);
+		}
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_trait_argp = {
+	.options = opts,
+	.parser = trait_parse_arg,
+};
+
+static void trait_validate(void)
+{
+	if (env.consumer_cnt != 0) {
+		fprintf(stderr, "bpf trait benchmark doesn't support consumer!\n");
+		exit(1);
+	}
+}
+
+static void trait_setup(void)
+{
+	int err;
+
+	setup_libbpf();
+
+	ctx.skel = bench_xdp_traits__open();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	ctx.skel->bss->trait_len = args.trait_len;
+
+	err = bench_xdp_traits__load(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		bench_xdp_traits__destroy(ctx.skel);
+		exit(1);
+	}
+}
+
+static void trait_get_setup(void)
+{
+	trait_setup();
+	ctx.pfd = bpf_program__fd(ctx.skel->progs.trait_get);
+}
+
+static void trait_set_setup(void)
+{
+	trait_setup();
+	ctx.pfd = bpf_program__fd(ctx.skel->progs.trait_set);
+}
+
+static void trait_move_setup(void)
+{
+	trait_setup();
+	ctx.pfd = bpf_program__fd(ctx.skel->progs.trait_move);
+}
+
+static void trait_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+static void *trait_producer(void *unused)
+{
+	static char in[14];
+	int err;
+
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		.data_in = in,
+		.data_size_in = sizeof(in),
+		.repeat = 256, // max
+	);
+
+	while (true) {
+		err = bpf_prog_test_run_opts(ctx.pfd, &opts);
+		if (err != 0) {
+			fprintf(stderr, "failed to prog_run: %d\n", err);
+			return NULL;
+		}
+		if (opts.retval != 0) {
+			fprintf(stderr, "prog didn't return 0: %d\n", opts.retval);
+			return NULL;
+		}
+	}
+
+	return NULL;
+}
+
+const struct bench bench_xdp_trait_get = {
+	.name = "xdp-trait-get",
+	.argp = &bench_trait_argp,
+	.validate = trait_validate,
+	.setup = trait_get_setup,
+	.producer_thread = trait_producer,
+	.measure = trait_measure,
+	.report_progress = ops_report_progress,
+	.report_final = ops_report_final,
+};
+
+const struct bench bench_xdp_trait_set = {
+	.name = "xdp-trait-set",
+	.argp = &bench_trait_argp,
+	.validate = trait_validate,
+	.setup = trait_set_setup,
+	.producer_thread = trait_producer,
+	.measure = trait_measure,
+	.report_progress = ops_report_progress,
+	.report_final = ops_report_final,
+};
+
+const struct bench bench_xdp_trait_move = {
+	.name = "xdp-trait-move",
+	.argp = &bench_trait_argp,
+	.validate = trait_validate,
+	.setup = trait_move_setup,
+	.producer_thread = trait_producer,
+	.measure = trait_measure,
+	.report_progress = ops_report_progress,
+	.report_final = ops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/progs/bench_xdp_traits.c b/tools/testing/selftests/bpf/progs/bench_xdp_traits.c
new file mode 100644
index 0000000000000000000000000000000000000000..558c1ab22e09990d3eb0f78f916fd02de3def749
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bench_xdp_traits.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <errno.h>
+
+extern int bpf_xdp_trait_set(const struct xdp_md *xdp, __u64 key,
+			     const void *val, __u64 val__sz,
+			     __u64 flags) __ksym;
+extern int bpf_xdp_trait_get(const struct xdp_md *xdp, __u64 key, void *val,
+			     __u64 val__sz) __ksym;
+extern int bpf_xdp_trait_del(const struct xdp_md *xdp, __u64 key) __ksym;
+
+__u32 trait_len;
+long hits = 0;
+
+#define ITERATIONS (10000)
+
+SEC("xdp")
+int trait_get(struct xdp_md *xdp)
+{
+	int ret, i;
+	__u16 val2 = 0xdead;
+	__u32 val4 = 0xdeadbeef;
+	__u64 val8 = 0xdeadbeefafafcfcf;
+
+#define BENCH_GET(val, size) do { \
+		ret = bpf_xdp_trait_set(xdp, 32, &val, sizeof(val), 0);		\
+		if (ret != 0)							\
+			return ret;						\
+		for (i = 0; i < ITERATIONS; i++)				\
+			ret = bpf_xdp_trait_get(xdp, 32, &val, sizeof(val));	\
+		if (ret != size)						\
+			return ret;						\
+	} while (0)
+
+	switch (trait_len) {
+	case 2:
+		BENCH_GET(val2, 2);
+		break;
+	case 4:
+		BENCH_GET(val4, 4);
+		break;
+	case 8:
+		BENCH_GET(val8, 8);
+		break;
+	}
+
+	__sync_add_and_fetch(&hits, ITERATIONS);
+	return 0;
+}
+
+SEC("xdp")
+int trait_set(struct xdp_md *xdp)
+{
+	int ret, i;
+	__u16 val2 = 0xdead;
+	__u32 val4 = 0xdeadbeef;
+	__u64 val8 = 0xdeadbeefafafcfcf;
+
+#define BENCH_SET(val) do { \
+		for (i = 0; i < ITERATIONS; i++)				\
+			ret = bpf_xdp_trait_set(xdp, 32, &val, sizeof(val), 0);	\
+	} while (0)
+
+	switch (trait_len) {
+	case 2:
+		BENCH_SET(val2);
+		break;
+	case 4:
+		BENCH_SET(val4);
+		break;
+	case 8:
+		BENCH_SET(val8);
+		break;
+	}
+
+	if (ret != 0)
+		return ret;
+
+	__sync_add_and_fetch(&hits, ITERATIONS);
+	return 0;
+}
+
+SEC("xdp")
+int trait_move(struct xdp_md *xdp)
+{
+	int ret, ret_del, i;
+	__u16 val2 = 0xdead;
+	__u32 val4 = 0xdeadbeef;
+	__u64 val8 = 0xdeadbeefafafcfcf;
+
+#define BENCH_MOVE(val) do { \
+		for (i = 0; i < 8; i++)	{						\
+			ret = bpf_xdp_trait_set(xdp, 40+i, &val8, sizeof(val8), 0);	\
+			if (ret != 0)							\
+				return ret;						\
+		}									\
+		/* We do two operations per iteration, so do half as many to make it
+		 * vaguely comparable with other benchmarks
+		 */									\
+		for (i = 0; i < ITERATIONS/2; i++) {					\
+			/* Need to delete after, otherwise we'll just overwrite an
+			 * existing trait and there will be nothing to move.
+			 */								\
+			ret = bpf_xdp_trait_set(xdp, 32, &val, sizeof(val), 0);		\
+			ret_del = bpf_xdp_trait_del(xdp, 32);				\
+		}									\
+	} while (0)
+
+	switch (trait_len) {
+	case 2:
+		BENCH_MOVE(val2);
+		break;
+	case 4:
+		BENCH_MOVE(val4);
+		break;
+	case 8:
+		BENCH_MOVE(val8);
+		break;
+	}
+
+	if (ret != 0)
+		return ret;
+	if (ret_del != 0)
+		return ret_del;
+
+	__sync_add_and_fetch(&hits, ITERATIONS);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (3 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 04/20] trait: basic XDP benchmark arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-10 10:50   ` Lorenzo Bianconi
  2025-03-10 22:15   ` David Laight
  2025-03-05 14:32 ` [PATCH RFC bpf-next 06/20] trait: Replace memmove calls with inline move arthur
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

When copying trait values to or from the caller, the size isn't a
constant so memcpy() ends up being a function call.

Replace it with an inline implementation that only handles the sizes we
support.

We store values "packed", so they won't necessarily be 4 or 8 byte
aligned.

Setting and getting traits is roughly ~40% faster.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/net/trait.h | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/net/trait.h b/include/net/trait.h
index 536b8a17dbbc091b4d1a4d7b4b21c1e36adea86a..d4581a877bd57a32e2ad032147c906764d6d37f8 100644
--- a/include/net/trait.h
+++ b/include/net/trait.h
@@ -7,6 +7,7 @@
 #include <linux/errno.h>
 #include <linux/string.h>
 #include <linux/bitops.h>
+#include <linux/unaligned.h>
 
 /* Traits are a very limited KV store, with:
  * - 64 keys (0-63).
@@ -145,23 +146,23 @@ int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u
 			memmove(traits + off + len, traits + off, traits_size(traits) - off);
 	}
 
-	/* Set our value. */
-	memcpy(traits + off, val, len);
-
-	/* Store our length in header. */
 	u64 encode_len = 0;
-
 	switch (len) {
 	case 2:
+		/* Values are least two bytes, so they'll be two byte aligned */
+		*(u16 *)(traits + off) = *(u16 *)val;
 		encode_len = 1;
 		break;
 	case 4:
+		put_unaligned(*(u32 *)val, (u32 *)(traits + off));
 		encode_len = 2;
 		break;
 	case 8:
+		put_unaligned(*(u64 *)val, (u64 *)(traits + off));
 		encode_len = 3;
 		break;
 	}
+
 	h->high |= (encode_len >> 1) << key;
 	h->low |= (encode_len & 1) << key;
 	return 0;
@@ -201,7 +202,19 @@ int trait_get(void *traits, u64 key, void *val, u64 val_len)
 	if (real_len > val_len)
 		return -ENOSPC;
 
-	memcpy(val, traits + off, real_len);
+	switch (real_len) {
+	case 2:
+		/* Values are least two bytes, so they'll be two byte aligned */
+		*(u16 *)val = *(u16 *)(traits + off);
+		break;
+	case 4:
+		*(u32 *)val = get_unaligned((u32 *)(traits + off));
+		break;
+	case 8:
+		*(u64 *)val = get_unaligned((u64 *)(traits + off));
+		break;
+	}
+
 	return real_len;
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 06/20] trait: Replace memmove calls with inline move
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (4 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-06 10:14   ` Jesper Dangaard Brouer
  2025-03-05 14:32 ` [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions arthur
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

When inserting or deleting traits, we need to move any subsequent
traits over.

Replace it with an inline implementation to avoid the function call
overhead. This is especially expensive on AMD with SRSO.

In practice we shouldn't have too much data to move around, and we're
naturally limited to 238 bytes max, so a dumb implementation should
hopefully be fast enough.

Jesper Brouer kindly ran benchmarks on real hardware with three configs:
- Intel: E5-1650 v4
- AMD SRSO: 9684X SRSO
- AMD IBPB: 9684X SRSO=IBPB

		Intel	AMD IBPB	AMD SRSO
xdp-trait-get	5.530	3.901		9.188		(ns/op)
xdp-trait-set	7.538	4.941		10.050		(ns/op)
xdp-trait-move	14.245	8.865		14.834		(ns/op)
function call	1.319	1.359		5.703		(ns/op)
indirect call	8.922	6.251		10.329		(ns/op)

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/net/trait.h | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/net/trait.h b/include/net/trait.h
index d4581a877bd57a32e2ad032147c906764d6d37f8..111ce766345cef45fd95e562eda6a7d01c6b76da 100644
--- a/include/net/trait.h
+++ b/include/net/trait.h
@@ -75,6 +75,40 @@ static __always_inline int __trait_offset(struct __trait_hdr h, u64 key)
 	return sizeof(struct __trait_hdr) + __trait_total_length(__trait_and(h, ~(~0llu << key)));
 }
 
+/* Avoid overhead of memmove() function call when possible. */
+static __always_inline void __trait_move(void *src, int off, size_t n)
+{
+	if (n == 0)
+		return;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) || BITS_PER_LONG != 64) {
+		memmove(src + off, src, n);
+		return;
+	}
+
+	/* Need to move in reverse to handle overlap. */
+	if (off > 0)
+		src += n;
+
+#define ___trait_move(op) do { \
+		src -= (off > 0) ? sizeof(u##op) : 0; \
+		*(u##op *)(src + off) = *(u##op *)src; \
+		src += (off < 0) ? sizeof(u##op) : 0; \
+	} while (0)
+
+	for (int w = 0; w < n / 8; w++)
+		___trait_move(64);
+
+	if (n & 4)
+		___trait_move(32);
+
+	if (n & 2)
+		___trait_move(16);
+
+	if (n & 1)
+		___trait_move(8);
+}
+
 /**
  * traits_init() - Initialize a trait store.
  * @traits: Start of trait store area.
@@ -142,8 +176,7 @@ int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u
 			return -ENOSPC;
 
 		/* Memmove all the kvs after us over. */
-		if (traits_size(traits) > off)
-			memmove(traits + off + len, traits + off, traits_size(traits) - off);
+		__trait_move(traits + off, len, traits_size(traits) - off);
 	}
 
 	u64 encode_len = 0;
@@ -244,8 +277,7 @@ static __always_inline int trait_del(void *traits, u64 key)
 	int len = __trait_total_length(__trait_and(*h, (1ull << key)));
 
 	/* Memmove all the kvs after us over */
-	if (traits_size(traits) > off + len)
-		memmove(traits + off, traits + off + len, traits_size(traits) - off - len);
+	__trait_move(traits + off + len, -len, traits_size(traits) - off - len);
 
 	/* Clear our length in header */
 	h->high &= ~(1ull << key);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (5 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 06/20] trait: Replace memmove calls with inline move arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 15:24   ` Alexander Lobakin
  2025-03-05 14:32 ` [PATCH RFC bpf-next 08/20] trait: Propagate presence of traits to sk_buff arthur
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

xdp_buff stores whether metadata is supported by a NIC by setting
data_meta to be greater than data.

But xdp_frame only stores the metadata size (as metasize), so converting
between xdp_frame and xdp_buff is lossy.

Steal an unused bit in xdp_frame to track whether metadata is supported
or not.

This will lets us have "generic" functions for setting skb fields from
either xdp_frame or xdp_buff from drivers.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/net/xdp.h | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 58019fa299b56dbd45c104fdfa807f73af6e4fa4..84afe07d09efdb2ab0cb78b904f02cb74f9a56b6 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -116,6 +116,9 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
 	xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
 }
 
+static bool xdp_data_meta_unsupported(const struct xdp_buff *xdp);
+static void xdp_set_data_meta_invalid(struct xdp_buff *xdp);
+
 static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
 {
 	return xdp->data_hard_start + _XDP_FRAME_SIZE;
@@ -270,7 +273,9 @@ struct xdp_frame {
 	void *data;
 	u32 len;
 	u32 headroom;
-	u32 metasize; /* uses lower 8-bits */
+	u32	:23, /* unused */
+		meta_unsupported:1,
+		metasize:8;
 	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
 	 * while mem_type is valid on remote CPU.
 	 */
@@ -369,6 +374,8 @@ void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
 	xdp->data = frame->data;
 	xdp->data_end = frame->data + frame->len;
 	xdp->data_meta = frame->data - frame->metasize;
+	if (frame->meta_unsupported)
+		xdp_set_data_meta_invalid(xdp);
 	xdp->frame_sz = frame->frame_sz;
 	xdp->flags = frame->flags;
 }
@@ -396,6 +403,7 @@ int xdp_update_frame_from_buff(const struct xdp_buff *xdp,
 	xdp_frame->len  = xdp->data_end - xdp->data;
 	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
 	xdp_frame->metasize = metasize;
+	xdp_frame->meta_unsupported = xdp_data_meta_unsupported(xdp);
 	xdp_frame->frame_sz = xdp->frame_sz;
 	xdp_frame->flags = xdp->flags;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 08/20] trait: Propagate presence of traits to sk_buff
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (6 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 09/20] bnxt: Propagate trait presence to skb arthur
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Use a bit in sk_buff to track whether or not an skb stores traits in
it's headroom.

It's tempting to store it in skb_shared_info like the XDP metadata,
but storing it in the skb allows us to more easily handle cases such as
skb clones in a few patches.

(I couldn't find an existing hole to use in sk_buff, so this makes a 4
byte hole... any pointers to a free bit?)

The following patches add support for handful of drivers, in the final
form we'd like to cover all drivers that currently support XDP metadata.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/linux/skbuff.h |  7 +++++++
 include/net/xdp.h      | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03553c2200ab1c3750b799edb94fa3b94e11a5f1..d7dfee152ebd26ce87a230222e94076aca793adc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -725,6 +725,12 @@ enum skb_tstamp_type {
 	__SKB_CLOCK_MAX = SKB_CLOCK_TAI,
 };
 
+enum skb_traits_type {
+	SKB_TRAITS_NONE,
+	/* Trait store in headroom, offset by sizeof(struct xdp_frame) */
+	SKB_TRAITS_AFTER_XDP,
+};
+
 /**
  * DOC: Basic sk_buff geometry
  *
@@ -1023,6 +1029,7 @@ struct sk_buff {
 	__u8			csum_not_inet:1;
 #endif
 	__u8			unreadable:1;
+	__u8			traits_type:1;	/* See enum skb_traits_type */
 #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
 	__u16			tc_index;	/* traffic control index */
 #endif
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 84afe07d09efdb2ab0cb78b904f02cb74f9a56b6..e2e7c270efa3cbd11bddd234010b91daee5009a2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -119,6 +119,12 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
 static bool xdp_data_meta_unsupported(const struct xdp_buff *xdp);
 static void xdp_set_data_meta_invalid(struct xdp_buff *xdp);
 
+static __always_inline void xdp_buff_update_skb(struct xdp_buff *xdp, struct sk_buff *skb)
+{
+	if (!xdp_data_meta_unsupported(xdp))
+		skb->traits_type = SKB_TRAITS_AFTER_XDP;
+}
+
 static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
 {
 	return xdp->data_hard_start + _XDP_FRAME_SIZE;
@@ -298,6 +304,12 @@ xdp_frame_is_frag_pfmemalloc(const struct xdp_frame *frame)
 	return !!(frame->flags & XDP_FLAGS_FRAGS_PF_MEMALLOC);
 }
 
+static __always_inline void xdp_frame_update_skb(struct xdp_frame *frame, struct sk_buff *skb)
+{
+	if (!frame->meta_unsupported)
+		skb->traits_type = SKB_TRAITS_AFTER_XDP;
+}
+
 #define XDP_BULK_QUEUE_SIZE	16
 struct xdp_frame_bulk {
 	int count;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 09/20] bnxt: Propagate trait presence to skb
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (7 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 08/20] trait: Propagate presence of traits to sk_buff arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 10/20] ice: " arthur
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Call the common xdp_buff_update_skb() helper.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 7b8b5b39c7bbe8885543a7c612567f7ff55a4fca..7e22804ba09b5a12384a0f8125db42f79b187d42 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2273,6 +2273,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 			}
 		}
 	}
+
+	if (xdp_active)
+		xdp_buff_update_skb(&xdp, skb);
+
 	bnxt_deliver_skb(bp, bnapi, skb);
 	rc = 1;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 10/20] ice: Propagate trait presence to skb
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (8 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 09/20] bnxt: Propagate trait presence to skb arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 11/20] veth: " arthur
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Call the common xdp_buff_update_skb() helper.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 4 ++++
 drivers/net/ethernet/intel/ice/ice_xsk.c  | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 9c9ea4c1b93b7fb5ac9bd5599ac91cf326fb2527..70101fc365b17a13f06ff7c46610d11fbf81313a 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1011,6 +1011,8 @@ ice_build_skb(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp)
 					   nr_frags * xdp->frame_sz,
 					   xdp_buff_is_frag_pfmemalloc(xdp));
 
+	xdp_buff_update_skb(xdp, skb);
+
 	return skb;
 }
 
@@ -1092,6 +1094,8 @@ ice_construct_skb(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp)
 					   xdp_buff_is_frag_pfmemalloc(xdp));
 	}
 
+	xdp_buff_update_skb(xdp, skb);
+
 	return skb;
 }
 
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 8975d2971bc3778aad8898e334ce1d759618fd32..a34c7a69c58677958974b84b7fb6e5643c5e29b2 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -572,6 +572,8 @@ ice_construct_skb_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp)
 		__skb_pull(skb, metasize);
 	}
 
+	xdp_buff_update_skb(xdp, skb);
+
 	if (likely(!xdp_buff_has_frags(xdp)))
 		goto out;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 11/20] veth: Propagate trait presence to skb
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (9 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 10/20] ice: " arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 12/20] virtio_net: " arthur
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Call the common xdp_buff_update_skb() / xdp_frame_update_skb() helpers.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 drivers/net/veth.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 01251868a9c27592f0dfbcdc32c0afbd7e9baafc..186ea9e09efbcb6abb01effcec5f7d03a186cefe 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -703,6 +703,8 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
 			stats->rx_drops++;
 			continue;
 		}
+
+		xdp_frame_update_skb(frames[i], skb);
 		napi_gro_receive(&rq->xdp_napi, skb);
 	}
 }
@@ -855,6 +857,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
+
+	xdp_buff_update_skb(xdp, skb);
 out:
 	return skb;
 drop:

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 12/20] virtio_net: Propagate trait presence to skb
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (10 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 11/20] veth: " arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 13/20] mlx5: move xdp_buff scope one level up arthur
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Call the common xdp_buff_update_skb() / xdp_frame_update_skb() helpers.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 drivers/net/virtio_net.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7646ddd9bef70cf2a0833c7db7c0491884fb7ab5..85798c2bc3446c47deaef66dbb96270e51c9076c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1190,6 +1190,7 @@ static struct sk_buff *virtnet_receive_xsk_small(struct net_device *dev, struct
 						 struct virtnet_rq_stats *stats)
 {
 	struct bpf_prog *prog;
+	struct sk_buff *skb;
 	u32 ret;
 
 	ret = XDP_PASS;
@@ -1201,7 +1202,9 @@ static struct sk_buff *virtnet_receive_xsk_small(struct net_device *dev, struct
 
 	switch (ret) {
 	case XDP_PASS:
-		return xsk_construct_skb(rq, xdp);
+		skb = xsk_construct_skb(rq, xdp);
+		xdp_buff_update_skb(xdp, skb);
+		return skb;
 
 	case XDP_TX:
 	case XDP_REDIRECT:
@@ -1323,6 +1326,7 @@ static struct sk_buff *virtnet_receive_xsk_merge(struct net_device *dev, struct
 		skb = xsk_construct_skb(rq, xdp);
 		if (!skb)
 			goto drop_bufs;
+		xdp_buff_update_skb(xdp, skb);
 
 		if (xsk_append_merge_buffer(vi, rq, skb, num_buf, hdr, stats)) {
 			dev_kfree_skb(skb);
@@ -1956,6 +1960,7 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
 	skb = virtnet_build_skb(buf, buflen, xdp.data - buf, len);
 	if (unlikely(!skb))
 		goto err;
+	xdp_buff_update_skb(&xdp, skb);
 
 	if (metasize)
 		skb_metadata_set(skb, metasize);
@@ -2324,6 +2329,7 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
 		head_skb = build_skb_from_xdp_buff(dev, vi, &xdp, xdp_frags_truesz);
 		if (unlikely(!head_skb))
 			break;
+		xdp_buff_update_skb(&xdp, head_skb);
 		return head_skb;
 
 	case XDP_TX:

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 13/20] mlx5: move xdp_buff scope one level up
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (11 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 12/20] virtio_net: " arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 14/20] mlx5: Propagate trait presence to skb arthur
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Jesper Dangaard Brouer <hawk@kernel.org>

This is in preparation for changes.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.c    |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.h    |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 103 +++++++++++----------
 4 files changed, 66 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 979fc56205e1fe7b473ad0849cf84f189d09fd4f..9bed146806a8d8b2c0afb14b2417a4a95cc09dcb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -581,14 +581,16 @@ struct mlx5e_mpw_info {
 #define MLX5E_MAX_RX_FRAGS 4
 
 struct mlx5e_rq;
+struct mlx5e_xdp_buff;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 			       struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
-			       u32 head_offset, u32 page_idx);
+			       u32 head_offset, u32 page_idx,
+			       struct mlx5e_xdp_buff *mxbuf);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
+			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt, struct mlx5e_xdp_buff *mxbuf);
 typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
 typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
 typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 1b7132fa70de2805a81b878fe3fa308ca9d4de6f..4dacaa61e1061960e09dccd6f97bc2f2d02ffbb8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -249,7 +249,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
-						    u32 page_idx)
+						    u32 page_idx,
+						    struct mlx5e_xdp_buff *mxbuf_)
 {
 	struct mlx5e_xdp_buff *mxbuf = xsk_buff_to_mxbuf(wi->alloc_units.xsk_buffs[page_idx]);
 	struct bpf_prog *prog;
@@ -304,7 +305,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
 					      struct mlx5_cqe64 *cqe,
-					      u32 cqe_bcnt)
+					      u32 cqe_bcnt,
+					      struct mlx5e_xdp_buff *mxbuf_)
 {
 	struct mlx5e_xdp_buff *mxbuf = xsk_buff_to_mxbuf(*wi->xskp);
 	struct bpf_prog *prog;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index cefc0ef6105d24705bdd450cfd7857435a9d0c67..0890c975042c7f58e512a61fc538f21b5e37c6b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -16,10 +16,12 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
-						    u32 page_idx);
+						    u32 page_idx,
+						    struct mlx5e_xdp_buff *mxbuf_);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
 					      struct mlx5_cqe64 *cqe,
-					      u32 cqe_bcnt);
+					      u32 cqe_bcnt,
+					      struct mlx5e_xdp_buff *mxbuf_);
 
 #endif /* __MLX5_EN_XSK_RX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 1963bc5adb1887af5a2cadb3febf24bef0ae3338..77bace3b212ae18c420a11312a5e3043b5e3f4ae 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -63,11 +63,11 @@
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				u32 page_idx);
+				u32 page_idx, struct mlx5e_xdp_buff *mxbuf);
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				   u32 page_idx);
+				   u32 page_idx, struct mlx5e_xdp_buff *mxbuf);
 static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -1662,7 +1662,8 @@ static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt,
+			  struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_frag_page *frag_page = wi->frag_page;
 	u16 rx_headroom = rq->buff.headroom;
@@ -1684,17 +1685,15 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct mlx5e_xdp_buff mxbuf;
-
 		net_prefetchw(va); /* xdp_frame data area */
 		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-				 cqe_bcnt, &mxbuf);
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf))
+				 cqe_bcnt, mxbuf);
+		if (mlx5e_xdp_handle(rq, prog, mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
-		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
-		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
-		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
+		rx_headroom = mxbuf->xdp.data - mxbuf->xdp.data_hard_start;
+		metasize = mxbuf->xdp.data - mxbuf->xdp.data_meta;
+		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -1710,14 +1709,14 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt,
+			     struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
 	struct mlx5e_wqe_frag_info *head_wi = wi;
 	u16 rx_headroom = rq->buff.headroom;
 	struct mlx5e_frag_page *frag_page;
 	struct skb_shared_info *sinfo;
-	struct mlx5e_xdp_buff mxbuf;
 	u32 frag_consumed_bytes;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
@@ -1737,8 +1736,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetch(va + rx_headroom);
 
 	mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-			 frag_consumed_bytes, &mxbuf);
-	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
+			 frag_consumed_bytes, mxbuf);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf->xdp);
 	truesize = 0;
 
 	cqe_bcnt -= frag_consumed_bytes;
@@ -1750,7 +1749,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 
 		frag_consumed_bytes = min_t(u32, frag_info->frag_size, cqe_bcnt);
 
-		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf.xdp, frag_page,
+		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf->xdp, frag_page,
 					       wi->offset, frag_consumed_bytes);
 		truesize += frag_info->frag_stride;
 
@@ -1760,7 +1759,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	}
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (prog && mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+	if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 			struct mlx5e_wqe_frag_info *pwi;
 
@@ -1770,21 +1769,21 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
-	skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
-				     mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
-				     mxbuf.xdp.data_end - mxbuf.xdp.data,
-				     mxbuf.xdp.data - mxbuf.xdp.data_meta);
+	skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
+				     mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
+				     mxbuf->xdp.data_end - mxbuf->xdp.data,
+				     mxbuf->xdp.data - mxbuf->xdp.data_meta);
 	if (unlikely(!skb))
 		return NULL;
 
 	skb_mark_for_recycle(skb);
 	head_wi->frag_page->frags++;
 
-	if (xdp_buff_has_frags(&mxbuf.xdp)) {
+	if (xdp_buff_has_frags(&mxbuf->xdp)) {
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
 		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
 					   sinfo->xdp_frags_size, truesize,
-					   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+					   xdp_buff_is_frag_pfmemalloc(&mxbuf->xdp));
 
 		for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
 			pwi->frag_page->frags++;
@@ -1815,6 +1814,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 ci;
@@ -1832,7 +1832,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_linear,
-			      rq, wi, cqe, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1863,6 +1863,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	struct mlx5_eswitch_rep *rep = rpriv->rep;
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 ci;
@@ -1879,7 +1880,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1907,6 +1908,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	u32 wqe_offset     = stride_ix << rq->mpwqe.log_stride_sz;
 	u32 head_offset    = wqe_offset & ((1 << rq->mpwqe.page_shift) - 1);
 	u32 page_idx       = wqe_offset >> rq->mpwqe.page_shift;
+	struct mlx5e_xdp_buff mxbuf;
 	struct mlx5e_rx_wqe_ll *wqe;
 	struct mlx5_wq_ll *wq;
 	struct sk_buff *skb;
@@ -1932,7 +1934,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx, &mxbuf);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -1979,7 +1981,7 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				   u32 page_idx)
+				   u32 page_idx, struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
@@ -1987,7 +1989,6 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	u32 frag_offset    = head_offset;
 	u32 byte_cnt       = cqe_bcnt;
 	struct skb_shared_info *sinfo;
-	struct mlx5e_xdp_buff mxbuf;
 	unsigned int truesize = 0;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
@@ -2033,9 +2034,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		}
 	}
 
-	mlx5e_fill_mxbuf(rq, cqe, va, linear_hr, linear_frame_sz, linear_data_len, &mxbuf);
+	mlx5e_fill_mxbuf(rq, cqe, va, linear_hr, linear_frame_sz, linear_data_len, mxbuf);
 
-	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf->xdp);
 
 	while (byte_cnt) {
 		/* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */
@@ -2046,7 +2047,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		else
 			truesize += ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz));
 
-		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf.xdp, frag_page, frag_offset,
+		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf->xdp, frag_page, frag_offset,
 					       pg_consumed_bytes);
 		byte_cnt -= pg_consumed_bytes;
 		frag_offset = 0;
@@ -2054,7 +2055,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	}
 
 	if (prog) {
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 				struct mlx5e_frag_page *pfp;
 
@@ -2067,10 +2068,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start,
+		skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start,
 					     linear_frame_sz,
-					     mxbuf.xdp.data - mxbuf.xdp.data_hard_start, 0,
-					     mxbuf.xdp.data - mxbuf.xdp.data_meta);
+					     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
+					     mxbuf->xdp.data - mxbuf->xdp.data_meta);
 		if (unlikely(!skb)) {
 			mlx5e_page_release_fragmented(rq, &wi->linear_page);
 			return NULL;
@@ -2080,13 +2081,13 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		wi->linear_page.frags++;
 		mlx5e_page_release_fragmented(rq, &wi->linear_page);
 
-		if (xdp_buff_has_frags(&mxbuf.xdp)) {
+		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
 			/* sinfo->nr_frags is reset by build_skb, calculate again. */
 			xdp_update_skb_shared_info(skb, frag_page - head_page,
 						   sinfo->xdp_frags_size, truesize,
-						   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+						   xdp_buff_is_frag_pfmemalloc(&mxbuf->xdp));
 
 			pagep = head_page;
 			do
@@ -2097,12 +2098,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	} else {
 		dma_addr_t addr;
 
-		if (xdp_buff_has_frags(&mxbuf.xdp)) {
+		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
 			xdp_update_skb_shared_info(skb, sinfo->nr_frags,
 						   sinfo->xdp_frags_size, truesize,
-						   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+						   xdp_buff_is_frag_pfmemalloc(&mxbuf->xdp));
 
 			pagep = frag_page - sinfo->nr_frags;
 			do
@@ -2124,7 +2125,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
-				u32 page_idx)
+				u32 page_idx, struct mlx5e_xdp_buff *mxbuf)
 {
 	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -2152,20 +2153,19 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
 		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-				 cqe_bcnt, &mxbuf);
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+				 cqe_bcnt, mxbuf);
+		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				frag_page->frags++;
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
-		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
-		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
+		rx_headroom = mxbuf->xdp.data - mxbuf->xdp.data_hard_start;
+		metasize = mxbuf->xdp.data - mxbuf->xdp.data_meta;
+		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -2288,12 +2288,14 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 	bool flush		= cqe->shampo.flush;
 	bool match		= cqe->shampo.match;
 	struct mlx5e_rq_stats *stats = rq->stats;
+	struct mlx5e_xdp_buff mxbuf;
 	struct mlx5e_rx_wqe_ll *wqe;
 	struct mlx5e_mpw_info *wi;
 	struct mlx5_wq_ll *wq;
 
 	wi = mlx5e_get_mpw_info(rq, wqe_id);
 	wi->consumed_strides += cstrides;
+	mxbuf.xdp.flags = 0;
 
 	if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
 		mlx5e_handle_rx_err_cqe(rq, cqe);
@@ -2316,7 +2318,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 			*skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
 		else
 			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
-								  data_offset, page_idx);
+								  data_offset, page_idx, &mxbuf);
 		if (unlikely(!*skb))
 			goto free_hd_entry;
 
@@ -2377,6 +2379,7 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 	u32 wqe_offset     = stride_ix << rq->mpwqe.log_stride_sz;
 	u32 head_offset    = wqe_offset & ((1 << rq->mpwqe.page_shift) - 1);
 	u32 page_idx       = wqe_offset >> rq->mpwqe.page_shift;
+	struct mlx5e_xdp_buff mxbuf;
 	struct mlx5e_rx_wqe_ll *wqe;
 	struct mlx5_wq_ll *wq;
 	struct sk_buff *skb;
@@ -2404,7 +2407,7 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_mpwrq_linear,
 			      rq, wi, cqe, cqe_bcnt, head_offset,
-			      page_idx);
+			      page_idx, &mxbuf);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -2632,6 +2635,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 ci;
@@ -2648,7 +2652,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb)
 		goto wq_cyc_pop;
 
@@ -2722,6 +2726,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
 {
 	struct mlx5_wq_cyc *wq = &rq->wqe.wq;
 	struct mlx5e_wqe_frag_info *wi;
+	struct mlx5e_xdp_buff mxbuf;
 	struct sk_buff *skb;
 	u32 cqe_bcnt;
 	u16 trap_id;
@@ -2737,7 +2742,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
 		goto wq_cyc_pop;
 	}
 
-	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
+	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt, &mxbuf);
 	if (!skb)
 		goto wq_cyc_pop;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 14/20] mlx5: Propagate trait presence to skb
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (12 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 13/20] mlx5: move xdp_buff scope one level up arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 15/20] xdp generic: " arthur
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Call the common xdp_buff_update_skb() helper.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 77bace3b212ae18c420a11312a5e3043b5e3f4ae..4ced9109a8f2a047992ab96fa533ad2a7283bb91 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1632,7 +1632,8 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 static inline
 struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 				       u32 frag_size, u16 headroom,
-				       u32 cqe_bcnt, u32 metasize)
+				       u32 cqe_bcnt, u32 metasize,
+				       struct mlx5e_xdp_buff *mxbuf)
 {
 	struct sk_buff *skb = napi_build_skb(va, frag_size);
 
@@ -1646,6 +1647,8 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 
 	if (metasize)
 		skb_metadata_set(skb, metasize);
+	if (mxbuf)
+		xdp_buff_update_skb(&mxbuf->xdp, skb);
 
 	return skb;
 }
@@ -1696,7 +1699,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
-	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
+	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize, mxbuf);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -1772,7 +1775,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
 				     mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
 				     mxbuf->xdp.data_end - mxbuf->xdp.data,
-				     mxbuf->xdp.data - mxbuf->xdp.data_meta);
+				     mxbuf->xdp.data - mxbuf->xdp.data_meta, mxbuf);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -2071,7 +2074,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		skb = mlx5e_build_linear_skb(rq, mxbuf->xdp.data_hard_start,
 					     linear_frame_sz,
 					     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
-					     mxbuf->xdp.data - mxbuf->xdp.data_meta);
+					     mxbuf->xdp.data - mxbuf->xdp.data_meta, mxbuf);
 		if (unlikely(!skb)) {
 			mlx5e_page_release_fragmented(rq, &wi->linear_page);
 			return NULL;
@@ -2168,7 +2171,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
-	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
+	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize, mxbuf);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -2202,7 +2205,7 @@ mlx5e_skb_from_cqe_shampo(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		dma_sync_single_range_for_cpu(rq->pdev, dma_addr, 0, frag_size, rq->buff.map_dir);
 		net_prefetchw(hdr);
 		net_prefetch(data);
-		skb = mlx5e_build_linear_skb(rq, hdr, frag_size, rx_headroom, head_size, 0);
+		skb = mlx5e_build_linear_skb(rq, hdr, frag_size, rx_headroom, head_size, 0, NULL);
 		if (unlikely(!skb))
 			return NULL;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 15/20] xdp generic: Propagate trait presence to skb
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (13 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 14/20] mlx5: Propagate trait presence to skb arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 16/20] trait: Support sk_buffs arthur
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Call the common xdp_buff_update_skb() / xdp_frame_update_skb() helpers.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1b252e9459fdbde42f6fb71dc146692c7f7ec17a..ecf994374b5a8721c1aacc6e4267a91775be3a25 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5247,6 +5247,7 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
 		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);
+		xdp_buff_update_skb(xdp, skb);
 		break;
 	}
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 16/20] trait: Support sk_buffs
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (14 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 15/20] xdp generic: " arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-10 11:45   ` Lorenzo Bianconi
  2025-03-05 14:32 ` [PATCH RFC bpf-next 17/20] trait: Allow socket filters to access traits arthur
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Hide the space used by traits from skb_headroom(): that space isn't
actually usable.

Preserve the trait store in pskb_expand_head() by copying it ahead of
the new headroom. The struct xdp_frame at the start of the headroom
isn't needed anymore, so we can overwrite it with traits, and introduce
a new flag to indicate traits are stored at the start of the headroom.

Cloned skbs share the same packet data and headroom as the original skb,
so changes to traits in one would be reflected in the other.
Is that ok?
Are there cases where we would want a clone to have different traits?
For now, prevent clones from using traits.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/linux/skbuff.h | 25 +++++++++++++++++++++++--
 net/core/skbuff.c      | 25 +++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d7dfee152ebd26ce87a230222e94076aca793adc..886537508789202339c925b5613574de67b7e43c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -39,6 +39,7 @@
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
 #include <net/netmem.h>
+#include <net/trait.h>
 
 /**
  * DOC: skb checksums
@@ -729,6 +730,8 @@ enum skb_traits_type {
 	SKB_TRAITS_NONE,
 	/* Trait store in headroom, offset by sizeof(struct xdp_frame) */
 	SKB_TRAITS_AFTER_XDP,
+	/* Trait store at start of headroom */
+	SKB_TRAITS_AT_HEAD,
 };
 
 /**
@@ -1029,7 +1032,7 @@ struct sk_buff {
 	__u8			csum_not_inet:1;
 #endif
 	__u8			unreadable:1;
-	__u8			traits_type:1;	/* See enum skb_traits_type */
+	__u8			traits_type:2;	/* See enum skb_traits_type */
 #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
 	__u16			tc_index;	/* traffic control index */
 #endif
@@ -2836,6 +2839,18 @@ static inline void *pskb_pull(struct sk_buff *skb, unsigned int len)
 
 void skb_condense(struct sk_buff *skb);
 
+static inline void *skb_traits(const struct sk_buff *skb)
+{
+	switch (skb->traits_type) {
+	case SKB_TRAITS_AFTER_XDP:
+		return skb->head + _XDP_FRAME_SIZE;
+	case SKB_TRAITS_AT_HEAD:
+		return skb->head;
+	default:
+		return NULL;
+	}
+}
+
 /**
  *	skb_headroom - bytes at buffer head
  *	@skb: buffer to check
@@ -2844,7 +2859,13 @@ void skb_condense(struct sk_buff *skb);
  */
 static inline unsigned int skb_headroom(const struct sk_buff *skb)
 {
-	return skb->data - skb->head;
+	int trait_size = 0;
+	void *traits = skb_traits(skb);
+
+	if (traits)
+		trait_size = traits_size(traits);
+
+	return skb->data - skb->head - trait_size;
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7b03b64fdcb276f68ce881d1d8da8e4c6b897efc..83f58517738e8ff12990c28b09336ed44f4be32a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1515,6 +1515,19 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	atomic_inc(&(skb_shinfo(skb)->dataref));
 	skb->cloned = 1;
 
+	/* traits would end up shared with the clone,
+	 * and edits would be reflected there.
+	 *
+	 * Is that ok? What if the original skb and the clone take different paths?
+	 * Does that even happen?
+	 *
+	 * If that's not ok, we could copy the traits and store them in an extension header
+	 * for clones.
+	 *
+	 * For now, pretend the clone doesn't have any traits.
+	 */
+	skb->traits_type = SKB_TRAITS_NONE;
+
 	return n;
 #undef C
 }
@@ -2170,7 +2183,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	unsigned int osize = skb_end_offset(skb);
 	unsigned int size = osize + nhead + ntail;
 	long off;
-	u8 *data;
+	u8 *data, *head;
 	int i;
 
 	BUG_ON(nhead < 0);
@@ -2187,10 +2200,18 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		goto nodata;
 	size = SKB_WITH_OVERHEAD(size);
 
+	head = skb->head;
+	if (skb->traits_type != SKB_TRAITS_NONE) {
+		head = skb_traits(skb) + traits_size(skb_traits(skb));
+		/* struct xdp_frame isn't needed in the headroom, drop it */
+		memcpy(data, skb_traits(skb), traits_size(skb_traits(skb)));
+		skb->traits_type = SKB_TRAITS_AT_HEAD;
+	}
+
 	/* Copy only real data... and, alas, header. This should be
 	 * optimized for the cases when header is void.
 	 */
-	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
+	memcpy(data + nhead, head, skb_tail_pointer(skb) - head);
 
 	memcpy((struct skb_shared_info *)(data + size),
 	       skb_shinfo(skb),

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 17/20] trait: Allow socket filters to access traits
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (15 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 16/20] trait: Support sk_buffs arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 18/20] trait: registration API arthur
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Add kfuncs to allow socket filter programs access to traits in an skb.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 net/core/filter.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 79b78e7cd57fd78c6cc8443da54ae96408c496b0..47f18fb0e23c2c19d26f9b408653c6756a5110c7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -12063,6 +12063,39 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
 #endif
 }
 
+__bpf_kfunc int bpf_skb_trait_set(const struct sk_buff *skb, u64 key,
+				  const void *val, u64 val__sz, u64 flags)
+{
+	void *traits = skb_traits(skb);
+
+	if (!traits)
+		return -EOPNOTSUPP;
+
+	return trait_set(traits, skb_metadata_end(skb) - skb_metadata_len(skb),
+			 key, val, val__sz, flags);
+}
+
+__bpf_kfunc int bpf_skb_trait_get(const struct sk_buff *skb, u64 key,
+				  void *val, u64 val__sz)
+{
+	void *traits = skb_traits(skb);
+
+	if (!traits)
+		return -EOPNOTSUPP;
+
+	return trait_get(traits, key, val, val__sz);
+}
+
+__bpf_kfunc int bpf_skb_trait_del(const struct sk_buff *skb, u64 key)
+{
+	void *traits = skb_traits(skb);
+
+	if (!traits)
+		return -EOPNOTSUPP;
+
+	return trait_del(traits, key);
+}
+
 __bpf_kfunc_end_defs();
 
 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12082,6 +12115,9 @@ int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
 
 BTF_KFUNCS_START(bpf_kfunc_check_set_skb)
 BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_skb_trait_set)
+BTF_ID_FLAGS(func, bpf_skb_trait_get)
+BTF_ID_FLAGS(func, bpf_skb_trait_del)
 BTF_KFUNCS_END(bpf_kfunc_check_set_skb)
 
 BTF_KFUNCS_START(bpf_kfunc_check_set_xdp)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 18/20] trait: registration API
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (16 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 17/20] trait: Allow socket filters to access traits arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 19/20] trait: Sync linux/bpf.h to tools/ for trait registration arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 20/20] trait: register traits in benchmarks and tests arthur
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

The "raw" key values (0-63) of traits could be exposed to userspace, and
used directly in BPF programs.

Similarly to allocating TCP or UDP ports to services, services that use
traits could make the IDs they use configurable in their config file,
and the system administrator could decided which services use which IDs.

However, unlike TCP or UDP ports, if two services hardcode and use the
same trait ID, we wouldn't be able to detect this and fail fast with
EADDRINUSE. The services would instead overwrite each other's traits.

This is intended more to kick off a discussion rather than be a fully
fleshed out patch.

One solution is to have the kernel provide an API for services to
register traits they want to use. If the trait is already registered, we
can return an appropriate error.

This could take many forms (in order of increasing complexity):
- A syscall to "reserve" a specific trait ID / key for use.
- A syscall to dynamically request a free trait ID / key. The keys could
  be passed to BPF programs as constants.
  (what this patch implements).
- Hide numerical trait IDs / keys from userspace entirely, and have them
  use strings. The verifier could allocate IDs for each unique string,
  and rewrite programs to use them (or the loader).

In all cases, it isn't obvious how to deal with the lifetime of
registered traits. Services need to be able to unregister traits so they
can restart cleanly. But that's complicated:
- Do we need to track which BPF programs use which traits, to prevent
  traits in use from being unregistered?
  How will this work if we extend netfilter and routing to support
  traits?
- If a service unregisters a key, what happens to in flight packets that
  have a trait with that ID set?
  If the same ID is re-used, it could be mis-interpreted.

On the other hand, it would also lets us specify and store per trait / key
flags.
This could let us handle cases like GRO by specifying policies for how
to "merge" traits of GRO'd packets, eg:
- TRAIT_KEEP_FIRST: keep the value of the first packet.
- TRAIT_KEEP_LAST: keep the value of the last packet.
- TRAIT_SUM: sum the values of all packets.
This would avoid disabling GRO as soon as packets have different traits,
which unfortunately happens today with XDP metadata.

Many other aspects of traits can be decided later on: how to guarantee
enough headroom without XDP, how to deal with GRO, skb clones, IP
fragmentation... because we can implement the most restrictive option
for now.
That's not the case for a registration API: we can't add one once we
allow users to use raw trait IDs.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 include/linux/bpf-netns.h   | 12 ++++++++++
 include/net/net_namespace.h |  6 +++++
 include/net/netns/trait.h   | 22 ++++++++++++++++++
 include/uapi/linux/bpf.h    | 26 ++++++++++++++++++++++
 kernel/bpf/net_namespace.c  | 54 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c        | 26 ++++++++++++++++++++++
 kernel/bpf/verifier.c       | 39 +++++++++++++++++++++++++++++++-
 7 files changed, 184 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
index 413cfa5e4b0710806f5ed56824c7488b2586e19b..775f9de5e3e2931103ca66b3f168637437872480 100644
--- a/include/linux/bpf-netns.h
+++ b/include/linux/bpf-netns.h
@@ -33,6 +33,8 @@ int netns_bpf_prog_attach(const union bpf_attr *attr,
 int netns_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype);
 int netns_bpf_link_create(const union bpf_attr *attr,
 			  struct bpf_prog *prog);
+int netns_bpf_trait_register(const union bpf_attr *attr);
+int netns_bpf_trait_unregister(const union bpf_attr *attr);
 #else
 static inline int netns_bpf_prog_query(const union bpf_attr *attr,
 				       union bpf_attr __user *uattr)
@@ -57,6 +59,16 @@ static inline int netns_bpf_link_create(const union bpf_attr *attr,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline int netns_bpf_trait_register(const union bpf_attr *attr)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int netns_bpf_trait_unregister(const union bpf_attr *attr)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #endif /* _BPF_NETNS_H */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index f467a66abc6b16b690a99037a3dea2e355910661..6f35593c91627c2793f7ced3e6c69c6fb4409665 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -37,6 +37,7 @@
 #include <net/netns/smc.h>
 #include <net/netns/bpf.h>
 #include <net/netns/mctp.h>
+#include <net/netns/trait.h>
 #include <net/net_trackers.h>
 #include <linux/ns_common.h>
 #include <linux/idr.h>
@@ -193,6 +194,11 @@ struct net {
 	/* Move to a better place when the config guard is removed. */
 	struct mutex		rtnl_mutex;
 #endif
+	/* Traits probably shouldn't be per namespace - otherwise we'd have to explicitly clear
+	 * them. And packet tracing should work across namespaces.
+	 * I just didn't know where to put this.
+	 */
+	struct netns_traits traits;
 } __randomize_layout;
 
 #include <linux/seq_file_net.h>
diff --git a/include/net/netns/trait.h b/include/net/netns/trait.h
new file mode 100644
index 0000000000000000000000000000000000000000..5b1d2ad4572c447dea80c285feb4da69d15693d6
--- /dev/null
+++ b/include/net/netns/trait.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Traits registered to a network namespace.
+ */
+
+#ifndef __NETNS_TRAIT_H__
+#define __NETNS_TRAIT_H__
+
+#include <linux/types.h>
+#include <linux/bpf.h>
+
+struct registered_trait {
+	bool used;
+	char name[BPF_OBJ_NAME_LEN];
+	u32 flags;
+};
+
+struct netns_traits {
+	struct registered_trait traits[64];
+};
+
+#endif /* __NETNS_TRAIT_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bb37897c039398dd3568cd007586d9b088ddeb32..748ab5a1cbe0d29d890b874aacfc4ee66b082058 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -906,6 +906,21 @@ union bpf_iter_link_info {
  *		A new file descriptor (a nonnegative integer), or -1 if an
  *		error occurred (in which case, *errno* is set appropriately).
  *
+ * BPF_REGISTER_TRAIT
+ *	Description
+ *		Register a trait. Docs to make bpf_doc.py not error out.
+ *	Return
+ *		Registered trait key.
+ *
+ * BPF_UNREGISTER_TRAIT
+ *	Description
+ *		Unregister a trait. Needed so services registering traits
+ *		can restart.
+ *		But what happens if a trait is currently being used?
+ *		And to in flight packets?
+ *	Return
+ *		-1 if an error occurred.
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -961,6 +976,8 @@ enum bpf_cmd {
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
 	BPF_TOKEN_CREATE,
+	BPF_REGISTER_TRAIT,
+	BPF_UNREGISTER_TRAIT,
 	__MAX_BPF_CMD,
 };
 
@@ -1841,6 +1858,15 @@ union bpf_attr {
 		__u32		bpffs_fd;
 	} token_create;
 
+	struct { /* struct used by BPF_REGISTER_TRAIT command */
+		char		name[BPF_OBJ_NAME_LEN];
+		__u32		flags;
+	} register_trait;
+
+	struct { /* struct used by BPF_UNREGISTER_TRAIT command */
+		__u64		trait;
+	} unregister_trait;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
index 868cc2c43899713b5dbd0ac92f06d297be4b4270..0828310b45f12d1feb8c26aad2cf9cbb0fb5cdf9 100644
--- a/kernel/bpf/net_namespace.c
+++ b/kernel/bpf/net_namespace.c
@@ -526,6 +526,60 @@ int netns_bpf_link_create(const union bpf_attr *attr, struct bpf_prog *prog)
 	return err;
 }
 
+int netns_bpf_trait_register(const union bpf_attr *attr)
+{
+	struct net *net;
+	int i;
+	int ret;
+
+	net = current->nsproxy->net_ns;
+	mutex_lock(&netns_bpf_mutex);
+
+	for (i = 0; i < 64; i++) {
+		if (net->traits.traits[i].used)
+			continue;
+
+		net->traits.traits[i].used = true;
+		net->traits.traits[i].flags = attr->register_trait.flags;
+		memcpy(net->traits.traits[i].name, attr->register_trait.name,
+				sizeof(attr->register_trait.name));
+
+		ret = i;
+		goto out_unlock;
+	}
+	ret = -ENOSPC;
+
+out_unlock:
+	mutex_unlock(&netns_bpf_mutex);
+
+	return ret;
+}
+
+int netns_bpf_trait_unregister(const union bpf_attr *attr)
+{
+	struct net *net;
+	int ret;
+
+	if (attr->unregister_trait.trait > 63)
+		return -EINVAL;
+
+	net = current->nsproxy->net_ns;
+	mutex_lock(&netns_bpf_mutex);
+
+	if (!net->traits.traits[attr->unregister_trait.trait].used) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+
+	net->traits.traits[attr->unregister_trait.trait].used = false;
+	ret = 0;
+
+out_unlock:
+	mutex_unlock(&netns_bpf_mutex);
+
+	return ret;
+}
+
 static int __net_init netns_bpf_pernet_init(struct net *net)
 {
 	int type;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 57a438706215936b75c957f95f5324371aa8f436..4ae401dbb5e332343f729dee0fa7bdbe6d216f16 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5769,6 +5769,26 @@ static int token_create(union bpf_attr *attr)
 	return bpf_token_create(attr);
 }
 
+#define BPF_REGISTER_TRAIT_LAST_FIELD register_trait.flags
+
+static int register_trait(union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_REGISTER_TRAIT))
+		return -EINVAL;
+
+	return netns_bpf_trait_register(attr);
+}
+
+#define BPF_UNREGISTER_TRAIT_LAST_FIELD unregister_trait.trait
+
+static int unregister_trait(union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_UNREGISTER_TRAIT))
+		return -EINVAL;
+
+	return netns_bpf_trait_unregister(attr);
+}
+
 static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -5905,6 +5925,12 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_TOKEN_CREATE:
 		err = token_create(&attr);
 		break;
+	case BPF_REGISTER_TRAIT:
+		err = register_trait(&attr);
+		break;
+	case BPF_UNREGISTER_TRAIT:
+		err = unregister_trait(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4edb2db0f889c5b236416ce0016d74fda2e09a18..da41a54731fb1eccafb295ea1a0bc1d27d92c726 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -30,6 +30,7 @@
 #include <net/xdp.h>
 #include <linux/trace_events.h>
 #include <linux/kallsyms.h>
+#include <linux/bpf-netns.h>
 
 #include "disasm.h"
 
@@ -12007,6 +12008,8 @@ enum special_kfunc_type {
 	KF_bpf_iter_num_destroy,
 	KF_bpf_set_dentry_xattr,
 	KF_bpf_remove_dentry_xattr,
+	KF_bpf_xdp_trait_set,
+	KF_bpf_skb_trait_set,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -12040,6 +12043,8 @@ BTF_ID(func, bpf_iter_css_task_new)
 BTF_ID(func, bpf_set_dentry_xattr)
 BTF_ID(func, bpf_remove_dentry_xattr)
 #endif
+BTF_ID(func, bpf_xdp_trait_set)
+BTF_ID(func, bpf_skb_trait_set)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -12096,6 +12101,8 @@ BTF_ID(func, bpf_remove_dentry_xattr)
 BTF_ID_UNUSED
 BTF_ID_UNUSED
 #endif
+BTF_ID(func, bpf_xdp_trait_set)
+BTF_ID(func, bpf_skb_trait_set)
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -13288,7 +13295,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 {
 	bool sleepable, rcu_lock, rcu_unlock, preempt_disable, preempt_enable;
 	u32 i, nargs, ptr_type_id, release_ref_obj_id;
-	struct bpf_reg_state *regs = cur_regs(env);
+	struct bpf_reg_state *regs = cur_regs(env), *reg;
 	const char *func_name, *ptr_type_name;
 	const struct btf_type *t, *ptr_type;
 	struct bpf_kfunc_call_arg_meta meta;
@@ -13297,6 +13304,8 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	const struct btf_param *args;
 	const struct btf_type *ret_t;
 	struct btf *desc_btf;
+	struct net *net;
+	bool trait_used;
 
 	/* skip for now, but return error when we find this in fixup_kfunc_call */
 	if (!insn->imm)
@@ -13463,6 +13472,34 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (meta.func_id == special_kfunc_list[KF_bpf_xdp_trait_set] ||
+	    meta.func_id == special_kfunc_list[KF_bpf_skb_trait_set]) {
+		reg = &cur_regs(env)[BPF_REG_2];
+		if (reg->type != SCALAR_VALUE || !tnum_is_const(reg->var_off)) {
+			verbose(env, "trait_set() key is not a known constant\n");
+			return -EINVAL;
+		}
+
+		if (reg->var_off.value > 63) {
+			verbose(env, "trait_set() key %llu invalid\n", reg->var_off.value);
+			return -EINVAL;
+		}
+
+		net = current->nsproxy->net_ns;
+		mutex_lock(&netns_bpf_mutex);
+		trait_used = net->traits.traits[reg->var_off.value].used;
+		mutex_unlock(&netns_bpf_mutex);
+
+		/* Checking in the verifier is good for runtime performance, but what happens if
+		 * a trait is unregistered?
+		 * Should we track which traits are used by BPF programs and prevent it?
+		 */
+		if (!trait_used) {
+			verbose(env, "trait_set() key %llu is not registered\n", reg->var_off.value);
+			return -EINVAL;
+		}
+	}
+
 	for (i = 0; i < CALLER_SAVED_REGS; i++)
 		mark_reg_not_init(env, regs, caller_saved[i]);
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 19/20] trait: Sync linux/bpf.h to tools/ for trait registration
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (17 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 18/20] trait: registration API arthur
@ 2025-03-05 14:32 ` arthur
  2025-03-05 14:32 ` [PATCH RFC bpf-next 20/20] trait: register traits in benchmarks and tests arthur
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Now that traits are required to be registered, the benchmarks and
selftests will need to register them.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 tools/include/uapi/linux/bpf.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bb37897c039398dd3568cd007586d9b088ddeb32..748ab5a1cbe0d29d890b874aacfc4ee66b082058 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -906,6 +906,21 @@ union bpf_iter_link_info {
  *		A new file descriptor (a nonnegative integer), or -1 if an
  *		error occurred (in which case, *errno* is set appropriately).
  *
+ * BPF_REGISTER_TRAIT
+ *	Description
+ *		Register a trait. Docs to make bpf_doc.py not error out.
+ *	Return
+ *		Registered trait key.
+ *
+ * BPF_UNREGISTER_TRAIT
+ *	Description
+ *		Unregister a trait. Needed so services registering traits
+ *		can restart.
+ *		But what happens if a trait is currently being used?
+ *		And to in flight packets?
+ *	Return
+ *		-1 if an error occurred.
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -961,6 +976,8 @@ enum bpf_cmd {
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
 	BPF_TOKEN_CREATE,
+	BPF_REGISTER_TRAIT,
+	BPF_UNREGISTER_TRAIT,
 	__MAX_BPF_CMD,
 };
 
@@ -1841,6 +1858,15 @@ union bpf_attr {
 		__u32		bpffs_fd;
 	} token_create;
 
+	struct { /* struct used by BPF_REGISTER_TRAIT command */
+		char		name[BPF_OBJ_NAME_LEN];
+		__u32		flags;
+	} register_trait;
+
+	struct { /* struct used by BPF_UNREGISTER_TRAIT command */
+		__u64		trait;
+	} unregister_trait;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC bpf-next 20/20] trait: register traits in benchmarks and tests
  2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
                   ` (18 preceding siblings ...)
  2025-03-05 14:32 ` [PATCH RFC bpf-next 19/20] trait: Sync linux/bpf.h to tools/ for trait registration arthur
@ 2025-03-05 14:32 ` arthur
  19 siblings, 0 replies; 37+ messages in thread
From: arthur @ 2025-03-05 14:32 UTC (permalink / raw)
  To: netdev, bpf
  Cc: jakub, hawk, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre

From: Arthur Fabre <afabre@cloudflare.com>

Otherwise the verifier rejects the programs now.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
---
 tools/testing/selftests/bpf/bench.c                |  3 ++
 tools/testing/selftests/bpf/bench.h                |  1 +
 .../selftests/bpf/benchs/bench_xdp_traits.c        | 33 +++++++++++++++++++++-
 .../testing/selftests/bpf/prog_tests/xdp_traits.c  | 18 +++++++++++-
 4 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 4678b928fc6ad2f0a870a25d9b10c75a1f6d77ba..0961cb71ddf1d682ca61e512e9f4b2df3606747c 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -752,5 +752,8 @@ int main(int argc, char **argv)
 		bench->report_final(state.results + env.warmup_sec,
 				    state.res_cnt - env.warmup_sec);
 
+	if (bench->cleanup)
+		bench->cleanup();
+
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index 005c401b3e2275030d1d489cd77423cb1fb652ad..6a94af31df17e7cc35bb1432c4b205eb96b631f2 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -58,6 +58,7 @@ struct bench {
 	void (*measure)(struct bench_res* res);
 	void (*report_progress)(int iter, struct bench_res* res, long delta_ns);
 	void (*report_final)(struct bench_res res[], int res_cnt);
+	void (*cleanup)(void);
 };
 
 struct counter {
diff --git a/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c b/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c
index 0fbcd49edd825f53e6957319d3f05efc218dfb02..b6fa2d8f2504dd9c35f8fb9dc1f1099b55a55ac6 100644
--- a/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c
+++ b/tools/testing/selftests/bpf/benchs/bench_xdp_traits.c
@@ -57,7 +57,18 @@ static void trait_validate(void)
 
 static void trait_setup(void)
 {
-	int err;
+	int err, i, key;
+	union bpf_attr attr;
+
+	/* Register all keys so we can use them all. */
+	bzero(&attr, sizeof(attr));
+	for (i = 0; i < 64; i++) {
+		key = syscall(__NR_bpf, BPF_REGISTER_TRAIT, &attr, sizeof(attr));
+		if (key < 0) {
+			fprintf(stderr, "couldn't register trait: %d\n", key);
+			exit(1);
+		}
+	}
 
 	setup_libbpf();
 
@@ -77,6 +88,23 @@ static void trait_setup(void)
 	}
 }
 
+static void trait_cleanup(void)
+{
+	int err, i;
+	union bpf_attr attr;
+
+	/* Unregister all keys so we can run again. */
+	bzero(&attr, sizeof(attr));
+	for (i = 0; i < 64; i++) {
+		attr.unregister_trait.trait = i;
+		err = syscall(__NR_bpf, BPF_UNREGISTER_TRAIT, &attr, sizeof(attr));
+		if (err < 0) {
+			fprintf(stderr, "couldn't unregister trait %d: %d\n", i, err);
+			exit(1);
+		}
+	}
+}
+
 static void trait_get_setup(void)
 {
 	trait_setup();
@@ -135,6 +163,7 @@ const struct bench bench_xdp_trait_get = {
 	.measure = trait_measure,
 	.report_progress = ops_report_progress,
 	.report_final = ops_report_final,
+	.cleanup = trait_cleanup,
 };
 
 const struct bench bench_xdp_trait_set = {
@@ -146,6 +175,7 @@ const struct bench bench_xdp_trait_set = {
 	.measure = trait_measure,
 	.report_progress = ops_report_progress,
 	.report_final = ops_report_final,
+	.cleanup = trait_cleanup,
 };
 
 const struct bench bench_xdp_trait_move = {
@@ -157,4 +187,5 @@ const struct bench bench_xdp_trait_move = {
 	.measure = trait_measure,
 	.report_progress = ops_report_progress,
 	.report_final = ops_report_final,
+	.cleanup = trait_cleanup,
 };
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_traits.c b/tools/testing/selftests/bpf/prog_tests/xdp_traits.c
index 4175b28d45e91e82435e646e5edd783980d5fe70..1c1eff235a6159d377a5e8b9e0a4d956c4540e8e 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_traits.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_traits.c
@@ -6,8 +6,16 @@ static void _test_xdp_traits(void)
 {
 	const char *file = "./test_xdp_traits.bpf.o";
 	struct bpf_object *obj;
-	int err, prog_fd;
+	int err, prog_fd, i, key;
 	char buf[128];
+	union bpf_attr attr;
+
+	/* Register all keys so we can use them all. */
+	bzero(&attr, sizeof(attr));
+	for (i = 0; i < 64; i++) {
+		key = syscall(__NR_bpf, BPF_REGISTER_TRAIT, &attr, sizeof(attr));
+		ASSERT_OK_FD(key, "test_xdp_traits");
+	}
 
 	LIBBPF_OPTS(bpf_test_run_opts, topts,
 		.data_in = &pkt_v4,
@@ -26,6 +34,14 @@ static void _test_xdp_traits(void)
 	ASSERT_EQ(topts.retval, XDP_PASS, "retval");
 
 	bpf_object__close(obj);
+
+	/* Unregister all keys so we can run again. */
+	bzero(&attr, sizeof(attr));
+	for (i = 0; i < 64; i++) {
+		attr.unregister_trait.trait = i;
+		err = syscall(__NR_bpf, BPF_UNREGISTER_TRAIT, &attr, sizeof(attr));
+		ASSERT_OK(err, "test_xdp_traits");
+	}
 }
 
 void test_xdp_traits(void)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
  2025-03-05 14:32 ` [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions arthur
@ 2025-03-05 15:24   ` Alexander Lobakin
  2025-03-05 17:02     ` Arthur Fabre
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Lobakin @ 2025-03-05 15:24 UTC (permalink / raw)
  To: arthur
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

From: Arthur <arthur@arthurfabre.com>
Date: Wed, 05 Mar 2025 15:32:04 +0100

> From: Arthur Fabre <afabre@cloudflare.com>
> 
> xdp_buff stores whether metadata is supported by a NIC by setting
> data_meta to be greater than data.
> 
> But xdp_frame only stores the metadata size (as metasize), so converting
> between xdp_frame and xdp_buff is lossy.
> 
> Steal an unused bit in xdp_frame to track whether metadata is supported
> or not.
> 
> This will lets us have "generic" functions for setting skb fields from
> either xdp_frame or xdp_buff from drivers.
> 
> Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
> ---
>  include/net/xdp.h | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 58019fa299b56dbd45c104fdfa807f73af6e4fa4..84afe07d09efdb2ab0cb78b904f02cb74f9a56b6 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -116,6 +116,9 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
>  	xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
>  }
>  
> +static bool xdp_data_meta_unsupported(const struct xdp_buff *xdp);
> +static void xdp_set_data_meta_invalid(struct xdp_buff *xdp);
> +
>  static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
>  {
>  	return xdp->data_hard_start + _XDP_FRAME_SIZE;
> @@ -270,7 +273,9 @@ struct xdp_frame {
>  	void *data;
>  	u32 len;
>  	u32 headroom;
> -	u32 metasize; /* uses lower 8-bits */
> +	u32	:23, /* unused */
> +		meta_unsupported:1,
> +		metasize:8;

See the history of this structure how we got rid of using bitfields here
and why.

...because of performance.

Even though metasize uses only 8 bits, 1-byte access is slower than
32-byte access.

I was going to write "you can use the fact that metasize is always a
multiple of 4 or that it's never > 252, for example, you could reuse LSB
as a flag indicating that meta is not supported", but first of all

Do we still have drivers which don't support metadata?
Why don't they do that? It's not HW-specific or even driver-specific.
They don't reserve headroom? Then they're invalid, at least XDP_REDIRECT
won't work.

So maybe we need to fix those drivers first, if there are any.

>  	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
>  	 * while mem_type is valid on remote CPU.
>  	 */
> @@ -369,6 +374,8 @@ void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
>  	xdp->data = frame->data;
>  	xdp->data_end = frame->data + frame->len;
>  	xdp->data_meta = frame->data - frame->metasize;
> +	if (frame->meta_unsupported)
> +		xdp_set_data_meta_invalid(xdp);
>  	xdp->frame_sz = frame->frame_sz;
>  	xdp->flags = frame->flags;
>  }
> @@ -396,6 +403,7 @@ int xdp_update_frame_from_buff(const struct xdp_buff *xdp,
>  	xdp_frame->len  = xdp->data_end - xdp->data;
>  	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
>  	xdp_frame->metasize = metasize;
> +	xdp_frame->meta_unsupported = xdp_data_meta_unsupported(xdp);
>  	xdp_frame->frame_sz = xdp->frame_sz;
>  	xdp_frame->flags = xdp->flags;

Thanks,
Olek

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
  2025-03-05 15:24   ` Alexander Lobakin
@ 2025-03-05 17:02     ` Arthur Fabre
  2025-03-06 11:12       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 37+ messages in thread
From: Arthur Fabre @ 2025-03-05 17:02 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

On Wed Mar 5, 2025 at 4:24 PM CET, Alexander Lobakin wrote:
> From: Arthur <arthur@arthurfabre.com>
> Date: Wed, 05 Mar 2025 15:32:04 +0100
>
> > From: Arthur Fabre <afabre@cloudflare.com>
> > 
> > xdp_buff stores whether metadata is supported by a NIC by setting
> > data_meta to be greater than data.
> > 
> > But xdp_frame only stores the metadata size (as metasize), so converting
> > between xdp_frame and xdp_buff is lossy.
> > 
> > Steal an unused bit in xdp_frame to track whether metadata is supported
> > or not.
> > 
> > This will lets us have "generic" functions for setting skb fields from
> > either xdp_frame or xdp_buff from drivers.
> > 
> > Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
> > ---
> >  include/net/xdp.h | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 58019fa299b56dbd45c104fdfa807f73af6e4fa4..84afe07d09efdb2ab0cb78b904f02cb74f9a56b6 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -116,6 +116,9 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
> >  	xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
> >  }
> >  
> > +static bool xdp_data_meta_unsupported(const struct xdp_buff *xdp);
> > +static void xdp_set_data_meta_invalid(struct xdp_buff *xdp);
> > +
> >  static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
> >  {
> >  	return xdp->data_hard_start + _XDP_FRAME_SIZE;
> > @@ -270,7 +273,9 @@ struct xdp_frame {
> >  	void *data;
> >  	u32 len;
> >  	u32 headroom;
> > -	u32 metasize; /* uses lower 8-bits */
> > +	u32	:23, /* unused */
> > +		meta_unsupported:1,
> > +		metasize:8;
>
> See the history of this structure how we got rid of using bitfields here
> and why.
>
> ...because of performance.
>
> Even though metasize uses only 8 bits, 1-byte access is slower than
> 32-byte access.

Interesting, thanks!

> I was going to write "you can use the fact that metasize is always a
> multiple of 4 or that it's never > 252, for example, you could reuse LSB
> as a flag indicating that meta is not supported", but first of all
>
> Do we still have drivers which don't support metadata?
> Why don't they do that? It's not HW-specific or even driver-specific.
> They don't reserve headroom? Then they're invalid, at least XDP_REDIRECT
> won't work.
>
> So maybe we need to fix those drivers first, if there are any.

Most drivers don't support metadata unfortunately:

> rg -U "xdp_prepare_buff\([^)]*false\);" drivers/net/
drivers/net/tun.c
1712:		xdp_prepare_buff(&xdp, buf, pad, len, false);

drivers/net/ethernet/microsoft/mana/mana_bpf.c
94:	xdp_prepare_buff(xdp, buf_va, XDP_PACKET_HEADROOM, pkt_len, false);

drivers/net/ethernet/marvell/mvneta.c
2344:	xdp_prepare_buff(xdp, data, pp->rx_offset_correction + MVNETA_MH_SIZE,
2345:			 data_len, false);

drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
1436:	xdp_prepare_buff(&xdp, hard_start, OTX2_HEAD_ROOM,
1437:			 cqe->sg.seg_size, false);

drivers/net/ethernet/socionext/netsec.c
1021:		xdp_prepare_buff(&xdp, desc->addr, NETSEC_RXBUF_HEADROOM,
1022:				 pkt_len, false);

drivers/net/ethernet/google/gve/gve_rx.c
740:	xdp_prepare_buff(&new, frame, headroom, len, false);
859:		xdp_prepare_buff(&xdp, page_info->page_address +
860:				 page_info->page_offset, GVE_RX_PAD,
861:				 len, false);

drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
3984:			xdp_prepare_buff(&xdp, data,
3985:					 MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM,
3986:					 rx_bytes, false);

drivers/net/ethernet/aquantia/atlantic/aq_ring.c
794:		xdp_prepare_buff(&xdp, hard_start, rx_ring->page_offset,
795:				 buff->len, false);

drivers/net/ethernet/cavium/thunder/nicvf_main.c
554:	xdp_prepare_buff(&xdp, hard_start, data - hard_start, len, false);

drivers/net/ethernet/ti/cpsw_new.c
348:		xdp_prepare_buff(&xdp, pa, headroom, size, false);

drivers/net/ethernet/freescale/enetc/enetc.c
1710:	xdp_prepare_buff(xdp_buff, hard_start - rx_ring->buffer_offset,
1711:			 rx_ring->buffer_offset, size, false);

drivers/net/ethernet/ti/am65-cpsw-nuss.c
1335:		xdp_prepare_buff(&xdp, page_addr, AM65_CPSW_HEADROOM,
1336:				 pkt_len, false);

drivers/net/ethernet/ti/cpsw.c
403:		xdp_prepare_buff(&xdp, pa, headroom, size, false);

drivers/net/ethernet/sfc/rx.c
289:	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
290:			 rx_buf->len, false);

drivers/net/ethernet/mediatek/mtk_eth_soc.c
2097:			xdp_prepare_buff(&xdp, data, MTK_PP_HEADROOM, pktlen,
2098:					 false);

drivers/net/ethernet/sfc/siena/rx.c
291:	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
292:			 rx_buf->len, false)

I don't know if it's just because no one has added calls to
skb_metadata_set() in yet, or if there's a more fundamental reason.

I think they all reserve some amount of headroom, but not always the
full XDP_PACKET_HEADROOM. Eg sfc:

drivers/net/ethernet/sfc/net_driver.h:
/* Non-standard XDP_PACKET_HEADROOM and tailroom to satisfy XDP_REDIRECT and
 * still fit two standard MTU size packets into a single 4K page.
 */
#define EFX_XDP_HEADROOM	128

If it's just because skb_metadata_set() is missing, I can take the
patches from this series that adds a "generic" XDP -> skb hook ("trait:
Propagate presence of traits to sk_buff"), have it call
skb_metadata_set(), and try to add it to all the drivers in a separate
series.

> >  	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> >  	 * while mem_type is valid on remote CPU.
> >  	 */
> > @@ -369,6 +374,8 @@ void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
> >  	xdp->data = frame->data;
> >  	xdp->data_end = frame->data + frame->len;
> >  	xdp->data_meta = frame->data - frame->metasize;
> > +	if (frame->meta_unsupported)
> > +		xdp_set_data_meta_invalid(xdp);
> >  	xdp->frame_sz = frame->frame_sz;
> >  	xdp->flags = frame->flags;
> >  }
> > @@ -396,6 +403,7 @@ int xdp_update_frame_from_buff(const struct xdp_buff *xdp,
> >  	xdp_frame->len  = xdp->data_end - xdp->data;
> >  	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
> >  	xdp_frame->metasize = metasize;
> > +	xdp_frame->meta_unsupported = xdp_data_meta_unsupported(xdp);
> >  	xdp_frame->frame_sz = xdp->frame_sz;
> >  	xdp_frame->flags = xdp->flags;
>
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 06/20] trait: Replace memmove calls with inline move
  2025-03-05 14:32 ` [PATCH RFC bpf-next 06/20] trait: Replace memmove calls with inline move arthur
@ 2025-03-06 10:14   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 37+ messages in thread
From: Jesper Dangaard Brouer @ 2025-03-06 10:14 UTC (permalink / raw)
  To: arthur, netdev, bpf
  Cc: jakub, yan, jbrandeburg, thoiland, lbiancon, Arthur Fabre



On 05/03/2025 15.32, arthur@arthurfabre.com wrote:
> From: Arthur Fabre <afabre@cloudflare.com>
> 
> When inserting or deleting traits, we need to move any subsequent
> traits over.
> 
> Replace it with an inline implementation to avoid the function call
> overhead. This is especially expensive on AMD with SRSO.
> 
> In practice we shouldn't have too much data to move around, and we're
> naturally limited to 238 bytes max, so a dumb implementation should
> hopefully be fast enough.
> 
> Jesper Brouer kindly ran benchmarks on real hardware with three configs:
> - Intel: E5-1650 v4
> - AMD SRSO: 9684X SRSO
> - AMD IBPB: 9684X SRSO=IBPB
> 
> 		Intel	AMD IBPB	AMD SRSO
> xdp-trait-get	5.530	3.901		9.188		(ns/op)
> xdp-trait-set	7.538	4.941		10.050		(ns/op)
> xdp-trait-move	14.245	8.865		14.834		(ns/op)
> function call	1.319	1.359		5.703		(ns/op)
> indirect call	8.922	6.251		10.329		(ns/op)
> 

I've done extensive *micro* bechmarking documented here:
  - https://github.com/xdp-project/xdp-project/tree/main/areas/hints
  - In traits0X_* files

The latest that corresponds to this patchset is in this file:
  - 
https://github.com/xdp-project/xdp-project/blob/main/areas/hints/traits07_bench-009.org

I've not done XDP_REDIRECT testing, which would likely show the bitfield 
change in xdp_frame, that Olek pointed out.

--Jesper

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
  2025-03-05 17:02     ` Arthur Fabre
@ 2025-03-06 11:12       ` Jesper Dangaard Brouer
  2025-03-10 11:10         ` Lorenzo Bianconi
  0 siblings, 1 reply; 37+ messages in thread
From: Jesper Dangaard Brouer @ 2025-03-06 11:12 UTC (permalink / raw)
  To: Arthur Fabre, Alexander Lobakin
  Cc: netdev, bpf, jakub, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre



On 05/03/2025 18.02, Arthur Fabre wrote:
> On Wed Mar 5, 2025 at 4:24 PM CET, Alexander Lobakin wrote:
>> From: Arthur <arthur@arthurfabre.com>
>> Date: Wed, 05 Mar 2025 15:32:04 +0100
>>
>>> From: Arthur Fabre <afabre@cloudflare.com>
>>>
>>> xdp_buff stores whether metadata is supported by a NIC by setting
>>> data_meta to be greater than data.
>>>
>>> But xdp_frame only stores the metadata size (as metasize), so converting
>>> between xdp_frame and xdp_buff is lossy.
>>>
>>> Steal an unused bit in xdp_frame to track whether metadata is supported
>>> or not.
>>>
>>> This will lets us have "generic" functions for setting skb fields from
>>> either xdp_frame or xdp_buff from drivers.
>>>
>>> Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
>>> ---
>>>   include/net/xdp.h | 10 +++++++++-
>>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index 58019fa299b56dbd45c104fdfa807f73af6e4fa4..84afe07d09efdb2ab0cb78b904f02cb74f9a56b6 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -116,6 +116,9 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
>>>   	xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
>>>   }
>>>   
>>> +static bool xdp_data_meta_unsupported(const struct xdp_buff *xdp);
>>> +static void xdp_set_data_meta_invalid(struct xdp_buff *xdp);
>>> +
>>>   static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
>>>   {
>>>   	return xdp->data_hard_start + _XDP_FRAME_SIZE;
>>> @@ -270,7 +273,9 @@ struct xdp_frame {
>>>   	void *data;
>>>   	u32 len;
>>>   	u32 headroom;
>>> -	u32 metasize; /* uses lower 8-bits */
>>> +	u32	:23, /* unused */
>>> +		meta_unsupported:1,
>>> +		metasize:8;
>>
>> See the history of this structure how we got rid of using bitfields here
>> and why.
>>
>> ...because of performance.
>>
>> Even though metasize uses only 8 bits, 1-byte access is slower than
>> 32-byte access.
> 
> Interesting, thanks!
> 

I agree with Olek, we should not use bitfields.  Thanks for catching this.

(The xdp_frame have a flags member...)
Why don't we use the flags member for storing this information?


>> I was going to write "you can use the fact that metasize is always a
>> multiple of 4 or that it's never > 252, for example, you could reuse LSB
>> as a flag indicating that meta is not supported", but first of all
>>
>> Do we still have drivers which don't support metadata?
>> Why don't they do that? It's not HW-specific or even driver-specific.
>> They don't reserve headroom? Then they're invalid, at least XDP_REDIRECT
>> won't work.
>>

I'm fairly sure that all drivers support XDP_REDIRECT.
Except didn't Lorenzo add a feature bit for this?
(so, some drivers might explicitly not-support this)

>> So maybe we need to fix those drivers first, if there are any.
> 
> Most drivers don't support metadata unfortunately:
> 
>> rg -U "xdp_prepare_buff\([^)]*false\);" drivers/net/
> drivers/net/tun.c
> 1712:		xdp_prepare_buff(&xdp, buf, pad, len, false);
> 
> drivers/net/ethernet/microsoft/mana/mana_bpf.c
> 94:	xdp_prepare_buff(xdp, buf_va, XDP_PACKET_HEADROOM, pkt_len, false);
> 
> drivers/net/ethernet/marvell/mvneta.c
> 2344:	xdp_prepare_buff(xdp, data, pp->rx_offset_correction + MVNETA_MH_SIZE,
> 2345:			 data_len, false);
> 
> drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
> 1436:	xdp_prepare_buff(&xdp, hard_start, OTX2_HEAD_ROOM,
> 1437:			 cqe->sg.seg_size, false);
> 
> drivers/net/ethernet/socionext/netsec.c
> 1021:		xdp_prepare_buff(&xdp, desc->addr, NETSEC_RXBUF_HEADROOM,
> 1022:				 pkt_len, false);
> 
> drivers/net/ethernet/google/gve/gve_rx.c
> 740:	xdp_prepare_buff(&new, frame, headroom, len, false);
> 859:		xdp_prepare_buff(&xdp, page_info->page_address +
> 860:				 page_info->page_offset, GVE_RX_PAD,
> 861:				 len, false);
> 
> drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> 3984:			xdp_prepare_buff(&xdp, data,
> 3985:					 MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM,
> 3986:					 rx_bytes, false);
> 
> drivers/net/ethernet/aquantia/atlantic/aq_ring.c
> 794:		xdp_prepare_buff(&xdp, hard_start, rx_ring->page_offset,
> 795:				 buff->len, false);
> 
> drivers/net/ethernet/cavium/thunder/nicvf_main.c
> 554:	xdp_prepare_buff(&xdp, hard_start, data - hard_start, len, false);
> 
> drivers/net/ethernet/ti/cpsw_new.c
> 348:		xdp_prepare_buff(&xdp, pa, headroom, size, false);
> 
> drivers/net/ethernet/freescale/enetc/enetc.c
> 1710:	xdp_prepare_buff(xdp_buff, hard_start - rx_ring->buffer_offset,
> 1711:			 rx_ring->buffer_offset, size, false);
> 
> drivers/net/ethernet/ti/am65-cpsw-nuss.c
> 1335:		xdp_prepare_buff(&xdp, page_addr, AM65_CPSW_HEADROOM,
> 1336:				 pkt_len, false);
> 
> drivers/net/ethernet/ti/cpsw.c
> 403:		xdp_prepare_buff(&xdp, pa, headroom, size, false);
> 
> drivers/net/ethernet/sfc/rx.c
> 289:	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
> 290:			 rx_buf->len, false);
> 
> drivers/net/ethernet/mediatek/mtk_eth_soc.c
> 2097:			xdp_prepare_buff(&xdp, data, MTK_PP_HEADROOM, pktlen,
> 2098:					 false);
> 
> drivers/net/ethernet/sfc/siena/rx.c
> 291:	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
> 292:			 rx_buf->len, false)
> 
> I don't know if it's just because no one has added calls to
> skb_metadata_set() in yet, or if there's a more fundamental reason.
> 

I simply think driver developers have been lazy.

If someone want some easy kernel commits, these drivers should be fairly
easy to fix...

> I think they all reserve some amount of headroom, but not always the
> full XDP_PACKET_HEADROOM. Eg sfc:
> 

The Intel drivers use 192 (AFAIK if that is still true). The API ended
up supporting non-standard XDP_PACKET_HEADROOM, due to the Intel
drivers, when XDP support was added to those (which is a long time ago now).

> drivers/net/ethernet/sfc/net_driver.h:
> /* Non-standard XDP_PACKET_HEADROOM and tailroom to satisfy XDP_REDIRECT and
>   * still fit two standard MTU size packets into a single 4K page.
>   */
> #define EFX_XDP_HEADROOM	128
> 

This is smaller than most drivers, but still have enough headroom for 
xdp_frame + traits.

> If it's just because skb_metadata_set() is missing, I can take the
> patches from this series that adds a "generic" XDP -> skb hook ("trait:
> Propagate presence of traits to sk_buff"), have it call
> skb_metadata_set(), and try to add it to all the drivers in a separate
> series.
> 

I think someone should cleanup those drivers and add support.

--Jesper

>>>   	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
>>>   	 * while mem_type is valid on remote CPU.
>>>   	 */
>>> @@ -369,6 +374,8 @@ void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
>>>   	xdp->data = frame->data;
>>>   	xdp->data_end = frame->data + frame->len;
>>>   	xdp->data_meta = frame->data - frame->metasize;
>>> +	if (frame->meta_unsupported)
>>> +		xdp_set_data_meta_invalid(xdp);
>>>   	xdp->frame_sz = frame->frame_sz;
>>>   	xdp->flags = frame->flags;
>>>   }
>>> @@ -396,6 +403,7 @@ int xdp_update_frame_from_buff(const struct xdp_buff *xdp,
>>>   	xdp_frame->len  = xdp->data_end - xdp->data;
>>>   	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
>>>   	xdp_frame->metasize = metasize;
>>> +	xdp_frame->meta_unsupported = xdp_data_meta_unsupported(xdp);
>>>   	xdp_frame->frame_sz = xdp->frame_sz;
>>>   	xdp_frame->flags = xdp->flags;
>>
>> Thanks,
>> Olek

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata
  2025-03-05 14:31 ` [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata arthur
@ 2025-03-07  6:36   ` Alexei Starovoitov
  2025-03-07 11:14     ` Arthur Fabre
  2025-03-07 19:24   ` Jakub Sitnicki
  1 sibling, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-03-07  6:36 UTC (permalink / raw)
  To: arthur
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, Toke Høiland-Jørgensen, lbiancon,
	Arthur Fabre

On Wed, Mar 5, 2025 at 6:33 AM <arthur@arthurfabre.com> wrote:
>
> +struct __trait_hdr {
> +       /* Values are stored ordered by key, immediately after the header.
> +        *
> +        * The size of each value set is stored in the header as two bits:
> +        *  - 00: Not set.
> +        *  - 01: 2 bytes.
> +        *  - 10: 4 bytes.
> +        *  - 11: 8 bytes.

...

> +        *  - hweight(low) + hweight(high)<<1 is offset.

the comment doesn't match the code

> +        */
> +       u64 high;
> +       u64 low;

...

> +static __always_inline int __trait_total_length(struct __trait_hdr h)
> +{
> +       return (hweight64(h.low) << 1) + (hweight64(h.high) << 2)
> +               // For size 8, we only get 4+2=6. Add another 2 in.
> +               + (hweight64(h.high & h.low) << 1);
> +}

This is really cool idea, but 2 byte size doesn't feel that useful.
How about:
- 00: Not set.
- 01: 4 bytes.
- 10: 8 bytes.
- 11: 16 bytes.

4 byte may be useful for ipv4, 16 for ipv6, and 8 is just a good number.
And compute the same way with 3 popcount with extra +1 to shifts.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata
  2025-03-07  6:36   ` Alexei Starovoitov
@ 2025-03-07 11:14     ` Arthur Fabre
  2025-03-07 17:29       ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Arthur Fabre @ 2025-03-07 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, Toke Høiland-Jørgensen, lbiancon,
	Arthur Fabre

On Fri Mar 7, 2025 at 7:36 AM CET, Alexei Starovoitov wrote:
> On Wed, Mar 5, 2025 at 6:33 AM <arthur@arthurfabre.com> wrote:
> >
> > +struct __trait_hdr {
> > +       /* Values are stored ordered by key, immediately after the header.
> > +        *
> > +        * The size of each value set is stored in the header as two bits:
> > +        *  - 00: Not set.
> > +        *  - 01: 2 bytes.
> > +        *  - 10: 4 bytes.
> > +        *  - 11: 8 bytes.
>
> ...
>
> > +        *  - hweight(low) + hweight(high)<<1 is offset.
>
> the comment doesn't match the code
>
> > +        */
> > +       u64 high;
> > +       u64 low;
>
> ...
>
> > +static __always_inline int __trait_total_length(struct __trait_hdr h)
> > +{
> > +       return (hweight64(h.low) << 1) + (hweight64(h.high) << 2)
> > +               // For size 8, we only get 4+2=6. Add another 2 in.
> > +               + (hweight64(h.high & h.low) << 1);
> > +}
>
> This is really cool idea, but 2 byte size doesn't feel that useful.
> How about:
> - 00: Not set.
> - 01: 4 bytes.
> - 10: 8 bytes.
> - 11: 16 bytes.
>
> 4 byte may be useful for ipv4, 16 for ipv6, and 8 is just a good number.
> And compute the same way with 3 popcount with extra +1 to shifts.

I chose the sizes arbitrarily, happy to change them.

16 is also useful for UUIDs, for tracing.

Size 0 could store bools / flags. Keys could be set without a value, 
and users could check if the key is set or not.
That replaces single bits of the mark today, for example a
"route locally" key.

That only leaves one other size, maybe 4 for smaller values?

If we want more sizes, we could also:

- Add another u64 word to the header, so we have 3 bits per key. It
  uses more room, and we need more popcnts, but most modern x86 CPUs can
  do 3 popcnts in parallel so it could be ok.

- Let users set consecutive keys to one big value. Instead of supporting
  size 16, we let them set two 8 byte KVs in one trait_set() call and
  provide a 16 byte value. Eg:

	trait_set_batch(u64 key_from, u64_key_to, size, ...);

  It's easy to implement, but it makes the API more complicated.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata
  2025-03-07 11:14     ` Arthur Fabre
@ 2025-03-07 17:29       ` Alexei Starovoitov
  2025-03-10 14:45         ` Arthur Fabre
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2025-03-07 17:29 UTC (permalink / raw)
  To: Arthur Fabre
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, Toke Høiland-Jørgensen, lbiancon,
	Arthur Fabre

On Fri, Mar 7, 2025 at 3:14 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>
> On Fri Mar 7, 2025 at 7:36 AM CET, Alexei Starovoitov wrote:
> > On Wed, Mar 5, 2025 at 6:33 AM <arthur@arthurfabre.com> wrote:
> > >
> > > +struct __trait_hdr {
> > > +       /* Values are stored ordered by key, immediately after the header.
> > > +        *
> > > +        * The size of each value set is stored in the header as two bits:
> > > +        *  - 00: Not set.
> > > +        *  - 01: 2 bytes.
> > > +        *  - 10: 4 bytes.
> > > +        *  - 11: 8 bytes.
> >
> > ...
> >
> > > +        *  - hweight(low) + hweight(high)<<1 is offset.
> >
> > the comment doesn't match the code
> >
> > > +        */
> > > +       u64 high;
> > > +       u64 low;
> >
> > ...
> >
> > > +static __always_inline int __trait_total_length(struct __trait_hdr h)
> > > +{
> > > +       return (hweight64(h.low) << 1) + (hweight64(h.high) << 2)
> > > +               // For size 8, we only get 4+2=6. Add another 2 in.
> > > +               + (hweight64(h.high & h.low) << 1);
> > > +}
> >
> > This is really cool idea, but 2 byte size doesn't feel that useful.
> > How about:
> > - 00: Not set.
> > - 01: 4 bytes.
> > - 10: 8 bytes.
> > - 11: 16 bytes.
> >
> > 4 byte may be useful for ipv4, 16 for ipv6, and 8 is just a good number.
> > And compute the same way with 3 popcount with extra +1 to shifts.
>
> I chose the sizes arbitrarily, happy to change them.
>
> 16 is also useful for UUIDs, for tracing.
>
> Size 0 could store bools / flags. Keys could be set without a value,
> and users could check if the key is set or not.
> That replaces single bits of the mark today, for example a
> "route locally" key.

I don't understand how that would work.
If I'm reading the code correctly 00 means that key is not present.
How one would differentiate key present/not with zero size in value?

>
> That only leaves one other size, maybe 4 for smaller values?
>
> If we want more sizes, we could also:
>
> - Add another u64 word to the header, so we have 3 bits per key. It
>   uses more room, and we need more popcnts, but most modern x86 CPUs can
>   do 3 popcnts in parallel so it could be ok.

Two u64s already need 3 pop counts, so it's a good compromise as-is.

> - Let users set consecutive keys to one big value. Instead of supporting
>   size 16, we let them set two 8 byte KVs in one trait_set() call and
>   provide a 16 byte value. Eg:
>
>         trait_set_batch(u64 key_from, u64_key_to, size, ...);
>
>   It's easy to implement, but it makes the API more complicated.

I don't think it complicates the api.
With max size 16 the user can put two consecutive keys of, say, 16 and 8
to make 24 bytes of info,
or use 4 keys of 16 byte each to form 64-bytes of info.
The bit manipulations are too tricky for compilers to optimize.
So even with full inlining the two trait_set() of consecutive keys
will still be largely separate blobs of code.
So trait_[gs]et_batch() makes sense to me.

Also let's not over focus on networking use cases.
This mini KV will be useful in all bpf maps including local storage.
For example the user process can add information about itself into
task local storage while sched-ext can use that to make scheduling decisions.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 02/20] trait: XDP support
  2025-03-05 14:31 ` [PATCH RFC bpf-next 02/20] trait: XDP support arthur
@ 2025-03-07 19:13   ` Lorenzo Bianconi
  2025-03-10 15:50     ` Arthur Fabre
  0 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Bianconi @ 2025-03-07 19:13 UTC (permalink / raw)
  To: arthur
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

[-- Attachment #1: Type: text/plain, Size: 5250 bytes --]

On Mar 05, arthur@arthurfabre.com wrote:
> From: Arthur Fabre <afabre@cloudflare.com>
> 

[...]

> +static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
> +{
> +	return xdp->data_hard_start + _XDP_FRAME_SIZE;
> +}
> +
>  static __always_inline void
>  xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
>  {
> @@ -133,6 +139,13 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
>  	xdp->data = data;
>  	xdp->data_end = data + data_len;
>  	xdp->data_meta = meta_valid ? data : data + 1;
> +
> +	if (meta_valid) {

can we relax this constraint and use xdp->data as end boundary here?

> +		/* We assume drivers reserve enough headroom to store xdp_frame
> +		 * and the traits header.
> +		 */
> +		traits_init(xdp_buff_traits(xdp), xdp->data_meta);
> +	}
>  }
>  
>  /* Reserve memory area at end-of data area.
> @@ -267,6 +280,8 @@ struct xdp_frame {
>  	u32 flags; /* supported values defined in xdp_buff_flags */
>  };
>  
> +static_assert(sizeof(struct xdp_frame) == _XDP_FRAME_SIZE);
> +
>  static __always_inline bool xdp_frame_has_frags(const struct xdp_frame *frame)
>  {
>  	return !!(frame->flags & XDP_FLAGS_HAS_FRAGS);
> @@ -517,6 +532,11 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
>  	return !IS_ALIGNED(metalen, sizeof(u32)) || metalen > meta_max;
>  }
>  
> +static __always_inline void *xdp_meta_hard_start(const struct xdp_buff *xdp)
> +{
> +	return xdp_buff_traits(xdp) + traits_size(xdp_buff_traits(xdp));

here we are always consuming sizeof(struct __trait_hdr)), right? We can do
somehing smarter and check if traits are really used? (e.g. adding in the flags
in xdp_buff)?

> +}
> +
>  struct xdp_attachment_info {
>  	struct bpf_prog *prog;
>  	u32 flags;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index dcc53ac5c5458f67a422453134665d43d466a02e..79b78e7cd57fd78c6cc8443da54ae96408c496b0 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -85,6 +85,7 @@
>  #include <linux/un.h>
>  #include <net/xdp_sock_drv.h>
>  #include <net/inet_dscp.h>
> +#include <net/trait.h>
>  
>  #include "dev.h"
>  
> @@ -3935,9 +3936,8 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
>  
>  BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>  {
> -	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>  	unsigned long metalen = xdp_get_metalen(xdp);
> -	void *data_start = xdp_frame_end + metalen;
> +	void *data_start = xdp_meta_hard_start(xdp) + metalen;

We could waste 16byte here, right?

Regards,
Lorenzo

>  	void *data = xdp->data + offset;
>  
>  	if (unlikely(data < data_start ||
> @@ -4228,13 +4228,12 @@ static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
>  
>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>  {
> -	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>  	void *meta = xdp->data_meta + offset;
>  	unsigned long metalen = xdp->data - meta;
>  
>  	if (xdp_data_meta_unsupported(xdp))
>  		return -ENOTSUPP;
> -	if (unlikely(meta < xdp_frame_end ||
> +	if (unlikely(meta < xdp_meta_hard_start(xdp) ||
>  		     meta > xdp->data))
>  		return -EINVAL;
>  	if (unlikely(xdp_metalen_invalid(metalen)))
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 2c6ab6fb452f7b90d85125ae17fef96cfc9a8576..2e87f82aa5f835f60295d859a524e40bd47c42ee 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -1032,3 +1032,53 @@ void xdp_features_clear_redirect_target(struct net_device *dev)
>  	xdp_set_features_flag(dev, val);
>  }
>  EXPORT_SYMBOL_GPL(xdp_features_clear_redirect_target);
> +
> +__bpf_kfunc_start_defs();
> +
> +__bpf_kfunc int bpf_xdp_trait_set(const struct xdp_buff *xdp, u64 key,
> +				  const void *val, u64 val__sz, u64 flags)
> +{
> +	if (xdp_data_meta_unsupported(xdp))
> +		return -EOPNOTSUPP;
> +
> +	return trait_set(xdp_buff_traits(xdp), xdp->data_meta, key,
> +			 val, val__sz, flags);
> +}
> +
> +__bpf_kfunc int bpf_xdp_trait_get(const struct xdp_buff *xdp, u64 key,
> +				  void *val, u64 val__sz)
> +{
> +	if (xdp_data_meta_unsupported(xdp))
> +		return -EOPNOTSUPP;
> +
> +	return trait_get(xdp_buff_traits(xdp), key, val, val__sz);
> +}
> +
> +__bpf_kfunc int bpf_xdp_trait_del(const struct xdp_buff *xdp, u64 key)
> +{
> +	if (xdp_data_meta_unsupported(xdp))
> +		return -EOPNOTSUPP;
> +
> +	return trait_del(xdp_buff_traits(xdp), key);
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(xdp_trait)
> +// TODO - should we use KF_TRUSTED_ARGS? https://www.kernel.org/doc/html/next/bpf/kfuncs.html#kf-trusted-args-flag
> +BTF_ID_FLAGS(func, bpf_xdp_trait_set)
> +BTF_ID_FLAGS(func, bpf_xdp_trait_get)
> +BTF_ID_FLAGS(func, bpf_xdp_trait_del)
> +BTF_KFUNCS_END(xdp_trait)
> +
> +static const struct btf_kfunc_id_set xdp_trait_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set = &xdp_trait,
> +};
> +
> +static int xdp_trait_init(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
> +					 &xdp_trait_kfunc_set);
> +}
> +late_initcall(xdp_trait_init);
> 
> -- 
> 2.43.0
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata
  2025-03-05 14:31 ` [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata arthur
  2025-03-07  6:36   ` Alexei Starovoitov
@ 2025-03-07 19:24   ` Jakub Sitnicki
  1 sibling, 0 replies; 37+ messages in thread
From: Jakub Sitnicki @ 2025-03-07 19:24 UTC (permalink / raw)
  To: arthur
  Cc: netdev, bpf, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

On Wed, Mar 05, 2025 at 03:31 PM +01, arthur@arthurfabre.com wrote:
> From: Arthur Fabre <afabre@cloudflare.com>
>
> A (very limited) KV store to support storing KVs in the packet headroom,
> with:
> - 64 keys (0-63).
> - 2, 4 or 8 byte values.
> - O(1) lookup
> - O(n) insertion
> - A fixed 16 byte header.
>
> Values are stored ordered by key, immediately following the fixed
> header.
>
> This could be extended in the future, for now it implements the smallest
> possible API. The API intentionally uses u64 keys to not impose
> restrictions on the implementation in the future.
>
> I picked 2¸ 4, and 8 bytes arbitrarily. We could also support 0 sized
> values for use as flags.
> A 16 byte value could be useful to store UUIDs and IPv6 addresses.
> If we want more than 3 sizes, we can add a word to the header (for a
> total of 24 bytes) to support 7 sizes.
>
> We could also allow users to set several consecutive keys in one
> trait_set() call to support storing larger values.
>
> Implemented in the header file so functions are always inlinable.
>
> Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
> ---

[...]

> +/**
> + * trait_get() - Get a trait key.
> + * @traits: Start of trait store area.
> + * @key: The key to get.
> + * @val: Where to put stored value.
> + * @val_len: The length of val.
> + *
> + * Return:
> + * * %>0      - Actual size of value.
> + * * %-EINVAL - Key or length invalid.
> + * * %-ENOENT - Key has not been set with trait_set() previously.
> + * * %-ENOSPC - Val is not big enough to hold stored value.
> + */
> +static __always_inline
> +int trait_get(void *traits, u64 key, void *val, u64 val_len)

Hmm. I think that passing void *val will bite us on big endian.
Seems I can't pass an u64 * if you don't know the value size up front.
For values under 8 bytes I would end up with bytes shifted to the left.
No?

Take u64 *val instead and copy it in at the right offset?

> +{
> +	if (!__trait_valid_key(key))
> +		return -EINVAL;
> +
> +	struct __trait_hdr h = *(struct __trait_hdr *)traits;
> +
> +	/* Check key is set */
> +	if (!((h.high & (1ull << key)) || (h.low & (1ull << key))))
> +		return -ENOENT;
> +
> +	/* Offset of value of this key */
> +	int off = __trait_offset(h, key);
> +
> +	/* Figure out our length */
> +	int real_len = __trait_total_length(__trait_and(h, (1ull << key)));
> +
> +	if (real_len > val_len)
> +		return -ENOSPC;
> +
> +	memcpy(val, traits + off, real_len);
> +	return real_len;
> +}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies
  2025-03-05 14:32 ` [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies arthur
@ 2025-03-10 10:50   ` Lorenzo Bianconi
  2025-03-10 15:52     ` Arthur Fabre
  2025-03-10 22:15   ` David Laight
  1 sibling, 1 reply; 37+ messages in thread
From: Lorenzo Bianconi @ 2025-03-10 10:50 UTC (permalink / raw)
  To: arthur
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

[-- Attachment #1: Type: text/plain, Size: 2574 bytes --]

> From: Arthur Fabre <afabre@cloudflare.com>
> 
> When copying trait values to or from the caller, the size isn't a
> constant so memcpy() ends up being a function call.
> 
> Replace it with an inline implementation that only handles the sizes we
> support.
> 
> We store values "packed", so they won't necessarily be 4 or 8 byte
> aligned.
> 
> Setting and getting traits is roughly ~40% faster.

Nice! I guess in a formal series this patch can be squashed with patch 1/20
(adding some comments).

Regards,
Lorenzo

> 
> Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
> ---
>  include/net/trait.h | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/trait.h b/include/net/trait.h
> index 536b8a17dbbc091b4d1a4d7b4b21c1e36adea86a..d4581a877bd57a32e2ad032147c906764d6d37f8 100644
> --- a/include/net/trait.h
> +++ b/include/net/trait.h
> @@ -7,6 +7,7 @@
>  #include <linux/errno.h>
>  #include <linux/string.h>
>  #include <linux/bitops.h>
> +#include <linux/unaligned.h>
>  
>  /* Traits are a very limited KV store, with:
>   * - 64 keys (0-63).
> @@ -145,23 +146,23 @@ int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u
>  			memmove(traits + off + len, traits + off, traits_size(traits) - off);
>  	}
>  
> -	/* Set our value. */
> -	memcpy(traits + off, val, len);
> -
> -	/* Store our length in header. */
>  	u64 encode_len = 0;
> -
>  	switch (len) {
>  	case 2:
> +		/* Values are least two bytes, so they'll be two byte aligned */
> +		*(u16 *)(traits + off) = *(u16 *)val;
>  		encode_len = 1;
>  		break;
>  	case 4:
> +		put_unaligned(*(u32 *)val, (u32 *)(traits + off));
>  		encode_len = 2;
>  		break;
>  	case 8:
> +		put_unaligned(*(u64 *)val, (u64 *)(traits + off));
>  		encode_len = 3;
>  		break;
>  	}
> +
>  	h->high |= (encode_len >> 1) << key;
>  	h->low |= (encode_len & 1) << key;
>  	return 0;
> @@ -201,7 +202,19 @@ int trait_get(void *traits, u64 key, void *val, u64 val_len)
>  	if (real_len > val_len)
>  		return -ENOSPC;
>  
> -	memcpy(val, traits + off, real_len);
> +	switch (real_len) {
> +	case 2:
> +		/* Values are least two bytes, so they'll be two byte aligned */
> +		*(u16 *)val = *(u16 *)(traits + off);
> +		break;
> +	case 4:
> +		*(u32 *)val = get_unaligned((u32 *)(traits + off));
> +		break;
> +	case 8:
> +		*(u64 *)val = get_unaligned((u64 *)(traits + off));
> +		break;
> +	}
> +
>  	return real_len;
>  }
>  
> 
> -- 
> 2.43.0
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions
  2025-03-06 11:12       ` Jesper Dangaard Brouer
@ 2025-03-10 11:10         ` Lorenzo Bianconi
  0 siblings, 0 replies; 37+ messages in thread
From: Lorenzo Bianconi @ 2025-03-10 11:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Arthur Fabre, Alexander Lobakin, netdev, bpf, jakub, yan,
	jbrandeburg, thoiland, lbiancon, Arthur Fabre

[-- Attachment #1: Type: text/plain, Size: 5452 bytes --]

[...]
> > > 
> 
> I'm fairly sure that all drivers support XDP_REDIRECT.
> Except didn't Lorenzo add a feature bit for this?
> (so, some drivers might explicitly not-support this)

I think most of the drivers support XDP_REDIRECT. IIRC just some vf
implementations (e.g. ixgbevf or thunder/nicvf do not support XDP_REDIRECT).
Maybe nfp is a special case.

Regards,
Lorenzo

> 
> > > So maybe we need to fix those drivers first, if there are any.
> > 
> > Most drivers don't support metadata unfortunately:
> > 
> > > rg -U "xdp_prepare_buff\([^)]*false\);" drivers/net/
> > drivers/net/tun.c
> > 1712:		xdp_prepare_buff(&xdp, buf, pad, len, false);
> > 
> > drivers/net/ethernet/microsoft/mana/mana_bpf.c
> > 94:	xdp_prepare_buff(xdp, buf_va, XDP_PACKET_HEADROOM, pkt_len, false);
> > 
> > drivers/net/ethernet/marvell/mvneta.c
> > 2344:	xdp_prepare_buff(xdp, data, pp->rx_offset_correction + MVNETA_MH_SIZE,
> > 2345:			 data_len, false);
> > 
> > drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
> > 1436:	xdp_prepare_buff(&xdp, hard_start, OTX2_HEAD_ROOM,
> > 1437:			 cqe->sg.seg_size, false);
> > 
> > drivers/net/ethernet/socionext/netsec.c
> > 1021:		xdp_prepare_buff(&xdp, desc->addr, NETSEC_RXBUF_HEADROOM,
> > 1022:				 pkt_len, false);
> > 
> > drivers/net/ethernet/google/gve/gve_rx.c
> > 740:	xdp_prepare_buff(&new, frame, headroom, len, false);
> > 859:		xdp_prepare_buff(&xdp, page_info->page_address +
> > 860:				 page_info->page_offset, GVE_RX_PAD,
> > 861:				 len, false);
> > 
> > drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> > 3984:			xdp_prepare_buff(&xdp, data,
> > 3985:					 MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM,
> > 3986:					 rx_bytes, false);
> > 
> > drivers/net/ethernet/aquantia/atlantic/aq_ring.c
> > 794:		xdp_prepare_buff(&xdp, hard_start, rx_ring->page_offset,
> > 795:				 buff->len, false);
> > 
> > drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > 554:	xdp_prepare_buff(&xdp, hard_start, data - hard_start, len, false);
> > 
> > drivers/net/ethernet/ti/cpsw_new.c
> > 348:		xdp_prepare_buff(&xdp, pa, headroom, size, false);
> > 
> > drivers/net/ethernet/freescale/enetc/enetc.c
> > 1710:	xdp_prepare_buff(xdp_buff, hard_start - rx_ring->buffer_offset,
> > 1711:			 rx_ring->buffer_offset, size, false);
> > 
> > drivers/net/ethernet/ti/am65-cpsw-nuss.c
> > 1335:		xdp_prepare_buff(&xdp, page_addr, AM65_CPSW_HEADROOM,
> > 1336:				 pkt_len, false);
> > 
> > drivers/net/ethernet/ti/cpsw.c
> > 403:		xdp_prepare_buff(&xdp, pa, headroom, size, false);
> > 
> > drivers/net/ethernet/sfc/rx.c
> > 289:	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
> > 290:			 rx_buf->len, false);
> > 
> > drivers/net/ethernet/mediatek/mtk_eth_soc.c
> > 2097:			xdp_prepare_buff(&xdp, data, MTK_PP_HEADROOM, pktlen,
> > 2098:					 false);
> > 
> > drivers/net/ethernet/sfc/siena/rx.c
> > 291:	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
> > 292:			 rx_buf->len, false)
> > 
> > I don't know if it's just because no one has added calls to
> > skb_metadata_set() in yet, or if there's a more fundamental reason.
> > 
> 
> I simply think driver developers have been lazy.
> 
> If someone want some easy kernel commits, these drivers should be fairly
> easy to fix...
> 
> > I think they all reserve some amount of headroom, but not always the
> > full XDP_PACKET_HEADROOM. Eg sfc:
> > 
> 
> The Intel drivers use 192 (AFAIK if that is still true). The API ended
> up supporting non-standard XDP_PACKET_HEADROOM, due to the Intel
> drivers, when XDP support was added to those (which is a long time ago now).
> 
> > drivers/net/ethernet/sfc/net_driver.h:
> > /* Non-standard XDP_PACKET_HEADROOM and tailroom to satisfy XDP_REDIRECT and
> >   * still fit two standard MTU size packets into a single 4K page.
> >   */
> > #define EFX_XDP_HEADROOM	128
> > 
> 
> This is smaller than most drivers, but still have enough headroom for
> xdp_frame + traits.
> 
> > If it's just because skb_metadata_set() is missing, I can take the
> > patches from this series that adds a "generic" XDP -> skb hook ("trait:
> > Propagate presence of traits to sk_buff"), have it call
> > skb_metadata_set(), and try to add it to all the drivers in a separate
> > series.
> > 
> 
> I think someone should cleanup those drivers and add support.
> 
> --Jesper
> 
> > > >   	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> > > >   	 * while mem_type is valid on remote CPU.
> > > >   	 */
> > > > @@ -369,6 +374,8 @@ void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
> > > >   	xdp->data = frame->data;
> > > >   	xdp->data_end = frame->data + frame->len;
> > > >   	xdp->data_meta = frame->data - frame->metasize;
> > > > +	if (frame->meta_unsupported)
> > > > +		xdp_set_data_meta_invalid(xdp);
> > > >   	xdp->frame_sz = frame->frame_sz;
> > > >   	xdp->flags = frame->flags;
> > > >   }
> > > > @@ -396,6 +403,7 @@ int xdp_update_frame_from_buff(const struct xdp_buff *xdp,
> > > >   	xdp_frame->len  = xdp->data_end - xdp->data;
> > > >   	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
> > > >   	xdp_frame->metasize = metasize;
> > > > +	xdp_frame->meta_unsupported = xdp_data_meta_unsupported(xdp);
> > > >   	xdp_frame->frame_sz = xdp->frame_sz;
> > > >   	xdp_frame->flags = xdp->flags;
> > > 
> > > Thanks,
> > > Olek
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 16/20] trait: Support sk_buffs
  2025-03-05 14:32 ` [PATCH RFC bpf-next 16/20] trait: Support sk_buffs arthur
@ 2025-03-10 11:45   ` Lorenzo Bianconi
  0 siblings, 0 replies; 37+ messages in thread
From: Lorenzo Bianconi @ 2025-03-10 11:45 UTC (permalink / raw)
  To: arthur
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

[-- Attachment #1: Type: text/plain, Size: 5394 bytes --]

> From: Arthur Fabre <afabre@cloudflare.com>
> 
> Hide the space used by traits from skb_headroom(): that space isn't
> actually usable.
> 
> Preserve the trait store in pskb_expand_head() by copying it ahead of
> the new headroom. The struct xdp_frame at the start of the headroom
> isn't needed anymore, so we can overwrite it with traits, and introduce
> a new flag to indicate traits are stored at the start of the headroom.
> 
> Cloned skbs share the same packet data and headroom as the original skb,
> so changes to traits in one would be reflected in the other.
> Is that ok?
> Are there cases where we would want a clone to have different traits?
> For now, prevent clones from using traits.
> 
> Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
> ---
>  include/linux/skbuff.h | 25 +++++++++++++++++++++++--
>  net/core/skbuff.c      | 25 +++++++++++++++++++++++--
>  2 files changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index d7dfee152ebd26ce87a230222e94076aca793adc..886537508789202339c925b5613574de67b7e43c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -39,6 +39,7 @@
>  #include <net/net_debug.h>
>  #include <net/dropreason-core.h>
>  #include <net/netmem.h>
> +#include <net/trait.h>
>  
>  /**
>   * DOC: skb checksums
> @@ -729,6 +730,8 @@ enum skb_traits_type {
>  	SKB_TRAITS_NONE,
>  	/* Trait store in headroom, offset by sizeof(struct xdp_frame) */
>  	SKB_TRAITS_AFTER_XDP,
> +	/* Trait store at start of headroom */
> +	SKB_TRAITS_AT_HEAD,
>  };
>  
>  /**
> @@ -1029,7 +1032,7 @@ struct sk_buff {
>  	__u8			csum_not_inet:1;
>  #endif
>  	__u8			unreadable:1;
> -	__u8			traits_type:1;	/* See enum skb_traits_type */
> +	__u8			traits_type:2;	/* See enum skb_traits_type */
>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>  	__u16			tc_index;	/* traffic control index */
>  #endif
> @@ -2836,6 +2839,18 @@ static inline void *pskb_pull(struct sk_buff *skb, unsigned int len)
>  
>  void skb_condense(struct sk_buff *skb);
>  
> +static inline void *skb_traits(const struct sk_buff *skb)
> +{
> +	switch (skb->traits_type) {
> +	case SKB_TRAITS_AFTER_XDP:
> +		return skb->head + _XDP_FRAME_SIZE;
> +	case SKB_TRAITS_AT_HEAD:
> +		return skb->head;
> +	default:
> +		return NULL;
> +	}
> +}
> +
>  /**
>   *	skb_headroom - bytes at buffer head
>   *	@skb: buffer to check
> @@ -2844,7 +2859,13 @@ void skb_condense(struct sk_buff *skb);
>   */
>  static inline unsigned int skb_headroom(const struct sk_buff *skb)
>  {
> -	return skb->data - skb->head;
> +	int trait_size = 0;
> +	void *traits = skb_traits(skb);
> +
> +	if (traits)
> +		trait_size = traits_size(traits);
> +
> +	return skb->data - skb->head - trait_size;

I am not fully aware of all possible use-cases, but do we really need to
store hw medata traits (e.g. hw rx checksum or hw rx hash) in the skb
headroom when we convert the xdp_frame/xdp_buff in the skb? All of these
fields already have dedicated fields in the skb struct. Moreover, we need
to set them in order to have a real performance improvements when we execute
XDP_PASS. Something like:

https://lore.kernel.org/bpf/01ce17910fdd7b693c23132663fa884d5ec7f440.1726935917.git.lorenzo@kernel.org/

Regards,
Lorenzo

>  }
>  
>  /**
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7b03b64fdcb276f68ce881d1d8da8e4c6b897efc..83f58517738e8ff12990c28b09336ed44f4be32a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1515,6 +1515,19 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
>  	atomic_inc(&(skb_shinfo(skb)->dataref));
>  	skb->cloned = 1;
>  
> +	/* traits would end up shared with the clone,
> +	 * and edits would be reflected there.
> +	 *
> +	 * Is that ok? What if the original skb and the clone take different paths?
> +	 * Does that even happen?
> +	 *
> +	 * If that's not ok, we could copy the traits and store them in an extension header
> +	 * for clones.
> +	 *
> +	 * For now, pretend the clone doesn't have any traits.
> +	 */
> +	skb->traits_type = SKB_TRAITS_NONE;
> +
>  	return n;
>  #undef C
>  }
> @@ -2170,7 +2183,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	unsigned int osize = skb_end_offset(skb);
>  	unsigned int size = osize + nhead + ntail;
>  	long off;
> -	u8 *data;
> +	u8 *data, *head;
>  	int i;
>  
>  	BUG_ON(nhead < 0);
> @@ -2187,10 +2200,18 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  		goto nodata;
>  	size = SKB_WITH_OVERHEAD(size);
>  
> +	head = skb->head;
> +	if (skb->traits_type != SKB_TRAITS_NONE) {
> +		head = skb_traits(skb) + traits_size(skb_traits(skb));
> +		/* struct xdp_frame isn't needed in the headroom, drop it */
> +		memcpy(data, skb_traits(skb), traits_size(skb_traits(skb)));
> +		skb->traits_type = SKB_TRAITS_AT_HEAD;
> +	}
> +
>  	/* Copy only real data... and, alas, header. This should be
>  	 * optimized for the cases when header is void.
>  	 */
> -	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
> +	memcpy(data + nhead, head, skb_tail_pointer(skb) - head);
>  
>  	memcpy((struct skb_shared_info *)(data + size),
>  	       skb_shinfo(skb),
> 
> -- 
> 2.43.0
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata
  2025-03-07 17:29       ` Alexei Starovoitov
@ 2025-03-10 14:45         ` Arthur Fabre
  0 siblings, 0 replies; 37+ messages in thread
From: Arthur Fabre @ 2025-03-10 14:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Network Development, bpf, Jakub Sitnicki, Jesper Dangaard Brouer,
	Yan Zhai, jbrandeburg, Toke Høiland-Jørgensen, lbiancon,
	Arthur Fabre

On Fri Mar 7, 2025 at 6:29 PM CET, Alexei Starovoitov wrote:
> On Fri, Mar 7, 2025 at 3:14 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
> >
> > On Fri Mar 7, 2025 at 7:36 AM CET, Alexei Starovoitov wrote:
> > > On Wed, Mar 5, 2025 at 6:33 AM <arthur@arthurfabre.com> wrote:
> > > >
> > > > +struct __trait_hdr {
> > > > +       /* Values are stored ordered by key, immediately after the header.
> > > > +        *
> > > > +        * The size of each value set is stored in the header as two bits:
> > > > +        *  - 00: Not set.
> > > > +        *  - 01: 2 bytes.
> > > > +        *  - 10: 4 bytes.
> > > > +        *  - 11: 8 bytes.
> > >
> > > ...
> > >
> > > > +        *  - hweight(low) + hweight(high)<<1 is offset.
> > >
> > > the comment doesn't match the code
> > >
> > > > +        */
> > > > +       u64 high;
> > > > +       u64 low;
> > >
> > > ...
> > >
> > > > +static __always_inline int __trait_total_length(struct __trait_hdr h)
> > > > +{
> > > > +       return (hweight64(h.low) << 1) + (hweight64(h.high) << 2)
> > > > +               // For size 8, we only get 4+2=6. Add another 2 in.
> > > > +               + (hweight64(h.high & h.low) << 1);
> > > > +}
> > >
> > > This is really cool idea, but 2 byte size doesn't feel that useful.
> > > How about:
> > > - 00: Not set.
> > > - 01: 4 bytes.
> > > - 10: 8 bytes.
> > > - 11: 16 bytes.
> > >
> > > 4 byte may be useful for ipv4, 16 for ipv6, and 8 is just a good number.
> > > And compute the same way with 3 popcount with extra +1 to shifts.
> >
> > I chose the sizes arbitrarily, happy to change them.
> >
> > 16 is also useful for UUIDs, for tracing.
> >
> > Size 0 could store bools / flags. Keys could be set without a value,
> > and users could check if the key is set or not.
> > That replaces single bits of the mark today, for example a
> > "route locally" key.
>
> I don't understand how that would work.
> If I'm reading the code correctly 00 means that key is not present.
> How one would differentiate key present/not with zero size in value?

It isn't implemented right now, 0 could just be one of the three sizes.
Eg:

- 00: key not set
- 01: key set, no value
- 10: key set, 4 bytes
- 11: key set, 16 bytes

We wouldn't popcnt the low bit, just the high bit, and high & low.

>
> >
> > That only leaves one other size, maybe 4 for smaller values?
> >
> > If we want more sizes, we could also:
> >
> > - Add another u64 word to the header, so we have 3 bits per key. It
> >   uses more room, and we need more popcnts, but most modern x86 CPUs can
> >   do 3 popcnts in parallel so it could be ok.
>
> Two u64s already need 3 pop counts, so it's a good compromise as-is.
>
> > - Let users set consecutive keys to one big value. Instead of supporting
> >   size 16, we let them set two 8 byte KVs in one trait_set() call and
> >   provide a 16 byte value. Eg:
> >
> >         trait_set_batch(u64 key_from, u64_key_to, size, ...);
> >
> >   It's easy to implement, but it makes the API more complicated.
>
> I don't think it complicates the api.
> With max size 16 the user can put two consecutive keys of, say, 16 and 8
> to make 24 bytes of info,
> or use 4 keys of 16 byte each to form 64-bytes of info.
> The bit manipulations are too tricky for compilers to optimize.
> So even with full inlining the two trait_set() of consecutive keys
> will still be largely separate blobs of code.
> So trait_[gs]et_batch() makes sense to me.

Ok! I'll respin this with a batch API too.

>
> Also let's not over focus on networking use cases.
> This mini KV will be useful in all bpf maps including local storage.
> For example the user process can add information about itself into
> task local storage while sched-ext can use that to make scheduling decisions.

Good point, we've been focusing on networking only. Are you thinking
of the value sizes, or is there something else that would help for other
use cases?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 02/20] trait: XDP support
  2025-03-07 19:13   ` Lorenzo Bianconi
@ 2025-03-10 15:50     ` Arthur Fabre
  0 siblings, 0 replies; 37+ messages in thread
From: Arthur Fabre @ 2025-03-10 15:50 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

On Fri Mar 7, 2025 at 8:14 PM CET, Lorenzo Bianconi wrote:
> On Mar 05, arthur@arthurfabre.com wrote:
> > From: Arthur Fabre <afabre@cloudflare.com>
> > 
>
> [...]
>
> > +static __always_inline void *xdp_buff_traits(const struct xdp_buff *xdp)
> > +{
> > +	return xdp->data_hard_start + _XDP_FRAME_SIZE;
> > +}
> > +
> >  static __always_inline void
> >  xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
> >  {
> > @@ -133,6 +139,13 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
> >  	xdp->data = data;
> >  	xdp->data_end = data + data_len;
> >  	xdp->data_meta = meta_valid ? data : data + 1;
> > +
> > +	if (meta_valid) {
>
> can we relax this constraint and use xdp->data as end boundary here?

The problem isn't having a boundary, it's patching all the drivers to
propagate that traits are present to the skb layer. See patch 8 "trait:
Propagate presence of traits to sk_buff", and patches 9-15 for driver
changes.

There's some discussion around updating all the remaining drivers to
support XDP metadata, if instead of making them call skb_metadata_set()
we use a more "generic" hook like "xdp_buff_update_skb()" from this
series, we can use it for traits later.

>
> > +		/* We assume drivers reserve enough headroom to store xdp_frame
> > +		 * and the traits header.
> > +		 */
> > +		traits_init(xdp_buff_traits(xdp), xdp->data_meta);
> > +	}
> >  }
> >  
> >  /* Reserve memory area at end-of data area.
> > @@ -267,6 +280,8 @@ struct xdp_frame {
> >  	u32 flags; /* supported values defined in xdp_buff_flags */
> >  };
> >  
> > +static_assert(sizeof(struct xdp_frame) == _XDP_FRAME_SIZE);
> > +
> >  static __always_inline bool xdp_frame_has_frags(const struct xdp_frame *frame)
> >  {
> >  	return !!(frame->flags & XDP_FLAGS_HAS_FRAGS);
> > @@ -517,6 +532,11 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
> >  	return !IS_ALIGNED(metalen, sizeof(u32)) || metalen > meta_max;
> >  }
> >  
> > +static __always_inline void *xdp_meta_hard_start(const struct xdp_buff *xdp)
> > +{
> > +	return xdp_buff_traits(xdp) + traits_size(xdp_buff_traits(xdp));
>
> here we are always consuming sizeof(struct __trait_hdr)), right? We can do
> somehing smarter and check if traits are really used? (e.g. adding in the flags
> in xdp_buff)?

Yes, we're always taking space from the headroom for struct __trait_hdr.

I think it's impossible to tell if traits are used or not early enough:
users could be setting a trait for the first time in iptables or TC. But
we don't know that in XDP.

>
> > +}
> > +
> >  struct xdp_attachment_info {
> >  	struct bpf_prog *prog;
> >  	u32 flags;
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index dcc53ac5c5458f67a422453134665d43d466a02e..79b78e7cd57fd78c6cc8443da54ae96408c496b0 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -85,6 +85,7 @@
> >  #include <linux/un.h>
> >  #include <net/xdp_sock_drv.h>
> >  #include <net/inet_dscp.h>
> > +#include <net/trait.h>
> >  
> >  #include "dev.h"
> >  
> > @@ -3935,9 +3936,8 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
> >  
> >  BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
> >  {
> > -	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
> >  	unsigned long metalen = xdp_get_metalen(xdp);
> > -	void *data_start = xdp_frame_end + metalen;
> > +	void *data_start = xdp_meta_hard_start(xdp) + metalen;
>
> We could waste 16byte here, right?

If traits aren't being used?

[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies
  2025-03-10 10:50   ` Lorenzo Bianconi
@ 2025-03-10 15:52     ` Arthur Fabre
  0 siblings, 0 replies; 37+ messages in thread
From: Arthur Fabre @ 2025-03-10 15:52 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

On Mon Mar 10, 2025 at 11:50 AM CET, Lorenzo Bianconi wrote:
> > From: Arthur Fabre <afabre@cloudflare.com>
> > 
> > When copying trait values to or from the caller, the size isn't a
> > constant so memcpy() ends up being a function call.
> > 
> > Replace it with an inline implementation that only handles the sizes we
> > support.
> > 
> > We store values "packed", so they won't necessarily be 4 or 8 byte
> > aligned.
> > 
> > Setting and getting traits is roughly ~40% faster.
>
> Nice! I guess in a formal series this patch can be squashed with patch 1/20
> (adding some comments).

Happy to squash and add comments instead if that's better :)

>
> Regards,
> Lorenzo
>
> > 
> > Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
> > ---
> >  include/net/trait.h | 25 +++++++++++++++++++------
> >  1 file changed, 19 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/net/trait.h b/include/net/trait.h
> > index 536b8a17dbbc091b4d1a4d7b4b21c1e36adea86a..d4581a877bd57a32e2ad032147c906764d6d37f8 100644
> > --- a/include/net/trait.h
> > +++ b/include/net/trait.h
> > @@ -7,6 +7,7 @@
> >  #include <linux/errno.h>
> >  #include <linux/string.h>
> >  #include <linux/bitops.h>
> > +#include <linux/unaligned.h>
> >  
> >  /* Traits are a very limited KV store, with:
> >   * - 64 keys (0-63).
> > @@ -145,23 +146,23 @@ int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u
> >  			memmove(traits + off + len, traits + off, traits_size(traits) - off);
> >  	}
> >  
> > -	/* Set our value. */
> > -	memcpy(traits + off, val, len);
> > -
> > -	/* Store our length in header. */
> >  	u64 encode_len = 0;
> > -
> >  	switch (len) {
> >  	case 2:
> > +		/* Values are least two bytes, so they'll be two byte aligned */
> > +		*(u16 *)(traits + off) = *(u16 *)val;
> >  		encode_len = 1;
> >  		break;
> >  	case 4:
> > +		put_unaligned(*(u32 *)val, (u32 *)(traits + off));
> >  		encode_len = 2;
> >  		break;
> >  	case 8:
> > +		put_unaligned(*(u64 *)val, (u64 *)(traits + off));
> >  		encode_len = 3;
> >  		break;
> >  	}
> > +
> >  	h->high |= (encode_len >> 1) << key;
> >  	h->low |= (encode_len & 1) << key;
> >  	return 0;
> > @@ -201,7 +202,19 @@ int trait_get(void *traits, u64 key, void *val, u64 val_len)
> >  	if (real_len > val_len)
> >  		return -ENOSPC;
> >  
> > -	memcpy(val, traits + off, real_len);
> > +	switch (real_len) {
> > +	case 2:
> > +		/* Values are least two bytes, so they'll be two byte aligned */
> > +		*(u16 *)val = *(u16 *)(traits + off);
> > +		break;
> > +	case 4:
> > +		*(u32 *)val = get_unaligned((u32 *)(traits + off));
> > +		break;
> > +	case 8:
> > +		*(u64 *)val = get_unaligned((u64 *)(traits + off));
> > +		break;
> > +	}
> > +
> >  	return real_len;
> >  }
> >  
> > 
> > -- 
> > 2.43.0
> > 
> > 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies
  2025-03-05 14:32 ` [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies arthur
  2025-03-10 10:50   ` Lorenzo Bianconi
@ 2025-03-10 22:15   ` David Laight
  1 sibling, 0 replies; 37+ messages in thread
From: David Laight @ 2025-03-10 22:15 UTC (permalink / raw)
  To: arthur
  Cc: netdev, bpf, jakub, hawk, yan, jbrandeburg, thoiland, lbiancon,
	Arthur Fabre

On Wed, 05 Mar 2025 15:32:02 +0100
arthur@arthurfabre.com wrote:

> From: Arthur Fabre <afabre@cloudflare.com>
> 
> When copying trait values to or from the caller, the size isn't a
> constant so memcpy() ends up being a function call.
> 
> Replace it with an inline implementation that only handles the sizes we
> support.
> 
> We store values "packed", so they won't necessarily be 4 or 8 byte
> aligned.
> 
> Setting and getting traits is roughly ~40% faster.
> 
> Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
> ---
>  include/net/trait.h | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/trait.h b/include/net/trait.h
> index 536b8a17dbbc091b4d1a4d7b4b21c1e36adea86a..d4581a877bd57a32e2ad032147c906764d6d37f8 100644
> --- a/include/net/trait.h
> +++ b/include/net/trait.h
> @@ -7,6 +7,7 @@
>  #include <linux/errno.h>
>  #include <linux/string.h>
>  #include <linux/bitops.h>
> +#include <linux/unaligned.h>
>  
>  /* Traits are a very limited KV store, with:
>   * - 64 keys (0-63).
> @@ -145,23 +146,23 @@ int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u
>  			memmove(traits + off + len, traits + off, traits_size(traits) - off);
>  	}
>  
> -	/* Set our value. */
> -	memcpy(traits + off, val, len);
> -
> -	/* Store our length in header. */
>  	u64 encode_len = 0;
> -
>  	switch (len) {
>  	case 2:
> +		/* Values are least two bytes, so they'll be two byte aligned */
> +		*(u16 *)(traits + off) = *(u16 *)val;
>  		encode_len = 1;
>  		break;
>  	case 4:
> +		put_unaligned(*(u32 *)val, (u32 *)(traits + off));
>  		encode_len = 2;
>  		break;
>  	case 8:
> +		put_unaligned(*(u64 *)val, (u64 *)(traits + off));
>  		encode_len = 3;
>  		break;
>  	}
> +
>  	h->high |= (encode_len >> 1) << key;
>  	h->low |= (encode_len & 1) << key;
>  	return 0;
> @@ -201,7 +202,19 @@ int trait_get(void *traits, u64 key, void *val, u64 val_len)
>  	if (real_len > val_len)
>  		return -ENOSPC;
>  
> -	memcpy(val, traits + off, real_len);
> +	switch (real_len) {
> +	case 2:
> +		/* Values are least two bytes, so they'll be two byte aligned */
> +		*(u16 *)val = *(u16 *)(traits + off);
> +		break;
> +	case 4:
> +		*(u32 *)val = get_unaligned((u32 *)(traits + off));
> +		break;
> +	case 8:
> +		*(u64 *)val = get_unaligned((u64 *)(traits + off));
> +		break;

Should there be a 'default' in here?
Possibly just 'return 0'?

> +	}
> +
>  	return real_len;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-03-10 22:15 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-05 14:31 [PATCH RFC bpf-next 00/20] traits: Per packet metadata KV store arthur
2025-03-05 14:31 ` [PATCH RFC bpf-next 01/20] trait: limited KV store for packet metadata arthur
2025-03-07  6:36   ` Alexei Starovoitov
2025-03-07 11:14     ` Arthur Fabre
2025-03-07 17:29       ` Alexei Starovoitov
2025-03-10 14:45         ` Arthur Fabre
2025-03-07 19:24   ` Jakub Sitnicki
2025-03-05 14:31 ` [PATCH RFC bpf-next 02/20] trait: XDP support arthur
2025-03-07 19:13   ` Lorenzo Bianconi
2025-03-10 15:50     ` Arthur Fabre
2025-03-05 14:32 ` [PATCH RFC bpf-next 03/20] trait: basic XDP selftest arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 04/20] trait: basic XDP benchmark arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 05/20] trait: Replace memcpy calls with inline copies arthur
2025-03-10 10:50   ` Lorenzo Bianconi
2025-03-10 15:52     ` Arthur Fabre
2025-03-10 22:15   ` David Laight
2025-03-05 14:32 ` [PATCH RFC bpf-next 06/20] trait: Replace memmove calls with inline move arthur
2025-03-06 10:14   ` Jesper Dangaard Brouer
2025-03-05 14:32 ` [PATCH RFC bpf-next 07/20] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions arthur
2025-03-05 15:24   ` Alexander Lobakin
2025-03-05 17:02     ` Arthur Fabre
2025-03-06 11:12       ` Jesper Dangaard Brouer
2025-03-10 11:10         ` Lorenzo Bianconi
2025-03-05 14:32 ` [PATCH RFC bpf-next 08/20] trait: Propagate presence of traits to sk_buff arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 09/20] bnxt: Propagate trait presence to skb arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 10/20] ice: " arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 11/20] veth: " arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 12/20] virtio_net: " arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 13/20] mlx5: move xdp_buff scope one level up arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 14/20] mlx5: Propagate trait presence to skb arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 15/20] xdp generic: " arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 16/20] trait: Support sk_buffs arthur
2025-03-10 11:45   ` Lorenzo Bianconi
2025-03-05 14:32 ` [PATCH RFC bpf-next 17/20] trait: Allow socket filters to access traits arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 18/20] trait: registration API arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 19/20] trait: Sync linux/bpf.h to tools/ for trait registration arthur
2025-03-05 14:32 ` [PATCH RFC bpf-next 20/20] trait: register traits in benchmarks and tests arthur

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).