* [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
@ 2025-07-02 14:58 Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
` (7 more replies)
0 siblings, 8 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
This patch series introduces a mechanism for an XDP program to store RX
metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
into the xdp_frame. These stored hints are then used to populate the
corresponding fields in the SKB that is created from the xdp_frame
following an XDP_REDIRECT.
The chosen RX metadata hints intentionally map to the existing NIC
hardware metadata that can be read via kfuncs [1]. While this design
allows a BPF program to read and propagate existing hardware hints, our
primary motivation is to enable setting custom values. This is important
for use cases where the hardware-provided information is insufficient or
needs to be calculated based on packet contents unavailable to the
hardware.
The primary motivation for this feature is to enable scalable load
balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
the networking stack later calculates a software hash based on the outer
headers. For a single tunnel, these outer headers are often identical,
causing all packets to be assigned the same hash. This collapses all
traffic onto a single RX queue, creating a performance bottleneck and
defeating receive-side scaling (RSS).
Our immediate use case involves load balancing IPsec traffic. For such
tunnelled traffic, any hardware-provided RX hash is calculated on the
outer headers and is therefore incorrect for distributing inner flows.
There is no reason to read the existing value, as it must be recalculated.
In our XDP program, we perform a partial decryption to access the inner
headers and calculate a new load-balancing hash, which provides better
flow distribution. However, without this patch set, there is no way to
persist this new hash for the network stack to use post-redirect.
This series solves the problem by introducing new BPF kfuncs that allow an
XDP program to write e.g. the hash value into the xdp_frame. The
__xdp_build_skb_from_frame() function is modified to use this stored value
to set skb->hash on the newly created SKB. As a result, the veth driver's
queue selection logic uses the BPF-supplied hash, achieving proper
traffic distribution across multiple CPU cores. This also ensures that
consumers, like the GRO engine, can operate effectively.
We considered XDP traits as an alternative to adding static members to
struct xdp_frame. Given the immediate need for this functionality and the
current development status of traits, we believe this approach is a
pragmatic solution. We are open to migrating to a traits-based
implementation if and when they become a generally accepted mechanism for
such extensions.
[1] https://docs.kernel.org/networking/xdp-rx-metadata.html
---
V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
Jesper Dangaard Brouer (2):
selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect
net: xdp: update documentation for xdp-rx-metadata.rst
Lorenzo Bianconi (5):
net: xdp: Add xdp_rx_meta structure
net: xdp: Add kfuncs to store hw metadata in xdp_buff
net: xdp: Set skb hw metadata from xdp_frame
net: veth: Read xdp metadata from rx_meta struct if available
bpf: selftests: Add rx_meta store kfuncs selftest
Documentation/networking/xdp-rx-metadata.rst | 77 ++++++--
drivers/net/veth.c | 12 ++
include/net/xdp.h | 134 ++++++++++++--
net/core/xdp.c | 107 ++++++++++-
net/xdp/xsk_buff_pool.c | 4 +-
.../bpf/prog_tests/xdp_do_redirect.c | 6 +-
.../selftests/bpf/prog_tests/xdp_rxmeta.c | 166 ++++++++++++++++++
.../selftests/bpf/progs/xdp_rxmeta_receiver.c | 44 +++++
.../selftests/bpf/progs/xdp_rxmeta_redirect.c | 43 +++++
9 files changed, 558 insertions(+), 35 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
--
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
@ 2025-07-02 14:58 ` Jesper Dangaard Brouer
2025-07-17 9:19 ` Jakub Sitnicki
2025-07-02 14:58 ` [PATCH bpf-next V2 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect Jesper Dangaard Brouer
` (6 subsequent siblings)
7 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
From: Lorenzo Bianconi <lorenzo@kernel.org>
Introduce the `xdp_rx_meta` structure to serve as a container for XDP RX
hardware hints within XDP packet buffers. Initially, this structure will
accommodate `rx_hash` and `rx_vlan` metadata. (The `rx_timestamp` hint will
get stored in `skb_shared_info`).
A key design aspect is making this metadata accessible both during BPF
program execution (via `struct xdp_buff`) and later if an `struct
xdp_frame` is materialized (e.g., for XDP_REDIRECT).
To achieve this:
- The `struct xdp_frame` embeds an `xdp_rx_meta` field directly for
storage.
- The `struct xdp_buff` includes an `xdp_rx_meta` pointer. This pointer
is initialized (in `xdp_prepare_buff`) to point to the memory location
within the packet buffer's headroom where the `xdp_frame`'s embedded
`rx_meta` field would reside.
This setup allows BPF kfuncs, operating on `xdp_buff`, to populate the
metadata in the precise location where it will be found if an `xdp_frame`
is subsequently created.
The availability of this metadata storage area within the buffer is
indicated by the `XDP_FLAGS_META_AREA` flag in `xdp_buff->flags` (and
propagated to `xdp_frame->flags`). This flag is only set if sufficient
headroom (at least `XDP_MIN_HEADROOM`, currently 192 bytes) is present.
Specific hints like `XDP_FLAGS_META_RX_HASH` and `XDP_FLAGS_META_RX_VLAN`
will then denote which types of metadata have been populated into the
`xdp_rx_meta` structure.
This patch is a step for enabling the preservation and use of XDP RX
hints across operations like XDP_REDIRECT.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
include/net/xdp.h | 57 +++++++++++++++++++++++++++++++++++------------
net/core/xdp.c | 1 +
net/xdp/xsk_buff_pool.c | 4 ++-
3 files changed, 47 insertions(+), 15 deletions(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index b40f1f96cb11..f52742a25212 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -71,11 +71,31 @@ struct xdp_txq_info {
struct net_device *dev;
};
+struct xdp_rx_meta {
+ struct xdp_rx_meta_hash {
+ u32 val;
+ u32 type; /* enum xdp_rss_hash_type */
+ } hash;
+ struct xdp_rx_meta_vlan {
+ __be16 proto;
+ u16 tci;
+ } vlan;
+};
+
+/* Storage area for HW RX metadata only available with reasonable headroom
+ * available. Less than XDP_PACKET_HEADROOM due to Intel drivers.
+ */
+#define XDP_MIN_HEADROOM 192
+
enum xdp_buff_flags {
XDP_FLAGS_HAS_FRAGS = BIT(0), /* non-linear xdp buff */
XDP_FLAGS_FRAGS_PF_MEMALLOC = BIT(1), /* xdp paged memory is under
* pressure
*/
+ XDP_FLAGS_META_AREA = BIT(2), /* storage area available */
+ XDP_FLAGS_META_RX_HASH = BIT(3), /* hw rx hash */
+ XDP_FLAGS_META_RX_VLAN = BIT(4), /* hw rx vlan */
+ XDP_FLAGS_META_RX_TS = BIT(5), /* hw rx timestamp */
};
struct xdp_buff {
@@ -87,6 +107,24 @@ struct xdp_buff {
struct xdp_txq_info *txq;
u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
u32 flags; /* supported values defined in xdp_buff_flags */
+ struct xdp_rx_meta *rx_meta; /* rx hw metadata pointer in the
+ * buffer headroom
+ */
+};
+
+struct xdp_frame {
+ void *data;
+ u32 len;
+ u32 headroom;
+ u32 metasize; /* uses lower 8-bits */
+ /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+ * while mem_type is valid on remote CPU.
+ */
+ enum xdp_mem_type mem_type:32;
+ struct net_device *dev_rx; /* used by cpumap */
+ u32 frame_sz;
+ u32 flags; /* supported values defined in xdp_buff_flags */
+ struct xdp_rx_meta rx_meta; /* rx hw metadata */
};
static __always_inline bool xdp_buff_has_frags(const struct xdp_buff *xdp)
@@ -133,6 +171,9 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
xdp->data = data;
xdp->data_end = data + data_len;
xdp->data_meta = meta_valid ? data : data + 1;
+ xdp->flags = (headroom < XDP_MIN_HEADROOM) ? 0 : XDP_FLAGS_META_AREA;
+ xdp->rx_meta = (void *)(hard_start +
+ offsetof(struct xdp_frame, rx_meta));
}
/* Reserve memory area at end-of data area.
@@ -253,20 +294,6 @@ static inline bool xdp_buff_add_frag(struct xdp_buff *xdp, netmem_ref netmem,
return true;
}
-struct xdp_frame {
- void *data;
- u32 len;
- u32 headroom;
- u32 metasize; /* uses lower 8-bits */
- /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
- * while mem_type is valid on remote CPU.
- */
- enum xdp_mem_type mem_type:32;
- struct net_device *dev_rx; /* used by cpumap */
- u32 frame_sz;
- u32 flags; /* supported values defined in xdp_buff_flags */
-};
-
static __always_inline bool xdp_frame_has_frags(const struct xdp_frame *frame)
{
return !!(frame->flags & XDP_FLAGS_HAS_FRAGS);
@@ -355,6 +382,8 @@ void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
xdp->data_meta = frame->data - frame->metasize;
xdp->frame_sz = frame->frame_sz;
xdp->flags = frame->flags;
+ xdp->rx_meta = xdp->data_hard_start +
+ offsetof(struct xdp_frame, rx_meta);
}
static inline
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 491334b9b8be..bd3110fc7ef8 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -606,6 +606,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
xdpf->metasize = metasize;
xdpf->frame_sz = PAGE_SIZE;
xdpf->mem_type = MEM_TYPE_PAGE_ORDER0;
+ memcpy(&xdpf->rx_meta, xdp->rx_meta, sizeof(*xdp->rx_meta));
xsk_buff_free(xdp);
return xdpf;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index aa9788f20d0d..de42dacdcb25 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -574,7 +574,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
xskb->xdp.data = xskb->xdp.data_hard_start + XDP_PACKET_HEADROOM;
xskb->xdp.data_meta = xskb->xdp.data;
- xskb->xdp.flags = 0;
+ xskb->xdp.flags = XDP_FLAGS_META_AREA;
+ xskb->xdp.rx_meta = (void *)(xskb->xdp.data_hard_start +
+ offsetof(struct xdp_frame, rx_meta));
if (pool->dev)
xp_dma_sync_for_device(pool, xskb->dma, pool->frame_len);
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH bpf-next V2 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
@ 2025-07-02 14:58 ` Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
` (5 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
Patchset increased xdp_buff with a pointer 8 bytes, and the bpf/test_run
struct xdp_page_head have two xdp_buff's. Thus adjust test with 16 bytes.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
.../selftests/bpf/prog_tests/xdp_do_redirect.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c b/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
index dd34b0cc4b4e..35c65518f55a 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
@@ -59,12 +59,12 @@ static int attach_tc_prog(struct bpf_tc_hook *hook, int fd)
/* The maximum permissible size is: PAGE_SIZE - sizeof(struct xdp_page_head) -
* SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) - XDP_PACKET_HEADROOM =
- * 3408 bytes for 64-byte cacheline and 3216 for 256-byte one.
+ * 3392 bytes for 64-byte cacheline and 3200 for 256-byte one.
*/
#if defined(__s390x__)
-#define MAX_PKT_SIZE 3216
+#define MAX_PKT_SIZE 3200
#else
-#define MAX_PKT_SIZE 3408
+#define MAX_PKT_SIZE 3392
#endif
#define PAGE_SIZE_4K 4096
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect Jesper Dangaard Brouer
@ 2025-07-02 14:58 ` Jesper Dangaard Brouer
2025-07-03 11:41 ` Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 4/7] net: xdp: Set skb hw metadata from xdp_frame Jesper Dangaard Brouer
` (4 subsequent siblings)
7 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
From: Lorenzo Bianconi <lorenzo@kernel.org>
Introduce the following kfuncs to store hw metadata provided by the NIC
into the xdp_buff struct:
- rx-hash: bpf_xdp_store_rx_hash
- rx-vlan: bpf_xdp_store_rx_vlan
- rx-hw-ts: bpf_xdp_store_rx_ts
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
include/net/xdp.h | 5 +++++
net/core/xdp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index f52742a25212..8c7d47e3609b 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -153,6 +153,11 @@ static __always_inline void xdp_buff_set_frag_pfmemalloc(struct xdp_buff *xdp)
xdp->flags |= XDP_FLAGS_FRAGS_PF_MEMALLOC;
}
+static __always_inline bool xdp_buff_has_valid_meta_area(struct xdp_buff *xdp)
+{
+ return !!(xdp->flags & XDP_FLAGS_META_AREA);
+}
+
static __always_inline void
xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
{
diff --git a/net/core/xdp.c b/net/core/xdp.c
index bd3110fc7ef8..1ffba57714ea 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -963,12 +963,57 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
return -EOPNOTSUPP;
}
+__bpf_kfunc int bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
+ enum xdp_rss_hash_type rss_type)
+{
+ struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+
+ if (!xdp_buff_has_valid_meta_area(xdp))
+ return -ENOSPC;
+
+ xdp->rx_meta->hash.val = hash;
+ xdp->rx_meta->hash.type = rss_type;
+ xdp->flags |= XDP_FLAGS_META_RX_HASH;
+
+ return 0;
+}
+
+__bpf_kfunc int bpf_xdp_store_rx_vlan(struct xdp_md *ctx, __be16 vlan_proto,
+ u16 vlan_tci)
+{
+ struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+
+ if (!xdp_buff_has_valid_meta_area(xdp))
+ return -ENOSPC;
+
+ xdp->rx_meta->vlan.proto = vlan_proto;
+ xdp->rx_meta->vlan.tci = vlan_tci;
+ xdp->flags |= XDP_FLAGS_META_RX_VLAN;
+
+ return 0;
+}
+
+__bpf_kfunc int bpf_xdp_store_rx_ts(struct xdp_md *ctx, u64 ts)
+{
+ struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+ struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+ struct skb_shared_hwtstamps *shwt = &sinfo->hwtstamps;
+
+ shwt->hwtstamp = ts;
+ xdp->flags |= XDP_FLAGS_META_RX_TS;
+
+ return 0;
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(xdp_metadata_kfunc_ids)
#define XDP_METADATA_KFUNC(_, __, name, ___) BTF_ID_FLAGS(func, name, KF_TRUSTED_ARGS)
XDP_METADATA_KFUNC_xxx
#undef XDP_METADATA_KFUNC
+BTF_ID_FLAGS(func, bpf_xdp_store_rx_hash)
+BTF_ID_FLAGS(func, bpf_xdp_store_rx_vlan)
+BTF_ID_FLAGS(func, bpf_xdp_store_rx_ts)
BTF_KFUNCS_END(xdp_metadata_kfunc_ids)
static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH bpf-next V2 4/7] net: xdp: Set skb hw metadata from xdp_frame
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
` (2 preceding siblings ...)
2025-07-02 14:58 ` [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
@ 2025-07-02 14:58 ` Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
` (3 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
From: Lorenzo Bianconi <lorenzo@kernel.org>
Update the following hw metadata provided by the NIC building the skb
from a xdp_frame.
- rx hash
- rx vlan
- rx hw-ts
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
include/net/xdp.h | 15 +++++++++++++++
net/core/xdp.c | 29 ++++++++++++++++++++++++++++-
2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 8c7d47e3609b..3d1a9711fe82 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -310,6 +310,21 @@ xdp_frame_is_frag_pfmemalloc(const struct xdp_frame *frame)
return !!(frame->flags & XDP_FLAGS_FRAGS_PF_MEMALLOC);
}
+static __always_inline bool xdp_frame_has_rx_meta_hash(struct xdp_frame *frame)
+{
+ return !!(frame->flags & XDP_FLAGS_META_RX_HASH);
+}
+
+static __always_inline bool xdp_frame_has_rx_meta_vlan(struct xdp_frame *frame)
+{
+ return !!(frame->flags & XDP_FLAGS_META_RX_VLAN);
+}
+
+static __always_inline bool xdp_frame_has_rx_meta_ts(struct xdp_frame *frame)
+{
+ return !!(frame->flags & XDP_FLAGS_META_RX_TS);
+}
+
#define XDP_BULK_QUEUE_SIZE 16
struct xdp_frame_bulk {
int count;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 1ffba57714ea..f1b2a3b4ba95 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -792,6 +792,23 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
}
EXPORT_SYMBOL_GPL(xdp_build_skb_from_zc);
+static void xdp_set_skb_rx_hash_from_meta(struct xdp_frame *frame,
+ struct sk_buff *skb)
+{
+ enum pkt_hash_types hash_type = PKT_HASH_TYPE_NONE;
+
+ if (!xdp_frame_has_rx_meta_hash(frame))
+ return;
+
+ if (frame->rx_meta.hash.type & XDP_RSS_TYPE_L4_ANY)
+ hash_type = PKT_HASH_TYPE_L4;
+ else if (frame->rx_meta.hash.type & (XDP_RSS_TYPE_L3_IPV4 |
+ XDP_RSS_TYPE_L3_IPV6))
+ hash_type = PKT_HASH_TYPE_L3;
+
+ skb_set_hash(skb, frame->rx_meta.hash.val, hash_type);
+}
+
struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
struct sk_buff *skb,
struct net_device *dev)
@@ -800,11 +817,15 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
unsigned int headroom, frame_size;
void *hard_start;
u8 nr_frags;
+ u64 ts;
/* xdp frags frame */
if (unlikely(xdp_frame_has_frags(xdpf)))
nr_frags = sinfo->nr_frags;
+ if (unlikely(xdp_frame_has_rx_meta_ts(xdpf)))
+ ts = sinfo->hwtstamps.hwtstamp;
+
/* Part of headroom was reserved to xdpf */
headroom = sizeof(*xdpf) + xdpf->headroom;
@@ -832,9 +853,15 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
/* Essential SKB info: protocol and skb->dev */
skb->protocol = eth_type_trans(skb, dev);
+ xdp_set_skb_rx_hash_from_meta(xdpf, skb);
+ if (xdp_frame_has_rx_meta_vlan(xdpf))
+ __vlan_hwaccel_put_tag(skb, xdpf->rx_meta.vlan.proto,
+ xdpf->rx_meta.vlan.tci);
+ if (unlikely(xdp_frame_has_rx_meta_ts(xdpf)))
+ skb_hwtstamps(skb)->hwtstamp = ts;
+
/* Optional SKB info, currently missing:
* - HW checksum info (skb->ip_summed)
- * - HW RX hash (skb_set_hash)
* - RX ring dev queue index (skb_record_rx_queue)
*/
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH bpf-next V2 5/7] net: veth: Read xdp metadata from rx_meta struct if available
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
` (3 preceding siblings ...)
2025-07-02 14:58 ` [PATCH bpf-next V2 4/7] net: xdp: Set skb hw metadata from xdp_frame Jesper Dangaard Brouer
@ 2025-07-02 14:58 ` Jesper Dangaard Brouer
2025-07-17 12:11 ` Jakub Sitnicki
2025-07-02 14:58 ` [PATCH bpf-next V2 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
` (2 subsequent siblings)
7 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
From: Lorenzo Bianconi <lorenzo@kernel.org>
Report xdp_rx_meta info if available in xdp_buff struct in
xdp_metadata_ops callbacks for veth driver
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
drivers/net/veth.c | 12 +++++++++++
include/net/xdp.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..c3a08b7d8192 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1651,6 +1651,10 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
static int veth_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
{
struct veth_xdp_buff *_ctx = (void *)ctx;
+ const struct xdp_buff *xdp = &_ctx->xdp;
+
+ if (!xdp_load_rx_ts_from_buff(xdp, timestamp))
+ return 0;
if (!_ctx->skb)
return -ENODATA;
@@ -1663,8 +1667,12 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
enum xdp_rss_hash_type *rss_type)
{
struct veth_xdp_buff *_ctx = (void *)ctx;
+ const struct xdp_buff *xdp = &_ctx->xdp;
struct sk_buff *skb = _ctx->skb;
+ if (!xdp_load_rx_hash_from_buff(xdp, hash, rss_type))
+ return 0;
+
if (!skb)
return -ENODATA;
@@ -1678,9 +1686,13 @@ static int veth_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
u16 *vlan_tci)
{
const struct veth_xdp_buff *_ctx = (void *)ctx;
+ const struct xdp_buff *xdp = &_ctx->xdp;
const struct sk_buff *skb = _ctx->skb;
int err;
+ if (!xdp_load_rx_vlan_tag_from_buff(xdp, vlan_proto, vlan_tci))
+ return 0;
+
if (!skb)
return -ENODATA;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 3d1a9711fe82..2b495feedfb0 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -158,6 +158,23 @@ static __always_inline bool xdp_buff_has_valid_meta_area(struct xdp_buff *xdp)
return !!(xdp->flags & XDP_FLAGS_META_AREA);
}
+static __always_inline bool
+xdp_buff_has_rx_meta_hash(const struct xdp_buff *xdp)
+{
+ return !!(xdp->flags & XDP_FLAGS_META_RX_HASH);
+}
+
+static __always_inline bool
+xdp_buff_has_rx_meta_vlan(const struct xdp_buff *xdp)
+{
+ return !!(xdp->flags & XDP_FLAGS_META_RX_VLAN);
+}
+
+static __always_inline bool xdp_buff_has_rx_meta_ts(const struct xdp_buff *xdp)
+{
+ return !!(xdp->flags & XDP_FLAGS_META_RX_TS);
+}
+
static __always_inline void
xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
{
@@ -712,4 +729,44 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
return act;
}
+
+static inline int xdp_load_rx_hash_from_buff(const struct xdp_buff *xdp,
+ u32 *hash,
+ enum xdp_rss_hash_type *rss_type)
+{
+ if (!xdp_buff_has_rx_meta_hash(xdp))
+ return -ENODATA;
+
+ *hash = xdp->rx_meta->hash.val;
+ *rss_type = xdp->rx_meta->hash.type;
+
+ return 0;
+}
+
+static inline int xdp_load_rx_vlan_tag_from_buff(const struct xdp_buff *xdp,
+ __be16 *vlan_proto,
+ u16 *vlan_tci)
+{
+ if (!xdp_buff_has_rx_meta_vlan(xdp))
+ return -ENODATA;
+
+ *vlan_proto = xdp->rx_meta->vlan.proto;
+ *vlan_tci = xdp->rx_meta->vlan.tci;
+
+ return 0;
+}
+
+static inline int xdp_load_rx_ts_from_buff(const struct xdp_buff *xdp, u64 *ts)
+{
+ struct skb_shared_info *sinfo;
+
+ if (!xdp_buff_has_rx_meta_ts(xdp))
+ return -ENODATA;
+
+ sinfo = xdp_get_shared_info_from_buff(xdp);
+ *ts = sinfo->hwtstamps.hwtstamp;
+
+ return 0;
+}
+
#endif /* __LINUX_NET_XDP_H__ */
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH bpf-next V2 6/7] bpf: selftests: Add rx_meta store kfuncs selftest
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
` (4 preceding siblings ...)
2025-07-02 14:58 ` [PATCH bpf-next V2 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
@ 2025-07-02 14:58 ` Jesper Dangaard Brouer
2025-07-23 9:24 ` Bouska, Zdenek
2025-07-02 14:58 ` [PATCH bpf-next V2 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
2025-07-02 16:05 ` [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Stanislav Fomichev
7 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
From: Lorenzo Bianconi <lorenzo@kernel.org>
Introduce bpf selftests for the XDP rx_meta store kfuncs.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
.../testing/selftests/bpf/prog_tests/xdp_rxmeta.c | 166 ++++++++++++++++++++
.../selftests/bpf/progs/xdp_rxmeta_receiver.c | 44 +++++
.../selftests/bpf/progs/xdp_rxmeta_redirect.c | 43 +++++
3 files changed, 253 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c b/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
new file mode 100644
index 000000000000..d5c181684ff8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <bpf/btf.h>
+#include <linux/if_link.h>
+
+#include "xdp_rxmeta_redirect.skel.h"
+#include "xdp_rxmeta_receiver.skel.h"
+
+#define LOCAL_NETNS_NAME "local"
+#define FWD_NETNS_NAME "forward"
+#define DST_NETNS_NAME "dest"
+
+#define LOCAL_NAME "local"
+#define FWD0_NAME "fwd0"
+#define FWD1_NAME "fwd1"
+#define DST_NAME "dest"
+
+#define LOCAL_MAC "00:00:00:00:00:01"
+#define FWD0_MAC "00:00:00:00:00:02"
+#define FWD1_MAC "00:00:00:00:01:01"
+#define DST_MAC "00:00:00:00:01:02"
+
+#define LOCAL_ADDR "10.0.0.1"
+#define FWD0_ADDR "10.0.0.2"
+#define FWD1_ADDR "20.0.0.1"
+#define DST_ADDR "20.0.0.2"
+
+#define PREFIX_LEN "8"
+#define NUM_PACKETS 10
+
+static int run_ping(const char *dst, int num_ping)
+{
+ SYS(fail, "ping -c%d -W1 -i0.5 %s >/dev/null", num_ping, dst);
+ return 0;
+fail:
+ return -1;
+}
+
+void test_xdp_rxmeta(void)
+{
+ struct xdp_rxmeta_redirect *skel_redirect = NULL;
+ struct xdp_rxmeta_receiver *skel_receiver = NULL;
+ struct bpf_devmap_val val = {};
+ struct nstoken *tok = NULL;
+ struct bpf_program *prog;
+ __u32 key = 0, stats;
+ int ret, index;
+
+ SYS(out, "ip netns add " LOCAL_NETNS_NAME);
+ SYS(out, "ip netns add " FWD_NETNS_NAME);
+ SYS(out, "ip netns add " DST_NETNS_NAME);
+
+ tok = open_netns(LOCAL_NETNS_NAME);
+ if (!ASSERT_OK_PTR(tok, "setns"))
+ goto out;
+
+ SYS(out, "ip link add " LOCAL_NAME " type veth peer " FWD0_NAME);
+ SYS(out, "ip link set " FWD0_NAME " netns " FWD_NETNS_NAME);
+ SYS(out, "ip link set dev " LOCAL_NAME " address " LOCAL_MAC);
+ SYS(out, "ip addr add " LOCAL_ADDR "/" PREFIX_LEN " dev " LOCAL_NAME);
+ SYS(out, "ip link set dev " LOCAL_NAME " up");
+ SYS(out, "ip route add default via " FWD0_ADDR);
+ close_netns(tok);
+
+ tok = open_netns(DST_NETNS_NAME);
+ if (!ASSERT_OK_PTR(tok, "setns"))
+ goto out;
+
+ SYS(out, "ip link add " DST_NAME " type veth peer " FWD1_NAME);
+ SYS(out, "ip link set " FWD1_NAME " netns " FWD_NETNS_NAME);
+ SYS(out, "ip link set dev " DST_NAME " address " DST_MAC);
+ SYS(out, "ip addr add " DST_ADDR "/" PREFIX_LEN " dev " DST_NAME);
+ SYS(out, "ip link set dev " DST_NAME " up");
+ SYS(out, "ip route add default via " FWD1_ADDR);
+
+ skel_receiver = xdp_rxmeta_receiver__open();
+ if (!ASSERT_OK_PTR(skel_receiver, "open skel_receiver"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel_receiver->obj,
+ "xdp_rxmeta_receiver");
+ index = if_nametoindex(DST_NAME);
+ bpf_program__set_ifindex(prog, index);
+ bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+ if (!ASSERT_OK(xdp_rxmeta_receiver__load(skel_receiver),
+ "load skel_receiver"))
+ goto out;
+
+ ret = bpf_xdp_attach(index,
+ bpf_program__fd(skel_receiver->progs.xdp_rxmeta_receiver),
+ XDP_FLAGS_DRV_MODE, NULL);
+ if (!ASSERT_GE(ret, 0, "bpf_xdp_attach rx_meta_redirect"))
+ goto out;
+
+ close_netns(tok);
+ tok = open_netns(FWD_NETNS_NAME);
+ if (!ASSERT_OK_PTR(tok, "setns"))
+ goto out;
+
+ SYS(out, "ip link set dev " FWD0_NAME " address " FWD0_MAC);
+ SYS(out, "ip addr add " FWD0_ADDR "/" PREFIX_LEN " dev " FWD0_NAME);
+ SYS(out, "ip link set dev " FWD0_NAME " up");
+
+ SYS(out, "ip link set dev " FWD1_NAME " address " FWD1_MAC);
+ SYS(out, "ip addr add " FWD1_ADDR "/" PREFIX_LEN " dev " FWD1_NAME);
+ SYS(out, "ip link set dev " FWD1_NAME " up");
+
+ SYS(out, "sysctl -qw net.ipv4.conf.all.forwarding=1");
+
+ skel_redirect = xdp_rxmeta_redirect__open();
+ if (!ASSERT_OK_PTR(skel_redirect, "open skel_redirect"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel_redirect->obj,
+ "xdp_rxmeta_redirect");
+ index = if_nametoindex(FWD0_NAME);
+ bpf_program__set_ifindex(prog, index);
+ bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+ if (!ASSERT_OK(xdp_rxmeta_redirect__load(skel_redirect),
+ "load skel_redirect"))
+ goto out;
+
+ val.ifindex = if_nametoindex(FWD1_NAME);
+ ret = bpf_map_update_elem(bpf_map__fd(skel_redirect->maps.dev_map),
+ &key, &val, 0);
+ if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+ goto out;
+
+ ret = bpf_xdp_attach(index,
+ bpf_program__fd(skel_redirect->progs.xdp_rxmeta_redirect),
+ XDP_FLAGS_DRV_MODE, NULL);
+ if (!ASSERT_GE(ret, 0, "bpf_xdp_attach rxmeta_redirect"))
+ goto out;
+
+ close_netns(tok);
+ tok = open_netns(LOCAL_NETNS_NAME);
+ if (!ASSERT_OK_PTR(tok, "setns"))
+ goto out;
+
+ if (!ASSERT_OK(run_ping(DST_ADDR, NUM_PACKETS), "ping"))
+ goto out;
+
+ close_netns(tok);
+ tok = open_netns(DST_NETNS_NAME);
+ if (!ASSERT_OK_PTR(tok, "setns"))
+ goto out;
+
+ ret = bpf_map__lookup_elem(skel_receiver->maps.stats,
+ &key, sizeof(key),
+ &stats, sizeof(stats), 0);
+ if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+ goto out;
+
+ ASSERT_EQ(stats, NUM_PACKETS, "rx_meta stats");
+out:
+ xdp_rxmeta_redirect__destroy(skel_redirect);
+ xdp_rxmeta_receiver__destroy(skel_receiver);
+ if (tok)
+ close_netns(tok);
+ SYS_NOFAIL("ip netns del " LOCAL_NETNS_NAME);
+ SYS_NOFAIL("ip netns del " FWD_NETNS_NAME);
+ SYS_NOFAIL("ip netns del " DST_NETNS_NAME);
+}
diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c b/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
new file mode 100644
index 000000000000..1033fa558970
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#define BPF_NO_KFUNC_PROTOTYPES
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
+ enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
+ __u64 *timestamp) __ksym;
+
+#define RX_TIMESTAMP 0x12345678
+#define RX_HASH 0x1234
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, __u32);
+ __type(value, __u32);
+ __uint(max_entries, 1);
+} stats SEC(".maps");
+
+SEC("xdp")
+int xdp_rxmeta_receiver(struct xdp_md *ctx)
+{
+ enum xdp_rss_hash_type rss_type;
+ __u64 timestamp;
+ __u32 hash;
+
+ if (!bpf_xdp_metadata_rx_hash(ctx, &hash, &rss_type) &&
+ !bpf_xdp_metadata_rx_timestamp(ctx, ×tamp)) {
+ if (hash == RX_HASH && rss_type == XDP_RSS_L4_TCP &&
+ timestamp == RX_TIMESTAMP) {
+ __u32 *val, key = 0;
+
+ val = bpf_map_lookup_elem(&stats, &key);
+ if (val)
+ __sync_add_and_fetch(val, 1);
+ }
+ }
+
+ return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
new file mode 100644
index 000000000000..635cbae64f53
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define RX_TIMESTAMP 0x12345678
+#define RX_HASH 0x1234
+
+#define ETH_ALEN 6
+#define ETH_P_IP 0x0800
+
+struct {
+ __uint(type, BPF_MAP_TYPE_DEVMAP);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(struct bpf_devmap_val));
+ __uint(max_entries, 1);
+} dev_map SEC(".maps");
+
+SEC("xdp")
+int xdp_rxmeta_redirect(struct xdp_md *ctx)
+{
+ __u8 src_mac[] = { 0x00, 0x00, 0x00, 0x00, 0x01, 0x01 };
+ __u8 dst_mac[] = { 0x00, 0x00, 0x00, 0x00, 0x01, 0x02 };
+ void *data_end = (void *)(long)ctx->data_end;
+ void *data = (void *)(long)ctx->data;
+ struct ethhdr *eh = data;
+
+ if (eh + 1 > (struct ethhdr *)data_end)
+ return XDP_DROP;
+
+ if (eh->h_proto != bpf_htons(ETH_P_IP))
+ return XDP_PASS;
+
+ __builtin_memcpy(eh->h_source, src_mac, ETH_ALEN);
+ __builtin_memcpy(eh->h_dest, dst_mac, ETH_ALEN);
+
+ bpf_xdp_store_rx_hash(ctx, RX_HASH, XDP_RSS_L4_TCP);
+ bpf_xdp_store_rx_ts(ctx, RX_TIMESTAMP);
+
+ return bpf_redirect_map(&dev_map, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH bpf-next V2 7/7] net: xdp: update documentation for xdp-rx-metadata.rst
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
` (5 preceding siblings ...)
2025-07-02 14:58 ` [PATCH bpf-next V2 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
@ 2025-07-02 14:58 ` Jesper Dangaard Brouer
2025-07-02 16:05 ` [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Stanislav Fomichev
7 siblings, 0 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-02 14:58 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
Update the documentation[1] based on the changes in this patchset.
[1] https://docs.kernel.org/networking/xdp-rx-metadata.html
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
Documentation/networking/xdp-rx-metadata.rst | 77 +++++++++++++++++++++-----
net/core/xdp.c | 32 +++++++++++
2 files changed, 93 insertions(+), 16 deletions(-)
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index a6e0ece18be5..e2b89c066a82 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -90,22 +90,67 @@ the ``data_meta`` pointer.
In the future, we'd like to support a case where an XDP program
can override some of the metadata used for building ``skbs``.
-bpf_redirect_map
-================
-
-``bpf_redirect_map`` can redirect the frame to a different device.
-Some devices (like virtual ethernet links) support running a second XDP
-program after the redirect. However, the final consumer doesn't have
-access to the original hardware descriptor and can't access any of
-the original metadata. The same applies to XDP programs installed
-into devmaps and cpumaps.
-
-This means that for redirected packets only custom metadata is
-currently supported, which has to be prepared by the initial XDP program
-before redirect. If the frame is eventually passed to the kernel, the
-``skb`` created from such a frame won't have any hardware metadata populated
-in its ``skb``. If such a packet is later redirected into an ``XSK``,
-that will also only have access to the custom metadata.
+XDP_REDIRECT
+============
+
+The ``XDP_REDIRECT`` action forwards an XDP frame (``xdp_frame``) to another net
+device or a CPU (via cpumap/devmap) for further processing. It is invoked using
+BPF helpers like ``bpf_redirect_map()`` or ``bpf_redirect()``. When an XDP
+frame is redirected, the recipient (e.g., an XDP program on a veth device, or
+the kernel stack via cpumap) naturally loses direct access to the original NIC's
+hardware descriptor and thus its hardware metadata hints.
+
+By default, if an ``xdp_frame`` is redirected and then converted to an ``skb``,
+its fields for hardware-derived metadata like ``skb->hash`` are not
+populated. When this occurs, the network stack recalculates the hash in
+software. This is particularly problematic for encapsulated tunnel traffic
+(e.g., IPsec, GRE), as the software hash is based on the outer headers. For a
+single tunnel, this can cause all flows to receive the same hash, leading to
+poor load balancing when redirected to a veth device or processed by cpumap.
+
+To solve this, a BPF program can calculate a more appropriate hint from the
+packet data (e.g., from the inner headers of a tunnel) and store it for later
+use. While it is also possible for the BPF program to propagate existing
+hardware hints, this is not useful for the tunnel use case; it is unnecessary to
+read the existing hardware metadata hint, as it is based on the outer headers
+and must be recalculated to correctly reflect the inner flow.
+
+For example, a BPF program can perform partial decryption on an IPsec packet,
+calculate a hash from the inner headers, and use ``bpf_xdp_store_rx_hash()`` to
+save it. This ensures that when the packet is redirected to a veth device, it is
+placed on the correct RX queue, achieving proper load balancing.
+
+When these kfuncs are used to store hints before redirection:
+
+* If the ``xdp_frame`` is converted to an ``skb``, the networking stack will use
+ the stored hints to populate the corresponding ``skb`` fields (e.g.,
+ ``skb->hash``, ``skb->vlan_tci``, timestamps).
+
+* When running a second XDP-program after the redirect. The veth driver supports
+ access to the previous stored metadata is accessed though the normal reader
+ kfuncs.
+
+The BPF programmer must explicitly call these "store" kfuncs to save the desired
+hints. The NIC driver is responsible for ensuring sufficient headroom is
+available; kfuncs may return ``-ENOSPC`` if space is inadequate.
+
+Kfuncs are available for storing RX hash (``bpf_xdp_store_rx_hash()``),
+VLAN information (``bpf_xdp_store_rx_vlan()``), and hardware timestamps
+(``bpf_xdp_store_rx_ts()``). Consult the kfunc API documentation for usage
+details, expected data, return codes, and relevant XDP flags that may
+indicate success or metadata availability.
+
+Kfuncs for **store** operations:
+
+.. kernel-doc:: net/core/xdp.c
+ :identifiers: bpf_xdp_store_rx_timestamp
+
+.. kernel-doc:: net/core/xdp.c
+ :identifiers: bpf_xdp_store_rx_hash
+
+.. kernel-doc:: net/core/xdp.c
+ :identifiers: bpf_xdp_store_rx_vlan_tag
+
bpf_tail_call
=============
diff --git a/net/core/xdp.c b/net/core/xdp.c
index f1b2a3b4ba95..e8faf9f6fc7e 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -990,6 +990,18 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
return -EOPNOTSUPP;
}
+/**
+ * bpf_xdp_store_rx_hash - Store XDP frame RX hash.
+ * @ctx: XDP context pointer.
+ * @hash: 32-bit hash value.
+ * @rss_type: RSS hash type.
+ *
+ * The RSS hash type (@rss_type) is as descibed in bpf_xdp_metadata_rx_hash.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-NOSPC`` : means device driver doesn't provide enough headroom for storing
+ */
__bpf_kfunc int bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
enum xdp_rss_hash_type rss_type)
{
@@ -1005,6 +1017,18 @@ __bpf_kfunc int bpf_xdp_store_rx_hash(struct xdp_md *ctx, u32 hash,
return 0;
}
+/**
+ * bpf_xdp_store_rx_vlan_tag - Store XDP packet outermost VLAN tag
+ * @ctx: XDP context pointer.
+ * @vlan_proto: VLAN protocol stored in **network byte order (BE)**
+ * @vlan_tci: VLAN TCI (VID + DEI + PCP) stored in **host byte order**
+ *
+ * See bpf_xdp_metadata_rx_vlan_tag() for byte order reasoning.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-NOSPC`` : means device driver doesn't provide enough headroom for storing
+ */
__bpf_kfunc int bpf_xdp_store_rx_vlan(struct xdp_md *ctx, __be16 vlan_proto,
u16 vlan_tci)
{
@@ -1020,6 +1044,14 @@ __bpf_kfunc int bpf_xdp_store_rx_vlan(struct xdp_md *ctx, __be16 vlan_proto,
return 0;
}
+/**
+ * bpf_xdp_metadata_rx_timestamp - Store XDP frame RX timestamp.
+ * @ctx: XDP context pointer.
+ * @timestamp: Timestamp value.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ */
__bpf_kfunc int bpf_xdp_store_rx_ts(struct xdp_md *ctx, u64 ts)
{
struct xdp_buff *xdp = (struct xdp_buff *)ctx;
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
` (6 preceding siblings ...)
2025-07-02 14:58 ` [PATCH bpf-next V2 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
@ 2025-07-02 16:05 ` Stanislav Fomichev
2025-07-03 11:17 ` Jesper Dangaard Brouer
7 siblings, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2025-07-02 16:05 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur, jakub
On 07/02, Jesper Dangaard Brouer wrote:
> This patch series introduces a mechanism for an XDP program to store RX
> metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
> into the xdp_frame. These stored hints are then used to populate the
> corresponding fields in the SKB that is created from the xdp_frame
> following an XDP_REDIRECT.
>
> The chosen RX metadata hints intentionally map to the existing NIC
> hardware metadata that can be read via kfuncs [1]. While this design
> allows a BPF program to read and propagate existing hardware hints, our
> primary motivation is to enable setting custom values. This is important
> for use cases where the hardware-provided information is insufficient or
> needs to be calculated based on packet contents unavailable to the
> hardware.
>
> The primary motivation for this feature is to enable scalable load
> balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
> packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
> the networking stack later calculates a software hash based on the outer
> headers. For a single tunnel, these outer headers are often identical,
> causing all packets to be assigned the same hash. This collapses all
> traffic onto a single RX queue, creating a performance bottleneck and
> defeating receive-side scaling (RSS).
>
> Our immediate use case involves load balancing IPsec traffic. For such
> tunnelled traffic, any hardware-provided RX hash is calculated on the
> outer headers and is therefore incorrect for distributing inner flows.
> There is no reason to read the existing value, as it must be recalculated.
> In our XDP program, we perform a partial decryption to access the inner
> headers and calculate a new load-balancing hash, which provides better
> flow distribution. However, without this patch set, there is no way to
> persist this new hash for the network stack to use post-redirect.
>
> This series solves the problem by introducing new BPF kfuncs that allow an
> XDP program to write e.g. the hash value into the xdp_frame. The
> __xdp_build_skb_from_frame() function is modified to use this stored value
> to set skb->hash on the newly created SKB. As a result, the veth driver's
> queue selection logic uses the BPF-supplied hash, achieving proper
> traffic distribution across multiple CPU cores. This also ensures that
> consumers, like the GRO engine, can operate effectively.
>
> We considered XDP traits as an alternative to adding static members to
> struct xdp_frame. Given the immediate need for this functionality and the
> current development status of traits, we believe this approach is a
> pragmatic solution. We are open to migrating to a traits-based
> implementation if and when they become a generally accepted mechanism for
> such extensions.
>
> [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> ---
> V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
No change log?
Btw, any feedback on the following from v1?
- https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
- https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-02 16:05 ` [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Stanislav Fomichev
@ 2025-07-03 11:17 ` Jesper Dangaard Brouer
2025-07-07 14:40 ` Stanislav Fomichev
0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-03 11:17 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur, jakub
On 02/07/2025 18.05, Stanislav Fomichev wrote:
> On 07/02, Jesper Dangaard Brouer wrote:
>> This patch series introduces a mechanism for an XDP program to store RX
>> metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
>> into the xdp_frame. These stored hints are then used to populate the
>> corresponding fields in the SKB that is created from the xdp_frame
>> following an XDP_REDIRECT.
>>
>> The chosen RX metadata hints intentionally map to the existing NIC
>> hardware metadata that can be read via kfuncs [1]. While this design
>> allows a BPF program to read and propagate existing hardware hints, our
>> primary motivation is to enable setting custom values. This is important
>> for use cases where the hardware-provided information is insufficient or
>> needs to be calculated based on packet contents unavailable to the
>> hardware.
>>
>> The primary motivation for this feature is to enable scalable load
>> balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
>> packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
>> the networking stack later calculates a software hash based on the outer
>> headers. For a single tunnel, these outer headers are often identical,
>> causing all packets to be assigned the same hash. This collapses all
>> traffic onto a single RX queue, creating a performance bottleneck and
>> defeating receive-side scaling (RSS).
>>
>> Our immediate use case involves load balancing IPsec traffic. For such
>> tunnelled traffic, any hardware-provided RX hash is calculated on the
>> outer headers and is therefore incorrect for distributing inner flows.
>> There is no reason to read the existing value, as it must be recalculated.
>> In our XDP program, we perform a partial decryption to access the inner
>> headers and calculate a new load-balancing hash, which provides better
>> flow distribution. However, without this patch set, there is no way to
>> persist this new hash for the network stack to use post-redirect.
>>
>> This series solves the problem by introducing new BPF kfuncs that allow an
>> XDP program to write e.g. the hash value into the xdp_frame. The
>> __xdp_build_skb_from_frame() function is modified to use this stored value
>> to set skb->hash on the newly created SKB. As a result, the veth driver's
>> queue selection logic uses the BPF-supplied hash, achieving proper
>> traffic distribution across multiple CPU cores. This also ensures that
>> consumers, like the GRO engine, can operate effectively.
>>
>> We considered XDP traits as an alternative to adding static members to
>> struct xdp_frame. Given the immediate need for this functionality and the
>> current development status of traits, we believe this approach is a
>> pragmatic solution. We are open to migrating to a traits-based
>> implementation if and when they become a generally accepted mechanism for
>> such extensions.
>>
>> [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
>> ---
>> V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
>
> No change log?
We have fixed selftest as requested by Alexie.
And we have updated cover-letter and doc as you Stanislav requested.
>
> Btw, any feedback on the following from v1?
> - https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
Addressed as updated cover-letter and documentation. I hope this helps
reviewers understand the use-case, as the discussion turn into "how do
we transfer all HW metadata", which is NOT what we want (and a waste of
precious cycles).
For our use-case, it doesn't make sense to "transfer all HW metadata".
In fact we don't even want to read the hardware RH-hash, because we
already know it is wrong (for tunnels), we just want to override the
RX-hash used at SKB creation. We do want the BPF programmers
flexibility to call these kfuncs individually (when relevant).
> - https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
I feel pressured into critiquing Jakub's suggestion, hope this is not
too harsh. First of all it is not relevant to our this patchset
use-case, as it focus on all HW metadata.
Second, I disagree with the idea/mental model of storing in a
"driver-specific format". The current implementation of driver-specific
kfunc helpers that "get the metadata" is already doing a conversion to a
common format, because the BPF-programmer naturally needs this to be the
same across drivers. Thus, it doesn't make sense to store it back in a
"driver-specific format", as that just complicate things. My mental
model is thus, that after the driver-specific "get" operation to result
is in a common format, that is simply defined by the struct type of the
kfunc, which is both known by the kernel and BPF-prog.
--Jesper
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff
2025-07-02 14:58 ` [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
@ 2025-07-03 11:41 ` Jesper Dangaard Brouer
2025-07-03 12:26 ` Lorenzo Bianconi
0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-03 11:41 UTC (permalink / raw)
To: bpf, netdev, Jakub Kicinski, lorenzo
Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub
On 02/07/2025 16.58, Jesper Dangaard Brouer wrote:
> From: Lorenzo Bianconi<lorenzo@kernel.org>
>
> Introduce the following kfuncs to store hw metadata provided by the NIC
> into the xdp_buff struct:
>
> - rx-hash: bpf_xdp_store_rx_hash
> - rx-vlan: bpf_xdp_store_rx_vlan
> - rx-hw-ts: bpf_xdp_store_rx_ts
>
> Signed-off-by: Lorenzo Bianconi<lorenzo@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer<hawk@kernel.org>
> ---
> include/net/xdp.h | 5 +++++
> net/core/xdp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 50 insertions(+)
>
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index bd3110fc7ef8..1ffba57714ea 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -963,12 +963,57 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
[...]
> +__bpf_kfunc int bpf_xdp_store_rx_ts(struct xdp_md *ctx, u64 ts)
> +{
> + struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> + struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> + struct skb_shared_hwtstamps *shwt = &sinfo->hwtstamps;
> +
> + shwt->hwtstamp = ts;
Here we are storing into the SKB shared_info struct. This is located at
the SKB data tail. Thus, this will very likely cause a cache-miss.
What about storing it into xdp->rx_meta and then starting a prefetch for
shared_info? (and updating patch-4 that moved it into SKB)
(Reviewers should be aware that writing into the xdp_frame headroom
(xdp->rx_meta) likely isn't a cache-miss, because all drivers does a
prefetchw for this memory prior to running BPF-prog).
> + xdp->flags |= XDP_FLAGS_META_RX_TS;
> +
> + return 0;
> +}
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff
2025-07-03 11:41 ` Jesper Dangaard Brouer
@ 2025-07-03 12:26 ` Lorenzo Bianconi
0 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Bianconi @ 2025-07-03 12:26 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, netdev, Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
Eric Dumazet, David S. Miller, Paolo Abeni, sdf, kernel-team,
arthur, jakub
[-- Attachment #1: Type: text/plain, Size: 1847 bytes --]
>
> On 02/07/2025 16.58, Jesper Dangaard Brouer wrote:
> > From: Lorenzo Bianconi<lorenzo@kernel.org>
> >
> > Introduce the following kfuncs to store hw metadata provided by the NIC
> > into the xdp_buff struct:
> >
> > - rx-hash: bpf_xdp_store_rx_hash
> > - rx-vlan: bpf_xdp_store_rx_vlan
> > - rx-hw-ts: bpf_xdp_store_rx_ts
> >
> > Signed-off-by: Lorenzo Bianconi<lorenzo@kernel.org>
> > Signed-off-by: Jesper Dangaard Brouer<hawk@kernel.org>
> > ---
> > include/net/xdp.h | 5 +++++
> > net/core/xdp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 50 insertions(+)
> >
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index bd3110fc7ef8..1ffba57714ea 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -963,12 +963,57 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
> [...]
> > +__bpf_kfunc int bpf_xdp_store_rx_ts(struct xdp_md *ctx, u64 ts)
> > +{
> > + struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> > + struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> > + struct skb_shared_hwtstamps *shwt = &sinfo->hwtstamps;
> > +
> > + shwt->hwtstamp = ts;
>
> Here we are storing into the SKB shared_info struct. This is located at
> the SKB data tail. Thus, this will very likely cause a cache-miss.
>
> What about storing it into xdp->rx_meta and then starting a prefetch for
> shared_info? (and updating patch-4 that moved it into SKB)
ack, I am fine with it. I can address it in v3.
Regards,
Lorenzo
>
> (Reviewers should be aware that writing into the xdp_frame headroom
> (xdp->rx_meta) likely isn't a cache-miss, because all drivers does a
> prefetchw for this memory prior to running BPF-prog).
>
>
> > + xdp->flags |= XDP_FLAGS_META_RX_TS;
> > +
> > + return 0;
> > +}
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-03 11:17 ` Jesper Dangaard Brouer
@ 2025-07-07 14:40 ` Stanislav Fomichev
2025-07-09 9:31 ` Lorenzo Bianconi
0 siblings, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2025-07-07 14:40 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur, jakub
On 07/03, Jesper Dangaard Brouer wrote:
>
>
> On 02/07/2025 18.05, Stanislav Fomichev wrote:
> > On 07/02, Jesper Dangaard Brouer wrote:
> > > This patch series introduces a mechanism for an XDP program to store RX
> > > metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
> > > into the xdp_frame. These stored hints are then used to populate the
> > > corresponding fields in the SKB that is created from the xdp_frame
> > > following an XDP_REDIRECT.
> > >
> > > The chosen RX metadata hints intentionally map to the existing NIC
> > > hardware metadata that can be read via kfuncs [1]. While this design
> > > allows a BPF program to read and propagate existing hardware hints, our
> > > primary motivation is to enable setting custom values. This is important
> > > for use cases where the hardware-provided information is insufficient or
> > > needs to be calculated based on packet contents unavailable to the
> > > hardware.
> > >
> > > The primary motivation for this feature is to enable scalable load
> > > balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
> > > packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
> > > the networking stack later calculates a software hash based on the outer
> > > headers. For a single tunnel, these outer headers are often identical,
> > > causing all packets to be assigned the same hash. This collapses all
> > > traffic onto a single RX queue, creating a performance bottleneck and
> > > defeating receive-side scaling (RSS).
> > >
> > > Our immediate use case involves load balancing IPsec traffic. For such
> > > tunnelled traffic, any hardware-provided RX hash is calculated on the
> > > outer headers and is therefore incorrect for distributing inner flows.
> > > There is no reason to read the existing value, as it must be recalculated.
> > > In our XDP program, we perform a partial decryption to access the inner
> > > headers and calculate a new load-balancing hash, which provides better
> > > flow distribution. However, without this patch set, there is no way to
> > > persist this new hash for the network stack to use post-redirect.
> > >
> > > This series solves the problem by introducing new BPF kfuncs that allow an
> > > XDP program to write e.g. the hash value into the xdp_frame. The
> > > __xdp_build_skb_from_frame() function is modified to use this stored value
> > > to set skb->hash on the newly created SKB. As a result, the veth driver's
> > > queue selection logic uses the BPF-supplied hash, achieving proper
> > > traffic distribution across multiple CPU cores. This also ensures that
> > > consumers, like the GRO engine, can operate effectively.
> > >
> > > We considered XDP traits as an alternative to adding static members to
> > > struct xdp_frame. Given the immediate need for this functionality and the
> > > current development status of traits, we believe this approach is a
> > > pragmatic solution. We are open to migrating to a traits-based
> > > implementation if and when they become a generally accepted mechanism for
> > > such extensions.
> > >
> > > [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> > > ---
> > > V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
> >
> > No change log?
>
> We have fixed selftest as requested by Alexie.
> And we have updated cover-letter and doc as you Stanislav requested.
>
> >
> > Btw, any feedback on the following from v1?
> > - https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
>
> Addressed as updated cover-letter and documentation. I hope this helps
> reviewers understand the use-case, as the discussion turn into "how do we
> transfer all HW metadata", which is NOT what we want (and a waste of
> precious cycles).
>
> For our use-case, it doesn't make sense to "transfer all HW metadata".
> In fact we don't even want to read the hardware RH-hash, because we already
> know it is wrong (for tunnels), we just want to override the RX-hash used at
> SKB creation. We do want the BPF programmers flexibility to call these
> kfuncs individually (when relevant).
>
> > - https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
>
> I feel pressured into critiquing Jakub's suggestion, hope this is not too
> harsh. First of all it is not relevant to our this patchset use-case, as it
> focus on all HW metadata.
[..]
> Second, I disagree with the idea/mental model of storing in a
> "driver-specific format". The current implementation of driver-specific
> kfunc helpers that "get the metadata" is already doing a conversion to a
> common format, because the BPF-programmer naturally needs this to be the
> same across drivers. Thus, it doesn't make sense to store it back in a
> "driver-specific format", as that just complicate things. My mental model
> is thus, that after the driver-specific "get" operation to result is in a
> common format, that is simply defined by the struct type of the kfunc, which
> is both known by the kernel and BPF-prog.
Having get/set model seems a bit more generic, no? Potentially giving us the
ability to "correct" HW metadata for the non-redirected cases as well.
Plus we don't hard-code the (internal) layout. Solving only xdp_redirect
seems a bit too narrow, idk..
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-07 14:40 ` Stanislav Fomichev
@ 2025-07-09 9:31 ` Lorenzo Bianconi
2025-07-11 16:04 ` Stanislav Fomichev
0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Bianconi @ 2025-07-09 9:31 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Jesper Dangaard Brouer, bpf, netdev, Jakub Kicinski,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub
[-- Attachment #1: Type: text/plain, Size: 6146 bytes --]
On Jul 07, Stanislav Fomichev wrote:
> On 07/03, Jesper Dangaard Brouer wrote:
> >
> >
> > On 02/07/2025 18.05, Stanislav Fomichev wrote:
> > > On 07/02, Jesper Dangaard Brouer wrote:
> > > > This patch series introduces a mechanism for an XDP program to store RX
> > > > metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
> > > > into the xdp_frame. These stored hints are then used to populate the
> > > > corresponding fields in the SKB that is created from the xdp_frame
> > > > following an XDP_REDIRECT.
> > > >
> > > > The chosen RX metadata hints intentionally map to the existing NIC
> > > > hardware metadata that can be read via kfuncs [1]. While this design
> > > > allows a BPF program to read and propagate existing hardware hints, our
> > > > primary motivation is to enable setting custom values. This is important
> > > > for use cases where the hardware-provided information is insufficient or
> > > > needs to be calculated based on packet contents unavailable to the
> > > > hardware.
> > > >
> > > > The primary motivation for this feature is to enable scalable load
> > > > balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
> > > > packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
> > > > the networking stack later calculates a software hash based on the outer
> > > > headers. For a single tunnel, these outer headers are often identical,
> > > > causing all packets to be assigned the same hash. This collapses all
> > > > traffic onto a single RX queue, creating a performance bottleneck and
> > > > defeating receive-side scaling (RSS).
> > > >
> > > > Our immediate use case involves load balancing IPsec traffic. For such
> > > > tunnelled traffic, any hardware-provided RX hash is calculated on the
> > > > outer headers and is therefore incorrect for distributing inner flows.
> > > > There is no reason to read the existing value, as it must be recalculated.
> > > > In our XDP program, we perform a partial decryption to access the inner
> > > > headers and calculate a new load-balancing hash, which provides better
> > > > flow distribution. However, without this patch set, there is no way to
> > > > persist this new hash for the network stack to use post-redirect.
> > > >
> > > > This series solves the problem by introducing new BPF kfuncs that allow an
> > > > XDP program to write e.g. the hash value into the xdp_frame. The
> > > > __xdp_build_skb_from_frame() function is modified to use this stored value
> > > > to set skb->hash on the newly created SKB. As a result, the veth driver's
> > > > queue selection logic uses the BPF-supplied hash, achieving proper
> > > > traffic distribution across multiple CPU cores. This also ensures that
> > > > consumers, like the GRO engine, can operate effectively.
> > > >
> > > > We considered XDP traits as an alternative to adding static members to
> > > > struct xdp_frame. Given the immediate need for this functionality and the
> > > > current development status of traits, we believe this approach is a
> > > > pragmatic solution. We are open to migrating to a traits-based
> > > > implementation if and when they become a generally accepted mechanism for
> > > > such extensions.
> > > >
> > > > [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> > > > ---
> > > > V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
> > >
> > > No change log?
> >
> > We have fixed selftest as requested by Alexie.
> > And we have updated cover-letter and doc as you Stanislav requested.
> >
> > >
> > > Btw, any feedback on the following from v1?
> > > - https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
> >
> > Addressed as updated cover-letter and documentation. I hope this helps
> > reviewers understand the use-case, as the discussion turn into "how do we
> > transfer all HW metadata", which is NOT what we want (and a waste of
> > precious cycles).
> >
> > For our use-case, it doesn't make sense to "transfer all HW metadata".
> > In fact we don't even want to read the hardware RH-hash, because we already
> > know it is wrong (for tunnels), we just want to override the RX-hash used at
> > SKB creation. We do want the BPF programmers flexibility to call these
> > kfuncs individually (when relevant).
> >
> > > - https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
> >
> > I feel pressured into critiquing Jakub's suggestion, hope this is not too
> > harsh. First of all it is not relevant to our this patchset use-case, as it
> > focus on all HW metadata.
>
> [..]
>
> > Second, I disagree with the idea/mental model of storing in a
> > "driver-specific format". The current implementation of driver-specific
> > kfunc helpers that "get the metadata" is already doing a conversion to a
> > common format, because the BPF-programmer naturally needs this to be the
> > same across drivers. Thus, it doesn't make sense to store it back in a
> > "driver-specific format", as that just complicate things. My mental model
> > is thus, that after the driver-specific "get" operation to result is in a
> > common format, that is simply defined by the struct type of the kfunc, which
> > is both known by the kernel and BPF-prog.
>
> Having get/set model seems a bit more generic, no? Potentially giving us the
> ability to "correct" HW metadata for the non-redirected cases as well.
> Plus we don't hard-code the (internal) layout. Solving only xdp_redirect
> seems a bit too narrow, idk..
I can't see what the non-redirected use-case could be. Can you please provide
more details?
Moreover, can it be solved without storing the rx_hash (or the other
hw-metadata) in a non-driver specific format?
Storing the hw-metadata in some of hw-specific format in xdp_frame will not
allow to consume them directly building the skb and we will require to decode
them again. What is the upside/use-case of this approach? (not considering the
orthogonality with the get method).
Regards,
Lorenzo
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-09 9:31 ` Lorenzo Bianconi
@ 2025-07-11 16:04 ` Stanislav Fomichev
2025-07-16 11:17 ` Lorenzo Bianconi
0 siblings, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2025-07-11 16:04 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Jesper Dangaard Brouer, bpf, netdev, Jakub Kicinski,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub
On 07/09, Lorenzo Bianconi wrote:
> On Jul 07, Stanislav Fomichev wrote:
> > On 07/03, Jesper Dangaard Brouer wrote:
> > >
> > >
> > > On 02/07/2025 18.05, Stanislav Fomichev wrote:
> > > > On 07/02, Jesper Dangaard Brouer wrote:
> > > > > This patch series introduces a mechanism for an XDP program to store RX
> > > > > metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
> > > > > into the xdp_frame. These stored hints are then used to populate the
> > > > > corresponding fields in the SKB that is created from the xdp_frame
> > > > > following an XDP_REDIRECT.
> > > > >
> > > > > The chosen RX metadata hints intentionally map to the existing NIC
> > > > > hardware metadata that can be read via kfuncs [1]. While this design
> > > > > allows a BPF program to read and propagate existing hardware hints, our
> > > > > primary motivation is to enable setting custom values. This is important
> > > > > for use cases where the hardware-provided information is insufficient or
> > > > > needs to be calculated based on packet contents unavailable to the
> > > > > hardware.
> > > > >
> > > > > The primary motivation for this feature is to enable scalable load
> > > > > balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
> > > > > packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
> > > > > the networking stack later calculates a software hash based on the outer
> > > > > headers. For a single tunnel, these outer headers are often identical,
> > > > > causing all packets to be assigned the same hash. This collapses all
> > > > > traffic onto a single RX queue, creating a performance bottleneck and
> > > > > defeating receive-side scaling (RSS).
> > > > >
> > > > > Our immediate use case involves load balancing IPsec traffic. For such
> > > > > tunnelled traffic, any hardware-provided RX hash is calculated on the
> > > > > outer headers and is therefore incorrect for distributing inner flows.
> > > > > There is no reason to read the existing value, as it must be recalculated.
> > > > > In our XDP program, we perform a partial decryption to access the inner
> > > > > headers and calculate a new load-balancing hash, which provides better
> > > > > flow distribution. However, without this patch set, there is no way to
> > > > > persist this new hash for the network stack to use post-redirect.
> > > > >
> > > > > This series solves the problem by introducing new BPF kfuncs that allow an
> > > > > XDP program to write e.g. the hash value into the xdp_frame. The
> > > > > __xdp_build_skb_from_frame() function is modified to use this stored value
> > > > > to set skb->hash on the newly created SKB. As a result, the veth driver's
> > > > > queue selection logic uses the BPF-supplied hash, achieving proper
> > > > > traffic distribution across multiple CPU cores. This also ensures that
> > > > > consumers, like the GRO engine, can operate effectively.
> > > > >
> > > > > We considered XDP traits as an alternative to adding static members to
> > > > > struct xdp_frame. Given the immediate need for this functionality and the
> > > > > current development status of traits, we believe this approach is a
> > > > > pragmatic solution. We are open to migrating to a traits-based
> > > > > implementation if and when they become a generally accepted mechanism for
> > > > > such extensions.
> > > > >
> > > > > [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> > > > > ---
> > > > > V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
> > > >
> > > > No change log?
> > >
> > > We have fixed selftest as requested by Alexie.
> > > And we have updated cover-letter and doc as you Stanislav requested.
> > >
> > > >
> > > > Btw, any feedback on the following from v1?
> > > > - https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
> > >
> > > Addressed as updated cover-letter and documentation. I hope this helps
> > > reviewers understand the use-case, as the discussion turn into "how do we
> > > transfer all HW metadata", which is NOT what we want (and a waste of
> > > precious cycles).
> > >
> > > For our use-case, it doesn't make sense to "transfer all HW metadata".
> > > In fact we don't even want to read the hardware RH-hash, because we already
> > > know it is wrong (for tunnels), we just want to override the RX-hash used at
> > > SKB creation. We do want the BPF programmers flexibility to call these
> > > kfuncs individually (when relevant).
> > >
> > > > - https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
> > >
> > > I feel pressured into critiquing Jakub's suggestion, hope this is not too
> > > harsh. First of all it is not relevant to our this patchset use-case, as it
> > > focus on all HW metadata.
> >
> > [..]
> >
> > > Second, I disagree with the idea/mental model of storing in a
> > > "driver-specific format". The current implementation of driver-specific
> > > kfunc helpers that "get the metadata" is already doing a conversion to a
> > > common format, because the BPF-programmer naturally needs this to be the
> > > same across drivers. Thus, it doesn't make sense to store it back in a
> > > "driver-specific format", as that just complicate things. My mental model
> > > is thus, that after the driver-specific "get" operation to result is in a
> > > common format, that is simply defined by the struct type of the kfunc, which
> > > is both known by the kernel and BPF-prog.
> >
> > Having get/set model seems a bit more generic, no? Potentially giving us the
> > ability to "correct" HW metadata for the non-redirected cases as well.
> > Plus we don't hard-code the (internal) layout. Solving only xdp_redirect
> > seems a bit too narrow, idk..
>
> I can't see what the non-redirected use-case could be. Can you please provide
> more details?
> Moreover, can it be solved without storing the rx_hash (or the other
> hw-metadata) in a non-driver specific format?
Having setters feels more generic than narrowly solving only the redirect,
but I don't have a good use-case in mind.
> Storing the hw-metadata in some of hw-specific format in xdp_frame will not
> allow to consume them directly building the skb and we will require to decode
> them again. What is the upside/use-case of this approach? (not considering the
> orthogonality with the get method).
If we add the store kfuncs to regular drivers, the metadata won't be stored
in the xdp_frame; it will go into the rx descriptors so regular path that
builds skbs will use it.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-11 16:04 ` Stanislav Fomichev
@ 2025-07-16 11:17 ` Lorenzo Bianconi
2025-07-16 21:20 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Bianconi @ 2025-07-16 11:17 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Jesper Dangaard Brouer, bpf, netdev, Jakub Kicinski,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub
[-- Attachment #1: Type: text/plain, Size: 7854 bytes --]
On Jul 11, Stanislav Fomichev wrote:
> On 07/09, Lorenzo Bianconi wrote:
> > On Jul 07, Stanislav Fomichev wrote:
> > > On 07/03, Jesper Dangaard Brouer wrote:
> > > >
> > > >
> > > > On 02/07/2025 18.05, Stanislav Fomichev wrote:
> > > > > On 07/02, Jesper Dangaard Brouer wrote:
> > > > > > This patch series introduces a mechanism for an XDP program to store RX
> > > > > > metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
> > > > > > into the xdp_frame. These stored hints are then used to populate the
> > > > > > corresponding fields in the SKB that is created from the xdp_frame
> > > > > > following an XDP_REDIRECT.
> > > > > >
> > > > > > The chosen RX metadata hints intentionally map to the existing NIC
> > > > > > hardware metadata that can be read via kfuncs [1]. While this design
> > > > > > allows a BPF program to read and propagate existing hardware hints, our
> > > > > > primary motivation is to enable setting custom values. This is important
> > > > > > for use cases where the hardware-provided information is insufficient or
> > > > > > needs to be calculated based on packet contents unavailable to the
> > > > > > hardware.
> > > > > >
> > > > > > The primary motivation for this feature is to enable scalable load
> > > > > > balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
> > > > > > packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
> > > > > > the networking stack later calculates a software hash based on the outer
> > > > > > headers. For a single tunnel, these outer headers are often identical,
> > > > > > causing all packets to be assigned the same hash. This collapses all
> > > > > > traffic onto a single RX queue, creating a performance bottleneck and
> > > > > > defeating receive-side scaling (RSS).
> > > > > >
> > > > > > Our immediate use case involves load balancing IPsec traffic. For such
> > > > > > tunnelled traffic, any hardware-provided RX hash is calculated on the
> > > > > > outer headers and is therefore incorrect for distributing inner flows.
> > > > > > There is no reason to read the existing value, as it must be recalculated.
> > > > > > In our XDP program, we perform a partial decryption to access the inner
> > > > > > headers and calculate a new load-balancing hash, which provides better
> > > > > > flow distribution. However, without this patch set, there is no way to
> > > > > > persist this new hash for the network stack to use post-redirect.
> > > > > >
> > > > > > This series solves the problem by introducing new BPF kfuncs that allow an
> > > > > > XDP program to write e.g. the hash value into the xdp_frame. The
> > > > > > __xdp_build_skb_from_frame() function is modified to use this stored value
> > > > > > to set skb->hash on the newly created SKB. As a result, the veth driver's
> > > > > > queue selection logic uses the BPF-supplied hash, achieving proper
> > > > > > traffic distribution across multiple CPU cores. This also ensures that
> > > > > > consumers, like the GRO engine, can operate effectively.
> > > > > >
> > > > > > We considered XDP traits as an alternative to adding static members to
> > > > > > struct xdp_frame. Given the immediate need for this functionality and the
> > > > > > current development status of traits, we believe this approach is a
> > > > > > pragmatic solution. We are open to migrating to a traits-based
> > > > > > implementation if and when they become a generally accepted mechanism for
> > > > > > such extensions.
> > > > > >
> > > > > > [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> > > > > > ---
> > > > > > V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
> > > > >
> > > > > No change log?
> > > >
> > > > We have fixed selftest as requested by Alexie.
> > > > And we have updated cover-letter and doc as you Stanislav requested.
> > > >
> > > > >
> > > > > Btw, any feedback on the following from v1?
> > > > > - https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
> > > >
> > > > Addressed as updated cover-letter and documentation. I hope this helps
> > > > reviewers understand the use-case, as the discussion turn into "how do we
> > > > transfer all HW metadata", which is NOT what we want (and a waste of
> > > > precious cycles).
> > > >
> > > > For our use-case, it doesn't make sense to "transfer all HW metadata".
> > > > In fact we don't even want to read the hardware RH-hash, because we already
> > > > know it is wrong (for tunnels), we just want to override the RX-hash used at
> > > > SKB creation. We do want the BPF programmers flexibility to call these
> > > > kfuncs individually (when relevant).
> > > >
> > > > > - https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
> > > >
> > > > I feel pressured into critiquing Jakub's suggestion, hope this is not too
> > > > harsh. First of all it is not relevant to our this patchset use-case, as it
> > > > focus on all HW metadata.
> > >
> > > [..]
> > >
> > > > Second, I disagree with the idea/mental model of storing in a
> > > > "driver-specific format". The current implementation of driver-specific
> > > > kfunc helpers that "get the metadata" is already doing a conversion to a
> > > > common format, because the BPF-programmer naturally needs this to be the
> > > > same across drivers. Thus, it doesn't make sense to store it back in a
> > > > "driver-specific format", as that just complicate things. My mental model
> > > > is thus, that after the driver-specific "get" operation to result is in a
> > > > common format, that is simply defined by the struct type of the kfunc, which
> > > > is both known by the kernel and BPF-prog.
> > >
> > > Having get/set model seems a bit more generic, no? Potentially giving us the
> > > ability to "correct" HW metadata for the non-redirected cases as well.
> > > Plus we don't hard-code the (internal) layout. Solving only xdp_redirect
> > > seems a bit too narrow, idk..
> >
> > I can't see what the non-redirected use-case could be. Can you please provide
> > more details?
> > Moreover, can it be solved without storing the rx_hash (or the other
> > hw-metadata) in a non-driver specific format?
>
> Having setters feels more generic than narrowly solving only the redirect,
> but I don't have a good use-case in mind.
>
> > Storing the hw-metadata in some of hw-specific format in xdp_frame will not
> > allow to consume them directly building the skb and we will require to decode
> > them again. What is the upside/use-case of this approach? (not considering the
> > orthogonality with the get method).
>
> If we add the store kfuncs to regular drivers, the metadata won't be stored
> in the xdp_frame; it will go into the rx descriptors so regular path that
> builds skbs will use it.
IIUC, the described use-case would be to modify the hw metadata via a
'setter' kfunc executed by an eBPF program bounded to the NIC and to store
the new metadata in the DMA descriptor in order to be consumed by the driver
codebase building the skb, right?
If so:
- we can get the same result just storing (running a kfunc) the modified hw
metadata in the xdp_buff struct using a well-known/generic layout and
consume it in the driver codebase (e.g. if the bounded eBPF program
returns XDP_PASS) using a generic xdp utility routine. This part is not in
the current series.
- Using this approach we are still not preserving the hw metadata if we pass
the xdp_frame to a remote CPU returning XDP_REDIRCT (we need to add more
code)
- I am not completely sure if can always modify the DMA descriptor directly
since it is DMA mapped.
What do you think?
Regards,
Lorenzo
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-16 11:17 ` Lorenzo Bianconi
@ 2025-07-16 21:20 ` Jakub Kicinski
2025-07-17 13:08 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-07-16 21:20 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Stanislav Fomichev, Jesper Dangaard Brouer, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub
On Wed, 16 Jul 2025 13:17:53 +0200 Lorenzo Bianconi wrote:
> > > I can't see what the non-redirected use-case could be. Can you please provide
> > > more details?
> > > Moreover, can it be solved without storing the rx_hash (or the other
> > > hw-metadata) in a non-driver specific format?
> >
> > Having setters feels more generic than narrowly solving only the redirect,
> > but I don't have a good use-case in mind.
> >
> > > Storing the hw-metadata in some of hw-specific format in xdp_frame will not
> > > allow to consume them directly building the skb and we will require to decode
> > > them again. What is the upside/use-case of this approach? (not considering the
> > > orthogonality with the get method).
> >
> > If we add the store kfuncs to regular drivers, the metadata won't be stored
> > in the xdp_frame; it will go into the rx descriptors so regular path that
> > builds skbs will use it.
>
> IIUC, the described use-case would be to modify the hw metadata via a
> 'setter' kfunc executed by an eBPF program bounded to the NIC and to store
> the new metadata in the DMA descriptor in order to be consumed by the driver
> codebase building the skb, right?
> If so:
> - we can get the same result just storing (running a kfunc) the modified hw
> metadata in the xdp_buff struct using a well-known/generic layout and
> consume it in the driver codebase (e.g. if the bounded eBPF program
> returns XDP_PASS) using a generic xdp utility routine. This part is not in
> the current series.
> - Using this approach we are still not preserving the hw metadata if we pass
> the xdp_frame to a remote CPU returning XDP_REDIRCT (we need to add more
> code)
> - I am not completely sure if can always modify the DMA descriptor directly
> since it is DMA mapped.
>
> What do you think?
FWIW I commented on an earlier revision to similar effect as Stanislav.
To me the main concern is that we're adding another adhoc scheme, and
are making xdp_frame grow into a para-skb. We added XDP to make raw
packet access fast, now we're making drivers convert metadata twice :/
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure
2025-07-02 14:58 ` [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
@ 2025-07-17 9:19 ` Jakub Sitnicki
2025-07-17 14:40 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Sitnicki @ 2025-07-17 9:19 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur
On Wed, Jul 02, 2025 at 04:58 PM +02, Jesper Dangaard Brouer wrote:
> From: Lorenzo Bianconi <lorenzo@kernel.org>
>
> Introduce the `xdp_rx_meta` structure to serve as a container for XDP RX
> hardware hints within XDP packet buffers. Initially, this structure will
> accommodate `rx_hash` and `rx_vlan` metadata. (The `rx_timestamp` hint will
> get stored in `skb_shared_info`).
>
> A key design aspect is making this metadata accessible both during BPF
> program execution (via `struct xdp_buff`) and later if an `struct
> xdp_frame` is materialized (e.g., for XDP_REDIRECT).
> To achieve this:
> - The `struct xdp_frame` embeds an `xdp_rx_meta` field directly for
> storage.
> - The `struct xdp_buff` includes an `xdp_rx_meta` pointer. This pointer
> is initialized (in `xdp_prepare_buff`) to point to the memory location
> within the packet buffer's headroom where the `xdp_frame`'s embedded
> `rx_meta` field would reside.
>
> This setup allows BPF kfuncs, operating on `xdp_buff`, to populate the
> metadata in the precise location where it will be found if an `xdp_frame`
> is subsequently created.
>
> The availability of this metadata storage area within the buffer is
> indicated by the `XDP_FLAGS_META_AREA` flag in `xdp_buff->flags` (and
> propagated to `xdp_frame->flags`). This flag is only set if sufficient
> headroom (at least `XDP_MIN_HEADROOM`, currently 192 bytes) is present.
> Specific hints like `XDP_FLAGS_META_RX_HASH` and `XDP_FLAGS_META_RX_VLAN`
> will then denote which types of metadata have been populated into the
> `xdp_rx_meta` structure.
>
> This patch is a step for enabling the preservation and use of XDP RX
> hints across operations like XDP_REDIRECT.
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
> include/net/xdp.h | 57 +++++++++++++++++++++++++++++++++++------------
> net/core/xdp.c | 1 +
> net/xdp/xsk_buff_pool.c | 4 ++-
> 3 files changed, 47 insertions(+), 15 deletions(-)
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index b40f1f96cb11..f52742a25212 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -71,11 +71,31 @@ struct xdp_txq_info {
> struct net_device *dev;
> };
>
> +struct xdp_rx_meta {
> + struct xdp_rx_meta_hash {
> + u32 val;
> + u32 type; /* enum xdp_rss_hash_type */
> + } hash;
> + struct xdp_rx_meta_vlan {
> + __be16 proto;
> + u16 tci;
> + } vlan;
> +};
> +
> +/* Storage area for HW RX metadata only available with reasonable headroom
> + * available. Less than XDP_PACKET_HEADROOM due to Intel drivers.
> + */
> +#define XDP_MIN_HEADROOM 192
> +
> enum xdp_buff_flags {
> XDP_FLAGS_HAS_FRAGS = BIT(0), /* non-linear xdp buff */
> XDP_FLAGS_FRAGS_PF_MEMALLOC = BIT(1), /* xdp paged memory is under
> * pressure
> */
> + XDP_FLAGS_META_AREA = BIT(2), /* storage area available */
Idea: Perhaps this could be called *HW*_META_AREA to differentiate from
the existing custom metadata area:
https://docs.kernel.org/networking/xdp-rx-metadata.html#af-xdp
> + XDP_FLAGS_META_RX_HASH = BIT(3), /* hw rx hash */
> + XDP_FLAGS_META_RX_VLAN = BIT(4), /* hw rx vlan */
> + XDP_FLAGS_META_RX_TS = BIT(5), /* hw rx timestamp */
> };
>
[...]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 5/7] net: veth: Read xdp metadata from rx_meta struct if available
2025-07-02 14:58 ` [PATCH bpf-next V2 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
@ 2025-07-17 12:11 ` Jakub Sitnicki
0 siblings, 0 replies; 43+ messages in thread
From: Jakub Sitnicki @ 2025-07-17 12:11 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur
On Wed, Jul 02, 2025 at 04:58 PM +02, Jesper Dangaard Brouer wrote:
> From: Lorenzo Bianconi <lorenzo@kernel.org>
>
> Report xdp_rx_meta info if available in xdp_buff struct in
> xdp_metadata_ops callbacks for veth driver
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
> drivers/net/veth.c | 12 +++++++++++
> include/net/xdp.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 69 insertions(+)
[...]
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 3d1a9711fe82..2b495feedfb0 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -158,6 +158,23 @@ static __always_inline bool xdp_buff_has_valid_meta_area(struct xdp_buff *xdp)
> return !!(xdp->flags & XDP_FLAGS_META_AREA);
> }
>
> +static __always_inline bool
> +xdp_buff_has_rx_meta_hash(const struct xdp_buff *xdp)
> +{
> + return !!(xdp->flags & XDP_FLAGS_META_RX_HASH);
> +}
> +
> +static __always_inline bool
> +xdp_buff_has_rx_meta_vlan(const struct xdp_buff *xdp)
> +{
> + return !!(xdp->flags & XDP_FLAGS_META_RX_VLAN);
> +}
> +
> +static __always_inline bool xdp_buff_has_rx_meta_ts(const struct xdp_buff *xdp)
> +{
> + return !!(xdp->flags & XDP_FLAGS_META_RX_TS);
> +}
> +
> static __always_inline void
> xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
> {
Nit: Why not have one set of generic helpers (macros) for checking if
the flags are set? If you want strict type checking, you can
additionally use _Generic type dispatch.
[...]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-16 21:20 ` Jakub Kicinski
@ 2025-07-17 13:08 ` Jesper Dangaard Brouer
2025-07-18 1:25 ` Jakub Kicinski
2025-07-18 9:55 ` Lorenzo Bianconi
0 siblings, 2 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-17 13:08 UTC (permalink / raw)
To: Jakub Kicinski, Lorenzo Bianconi
Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur, jakub, Jesse Brandeburg
On 16/07/2025 23.20, Jakub Kicinski wrote:
> On Wed, 16 Jul 2025 13:17:53 +0200 Lorenzo Bianconi wrote:
>>>> I can't see what the non-redirected use-case could be. Can you please provide
>>>> more details?
>>>> Moreover, can it be solved without storing the rx_hash (or the other
>>>> hw-metadata) in a non-driver specific format?
>>>
>>> Having setters feels more generic than narrowly solving only the redirect,
>>> but I don't have a good use-case in mind.
>>>
>>>> Storing the hw-metadata in some of hw-specific format in xdp_frame will not
>>>> allow to consume them directly building the skb and we will require to decode
>>>> them again. What is the upside/use-case of this approach? (not considering the
>>>> orthogonality with the get method).
>>>
>>> If we add the store kfuncs to regular drivers, the metadata won't be stored
>>> in the xdp_frame; it will go into the rx descriptors so regular path that
>>> builds skbs will use it.
>>
>> IIUC, the described use-case would be to modify the hw metadata via a
>> 'setter' kfunc executed by an eBPF program bounded to the NIC and to store
>> the new metadata in the DMA descriptor in order to be consumed by the driver
>> codebase building the skb, right?
>> If so:
>> - we can get the same result just storing (running a kfunc) the modified hw
>> metadata in the xdp_buff struct using a well-known/generic layout and
>> consume it in the driver codebase (e.g. if the bounded eBPF program
>> returns XDP_PASS) using a generic xdp utility routine. This part is not in
>> the current series.
>> - Using this approach we are still not preserving the hw metadata if we pass
>> the xdp_frame to a remote CPU returning XDP_REDIRCT (we need to add more
>> code)
>> - I am not completely sure if can always modify the DMA descriptor directly
>> since it is DMA mapped.
Let me explain why it is a bad idea of writing into the RX descriptors.
The DMA descriptors are allocated as coherent DMA (dma_alloc_coherent).
This is memory that is shared with the NIC hardware device, which
implies cache-line coherence. NIC performance is tightly coupled to
limiting cache misses for descriptors. One common trick is to pack more
descriptors into a single cache-line. Thus, if we start to write into
the current RX-descriptor, then we invalidate that cache-line seen from
the device, and next RX-descriptor (from this cache-line) will be in an
unfortunate coherent state. Behind the scene this might lead to some
extra PCIe transactions.
By writing to the xdp_frame, we don't have to modify the DMA descriptors
directly and risk invalidating cache lines for the NIC.
>>
>> What do you think?
>
> FWIW I commented on an earlier revision to similar effect as Stanislav.
> To me the main concern is that we're adding another adhoc scheme, and
> are making xdp_frame grow into a para-skb. We added XDP to make raw
> packet access fast, now we're making drivers convert metadata twice :/
Thanks for the feedback. I can see why you'd be concerned about adding
another adhoc scheme or making xdp_frame grow into a "para-skb".
However, I'd like to frame this as part of a long-term plan we've been
calling the "mini-SKB" concept. This isn't a new idea, but a
continuation of architectural discussions from as far back as [2016].
The long-term goal, described in these presentations from [2018] and
[2019], has always been to evolve the xdp_frame to handle more hardware
offloads, with the ultimate vision of moving SKB allocation out of NIC
drivers entirely. In the future, the netstack could perform L3
forwarding (and L2 bridging) directly on these enhanced xdp_frames
[2019-slide20]. The main blocker for this vision has been the lack of
hardware metadata in the xdp_frame.
This patchset is a small but necessary first step towards that goal. It
focuses on the concrete XDP_REDIRECT use-case where we can immediately
benefit for our production use-case. Storing this metadata in the
xdp_frame is fundamental to the plan. It's no coincidence the fields are
compatible with the SKB; they need to be.
I'm certainly open to debating the bigger picture, but I hope we can
agree that it shouldn't hold up this first step, which solves an
immediate need. Perhaps we can evaluate the merits of this specific
change first, and discuss the overall architecture in parallel?
--Jesper
Links:
------
[2019] XDP closer integration with network stack
-
https://people.netfilter.org/hawk/presentations/KernelRecipes2019/xdp-netstack-concert.pdf
-
https://github.com/xdp-project/xdp-project/blob/main/conference/KernelRecipes2019/xdp-netstack-concert.org#slide-move-skb-allocations-out-of-nic-drivers
- [2019-slide20]
https://github.com/xdp-project/xdp-project/blob/main/conference/KernelRecipes2019/xdp-netstack-concert.org#slide-fun-with-xdp_frame-before-skb-alloc
[2018] LPC Networking Track: XDP - challenges and future work
- https://people.netfilter.org/hawk/presentations/LinuxPlumbers2018/
-
https://github.com/xdp-project/xdp-project/blob/main/conference/LinuxPlumbers2018/presentation-lpc2018-xdp-future.org#topic-moving-skb-allocation-out-of-driver
[2016] Network Performance Workshop
-
https://people.netfilter.org/hawk/presentations/NetDev1.2_2016/net_performance_workshop_netdev1.2.pdf
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure
2025-07-17 9:19 ` Jakub Sitnicki
@ 2025-07-17 14:40 ` Jesper Dangaard Brouer
2025-07-18 10:33 ` Jakub Sitnicki
0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-17 14:40 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur
On 17/07/2025 11.19, Jakub Sitnicki wrote:
> On Wed, Jul 02, 2025 at 04:58 PM +02, Jesper Dangaard Brouer wrote:
>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>
>> Introduce the `xdp_rx_meta` structure to serve as a container for XDP RX
>> hardware hints within XDP packet buffers. Initially, this structure will
>> accommodate `rx_hash` and `rx_vlan` metadata. (The `rx_timestamp` hint will
>> get stored in `skb_shared_info`).
>>
>> A key design aspect is making this metadata accessible both during BPF
>> program execution (via `struct xdp_buff`) and later if an `struct
>> xdp_frame` is materialized (e.g., for XDP_REDIRECT).
>> To achieve this:
>> - The `struct xdp_frame` embeds an `xdp_rx_meta` field directly for
>> storage.
>> - The `struct xdp_buff` includes an `xdp_rx_meta` pointer. This pointer
>> is initialized (in `xdp_prepare_buff`) to point to the memory location
>> within the packet buffer's headroom where the `xdp_frame`'s embedded
>> `rx_meta` field would reside.
>>
>> This setup allows BPF kfuncs, operating on `xdp_buff`, to populate the
>> metadata in the precise location where it will be found if an `xdp_frame`
>> is subsequently created.
>>
>> The availability of this metadata storage area within the buffer is
>> indicated by the `XDP_FLAGS_META_AREA` flag in `xdp_buff->flags` (and
>> propagated to `xdp_frame->flags`). This flag is only set if sufficient
>> headroom (at least `XDP_MIN_HEADROOM`, currently 192 bytes) is present.
>> Specific hints like `XDP_FLAGS_META_RX_HASH` and `XDP_FLAGS_META_RX_VLAN`
>> will then denote which types of metadata have been populated into the
>> `xdp_rx_meta` structure.
>>
>> This patch is a step for enabling the preservation and use of XDP RX
>> hints across operations like XDP_REDIRECT.
>>
>> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>> include/net/xdp.h | 57 +++++++++++++++++++++++++++++++++++------------
>> net/core/xdp.c | 1 +
>> net/xdp/xsk_buff_pool.c | 4 ++-
>> 3 files changed, 47 insertions(+), 15 deletions(-)
>>
>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>> index b40f1f96cb11..f52742a25212 100644
>> --- a/include/net/xdp.h
>> +++ b/include/net/xdp.h
>> @@ -71,11 +71,31 @@ struct xdp_txq_info {
>> struct net_device *dev;
>> };
>>
>> +struct xdp_rx_meta {
>> + struct xdp_rx_meta_hash {
>> + u32 val;
>> + u32 type; /* enum xdp_rss_hash_type */
>> + } hash;
>> + struct xdp_rx_meta_vlan {
>> + __be16 proto;
>> + u16 tci;
>> + } vlan;
>> +};
>> +
>> +/* Storage area for HW RX metadata only available with reasonable headroom
>> + * available. Less than XDP_PACKET_HEADROOM due to Intel drivers.
>> + */
>> +#define XDP_MIN_HEADROOM 192
>> +
>> enum xdp_buff_flags {
>> XDP_FLAGS_HAS_FRAGS = BIT(0), /* non-linear xdp buff */
>> XDP_FLAGS_FRAGS_PF_MEMALLOC = BIT(1), /* xdp paged memory is under
>> * pressure
>> */
>> + XDP_FLAGS_META_AREA = BIT(2), /* storage area available */
>
> Idea: Perhaps this could be called *HW*_META_AREA to differentiate from
> the existing custom metadata area:
>
I agree, that calling it META_AREA can easily be misunderstood and
confused with metadata or data_meta.
What do you think about renaming this to "hints" ?
E.g. XDP_FLAGS_HINTS_AREA
or XDP_FLAGS_HINTS_AVAIL
And also renaming XDP_FLAGS_META_RX_* to
e.g XDP_FLAGS_META_RX_HASH -> XDP_FLAGS_HINT_RX_HASH
or XDP_FLAGS_HW_HINT_RX_HASH
> https://docs.kernel.org/networking/xdp-rx-metadata.html#af-xdp
>
>> + XDP_FLAGS_META_RX_HASH = BIT(3), /* hw rx hash */
>> + XDP_FLAGS_META_RX_VLAN = BIT(4), /* hw rx vlan */
>> + XDP_FLAGS_META_RX_TS = BIT(5), /* hw rx timestamp */
>> };
>>
>
> [...]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-17 13:08 ` Jesper Dangaard Brouer
@ 2025-07-18 1:25 ` Jakub Kicinski
2025-07-18 10:56 ` Jesper Dangaard Brouer
2025-07-18 9:55 ` Lorenzo Bianconi
1 sibling, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-07-18 1:25 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Lorenzo Bianconi, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
On Thu, 17 Jul 2025 15:08:49 +0200 Jesper Dangaard Brouer wrote:
> Let me explain why it is a bad idea of writing into the RX descriptors.
> The DMA descriptors are allocated as coherent DMA (dma_alloc_coherent).
> This is memory that is shared with the NIC hardware device, which
> implies cache-line coherence. NIC performance is tightly coupled to
> limiting cache misses for descriptors. One common trick is to pack more
> descriptors into a single cache-line. Thus, if we start to write into
> the current RX-descriptor, then we invalidate that cache-line seen from
> the device, and next RX-descriptor (from this cache-line) will be in an
> unfortunate coherent state. Behind the scene this might lead to some
> extra PCIe transactions.
>
> By writing to the xdp_frame, we don't have to modify the DMA descriptors
> directly and risk invalidating cache lines for the NIC.
I thought you main use case is redirected packets. In which case it's
the _remote_ end that's writing its metadata, if it's veth it's
obviously not going to be doing it into DMA coherent memory.
The metadata travels between the source and destination in program-
-defined format.
> Thanks for the feedback. I can see why you'd be concerned about adding
> another adhoc scheme or making xdp_frame grow into a "para-skb".
>
> However, I'd like to frame this as part of a long-term plan we've been
> calling the "mini-SKB" concept. This isn't a new idea, but a
> continuation of architectural discussions from as far back as [2016].
My understanding is that while this was floated as a plan by some,
nobody came up with a clean way of implementing it.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-17 13:08 ` Jesper Dangaard Brouer
2025-07-18 1:25 ` Jakub Kicinski
@ 2025-07-18 9:55 ` Lorenzo Bianconi
1 sibling, 0 replies; 43+ messages in thread
From: Lorenzo Bianconi @ 2025-07-18 9:55 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Jakub Kicinski, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
[-- Attachment #1: Type: text/plain, Size: 5885 bytes --]
>
>
> On 16/07/2025 23.20, Jakub Kicinski wrote:
> > On Wed, 16 Jul 2025 13:17:53 +0200 Lorenzo Bianconi wrote:
> > > > > I can't see what the non-redirected use-case could be. Can you please provide
> > > > > more details?
> > > > > Moreover, can it be solved without storing the rx_hash (or the other
> > > > > hw-metadata) in a non-driver specific format?
> > > >
> > > > Having setters feels more generic than narrowly solving only the redirect,
> > > > but I don't have a good use-case in mind.
> > > > > Storing the hw-metadata in some of hw-specific format in xdp_frame will not
> > > > > allow to consume them directly building the skb and we will require to decode
> > > > > them again. What is the upside/use-case of this approach? (not considering the
> > > > > orthogonality with the get method).
> > > >
> > > > If we add the store kfuncs to regular drivers, the metadata won't be stored
> > > > in the xdp_frame; it will go into the rx descriptors so regular path that
> > > > builds skbs will use it.
> > >
> > > IIUC, the described use-case would be to modify the hw metadata via a
> > > 'setter' kfunc executed by an eBPF program bounded to the NIC and to store
> > > the new metadata in the DMA descriptor in order to be consumed by the driver
> > > codebase building the skb, right?
> > > If so:
> > > - we can get the same result just storing (running a kfunc) the modified hw
> > > metadata in the xdp_buff struct using a well-known/generic layout and
> > > consume it in the driver codebase (e.g. if the bounded eBPF program
> > > returns XDP_PASS) using a generic xdp utility routine. This part is not in
> > > the current series.
> > > - Using this approach we are still not preserving the hw metadata if we pass
> > > the xdp_frame to a remote CPU returning XDP_REDIRCT (we need to add more
> > > code)
> > > - I am not completely sure if can always modify the DMA descriptor directly
> > > since it is DMA mapped.
>
> Let me explain why it is a bad idea of writing into the RX descriptors.
> The DMA descriptors are allocated as coherent DMA (dma_alloc_coherent).
> This is memory that is shared with the NIC hardware device, which
> implies cache-line coherence. NIC performance is tightly coupled to
> limiting cache misses for descriptors. One common trick is to pack more
> descriptors into a single cache-line. Thus, if we start to write into
> the current RX-descriptor, then we invalidate that cache-line seen from
> the device, and next RX-descriptor (from this cache-line) will be in an
> unfortunate coherent state. Behind the scene this might lead to some
> extra PCIe transactions.
>
> By writing to the xdp_frame, we don't have to modify the DMA descriptors
> directly and risk invalidating cache lines for the NIC.
>
> > >
> > > What do you think?
> >
> > FWIW I commented on an earlier revision to similar effect as Stanislav.
> > To me the main concern is that we're adding another adhoc scheme, and
> > are making xdp_frame grow into a para-skb. We added XDP to make raw
> > packet access fast, now we're making drivers convert metadata twice :/
>
> Thanks for the feedback. I can see why you'd be concerned about adding
> another adhoc scheme or making xdp_frame grow into a "para-skb".
>
> However, I'd like to frame this as part of a long-term plan we've been
> calling the "mini-SKB" concept. This isn't a new idea, but a
> continuation of architectural discussions from as far back as [2016].
>
> The long-term goal, described in these presentations from [2018] and
> [2019], has always been to evolve the xdp_frame to handle more hardware
> offloads, with the ultimate vision of moving SKB allocation out of NIC
> drivers entirely. In the future, the netstack could perform L3
> forwarding (and L2 bridging) directly on these enhanced xdp_frames
> [2019-slide20]. The main blocker for this vision has been the lack of
> hardware metadata in the xdp_frame.
>
> This patchset is a small but necessary first step towards that goal. It
> focuses on the concrete XDP_REDIRECT use-case where we can immediately
> benefit for our production use-case. Storing this metadata in the
> xdp_frame is fundamental to the plan. It's no coincidence the fields are
> compatible with the SKB; they need to be.
>
> I'm certainly open to debating the bigger picture, but I hope we can
> agree that it shouldn't hold up this first step, which solves an
> immediate need. Perhaps we can evaluate the merits of this specific
> change first, and discuss the overall architecture in parallel?
Considering the XDP_REDIRECT use-case, this series will allow us (in the
future) to avoid recomputing the packet checksum redirecting the frame into
a veth and then into a container, obtaining a significant performance
improvement.
Regarding,
Lorenzo
>
> --Jesper
>
>
> Links:
> ------
> [2019] XDP closer integration with network stack
> - https://people.netfilter.org/hawk/presentations/KernelRecipes2019/xdp-netstack-concert.pdf
> - https://github.com/xdp-project/xdp-project/blob/main/conference/KernelRecipes2019/xdp-netstack-concert.org#slide-move-skb-allocations-out-of-nic-drivers
> - [2019-slide20] https://github.com/xdp-project/xdp-project/blob/main/conference/KernelRecipes2019/xdp-netstack-concert.org#slide-fun-with-xdp_frame-before-skb-alloc
>
> [2018] LPC Networking Track: XDP - challenges and future work
> - https://people.netfilter.org/hawk/presentations/LinuxPlumbers2018/
> - https://github.com/xdp-project/xdp-project/blob/main/conference/LinuxPlumbers2018/presentation-lpc2018-xdp-future.org#topic-moving-skb-allocation-out-of-driver
>
> [2016] Network Performance Workshop
> - https://people.netfilter.org/hawk/presentations/NetDev1.2_2016/net_performance_workshop_netdev1.2.pdf
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure
2025-07-17 14:40 ` Jesper Dangaard Brouer
@ 2025-07-18 10:33 ` Jakub Sitnicki
0 siblings, 0 replies; 43+ messages in thread
From: Jakub Sitnicki @ 2025-07-18 10:33 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, netdev, Jakub Kicinski, lorenzo, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur
On Thu, Jul 17, 2025 at 04:40 PM +02, Jesper Dangaard Brouer wrote:
> On 17/07/2025 11.19, Jakub Sitnicki wrote:
>> On Wed, Jul 02, 2025 at 04:58 PM +02, Jesper Dangaard Brouer wrote:
>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>>
>>> Introduce the `xdp_rx_meta` structure to serve as a container for XDP RX
>>> hardware hints within XDP packet buffers. Initially, this structure will
>>> accommodate `rx_hash` and `rx_vlan` metadata. (The `rx_timestamp` hint will
>>> get stored in `skb_shared_info`).
>>>
>>> A key design aspect is making this metadata accessible both during BPF
>>> program execution (via `struct xdp_buff`) and later if an `struct
>>> xdp_frame` is materialized (e.g., for XDP_REDIRECT).
>>> To achieve this:
>>> - The `struct xdp_frame` embeds an `xdp_rx_meta` field directly for
>>> storage.
>>> - The `struct xdp_buff` includes an `xdp_rx_meta` pointer. This pointer
>>> is initialized (in `xdp_prepare_buff`) to point to the memory location
>>> within the packet buffer's headroom where the `xdp_frame`'s embedded
>>> `rx_meta` field would reside.
>>>
>>> This setup allows BPF kfuncs, operating on `xdp_buff`, to populate the
>>> metadata in the precise location where it will be found if an `xdp_frame`
>>> is subsequently created.
>>>
>>> The availability of this metadata storage area within the buffer is
>>> indicated by the `XDP_FLAGS_META_AREA` flag in `xdp_buff->flags` (and
>>> propagated to `xdp_frame->flags`). This flag is only set if sufficient
>>> headroom (at least `XDP_MIN_HEADROOM`, currently 192 bytes) is present.
>>> Specific hints like `XDP_FLAGS_META_RX_HASH` and `XDP_FLAGS_META_RX_VLAN`
>>> will then denote which types of metadata have been populated into the
>>> `xdp_rx_meta` structure.
>>>
>>> This patch is a step for enabling the preservation and use of XDP RX
>>> hints across operations like XDP_REDIRECT.
>>>
>>> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>> ---
>>> include/net/xdp.h | 57 +++++++++++++++++++++++++++++++++++------------
>>> net/core/xdp.c | 1 +
>>> net/xdp/xsk_buff_pool.c | 4 ++-
>>> 3 files changed, 47 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index b40f1f96cb11..f52742a25212 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -71,11 +71,31 @@ struct xdp_txq_info {
>>> struct net_device *dev;
>>> };
>>> +struct xdp_rx_meta {
>>> + struct xdp_rx_meta_hash {
>>> + u32 val;
>>> + u32 type; /* enum xdp_rss_hash_type */
>>> + } hash;
>>> + struct xdp_rx_meta_vlan {
>>> + __be16 proto;
>>> + u16 tci;
>>> + } vlan;
>>> +};
>>> +
>>> +/* Storage area for HW RX metadata only available with reasonable headroom
>>> + * available. Less than XDP_PACKET_HEADROOM due to Intel drivers.
>>> + */
>>> +#define XDP_MIN_HEADROOM 192
>>> +
>>> enum xdp_buff_flags {
>>> XDP_FLAGS_HAS_FRAGS = BIT(0), /* non-linear xdp buff */
>>> XDP_FLAGS_FRAGS_PF_MEMALLOC = BIT(1), /* xdp paged memory is under
>>> * pressure
>>> */
>>> + XDP_FLAGS_META_AREA = BIT(2), /* storage area available */
>> Idea: Perhaps this could be called *HW*_META_AREA to differentiate from
>> the existing custom metadata area:
>>
>
> I agree, that calling it META_AREA can easily be misunderstood and confused with
> metadata or data_meta.
>
> What do you think about renaming this to "hints" ?
> E.g. XDP_FLAGS_HINTS_AREA
> or XDP_FLAGS_HINTS_AVAIL
>
> And also renaming XDP_FLAGS_META_RX_* to
> e.g XDP_FLAGS_META_RX_HASH -> XDP_FLAGS_HINT_RX_HASH
> or XDP_FLAGS_HW_HINT_RX_HASH
Any name that doesn't lean on the already overloaded "metadata" term is
a better alternative, in my mind :-)
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-18 1:25 ` Jakub Kicinski
@ 2025-07-18 10:56 ` Jesper Dangaard Brouer
2025-07-22 1:13 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-18 10:56 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Lorenzo Bianconi, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
On 18/07/2025 03.25, Jakub Kicinski wrote:
> On Thu, 17 Jul 2025 15:08:49 +0200 Jesper Dangaard Brouer wrote:
>> Let me explain why it is a bad idea of writing into the RX descriptors.
>> The DMA descriptors are allocated as coherent DMA (dma_alloc_coherent).
>> This is memory that is shared with the NIC hardware device, which
>> implies cache-line coherence. NIC performance is tightly coupled to
>> limiting cache misses for descriptors. One common trick is to pack more
>> descriptors into a single cache-line. Thus, if we start to write into
>> the current RX-descriptor, then we invalidate that cache-line seen from
>> the device, and next RX-descriptor (from this cache-line) will be in an
>> unfortunate coherent state. Behind the scene this might lead to some
>> extra PCIe transactions.
>>
>> By writing to the xdp_frame, we don't have to modify the DMA descriptors
>> directly and risk invalidating cache lines for the NIC.
>
> I thought you main use case is redirected packets. In which case it's
> the _remote_ end that's writing its metadata, if it's veth it's
> obviously not going to be doing it into DMA coherent memory.
My apologies for the confusion. That entire explanation about the
dangers of writing to RX descriptors was a direct response to
Stanislav's earlier proposal (for the XDP_PASS case, I assume).
You are right that this isn't relevant for redirected xdp_frames,
as there is no access to the original RX-descriptor on a remote CPU or
target device like veth.
>> Thanks for the feedback. I can see why you'd be concerned about adding
>> another adhoc scheme or making xdp_frame grow into a "para-skb".
>>
>> However, I'd like to frame this as part of a long-term plan we've been
>> calling the "mini-SKB" concept. This isn't a new idea, but a
>> continuation of architectural discussions from as far back as [2016].
>
> My understanding is that while this was floated as a plan by some,
> nobody came up with a clean way of implementing it.
I can see why you might think that, but from my perspective, the
xdp_frame *is* the implementation of the mini-SKB concept. We've been
building it incrementally for years. It started as the most minimal
structure possible and has gradually gained more context (e.g. dev_rx,
mem_info/rxq_info, flags, and also uses skb_shared_info with same layout
as SKB).
This patch is simply the next logical step in that existing evolution:
adding hardware metadata to make it more capable, starting with enabling
XDP_REDIRECT offloads. The xdp_frame is our mini-SKB, and this patchset
continues its evolution.
--Jesper
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-18 10:56 ` Jesper Dangaard Brouer
@ 2025-07-22 1:13 ` Jakub Kicinski
2025-07-28 10:53 ` Lorenzo Bianconi
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-07-22 1:13 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Lorenzo Bianconi, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
On Fri, 18 Jul 2025 12:56:46 +0200 Jesper Dangaard Brouer wrote:
> >> Thanks for the feedback. I can see why you'd be concerned about adding
> >> another adhoc scheme or making xdp_frame grow into a "para-skb".
> >>
> >> However, I'd like to frame this as part of a long-term plan we've been
> >> calling the "mini-SKB" concept. This isn't a new idea, but a
> >> continuation of architectural discussions from as far back as [2016].
> >
> > My understanding is that while this was floated as a plan by some,
> > nobody came up with a clean way of implementing it.
>
> I can see why you might think that, but from my perspective, the
> xdp_frame *is* the implementation of the mini-SKB concept. We've been
> building it incrementally for years. It started as the most minimal
> structure possible and has gradually gained more context (e.g. dev_rx,
> mem_info/rxq_info, flags, and also uses skb_shared_info with same layout
> as SKB).
My understanding was that just adding all the fields to xdp_frame was
considered too wasteful. Otherwise we would have done something along
those lines ~10 years ago :S
> This patch is simply the next logical step in that existing evolution:
> adding hardware metadata to make it more capable, starting with enabling
> XDP_REDIRECT offloads. The xdp_frame is our mini-SKB, and this patchset
> continues its evolution.
I thought one of the goals for mini-skb was to move the skb allocation
out of the drivers. The patches as posted seem to make it the
responsibility of the XDP program to save the metadata. If you're
planning to make drivers populate this metadata by default - why add
the helpers.
Again, I just don't understand how these logically fit into place
vis-a-vis the existing metadata "get" callbacks.
^ permalink raw reply [flat|nested] 43+ messages in thread
* RE: [PATCH bpf-next V2 6/7] bpf: selftests: Add rx_meta store kfuncs selftest
2025-07-02 14:58 ` [PATCH bpf-next V2 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
@ 2025-07-23 9:24 ` Bouska, Zdenek
0 siblings, 0 replies; 43+ messages in thread
From: Bouska, Zdenek @ 2025-07-23 9:24 UTC (permalink / raw)
To: Jesper Dangaard Brouer, bpf@vger.kernel.org,
netdev@vger.kernel.org, Jakub Kicinski, lorenzo@kernel.org
Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf@fomichev.me,
kernel-team@cloudflare.com, arthur@arthurfabre.com,
jakub@cloudflare.com
On 02/07/2025 16.59, Jesper Dangaard Brouer wrote:
> From: Lorenzo Bianconi <lorenzo@kernel.org>
>
> Introduce bpf selftests for the XDP rx_meta store kfuncs.
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
> .../testing/selftests/bpf/prog_tests/xdp_rxmeta.c | 166
> ++++++++++++++++++++
> .../selftests/bpf/progs/xdp_rxmeta_receiver.c | 44 +++++
> .../selftests/bpf/progs/xdp_rxmeta_redirect.c | 43 +++++
> 3 files changed, 253 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
> create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
> create mode 100644 tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
> b/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
> new file mode 100644
> index 000000000000..d5c181684ff8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_rxmeta.c
> @@ -0,0 +1,166 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <bpf/btf.h>
> +#include <linux/if_link.h>
> +
> +#include "xdp_rxmeta_redirect.skel.h"
> +#include "xdp_rxmeta_receiver.skel.h"
> +
> +#define LOCAL_NETNS_NAME "local"
> +#define FWD_NETNS_NAME "forward"
> +#define DST_NETNS_NAME "dest"
> +
> +#define LOCAL_NAME "local"
> +#define FWD0_NAME "fwd0"
> +#define FWD1_NAME "fwd1"
> +#define DST_NAME "dest"
> +
> +#define LOCAL_MAC "00:00:00:00:00:01"
> +#define FWD0_MAC "00:00:00:00:00:02"
> +#define FWD1_MAC "00:00:00:00:01:01"
> +#define DST_MAC "00:00:00:00:01:02"
> +
> +#define LOCAL_ADDR "10.0.0.1"
> +#define FWD0_ADDR "10.0.0.2"
> +#define FWD1_ADDR "20.0.0.1"
> +#define DST_ADDR "20.0.0.2"
> +
> +#define PREFIX_LEN "8"
> +#define NUM_PACKETS 10
> +
> +static int run_ping(const char *dst, int num_ping)
> +{
> + SYS(fail, "ping -c%d -W1 -i0.5 %s >/dev/null", num_ping, dst);
> + return 0;
> +fail:
> + return -1;
> +}
> +
> +void test_xdp_rxmeta(void)
> +{
> + struct xdp_rxmeta_redirect *skel_redirect = NULL;
> + struct xdp_rxmeta_receiver *skel_receiver = NULL;
> + struct bpf_devmap_val val = {};
> + struct nstoken *tok = NULL;
> + struct bpf_program *prog;
> + __u32 key = 0, stats;
> + int ret, index;
> +
> + SYS(out, "ip netns add " LOCAL_NETNS_NAME);
> + SYS(out, "ip netns add " FWD_NETNS_NAME);
> + SYS(out, "ip netns add " DST_NETNS_NAME);
> +
> + tok = open_netns(LOCAL_NETNS_NAME);
> + if (!ASSERT_OK_PTR(tok, "setns"))
> + goto out;
> +
> + SYS(out, "ip link add " LOCAL_NAME " type veth peer " FWD0_NAME);
> + SYS(out, "ip link set " FWD0_NAME " netns " FWD_NETNS_NAME);
> + SYS(out, "ip link set dev " LOCAL_NAME " address " LOCAL_MAC);
> + SYS(out, "ip addr add " LOCAL_ADDR "/" PREFIX_LEN " dev "
> LOCAL_NAME);
> + SYS(out, "ip link set dev " LOCAL_NAME " up");
> + SYS(out, "ip route add default via " FWD0_ADDR);
> + close_netns(tok);
> +
> + tok = open_netns(DST_NETNS_NAME);
> + if (!ASSERT_OK_PTR(tok, "setns"))
> + goto out;
> +
> + SYS(out, "ip link add " DST_NAME " type veth peer " FWD1_NAME);
> + SYS(out, "ip link set " FWD1_NAME " netns " FWD_NETNS_NAME);
> + SYS(out, "ip link set dev " DST_NAME " address " DST_MAC);
> + SYS(out, "ip addr add " DST_ADDR "/" PREFIX_LEN " dev " DST_NAME);
> + SYS(out, "ip link set dev " DST_NAME " up");
> + SYS(out, "ip route add default via " FWD1_ADDR);
> +
> + skel_receiver = xdp_rxmeta_receiver__open();
> + if (!ASSERT_OK_PTR(skel_receiver, "open skel_receiver"))
> + goto out;
> +
> + prog = bpf_object__find_program_by_name(skel_receiver->obj,
> + "xdp_rxmeta_receiver");
> + index = if_nametoindex(DST_NAME);
> + bpf_program__set_ifindex(prog, index);
> + bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
> +
> + if (!ASSERT_OK(xdp_rxmeta_receiver__load(skel_receiver),
> + "load skel_receiver"))
> + goto out;
> +
> + ret = bpf_xdp_attach(index,
> + bpf_program__fd(skel_receiver-
> >progs.xdp_rxmeta_receiver),
> + XDP_FLAGS_DRV_MODE, NULL);
> + if (!ASSERT_GE(ret, 0, "bpf_xdp_attach rx_meta_redirect"))
> + goto out;
> +
> + close_netns(tok);
> + tok = open_netns(FWD_NETNS_NAME);
> + if (!ASSERT_OK_PTR(tok, "setns"))
> + goto out;
> +
> + SYS(out, "ip link set dev " FWD0_NAME " address " FWD0_MAC);
> + SYS(out, "ip addr add " FWD0_ADDR "/" PREFIX_LEN " dev "
> FWD0_NAME);
> + SYS(out, "ip link set dev " FWD0_NAME " up");
> +
> + SYS(out, "ip link set dev " FWD1_NAME " address " FWD1_MAC);
> + SYS(out, "ip addr add " FWD1_ADDR "/" PREFIX_LEN " dev "
> FWD1_NAME);
> + SYS(out, "ip link set dev " FWD1_NAME " up");
> +
> + SYS(out, "sysctl -qw net.ipv4.conf.all.forwarding=1");
> +
> + skel_redirect = xdp_rxmeta_redirect__open();
> + if (!ASSERT_OK_PTR(skel_redirect, "open skel_redirect"))
> + goto out;
> +
> + prog = bpf_object__find_program_by_name(skel_redirect->obj,
> + "xdp_rxmeta_redirect");
> + index = if_nametoindex(FWD0_NAME);
> + bpf_program__set_ifindex(prog, index);
> + bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
> +
> + if (!ASSERT_OK(xdp_rxmeta_redirect__load(skel_redirect),
> + "load skel_redirect"))
> + goto out;
> +
> + val.ifindex = if_nametoindex(FWD1_NAME);
> + ret = bpf_map_update_elem(bpf_map__fd(skel_redirect->maps.dev_map),
> + &key, &val, 0);
> + if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
> + goto out;
> +
> + ret = bpf_xdp_attach(index,
> + bpf_program__fd(skel_redirect-
> >progs.xdp_rxmeta_redirect),
> + XDP_FLAGS_DRV_MODE, NULL);
> + if (!ASSERT_GE(ret, 0, "bpf_xdp_attach rxmeta_redirect"))
> + goto out;
> +
> + close_netns(tok);
> + tok = open_netns(LOCAL_NETNS_NAME);
> + if (!ASSERT_OK_PTR(tok, "setns"))
> + goto out;
> +
> + if (!ASSERT_OK(run_ping(DST_ADDR, NUM_PACKETS), "ping"))
> + goto out;
> +
> + close_netns(tok);
> + tok = open_netns(DST_NETNS_NAME);
> + if (!ASSERT_OK_PTR(tok, "setns"))
> + goto out;
> +
> + ret = bpf_map__lookup_elem(skel_receiver->maps.stats,
> + &key, sizeof(key),
> + &stats, sizeof(stats), 0);
> + if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
This should have "bpf_map__lookup_elem" instead of "bpf_map_update_elem".
> + goto out;
> +
> + ASSERT_EQ(stats, NUM_PACKETS, "rx_meta stats");
> +out:
> + xdp_rxmeta_redirect__destroy(skel_redirect);
> + xdp_rxmeta_receiver__destroy(skel_receiver);
> + if (tok)
> + close_netns(tok);
> + SYS_NOFAIL("ip netns del " LOCAL_NETNS_NAME);
> + SYS_NOFAIL("ip netns del " FWD_NETNS_NAME);
> + SYS_NOFAIL("ip netns del " DST_NETNS_NAME);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
> b/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
> new file mode 100644
> index 000000000000..1033fa558970
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_receiver.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define BPF_NO_KFUNC_PROTOTYPES
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
> + enum xdp_rss_hash_type *rss_type) __ksym;
> +extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
> + __u64 *timestamp) __ksym;
> +
> +#define RX_TIMESTAMP 0x12345678
> +#define RX_HASH 0x1234
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_ARRAY);
> + __type(key, __u32);
> + __type(value, __u32);
> + __uint(max_entries, 1);
> +} stats SEC(".maps");
> +
> +SEC("xdp")
> +int xdp_rxmeta_receiver(struct xdp_md *ctx)
> +{
> + enum xdp_rss_hash_type rss_type;
> + __u64 timestamp;
> + __u32 hash;
> +
> + if (!bpf_xdp_metadata_rx_hash(ctx, &hash, &rss_type) &&
> + !bpf_xdp_metadata_rx_timestamp(ctx, ×tamp)) {
> + if (hash == RX_HASH && rss_type == XDP_RSS_L4_TCP &&
> + timestamp == RX_TIMESTAMP) {
> + __u32 *val, key = 0;
> +
> + val = bpf_map_lookup_elem(&stats, &key);
> + if (val)
> + __sync_add_and_fetch(val, 1);
> + }
> + }
> +
> + return XDP_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
> b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
> new file mode 100644
> index 000000000000..635cbae64f53
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/xdp_rxmeta_redirect.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +#define RX_TIMESTAMP 0x12345678
> +#define RX_HASH 0x1234
> +
> +#define ETH_ALEN 6
> +#define ETH_P_IP 0x0800
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_DEVMAP);
> + __uint(key_size, sizeof(__u32));
> + __uint(value_size, sizeof(struct bpf_devmap_val));
> + __uint(max_entries, 1);
> +} dev_map SEC(".maps");
> +
> +SEC("xdp")
> +int xdp_rxmeta_redirect(struct xdp_md *ctx)
> +{
> + __u8 src_mac[] = { 0x00, 0x00, 0x00, 0x00, 0x01, 0x01 };
> + __u8 dst_mac[] = { 0x00, 0x00, 0x00, 0x00, 0x01, 0x02 };
> + void *data_end = (void *)(long)ctx->data_end;
> + void *data = (void *)(long)ctx->data;
> + struct ethhdr *eh = data;
> +
> + if (eh + 1 > (struct ethhdr *)data_end)
> + return XDP_DROP;
> +
> + if (eh->h_proto != bpf_htons(ETH_P_IP))
> + return XDP_PASS;
> +
> + __builtin_memcpy(eh->h_source, src_mac, ETH_ALEN);
> + __builtin_memcpy(eh->h_dest, dst_mac, ETH_ALEN);
> +
> + bpf_xdp_store_rx_hash(ctx, RX_HASH, XDP_RSS_L4_TCP);
> + bpf_xdp_store_rx_ts(ctx, RX_TIMESTAMP);
> +
> + return bpf_redirect_map(&dev_map, ctx->rx_queue_index, XDP_PASS);
> +}
> +
> +char _license[] SEC("license") = "GPL";
>
>
Best regards,
Zdenek Bouska
--
Siemens, s.r.o
Foundational Technologies
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-22 1:13 ` Jakub Kicinski
@ 2025-07-28 10:53 ` Lorenzo Bianconi
2025-07-28 16:29 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Bianconi @ 2025-07-28 10:53 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jesper Dangaard Brouer, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
[-- Attachment #1: Type: text/plain, Size: 2437 bytes --]
> On Fri, 18 Jul 2025 12:56:46 +0200 Jesper Dangaard Brouer wrote:
> > >> Thanks for the feedback. I can see why you'd be concerned about adding
> > >> another adhoc scheme or making xdp_frame grow into a "para-skb".
> > >>
> > >> However, I'd like to frame this as part of a long-term plan we've been
> > >> calling the "mini-SKB" concept. This isn't a new idea, but a
> > >> continuation of architectural discussions from as far back as [2016].
> > >
> > > My understanding is that while this was floated as a plan by some,
> > > nobody came up with a clean way of implementing it.
> >
> > I can see why you might think that, but from my perspective, the
> > xdp_frame *is* the implementation of the mini-SKB concept. We've been
> > building it incrementally for years. It started as the most minimal
> > structure possible and has gradually gained more context (e.g. dev_rx,
> > mem_info/rxq_info, flags, and also uses skb_shared_info with same layout
> > as SKB).
>
> My understanding was that just adding all the fields to xdp_frame was
> considered too wasteful. Otherwise we would have done something along
> those lines ~10 years ago :S
Hi Jakub,
sorry for the late reply.
I am completely fine to redesign the solution to overcome the problem but I
guess this feature will allow us to improve XDP performance in a common/real
use-case. Let's consider we want to redirect a packet into a veth and then into
a container. Preserving the hw metadata performing XDP_REDIRECT will allow us
to avoid recalculating the checksum creating the skb. This will result in a
very nice performance improvement.
So I guess we should really come up with some idea to add this missing feature.
Regards,
Lorenzo
>
> > This patch is simply the next logical step in that existing evolution:
> > adding hardware metadata to make it more capable, starting with enabling
> > XDP_REDIRECT offloads. The xdp_frame is our mini-SKB, and this patchset
> > continues its evolution.
>
> I thought one of the goals for mini-skb was to move the skb allocation
> out of the drivers. The patches as posted seem to make it the
> responsibility of the XDP program to save the metadata. If you're
> planning to make drivers populate this metadata by default - why add
> the helpers.
>
> Again, I just don't understand how these logically fit into place
> vis-a-vis the existing metadata "get" callbacks.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-28 10:53 ` Lorenzo Bianconi
@ 2025-07-28 16:29 ` Jakub Kicinski
2025-07-29 11:15 ` Jesper Dangaard Brouer
2025-07-31 21:18 ` Lorenzo Bianconi
0 siblings, 2 replies; 43+ messages in thread
From: Jakub Kicinski @ 2025-07-28 16:29 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Jesper Dangaard Brouer, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
On Mon, 28 Jul 2025 12:53:01 +0200 Lorenzo Bianconi wrote:
> > > I can see why you might think that, but from my perspective, the
> > > xdp_frame *is* the implementation of the mini-SKB concept. We've been
> > > building it incrementally for years. It started as the most minimal
> > > structure possible and has gradually gained more context (e.g. dev_rx,
> > > mem_info/rxq_info, flags, and also uses skb_shared_info with same layout
> > > as SKB).
> >
> > My understanding was that just adding all the fields to xdp_frame was
> > considered too wasteful. Otherwise we would have done something along
> > those lines ~10 years ago :S
>
> Hi Jakub,
>
> sorry for the late reply.
> I am completely fine to redesign the solution to overcome the problem but I
> guess this feature will allow us to improve XDP performance in a common/real
> use-case. Let's consider we want to redirect a packet into a veth and then into
> a container. Preserving the hw metadata performing XDP_REDIRECT will allow us
> to avoid recalculating the checksum creating the skb. This will result in a
> very nice performance improvement.
> So I guess we should really come up with some idea to add this missing feature.
I don't think the counter-proposal prevents that. As long as veth
supports "set" callbacks the program can transfer the metadata over
to the veth and the second program at veth can communicate them to
the driver.
Martin mentioned to me that he had proposed in the past that we allow
allocating the skb at the XDP level, if the program needs "skb-level
metadata". That actually seems pretty clean to me.. Was it ever
explored?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-28 16:29 ` Jakub Kicinski
@ 2025-07-29 11:15 ` Jesper Dangaard Brouer
2025-07-29 19:47 ` Martin KaFai Lau
2025-07-31 21:18 ` Lorenzo Bianconi
1 sibling, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-29 11:15 UTC (permalink / raw)
To: Jakub Kicinski, Lorenzo Bianconi
Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur, jakub, Jesse Brandeburg, Andrew Rzeznik
On 28/07/2025 18.29, Jakub Kicinski wrote:
> On Mon, 28 Jul 2025 12:53:01 +0200 Lorenzo Bianconi wrote:
>>>> I can see why you might think that, but from my perspective, the
>>>> xdp_frame *is* the implementation of the mini-SKB concept. We've been
>>>> building it incrementally for years. It started as the most minimal
>>>> structure possible and has gradually gained more context (e.g. dev_rx,
>>>> mem_info/rxq_info, flags, and also uses skb_shared_info with same layout
>>>> as SKB).
>>>
>>> My understanding was that just adding all the fields to xdp_frame was
>>> considered too wasteful. Otherwise we would have done something along
>>> those lines ~10 years ago :S
>>
>> Hi Jakub,
>>
>> sorry for the late reply.
Same, back from vacation.
>> I am completely fine to redesign the solution to overcome the problem but I
>> guess this feature will allow us to improve XDP performance in a common/real
>> use-case. Let's consider we want to redirect a packet into a veth and then into
>> a container. Preserving the hw metadata performing XDP_REDIRECT will allow us
>> to avoid recalculating the checksum creating the skb. This will result in a
>> very nice performance improvement.
>> So I guess we should really come up with some idea to add this missing feature.
>
>
> Martin mentioned to me that he had proposed in the past that we allow
> allocating the skb at the XDP level, if the program needs "skb-level
> metadata". That actually seems pretty clean to me.. Was it ever
> explored?
That idea has been considered before, but it unfortunately doesn't work
from a performance angle. The performance model of XDP_REDIRECT into
CPUMAP relies on moving the expensive SKB allocation+init to a remote
CPU. This keeps the ingress CPU free to process packets at near line
rate (our DDoS use-case). If we allocate the SKB on the ingress-CPU
before the redirect, we destroy this load-balancing model and create the
exact bottleneck we designed CPUMAP to avoid.
To bring the focus back to the specific problem this series solves,
let's review the concrete use case. Our IPsec scenario is a key example:
on the ingress CPU, an XDP program calculates a hash from inner packet
headers to load-balance traffic via CPUMAP. When the packet arrives on
the remote CPU, this hash is lost, so the new SKB is created with a hash
of zero. This, in turn, causes poor load-balancing when the packet is
forwarded to a multi-queue device like veth, as traffic often collapses
to a single queue. The purpose of this patchset is simply to provide a
standard way to carry that hash to the remote CPU within the xdp_frame.
(Same goes for a standard way to carry VLAN tags)
Given this specific problem, is there a better approach to solving it
than what this patchset proposes?
--Jesper
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-29 11:15 ` Jesper Dangaard Brouer
@ 2025-07-29 19:47 ` Martin KaFai Lau
2025-07-31 16:27 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 19:47 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Jakub Kicinski, Lorenzo Bianconi
Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur, jakub, Jesse Brandeburg, Andrew Rzeznik
On 7/29/25 4:15 AM, Jesper Dangaard Brouer wrote:
> That idea has been considered before, but it unfortunately doesn't work
> from a performance angle. The performance model of XDP_REDIRECT into
> CPUMAP relies on moving the expensive SKB allocation+init to a remote
> CPU. This keeps the ingress CPU free to process packets at near line
> rate (our DDoS use-case). If we allocate the SKB on the ingress-CPU
> before the redirect, we destroy this load-balancing model and create the
> exact bottleneck we designed CPUMAP to avoid.
iirc, a xdp prog can be attached to a cpumap. The skb can be created by that xdp
prog running on the remote cpu. It should be like a xdp prog returning a
XDP_PASS + an optional skb. The xdp prog can set some fields in the skb. Other
than setting fields in the skb, something else may be also possible in the
future, e.g. look up sk, earlier demux ...etc.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-29 19:47 ` Martin KaFai Lau
@ 2025-07-31 16:27 ` Jesper Dangaard Brouer
2025-08-01 20:38 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-07-31 16:27 UTC (permalink / raw)
To: Martin KaFai Lau, Jakub Kicinski, Lorenzo Bianconi
Cc: Stanislav Fomichev, bpf, netdev, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Paolo Abeni, sdf,
kernel-team, arthur, jakub, Jesse Brandeburg, Andrew Rzeznik
On 29/07/2025 21.47, Martin KaFai Lau wrote:
> On 7/29/25 4:15 AM, Jesper Dangaard Brouer wrote:
>> That idea has been considered before, but it unfortunately doesn't work
>> from a performance angle. The performance model of XDP_REDIRECT into
>> CPUMAP relies on moving the expensive SKB allocation+init to a remote
>> CPU. This keeps the ingress CPU free to process packets at near line
>> rate (our DDoS use-case). If we allocate the SKB on the ingress-CPU
>> before the redirect, we destroy this load-balancing model and create the
>> exact bottleneck we designed CPUMAP to avoid.
>
> iirc, a xdp prog can be attached to a cpumap. The skb can be created by
> that xdp prog running on the remote cpu. It should be like a xdp prog
> returning a XDP_PASS + an optional skb. The xdp prog can set some fields
> in the skb. Other than setting fields in the skb, something else may be
> also possible in the future, e.g. look up sk, earlier demux ...etc.
>
I have strong reservations about having the BPF program itself trigger
the SKB allocation. I believe this would fundamentally break the
performance model that makes cpumap redirect so effective.
The key to XDP's high performance lies in processing a bulk of
xdp_frames in a tight loop to amortize costs. The existing cpumap code
on the remote CPU is already highly optimized for this: it performs bulk
allocation of SKBs and uses careful prefetching to hide the memory
latency. Allowing a BPF program to sometimes trigger a heavyweight SKB
alloc+init (4 cache-line misses) would bypass all these existing
optimizations. It would introduce significant jitter into the pipeline
and disrupt the entire bulk-processing model we rely on for performance.
This performance is not just theoretical; we rely on it for DDoS
protection. For example, our plan is to use the XDP program on the
cpumap hook to run secondary DDoS mitigation rules that currently use
iptables (funny, many rules are actually BPF program snippets today).
Architecturally, there is a clean separation today: the BPF program
makes a decision, and the highly-optimized cpumap or core kernel code
acts on it (build_skb, napi_gro_receive, etc). Your proposal blurs this
line significantly. Our patch, in contrast, preserves this model. It
simply provides the necessary data (the hash, vlan and timestamp) to the
existing cpumap/veth skb path via the xdp_frame.
While more advanced capabilities are an interesting topic for the
future, my goal here is to solve the immediate, concrete problem of
transferring metadata cleanly, without disrupting the performance
architecture we rely on for use cases like DDoS mitigation.
--Jesper
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-28 16:29 ` Jakub Kicinski
2025-07-29 11:15 ` Jesper Dangaard Brouer
@ 2025-07-31 21:18 ` Lorenzo Bianconi
2025-08-01 20:40 ` Jakub Kicinski
1 sibling, 1 reply; 43+ messages in thread
From: Lorenzo Bianconi @ 2025-07-31 21:18 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jesper Dangaard Brouer, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
[-- Attachment #1: Type: text/plain, Size: 2806 bytes --]
On Jul 28, Jakub Kicinski wrote:
> On Mon, 28 Jul 2025 12:53:01 +0200 Lorenzo Bianconi wrote:
> > > > I can see why you might think that, but from my perspective, the
> > > > xdp_frame *is* the implementation of the mini-SKB concept. We've been
> > > > building it incrementally for years. It started as the most minimal
> > > > structure possible and has gradually gained more context (e.g. dev_rx,
> > > > mem_info/rxq_info, flags, and also uses skb_shared_info with same layout
> > > > as SKB).
> > >
> > > My understanding was that just adding all the fields to xdp_frame was
> > > considered too wasteful. Otherwise we would have done something along
> > > those lines ~10 years ago :S
> >
> > Hi Jakub,
> >
> > sorry for the late reply.
> > I am completely fine to redesign the solution to overcome the problem but I
> > guess this feature will allow us to improve XDP performance in a common/real
> > use-case. Let's consider we want to redirect a packet into a veth and then into
> > a container. Preserving the hw metadata performing XDP_REDIRECT will allow us
> > to avoid recalculating the checksum creating the skb. This will result in a
> > very nice performance improvement.
> > So I guess we should really come up with some idea to add this missing feature.
>
> I don't think the counter-proposal prevents that. As long as veth
> supports "set" callbacks the program can transfer the metadata over
> to the veth and the second program at veth can communicate them to
> the driver.
IIUC the 'set' proposal (please correct me if I am wrong), the eBPF program
running on the NIC that is receiving the packet from the wire is supposed
to set (or update) the hw metadata info (e.g. RX HASH or RX checksum) in
the RX DMA descriptor associated to the packet to be successively consumed.
Am I right?
I think this approach works fine if the SKB is created locally in the NAPI
loop of the receiving driver (e.g if the eBPF program bounded on the NIC is
returning XDP_PASS) but I guess it does not work if the packet is redirected
into a remote CPU or a remote device (e.g. veth). Considering the veth
use-case, veth_ndo_xdp_xmit() enqueues the packet into a ptr_ring and
schedule a NAPI. When the NAPI runs I guess the DMA descriptor originally
associated to the packet has been already queued back to the hw ring to be
consumed for a following packet. In order to be able to easily consume
these hw metadata I guess we should store these info in the same packet
buffer. Am I missing something?
Regards,
Lorenzo
>
> Martin mentioned to me that he had proposed in the past that we allow
> allocating the skb at the XDP level, if the program needs "skb-level
> metadata". That actually seems pretty clean to me.. Was it ever
> explored?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-31 16:27 ` Jesper Dangaard Brouer
@ 2025-08-01 20:38 ` Jakub Kicinski
2025-08-04 13:18 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-08-01 20:38 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Martin KaFai Lau, Lorenzo Bianconi, Stanislav Fomichev, bpf,
netdev, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg, Andrew Rzeznik
On Thu, 31 Jul 2025 18:27:07 +0200 Jesper Dangaard Brouer wrote:
> > iirc, a xdp prog can be attached to a cpumap. The skb can be created by
> > that xdp prog running on the remote cpu. It should be like a xdp prog
> > returning a XDP_PASS + an optional skb. The xdp prog can set some fields
> > in the skb. Other than setting fields in the skb, something else may be
> > also possible in the future, e.g. look up sk, earlier demux ...etc.
>
> I have strong reservations about having the BPF program itself trigger
> the SKB allocation. I believe this would fundamentally break the
> performance model that makes cpumap redirect so effective.
See, I have similar concerns about growing struct xdp_frame.
That's why the guiding principle for me would be to make sure that
the features we add, beyond "classic XDP" as needed by DDoS, are
entirely optional. And if we include the goal of moving skb allocation
out of the driver to the xdp_frame growth, the drivers will sooner or
later unconditionally populate the xdp_frame. Decreasing performance
of "classic XDP"?
> The key to XDP's high performance lies in processing a bulk of
> xdp_frames in a tight loop to amortize costs. The existing cpumap code
> on the remote CPU is already highly optimized for this: it performs bulk
> allocation of SKBs and uses careful prefetching to hide the memory
> latency. Allowing a BPF program to sometimes trigger a heavyweight SKB
> alloc+init (4 cache-line misses) would bypass all these existing
> optimizations. It would introduce significant jitter into the pipeline
> and disrupt the entire bulk-processing model we rely on for performance.
>
> This performance is not just theoretical;
Somewhat off-topic for the architecture, I think, but do you happen
to have any real life data for that? IIRC the "listification" was a
moderate success for the skb path.. Or am I misreading and you have
other benefits of a tight processing loop in mind?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-07-31 21:18 ` Lorenzo Bianconi
@ 2025-08-01 20:40 ` Jakub Kicinski
2025-08-05 13:18 ` Lorenzo Bianconi
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-08-01 20:40 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Jesper Dangaard Brouer, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
On Thu, 31 Jul 2025 23:18:12 +0200 Lorenzo Bianconi wrote:
> IIUC the 'set' proposal (please correct me if I am wrong), the eBPF program
> running on the NIC that is receiving the packet from the wire is supposed
> to set (or update) the hw metadata info (e.g. RX HASH or RX checksum) in
> the RX DMA descriptor associated to the packet to be successively consumed.
> Am I right?
I was thinking of doing the SET on the veth side. Basically the
metadata has to be understood by the stack only at the xdp->skb
transition point. So we can delay the SET until that moment, carrying
the information in program-specific format.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-01 20:38 ` Jakub Kicinski
@ 2025-08-04 13:18 ` Jesper Dangaard Brouer
2025-08-06 0:28 ` Jakub Kicinski
2025-08-06 1:24 ` Martin KaFai Lau
0 siblings, 2 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-04 13:18 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Martin KaFai Lau, Lorenzo Bianconi, Stanislav Fomichev, bpf,
netdev, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg, Andrew Rzeznik
On 01/08/2025 22.38, Jakub Kicinski wrote:
> On Thu, 31 Jul 2025 18:27:07 +0200 Jesper Dangaard Brouer wrote:
>>> iirc, a xdp prog can be attached to a cpumap. The skb can be created by
>>> that xdp prog running on the remote cpu. It should be like a xdp prog
>>> returning a XDP_PASS + an optional skb. The xdp prog can set some fields
>>> in the skb. Other than setting fields in the skb, something else may be
>>> also possible in the future, e.g. look up sk, earlier demux ...etc.
>>
>> I have strong reservations about having the BPF program itself trigger
>> the SKB allocation. I believe this would fundamentally break the
>> performance model that makes cpumap redirect so effective.
>
> See, I have similar concerns about growing struct xdp_frame.
>
IMHO there is a huge difference in doing memory allocs+init vs. growing
struct xdp_frame.
It very is important to notice that patchset is actually not growing
xdp_frame, in the traditional sense, instead we are adding an optional
area to xdp_frame (plus some flags to tell if area is in-use). Remember
the xdp_frame area is not allocated or mem-zeroed (except flags). If
not used, the members in struct xdp_rx_meta are never touched. Thus,
there is actually no performance impact in growing struct xdp_frame in
this way. Do you still have concerns?
> That's why the guiding principle for me would be to make sure that
> the features we add, beyond "classic XDP" as needed by DDoS, are
> entirely optional.
Exactly, we agree. What we do in this patchset is entirely optional.
These changes does not slowdown "classic XDP" and our DDoS use-case.
> And if we include the goal of moving skb allocation
> out of the driver to the xdp_frame growth, the drivers will sooner or
> later unconditionally populate the xdp_frame. Decreasing performance
> of "classic XDP"?
>
No, that is the beauty of this solution, it will not decrease the
performance of "classic XDP".
Do keep-in-mind that "moving skb allocation out of the driver" is not
part of this patchset and a moonshot goal that will take a long time
(but we are already "simulation" this via XDP-redirect for years now).
Drivers should obviously not unconditionally populate the xdp_frame's
rx_meta area. It is first time to populate rx_meta, once driver reach
XDP_PASS case (normal netstack delivery). Today all drivers will at this
stage populate the SKB metadata (e.g. rx-hash + vlan) from the RX-
descriptor anyway. Thus, I don't see how replacing those writes will
decrease performance.
>> The key to XDP's high performance lies in processing a bulk of
>> xdp_frames in a tight loop to amortize costs. The existing cpumap code
>> on the remote CPU is already highly optimized for this: it performs bulk
>> allocation of SKBs and uses careful prefetching to hide the memory
>> latency. Allowing a BPF program to sometimes trigger a heavyweight SKB
>> alloc+init (4 cache-line misses) would bypass all these existing
>> optimizations. It would introduce significant jitter into the pipeline
>> and disrupt the entire bulk-processing model we rely on for performance.
>>
>> This performance is not just theoretical;
>
> Somewhat off-topic for the architecture, I think, but do you happen
> to have any real life data for that? IIRC the "listification" was a
> moderate success for the skb path.. Or am I misreading and you have
> other benefits of a tight processing loop in mind?
Our "tight processing loop" for NAPI (net_rx_action/napi_pool) is not
performing as well as we want. One major reason is that the CPU is being
stalled each time in the loop when the NIC driver needs to clear the 4
cache-lines for the SKB. XDP have shown us that avoiding these steps is
a huge performance boost. The "moving skb allocation out of the driver"
is one step towards improving the NAPI loop. As you hint we also need
some bulking or "listification". I'm not a huge fan of SKB
"listification". XDP-redirect devmap/cpumap uses an array for creating
an RX bulk "stage". The SKB listification work was never fully
completed IMHO. Back then, I was working on getting PoC for SKB
forwarding working, but as soon as we reached any of the netfilter hooks
points the SKB list would get split into individual SKBs. IIRC SKB
listification only works for the first part of netstack SKB input code
path. And "late" part of qdisc TX layer, but the netstack code in-
between will always cause the SKB list would get split into individual
SKBs. IIRC only back-pressure during qdisc TX will cause listification
to be used. It would be great if someone have cycles to work on
completing more of the SKB listification.
--Jesper
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-01 20:40 ` Jakub Kicinski
@ 2025-08-05 13:18 ` Lorenzo Bianconi
2025-08-05 23:54 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Bianconi @ 2025-08-05 13:18 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jesper Dangaard Brouer, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]
On Aug 01, Jakub Kicinski wrote:
> On Thu, 31 Jul 2025 23:18:12 +0200 Lorenzo Bianconi wrote:
> > IIUC the 'set' proposal (please correct me if I am wrong), the eBPF program
> > running on the NIC that is receiving the packet from the wire is supposed
> > to set (or update) the hw metadata info (e.g. RX HASH or RX checksum) in
> > the RX DMA descriptor associated to the packet to be successively consumed.
> > Am I right?
>
> I was thinking of doing the SET on the veth side. Basically the
> metadata has to be understood by the stack only at the xdp->skb
> transition point. So we can delay the SET until that moment, carrying
> the information in program-specific format.
ack, I am fine to delay the translation of the HW metadata from a HW
specific format (the one contained in the DMA descriptor) to the network one
when they are consumed to create the SKB (the veth driver in this case) but I
guess we need to copy the info contained in the DMA descriptor into a buffer
that is still valid when veth driver consumes them since the DMA descriptor
can be no longer available at that time. Do you agree or am I missing
something?
Regards,
Lorenzo
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-05 13:18 ` Lorenzo Bianconi
@ 2025-08-05 23:54 ` Jakub Kicinski
0 siblings, 0 replies; 43+ messages in thread
From: Jakub Kicinski @ 2025-08-05 23:54 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Jesper Dangaard Brouer, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg
On Tue, 5 Aug 2025 15:18:52 +0200 Lorenzo Bianconi wrote:
> > I was thinking of doing the SET on the veth side. Basically the
> > metadata has to be understood by the stack only at the xdp->skb
> > transition point. So we can delay the SET until that moment, carrying
> > the information in program-specific format.
>
> ack, I am fine to delay the translation of the HW metadata from a HW
> specific format (the one contained in the DMA descriptor) to the network one
> when they are consumed to create the SKB (the veth driver in this case) but I
> guess we need to copy the info contained in the DMA descriptor into a buffer
> that is still valid when veth driver consumes them since the DMA descriptor
> can be no longer available at that time. Do you agree or am I missing
> something?
That's right, we need to carry the metadata we need with the packet
(in an XDP program-specific md prepend, presumably).
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-04 13:18 ` Jesper Dangaard Brouer
@ 2025-08-06 0:28 ` Jakub Kicinski
2025-08-07 18:26 ` Jesper Dangaard Brouer
2025-08-06 1:24 ` Martin KaFai Lau
1 sibling, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-08-06 0:28 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Martin KaFai Lau, Lorenzo Bianconi, Stanislav Fomichev, bpf,
netdev, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg, Andrew Rzeznik
On Mon, 4 Aug 2025 15:18:35 +0200 Jesper Dangaard Brouer wrote:
> On 01/08/2025 22.38, Jakub Kicinski wrote:
> > On Thu, 31 Jul 2025 18:27:07 +0200 Jesper Dangaard Brouer wrote:
> >> I have strong reservations about having the BPF program itself trigger
> >> the SKB allocation. I believe this would fundamentally break the
> >> performance model that makes cpumap redirect so effective.
> >
> > See, I have similar concerns about growing struct xdp_frame.
> >
>
> IMHO there is a huge difference in doing memory allocs+init vs. growing
> struct xdp_frame.
>
> It very is important to notice that patchset is actually not growing
> xdp_frame, in the traditional sense, instead we are adding an optional
> area to xdp_frame (plus some flags to tell if area is in-use). Remember
> the xdp_frame area is not allocated or mem-zeroed (except flags). If
> not used, the members in struct xdp_rx_meta are never touched.
Yes, I get all that.
> Thus, there is actually no performance impact in growing struct
> xdp_frame in this way. Do you still have concerns?
You're adding code in a number of paths, I don't think it's fair to
claim that there is *no* performance impact. Maybe no impact of
XDP_DROP from the patches themselves, assuming driver doesn't
pre-populate.
Do you have any idea how well this approach will scale to all the fields
people will need in the future to xdp_frame? The nice thing about the
SET ops is that the driver can define whatever ops it supports,
including things not supported by skb (or supported thru skb_ext),
at zero cost to the common stack. If we define the fields in the core
we're back to the inflexibility of the skb world..
> > That's why the guiding principle for me would be to make sure that
> > the features we add, beyond "classic XDP" as needed by DDoS, are
> > entirely optional.
>
> Exactly, we agree. What we do in this patchset is entirely optional.
> These changes does not slowdown "classic XDP" and our DDoS use-case.
>
> > And if we include the goal of moving skb allocation
> > out of the driver to the xdp_frame growth, the drivers will sooner or
> > later unconditionally populate the xdp_frame. Decreasing performance
> > of "classic XDP"?
>
> No, that is the beauty of this solution, it will not decrease the
> performance of "classic XDP".
>
> Do keep-in-mind that "moving skb allocation out of the driver" is not
> part of this patchset and a moonshot goal that will take a long time
> (but we are already "simulation" this via XDP-redirect for years now).
> Drivers should obviously not unconditionally populate the xdp_frame's
> rx_meta area. It is first time to populate rx_meta, once driver reach
> XDP_PASS case (normal netstack delivery). Today all drivers will at this
> stage populate the SKB metadata (e.g. rx-hash + vlan) from the RX-
> descriptor anyway. Thus, I don't see how replacing those writes will
> decrease performance.
I don't think it's at all obvious that the driver should not
unconditionally populate the xdp_frame.It seems like the logical
direction to me, TBH. Driver pre-populates, then the conversion
and GET callbacks become trivial and generic..
Perhaps we should try to convert a real driver in this series.
> >> The key to XDP's high performance lies in processing a bulk of
> >> xdp_frames in a tight loop to amortize costs. The existing cpumap code
> >> on the remote CPU is already highly optimized for this: it performs bulk
> >> allocation of SKBs and uses careful prefetching to hide the memory
> >> latency. Allowing a BPF program to sometimes trigger a heavyweight SKB
> >> alloc+init (4 cache-line misses) would bypass all these existing
> >> optimizations. It would introduce significant jitter into the pipeline
> >> and disrupt the entire bulk-processing model we rely on for performance.
> >>
> >> This performance is not just theoretical;
> >
> > Somewhat off-topic for the architecture, I think, but do you happen
> > to have any real life data for that? IIRC the "listification" was a
> > moderate success for the skb path.. Or am I misreading and you have
> > other benefits of a tight processing loop in mind?
>
> Our "tight processing loop" for NAPI (net_rx_action/napi_pool) is not
> performing as well as we want. One major reason is that the CPU is being
> stalled each time in the loop when the NIC driver needs to clear the 4
> cache-lines for the SKB. XDP have shown us that avoiding these steps is
> a huge performance boost.
Do you know what uarch resource it's stalling on?
It's been on my minder whether in the attempts to zero out as
little as possible we didn't defeat CPU optimization for clearing
full cache lines.
> The "moving skb allocation out of the driver"
> is one step towards improving the NAPI loop. As you hint we also need
> some bulking or "listification". I'm not a huge fan of SKB
> "listification". XDP-redirect devmap/cpumap uses an array for creating
> an RX bulk "stage". The SKB listification work was never fully
> completed IMHO. Back then, I was working on getting PoC for SKB
> forwarding working, but as soon as we reached any of the netfilter hooks
> points the SKB list would get split into individual SKBs. IIRC SKB
> listification only works for the first part of netstack SKB input code
> path. And "late" part of qdisc TX layer, but the netstack code in-
> between will always cause the SKB list would get split into individual
> SKBs. IIRC only back-pressure during qdisc TX will cause listification
> to be used. It would be great if someone have cycles to work on
> completing more of the SKB listification.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-04 13:18 ` Jesper Dangaard Brouer
2025-08-06 0:28 ` Jakub Kicinski
@ 2025-08-06 1:24 ` Martin KaFai Lau
2025-08-07 19:07 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-08-06 1:24 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Jakub Kicinski, Lorenzo Bianconi, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg, Andrew Rzeznik
On 8/4/25 6:18 AM, Jesper Dangaard Brouer wrote:
> Do keep-in-mind that "moving skb allocation out of the driver" is not
> part of this patchset and a moonshot goal that will take a long time
> (but we are already "simulation" this via XDP-redirect for years now).
The XDP_PASS was first done in the very early days of BPF in 2016. The
XDP-redirect then followed a similar setup. A lot has improved since then. A
moonshot in 2016 does not necessarily mean it is still hard to do now. e.g. Loop
is feasible. Directly reading/writing skb is also easier.
Let’s first quantify what the performance loss would be if the skb is allocated
and field-set by the xdp prog (for the general XDP_PASS case and the
redirect+cpumap case). If it’s really worth it, let’s see what it would take for
the XDP program to achieve similar optimizations.
> Drivers should obviously not unconditionally populate the xdp_frame's
> rx_meta area. It is first time to populate rx_meta, once driver reach
afaict, the rx_meta is reserved regardless though. The xdp prog cannot use that
space for data_meta. The rx_meta will grow in time.
My preference is to allow xdp prog to decide what needs to write in data_meta
and decides what needs to set in the skb directly. This is the general case it
should support first and then optimize.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-06 0:28 ` Jakub Kicinski
@ 2025-08-07 18:26 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-07 18:26 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Martin KaFai Lau, Lorenzo Bianconi, Stanislav Fomichev, bpf,
netdev, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg, Andrew Rzeznik
On 06/08/2025 02.28, Jakub Kicinski wrote:
> On Mon, 4 Aug 2025 15:18:35 +0200 Jesper Dangaard Brouer wrote:
>> On 01/08/2025 22.38, Jakub Kicinski wrote:
>>> On Thu, 31 Jul 2025 18:27:07 +0200 Jesper Dangaard Brouer wrote:
>>>> I have strong reservations about having the BPF program itself trigger
>>>> the SKB allocation. I believe this would fundamentally break the
>>>> performance model that makes cpumap redirect so effective.
>>>
>>> See, I have similar concerns about growing struct xdp_frame.
>>>
>>
>> IMHO there is a huge difference in doing memory allocs+init vs. growing
>> struct xdp_frame.
>>
>> It very is important to notice that patchset is actually not growing
>> xdp_frame, in the traditional sense, instead we are adding an optional
>> area to xdp_frame (plus some flags to tell if area is in-use). Remember
>> the xdp_frame area is not allocated or mem-zeroed (except flags). If
>> not used, the members in struct xdp_rx_meta are never touched.
>
> Yes, I get all that.
>
>> Thus, there is actually no performance impact in growing struct
>> xdp_frame in this way. Do you still have concerns?
>
> You're adding code in a number of paths, I don't think it's fair to
> claim that there is *no* performance impact. Maybe no impact of
> XDP_DROP from the patches themselves, assuming driver doesn't
> pre-populate.
>
I feel a need to state this. The purpose of this patchset is to
increase the performance, by providing access to offload hints. The
common theme is that these offload hints can help us avoid data cache-
misses and/or avoid some steps in software. Setting VLAN
(__vlan_hwaccel_put_tag) avoids an extra trip through the RX-handler.
Setting RX-hash avoids this calling into flow_dissector to SW calc. The
RX-hash is an extended type, that already knows the packet type IPv4/
IPv6 and UDP/TCP, thus it can simplify BPF-code needed. We are missing
checksum offload, as Lorenzo mentions, but that is the plan, as it has
the most gain (for TCP csum_partial is in perf top for cpumap).
> Do you have any idea how well this approach will scale to all the fields
> people will need in the future to xdp_frame? The nice thing about the
> SET ops is that the driver can define whatever ops it supports,
> including things not supported by skb (or supported thru skb_ext),
> at zero cost to the common stack. If we define the fields in the core
> we're back to the inflexibility of the skb world..
>
This is where traits come in. For now the struct has static members, but
we want to convert this to a dynamic struct based on traits, if demand
for more members is proposed. The patchset API allows us to change to
this approach later.
The SET ops API requires two XDP program, one at physical NIC, and one
at veth, and agreement for side-band layout for transferring e.g RX-hash
and timestamp. The veth XDP-prog need to run on the peer-device, for
containers this is the veth device inside the container. If veth
XDP-prog doesn't clear data_meta area, then GRO aggregation breaks (e.g
for timestamp usage).
Good luck getting this to work for containers.
Our solution works out-of-the-box for containers. We only need one
XDP-prog on at physical NIC, that will supply missing hardware offload
for XDP-redirect into the veth device.
>>> That's why the guiding principle for me would be to make sure that
>>> the features we add, beyond "classic XDP" as needed by DDoS, are
>>> entirely optional.
>>
>> Exactly, we agree. What we do in this patchset is entirely optional.
>> These changes does not slowdown "classic XDP" and our DDoS use-case.
>>
>>> And if we include the goal of moving skb allocation
>>> out of the driver to the xdp_frame growth, the drivers will sooner or
>>> later unconditionally populate the xdp_frame. Decreasing performance
>>> of "classic XDP"?
>>
>> No, that is the beauty of this solution, it will not decrease the
>> performance of "classic XDP".
>>
>> Do keep-in-mind that "moving skb allocation out of the driver" is not
>> part of this patchset and a moonshot goal that will take a long time
>> (but we are already "simulation" this via XDP-redirect for years now).
>> Drivers should obviously not unconditionally populate the xdp_frame's
>> rx_meta area. It is first time to populate rx_meta, once driver reach
>> XDP_PASS case (normal netstack delivery). Today all drivers will at this
>> stage populate the SKB metadata (e.g. rx-hash + vlan) from the RX-
>> descriptor anyway. Thus, I don't see how replacing those writes will
>> decrease performance.
>
> I don't think it's at all obvious that the driver should not
> unconditionally populate the xdp_frame.It seems like the logical
> direction to me, TBH. Driver pre-populates, then the conversion
> and GET callbacks become trivial and generic..
>
This is related to when cache-lines are ready. All XDP drivers
prefetchw the xdp_frame area before starting XDP-prog. Thus, driver
want to delay writing until this cache-line is ready.
> Perhaps we should try to convert a real driver in this series.
>
What do you mean?
This series is about the XDP_REDIRECT case, so we don't need to modify
any physical NIC driver.
Do you want this series to include the ability to XDP-override the
hardware offloads for RX-hash and VLAN for the XDP_PASS case for a real
physical NIC driver?
>>>> The key to XDP's high performance lies in processing a bulk of
>>>> xdp_frames in a tight loop to amortize costs. The existing cpumap code
>>>> on the remote CPU is already highly optimized for this: it performs bulk
>>>> allocation of SKBs and uses careful prefetching to hide the memory
>>>> latency. Allowing a BPF program to sometimes trigger a heavyweight SKB
>>>> alloc+init (4 cache-line misses) would bypass all these existing
>>>> optimizations. It would introduce significant jitter into the pipeline
>>>> and disrupt the entire bulk-processing model we rely on for performance.
>>>>
>>>> This performance is not just theoretical;
>>>
>>> Somewhat off-topic for the architecture, I think, but do you happen
>>> to have any real life data for that? IIRC the "listification" was a
>>> moderate success for the skb path.. Or am I misreading and you have
>>> other benefits of a tight processing loop in mind?
>>
>> Our "tight processing loop" for NAPI (net_rx_action/napi_pool) is not
>> performing as well as we want. One major reason is that the CPU is being
>> stalled each time in the loop when the NIC driver needs to clear the 4
>> cache-lines for the SKB. XDP have shown us that avoiding these steps is
>> a huge performance boost.
>
> Do you know what uarch resource it's stalling on?
> It's been on my minder whether in the attempts to zero out as
> little as possible we didn't defeat CPU optimization for clearing
> full cache lines.
>
The main performance stall problem with zeroing is the 'rep
stos' (repeated string store) operation.
See comments in this memset micro benchmark [1]
[1]
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
Memset of 32 bytes (or less) will result in MOVQ instructions, which is
really fast. Large sizes the compiler usually creates 'rep stos'
instructions, which have high "startup" cost.
--Jesper
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-06 1:24 ` Martin KaFai Lau
@ 2025-08-07 19:07 ` Jesper Dangaard Brouer
2025-08-13 2:59 ` Martin KaFai Lau
0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-07 19:07 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: Jakub Kicinski, Lorenzo Bianconi, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg, Andrew Rzeznik
On 06/08/2025 03.24, Martin KaFai Lau wrote:
> On 8/4/25 6:18 AM, Jesper Dangaard Brouer wrote:
>> Do keep-in-mind that "moving skb allocation out of the driver" is not
>> part of this patchset and a moonshot goal that will take a long time
>> (but we are already "simulation" this via XDP-redirect for years now).
>
> The XDP_PASS was first done in the very early days of BPF in 2016. The
> XDP-redirect then followed a similar setup. A lot has improved since
> then. A moonshot in 2016 does not necessarily mean it is still hard to
> do now. e.g. Loop is feasible. Directly reading/writing skb is also easier.
>
Please enlighten me. How can easily give XDP-progs access to the SKB
data-structure. What changes do we need?
You mentioned XDP returning an SKB pointer, but that will need changes
to every XDP driver, right?
> Let’s first quantify what the performance loss would be if the skb is
> allocated and field-set by the xdp prog (for the general XDP_PASS case
> and the redirect+cpumap case). If it’s really worth it, let’s see what
> it would take for the XDP program to achieve similar optimizations.
>
>> Drivers should obviously not unconditionally populate the xdp_frame's
>> rx_meta area. It is first time to populate rx_meta, once driver reach
>
> afaict, the rx_meta is reserved regardless though. The xdp prog cannot
> use that space for data_meta. The rx_meta will grow in time.
>
My view is that we a memory area of minimum 192 bytes available as
headroom, that we are currently not using, that seems a waste. The
data_meta was limited to 32 bytes for a long time without complaints, so
I don't think that is a concern. If rx_meta grows, we propose changing
to the traits implementation, which gives us a dynamic compressed struct.
> My preference is to allow xdp prog to decide what needs to write in
> data_meta and decides what needs to set in the skb directly. This is the
> general case it should support first and then optimize.
>
Yan and I have previously[1] (Oct 2024) explored adding a common
callback to XDP drivers, which have access to both the xdp_buff and SKB
in the function call. (Ignore the GRO disable bit, focus on callback)
We named the functions: xdp_buff_fixup_skb_offloading() and
xdp_frame_fixup_skb_offloading()
We implemented the driver changes for [bnxt], [mlx5], [ice] and [veth].
What do you think of the idea of adding a BPF-hook, at this callback,
which have access to both the XDP and SKB pointer.
That would allow us to implement your idea, right?
--Jesper
[1] https://lore.kernel.org/all/cover.1718919473.git.yan@cloudflare.com/#r
[bnxt]
https://lore.kernel.org/all/f804c22ca168ec3aedb0ee754bfbee71764eb894.1718919473.git.yan@cloudflare.com/
[mlx5]
https://lore.kernel.org/all/17595a278ee72964b83c0bd0b502152aa025f600.1718919473.git.yan@cloudflare.com/
[ice]
https://lore.kernel.org/all/a9eba425bfd3bfac7e7be38fe86ad5dbff3ae01f.1718919473.git.yan@cloudflare.com/
[veth]
https://lore.kernel.org/all/b7c75daecca9c4e36ef79af683d288653a9b5b82.1718919473.git.yan@cloudflare.com/
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
2025-08-07 19:07 ` Jesper Dangaard Brouer
@ 2025-08-13 2:59 ` Martin KaFai Lau
0 siblings, 0 replies; 43+ messages in thread
From: Martin KaFai Lau @ 2025-08-13 2:59 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Jakub Kicinski, Lorenzo Bianconi, Stanislav Fomichev, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
David S. Miller, Paolo Abeni, sdf, kernel-team, arthur, jakub,
Jesse Brandeburg, Andrew Rzeznik
On 8/7/25 12:07 PM, Jesper Dangaard Brouer wrote:
> Yan and I have previously[1] (Oct 2024) explored adding a common
> callback to XDP drivers, which have access to both the xdp_buff and SKB
> in the function call. (Ignore the GRO disable bit, focus on callback)
>
> We named the functions: xdp_buff_fixup_skb_offloading() and
> xdp_frame_fixup_skb_offloading()
> We implemented the driver changes for [bnxt], [mlx5], [ice] and [veth].
>
> What do you think of the idea of adding a BPF-hook, at this callback,
> which have access to both the XDP and SKB pointer.
> That would allow us to implement your idea, right?
>
> [1] https://lore.kernel.org/all/cover.1718919473.git.yan@cloudflare.com/#r
>
> [bnxt] https://lore.kernel.org/all/
> f804c22ca168ec3aedb0ee754bfbee71764eb894.1718919473.git.yan@cloudflare.com/
>
> [mlx5] https://lore.kernel.org/
> all/17595a278ee72964b83c0bd0b502152aa025f600.1718919473.git.yan@cloudflare.com/
>
> [ice] https://lore.kernel.org/all/
> a9eba425bfd3bfac7e7be38fe86ad5dbff3ae01f.1718919473.git.yan@cloudflare.com/
>
> [veth] https://lore.kernel.org/all/
> b7c75daecca9c4e36ef79af683d288653a9b5b82.1718919473.git.yan@cloudflare.com/
It should not need a new BPF-hook to consume info produced by an earlier xdp
prog. Instead, the same and existing xdp prog can call a kfunc to directly
create the skb and update the skb fields. The kfunc could be driver specific,
like the current .xmo_rx_xxx.
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2025-08-13 3:00 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
2025-07-17 9:19 ` Jakub Sitnicki
2025-07-17 14:40 ` Jesper Dangaard Brouer
2025-07-18 10:33 ` Jakub Sitnicki
2025-07-02 14:58 ` [PATCH bpf-next V2 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
2025-07-03 11:41 ` Jesper Dangaard Brouer
2025-07-03 12:26 ` Lorenzo Bianconi
2025-07-02 14:58 ` [PATCH bpf-next V2 4/7] net: xdp: Set skb hw metadata from xdp_frame Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
2025-07-17 12:11 ` Jakub Sitnicki
2025-07-02 14:58 ` [PATCH bpf-next V2 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
2025-07-23 9:24 ` Bouska, Zdenek
2025-07-02 14:58 ` [PATCH bpf-next V2 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
2025-07-02 16:05 ` [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Stanislav Fomichev
2025-07-03 11:17 ` Jesper Dangaard Brouer
2025-07-07 14:40 ` Stanislav Fomichev
2025-07-09 9:31 ` Lorenzo Bianconi
2025-07-11 16:04 ` Stanislav Fomichev
2025-07-16 11:17 ` Lorenzo Bianconi
2025-07-16 21:20 ` Jakub Kicinski
2025-07-17 13:08 ` Jesper Dangaard Brouer
2025-07-18 1:25 ` Jakub Kicinski
2025-07-18 10:56 ` Jesper Dangaard Brouer
2025-07-22 1:13 ` Jakub Kicinski
2025-07-28 10:53 ` Lorenzo Bianconi
2025-07-28 16:29 ` Jakub Kicinski
2025-07-29 11:15 ` Jesper Dangaard Brouer
2025-07-29 19:47 ` Martin KaFai Lau
2025-07-31 16:27 ` Jesper Dangaard Brouer
2025-08-01 20:38 ` Jakub Kicinski
2025-08-04 13:18 ` Jesper Dangaard Brouer
2025-08-06 0:28 ` Jakub Kicinski
2025-08-07 18:26 ` Jesper Dangaard Brouer
2025-08-06 1:24 ` Martin KaFai Lau
2025-08-07 19:07 ` Jesper Dangaard Brouer
2025-08-13 2:59 ` Martin KaFai Lau
2025-07-31 21:18 ` Lorenzo Bianconi
2025-08-01 20:40 ` Jakub Kicinski
2025-08-05 13:18 ` Lorenzo Bianconi
2025-08-05 23:54 ` Jakub Kicinski
2025-07-18 9:55 ` Lorenzo Bianconi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).