public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset
@ 2026-01-07 14:28 Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 01/17] bnxt_en: Call skb_metadata_set when skb->data points at metadata end Jakub Sitnicki
                   ` (17 more replies)
  0 siblings, 18 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

This series continues the effort to provide reliable access to xdp/skb
metadata from BPF context on the receive path. We have recently talked
about it at Plumbers [1].

Currently skb metadata location is tied to the MAC header offset:

  [headroom][metadata][MAC hdr][L3 pkt]
                      ^
                      skb_metadata_end = head + mac_header

This design breaks on L2 decapsulation (VLAN, GRE, etc.) when the MAC
offset is reset. The naive fix is to memmove metadata on every decap path,
but we can avoid this cost by tracking metadata position independently.

Introduce a dedicated meta_end field in skb_shared_info that records where
metadata ends relative to skb->head:

  [headroom][metadata][gap][MAC hdr][L3 pkt]
                     ^
                     skb_metadata_end = head + meta_end
                     
This allows BPF dynptr access (bpf_dynptr_from_skb_meta()) to work without
memmove. For skb->data_meta pointer access, which expects metadata
immediately before skb->data, make the verifier inject realignment code in
TC BPF prologue.

Patches 1-9 enforce the calling convention: skb_metadata_set() must be
called after skb->data points past the metadata area, ensuring meta_end
captures the correct position. Patch 10 implements the core change.
Patches 11-14 extend the verifier to track data_meta usage, and patch 15
adds the realignment logic. Patch 16 adds selftests covering L2 decap
scenarios.

Note: This series does not address moving metadata on L2 encapsulation when
forwarding packets. VLAN and QinQ have already been patched when fixing TC
BPF helpers [2], but other tagging/tunnel code still requires changes.

Note to maintainers: This is not a typical series, in the sense that it
touches both the networking drivers and the BPF verifier. The driver
changes (patches 1-9) can be split out, if it makes patch wrangling easier.

Thanks,
-jkbs

[1] https://lpc.events/event/19/contributions/2269/
[2] https://lore.kernel.org/all/20251105-skb-meta-rx-path-v4-0-5ceb08a9b37b@cloudflare.com/

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Changes in v3:
- Use BPF_EMIT_CALL in gen_prologue (patch 15, 16) (Alexei)
  (Will convert bpf_qdisc/testmod as a follow up.)
- Link to v2: https://lore.kernel.org/r/20260105-skb-meta-safeproof-netdevs-rx-only-v2-0-a21e679b5afa@cloudflare.com

Changes in v2:
- Add veth driver fix (patch 7)
- Add selftests for L2 decap paths (patch 16)
- Link to RFC: https://lore.kernel.org/r/20251124-skb-meta-safeproof-netdevs-rx-only-v1-0-8978f5054417@cloudflare.com

---
Jakub Sitnicki (17):
      bnxt_en: Call skb_metadata_set when skb->data points at metadata end
      i40e: Call skb_metadata_set when skb->data points at metadata end
      igb: Call skb_metadata_set when skb->data points at metadata end
      igc: Call skb_metadata_set when skb->data points at metadata end
      ixgbe: Call skb_metadata_set when skb->data points at metadata end
      net/mlx5e: Call skb_metadata_set when skb->data points at metadata end
      veth: Call skb_metadata_set when skb->data points at metadata end
      xsk: Call skb_metadata_set when skb->data points at metadata end
      xdp: Call skb_metadata_set when skb->data points at metadata end
      net: Track skb metadata end separately from MAC offset
      bpf, verifier: Remove side effects from may_access_direct_pkt_data
      bpf, verifier: Turn seen_direct_write flag into a bitmap
      bpf, verifier: Propagate packet access flags to gen_prologue
      bpf, verifier: Track when data_meta pointer is loaded
      bpf, verifier: Support direct kernel calls in gen_prologue
      bpf: Realign skb metadata for TC progs using data_meta
      selftests/bpf: Test skb metadata access after L2 decapsulation

 drivers/net/ethernet/broadcom/bnxt/bnxt.c          |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c         |   2 +-
 drivers/net/ethernet/intel/igb/igb_xsk.c           |   2 +-
 drivers/net/ethernet/intel/igc/igc_main.c          |   4 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c       |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.c    |   2 +-
 drivers/net/veth.c                                 |   4 +-
 include/linux/bpf.h                                |   2 +-
 include/linux/bpf_verifier.h                       |   8 +-
 include/linux/skbuff.h                             |  37 ++-
 kernel/bpf/cgroup.c                                |   2 +-
 kernel/bpf/verifier.c                              |  54 ++--
 net/core/dev.c                                     |   5 +-
 net/core/filter.c                                  |  62 ++++-
 net/core/skbuff.c                                  |  10 +-
 net/core/xdp.c                                     |   2 +-
 net/sched/bpf_qdisc.c                              |   3 +-
 tools/testing/selftests/bpf/config                 |   6 +-
 .../bpf/prog_tests/xdp_context_test_run.c          | 292 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_xdp_meta.c  |  48 ++--
 .../testing/selftests/bpf/test_kmods/bpf_testmod.c |   6 +-
 21 files changed, 466 insertions(+), 89 deletions(-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 01/17] bnxt_en: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 02/17] i40e: " Jakub Sitnicki
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
already points past the metadata area.

Adjust the driver to pull from skb->data before calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d17d0ea89c36..cb1aa2895172 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1440,8 +1440,8 @@ static struct sk_buff *bnxt_copy_xdp(struct bnxt_napi *bnapi,
 		return skb;
 
 	if (metasize) {
-		skb_metadata_set(skb, metasize);
 		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
 	}
 
 	return skb;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 02/17] i40e: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 01/17] bnxt_en: Call skb_metadata_set when skb->data points at metadata end Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 03/17] igb: " Jakub Sitnicki
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
already points past the metadata area.

Adjust the driver to pull from skb->data before calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 9f47388eaba5..11eff5bd840b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -310,8 +310,8 @@ static struct sk_buff *i40e_construct_skb_zc(struct i40e_ring *rx_ring,
 	       ALIGN(totalsize, sizeof(long)));
 
 	if (metasize) {
-		skb_metadata_set(skb, metasize);
 		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
 	}
 
 	if (likely(!xdp_buff_has_frags(xdp)))

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 03/17] igb: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 01/17] bnxt_en: Call skb_metadata_set when skb->data points at metadata end Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 02/17] i40e: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 04/17] igc: " Jakub Sitnicki
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
already points past the metadata area.

Adjust the driver to pull from skb->data before calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 drivers/net/ethernet/intel/igb/igb_xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_xsk.c b/drivers/net/ethernet/intel/igb/igb_xsk.c
index 30ce5fbb5b77..9202da66e32c 100644
--- a/drivers/net/ethernet/intel/igb/igb_xsk.c
+++ b/drivers/net/ethernet/intel/igb/igb_xsk.c
@@ -284,8 +284,8 @@ static struct sk_buff *igb_construct_skb_zc(struct igb_ring *rx_ring,
 	       ALIGN(totalsize, sizeof(long)));
 
 	if (metasize) {
-		skb_metadata_set(skb, metasize);
 		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
 	}
 
 	return skb;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 04/17] igc: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 03/17] igb: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 05/17] ixgbe: " Jakub Sitnicki
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
already points past the metadata area.

Adjust the driver to pull from skb->data before calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 drivers/net/ethernet/intel/igc/igc_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 7aafa60ba0c8..ba758399615b 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -2024,8 +2024,8 @@ static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
 	       ALIGN(headlen + metasize, sizeof(long)));
 
 	if (metasize) {
-		skb_metadata_set(skb, metasize);
 		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
 	}
 
 	/* update all of the pointers */
@@ -2752,8 +2752,8 @@ static struct sk_buff *igc_construct_skb_zc(struct igc_ring *ring,
 	       ALIGN(totalsize, sizeof(long)));
 
 	if (metasize) {
-		skb_metadata_set(skb, metasize);
 		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
 	}
 
 	if (ctx->rx_ts) {

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 05/17] ixgbe: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 04/17] igc: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 06/17] net/mlx5e: " Jakub Sitnicki
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
already points past the metadata area.

Adjust the driver to pull from skb->data before calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 7b941505a9d0..69104f432f8d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -228,8 +228,8 @@ static struct sk_buff *ixgbe_construct_skb_zc(struct ixgbe_ring *rx_ring,
 	       ALIGN(totalsize, sizeof(long)));
 
 	if (metasize) {
-		skb_metadata_set(skb, metasize);
 		__skb_pull(skb, metasize);
+		skb_metadata_set(skb, metasize);
 	}
 
 	return skb;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 06/17] net/mlx5e: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (4 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 05/17] ixgbe: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 07/17] veth: " Jakub Sitnicki
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
already points past the metadata area.

Adjust the driver to pull from skb->data before calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 2b05536d564a..20c983c3ce62 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -237,8 +237,8 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, struct xdp_b
 	skb_put_data(skb, xdp->data_meta, totallen);
 
 	if (metalen) {
-		skb_metadata_set(skb, metalen);
 		__skb_pull(skb, metalen);
+		skb_metadata_set(skb, metalen);
 	}
 
 	return skb;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 07/17] veth: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (5 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 06/17] net/mlx5e: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 08/17] xsk: " Jakub Sitnicki
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
points right past the metadata area.

Unlike other drivers, veth calls skb_metadata_set() after eth_type_trans(),
which pulls the Ethernet header and moves skb->data. This violates the
future calling convention.

Adjust the driver to pull the MAC header after calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 drivers/net/veth.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 14e6f2a2fb77..1d1dbfa2e5ef 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -874,11 +874,11 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	else
 		skb->data_len = 0;
 
-	skb->protocol = eth_type_trans(skb, rq->dev);
-
 	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
+
+	skb->protocol = eth_type_trans(skb, rq->dev);
 out:
 	return skb;
 drop:

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 08/17] xsk: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (6 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 07/17] veth: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 09/17] xdp: " Jakub Sitnicki
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
already points past the metadata area.

Adjust AF_XDP to pull from skb->data before calling skb_metadata_set().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/xdp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/xdp.c b/net/core/xdp.c
index fee6d080ee85..f35661305660 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -768,8 +768,8 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
 
 	metalen = xdp->data - xdp->data_meta;
 	if (metalen > 0) {
-		skb_metadata_set(skb, metalen);
 		__skb_pull(skb, metalen);
+		skb_metadata_set(skb, metalen);
 	}
 
 	skb_record_rx_queue(skb, rxq->queue_index);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 09/17] xdp: Call skb_metadata_set when skb->data points at metadata end
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (7 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 08/17] xsk: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 10/17] net: Track skb metadata end separately from MAC offset Jakub Sitnicki
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare to track skb metadata location independently of MAC header offset.

Following changes will make skb_metadata_set() record where metadata ends
relative to skb->head. Hence the helper must be called when skb->data
points just past the metadata area.

Tweak XDP generic mode accordingly.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/dev.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 9094c0fb8c68..7f984d86dfe9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5467,8 +5467,11 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
 		break;
 	case XDP_PASS:
 		metalen = xdp->data - xdp->data_meta;
-		if (metalen)
+		if (metalen) {
+			__skb_push(skb, mac_len);
 			skb_metadata_set(skb, metalen);
+			__skb_pull(skb, mac_len);
+		}
 		break;
 	}
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 10/17] net: Track skb metadata end separately from MAC offset
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (8 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 09/17] xdp: " Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 11/17] bpf, verifier: Remove side effects from may_access_direct_pkt_data Jakub Sitnicki
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Currently skb metadata location is derived from the MAC header offset.
This breaks when L2 tunnel/tagging devices (VLAN, GRE, etc.) reset the MAC
offset after pulling the encapsulation header, making the metadata
inaccessible.

A naive fix would be to move metadata on every skb_pull() path. However, we
can avoid a memmove on L2 decapsulation if we can locate metadata
independently of the MAC offset.

Introduce a meta_end field in skb_shared_info to track where metadata ends,
decoupling it from mac_header. The new field takes 2 bytes out of the
existing 4 byte hole, with structure size unchanged if we reorder the
gso_type field.

Update skb_metadata_set() to record meta_end at the time of the call, and
adjust skb_data_move() and pskb_expand_head() to keep meta_end in sync with
head buffer layout.

Remove the now-unneeded metadata adjustment in skb_reorder_vlan_header().

Note that this breaks BPF skb metadata access through skb->data_meta when
there is a gap between meta_end and skb->data. Following BPF verifier
changes address this.

Also, we still need to relocate the metadata on encapsulation on forward
path. VLAN and QinQ have already been patched when fixing TC BPF helpers
[1], but other tagging/tunnel code still requires similar changes. This
will be done as a follow up.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/skbuff.h | 14 ++++++++++++--
 net/core/skbuff.c      | 10 ++--------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 86737076101d..6dd09f55a975 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -595,15 +595,16 @@ struct skb_shared_info {
 	__u8		meta_len;
 	__u8		nr_frags;
 	__u8		tx_flags;
+	u16		meta_end;
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
 	unsigned short	gso_segs;
+	unsigned int	gso_type;
 	struct sk_buff	*frag_list;
 	union {
 		struct skb_shared_hwtstamps hwtstamps;
 		struct xsk_tx_metadata_compl xsk_meta;
 	};
-	unsigned int	gso_type;
 	u32		tskey;
 
 	/*
@@ -4499,7 +4500,7 @@ static inline u8 skb_metadata_len(const struct sk_buff *skb)
 
 static inline void *skb_metadata_end(const struct sk_buff *skb)
 {
-	return skb_mac_header(skb);
+	return skb->head + skb_shinfo(skb)->meta_end;
 }
 
 static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
@@ -4554,8 +4555,16 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
 	       true : __skb_metadata_differs(skb_a, skb_b, len_a);
 }
 
+/**
+ * skb_metadata_set - Record packet metadata length and location.
+ * @skb: packet carrying the metadata
+ * @meta_len: number of bytes of metadata preceding skb->data
+ *
+ * Must be called when skb->data already points past the metadata area.
+ */
 static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
 {
+	skb_shinfo(skb)->meta_end = skb_headroom(skb);
 	skb_shinfo(skb)->meta_len = meta_len;
 }
 
@@ -4601,6 +4610,7 @@ static inline void skb_data_move(struct sk_buff *skb, const int len,
 	}
 
 	memmove(meta + len, meta, meta_len + n);
+	skb_shinfo(skb)->meta_end += len;
 	return;
 
 no_metadata:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a00808f7be6a..afff023e70b1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2325,6 +2325,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 #endif
 	skb->tail	      += off;
 	skb_headers_offset_update(skb, nhead);
+	skb_shinfo(skb)->meta_end += nhead;
 	skb->cloned   = 0;
 	skb->hdr_len  = 0;
 	skb->nohdr    = 0;
@@ -6238,8 +6239,7 @@ EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
 static struct sk_buff *skb_reorder_vlan_header(struct sk_buff *skb)
 {
-	int mac_len, meta_len;
-	void *meta;
+	int mac_len;
 
 	if (skb_cow(skb, skb_headroom(skb)) < 0) {
 		kfree_skb(skb);
@@ -6252,12 +6252,6 @@ static struct sk_buff *skb_reorder_vlan_header(struct sk_buff *skb)
 			mac_len - VLAN_HLEN - ETH_TLEN);
 	}
 
-	meta_len = skb_metadata_len(skb);
-	if (meta_len) {
-		meta = skb_metadata_end(skb) - meta_len;
-		memmove(meta + VLAN_HLEN, meta, meta_len);
-	}
-
 	skb->mac_header += VLAN_HLEN;
 	return skb;
 }

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 11/17] bpf, verifier: Remove side effects from may_access_direct_pkt_data
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (9 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 10/17] net: Track skb metadata end separately from MAC offset Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 12/17] bpf, verifier: Turn seen_direct_write flag into a bitmap Jakub Sitnicki
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

The may_access_direct_pkt_data() helper sets env->seen_direct_write as a
side effect, which creates awkward calling patterns:

- check_special_kfunc() has a comment warning readers about the side effect
- specialize_kfunc() must save and restore the flag around the call

Make the helper a pure function by moving the seen_direct_write flag
setting to call sites that need it.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 kernel/bpf/verifier.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 53635ea2e41b..1158c7764a34 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6151,13 +6151,9 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 		if (meta)
 			return meta->pkt_access;
 
-		env->seen_direct_write = true;
 		return true;
 
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
-		if (t == BPF_WRITE)
-			env->seen_direct_write = true;
-
 		return true;
 
 	default:
@@ -7708,15 +7704,17 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			err = check_stack_write(env, regno, off, size,
 						value_regno, insn_idx);
 	} else if (reg_is_pkt_pointer(reg)) {
-		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
-			verbose(env, "cannot write into packet\n");
-			return -EACCES;
-		}
-		if (t == BPF_WRITE && value_regno >= 0 &&
-		    is_pointer_value(env, value_regno)) {
-			verbose(env, "R%d leaks addr into packet\n",
-				value_regno);
-			return -EACCES;
+		if (t == BPF_WRITE) {
+			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE)) {
+				verbose(env, "cannot write into packet\n");
+				return -EACCES;
+			}
+			if (value_regno >= 0 && is_pointer_value(env, value_regno)) {
+				verbose(env, "R%d leaks addr into packet\n",
+					value_regno);
+				return -EACCES;
+			}
+			env->seen_direct_write = true;
 		}
 		err = check_packet_access(env, regno, off, size, false);
 		if (!err && t == BPF_READ && value_regno >= 0)
@@ -13893,11 +13891,11 @@ static int check_special_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_ca
 		if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_slice]) {
 			regs[BPF_REG_0].type |= MEM_RDONLY;
 		} else {
-			/* this will set env->seen_direct_write to true */
 			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE)) {
 				verbose(env, "the prog does not allow writes to packet data\n");
 				return -EINVAL;
 			}
+			env->seen_direct_write = true;
 		}
 
 		if (!meta->initialized_dynptr.id) {
@@ -22398,7 +22396,6 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc *desc, int insn_idx)
 {
 	struct bpf_prog *prog = env->prog;
-	bool seen_direct_write;
 	void *xdp_kfunc;
 	bool is_rdonly;
 	u32 func_id = desc->func_id;
@@ -22414,16 +22411,10 @@ static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc
 			addr = (unsigned long)xdp_kfunc;
 		/* fallback to default kfunc when not supported by netdev */
 	} else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
-		seen_direct_write = env->seen_direct_write;
 		is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
 
 		if (is_rdonly)
 			addr = (unsigned long)bpf_dynptr_from_skb_rdonly;
-
-		/* restore env->seen_direct_write to its original value, since
-		 * may_access_direct_pkt_data mutates it
-		 */
-		env->seen_direct_write = seen_direct_write;
 	} else if (func_id == special_kfunc_list[KF_bpf_set_dentry_xattr]) {
 		if (bpf_lsm_has_d_inode_locked(prog))
 			addr = (unsigned long)bpf_set_dentry_xattr_locked;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 12/17] bpf, verifier: Turn seen_direct_write flag into a bitmap
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (10 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 11/17] bpf, verifier: Remove side effects from may_access_direct_pkt_data Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 13/17] bpf, verifier: Propagate packet access flags to gen_prologue Jakub Sitnicki
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Convert seen_direct_write from a boolean to a bitmap (seen_packet_access)
in preparation for tracking additional packet access patterns.

No functional change.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf_verifier.h |  6 +++++-
 kernel/bpf/verifier.c        | 11 ++++++-----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 130bcbd66f60..c8397ae51880 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -647,6 +647,10 @@ enum priv_stack_mode {
 	PRIV_STACK_ADAPTIVE,
 };
 
+enum packet_access_flags {
+	PA_F_DIRECT_WRITE = BIT(0),
+};
+
 struct bpf_subprog_info {
 	/* 'start' has to be the first field otherwise find_subprog() won't work */
 	u32 start; /* insn idx of function entry point */
@@ -773,7 +777,7 @@ struct bpf_verifier_env {
 	bool bpf_capable;
 	bool bypass_spec_v1;
 	bool bypass_spec_v4;
-	bool seen_direct_write;
+	u8 seen_packet_access;	/* combination of enum packet_access_flags */
 	bool seen_exception;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	const struct bpf_line_info *prev_linfo;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1158c7764a34..95818a7eedff 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7714,7 +7714,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 					value_regno);
 				return -EACCES;
 			}
-			env->seen_direct_write = true;
+			env->seen_packet_access |= PA_F_DIRECT_WRITE;
 		}
 		err = check_packet_access(env, regno, off, size, false);
 		if (!err && t == BPF_READ && value_regno >= 0)
@@ -13895,7 +13895,7 @@ static int check_special_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_ca
 				verbose(env, "the prog does not allow writes to packet data\n");
 				return -EINVAL;
 			}
-			env->seen_direct_write = true;
+			env->seen_packet_access |= PA_F_DIRECT_WRITE;
 		}
 
 		if (!meta->initialized_dynptr.id) {
@@ -21768,6 +21768,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 	struct bpf_prog *new_prog;
 	enum bpf_access_type type;
 	bool is_narrower_load;
+	bool seen_direct_write;
 	int epilogue_idx = 0;
 
 	if (ops->gen_epilogue) {
@@ -21795,13 +21796,13 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		}
 	}
 
-	if (ops->gen_prologue || env->seen_direct_write) {
+	seen_direct_write = env->seen_packet_access & PA_F_DIRECT_WRITE;
+	if (ops->gen_prologue || seen_direct_write) {
 		if (!ops->gen_prologue) {
 			verifier_bug(env, "gen_prologue is null");
 			return -EFAULT;
 		}
-		cnt = ops->gen_prologue(insn_buf, env->seen_direct_write,
-					env->prog);
+		cnt = ops->gen_prologue(insn_buf, seen_direct_write, env->prog);
 		if (cnt >= INSN_BUF_SIZE) {
 			verifier_bug(env, "prologue is too long");
 			return -EFAULT;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 13/17] bpf, verifier: Propagate packet access flags to gen_prologue
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (11 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 12/17] bpf, verifier: Turn seen_direct_write flag into a bitmap Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 14/17] bpf, verifier: Track when data_meta pointer is loaded Jakub Sitnicki
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Change gen_prologue() to accept the packet access flags bitmap. This allows
gen_prologue() to inspect multiple access patterns when needed.

No functional change.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf.h                                  |  2 +-
 kernel/bpf/cgroup.c                                  |  2 +-
 kernel/bpf/verifier.c                                |  6 ++----
 net/core/filter.c                                    | 15 ++++++++-------
 net/sched/bpf_qdisc.c                                |  3 ++-
 tools/testing/selftests/bpf/test_kmods/bpf_testmod.c |  6 +++---
 6 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5936f8e2996f..1dba2caee09c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1106,7 +1106,7 @@ struct bpf_verifier_ops {
 	bool (*is_valid_access)(int off, int size, enum bpf_access_type type,
 				const struct bpf_prog *prog,
 				struct bpf_insn_access_aux *info);
-	int (*gen_prologue)(struct bpf_insn *insn, bool direct_write,
+	int (*gen_prologue)(struct bpf_insn *insn, u32 pkt_access_flags,
 			    const struct bpf_prog *prog);
 	int (*gen_epilogue)(struct bpf_insn *insn, const struct bpf_prog *prog,
 			    s16 ctx_stack_off);
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 69988af44b37..d96465cd7d43 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -2694,7 +2694,7 @@ static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
 }
 
 static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
-				   bool direct_write,
+				   u32 pkt_access_flags,
 				   const struct bpf_prog *prog)
 {
 	/* Nothing to do for sockopt argument. The data is kzalloc'ated.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 95818a7eedff..daa90c81d802 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -21768,7 +21768,6 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 	struct bpf_prog *new_prog;
 	enum bpf_access_type type;
 	bool is_narrower_load;
-	bool seen_direct_write;
 	int epilogue_idx = 0;
 
 	if (ops->gen_epilogue) {
@@ -21796,13 +21795,12 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		}
 	}
 
-	seen_direct_write = env->seen_packet_access & PA_F_DIRECT_WRITE;
-	if (ops->gen_prologue || seen_direct_write) {
+	if (ops->gen_prologue || (env->seen_packet_access & PA_F_DIRECT_WRITE)) {
 		if (!ops->gen_prologue) {
 			verifier_bug(env, "gen_prologue is null");
 			return -EFAULT;
 		}
-		cnt = ops->gen_prologue(insn_buf, seen_direct_write, env->prog);
+		cnt = ops->gen_prologue(insn_buf, env->seen_packet_access, env->prog);
 		if (cnt >= INSN_BUF_SIZE) {
 			verifier_bug(env, "prologue is too long");
 			return -EFAULT;
diff --git a/net/core/filter.c b/net/core/filter.c
index d43df98e1ded..07af2a94cc9a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9052,7 +9052,7 @@ static bool sock_filter_is_valid_access(int off, int size,
 					       prog->expected_attach_type);
 }
 
-static int bpf_noop_prologue(struct bpf_insn *insn_buf, bool direct_write,
+static int bpf_noop_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
 			     const struct bpf_prog *prog)
 {
 	/* Neither direct read nor direct write requires any preliminary
@@ -9061,12 +9061,12 @@ static int bpf_noop_prologue(struct bpf_insn *insn_buf, bool direct_write,
 	return 0;
 }
 
-static int bpf_unclone_prologue(struct bpf_insn *insn_buf, bool direct_write,
+static int bpf_unclone_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
 				const struct bpf_prog *prog, int drop_verdict)
 {
 	struct bpf_insn *insn = insn_buf;
 
-	if (!direct_write)
+	if (!(pkt_access_flags & PA_F_DIRECT_WRITE))
 		return 0;
 
 	/* if (!skb->cloned)
@@ -9135,10 +9135,11 @@ static int bpf_gen_ld_abs(const struct bpf_insn *orig,
 	return insn - insn_buf;
 }
 
-static int tc_cls_act_prologue(struct bpf_insn *insn_buf, bool direct_write,
+static int tc_cls_act_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
 			       const struct bpf_prog *prog)
 {
-	return bpf_unclone_prologue(insn_buf, direct_write, prog, TC_ACT_SHOT);
+	return bpf_unclone_prologue(insn_buf, pkt_access_flags, prog,
+				    TC_ACT_SHOT);
 }
 
 static bool tc_cls_act_is_valid_access(int off, int size,
@@ -9476,10 +9477,10 @@ static bool sock_ops_is_valid_access(int off, int size,
 	return true;
 }
 
-static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
+static int sk_skb_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
 			   const struct bpf_prog *prog)
 {
-	return bpf_unclone_prologue(insn_buf, direct_write, prog, SK_DROP);
+	return bpf_unclone_prologue(insn_buf, pkt_access_flags, prog, SK_DROP);
 }
 
 static bool sk_skb_is_valid_access(int off, int size,
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index b9771788b9b3..dc7b9db7e785 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -132,7 +132,8 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 
 BTF_ID_LIST_SINGLE(bpf_qdisc_init_prologue_ids, func, bpf_qdisc_init_prologue)
 
-static int bpf_qdisc_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
+static int bpf_qdisc_gen_prologue(struct bpf_insn *insn_buf,
+				  u32 direct_access_flags,
 				  const struct bpf_prog *prog)
 {
 	struct bpf_insn *insn = insn_buf;
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
index 1c41d03bd5a1..d698e45783e3 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
@@ -1397,7 +1397,7 @@ static int bpf_test_mod_st_ops__test_pro_epilogue(struct st_ops_args *args)
 static int bpf_cgroup_from_id_id;
 static int bpf_cgroup_release_id;
 
-static int st_ops_gen_prologue_with_kfunc(struct bpf_insn *insn_buf, bool direct_write,
+static int st_ops_gen_prologue_with_kfunc(struct bpf_insn *insn_buf,
 					  const struct bpf_prog *prog)
 {
 	struct bpf_insn *insn = insn_buf;
@@ -1473,7 +1473,7 @@ static int st_ops_gen_epilogue_with_kfunc(struct bpf_insn *insn_buf, const struc
 }
 
 #define KFUNC_PRO_EPI_PREFIX "test_kfunc_"
-static int st_ops_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
+static int st_ops_gen_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
 			       const struct bpf_prog *prog)
 {
 	struct bpf_insn *insn = insn_buf;
@@ -1483,7 +1483,7 @@ static int st_ops_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
 		return 0;
 
 	if (!strncmp(prog->aux->name, KFUNC_PRO_EPI_PREFIX, strlen(KFUNC_PRO_EPI_PREFIX)))
-		return st_ops_gen_prologue_with_kfunc(insn_buf, direct_write, prog);
+		return st_ops_gen_prologue_with_kfunc(insn_buf, prog);
 
 	/* r6 = r1[0]; // r6 will be "struct st_ops *args". r1 is "u64 *ctx".
 	 * r7 = r6->a;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 14/17] bpf, verifier: Track when data_meta pointer is loaded
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (12 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 13/17] bpf, verifier: Propagate packet access flags to gen_prologue Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 15/17] bpf, verifier: Support direct kernel calls in gen_prologue Jakub Sitnicki
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Introduce PA_F_DATA_META_LOAD flag to track when a BPF program loads the
skb->data_meta pointer.

This information will be used by gen_prologue() to handle cases where there
is a gap between metadata end and skb->data, requiring metadata to be
realigned.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf_verifier.h | 1 +
 kernel/bpf/verifier.c        | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c8397ae51880..b32ddf0f0ab3 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -649,6 +649,7 @@ enum priv_stack_mode {
 
 enum packet_access_flags {
 	PA_F_DIRECT_WRITE = BIT(0),
+	PA_F_DATA_META_LOAD = BIT(1),
 };
 
 struct bpf_subprog_info {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index daa90c81d802..76f2befc8159 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6226,6 +6226,10 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 		} else {
 			env->insn_aux_data[insn_idx].ctx_field_size = info->ctx_field_size;
 		}
+
+		if (base_type(info->reg_type) == PTR_TO_PACKET_META)
+			env->seen_packet_access |= PA_F_DATA_META_LOAD;
+
 		/* remember the offset of last byte accessed in ctx */
 		if (env->prog->aux->max_ctx_offset < off + size)
 			env->prog->aux->max_ctx_offset = off + size;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 15/17] bpf, verifier: Support direct kernel calls in gen_prologue
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (13 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 14/17] bpf, verifier: Track when data_meta pointer is loaded Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 14:28 ` [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta Jakub Sitnicki
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Prepare ground for the next patch to emit a call to a regular kernel
function, not a kfunc or a BPF helper, from the prologue generator using
BPF_EMIT_CALL.

These calls use offsets relative to __bpf_call_base and must bypass the
verifier's patch_call_imm fixup, which expects BPF helper IDs rather than
pre-resolved offsets.

Add a finalized_call flag to bpf_insn_aux_data to mark call instructions
with finalized offsets so the verifier can skip patch_call_imm fixup for
these calls.

As a follow-up, existing gen_prologue and gen_epilogue callbacks using
kfuncs can be converted to BPF_EMIT_CALL, removing the need for kfunc
resolution during prologue/epilogue generation.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/verifier.c        | 12 ++++++++++++
 net/core/filter.c            |  5 +++--
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b32ddf0f0ab3..9ccd56c04a45 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -561,6 +561,7 @@ struct bpf_insn_aux_data {
 	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
 	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
 	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
+	bool finalized_call; /* call holds function offset relative to __bpf_base_call */
 	u8 alu_state; /* used in combination with alu_limit */
 	/* true if STX or LDX instruction is a part of a spill/fill
 	 * pattern for a bpf_fastcall call.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 76f2befc8159..219e233cc4c6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -21816,6 +21816,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			env->prog = new_prog;
 			delta += cnt - 1;
 
+			/* gen_prologue emits function calls with target address
+			 * relative to __bpf_call_base. Skip patch_call_imm fixup.
+			 */
+			for (i = 0; i < cnt - 1; i++) {
+				if (bpf_helper_call(&env->prog->insnsi[i]))
+					env->insn_aux_data[i].finalized_call = true;
+			}
+
 			ret = add_kfunc_in_insns(env, insn_buf, cnt - 1);
 			if (ret < 0)
 				return ret;
@@ -23422,6 +23430,9 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 patch_call_imm:
+		if (env->insn_aux_data[i + delta].finalized_call)
+			goto next_insn;
+
 		fn = env->ops->get_func_proto(insn->imm, env->prog);
 		/* all functions that have prototype and verifier allowed
 		 * programs to call them, must be real in-kernel functions
@@ -23433,6 +23444,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			return -EFAULT;
 		}
 		insn->imm = fn->func - __bpf_call_base;
+		env->insn_aux_data[i + delta].finalized_call = true;
 next_insn:
 		if (subprogs[cur_subprog + 1].start == i + delta + 1) {
 			subprogs[cur_subprog].stack_depth += stack_depth_extra;
diff --git a/net/core/filter.c b/net/core/filter.c
index 07af2a94cc9a..e91d5a39e0a7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9080,10 +9080,11 @@ static int bpf_unclone_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
 	*insn++ = BPF_JMP_IMM(BPF_JEQ, BPF_REG_6, 0, 7);
 
 	/* ret = bpf_skb_pull_data(skb, 0); */
+	BUILD_BUG_ON(!__same_type(btf_bpf_skb_pull_data,
+				  (u64 (*)(struct sk_buff *, u32))NULL));
 	*insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
 	*insn++ = BPF_ALU64_REG(BPF_XOR, BPF_REG_2, BPF_REG_2);
-	*insn++ = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
-			       BPF_FUNC_skb_pull_data);
+	*insn++ = BPF_EMIT_CALL(bpf_skb_pull_data);
 	/* if (!ret)
 	 *      goto restore;
 	 * return TC_ACT_SHOT;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (14 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 15/17] bpf, verifier: Support direct kernel calls in gen_prologue Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-07 22:01   ` Alexei Starovoitov
  2026-01-07 14:28 ` [PATCH bpf-next v3 17/17] selftests/bpf: Test skb metadata access after L2 decapsulation Jakub Sitnicki
  2026-01-08 15:47 ` [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Kicinski
  17 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

After decoupling metadata location from MAC header offset, a gap can appear
between metadata and skb->data on L2 decapsulation (e.g., VLAN, GRE). This
breaks the BPF data_meta pointer which assumes metadata is directly before
skb->data.

Introduce bpf_skb_meta_realign() kfunc to close the gap by moving metadata
to immediately precede the MAC header. Inject a call to it in
tc_cls_act_prologue() when the verifier detects data_meta access
(PA_F_DATA_META_LOAD flag).

Update skb_data_move() to handle the gap case: on skb_push(), move metadata
to the top of the head buffer; on skb_pull() where metadata is already
detached, leave it in place.

This restores data_meta functionality for TC programs while keeping the
performance benefit of avoiding memmove on L2 decapsulation for programs
that don't use data_meta.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/skbuff.h | 25 +++++++++++++++++--------
 net/core/filter.c      | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6dd09f55a975..0fc4df42826e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4600,19 +4600,28 @@ static inline void skb_data_move(struct sk_buff *skb, const int len,
 	if (!meta_len)
 		goto no_metadata;
 
-	meta_end = skb_metadata_end(skb);
-	meta = meta_end - meta_len;
-
-	if (WARN_ON_ONCE(meta_end + len != skb->data ||
-			 meta_len > skb_headroom(skb))) {
+	/* Not enough headroom left for metadata. Drop it. */
+	if (WARN_ONCE(meta_len > skb_headroom(skb),
+		      "skb headroom smaller than metadata")) {
 		skb_metadata_clear(skb);
 		goto no_metadata;
 	}
 
-	memmove(meta + len, meta, meta_len + n);
-	skb_shinfo(skb)->meta_end += len;
-	return;
+	meta_end = skb_metadata_end(skb);
+	meta = meta_end - meta_len;
 
+	/* Metadata in front of data before push/pull. Keep it that way. */
+	if (meta_end == skb->data - len) {
+		memmove(meta + len, meta, meta_len + n);
+		skb_shinfo(skb)->meta_end += len;
+		return;
+	}
+
+	if (len < 0) {
+		/* Data pushed. Move metadata to the top. */
+		memmove(skb->head, meta, meta_len);
+		skb_shinfo(skb)->meta_end = meta_len;
+	}
 no_metadata:
 	memmove(skb->data, skb->data - len, n);
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index e91d5a39e0a7..df4c97fe79ee 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9136,11 +9136,53 @@ static int bpf_gen_ld_abs(const struct bpf_insn *orig,
 	return insn - insn_buf;
 }
 
+static void bpf_skb_meta_realign(struct sk_buff *skb)
+{
+	u8 *meta_end = skb_metadata_end(skb);
+	u8 meta_len = skb_metadata_len(skb);
+	u8 *meta;
+	int gap;
+
+	gap = skb_mac_header(skb) - meta_end;
+	if (!meta_len || !gap)
+		return;
+
+	if (WARN_ONCE(gap < 0, "skb metadata end past mac header")) {
+		skb_metadata_clear(skb);
+		return;
+	}
+
+	meta = meta_end - meta_len;
+	memmove(meta + gap, meta, meta_len);
+	skb_shinfo(skb)->meta_end += gap;
+
+	bpf_compute_data_pointers(skb);
+}
+
 static int tc_cls_act_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
 			       const struct bpf_prog *prog)
 {
-	return bpf_unclone_prologue(insn_buf, pkt_access_flags, prog,
-				    TC_ACT_SHOT);
+	struct bpf_insn *insn = insn_buf;
+	int cnt;
+
+	if (pkt_access_flags & PA_F_DATA_META_LOAD) {
+		/* Realign skb metadata for access through data_meta pointer.
+		 *
+		 * r6 = r1; // r6 will be "u64 *ctx"
+		 * r0 = bpf_skb_meta_realign(r1); // r0 is undefined
+		 * r1 = r6;
+		 */
+		BUILD_BUG_ON(!__same_type(&bpf_skb_meta_realign,
+					  (void (*)(struct sk_buff *))NULL));
+		*insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
+		*insn++ = BPF_EMIT_CALL(bpf_skb_meta_realign);
+		*insn++ = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
+	}
+	cnt = bpf_unclone_prologue(insn, pkt_access_flags, prog, TC_ACT_SHOT);
+	if (!cnt && insn > insn_buf)
+		*insn++ = prog->insnsi[0];
+
+	return cnt + insn - insn_buf;
 }
 
 static bool tc_cls_act_is_valid_access(int off, int size,

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v3 17/17] selftests/bpf: Test skb metadata access after L2 decapsulation
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (15 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta Jakub Sitnicki
@ 2026-01-07 14:28 ` Jakub Sitnicki
  2026-01-08 15:47 ` [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Kicinski
  17 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-07 14:28 UTC (permalink / raw)
  To: bpf
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

Add tests to verify that XDP/skb metadata remains accessible after L2
decapsulation now that metadata tracking has been decoupled from the MAC
header offset.

Use a 2-netns setup to test each L2 decapsulation path that resets the MAC
header offset: GRE (IPv4/IPv6), VXLAN, GENEVE, L2TPv3, VLAN, QinQ, and
MPLS. SRv6 and NSH are not tested due to setup complexity (SRv6 requires 4
namespaces according to selftests/net/srv6_hl2encap_red_l2vpn_test.sh, NSH
requires Open vSwitch).

For each encapsulation type, test three access scenarios after L2 decap:
- direct skb->data_meta pointer access
- dynptr-based metadata access (bpf_dynptr_from_skb_meta)
- metadata access after move or skb head realloc (bpf_skb_adjust_room)

Change test_xdp_meta.c to identify test packets by payload content instead
of source MAC address, since L2 encapsulation pushes its own MAC header.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/config                 |   6 +-
 .../bpf/prog_tests/xdp_context_test_run.c          | 292 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_xdp_meta.c  |  48 ++--
 3 files changed, 325 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 558839e3c185..f867b7eff085 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -130,4 +130,8 @@ CONFIG_INFINIBAND=y
 CONFIG_SMC=y
 CONFIG_SMC_HS_CTRL_BPF=y
 CONFIG_DIBS=y
-CONFIG_DIBS_LO=y
\ No newline at end of file
+CONFIG_DIBS_LO=y
+CONFIG_L2TP=y
+CONFIG_L2TP_V3=y
+CONFIG_L2TP_IP=y
+CONFIG_L2TP_ETH=y
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
index ee94c281888a..dc7f936216db 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
@@ -8,9 +8,14 @@
 #define TX_NAME "veth1"
 #define TX_NETNS "xdp_context_tx"
 #define RX_NETNS "xdp_context_rx"
+#define RX_MAC "02:00:00:00:00:01"
+#define TX_MAC "02:00:00:00:00:02"
 #define TAP_NAME "tap0"
 #define DUMMY_NAME "dum0"
 #define TAP_NETNS "xdp_context_tuntap"
+#define ENCAP_DEV "encap"
+#define DECAP_RX_NETNS "xdp_context_decap_rx"
+#define DECAP_TX_NETNS "xdp_context_decap_tx"
 
 #define TEST_PAYLOAD_LEN 32
 static const __u8 test_payload[TEST_PAYLOAD_LEN] = {
@@ -127,6 +132,7 @@ static int send_test_packet(int ifindex)
 	/* We use the Ethernet header only to identify the test packet */
 	struct ethhdr eth = {
 		.h_source = { 0x12, 0x34, 0xDE, 0xAD, 0xBE, 0xEF },
+		.h_proto = htons(ETH_P_IP),
 	};
 
 	memcpy(packet, &eth, sizeof(eth));
@@ -155,6 +161,34 @@ static int send_test_packet(int ifindex)
 	return -1;
 }
 
+static int send_routed_packet(int af, const char *ip)
+{
+	struct sockaddr_storage addr;
+	socklen_t alen;
+	int r, sock = -1;
+
+	r = make_sockaddr(af, ip, 42, &addr, &alen);
+	if (!ASSERT_OK(r, "make_sockaddr"))
+		goto err;
+
+	sock = socket(af, SOCK_DGRAM, 0);
+	if (!ASSERT_OK_FD(sock, "socket"))
+		goto err;
+
+	r = sendto(sock, test_payload, sizeof(test_payload), 0,
+		   (struct sockaddr *)&addr, alen);
+	if (!ASSERT_EQ(r, sizeof(test_payload), "sendto"))
+		goto err;
+
+	close(sock);
+	return 0;
+
+err:
+	if (sock >= 0)
+		close(sock);
+	return -1;
+}
+
 static int write_test_packet(int tap_fd)
 {
 	__u8 packet[sizeof(struct ethhdr) + TEST_PAYLOAD_LEN];
@@ -510,3 +544,261 @@ void test_xdp_context_tuntap(void)
 
 	test_xdp_meta__destroy(skel);
 }
+
+enum l2_encap_type {
+	GRE4_ENCAP,
+	GRE6_ENCAP,
+	VXLAN_ENCAP,
+	GENEVE_ENCAP,
+	L2TPV3_ENCAP,
+	VLAN_ENCAP,
+	QINQ_ENCAP,
+	MPLS_ENCAP,
+};
+
+static bool l2_encap_uses_ipv6(enum l2_encap_type encap_type)
+{
+	return encap_type == GRE6_ENCAP;
+}
+
+static bool l2_encap_uses_routing(enum l2_encap_type encap_type)
+{
+	return encap_type == MPLS_ENCAP;
+}
+
+static bool setup_l2_encap_dev(enum l2_encap_type encap_type,
+			       const char *encap_dev, const char *lower_dev,
+			       const char *net_prefix, const char *local_addr,
+			       const char *remote_addr)
+{
+	switch (encap_type) {
+	case GRE4_ENCAP:
+		SYS(fail, "ip link add %s type gretap local %s remote %s",
+		    encap_dev, local_addr, remote_addr);
+		return true;
+
+	case GRE6_ENCAP:
+		SYS(fail, "ip link add %s type ip6gretap local %s remote %s",
+		    encap_dev, local_addr, remote_addr);
+		return true;
+
+	case VXLAN_ENCAP:
+		SYS(fail,
+		    "ip link add %s type vxlan id 42 local %s remote %s dstport 4789",
+		    encap_dev, local_addr, remote_addr);
+		return true;
+
+	case GENEVE_ENCAP:
+		SYS(fail,
+		    "ip link add %s type geneve id 42 remote %s",
+		    encap_dev, remote_addr);
+		return true;
+
+	case L2TPV3_ENCAP:
+		SYS(fail,
+		    "ip l2tp add tunnel tunnel_id 42 peer_tunnel_id 42 encap ip local %s remote %s",
+		    local_addr, remote_addr);
+		SYS(fail,
+		    "ip l2tp add session name %s tunnel_id 42 session_id 42 peer_session_id 42",
+		    encap_dev);
+		return true;
+
+	case VLAN_ENCAP:
+		SYS(fail, "ip link set dev %s down", lower_dev);
+		SYS(fail, "ethtool -K %s rx-vlan-hw-parse off", lower_dev);
+		SYS(fail, "ethtool -K %s tx-vlan-hw-insert off", lower_dev);
+		SYS(fail, "ip link set dev %s up", lower_dev);
+		SYS(fail, "ip link add %s link %s type vlan id 42", encap_dev,
+		    lower_dev);
+		return true;
+
+	case QINQ_ENCAP:
+		SYS(fail, "ip link set dev %s down", lower_dev);
+		SYS(fail, "ethtool -K %s rx-vlan-hw-parse off", lower_dev);
+		SYS(fail, "ethtool -K %s tx-vlan-hw-insert off", lower_dev);
+		SYS(fail, "ethtool -K %s rx-vlan-stag-hw-parse off", lower_dev);
+		SYS(fail, "ethtool -K %s tx-vlan-stag-hw-insert off", lower_dev);
+		SYS(fail, "ip link set dev %s up", lower_dev);
+		SYS(fail, "ip link add vlan.100 link %s type vlan proto 802.1ad id 100", lower_dev);
+		SYS(fail, "ip link set dev vlan.100 up");
+		SYS(fail, "ip link add %s link vlan.100 type vlan id 42", encap_dev);
+		return true;
+
+	case MPLS_ENCAP:
+		SYS(fail, "sysctl -wq net.mpls.platform_labels=65535");
+		SYS(fail, "sysctl -wq net.mpls.conf.%s.input=1", lower_dev);
+		SYS(fail, "ip route change %s encap mpls 42 via %s", net_prefix, remote_addr);
+		SYS(fail, "ip -f mpls route add 42 dev lo");
+		SYS(fail, "ip link set dev lo name %s", encap_dev);
+
+		return true;
+	}
+fail:
+	return false;
+}
+
+static void test_l2_decap(enum l2_encap_type encap_type,
+			  struct bpf_program *xdp_prog,
+			  struct bpf_program *tc_prog, bool *test_pass)
+{
+	LIBBPF_OPTS(bpf_tc_hook, tc_hook, .attach_point = BPF_TC_INGRESS);
+	LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
+	const char *net, *rx_ip, *tx_ip, *addr_opts;
+	int af, plen;
+	struct netns_obj *rx_ns = NULL, *tx_ns = NULL;
+	struct nstoken *nstoken = NULL;
+	int lower_ifindex, upper_ifindex;
+	int ret;
+
+	if (l2_encap_uses_ipv6(encap_type)) {
+		af = AF_INET6;
+		net = "fd00::/64";
+		rx_ip = "fd00::1";
+		tx_ip = "fd00::2";
+		plen = 64;
+		addr_opts = "nodad";
+	} else {
+		af = AF_INET;
+		net = "192.0.2.0/24";
+		rx_ip = "192.0.2.1";
+		tx_ip = "192.0.2.2";
+		plen = 24;
+		addr_opts = "";
+	}
+
+	*test_pass = false;
+
+	rx_ns = netns_new(DECAP_RX_NETNS, false);
+	if (!ASSERT_OK_PTR(rx_ns, "create rx_ns"))
+		return;
+
+	tx_ns = netns_new(DECAP_TX_NETNS, false);
+	if (!ASSERT_OK_PTR(tx_ns, "create tx_ns"))
+		goto close;
+
+	SYS(close, "ip link add " RX_NAME " address " RX_MAC " netns " DECAP_RX_NETNS
+		   " type veth peer name " TX_NAME " address " TX_MAC " netns " DECAP_TX_NETNS);
+
+	nstoken = open_netns(DECAP_RX_NETNS);
+	if (!ASSERT_OK_PTR(nstoken, "setns rx_ns"))
+		goto close;
+
+	SYS(close, "ip addr add %s/%u dev %s %s", rx_ip, plen, RX_NAME, addr_opts);
+	SYS(close, "ip link set dev %s up", RX_NAME);
+
+	if (!setup_l2_encap_dev(encap_type, ENCAP_DEV, RX_NAME, net, rx_ip, tx_ip))
+		goto close;
+	SYS(close, "ip link set dev %s up", ENCAP_DEV);
+
+	lower_ifindex = if_nametoindex(RX_NAME);
+	if (!ASSERT_GE(lower_ifindex, 0, "if_nametoindex lower"))
+		goto close;
+
+	upper_ifindex = if_nametoindex(ENCAP_DEV);
+	if (!ASSERT_GE(upper_ifindex, 0, "if_nametoindex upper"))
+		goto close;
+
+	ret = bpf_xdp_attach(lower_ifindex, bpf_program__fd(xdp_prog), 0, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto close;
+
+	tc_hook.ifindex = upper_ifindex;
+	ret = bpf_tc_hook_create(&tc_hook);
+	if (!ASSERT_OK(ret, "bpf_tc_hook_create"))
+		goto close;
+
+	tc_opts.prog_fd = bpf_program__fd(tc_prog);
+	ret = bpf_tc_attach(&tc_hook, &tc_opts);
+	if (!ASSERT_OK(ret, "bpf_tc_attach"))
+		goto close;
+
+	close_netns(nstoken);
+
+	nstoken = open_netns(DECAP_TX_NETNS);
+	if (!ASSERT_OK_PTR(nstoken, "setns tx_ns"))
+		goto close;
+
+	SYS(close, "ip addr add %s/%u dev %s %s", tx_ip, plen, TX_NAME, addr_opts);
+	SYS(close, "ip neigh add %s lladdr %s nud permanent dev %s", rx_ip, RX_MAC, TX_NAME);
+	SYS(close, "ip link set dev %s up", TX_NAME);
+
+	if (!setup_l2_encap_dev(encap_type, ENCAP_DEV, TX_NAME, net, tx_ip, rx_ip))
+		goto close;
+	SYS(close, "ip link set dev %s up", ENCAP_DEV);
+
+	upper_ifindex = if_nametoindex(ENCAP_DEV);
+	if (!ASSERT_GE(upper_ifindex, 0, "if_nametoindex upper"))
+		goto close;
+
+	if (l2_encap_uses_routing(encap_type))
+		ret = send_routed_packet(af, rx_ip);
+	else
+		ret = send_test_packet(upper_ifindex);
+	if (!ASSERT_OK(ret, "send packet"))
+		goto close;
+
+	if (!ASSERT_TRUE(*test_pass, "test_pass"))
+		dump_err_stream(tc_prog);
+
+close:
+	close_netns(nstoken);
+	netns_free(rx_ns);
+	netns_free(tx_ns);
+}
+
+__printf(1, 2) static bool start_subtest(const char *fmt, ...)
+{
+	char *subtest_name;
+	va_list ap;
+	int r;
+
+	va_start(ap, fmt);
+	r = vasprintf(&subtest_name, fmt, ap);
+	va_end(ap);
+	if (!ASSERT_GE(r, 0, "format string"))
+		return false;
+
+	r = test__start_subtest(subtest_name);
+	free(subtest_name);
+	return r;
+}
+
+void test_xdp_context_l2_decap(void)
+{
+	const struct test {
+		enum l2_encap_type encap_type;
+		const char *encap_name;
+	} tests[] = {
+		{ GRE4_ENCAP, "gre4" },
+		{ GRE6_ENCAP, "gre6" },
+		{ VXLAN_ENCAP, "vxlan" },
+		{ GENEVE_ENCAP, "geneve" },
+		{ L2TPV3_ENCAP, "l2tpv3" },
+		{ VLAN_ENCAP, "vlan" },
+		{ QINQ_ENCAP, "qinq" },
+		{ MPLS_ENCAP, "mpls" },
+	};
+	struct test_xdp_meta *skel;
+	const struct test *t;
+
+	skel = test_xdp_meta__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open and load skeleton"))
+		return;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (start_subtest("%s_direct_access", t->encap_name))
+			test_l2_decap(t->encap_type, skel->progs.ing_xdp,
+				      skel->progs.ing_cls,
+				      &skel->bss->test_pass);
+		if (start_subtest("%s_dynptr_read", t->encap_name))
+			test_l2_decap(t->encap_type, skel->progs.ing_xdp,
+				      skel->progs.ing_cls_dynptr_read,
+				      &skel->bss->test_pass);
+		if (start_subtest("%s_helper_adjust_room", t->encap_name))
+			test_l2_decap(t->encap_type, skel->progs.ing_xdp,
+				      skel->progs.helper_skb_adjust_room,
+				      &skel->bss->test_pass);
+	}
+
+	test_xdp_meta__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_meta.c b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
index 0a0f371a2dec..69130e250e84 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_meta.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
@@ -280,18 +280,35 @@ int ing_cls_dynptr_offset_oob(struct __sk_buff *ctx)
 	return TC_ACT_SHOT;
 }
 
+/* Test packets carry test metadata pattern as payload. */
+static bool is_test_packet(struct xdp_md *ctx)
+{
+	__u8 meta_have[META_SIZE];
+	__u32 len;
+	int ret;
+
+	len = bpf_xdp_get_buff_len(ctx);
+	if (len < META_SIZE)
+		return false;
+	ret = bpf_xdp_load_bytes(ctx, len - META_SIZE, meta_have, META_SIZE);
+	if (ret)
+		return false;
+	ret = __builtin_memcmp(meta_have, meta_want, META_SIZE);
+	if (ret)
+		return false;
+
+	return true;
+}
+
 /* Reserve and clear space for metadata but don't populate it */
 SEC("xdp")
 int ing_xdp_zalloc_meta(struct xdp_md *ctx)
 {
-	struct ethhdr *eth = ctx_ptr(ctx, data);
 	__u8 *meta;
 	int ret;
 
 	/* Drop any non-test packets */
-	if (eth + 1 > ctx_ptr(ctx, data_end))
-		return XDP_DROP;
-	if (!check_smac(eth))
+	if (!is_test_packet(ctx))
 		return XDP_DROP;
 
 	ret = bpf_xdp_adjust_meta(ctx, -META_SIZE);
@@ -310,33 +327,24 @@ int ing_xdp_zalloc_meta(struct xdp_md *ctx)
 SEC("xdp")
 int ing_xdp(struct xdp_md *ctx)
 {
-	__u8 *data, *data_meta, *data_end, *payload;
-	struct ethhdr *eth;
+	__u8 *data, *data_meta;
 	int ret;
 
+	/* Drop any non-test packets */
+	if (!is_test_packet(ctx))
+		return XDP_DROP;
+
 	ret = bpf_xdp_adjust_meta(ctx, -META_SIZE);
 	if (ret < 0)
 		return XDP_DROP;
 
 	data_meta = ctx_ptr(ctx, data_meta);
-	data_end  = ctx_ptr(ctx, data_end);
 	data      = ctx_ptr(ctx, data);
 
-	eth = (struct ethhdr *)data;
-	payload = data + sizeof(struct ethhdr);
-
-	if (payload + META_SIZE > data_end ||
-	    data_meta + META_SIZE > data)
-		return XDP_DROP;
-
-	/* The Linux networking stack may send other packets on the test
-	 * interface that interfere with the test. Just drop them.
-	 * The test packets can be recognized by their source MAC address.
-	 */
-	if (!check_smac(eth))
+	if (data_meta + META_SIZE > data)
 		return XDP_DROP;
 
-	__builtin_memcpy(data_meta, payload, META_SIZE);
+	__builtin_memcpy(data_meta, meta_want, META_SIZE);
 	return XDP_PASS;
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta
  2026-01-07 14:28 ` [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta Jakub Sitnicki
@ 2026-01-07 22:01   ` Alexei Starovoitov
  2026-01-08 19:54     ` Jakub Sitnicki
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2026-01-07 22:01 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Network Development, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

On Wed, Jan 7, 2026 at 6:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
>
> +static void bpf_skb_meta_realign(struct sk_buff *skb)
> +{
> +       u8 *meta_end = skb_metadata_end(skb);
> +       u8 meta_len = skb_metadata_len(skb);
> +       u8 *meta;
> +       int gap;
> +
> +       gap = skb_mac_header(skb) - meta_end;
> +       if (!meta_len || !gap)
> +               return;
> +
> +       if (WARN_ONCE(gap < 0, "skb metadata end past mac header")) {
> +               skb_metadata_clear(skb);
> +               return;
> +       }
> +
> +       meta = meta_end - meta_len;
> +       memmove(meta + gap, meta, meta_len);
> +       skb_shinfo(skb)->meta_end += gap;
> +
> +       bpf_compute_data_pointers(skb);
> +}
> +
>  static int tc_cls_act_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
>                                const struct bpf_prog *prog)
>  {
> -       return bpf_unclone_prologue(insn_buf, pkt_access_flags, prog,
> -                                   TC_ACT_SHOT);
> +       struct bpf_insn *insn = insn_buf;
> +       int cnt;
> +
> +       if (pkt_access_flags & PA_F_DATA_META_LOAD) {
> +               /* Realign skb metadata for access through data_meta pointer.
> +                *
> +                * r6 = r1; // r6 will be "u64 *ctx"
> +                * r0 = bpf_skb_meta_realign(r1); // r0 is undefined
> +                * r1 = r6;
> +                */
> +               BUILD_BUG_ON(!__same_type(&bpf_skb_meta_realign,
> +                                         (void (*)(struct sk_buff *))NULL));
> +               *insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
> +               *insn++ = BPF_EMIT_CALL(bpf_skb_meta_realign);

Not quite. drop this BUILD_BUG_ON(), since it's pointless and misleading.
bpf_skb_meta_realign() has to be the one done with BPF_CALL_1(...).
Otherwise above will work only on x86.
In this case on arm64 too, but it's by accident.
BPF_CALL* has to do ABI conversion from BPF to native.

For kfuncs that's what JITs do via btf_func_model machinery.

pw-bot: cr

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset
  2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
                   ` (16 preceding siblings ...)
  2026-01-07 14:28 ` [PATCH bpf-next v3 17/17] selftests/bpf: Test skb metadata access after L2 decapsulation Jakub Sitnicki
@ 2026-01-08 15:47 ` Jakub Kicinski
  2026-01-08 19:25   ` Jakub Sitnicki
  17 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2026-01-08 15:47 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Simon Horman, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Hao Luo, Jiri Olsa, kernel-team

On Wed, 07 Jan 2026 15:28:00 +0100 Jakub Sitnicki wrote:
> This series continues the effort to provide reliable access to xdp/skb
> metadata from BPF context on the receive path. We have recently talked
> about it at Plumbers [1].
> 
> Currently skb metadata location is tied to the MAC header offset:
> 
>   [headroom][metadata][MAC hdr][L3 pkt]
>                       ^
>                       skb_metadata_end = head + mac_header
> 
> This design breaks on L2 decapsulation (VLAN, GRE, etc.) when the MAC
> offset is reset. The naive fix is to memmove metadata on every decap path,
> but we can avoid this cost by tracking metadata position independently.
> 
> Introduce a dedicated meta_end field in skb_shared_info that records where
> metadata ends relative to skb->head:
> 
>   [headroom][metadata][gap][MAC hdr][L3 pkt]
>                      ^
>                      skb_metadata_end = head + meta_end
>                      
> This allows BPF dynptr access (bpf_dynptr_from_skb_meta()) to work without
> memmove. For skb->data_meta pointer access, which expects metadata
> immediately before skb->data, make the verifier inject realignment code in
> TC BPF prologue.

I don't understand what semantics for the buffer layout you're trying
to establish, we now have "headroom" and "gap"?

	[headroom][metadata][gap][packet]

You're not solving the encap side either, skb_push() will still happily
encroach on the metadata. Feel like duct tape, we can't fundamentally
update the layout of the skb without updating all the helpers.
metadata works perfectly fine for its intended use case - passing info
about the frame from XDP offload to XDP and then to TC.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset
  2026-01-08 15:47 ` [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Kicinski
@ 2026-01-08 19:25   ` Jakub Sitnicki
  2026-01-09  1:49     ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-08 19:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Simon Horman, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Hao Luo, Jiri Olsa, kernel-team

On Thu, Jan 08, 2026 at 07:47 AM -08, Jakub Kicinski wrote:
> On Wed, 07 Jan 2026 15:28:00 +0100 Jakub Sitnicki wrote:
>> This series continues the effort to provide reliable access to xdp/skb
>> metadata from BPF context on the receive path. We have recently talked
>> about it at Plumbers [1].
>> 
>> Currently skb metadata location is tied to the MAC header offset:
>> 
>>   [headroom][metadata][MAC hdr][L3 pkt]
>>                       ^
>>                       skb_metadata_end = head + mac_header
>> 
>> This design breaks on L2 decapsulation (VLAN, GRE, etc.) when the MAC
>> offset is reset. The naive fix is to memmove metadata on every decap path,
>> but we can avoid this cost by tracking metadata position independently.
>> 
>> Introduce a dedicated meta_end field in skb_shared_info that records where
>> metadata ends relative to skb->head:
>> 
>>   [headroom][metadata][gap][MAC hdr][L3 pkt]
>>                      ^
>>                      skb_metadata_end = head + meta_end
>>                      
>> This allows BPF dynptr access (bpf_dynptr_from_skb_meta()) to work without
>> memmove. For skb->data_meta pointer access, which expects metadata
>> immediately before skb->data, make the verifier inject realignment code in
>> TC BPF prologue.
>
> I don't understand what semantics for the buffer layout you're trying
> to establish, we now have "headroom" and "gap"?
>
> 	[headroom][metadata][gap][packet]
>
> You're not solving the encap side either, skb_push() will still happily
> encroach on the metadata. Feel like duct tape, we can't fundamentally
> update the layout of the skb without updating all the helpers.
> metadata works perfectly fine for its intended use case - passing info
> about the frame from XDP offload to XDP and then to TC.

<joke>
Man, that makes me think I'm a terrible speaker.
Or you just missed my talk :-)
</joke>

You're right that metadata passing between XDP and TC works well in the
simple case when both BPF progs are attached to the same device.

However, the metadata doesn't get cleared when skb undergoes L2
decapsulation. So if you attach the TC prog to the tunnel device, things
break. (VLAN being the exception since we handle it specially in
skb_reorder_vlan_header).

Here's a simple example using veth + GRE tunnel:

# ip netns add tx
# ip link add rx type veth peer name tx netns tx
# ip link set dev rx up
# ip -n tx link set dev tx up
# ip addr add 10.0.0.1/24 dev rx
# ip -n tx addr add 10.0.0.2/24 dev tx
# ip tun add decap mode gre local 10.0.0.1 remote 10.0.0.2
# ip -n tx tun add encap mode gre local 10.0.0.2 remote 10.0.0.1
# ip link set dev decap up
# ip -n tx link set dev encap up
# ip addr add 192.0.2.1/24 dev decap
# ip -n tx addr add 192.0.2.2/24 dev encap
# cat progs.bpf.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("xdp")
int set_tag(struct xdp_md *ctx)
{
        __u8 *data;
        __u32 *tag;

        if (bpf_xdp_adjust_meta(ctx, -sizeof(*tag)))
                return XDP_DROP;

        data = (typeof(data))(unsigned long)ctx->data;
        tag = (typeof(tag))(unsigned long)ctx->data_meta;

        if (tag + 1 > data)
                return XDP_DROP;

        *tag = 42;
        return XDP_PASS;
}

SEC("tcx/ingress")
int get_tag(struct __sk_buff *ctx)
{
        __u8 *data = (typeof(data))(unsigned long)ctx->data;
        __u32 *tag = (typeof(tag))(unsigned long)ctx->data_meta;

        if (tag + 1 > data)
                return TCX_DROP;

        bpf_printk("%u\n", *tag);
        return TCX_PASS;
}

const char _license[] SEC("license") = "GPL";
# bpftool prog loadall progs.bpf.o /sys/fs/bpf
# bpftool net attach xdp pinned /sys/fs/bpf/set_tag dev rx
# bpftool net attach tcx_ingress pinned /sys/fs/bpf/get_tag dev rx
# ip netns exec tx ping -c1 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.074 ms

--- 192.0.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.074/0.074/0.074/0.000 ms
# bpftool prog tracelog
     ksoftirqd/3-35      [003] ..s1.   592.051827: bpf_trace_printk: 42

            ping-251     [003] ..s2.   614.979169: bpf_trace_printk: 42

          <idle>-0       [003] ..s2.   620.212369: bpf_trace_printk: 42

^C# bpftool net attach tcx_ingress pinned /sys/fs/bpf/get_tag dev decap
# ip netns exec tx ping -c1 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.043 ms

--- 192.0.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.043/0.043/0.043/0.000 ms
# bpftool prog tracelog
            ping-263     [003] ..s2.   681.823981: bpf_trace_printk: 524288

^C#

When reading metadata from TC attached to the GRE device, we get garbage
(524288 = 0x0008_0000, which is actually the GRE header).

This happens because after GRE decapsulation, skb_metadata_end points to
the inner MAC header location:
 
  [headroom][metadata][outer MAC][outer IP][GRE][inner MAC][...]
                                                ^
                                                skb_metadata_end =
                                                head + mac_header
 
So we effectively already have a [headroom][metadata][gap][packet]
layout today, with the access being broken.


Regarding skb_push() and encapsulation - you're absolutely right that
this needs to be addressed. I mention it briefly in the cover letter,
but I should have been clearer about the plan:

This series focuses on fixing the decap/skb_pull path first. The
encap/skb_push side is on the roadmap [1], and patch 10 updates
skb_postpush_data_move() helper to be plugged after skb_push() in
preparation for handling metadata movement on push.

I'm just tackling stuff one by one, though I'm happy to prepare an RFC
that covers both sides, if you'd prefer to see the complete picture
upfront before making the call. Let me know what works best for you.

If it is the overall approach that feels wrong, I'm definitely open to
discussing alternatives.

Thanks,
-jkbs

[1] Tenative roadmap presented at LPC:

* [x] Add bpf_dynptr_from_skb_meta()
* [x] Make TC bpf_skb_*() helpers preserve metadata (bug fix)
* [-] Make L2 decap preserve metadata (bug fix)  [work in progress]
* [ ] Allow-list bpf_dynptr_from_skb_meta() for other BPF prog types
* [ ] uAPI read access to metadata 
-=-=- Milestone: Metadata on RX path -=-=-
* [ ] [FWD path] Make L2 encap preserve metadata (bug fix)
* [ ] [TX path] uAPI write access to metadata
* [ ] [TX path] Alloc metadata from other BPF prog types

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta
  2026-01-07 22:01   ` Alexei Starovoitov
@ 2026-01-08 19:54     ` Jakub Sitnicki
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-08 19:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Network Development, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Simon Horman, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, KP Singh, Hao Luo, Jiri Olsa,
	kernel-team

On Wed, Jan 07, 2026 at 02:01 PM -08, Alexei Starovoitov wrote:
> On Wed, Jan 7, 2026 at 6:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>>
>> +static void bpf_skb_meta_realign(struct sk_buff *skb)
>> +{
>> +       u8 *meta_end = skb_metadata_end(skb);
>> +       u8 meta_len = skb_metadata_len(skb);
>> +       u8 *meta;
>> +       int gap;
>> +
>> +       gap = skb_mac_header(skb) - meta_end;
>> +       if (!meta_len || !gap)
>> +               return;
>> +
>> +       if (WARN_ONCE(gap < 0, "skb metadata end past mac header")) {
>> +               skb_metadata_clear(skb);
>> +               return;
>> +       }
>> +
>> +       meta = meta_end - meta_len;
>> +       memmove(meta + gap, meta, meta_len);
>> +       skb_shinfo(skb)->meta_end += gap;
>> +
>> +       bpf_compute_data_pointers(skb);
>> +}
>> +
>>  static int tc_cls_act_prologue(struct bpf_insn *insn_buf, u32 pkt_access_flags,
>>                                const struct bpf_prog *prog)
>>  {
>> -       return bpf_unclone_prologue(insn_buf, pkt_access_flags, prog,
>> -                                   TC_ACT_SHOT);
>> +       struct bpf_insn *insn = insn_buf;
>> +       int cnt;
>> +
>> +       if (pkt_access_flags & PA_F_DATA_META_LOAD) {
>> +               /* Realign skb metadata for access through data_meta pointer.
>> +                *
>> +                * r6 = r1; // r6 will be "u64 *ctx"
>> +                * r0 = bpf_skb_meta_realign(r1); // r0 is undefined
>> +                * r1 = r6;
>> +                */
>> +               BUILD_BUG_ON(!__same_type(&bpf_skb_meta_realign,
>> +                                         (void (*)(struct sk_buff *))NULL));
>> +               *insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
>> +               *insn++ = BPF_EMIT_CALL(bpf_skb_meta_realign);
>
> Not quite. drop this BUILD_BUG_ON(), since it's pointless and misleading.

Will do.

> bpf_skb_meta_realign() has to be the one done with BPF_CALL_1(...).
> Otherwise above will work only on x86.
> In this case on arm64 too, but it's by accident.
> BPF_CALL* has to do ABI conversion from BPF to native.
>
> For kfuncs that's what JITs do via btf_func_model machinery.

Oh, 32-bit ARM has different register mapping. TIL. Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset
  2026-01-08 19:25   ` Jakub Sitnicki
@ 2026-01-09  1:49     ` Jakub Kicinski
  2026-01-09 10:50       ` Jakub Sitnicki
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2026-01-09  1:49 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Simon Horman, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Hao Luo, Jiri Olsa, kernel-team

On Thu, 08 Jan 2026 20:25:30 +0100 Jakub Sitnicki wrote:
> Man, that makes me think I'm a terrible speaker.
> Or you just missed my talk :-)

You're a great speaker, but not necessarily listener ;)
IIRC there was no time for discussion

> </joke>

> If it is the overall approach that feels wrong, I'm definitely open to
> discussing alternatives.

To reduce the one-off feeling of the mechanism it'd be great to shove
this state into an skb extension for example. Then if we optimize it
and possibly make it live inline in the frame all the other skb
extensions will benefit too.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset
  2026-01-09  1:49     ` Jakub Kicinski
@ 2026-01-09 10:50       ` Jakub Sitnicki
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2026-01-09 10:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Simon Horman, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Hao Luo, Jiri Olsa, kernel-team

On Thu, Jan 08, 2026 at 05:49 PM -08, Jakub Kicinski wrote:
> To reduce the one-off feeling of the mechanism it'd be great to shove
> this state into an skb extension for example. Then if we optimize it
> and possibly make it live inline in the frame all the other skb
> extensions will benefit too.

Back to the drawing board then.

Here's how I think we can marry it with skb extension:

1. Move metadata from headroom to skb_ext chunk on skb_metadata_set().

2. If TC BPF prog uses data_meta pseudo-pointer, copy metadata contents
   in and out of headroom in BPF prologue and epilogue.

3. If TC BPF prog uses bpf_dynptr_from_skb_meta(), access the skb_ext
   chunk directly.

If that sounds sane, then I'll get cracking on an RFC.

We will need the driver tweaks from this series for (1) to work, so I'm
thinking to split that out and resubmit.

I would also split out the BPF verifier prologue/epilogue processing
tweaks for (2) to work without kfuncs.

Let me know what you think.

Thanks,
-jkbs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-01-09 10:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 01/17] bnxt_en: Call skb_metadata_set when skb->data points at metadata end Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 02/17] i40e: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 03/17] igb: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 04/17] igc: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 05/17] ixgbe: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 06/17] net/mlx5e: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 07/17] veth: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 08/17] xsk: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 09/17] xdp: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 10/17] net: Track skb metadata end separately from MAC offset Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 11/17] bpf, verifier: Remove side effects from may_access_direct_pkt_data Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 12/17] bpf, verifier: Turn seen_direct_write flag into a bitmap Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 13/17] bpf, verifier: Propagate packet access flags to gen_prologue Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 14/17] bpf, verifier: Track when data_meta pointer is loaded Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 15/17] bpf, verifier: Support direct kernel calls in gen_prologue Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta Jakub Sitnicki
2026-01-07 22:01   ` Alexei Starovoitov
2026-01-08 19:54     ` Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 17/17] selftests/bpf: Test skb metadata access after L2 decapsulation Jakub Sitnicki
2026-01-08 15:47 ` [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Kicinski
2026-01-08 19:25   ` Jakub Sitnicki
2026-01-09  1:49     ` Jakub Kicinski
2026-01-09 10:50       ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox