public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: bpf@vger.kernel.org,  netdev@vger.kernel.org,
	 "David S. Miller" <davem@davemloft.net>,
	 Eric Dumazet <edumazet@google.com>,
	 Paolo Abeni <pabeni@redhat.com>,
	 Alexei Starovoitov <ast@kernel.org>,
	 Daniel Borkmann <daniel@iogearbox.net>,
	 Jesper Dangaard Brouer <hawk@kernel.org>,
	 John Fastabend <john.fastabend@gmail.com>,
	 Stanislav Fomichev <sdf@fomichev.me>,
	 Simon Horman <horms@kernel.org>,
	 Andrii Nakryiko <andrii@kernel.org>,
	 Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>,  Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	 KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
	 Jiri Olsa <jolsa@kernel.org>,
	kernel-team@cloudflare.com
Subject: Re: [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset
Date: Thu, 08 Jan 2026 20:25:30 +0100	[thread overview]
Message-ID: <87ecnzj49h.fsf@cloudflare.com> (raw)
In-Reply-To: <20260108074741.00bd532f@kernel.org> (Jakub Kicinski's message of "Thu, 8 Jan 2026 07:47:41 -0800")

On Thu, Jan 08, 2026 at 07:47 AM -08, Jakub Kicinski wrote:
> On Wed, 07 Jan 2026 15:28:00 +0100 Jakub Sitnicki wrote:
>> This series continues the effort to provide reliable access to xdp/skb
>> metadata from BPF context on the receive path. We have recently talked
>> about it at Plumbers [1].
>> 
>> Currently skb metadata location is tied to the MAC header offset:
>> 
>>   [headroom][metadata][MAC hdr][L3 pkt]
>>                       ^
>>                       skb_metadata_end = head + mac_header
>> 
>> This design breaks on L2 decapsulation (VLAN, GRE, etc.) when the MAC
>> offset is reset. The naive fix is to memmove metadata on every decap path,
>> but we can avoid this cost by tracking metadata position independently.
>> 
>> Introduce a dedicated meta_end field in skb_shared_info that records where
>> metadata ends relative to skb->head:
>> 
>>   [headroom][metadata][gap][MAC hdr][L3 pkt]
>>                      ^
>>                      skb_metadata_end = head + meta_end
>>                      
>> This allows BPF dynptr access (bpf_dynptr_from_skb_meta()) to work without
>> memmove. For skb->data_meta pointer access, which expects metadata
>> immediately before skb->data, make the verifier inject realignment code in
>> TC BPF prologue.
>
> I don't understand what semantics for the buffer layout you're trying
> to establish, we now have "headroom" and "gap"?
>
> 	[headroom][metadata][gap][packet]
>
> You're not solving the encap side either, skb_push() will still happily
> encroach on the metadata. Feel like duct tape, we can't fundamentally
> update the layout of the skb without updating all the helpers.
> metadata works perfectly fine for its intended use case - passing info
> about the frame from XDP offload to XDP and then to TC.

<joke>
Man, that makes me think I'm a terrible speaker.
Or you just missed my talk :-)
</joke>

You're right that metadata passing between XDP and TC works well in the
simple case when both BPF progs are attached to the same device.

However, the metadata doesn't get cleared when skb undergoes L2
decapsulation. So if you attach the TC prog to the tunnel device, things
break. (VLAN being the exception since we handle it specially in
skb_reorder_vlan_header).

Here's a simple example using veth + GRE tunnel:

# ip netns add tx
# ip link add rx type veth peer name tx netns tx
# ip link set dev rx up
# ip -n tx link set dev tx up
# ip addr add 10.0.0.1/24 dev rx
# ip -n tx addr add 10.0.0.2/24 dev tx
# ip tun add decap mode gre local 10.0.0.1 remote 10.0.0.2
# ip -n tx tun add encap mode gre local 10.0.0.2 remote 10.0.0.1
# ip link set dev decap up
# ip -n tx link set dev encap up
# ip addr add 192.0.2.1/24 dev decap
# ip -n tx addr add 192.0.2.2/24 dev encap
# cat progs.bpf.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("xdp")
int set_tag(struct xdp_md *ctx)
{
        __u8 *data;
        __u32 *tag;

        if (bpf_xdp_adjust_meta(ctx, -sizeof(*tag)))
                return XDP_DROP;

        data = (typeof(data))(unsigned long)ctx->data;
        tag = (typeof(tag))(unsigned long)ctx->data_meta;

        if (tag + 1 > data)
                return XDP_DROP;

        *tag = 42;
        return XDP_PASS;
}

SEC("tcx/ingress")
int get_tag(struct __sk_buff *ctx)
{
        __u8 *data = (typeof(data))(unsigned long)ctx->data;
        __u32 *tag = (typeof(tag))(unsigned long)ctx->data_meta;

        if (tag + 1 > data)
                return TCX_DROP;

        bpf_printk("%u\n", *tag);
        return TCX_PASS;
}

const char _license[] SEC("license") = "GPL";
# bpftool prog loadall progs.bpf.o /sys/fs/bpf
# bpftool net attach xdp pinned /sys/fs/bpf/set_tag dev rx
# bpftool net attach tcx_ingress pinned /sys/fs/bpf/get_tag dev rx
# ip netns exec tx ping -c1 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.074 ms

--- 192.0.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.074/0.074/0.074/0.000 ms
# bpftool prog tracelog
     ksoftirqd/3-35      [003] ..s1.   592.051827: bpf_trace_printk: 42

            ping-251     [003] ..s2.   614.979169: bpf_trace_printk: 42

          <idle>-0       [003] ..s2.   620.212369: bpf_trace_printk: 42

^C# bpftool net attach tcx_ingress pinned /sys/fs/bpf/get_tag dev decap
# ip netns exec tx ping -c1 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.043 ms

--- 192.0.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.043/0.043/0.043/0.000 ms
# bpftool prog tracelog
            ping-263     [003] ..s2.   681.823981: bpf_trace_printk: 524288

^C#

When reading metadata from TC attached to the GRE device, we get garbage
(524288 = 0x0008_0000, which is actually the GRE header).

This happens because after GRE decapsulation, skb_metadata_end points to
the inner MAC header location:
 
  [headroom][metadata][outer MAC][outer IP][GRE][inner MAC][...]
                                                ^
                                                skb_metadata_end =
                                                head + mac_header
 
So we effectively already have a [headroom][metadata][gap][packet]
layout today, with the access being broken.


Regarding skb_push() and encapsulation - you're absolutely right that
this needs to be addressed. I mention it briefly in the cover letter,
but I should have been clearer about the plan:

This series focuses on fixing the decap/skb_pull path first. The
encap/skb_push side is on the roadmap [1], and patch 10 updates
skb_postpush_data_move() helper to be plugged after skb_push() in
preparation for handling metadata movement on push.

I'm just tackling stuff one by one, though I'm happy to prepare an RFC
that covers both sides, if you'd prefer to see the complete picture
upfront before making the call. Let me know what works best for you.

If it is the overall approach that feels wrong, I'm definitely open to
discussing alternatives.

Thanks,
-jkbs

[1] Tenative roadmap presented at LPC:

* [x] Add bpf_dynptr_from_skb_meta()
* [x] Make TC bpf_skb_*() helpers preserve metadata (bug fix)
* [-] Make L2 decap preserve metadata (bug fix)  [work in progress]
* [ ] Allow-list bpf_dynptr_from_skb_meta() for other BPF prog types
* [ ] uAPI read access to metadata 
-=-=- Milestone: Metadata on RX path -=-=-
* [ ] [FWD path] Make L2 encap preserve metadata (bug fix)
* [ ] [TX path] uAPI write access to metadata
* [ ] [TX path] Alloc metadata from other BPF prog types

  reply	other threads:[~2026-01-08 19:25 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-07 14:28 [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 01/17] bnxt_en: Call skb_metadata_set when skb->data points at metadata end Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 02/17] i40e: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 03/17] igb: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 04/17] igc: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 05/17] ixgbe: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 06/17] net/mlx5e: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 07/17] veth: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 08/17] xsk: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 09/17] xdp: " Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 10/17] net: Track skb metadata end separately from MAC offset Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 11/17] bpf, verifier: Remove side effects from may_access_direct_pkt_data Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 12/17] bpf, verifier: Turn seen_direct_write flag into a bitmap Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 13/17] bpf, verifier: Propagate packet access flags to gen_prologue Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 14/17] bpf, verifier: Track when data_meta pointer is loaded Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 15/17] bpf, verifier: Support direct kernel calls in gen_prologue Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 16/17] bpf: Realign skb metadata for TC progs using data_meta Jakub Sitnicki
2026-01-07 22:01   ` Alexei Starovoitov
2026-01-08 19:54     ` Jakub Sitnicki
2026-01-07 14:28 ` [PATCH bpf-next v3 17/17] selftests/bpf: Test skb metadata access after L2 decapsulation Jakub Sitnicki
2026-01-08 15:47 ` [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset Jakub Kicinski
2026-01-08 19:25   ` Jakub Sitnicki [this message]
2026-01-09  1:49     ` Jakub Kicinski
2026-01-09 10:50       ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ecnzj49h.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox