From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Avinash Duduskar <avinash.duduskar@gmail.com>,
ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org
Cc: eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
song@kernel.org, yonghong.song@linux.dev, jolsa@kernel.org,
emil@etsalapatis.com, john.fastabend@gmail.com, sdf@fomichev.me,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, horms@kernel.org, shuah@kernel.org,
hawk@kernel.org, yatsenko@meta.com, leon.hwang@linux.dev,
kpsingh@kernel.org, a.s.protopopov@gmail.com,
ameryhung@gmail.com, rongtao@cestc.cn, eyal.birger@gmail.com,
bpf@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
dsahern@kernel.org
Subject: Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
Date: Tue, 23 Jun 2026 13:58:05 +0200 [thread overview]
Message-ID: <877bnpeaeq.fsf@toke.dk> (raw)
In-Reply-To: <20260623025147.1001664-2-avinash.duduskar@gmail.com>
Avinash Duduskar <avinash.duduskar@gmail.com> writes:
> bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
> from the fib result. When the egress is a VLAN device, the returned
> ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
> programs that want to forward the frame (e.g. xdp-forward) must
> instead target the underlying physical device and push the VLAN tag
> themselves. Today the program has no way to learn either the
> underlying ifindex or the VLAN tag without maintaining its own
> VLAN-to-ifindex map in userspace and refreshing it on netlink
> events.
>
> Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
> result is a VLAN device whose immediate parent is a real (non-VLAN)
> device in the same network namespace, populate the existing output
> fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
> device and replace params->ifindex with the parent's ifindex.
> params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
> consumer wanting to set egress priority writes PCP itself.
> params->smac is the VLAN device's own address, which can differ from
> the parent's.
>
> Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
> and not vlan_dev_real_dev(), which walks to the bottom of a stack. When
> the immediate parent is not a real device in the same namespace, the
> lookup returns BPF_FIB_LKUP_RET_VLAN_FAILURE and leaves params->ifindex
> at the input. This covers a stacked VLAN (QinQ), where the immediate
> parent is itself a VLAN device and one h_vlan_proto/h_vlan_TCI pair
> cannot describe two tags, and a parent in another network namespace (a
> VLAN device can be moved while its parent stays), whose ifindex would
> be meaningless in the caller's namespace. A program that wants the VLAN
> device's own ifindex re-issues the lookup without BPF_FIB_LOOKUP_VLAN,
> so the unreducible case stays distinct from a physical egress. That
> distinction matters for XDP: a program cannot xmit on a VLAN device, so
> a success carrying the VLAN ifindex would make it redirect to a device
> with no ndo_xdp_xmit and drop the frame at xdp_do_flush(). The swap and
> the vlan fields are written only on the reduce path; other output
> fields keep their existing behaviour, so a frag-needed result still
> reports the route mtu in params->mtu_result.
>
> On the skb path without tot_len the deferred mtu check is done against
> the resolved egress device. To keep that the VLAN device rather than
> the parent after the swap, bpf_ipv4_fib_lookup()/bpf_ipv6_fib_lookup()
> hand the FIB-result device back to the caller; the XDP path always
> runs the route-mtu check and passes NULL. When the flag is not set,
> behaviour is unchanged: h_vlan_proto and h_vlan_TCI are zeroed and
> ifindex is left at the FIB result.
>
> The new block is compiled only under CONFIG_VLAN_8021Q since
> vlan_dev_priv() is not defined otherwise; without that config
> is_vlan_dev() is constant false and the flag is accepted but never
> acts. That is safe because no VLAN device can exist there, so every
> egress is already physical.
>
> This lets an XDP redirect target the physical device and learn the
> tag to push in a single lookup, which xdp-forward's optional VLAN
> mode (xdp-project/xdp-tools#504) wants from the kernel side.
>
> The helper's input semantics are unchanged; the reverse direction
> (supplying a tag as lookup input) is added in the following patch.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
> ---
> include/uapi/linux/bpf.h | 28 +++++++++++++-
> net/core/filter.c | 69 ++++++++++++++++++++++++----------
> tools/include/uapi/linux/bpf.h | 28 +++++++++++++-
> 3 files changed, 104 insertions(+), 21 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 89b36de5fdbb..8d0058d88eb2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3532,6 +3532,26 @@ union bpf_attr {
> * Use the mark present in *params*->mark for the fib lookup.
> * This option should not be used with BPF_FIB_LOOKUP_DIRECT,
> * as it only has meaning for full lookups.
> + * **BPF_FIB_LOOKUP_VLAN**
> + * If the fib lookup resolves to a VLAN device whose
> + * parent is a real (non-VLAN) device, set
> + * *params*->h_vlan_proto and *params*->h_vlan_TCI from
> + * the VLAN device and replace *params*->ifindex with the
> + * parent's ifindex. *params*->h_vlan_TCI carries the VID
> + * only, with PCP and DEI bits zero; a consumer wanting to
> + * set egress priority writes PCP itself. *params*->smac is
> + * the VLAN device's own address, which can differ from the
> + * parent's. Only the immediate parent is resolved; if it
> + * is itself a VLAN device (QinQ) or in another namespace,
> + * the egress cannot be reduced to a physical device plus
> + * one tag and the lookup returns
> + * **BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
> + * left at the input. Re-issue without
> + * **BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
> + * ifindex. The swap and the vlan fields
> + * are written only on success; other output fields keep
> + * the helper's existing behaviour, so a frag-needed result
> + * still reports the route mtu in *params*->mtu_result.
> *
> * *ctx* is either **struct xdp_md** for XDP programs or
> * **struct sk_buff** tc cls_act programs.
> @@ -7327,6 +7347,7 @@ enum {
> BPF_FIB_LOOKUP_TBID = (1U << 3),
> BPF_FIB_LOOKUP_SRC = (1U << 4),
> BPF_FIB_LOOKUP_MARK = (1U << 5),
> + BPF_FIB_LOOKUP_VLAN = (1U << 6),
> };
>
> enum {
> @@ -7340,6 +7361,7 @@ enum {
> BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
> BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */
> BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */
> + BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
> };
>
> struct bpf_fib_lookup {
> @@ -7393,7 +7415,11 @@ struct bpf_fib_lookup {
>
> union {
> struct {
> - /* output */
> + /*
> + * output with BPF_FIB_LOOKUP_VLAN: set from the
> + * resolved egress VLAN device (see the flag); zeroed
> + * on other successful lookups.
> + */
> __be16 h_vlan_proto;
> __be16 h_vlan_TCI;
> };
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e96b4b847ce..8345295d84de 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6201,10 +6201,28 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
> #endif
>
> #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
> -static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
> +static int bpf_fib_set_fwd_params(struct net_device *dev,
> + struct bpf_fib_lookup *params,
> + u32 flags, u32 mtu)
> {
> params->h_vlan_TCI = 0;
> params->h_vlan_proto = 0;
> +
> +#if IS_ENABLED(CONFIG_VLAN_8021Q)
> + if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
If you move the ifdef into the if statement, the if statement can have
an else-branch that assigns params->ifindex, so you don't need the
restore dance (see below).
> + struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
> +
> + if (!is_vlan_dev(real_dev) &&
> + net_eq(dev_net(real_dev), dev_net(dev))) {
> + params->h_vlan_proto = vlan_dev_vlan_proto(dev);
> + params->h_vlan_TCI = htons(vlan_dev_vlan_id(dev));
> + params->ifindex = real_dev->ifindex;
> + } else {
> + return BPF_FIB_LKUP_RET_VLAN_FAILURE;
> + }
> + }
> +#endif
> +
> if (mtu)
> params->mtu_result = mtu; /* union with tot_len */
>
> @@ -6214,8 +6232,10 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
>
> #if IS_ENABLED(CONFIG_INET)
> static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> - u32 flags, bool check_mtu)
> + u32 flags, bool check_mtu,
> + struct net_device **fwd_dev)
> {
> + u32 in_ifindex = params->ifindex;
> struct neighbour *neigh = NULL;
> struct fib_nh_common *nhc;
> struct in_device *in_dev;
> @@ -6347,16 +6367,23 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>
> set_fwd_params:
> - return bpf_fib_set_fwd_params(params, mtu);
> + if (fwd_dev)
> + *fwd_dev = dev;
> + err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> + if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
> + params->ifindex = in_ifindex;
> + return err;
I think it's better to just move the assignment of params->ifindex
entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
That way this can be simplified to:
err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
if (!err && fwd_dev)
*fwd_dev = dev;
return err;
> }
> #endif
>
> #if IS_ENABLED(CONFIG_IPV6)
> static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> - u32 flags, bool check_mtu)
> + u32 flags, bool check_mtu,
> + struct net_device **fwd_dev)
> {
> struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
> struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
> + u32 in_ifindex = params->ifindex;
> struct fib6_result res = {};
> struct neighbour *neigh;
> struct net_device *dev;
> @@ -6486,13 +6513,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>
> set_fwd_params:
> - return bpf_fib_set_fwd_params(params, mtu);
> + if (fwd_dev)
> + *fwd_dev = dev;
> + err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> + if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
> + params->ifindex = in_ifindex;
> + return err;
Same as above.
-Toke
next prev parent reply other threads:[~2026-06-23 11:58 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-23 2:51 [PATCH bpf-next v4 0/3] bpf: bidirectional VLAN support for bpf_fib_lookup() Avinash Duduskar
2026-06-23 2:51 ` [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper Avinash Duduskar
2026-06-23 11:58 ` Toke Høiland-Jørgensen [this message]
2026-06-23 18:28 ` Avinash Duduskar
2026-06-23 2:51 ` [PATCH bpf-next v4 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT " Avinash Duduskar
2026-06-23 12:00 ` Toke Høiland-Jørgensen
2026-06-23 2:51 ` [PATCH bpf-next v4 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests Avinash Duduskar
2026-06-23 3:39 ` bot+bpf-ci
2026-06-23 12:36 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877bnpeaeq.fsf@toke.dk \
--to=toke@redhat.com \
--cc=a.s.protopopov@gmail.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=avinash.duduskar@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=emil@etsalapatis.com \
--cc=eyal.birger@gmail.com \
--cc=hawk@kernel.org \
--cc=horms@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=leon.hwang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rongtao@cestc.cn \
--cc=sdf@fomichev.me \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=yatsenko@meta.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox