BPF List
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Avinash Duduskar <avinash.duduskar@gmail.com>,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org
Cc: eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
	song@kernel.org, yonghong.song@linux.dev, jolsa@kernel.org,
	emil@etsalapatis.com, john.fastabend@gmail.com, sdf@fomichev.me,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, shuah@kernel.org,
	hawk@kernel.org, yatsenko@meta.com, leon.hwang@linux.dev,
	kpsingh@kernel.org, a.s.protopopov@gmail.com,
	ameryhung@gmail.com, rongtao@cestc.cn, eyal.birger@gmail.com,
	bpf@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	dsahern@kernel.org
Subject: Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
Date: Tue, 23 Jun 2026 13:58:05 +0200	[thread overview]
Message-ID: <877bnpeaeq.fsf@toke.dk> (raw)
In-Reply-To: <20260623025147.1001664-2-avinash.duduskar@gmail.com>

Avinash Duduskar <avinash.duduskar@gmail.com> writes:

> bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
> from the fib result. When the egress is a VLAN device, the returned
> ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
> programs that want to forward the frame (e.g. xdp-forward) must
> instead target the underlying physical device and push the VLAN tag
> themselves. Today the program has no way to learn either the
> underlying ifindex or the VLAN tag without maintaining its own
> VLAN-to-ifindex map in userspace and refreshing it on netlink
> events.
>
> Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
> result is a VLAN device whose immediate parent is a real (non-VLAN)
> device in the same network namespace, populate the existing output
> fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
> device and replace params->ifindex with the parent's ifindex.
> params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
> consumer wanting to set egress priority writes PCP itself.
> params->smac is the VLAN device's own address, which can differ from
> the parent's.
>
> Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
> and not vlan_dev_real_dev(), which walks to the bottom of a stack. When
> the immediate parent is not a real device in the same namespace, the
> lookup returns BPF_FIB_LKUP_RET_VLAN_FAILURE and leaves params->ifindex
> at the input. This covers a stacked VLAN (QinQ), where the immediate
> parent is itself a VLAN device and one h_vlan_proto/h_vlan_TCI pair
> cannot describe two tags, and a parent in another network namespace (a
> VLAN device can be moved while its parent stays), whose ifindex would
> be meaningless in the caller's namespace. A program that wants the VLAN
> device's own ifindex re-issues the lookup without BPF_FIB_LOOKUP_VLAN,
> so the unreducible case stays distinct from a physical egress. That
> distinction matters for XDP: a program cannot xmit on a VLAN device, so
> a success carrying the VLAN ifindex would make it redirect to a device
> with no ndo_xdp_xmit and drop the frame at xdp_do_flush(). The swap and
> the vlan fields are written only on the reduce path; other output
> fields keep their existing behaviour, so a frag-needed result still
> reports the route mtu in params->mtu_result.
>
> On the skb path without tot_len the deferred mtu check is done against
> the resolved egress device. To keep that the VLAN device rather than
> the parent after the swap, bpf_ipv4_fib_lookup()/bpf_ipv6_fib_lookup()
> hand the FIB-result device back to the caller; the XDP path always
> runs the route-mtu check and passes NULL. When the flag is not set,
> behaviour is unchanged: h_vlan_proto and h_vlan_TCI are zeroed and
> ifindex is left at the FIB result.
>
> The new block is compiled only under CONFIG_VLAN_8021Q since
> vlan_dev_priv() is not defined otherwise; without that config
> is_vlan_dev() is constant false and the flag is accepted but never
> acts. That is safe because no VLAN device can exist there, so every
> egress is already physical.
>
> This lets an XDP redirect target the physical device and learn the
> tag to push in a single lookup, which xdp-forward's optional VLAN
> mode (xdp-project/xdp-tools#504) wants from the kernel side.
>
> The helper's input semantics are unchanged; the reverse direction
> (supplying a tag as lookup input) is added in the following patch.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
> ---
>  include/uapi/linux/bpf.h       | 28 +++++++++++++-
>  net/core/filter.c              | 69 ++++++++++++++++++++++++----------
>  tools/include/uapi/linux/bpf.h | 28 +++++++++++++-
>  3 files changed, 104 insertions(+), 21 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 89b36de5fdbb..8d0058d88eb2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3532,6 +3532,26 @@ union bpf_attr {
>   *			Use the mark present in *params*->mark for the fib lookup.
>   *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
>   *			as it only has meaning for full lookups.
> + *		**BPF_FIB_LOOKUP_VLAN**
> + *			If the fib lookup resolves to a VLAN device whose
> + *			parent is a real (non-VLAN) device, set
> + *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
> + *			the VLAN device and replace *params*->ifindex with the
> + *			parent's ifindex. *params*->h_vlan_TCI carries the VID
> + *			only, with PCP and DEI bits zero; a consumer wanting to
> + *			set egress priority writes PCP itself. *params*->smac is
> + *			the VLAN device's own address, which can differ from the
> + *			parent's. Only the immediate parent is resolved; if it
> + *			is itself a VLAN device (QinQ) or in another namespace,
> + *			the egress cannot be reduced to a physical device plus
> + *			one tag and the lookup returns
> + *			**BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
> + *			left at the input. Re-issue without
> + *			**BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
> + *			ifindex. The swap and the vlan fields
> + *			are written only on success; other output fields keep
> + *			the helper's existing behaviour, so a frag-needed result
> + *			still reports the route mtu in *params*->mtu_result.
>   *
>   *		*ctx* is either **struct xdp_md** for XDP programs or
>   *		**struct sk_buff** tc cls_act programs.
> @@ -7327,6 +7347,7 @@ enum {
>  	BPF_FIB_LOOKUP_TBID    = (1U << 3),
>  	BPF_FIB_LOOKUP_SRC     = (1U << 4),
>  	BPF_FIB_LOOKUP_MARK    = (1U << 5),
> +	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
>  };
>  
>  enum {
> @@ -7340,6 +7361,7 @@ enum {
>  	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
>  	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
>  	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
> +	BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
>  };
>  
>  struct bpf_fib_lookup {
> @@ -7393,7 +7415,11 @@ struct bpf_fib_lookup {
>  
>  	union {
>  		struct {
> -			/* output */
> +			/*
> +			 * output with BPF_FIB_LOOKUP_VLAN: set from the
> +			 * resolved egress VLAN device (see the flag); zeroed
> +			 * on other successful lookups.
> +			 */
>  			__be16	h_vlan_proto;
>  			__be16	h_vlan_TCI;
>  		};
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e96b4b847ce..8345295d84de 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6201,10 +6201,28 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
>  #endif
>  
>  #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
> -static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
> +static int bpf_fib_set_fwd_params(struct net_device *dev,
> +				  struct bpf_fib_lookup *params,
> +				  u32 flags, u32 mtu)
>  {
>  	params->h_vlan_TCI = 0;
>  	params->h_vlan_proto = 0;
> +
> +#if IS_ENABLED(CONFIG_VLAN_8021Q)
> +	if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {

If you move the ifdef into the if statement, the if statement can have
an else-branch that assigns params->ifindex, so you don't need the
restore dance (see below).

> +		struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
> +
> +		if (!is_vlan_dev(real_dev) &&
> +		    net_eq(dev_net(real_dev), dev_net(dev))) {
> +			params->h_vlan_proto = vlan_dev_vlan_proto(dev);
> +			params->h_vlan_TCI = htons(vlan_dev_vlan_id(dev));
> +			params->ifindex = real_dev->ifindex;
> +		} else {
> +			return BPF_FIB_LKUP_RET_VLAN_FAILURE;
> +		}
> +	}
> +#endif
> +
>  	if (mtu)
>  		params->mtu_result = mtu; /* union with tot_len */
>  
> @@ -6214,8 +6232,10 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
>  
>  #if IS_ENABLED(CONFIG_INET)
>  static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> -			       u32 flags, bool check_mtu)
> +			       u32 flags, bool check_mtu,
> +			       struct net_device **fwd_dev)
>  {
> +	u32 in_ifindex = params->ifindex;
>  	struct neighbour *neigh = NULL;
>  	struct fib_nh_common *nhc;
>  	struct in_device *in_dev;
> @@ -6347,16 +6367,23 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>  
>  set_fwd_params:
> -	return bpf_fib_set_fwd_params(params, mtu);
> +	if (fwd_dev)
> +		*fwd_dev = dev;
> +	err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> +	if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
> +		params->ifindex = in_ifindex;
> +	return err;

I think it's better to just move the assignment of params->ifindex
entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
That way this can be simplified to:

	err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
	if (!err && fwd_dev)
		*fwd_dev = dev;
       	return err;

>  }
>  #endif
>  
>  #if IS_ENABLED(CONFIG_IPV6)
>  static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> -			       u32 flags, bool check_mtu)
> +			       u32 flags, bool check_mtu,
> +			       struct net_device **fwd_dev)
>  {
>  	struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
>  	struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
> +	u32 in_ifindex = params->ifindex;
>  	struct fib6_result res = {};
>  	struct neighbour *neigh;
>  	struct net_device *dev;
> @@ -6486,13 +6513,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>  
>  set_fwd_params:
> -	return bpf_fib_set_fwd_params(params, mtu);
> +	if (fwd_dev)
> +		*fwd_dev = dev;
> +	err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> +	if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
> +		params->ifindex = in_ifindex;
> +	return err;

Same as above.

-Toke


  reply	other threads:[~2026-06-23 11:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-23  2:51 [PATCH bpf-next v4 0/3] bpf: bidirectional VLAN support for bpf_fib_lookup() Avinash Duduskar
2026-06-23  2:51 ` [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper Avinash Duduskar
2026-06-23 11:58   ` Toke Høiland-Jørgensen [this message]
2026-06-23  2:51 ` [PATCH bpf-next v4 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT " Avinash Duduskar
2026-06-23 12:00   ` Toke Høiland-Jørgensen
2026-06-23  2:51 ` [PATCH bpf-next v4 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests Avinash Duduskar
2026-06-23  3:39   ` bot+bpf-ci
2026-06-23 12:36   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877bnpeaeq.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=a.s.protopopov@gmail.com \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=avinash.duduskar@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=emil@etsalapatis.com \
    --cc=eyal.birger@gmail.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon.hwang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rongtao@cestc.cn \
    --cc=sdf@fomichev.me \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=yatsenko@meta.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox