From: Avinash Duduskar <avinash.duduskar@gmail.com>
To: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org
Cc: eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
song@kernel.org, yonghong.song@linux.dev, jolsa@kernel.org,
emil@etsalapatis.com, john.fastabend@gmail.com, sdf@fomichev.me,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, horms@kernel.org, shuah@kernel.org,
hawk@kernel.org, yatsenko@meta.com, leon.hwang@linux.dev,
kpsingh@kernel.org, a.s.protopopov@gmail.com,
ameryhung@gmail.com, rongtao@cestc.cn, eyal.birger@gmail.com,
bpf@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
toke@redhat.com, dsahern@kernel.org
Subject: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
Date: Tue, 23 Jun 2026 08:21:45 +0530 [thread overview]
Message-ID: <20260623025147.1001664-2-avinash.duduskar@gmail.com> (raw)
In-Reply-To: <20260623025147.1001664-1-avinash.duduskar@gmail.com>
bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
from the fib result. When the egress is a VLAN device, the returned
ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
programs that want to forward the frame (e.g. xdp-forward) must
instead target the underlying physical device and push the VLAN tag
themselves. Today the program has no way to learn either the
underlying ifindex or the VLAN tag without maintaining its own
VLAN-to-ifindex map in userspace and refreshing it on netlink
events.
Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
result is a VLAN device whose immediate parent is a real (non-VLAN)
device in the same network namespace, populate the existing output
fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
device and replace params->ifindex with the parent's ifindex.
params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
consumer wanting to set egress priority writes PCP itself.
params->smac is the VLAN device's own address, which can differ from
the parent's.
Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
and not vlan_dev_real_dev(), which walks to the bottom of a stack. When
the immediate parent is not a real device in the same namespace, the
lookup returns BPF_FIB_LKUP_RET_VLAN_FAILURE and leaves params->ifindex
at the input. This covers a stacked VLAN (QinQ), where the immediate
parent is itself a VLAN device and one h_vlan_proto/h_vlan_TCI pair
cannot describe two tags, and a parent in another network namespace (a
VLAN device can be moved while its parent stays), whose ifindex would
be meaningless in the caller's namespace. A program that wants the VLAN
device's own ifindex re-issues the lookup without BPF_FIB_LOOKUP_VLAN,
so the unreducible case stays distinct from a physical egress. That
distinction matters for XDP: a program cannot xmit on a VLAN device, so
a success carrying the VLAN ifindex would make it redirect to a device
with no ndo_xdp_xmit and drop the frame at xdp_do_flush(). The swap and
the vlan fields are written only on the reduce path; other output
fields keep their existing behaviour, so a frag-needed result still
reports the route mtu in params->mtu_result.
On the skb path without tot_len the deferred mtu check is done against
the resolved egress device. To keep that the VLAN device rather than
the parent after the swap, bpf_ipv4_fib_lookup()/bpf_ipv6_fib_lookup()
hand the FIB-result device back to the caller; the XDP path always
runs the route-mtu check and passes NULL. When the flag is not set,
behaviour is unchanged: h_vlan_proto and h_vlan_TCI are zeroed and
ifindex is left at the FIB result.
The new block is compiled only under CONFIG_VLAN_8021Q since
vlan_dev_priv() is not defined otherwise; without that config
is_vlan_dev() is constant false and the flag is accepted but never
acts. That is safe because no VLAN device can exist there, so every
egress is already physical.
This lets an XDP redirect target the physical device and learn the
tag to push in a single lookup, which xdp-forward's optional VLAN
mode (xdp-project/xdp-tools#504) wants from the kernel side.
The helper's input semantics are unchanged; the reverse direction
(supplying a tag as lookup input) is added in the following patch.
Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
include/uapi/linux/bpf.h | 28 +++++++++++++-
net/core/filter.c | 69 ++++++++++++++++++++++++----------
tools/include/uapi/linux/bpf.h | 28 +++++++++++++-
3 files changed, 104 insertions(+), 21 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 89b36de5fdbb..8d0058d88eb2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3532,6 +3532,26 @@ union bpf_attr {
* Use the mark present in *params*->mark for the fib lookup.
* This option should not be used with BPF_FIB_LOOKUP_DIRECT,
* as it only has meaning for full lookups.
+ * **BPF_FIB_LOOKUP_VLAN**
+ * If the fib lookup resolves to a VLAN device whose
+ * parent is a real (non-VLAN) device, set
+ * *params*->h_vlan_proto and *params*->h_vlan_TCI from
+ * the VLAN device and replace *params*->ifindex with the
+ * parent's ifindex. *params*->h_vlan_TCI carries the VID
+ * only, with PCP and DEI bits zero; a consumer wanting to
+ * set egress priority writes PCP itself. *params*->smac is
+ * the VLAN device's own address, which can differ from the
+ * parent's. Only the immediate parent is resolved; if it
+ * is itself a VLAN device (QinQ) or in another namespace,
+ * the egress cannot be reduced to a physical device plus
+ * one tag and the lookup returns
+ * **BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
+ * left at the input. Re-issue without
+ * **BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
+ * ifindex. The swap and the vlan fields
+ * are written only on success; other output fields keep
+ * the helper's existing behaviour, so a frag-needed result
+ * still reports the route mtu in *params*->mtu_result.
*
* *ctx* is either **struct xdp_md** for XDP programs or
* **struct sk_buff** tc cls_act programs.
@@ -7327,6 +7347,7 @@ enum {
BPF_FIB_LOOKUP_TBID = (1U << 3),
BPF_FIB_LOOKUP_SRC = (1U << 4),
BPF_FIB_LOOKUP_MARK = (1U << 5),
+ BPF_FIB_LOOKUP_VLAN = (1U << 6),
};
enum {
@@ -7340,6 +7361,7 @@ enum {
BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */
BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */
+ BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
};
struct bpf_fib_lookup {
@@ -7393,7 +7415,11 @@ struct bpf_fib_lookup {
union {
struct {
- /* output */
+ /*
+ * output with BPF_FIB_LOOKUP_VLAN: set from the
+ * resolved egress VLAN device (see the flag); zeroed
+ * on other successful lookups.
+ */
__be16 h_vlan_proto;
__be16 h_vlan_TCI;
};
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e96b4b847ce..8345295d84de 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6201,10 +6201,28 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
#endif
#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
-static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
+static int bpf_fib_set_fwd_params(struct net_device *dev,
+ struct bpf_fib_lookup *params,
+ u32 flags, u32 mtu)
{
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+
+#if IS_ENABLED(CONFIG_VLAN_8021Q)
+ if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
+ struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+
+ if (!is_vlan_dev(real_dev) &&
+ net_eq(dev_net(real_dev), dev_net(dev))) {
+ params->h_vlan_proto = vlan_dev_vlan_proto(dev);
+ params->h_vlan_TCI = htons(vlan_dev_vlan_id(dev));
+ params->ifindex = real_dev->ifindex;
+ } else {
+ return BPF_FIB_LKUP_RET_VLAN_FAILURE;
+ }
+ }
+#endif
+
if (mtu)
params->mtu_result = mtu; /* union with tot_len */
@@ -6214,8 +6232,10 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
#if IS_ENABLED(CONFIG_INET)
static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
- u32 flags, bool check_mtu)
+ u32 flags, bool check_mtu,
+ struct net_device **fwd_dev)
{
+ u32 in_ifindex = params->ifindex;
struct neighbour *neigh = NULL;
struct fib_nh_common *nhc;
struct in_device *in_dev;
@@ -6347,16 +6367,23 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
set_fwd_params:
- return bpf_fib_set_fwd_params(params, mtu);
+ if (fwd_dev)
+ *fwd_dev = dev;
+ err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
+ if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
+ params->ifindex = in_ifindex;
+ return err;
}
#endif
#if IS_ENABLED(CONFIG_IPV6)
static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
- u32 flags, bool check_mtu)
+ u32 flags, bool check_mtu,
+ struct net_device **fwd_dev)
{
struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
+ u32 in_ifindex = params->ifindex;
struct fib6_result res = {};
struct neighbour *neigh;
struct net_device *dev;
@@ -6486,13 +6513,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
set_fwd_params:
- return bpf_fib_set_fwd_params(params, mtu);
+ if (fwd_dev)
+ *fwd_dev = dev;
+ err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
+ if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
+ params->ifindex = in_ifindex;
+ return err;
}
#endif
#define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
- BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK)
+ BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
+ BPF_FIB_LOOKUP_VLAN)
BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6507,12 +6540,12 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
#if IS_ENABLED(CONFIG_INET)
case AF_INET:
return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
- flags, true);
+ flags, true, NULL);
#endif
#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
- flags, true);
+ flags, true, NULL);
#endif
}
return -EAFNOSUPPORT;
@@ -6532,6 +6565,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
struct bpf_fib_lookup *, params, int, plen, u32, flags)
{
struct net *net = dev_net(skb->dev);
+ struct net_device *fwd_dev = NULL;
int rc = -EAFNOSUPPORT;
bool check_mtu = false;
@@ -6547,29 +6581,26 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
switch (params->family) {
#if IS_ENABLED(CONFIG_INET)
case AF_INET:
- rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
+ rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu,
+ &fwd_dev);
break;
#endif
#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
- rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
+ rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu,
+ &fwd_dev);
break;
#endif
}
if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
- struct net_device *dev;
-
- /* When tot_len isn't provided by user, check skb
- * against MTU of FIB lookup resulting net_device
+ /* without tot_len, check the skb against the FIB-result
+ * device's MTU
*/
- dev = dev_get_by_index_rcu(net, params->ifindex);
- if (unlikely(!dev))
- return -ENODEV;
- if (!is_skb_forwardable(dev, skb))
+ if (!is_skb_forwardable(fwd_dev, skb))
rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
- params->mtu_result = dev->mtu; /* union with tot_len */
+ params->mtu_result = fwd_dev->mtu; /* union with tot_len */
}
return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 89b36de5fdbb..8d0058d88eb2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3532,6 +3532,26 @@ union bpf_attr {
* Use the mark present in *params*->mark for the fib lookup.
* This option should not be used with BPF_FIB_LOOKUP_DIRECT,
* as it only has meaning for full lookups.
+ * **BPF_FIB_LOOKUP_VLAN**
+ * If the fib lookup resolves to a VLAN device whose
+ * parent is a real (non-VLAN) device, set
+ * *params*->h_vlan_proto and *params*->h_vlan_TCI from
+ * the VLAN device and replace *params*->ifindex with the
+ * parent's ifindex. *params*->h_vlan_TCI carries the VID
+ * only, with PCP and DEI bits zero; a consumer wanting to
+ * set egress priority writes PCP itself. *params*->smac is
+ * the VLAN device's own address, which can differ from the
+ * parent's. Only the immediate parent is resolved; if it
+ * is itself a VLAN device (QinQ) or in another namespace,
+ * the egress cannot be reduced to a physical device plus
+ * one tag and the lookup returns
+ * **BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
+ * left at the input. Re-issue without
+ * **BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
+ * ifindex. The swap and the vlan fields
+ * are written only on success; other output fields keep
+ * the helper's existing behaviour, so a frag-needed result
+ * still reports the route mtu in *params*->mtu_result.
*
* *ctx* is either **struct xdp_md** for XDP programs or
* **struct sk_buff** tc cls_act programs.
@@ -7327,6 +7347,7 @@ enum {
BPF_FIB_LOOKUP_TBID = (1U << 3),
BPF_FIB_LOOKUP_SRC = (1U << 4),
BPF_FIB_LOOKUP_MARK = (1U << 5),
+ BPF_FIB_LOOKUP_VLAN = (1U << 6),
};
enum {
@@ -7340,6 +7361,7 @@ enum {
BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */
BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */
+ BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
};
struct bpf_fib_lookup {
@@ -7393,7 +7415,11 @@ struct bpf_fib_lookup {
union {
struct {
- /* output */
+ /*
+ * output with BPF_FIB_LOOKUP_VLAN: set from the
+ * resolved egress VLAN device (see the flag); zeroed
+ * on other successful lookups.
+ */
__be16 h_vlan_proto;
__be16 h_vlan_TCI;
};
--
2.54.0
next prev parent reply other threads:[~2026-06-23 2:52 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-23 2:51 [PATCH bpf-next v4 0/3] bpf: bidirectional VLAN support for bpf_fib_lookup() Avinash Duduskar
2026-06-23 2:51 ` Avinash Duduskar [this message]
2026-06-23 11:58 ` [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper Toke Høiland-Jørgensen
2026-06-23 2:51 ` [PATCH bpf-next v4 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT " Avinash Duduskar
2026-06-23 12:00 ` Toke Høiland-Jørgensen
2026-06-23 2:51 ` [PATCH bpf-next v4 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests Avinash Duduskar
2026-06-23 3:39 ` bot+bpf-ci
2026-06-23 12:36 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260623025147.1001664-2-avinash.duduskar@gmail.com \
--to=avinash.duduskar@gmail.com \
--cc=a.s.protopopov@gmail.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=emil@etsalapatis.com \
--cc=eyal.birger@gmail.com \
--cc=hawk@kernel.org \
--cc=horms@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=leon.hwang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rongtao@cestc.cn \
--cc=sdf@fomichev.me \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=toke@redhat.com \
--cc=yatsenko@meta.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.