Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next v5 0/3] bpf: bidirectional VLAN support for bpf_fib_lookup()
From: Avinash Duduskar @ 2026-06-24  3:05 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern

This series adds VLAN awareness to bpf_fib_lookup() in both directions.
BPF_FIB_LOOKUP_VLAN resolves a VLAN egress to its underlying real device
plus the VLAN tag (XDP programs need this because VLAN devices have no XDP
xmit), and BPF_FIB_LOOKUP_VLAN_INPUT runs the lookup as if a tagged frame
had arrived on the matching VLAN subinterface, for iif policy routing and
VRF table selection.

The independent l3mdev/VRF flow-init fix, patch 1 in v1 and v2, was split
out and merged to bpf separately.

An unreducible VLAN egress (a QinQ egress, or a parent in another
namespace) returns BPF_FIB_LKUP_RET_VLAN_FAILURE rather than a best-effort
SUCCESS, so an XDP program cannot mistake it for a physical egress and
silently blackhole the frame at xdp_do_flush(). The code is appended after
BPF_FIB_LKUP_RET_NO_SRC_ADDR (nothing renumbered, tools/ mirror updated)
and is returned only when BPF_FIB_LOOKUP_VLAN is set, so no existing caller
can observe it. On that failure params->ifindex is left at the input; a
program that wants the VLAN device's own ifindex re-issues without the flag.

Changes v4 -> v5 (Toke's review,
https://lore.kernel.org/bpf/87y0g5ca7x.fsf@toke.dk/):

- Patch 1: BPF_FIB_LOOKUP_VLAN only makes sense for XDP, which cannot
  redirect to a VLAN device; a tc program can redirect to the VLAN device
  directly. So bpf_skb_fib_lookup() now rejects the flag with -EINVAL, and
  the fwd_dev out-parameter added in v4 is dropped: with the flag gone from
  the skb path there is no swap to preserve, so the deferred mtu check
  returns to the original dev_get_by_index_rcu(net, params->ifindex). The
  VLAN_FAILURE rewind moves into bpf_fib_set_fwd_params() via an input
  ifindex parameter, so each lookup ends in a plain
  "return bpf_fib_set_fwd_params(...)". The early params->ifindex =
  dev->ifindex that NO_NEIGH and NO_SRC_ADDR report stays where
  d1c362e1dd68a ("bpf: Always return target ifindex in bpf_fib_lookup") put
  it. Dropping fwd_dev also removes the i386 W=1 unused-variable warning the
  kernel test robot reported, since net is used again.

- Patch 2: no code change; add Toke's Reviewed-by.

- Patch 3: the BPF_FIB_LOOKUP_VLAN cases assert the tc helper returns
  -EINVAL and check the egress result on the XDP path, including dmac and
  (for tot_len cases) the route mtu_result; the cross-netns egress case
  runs through bpf_xdp_fib_lookup(); the obsolete skb-mtu-after-swap arm is
  dropped.

Changes v3 -> v4:

- Patch 1: return BPF_FIB_LKUP_RET_VLAN_FAILURE for an unreducible VLAN
  egress, leaving params->ifindex at the input, per Toke's v3 review.

- Patch 3: QinQ-egress and cross-namespace-egress arms expect VLAN_FAILURE;
  an escape-hatch arm re-issues without the flag; and a live-frames arm
  asserts a reducible egress is delivered and a QinQ egress is passed to
  the stack.

Taking the tag as lookup input follows the approach David Ahern suggested
in the 2021 fwmark discussion:
https://lore.kernel.org/bpf/6248c547-ad64-04d6-fcec-374893cc1ef2@gmail.com/

v4: https://lore.kernel.org/all/20260623025147.1001664-1-avinash.duduskar@gmail.com/
v3: https://lore.kernel.org/all/20260617224729.1428662-1-avinash.duduskar@gmail.com/
v2: https://lore.kernel.org/all/20260616223426.3568080-1-avinash.duduskar@gmail.com/
v1: https://lore.kernel.org/all/20260609172052.81613-1-avinash.duduskar@gmail.com/

Avinash Duduskar (3):
  bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
  bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
  selftests/bpf: Add bpf_fib_lookup() VLAN flag tests

 include/uapi/linux/bpf.h                      |  50 +-
 net/core/filter.c                             |  97 ++-
 tools/include/uapi/linux/bpf.h                |  50 +-
 .../selftests/bpf/prog_tests/fib_lookup.c     | 717 +++++++++++++++++-
 .../testing/selftests/bpf/progs/fib_lookup.c  |  36 +
 5 files changed, 936 insertions(+), 14 deletions(-)


base-commit: a975094bf98ca97be9146f9d3b5681a6f9cf5ce3
-- 
2.54.0


^ permalink raw reply

* [PATCH bpf-next v5 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-24  3:05 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern
In-Reply-To: <20260624030530.3342884-1-avinash.duduskar@gmail.com>

bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
from the fib result. When the egress is a VLAN device, the returned
ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
programs that want to forward the frame (e.g. xdp-forward) must
instead target the underlying physical device and push the VLAN tag
themselves. Today the program has no way to learn either the
underlying ifindex or the VLAN tag without maintaining its own
VLAN-to-ifindex map in userspace and refreshing it on netlink
events.

Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
result is a VLAN device whose immediate parent is a real (non-VLAN)
device in the same network namespace, populate the existing output
fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
device and replace params->ifindex with the parent's ifindex.
params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
consumer wanting to set egress priority writes PCP itself.
params->smac is the VLAN device's own address, which can differ from
the parent's.

Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
and not vlan_dev_real_dev(), which walks to the bottom of a stack. When
the immediate parent is not a real device in the same namespace, the
lookup returns BPF_FIB_LKUP_RET_VLAN_FAILURE and leaves params->ifindex
at the input. This covers a stacked VLAN (QinQ), where the immediate
parent is itself a VLAN device and one h_vlan_proto/h_vlan_TCI pair
cannot describe two tags, and a parent in another network namespace (a
VLAN device can be moved while its parent stays), whose ifindex would
be meaningless in the caller's namespace. A program that wants the VLAN
device's own ifindex re-issues the lookup without BPF_FIB_LOOKUP_VLAN,
so the unreducible case stays distinct from a physical egress. That
distinction matters for XDP: a program cannot xmit on a VLAN device, so
a success carrying the VLAN ifindex would make it redirect to a device
with no ndo_xdp_xmit and drop the frame at xdp_do_flush(). The swap and
the vlan fields are written only on the reduce path; other output
fields keep their existing behaviour, so a frag-needed result still
reports the route mtu in params->mtu_result.

BPF_FIB_LOOKUP_VLAN is only useful to XDP, which cannot redirect to a
VLAN device. A tc program can redirect to the VLAN device directly, so
bpf_skb_fib_lookup() rejects the flag with -EINVAL; bpf_xdp_fib_lookup()
accepts it. When the flag is not set, behaviour is unchanged:
h_vlan_proto and h_vlan_TCI are zeroed and ifindex is left at the FIB
result.

The new block is compiled only under CONFIG_VLAN_8021Q since
vlan_dev_priv() is not defined otherwise; without that config
is_vlan_dev() is constant false and the flag is accepted but never
acts. That is safe because no VLAN device can exist there, so every
egress is already physical.

This lets an XDP redirect target the physical device and learn the
tag to push in a single lookup, which xdp-forward's optional VLAN
mode (xdp-project/xdp-tools#504) wants from the kernel side.

The helper's input semantics are unchanged; the reverse direction
(supplying a tag as lookup input) is added in the following patch.

Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 include/uapi/linux/bpf.h       | 31 ++++++++++++++++++++++++++++++-
 net/core/filter.c              | 33 +++++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h | 31 ++++++++++++++++++++++++++++++-
 3 files changed, 89 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 89b36de5fdbb..e00f0392e728 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3532,6 +3532,29 @@ union bpf_attr {
  *			Use the mark present in *params*->mark for the fib lookup.
  *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
  *			as it only has meaning for full lookups.
+ *		**BPF_FIB_LOOKUP_VLAN**
+ *			If the fib lookup resolves to a VLAN device whose
+ *			parent is a real (non-VLAN) device, set
+ *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
+ *			the VLAN device and replace *params*->ifindex with the
+ *			parent's ifindex. *params*->h_vlan_TCI carries the VID
+ *			only, with PCP and DEI bits zero; a consumer wanting to
+ *			set egress priority writes PCP itself. *params*->smac is
+ *			the VLAN device's own address, which can differ from the
+ *			parent's. Only the immediate parent is resolved; if it
+ *			is itself a VLAN device (QinQ) or in another namespace,
+ *			the egress cannot be reduced to a physical device plus
+ *			one tag and the lookup returns
+ *			**BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
+ *			left at the input. Re-issue without
+ *			**BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
+ *			ifindex. The swap and the vlan fields
+ *			are written only on success; other output fields keep
+ *			the helper's existing behaviour, so a frag-needed result
+ *			still reports the route mtu in *params*->mtu_result.
+ *			This flag is only valid for XDP programs; tc programs
+ *			receive -EINVAL since they can redirect to the VLAN
+ *			device directly.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7327,6 +7350,7 @@ enum {
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
+	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
 };
 
 enum {
@@ -7340,6 +7364,7 @@ enum {
 	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
 	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
 	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
+	BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
 };
 
 struct bpf_fib_lookup {
@@ -7393,7 +7418,11 @@ struct bpf_fib_lookup {
 
 	union {
 		struct {
-			/* output */
+			/*
+			 * output with BPF_FIB_LOOKUP_VLAN: set from the
+			 * resolved egress VLAN device (see the flag); zeroed
+			 * on other successful lookups.
+			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
 		};
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e96b4b847ce..b5a45485a54b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6201,10 +6201,29 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #endif
 
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
-static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
+static int bpf_fib_set_fwd_params(struct net_device *dev,
+				  struct bpf_fib_lookup *params,
+				  u32 flags, u32 mtu, u32 in_ifindex)
 {
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
+
+#if IS_ENABLED(CONFIG_VLAN_8021Q)
+	if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
+		struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+
+		if (!is_vlan_dev(real_dev) &&
+		    net_eq(dev_net(real_dev), dev_net(dev))) {
+			params->h_vlan_proto = vlan_dev_vlan_proto(dev);
+			params->h_vlan_TCI = htons(vlan_dev_vlan_id(dev));
+			params->ifindex = real_dev->ifindex;
+		} else {
+			params->ifindex = in_ifindex;
+			return BPF_FIB_LKUP_RET_VLAN_FAILURE;
+		}
+	}
+#endif
+
 	if (mtu)
 		params->mtu_result = mtu; /* union with tot_len */
 
@@ -6216,6 +6235,7 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
 static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 			       u32 flags, bool check_mtu)
 {
+	u32 in_ifindex = params->ifindex;
 	struct neighbour *neigh = NULL;
 	struct fib_nh_common *nhc;
 	struct in_device *in_dev;
@@ -6347,7 +6367,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 
 set_fwd_params:
-	return bpf_fib_set_fwd_params(params, mtu);
+	return bpf_fib_set_fwd_params(dev, params, flags, mtu, in_ifindex);
 }
 #endif
 
@@ -6357,6 +6377,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 {
 	struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
 	struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
+	u32 in_ifindex = params->ifindex;
 	struct fib6_result res = {};
 	struct neighbour *neigh;
 	struct net_device *dev;
@@ -6486,13 +6507,14 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 
 set_fwd_params:
-	return bpf_fib_set_fwd_params(params, mtu);
+	return bpf_fib_set_fwd_params(dev, params, flags, mtu, in_ifindex);
 }
 #endif
 
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
 			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
-			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK)
+			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
+			     BPF_FIB_LOOKUP_VLAN)
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6541,6 +6563,9 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	if (flags & ~BPF_FIB_LOOKUP_MASK)
 		return -EINVAL;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN)
+		return -EINVAL;
+
 	if (params->tot_len)
 		check_mtu = true;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 89b36de5fdbb..e00f0392e728 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3532,6 +3532,29 @@ union bpf_attr {
  *			Use the mark present in *params*->mark for the fib lookup.
  *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
  *			as it only has meaning for full lookups.
+ *		**BPF_FIB_LOOKUP_VLAN**
+ *			If the fib lookup resolves to a VLAN device whose
+ *			parent is a real (non-VLAN) device, set
+ *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
+ *			the VLAN device and replace *params*->ifindex with the
+ *			parent's ifindex. *params*->h_vlan_TCI carries the VID
+ *			only, with PCP and DEI bits zero; a consumer wanting to
+ *			set egress priority writes PCP itself. *params*->smac is
+ *			the VLAN device's own address, which can differ from the
+ *			parent's. Only the immediate parent is resolved; if it
+ *			is itself a VLAN device (QinQ) or in another namespace,
+ *			the egress cannot be reduced to a physical device plus
+ *			one tag and the lookup returns
+ *			**BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
+ *			left at the input. Re-issue without
+ *			**BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
+ *			ifindex. The swap and the vlan fields
+ *			are written only on success; other output fields keep
+ *			the helper's existing behaviour, so a frag-needed result
+ *			still reports the route mtu in *params*->mtu_result.
+ *			This flag is only valid for XDP programs; tc programs
+ *			receive -EINVAL since they can redirect to the VLAN
+ *			device directly.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7327,6 +7350,7 @@ enum {
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
+	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
 };
 
 enum {
@@ -7340,6 +7364,7 @@ enum {
 	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
 	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
 	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
+	BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
 };
 
 struct bpf_fib_lookup {
@@ -7393,7 +7418,11 @@ struct bpf_fib_lookup {
 
 	union {
 		struct {
-			/* output */
+			/*
+			 * output with BPF_FIB_LOOKUP_VLAN: set from the
+			 * resolved egress VLAN device (see the flag); zeroed
+			 * on other successful lookups.
+			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
 		};
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v5 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-24  3:05 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern
In-Reply-To: <20260624030530.3342884-1-avinash.duduskar@gmail.com>

BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
useful: an XDP program receiving a VLAN-tagged frame on a physical
device wants the lookup to behave as if the packet had arrived on the
corresponding VLAN subinterface, so iif-based policy routing and VRF
table selection use the right ingress.

Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
The device must be up and in the same network namespace as
params->ifindex (a VLAN device can be moved to another netns while
registered on its parent; receive would deliver into that other
namespace, which a lookup here cannot represent). If params->ifindex
is itself a VLAN device, its inner (QinQ) subinterface is matched.
For a bond or team, a tag on a port matches no device and returns
NOT_FWDED; pass the master's ifindex.
The lookup then runs with the resolved device as the ingress;
params->ifindex itself is not modified on the input side. When the
resolved device is enslaved to a VRF, both the full lookup (via the
l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
select the VRF's table from the resolved ingress. That follows from
feeding the resolved device to the flow as the ingress
(fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
the VRF master from the subinterface rather than from
params->ifindex.

The two failure classes get different treatment on purpose. A
h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
-EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
with a program-controlled value. An unmatched VID, a device that is
down, or one in another namespace is a data outcome and returns
BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
fib_get_table() finds no table and mirroring real ingress, where the
receive path drops such frames. A VID of 0 (a priority tag) is looked
up literally and normally fails the same way; receive instead
processes such frames untagged, so callers should not set the flag for
priority tags. Proceeding on the physical device for any of these
would be fail-open for the policy-routing cases above.

The h_vlan fields share a union with tbid, so the flag cannot be
combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
return -EINVAL; restricting now keeps a later relaxation backward
compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
consumed on the ingress side and the egress tag is written on
success.

Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
NULL, so every lookup with the flag returns NOT_FWDED, which is
correct since no VLAN device can exist.

Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 include/uapi/linux/bpf.h       | 21 ++++++++++-
 net/core/filter.c              | 66 +++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h | 21 ++++++++++-
 3 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e00f0392e728..d4218954c50f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3555,6 +3555,22 @@ union bpf_attr {
  *			This flag is only valid for XDP programs; tc programs
  *			receive -EINVAL since they can redirect to the VLAN
  *			device directly.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7351,6 +7367,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7421,7 +7438,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
diff --git a/net/core/filter.c b/net/core/filter.c
index b5a45485a54b..0ea362fa4287 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6229,6 +6229,25 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
 
 	return 0;
 }
+
+static struct net_device *bpf_fib_vlan_input_dev(struct net_device *dev,
+						 const struct bpf_fib_lookup *params)
+{
+	__be16 proto = params->h_vlan_proto;
+	struct net_device *vlan_dev;
+	u16 vid;
+
+	if (proto != htons(ETH_P_8021Q) && proto != htons(ETH_P_8021AD))
+		return ERR_PTR(-EINVAL);
+
+	vid = ntohs(params->h_vlan_TCI) & VLAN_VID_MASK;
+	vlan_dev = __vlan_find_dev_deep_rcu(dev, proto, vid);
+	if (!vlan_dev || !(vlan_dev->flags & IFF_UP) ||
+	    !net_eq(dev_net(vlan_dev), dev_net(dev)))
+		return NULL;
+
+	return vlan_dev;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_INET)
@@ -6249,6 +6268,14 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	/* verify forwarding is enabled on this interface */
 	in_dev = __in_dev_get_rcu(dev);
 	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
@@ -6258,7 +6285,11 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl4.flowi4_iif = 1;
 		fl4.flowi4_oif = params->ifindex;
 	} else {
-		fl4.flowi4_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		fl4.flowi4_iif = dev->ifindex;
 		fl4.flowi4_oif = 0;
 	}
 	fl4.flowi4_dscp = inet_dsfield_to_dscp(params->tos);
@@ -6395,6 +6426,14 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	idev = __in6_dev_get_safely(dev);
 	if (unlikely(!idev || !READ_ONCE(idev->cnf.forwarding)))
 		return BPF_FIB_LKUP_RET_FWD_DISABLED;
@@ -6403,7 +6442,12 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl6.flowi6_iif = 1;
 		oif = fl6.flowi6_oif = params->ifindex;
 	} else {
-		oif = fl6.flowi6_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		oif = dev->ifindex;
+		fl6.flowi6_iif = oif;
 		fl6.flowi6_oif = 0;
 		strict = RT6_LOOKUP_F_HAS_SADDR;
 	}
@@ -6514,7 +6558,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
 			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
 			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
-			     BPF_FIB_LOOKUP_VLAN)
+			     BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_VLAN_INPUT)
+
+static bool bpf_fib_lookup_flags_ok(u32 flags)
+{
+	if (flags & ~BPF_FIB_LOOKUP_MASK)
+		return false;
+
+	if ((flags & BPF_FIB_LOOKUP_VLAN_INPUT) &&
+	    (flags & (BPF_FIB_LOOKUP_TBID | BPF_FIB_LOOKUP_OUTPUT)))
+		return false;
+
+	return true;
+}
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6522,7 +6578,7 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	switch (params->family) {
@@ -6560,7 +6616,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	if (flags & BPF_FIB_LOOKUP_VLAN)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e00f0392e728..d4218954c50f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3555,6 +3555,22 @@ union bpf_attr {
  *			This flag is only valid for XDP programs; tc programs
  *			receive -EINVAL since they can redirect to the VLAN
  *			device directly.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7351,6 +7367,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7421,7 +7438,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v5 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
From: Avinash Duduskar @ 2026-06-24  3:05 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern
In-Reply-To: <20260624030530.3342884-1-avinash.duduskar@gmail.com>

Cover both new VLAN flags in the fib_lookup test. BPF_FIB_LOOKUP_VLAN
reduces a VLAN egress to its physical parent plus the tag, and
BPF_FIB_LOOKUP_VLAN_INPUT scopes the lookup to a VLAN subinterface.

BPF_FIB_LOOKUP_VLAN is XDP-only, since VLAN devices have no XDP xmit; the
tc helper rejects it with -EINVAL, which the table runner asserts for
every flag arm, and the egress result is checked through
bpf_xdp_fib_lookup(). Non-VLAN cases run through both helpers and assert
the path-independent results match; the XDP loop also checks dmac and,
for the tot_len cases, the route mtu_result, so the VLAN-egress dmac and
frag-needed coverage stays even though the tc path no longer reaches it.

The egress arms pin the reduction (parent ifindex plus tag, including
via a neighbour on the VLAN device, in OUTPUT mode, over a bond, and
through a DIRECT|TBID table) and the failure contract: a stacked-VLAN
(QinQ) egress returns BPF_FIB_LKUP_RET_VLAN_FAILURE with params->ifindex
left at the input. That is distinct from a no-neighbour return, which
reports the egress ifindex; only VLAN_FAILURE rewinds params->ifindex,
and a guard arm whose input and egress devices differ pins the
distinction. The VLAN_FAILURE arms are IPv4; the IPv6 path reaches it
through the same shared code, so an IPv6 arm would only re-test that.

The input arms use an iif rule that routes one destination to two
gateways, so the asserted gateway reveals which device the lookup used
as ingress, including VRF table selection through the l3mdev rule and
l3mdev_fib_table_rcu(). A cross-netns subtest moves a VLAN device into a
second netns while it stays registered on its parent and checks both
directions fail closed at the boundary.

A live-frames subtest (test_fib_lookup_vlan_redirect, with
BPF_F_TEST_XDP_LIVE_FRAMES) drives real frames through the native
xdp_do_redirect() / xdp_do_flush() path: a reducible egress is
redirected to the parent and delivered to its peer, while a QinQ egress
is passed to the stack, since redirecting to the VLAN device would drop
the frame at flush (no ndo_xdp_xmit).

The remaining per-case assertions -- resolution semantics, the -EINVAL
and NOT_FWDED error arms, and the SRC/SKIP_NEIGH combinations -- are in
the test table.

Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 .../selftests/bpf/prog_tests/fib_lookup.c     | 717 +++++++++++++++++-
 .../testing/selftests/bpf/progs/fib_lookup.c  |  36 +
 2 files changed, 749 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
index bd7658958004..8caed9d43b98 100644
--- a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
 
 #include <linux/rtnetlink.h>
+#include <linux/if_ether.h>
 #include <sys/types.h>
 #include <net/if.h>
 
@@ -23,6 +24,7 @@
 #define IPV4_TBID_ADDR		"172.0.0.254"
 #define IPV4_TBID_NET		"172.0.0.0"
 #define IPV4_TBID_DST		"172.0.0.2"
+#define IPV4_TBID_NONEIGH_DST	"172.0.0.5"
 #define IPV6_TBID_ADDR		"fd00::FFFF"
 #define IPV6_TBID_NET		"fd00::"
 #define IPV6_TBID_DST		"fd00::2"
@@ -37,6 +39,41 @@
 #define IPV6_LOCAL		"fd01::3"
 #define IPV6_GW1		"fd01::1"
 #define IPV6_GW2		"fd01::2"
+#define VLAN_ID			100
+#define VLAN_IFACE		"veth1.100"
+#define VLAN_ID_DOWN		102
+#define VLAN_IFACE_DOWN		"veth1.102"
+#define QINQ_OUTER_IFACE	"veth1.200"
+#define QINQ_INNER_IFACE	"veth1.200.300"
+#define VLAN_TABLE		"300"
+#define IPV4_VLAN_IFACE_ADDR	"10.5.0.254"
+#define IPV4_VLAN_EGRESS_DST	"10.5.0.2"
+#define IPV4_QINQ_DST		"10.7.0.2"
+#define IPV4_VLAN_DST		"10.6.0.2"
+#define IPV4_VLAN_GW		"10.5.0.1"
+#define IPV6_VLAN_IFACE_ADDR	"fd02::254"
+#define IPV6_VLAN_EGRESS_DST	"fd02::2"
+#define IPV6_VLAN_DST		"fd03::2"
+#define IPV6_VLAN_GW		"fd02::1"
+#define VLAN_VID_UNUSED		999
+#define VRF_IFACE		"vrf-blue"
+#define VRF_TABLE		"1000"
+#define VRF_VLAN_ID		101
+#define VRF_VLAN_IFACE		"veth1.101"
+#define IPV4_VRF_IFACE_ADDR	"10.8.0.254"
+#define IPV4_VRF_GW		"10.8.0.1"
+#define IPV4_VRF_DST		"10.9.0.2"
+#define TBID_VLAN_ID		50
+#define TBID_VLAN_IFACE		"veth2.50"
+#define IPV4_TBID_VLAN_DST	"172.2.0.2"
+#define IPV4_BOND_VLAN_DST	"10.11.0.2"
+#define IPV4_VLAN_MTU_DST	"10.5.9.2"
+#define QINQ_AD_VLAN_ID		200
+#define QINQ_INNER_VLAN_ID	300
+#define BOND_IFACE		"bond99"
+#define BOND_PORT		"veth3"
+#define BOND_PORT_PEER		"veth4"
+#define BOND_VLAN_ID		500
 #define DMAC			"11:11:11:11:11:11"
 #define DMAC_INIT { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, }
 #define DMAC2			"01:01:01:01:01:01"
@@ -52,6 +89,17 @@ struct fib_lookup_test {
 	__u32 tbid;
 	__u8 dmac[6];
 	__u32 mark;
+	/*
+	 * input tag with BPF_FIB_LOOKUP_VLAN_INPUT; expected output tag
+	 * with BPF_FIB_LOOKUP_VLAN (checked when check_vlan is set)
+	 */
+	__u16 vlan_proto;
+	__u16 vlan_id;
+	bool check_vlan;
+	const char *expected_dev; /* expected params->ifindex after lookup */
+	const char *iif;	  /* override the default veth1 input device */
+	__u16 tot_len;		  /* triggers the in-lookup mtu check when set */
+	__u16 expected_mtu;	  /* expected mtu_result (union with tot_len) */
 };
 
 static const struct fib_lookup_test tests[] = {
@@ -79,6 +127,17 @@ static const struct fib_lookup_test tests[] = {
 	  .daddr = IPV4_TBID_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
 	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID, .tbid = 100,
 	  .dmac = DMAC_INIT2, },
+	/*
+	 * An error that returns after the egress device is resolved must
+	 * report the egress ifindex, not the input. This routes from input
+	 * veth1 via veth2 (table 100) to a dst with no neighbour, so
+	 * input != egress, pinning NO_NEIGH to the egress device.
+	 */
+	{ .desc = "IPv4 NO_NEIGH reports the egress ifindex, not the input",
+	  .daddr = IPV4_TBID_NONEIGH_DST,
+	  .expected_ret = BPF_FIB_LKUP_RET_NO_NEIGH,
+	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID, .tbid = 100,
+	  .expected_dev = "veth2", },
 	{ .desc = "IPv6 TBID lookup failure",
 	  .daddr = IPV6_TBID_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
 	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID,
@@ -142,6 +201,218 @@ static const struct fib_lookup_test tests[] = {
 	  .expected_dst = IPV6_GW1,
 	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
 	  .mark = MARK, },
+	/* vlan egress resolution */
+	/*
+	 * Invariant the VLAN-egress arms jointly enforce: a
+	 * BPF_FIB_LOOKUP_VLAN SUCCESS always carries a physical,
+	 * xmit-capable ifindex -- no SUCCESS ever returns a VLAN-device
+	 * ifindex. Reducible arms pin ifindex == the physical parent; the
+	 * QinQ and foreign-netns arms pin VLAN_FAILURE with params->ifindex
+	 * left at the input, so a regression to best-effort (SUCCESS + the
+	 * VLAN ifindex) fails one.
+	 */
+	{ .desc = "IPv4 VLAN egress, no flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, single VLAN",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * skb path without tot_len: mtu_result is the VLAN device's mtu
+	 * (1400), not the parent's (1500)
+	 */
+	{ .desc = "IPv4 VLAN egress, skb-path mtu is the VLAN device's without the flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, .expected_mtu = 1400, },
+	{ .desc = "IPv4 VLAN egress, flag set but egress is not a VLAN",
+	  .daddr = IPV4_NUD_FAILED_ADDR, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, QinQ not reducible (VLAN_FAILURE)",
+	  .daddr = IPV4_QINQ_DST,
+	  .expected_ret = BPF_FIB_LKUP_RET_VLAN_FAILURE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	{ .desc = "IPv4 QinQ egress without the flag (escape hatch)",
+	  .daddr = IPV4_QINQ_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = QINQ_INNER_IFACE, },
+	{ .desc = "IPv6 VLAN egress, single VLAN",
+	  .daddr = IPV6_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, neighbour on the VLAN device",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT, },
+	{ .desc = "IPv4 VLAN egress in OUTPUT mode",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .iif = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_OUTPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress over a bond",
+	  .daddr = IPV4_BOND_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = BOND_IFACE, .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress via TBID table",
+	  .daddr = IPV4_TBID_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID |
+			  BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .tbid = 100,
+	  .expected_dev = "veth2", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = TBID_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, success writes mtu_result with the swap",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .tot_len = 500, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, FRAG_NEEDED reports mtu, swap unwritten",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_FRAG_NEEDED,
+	  .tot_len = 1400, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	/* vlan tag as lookup input */
+	{ .desc = "IPv4 VLAN input, no flag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects subinterface route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, tag selects subinterface route",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV6_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input and egress combined",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = "veth1",
+	  .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, neighbour resolved on the route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT2, },
+	{ .desc = "IPv4 VLAN input, source address from the subinterface",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_src = IPV4_VLAN_IFACE_ADDR,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SRC |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * VRF: the resolved subinterface is enslaved, so the l3mdev rule
+	 * (full lookup) and l3mdev_fib_table_rcu() (DIRECT) must select
+	 * the VRF table from the resolved ingress
+	 */
+	{ .desc = "IPv4 VLAN input, VRF subinterface, no flag",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects VRF table",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, DIRECT uses VRF table from resolved ingress",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_DIRECT |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	/*
+	 * failure arms also assert params is left untouched: ifindex still
+	 * names the physical device and the input tag bytes survive
+	 */
+	{ .desc = "IPv4 VLAN input, invalid proto",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, unmatched VID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "IPv4 VLAN input, subinterface down",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID_DOWN, },
+	/*
+	 * the resolver runs before the forwarding check, so on devices
+	 * with forwarding off FWD_DISABLED (not NOT_FWDED) proves the tag
+	 * resolved to that device and the lookup used it as ingress
+	 */
+	{ .desc = "IPv4 VLAN input, 802.1ad tag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021AD, .vlan_id = QINQ_AD_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, PCP and DEI bits ignored in TCI",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0xe000 | VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, inner QinQ device from VLAN ifindex",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = QINQ_OUTER_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = QINQ_INNER_VLAN_ID, },
+	/*
+	 * bonding: the VLANs live on the master, as on receive, where the
+	 * frame is steered to the master before VLAN processing; a port
+	 * ifindex does not match (ports carry vid state but no VLAN devs)
+	 */
+	{ .desc = "IPv4 VLAN input, tag on bond master resolves",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = BOND_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, tag on bond port does not match",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .iif = BOND_PORT, .expected_dev = BOND_PORT, .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, invalid proto",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, VID 0 priority tag fails closed",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0, },
+	{ .desc = "IPv6 VLAN input, unmatched VID",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "unknown flag bit rejected",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = (1 << 14) | BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input rejected with TBID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_TBID,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input rejected with OUTPUT",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_OUTPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
 };
 
 static int setup_netns(void)
@@ -204,6 +475,110 @@ static int setup_netns(void)
 	SYS(fail, "ip rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 	SYS(fail, "ip -6 rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 
+	/*
+	 * Setup for vlan tests: a subinterface for egress resolution and
+	 * tag-as-input, a QinQ stack, and an iif rule so the input tests
+	 * observe which device the lookup used as ingress.
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE, VLAN_ID);
+	SYS(fail, "ip link set dev %s up", VLAN_IFACE);
+	/*
+	 * lower than the veth1 parent (1500): the skb-path mtu check uses the
+	 * FIB result (VLAN) device, so mtu_result is this value with or
+	 * without the egress swap, which two arms below pin
+	 */
+	SYS(fail, "ip link set dev %s mtu 1400", VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VLAN_IFACE_ADDR, VLAN_IFACE);
+	SYS(fail, "ip addr add %s/64 dev %s nodad", IPV6_VLAN_IFACE_ADDR, VLAN_IFACE);
+
+	/*
+	 * stays down: the input flag must treat its tag the way real
+	 * ingress treats a frame arriving on a down VLAN device (drop)
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE_DOWN, VLAN_ID_DOWN);
+
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	err = write_sysctl("/proc/sys/net/ipv6/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv6.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	SYS(fail, "ip link add link veth1 name %s type vlan proto 802.1ad id 200",
+	    QINQ_OUTER_IFACE);
+	SYS(fail, "ip link add link %s name %s type vlan id 300",
+	    QINQ_OUTER_IFACE, QINQ_INNER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_OUTER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_INNER_IFACE);
+	SYS(fail, "ip route add %s/32 dev %s", IPV4_QINQ_DST, QINQ_INNER_IFACE);
+
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VLAN_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VLAN_TABLE, IPV4_VLAN_DST, IPV4_VLAN_GW);
+	SYS(fail, "ip rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+	SYS(fail, "ip -6 route add %s/128 via %s", IPV6_VLAN_DST, IPV6_GW1);
+	SYS(fail, "ip -6 route add table %s %s/128 via %s",
+	    VLAN_TABLE, IPV6_VLAN_DST, IPV6_VLAN_GW);
+	SYS(fail, "ip -6 rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+
+	/*
+	 * a bond with one port and a VLAN on the bond: VLANs on a bond
+	 * live on the master, so resolution succeeds for the master's
+	 * ifindex and fails closed for a port's, matching receive, which
+	 * steers the frame to the master before VLAN processing
+	 */
+	SYS(fail, "ip link add %s type bond", BOND_IFACE);
+	SYS(fail, "ip link add %s type veth peer name %s", BOND_PORT, BOND_PORT_PEER);
+	SYS(fail, "ip link set %s master %s", BOND_PORT, BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_PORT);
+	SYS(fail, "ip link add link %s name %s.%d type vlan id %d",
+	    BOND_IFACE, BOND_IFACE, BOND_VLAN_ID, BOND_VLAN_ID);
+	SYS(fail, "ip link set dev %s.%d up", BOND_IFACE, BOND_VLAN_ID);
+	SYS(fail, "ip route add %s/32 dev %s.%d",
+	    IPV4_BOND_VLAN_DST, BOND_IFACE, BOND_VLAN_ID);
+
+	/*
+	 * a VRF with its own dedicated subinterface (the iif rules above
+	 * must not see it), for the table-selection-by-ingress cases
+	 */
+	SYS(fail, "ip link add %s type vrf table %s", VRF_IFACE, VRF_TABLE);
+	SYS(fail, "ip link set dev %s up", VRF_IFACE);
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VRF_VLAN_IFACE, VRF_VLAN_ID);
+	SYS(fail, "ip link set %s master %s", VRF_VLAN_IFACE, VRF_IFACE);
+	SYS(fail, "ip link set dev %s up", VRF_VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VRF_IFACE_ADDR, VRF_VLAN_IFACE);
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VRF_VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VRF_VLAN_IFACE ".forwarding)"))
+		goto fail;
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VRF_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VRF_TABLE, IPV4_VRF_DST, IPV4_VRF_GW);
+
+	/* neighbours on the VLAN subinterface for the non-SKIP_NEIGH cases */
+	err = write_sysctl("/proc/sys/net/ipv4/neigh/" VLAN_IFACE "/gc_stale_time", "900");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.neigh." VLAN_IFACE ".gc_stale_time)"))
+		goto fail;
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_EGRESS_DST, VLAN_IFACE, DMAC);
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_GW, VLAN_IFACE, DMAC2);
+
+	/* a VLAN on veth2 with a route in the tbid test table */
+	SYS(fail, "ip link add link veth2 name %s type vlan id %d",
+	    TBID_VLAN_IFACE, TBID_VLAN_ID);
+	SYS(fail, "ip link set dev %s up", TBID_VLAN_IFACE);
+	SYS(fail, "ip route add table 100 %s/32 dev %s",
+	    IPV4_TBID_VLAN_DST, TBID_VLAN_IFACE);
+
+	/* a locked-mtu route via the subinterface for the FRAG_NEEDED case */
+	SYS(fail, "ip route add %s/32 dev %s mtu lock 1000",
+	    IPV4_VLAN_MTU_DST, VLAN_IFACE);
+
 	return 0;
 fail:
 	return -1;
@@ -218,9 +593,16 @@ static int set_lookup_params(struct bpf_fib_lookup *params,
 	memset(params, 0, sizeof(*params));
 
 	params->l4_protocol = IPPROTO_TCP;
-	params->ifindex = ifindex;
+	params->ifindex = test->iif ? if_nametoindex(test->iif) : ifindex;
 	params->tbid = test->tbid;
 	params->mark = test->mark;
+	params->tot_len = test->tot_len;
+
+	/* h_vlan_proto/h_vlan_TCI union with tbid */
+	if (test->lookup_flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		params->h_vlan_proto = htons(test->vlan_proto);
+		params->h_vlan_TCI = htons(test->vlan_id);
+	}
 
 	if (inet_pton(AF_INET6, test->daddr, params->ipv6_dst) == 1) {
 		params->family = AF_INET6;
@@ -298,7 +680,7 @@ void test_fib_lookup(void)
 	struct nstoken *nstoken = NULL;
 	struct __sk_buff skb = { };
 	struct fib_lookup *skel;
-	int prog_fd, err, ret, i;
+	int prog_fd, xdp_fd, err, ret, i;
 
 	/* The test does not use the skb->data, so
 	 * use pkt_v6 for both v6 and v4 test.
@@ -309,11 +691,16 @@ void test_fib_lookup(void)
 		    .ctx_in = &skb,
 		    .ctx_size_in = sizeof(skb),
 	);
+	LIBBPF_OPTS(bpf_test_run_opts, xdp_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+	);
 
 	skel = fib_lookup__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
 		return;
 	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	xdp_fd = bpf_program__fd(skel->progs.fib_lookup_xdp);
 
 	SYS(fail, "ip netns add %s", NS_TEST);
 
@@ -343,6 +730,15 @@ void test_fib_lookup(void)
 		if (!ASSERT_OK(err, "bpf_prog_test_run_opts"))
 			continue;
 
+		/* BPF_FIB_LOOKUP_VLAN is XDP-only; the tc helper rejects it.
+		 * These cases are exercised on the XDP path below.
+		 */
+		if (tests[i].lookup_flags & BPF_FIB_LOOKUP_VLAN) {
+			ASSERT_EQ(skel->bss->fib_lookup_ret, -EINVAL,
+				  "tc rejects BPF_FIB_LOOKUP_VLAN");
+			continue;
+		}
+
 		ASSERT_EQ(skel->bss->fib_lookup_ret, tests[i].expected_ret,
 			  "fib_lookup_ret");
 
@@ -352,6 +748,21 @@ void test_fib_lookup(void)
 		if (tests[i].expected_dst)
 			assert_dst_ip(fib_params, tests[i].expected_dst);
 
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev), "ifindex");
+
+		if (tests[i].expected_mtu)
+			ASSERT_EQ(fib_params->mtu_result, tests[i].expected_mtu,
+				  "mtu_result");
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "h_vlan_TCI");
+		}
+
 		ret = memcmp(tests[i].dmac, fib_params->dmac, sizeof(tests[i].dmac));
 		if (!ASSERT_EQ(ret, 0, "dmac not match")) {
 			char expected[18], actual[18];
@@ -361,15 +772,313 @@ void test_fib_lookup(void)
 			printf("dmac expected %s actual %s ", expected, actual);
 		}
 
-		// ensure tbid is zero'd out after fib lookup.
-		if (tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) {
+		/*
+		 * ensure tbid is zero'd out after fib lookup. With
+		 * BPF_FIB_LOOKUP_VLAN the union holds the packed vlan
+		 * fields instead, so skip the check for those.
+		 */
+		if ((tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) &&
+		    !(tests[i].lookup_flags & BPF_FIB_LOOKUP_VLAN)) {
 			if (!ASSERT_EQ(skel->bss->fib_params.tbid, 0,
 					"expected fib_params.tbid to be zero"))
 				goto fail;
 		}
 	}
 
+	/*
+	 * Re-run the cases through bpf_xdp_fib_lookup(). test_run uses the
+	 * current netns' loopback for ctx->rxq->dev, so dev_net() is NS_TEST
+	 * and the lookup runs against its FIB. The path-independent results
+	 * (return code, swapped ifindex, vlan tag, gateway) must match the skb
+	 * path; the no-tot_len mtu_result is skb-specific and not rechecked.
+	 */
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (set_lookup_params(fib_params, &tests[i], skb.ifindex))
+			continue;
+
+		skel->bss->fib_lookup_ret = -1;
+		skel->bss->lookup_flags = tests[i].lookup_flags;
+
+		err = bpf_prog_test_run_opts(xdp_fd, &xdp_opts);
+		if (!ASSERT_OK(err, "xdp test_run"))
+			continue;
+
+		if (!ASSERT_EQ(skel->bss->fib_lookup_ret, tests[i].expected_ret,
+			       "xdp fib_lookup_ret"))
+			printf("(xdp) %s\n", tests[i].desc);
+
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev),
+				  "xdp ifindex");
+
+		if (tests[i].expected_dst)
+			assert_dst_ip(fib_params, tests[i].expected_dst);
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "xdp h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "xdp h_vlan_TCI");
+		}
+
+		ret = memcmp(tests[i].dmac, fib_params->dmac, sizeof(tests[i].dmac));
+		ASSERT_EQ(ret, 0, "xdp dmac");
+
+		/*
+		 * mtu_result from a tot_len lookup is the route mtu and is
+		 * path-independent; the no-tot_len arm reads dev->mtu and is
+		 * skb-only, so gate on tot_len
+		 */
+		if (tests[i].expected_mtu && tests[i].tot_len)
+			ASSERT_EQ(fib_params->mtu_result, tests[i].expected_mtu,
+				  "xdp mtu_result");
+	}
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " NS_TEST);
+	fib_lookup__destroy(skel);
+}
+
+#define NS_VLAN_A	"fib_lookup_vlan_ns_a"
+#define NS_VLAN_B	"fib_lookup_vlan_ns_b"
+
+/*
+ * A VLAN device can be moved to another netns while staying registered
+ * on its parent. Neither direction may then cross the boundary: the
+ * egress flag must not publish the foreign parent's ifindex, and the
+ * input flag must fail closed rather than use a foreign ingress.
+ */
+void test_fib_lookup_vlan_netns(void)
+{
+	struct bpf_fib_lookup *fib_params;
+	struct nstoken *nstoken = NULL;
+	struct __sk_buff skb = { };
+	struct fib_lookup *skel = NULL;
+	int prog_fd, xdp_fd, err, parent_idx, vlan_idx;
+
+	LIBBPF_OPTS(bpf_test_run_opts, run_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+		    .ctx_in = &skb,
+		    .ctx_size_in = sizeof(skb),
+	);
+	LIBBPF_OPTS(bpf_test_run_opts, xdp_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+	);
+
+	skel = fib_lookup__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+		return;
+	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	xdp_fd = bpf_program__fd(skel->progs.fib_lookup_xdp);
+	fib_params = &skel->bss->fib_params;
+
+	SYS(fail, "ip netns add %s", NS_VLAN_A);
+	SYS(fail, "ip netns add %s", NS_VLAN_B);
+
+	nstoken = open_netns(NS_VLAN_A);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(a)"))
+		goto fail;
+
+	SYS(fail, "ip link add veth7 type veth peer name veth8");
+	SYS(fail, "ip link set dev veth7 up");
+	SYS(fail, "ip link add link veth7 name veth7.66 type vlan id 66");
+	SYS(fail, "ip link set veth7.66 netns %s", NS_VLAN_B);
+
+	parent_idx = if_nametoindex("veth7");
+	if (!ASSERT_NEQ(parent_idx, 0, "if_nametoindex(veth7)"))
+		goto fail;
+
+	/*
+	 * input: the moved device is still in veth7's VLAN group, but it
+	 * lives in another netns, so the lookup must fail closed
+	 */
+	skb.ifindex = parent_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = parent_idx;
+	fib_params->h_vlan_proto = htons(ETH_P_8021Q);
+	fib_params->h_vlan_TCI = htons(66);
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+	if (!ASSERT_OK(err, "test_run(input)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_NOT_FWDED,
+		  "input across netns fails closed");
+	ASSERT_EQ(fib_params->ifindex, parent_idx, "ifindex untouched");
+	ASSERT_EQ(fib_params->h_vlan_TCI, htons(66), "tag untouched");
+
+	close_netns(nstoken);
+	nstoken = open_netns(NS_VLAN_B);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(b)"))
+		goto fail;
+
+	/*
+	 * egress: the fib result is the VLAN device here, but its parent
+	 * is in the other netns, so the swap must not happen
+	 */
+	SYS(fail, "ip link set dev veth7.66 up");
+	SYS(fail, "ip addr add 10.66.0.1/24 dev veth7.66");
+	err = write_sysctl("/proc/sys/net/ipv4/conf/veth7.66/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(forwarding)"))
+		goto fail;
+
+	vlan_idx = if_nametoindex("veth7.66");
+	if (!ASSERT_NEQ(vlan_idx, 0, "if_nametoindex(veth7.66)"))
+		goto fail;
+
+	skb.ifindex = vlan_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = vlan_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, "10.66.0.1", &fib_params->ipv4_src),
+		       1, "inet_pton(src)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(xdp_fd, &xdp_opts);
+	if (!ASSERT_OK(err, "test_run(egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_VLAN_FAILURE,
+		  "egress returns VLAN_FAILURE");
+	ASSERT_EQ(fib_params->ifindex, vlan_idx,
+		  "foreign parent not published");
+	ASSERT_EQ(fib_params->h_vlan_TCI, 0, "vlan fields zero");
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " NS_VLAN_A);
+	SYS_NOFAIL("ip netns del " NS_VLAN_B);
+	fib_lookup__destroy(skel);
+}
+
+#define REDIRECT_NPKTS 1000
+
+/*
+ * The egress flag exists so an XDP program can redirect to the physical
+ * parent. A redirect that lands on a VLAN device is dropped at
+ * xdp_do_flush(), because a VLAN device has no ndo_xdp_xmit. Drive real
+ * frames with BPF_F_TEST_XDP_LIVE_FRAMES, which runs the native
+ * xdp_do_redirect() + xdp_do_flush() path: a reducible VLAN egress
+ * resolves to veth1 and is delivered to its peer veth2, while a QinQ
+ * egress returns VLAN_FAILURE and is passed to the stack instead of
+ * redirected to a device that would silently drop it.
+ */
+void test_fib_lookup_vlan_redirect(void)
+{
+	int redirect_fd, err, veth1_idx, veth2_idx = -1;
+	struct bpf_fib_lookup *fib_params;
+	struct nstoken *nstoken = NULL;
+	struct fib_lookup *skel = NULL;
+	bool xdp_attached = false;
+
+	LIBBPF_OPTS(bpf_test_run_opts, lf_opts,
+		    .data_in = &pkt_v4,
+		    .data_size_in = sizeof(pkt_v4),
+		    .flags = BPF_F_TEST_XDP_LIVE_FRAMES,
+		    .repeat = REDIRECT_NPKTS,
+	);
+
+	skel = fib_lookup__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+		return;
+	redirect_fd = bpf_program__fd(skel->progs.fib_lookup_redirect);
+	fib_params = &skel->bss->fib_params;
+
+	SYS(fail, "ip netns add %s", NS_TEST);
+	nstoken = open_netns(NS_TEST);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto fail;
+	if (setup_netns())
+		goto fail;
+
+	veth1_idx = if_nametoindex("veth1");
+	veth2_idx = if_nametoindex("veth2");
+	if (!ASSERT_NEQ(veth1_idx, 0, "if_nametoindex(veth1)") ||
+	    !ASSERT_NEQ(veth2_idx, 0, "if_nametoindex(veth2)"))
+		goto fail;
+
+	/*
+	 * A redirect to veth1 is delivered to its peer veth2. veth_xdp_xmit()
+	 * only accepts the frame if veth2's NAPI is up, which on veth means
+	 * veth2 carries an XDP program; xdp_count tallies what arrives.
+	 */
+	err = bpf_xdp_attach(veth2_idx, bpf_program__fd(skel->progs.xdp_count),
+			     XDP_FLAGS_DRV_MODE, NULL);
+	if (!ASSERT_OK(err, "attach xdp_count on veth2"))
+		goto fail;
+	xdp_attached = true;
+
+	/* reducible VLAN egress: resolves to the physical parent veth1 */
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = veth1_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, IPV4_IFACE_ADDR, &fib_params->ipv4_src),
+		       1, "inet_pton(src)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, IPV4_VLAN_EGRESS_DST, &fib_params->ipv4_dst),
+		       1, "inet_pton(reducible dst)"))
+		goto fail;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH;
+	skel->bss->redirected = 0;
+	skel->bss->passed = 0;
+	skel->bss->delivered = 0;
+
+	err = bpf_prog_test_run_opts(redirect_fd, &lf_opts);
+	if (!ASSERT_OK(err, "test_run(reducible egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->redirected, REDIRECT_NPKTS, "reducible egress redirected");
+	ASSERT_EQ(skel->bss->passed, 0, "reducible egress not passed");
+	ASSERT_GT(skel->bss->delivered, 0, "reducible egress delivered to veth2");
+
+	/*
+	 * QinQ egress: not reducible, so the lookup returns VLAN_FAILURE and
+	 * the program passes the frame instead of redirecting to the inner
+	 * VLAN device. redirected == 0 is the assertion that matters: the
+	 * program did not redirect to a device that would drop the frame at
+	 * xdp_do_flush(). veth2's delivered count is not checked here, since
+	 * a passed frame can still reach veth2 through the stack's forwarding
+	 * path, which is unrelated to the redirect under test.
+	 */
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = veth1_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, IPV4_IFACE_ADDR, &fib_params->ipv4_src),
+		       1, "inet_pton(src)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, IPV4_QINQ_DST, &fib_params->ipv4_dst),
+		       1, "inet_pton(qinq dst)"))
+		goto fail;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH;
+	skel->bss->redirected = 0;
+	skel->bss->passed = 0;
+
+	err = bpf_prog_test_run_opts(redirect_fd, &lf_opts);
+	if (!ASSERT_OK(err, "test_run(qinq egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->passed, REDIRECT_NPKTS, "qinq egress passed");
+	ASSERT_EQ(skel->bss->redirected, 0, "qinq egress not redirected");
+
 fail:
+	if (xdp_attached)
+		bpf_xdp_detach(veth2_idx, XDP_FLAGS_DRV_MODE, NULL);
 	if (nstoken)
 		close_netns(nstoken);
 	SYS_NOFAIL("ip netns del " NS_TEST);
diff --git a/tools/testing/selftests/bpf/progs/fib_lookup.c b/tools/testing/selftests/bpf/progs/fib_lookup.c
index 7b5dd2214ff4..862a1e9457b4 100644
--- a/tools/testing/selftests/bpf/progs/fib_lookup.c
+++ b/tools/testing/selftests/bpf/progs/fib_lookup.c
@@ -19,4 +19,40 @@ int fib_lookup(struct __sk_buff *skb)
 	return TC_ACT_SHOT;
 }
 
+SEC("xdp")
+int fib_lookup_xdp(struct xdp_md *ctx)
+{
+	fib_lookup_ret = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params),
+					lookup_flags);
+
+	return XDP_DROP;
+}
+
+int redirected = 0;
+int passed = 0;
+int delivered = 0;
+
+SEC("xdp")
+int fib_lookup_redirect(struct xdp_md *ctx)
+{
+	struct bpf_fib_lookup params = fib_params;
+	long ret;
+
+	ret = bpf_fib_lookup(ctx, &params, sizeof(params), lookup_flags);
+	if (ret == BPF_FIB_LKUP_RET_SUCCESS) {
+		redirected++;
+		return bpf_redirect(params.ifindex, 0);
+	}
+
+	passed++;
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_count(struct xdp_md *ctx)
+{
+	delivered++;
+	return XDP_DROP;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2] net: meth: Fix skb allocation failure handling in RX init
From: Haoxiang Li @ 2026-06-24  3:19 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, pavan.chebbi
  Cc: netdev, linux-kernel, Haoxiang Li

meth_init_rx_ring() does not check the return value of alloc_skb().
If the allocation fails, the NULL skb is passed to skb_reserve() and
then dereferenced through skb->head.

Add check for alloc_skb() to prevent potential null pointer dereference.
And unwind the RX entries that were already allocated and DMA-mapped
before returning.

Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
Changes in v2:
 - Add error handling to free the resources that were already allocated. Thanks, Pavan.
 - Drop the fixes tag. Thanks, Andrew.
---
 drivers/net/ethernet/sgi/meth.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/sgi/meth.c b/drivers/net/ethernet/sgi/meth.c
index f7c3a5a766b7..bf5f47692023 100644
--- a/drivers/net/ethernet/sgi/meth.c
+++ b/drivers/net/ethernet/sgi/meth.c
@@ -228,6 +228,9 @@ static int meth_init_rx_ring(struct meth_private *priv)
 
 	for (i = 0; i < RX_RING_ENTRIES; i++) {
 		priv->rx_skbs[i] = alloc_skb(METH_RX_BUFF_SIZE, 0);
+		if (!priv->rx_skbs[i])
+			goto err_free_skbs;
+
 		/* 8byte status vector + 3quad padding + 2byte padding,
 		 * to put data on 64bit aligned boundary */
 		skb_reserve(priv->rx_skbs[i],METH_RX_HEAD);
@@ -240,6 +243,17 @@ static int meth_init_rx_ring(struct meth_private *priv)
 	}
         priv->rx_write = 0;
 	return 0;
+
+err_free_skbs:
+	while (i--) {
+		dma_unmap_single(&priv->pdev->dev, priv->rx_ring_dmas[i],
+				 METH_RX_BUFF_SIZE, DMA_FROM_DEVICE);
+		priv->rx_ring[i] = 0;
+		priv->rx_ring_dmas[i] = 0;
+		kfree_skb(priv->rx_skbs[i]);
+		priv->rx_skbs[i] = NULL;
+	}
+	return -ENOMEM;
 }
 static void meth_free_tx_ring(struct meth_private *priv)
 {
-- 
2.25.1


^ permalink raw reply related

* [PATCH net] net: gianfar: use of_irq_get()
From: Rosen Penev @ 2026-06-24  3:21 UTC (permalink / raw)
  To: netdev
  Cc: Claudiu Manoil, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Andy Fleming, open list

of_irq_get() differs from irq_of_parse_and_map() in that the latter
requires calling irq_dispose_mapping() when done, which is missing in the
driver. Meaning it leaks memory.

No need to map it anyway. Just need the value stored in the irq field.

Changed irq to an int as required by the of_irq_get API as it supports
-EPROBE_DEFER.

Fixes: b31a1d8b4151 ("gianfar: Convert gianfar to an of_platform_driver")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 12 ++++++------
 drivers/net/ethernet/freescale/gianfar.h |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 3271de5844f8..17a0d0787ed2 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -514,15 +514,15 @@ static int gfar_parse_group(struct device_node *np,
 	if (!grp->regs)
 		return -ENOMEM;
 
-	gfar_irq(grp, TX)->irq = irq_of_parse_and_map(np, 0);
+	gfar_irq(grp, TX)->irq = of_irq_get(np, 0);
 
 	/* If we aren't the FEC we have multiple interrupts */
 	if (model && strcasecmp(model, "FEC")) {
-		gfar_irq(grp, RX)->irq = irq_of_parse_and_map(np, 1);
-		gfar_irq(grp, ER)->irq = irq_of_parse_and_map(np, 2);
-		if (!gfar_irq(grp, TX)->irq ||
-		    !gfar_irq(grp, RX)->irq ||
-		    !gfar_irq(grp, ER)->irq)
+		gfar_irq(grp, RX)->irq = of_irq_get(np, 1);
+		gfar_irq(grp, ER)->irq = of_irq_get(np, 2);
+		if (gfar_irq(grp, TX)->irq < 0 ||
+		    gfar_irq(grp, RX)->irq < 0 ||
+		    gfar_irq(grp, ER)->irq < 0)
 			return -EINVAL;
 	}
 
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 68b59d3202e3..c6f1c0b6229e 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -1074,7 +1074,7 @@ enum gfar_irqinfo_id {
 };
 
 struct gfar_irqinfo {
-	unsigned int irq;
+	int irq;
 	char name[GFAR_INT_NAME_MAX];
 };
 
-- 
2.54.0


^ permalink raw reply related

* RE: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
From: Vyavahare, Tushar @ 2026-06-24  3:47 UTC (permalink / raw)
  To: Fijalkowski, Maciej, netdev@vger.kernel.org
  Cc: bpf@vger.kernel.org, Karlsson, Magnus, stfomichev@gmail.com,
	kuba@kernel.org, pabeni@redhat.com, horms@kernel.org,
	kerneljasonxing@gmail.com
In-Reply-To: <20260623091008.1046547-1-maciej.fijalkowski@intel.com>



> -----Original Message-----
> From: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>
> Sent: Tuesday, June 23, 2026 2:40 PM
> To: netdev@vger.kernel.org
> Cc: bpf@vger.kernel.org; Karlsson, Magnus <magnus.karlsson@intel.com>;
> stfomichev@gmail.com; kuba@kernel.org; pabeni@redhat.com;
> horms@kernel.org; Vyavahare, Tushar <tushar.vyavahare@intel.com>;
> kerneljasonxing@gmail.com; Fijalkowski, Maciej
> <maciej.fijalkowski@intel.com>
> Subject: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
> 
> The UMEM state refactor made __send_pkts() use xsk->umem for Tx address
> generation. At the same time, the shared-UMEM Tx setup copies the Rx
> UMEM state into a Tx-local state object and resets base_addr and next_buffer
> before configuring the Tx socket.
> 
> Passing that Tx-local object to xsk_configure() makes xsk->umem point to the
> zero-based Tx allocator state. This breaks the BIDIRECTIONAL test once the
> roles are switched: the same socket is then used for Rx validation, but received
> descriptors from the other logical UMEM half are checked against base_addr
> == 0. With the new UMEM bounds check, a valid address such as base_addr +
> XDP_PACKET_HEADROOM is rejected as being outside the UMEM window.
> 
> Keep xsk->umem as the shared/Rx UMEM view used for socket configuration
> and Rx validation. Use the ifobject-local UMEM copy only for Tx descriptor
> address generation, preserving the BIDIRECTIONAL test's intent of using the
> proper logical UMEM half after the direction switch.
> 
> Fixes: b17631032769 ("selftests/xsk: Move UMEM state from ifobject to
> xsk_socket_info")
> Signed-off-by: Maciej Fijalkowski maciej.fijalkowski@intel.com

Reviewed-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Tested-by: Tushar Vyavahare <tushar.vyavahare@intel.com>

> ---
>  tools/testing/selftests/bpf/prog_tests/test_xsk.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> index d8a1c0d40e5a..50a8dbacb63d 100644
> --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> @@ -1169,8 +1169,8 @@ static int receive_pkts(struct test_spec *test)  static
> int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, bool
> timeout)  {
>  	u32 i, idx = 0, valid_pkts = 0, valid_frags = 0, buffer_len;
> +	struct xsk_umem_info *umem = ifobject->xsk_arr[0].umem_real;
>  	struct pkt_stream *pkt_stream = xsk->pkt_stream;
> -	struct xsk_umem_info *umem = xsk->umem;
>  	bool use_poll = ifobject->use_poll;
>  	struct pollfd fds = { };
>  	int ret;
> @@ -1524,7 +1524,7 @@ static int thread_common_ops_tx(struct test_spec
> *test, struct ifobject *ifobjec
>  	umem_tx->base_addr = 0;
>  	umem_tx->next_buffer = 0;
> 
> -	ret = xsk_configure(test, ifobject, umem_tx, true);
> +	ret = xsk_configure(test, ifobject, umem_rx, true);
>  	if (ret)
>  		return ret;
>  	ifobject->xsk = &ifobject->xsk_arr[0];
> --
> 2.43.0


^ permalink raw reply

* [PATCH net v2] fsl/fman: Free init resources on KeyGen failure in fman_init()
From: Haoxiang Li @ 2026-06-24  5:51 UTC (permalink / raw)
  To: madalin.bucur, sean.anderson, andrew+netdev, davem, edumazet,
	kuba, pabeni, florinel.iordache
  Cc: netdev, linux-kernel, Haoxiang Li, stable, Pavan Chebbi

fman_muram_alloc() allocates initialization resources before
initializing the KeyGen block. If keygen_init() fails, the
function returns -EINVAL directly and leaves those resources
allocated. Free the initialization resources before returning
from the KeyGen failure path.

While at it, drop the unused error check around enable(), which
always returns 0.

Fixes: 7472f4f281d0 ("fsl/fman: enable FMan Keygen")
Cc: stable@kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
---
Changes in v2:
 - Add "net" to patch title. Thanks, Pavan!
---
 drivers/net/ethernet/freescale/fman/fman.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c
index 013273a2de32..3a2a57207e55 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -1995,12 +1995,12 @@ static int fman_init(struct fman *fman)
 
 	/* Init KeyGen */
 	fman->keygen = keygen_init(fman->kg_regs);
-	if (!fman->keygen)
+	if (!fman->keygen) {
+		free_init_resources(fman);
 		return -EINVAL;
+	}
 
-	err = enable(fman, cfg);
-	if (err != 0)
-		return err;
+	enable(fman, cfg);
 
 	enable_time_stamp(fman);
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net-next v2] Documentation: net/smc: correct old value of smcr_max_recv_wr
From: Mahanta Jambigi @ 2026-06-24  6:14 UTC (permalink / raw)
  To: Breno Leitao
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, alibuda, dust.li,
	sidraya, wenjia, wintera, pasic, horms, tonylu, guwen, netdev,
	linux-s390
In-Reply-To: <ajqh_3YDwz9q5Aiz@gmail.com>



On 23/06/26 8:42 pm, Breno Leitao wrote:
> On Fri, Apr 24, 2026 at 07:23:36AM +0200, Mahanta Jambigi wrote:
>> The smc-sysctl.rst documentation incorrectly stated that the previous
>> hardcoded maximum number of WR buffers on the receive path (smcr_max_recv_wr)
>> was 16. The correct historical value used before the introduction of the sysctl
>> control was 48. Update the documentation to reflect the accurate historical
>> value. Also fix a couple of minor typos.
>>
>> Fixes: aef3cdb47bbb net/smc: make wr buffer count configurable
> 
> This Fixes tag is broken. You probably want:
> 
> 	Fixes: aef3cdb47bbb ("net/smc: make wr buffer count configurable")

I believe you are talking about the missing parenthesis & double quotes
around the commit subject line? I verified the 12-character commit hash
points to the correct commit.

> 
> Other than that, it looks good, the corrected value checks out.


^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-24  6:26 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <038a11a7-4ced-49ae-b605-2058733e841a@amd.com>

On 2026-06-23 20:24:02 [+0530], K Prateek Nayak wrote:
> Hello Sebastian,
Hi Prateek,

> nit.
> 
> Instead of replicating these bits, can we replace that return with a
> "goto out" ...

sure

…
> ... and replace this return with a:
> 
>     return (warning) ? BUG_TRAP_TYPE_WARN : BUG_TRAP_TYPE_BUG;
> 
> Looks a tab bit cleaner to my eyes. Thoughts?

It sure does.
I wait for PeterZ' executive order to either do this and sprinkle sched/
_or_ make legacy consoles deferred as it is done on RT.

Petr, was there a big push back doing it unconditionally?

> >  }
> >  

Sebastian

^ permalink raw reply

* Re: [PATCH net v2] net: ti: icssg-prueth: fix XDP_TX from the AF_XDP zero-copy RX path
From: Meghana Malladi @ 2026-06-24  6:30 UTC (permalink / raw)
  To: David Carlier, danishanwar, rogerq, andrew+netdev, netdev
  Cc: davem, edumazet, kuba, pabeni, horms, hawk, john.fastabend, sdf,
	ast, daniel, bpf, linux-arm-kernel, linux-kernel, stable
In-Reply-To: <20260623112225.303930-1-devnexen@gmail.com>

Few nitpicks,

On 6/23/26 16:52, David Carlier wrote:
> On XDP_TX from the zero-copy RX path, emac_run_xdp() converts the xsk
> buffer via xdp_convert_zc_to_xdp_frame(), which clones the data into a
> fresh MEM_TYPE_PAGE_ORDER0 page that is not DMA mapped. Transmitting it
> as PRUETH_TX_BUFF_TYPE_XDP_TX derives the DMA address with
> page_pool_get_dma_addr(), reading an uninitialized page->dma_addr, so
> the device DMAs from a bogus address (corrupt TX, or an IOMMU fault).
> 
> Pick the TX buffer type from the frame's memory type: keep
> PRUETH_TX_BUFF_TYPE_XDP_TX for page_pool frames and use
> PRUETH_TX_BUFF_TYPE_XDP_NDO for the cloned zero-copy frame, which is then
> DMA mapped through the NDO path and unmapped on completion.
> 
> While at it, fix the page_pool XDP_TX completion path. A
> PRUETH_TX_BUFF_TYPE_XDP_TX frame carries a page_pool-owned DMA mapping
> (established against rx_chn->dma_dev), yet prueth_xmit_free()
> unconditionally calls dma_unmap_single() on it with tx_chn->dma_dev,
> tearing down a mapping the driver does not own; xdp_return_frame()
> already recycles the page back to the pool. Tag such frames with a
> dedicated PRUETH_SWDATA_XDPF_TX type so the completion path skips the
> unmap, the same way PRUETH_SWDATA_XSK buffers are handled.
> 
> Fixes: 7a64bb388df3 ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
> Fixes: 62aa3246f462 ("net: ti: icssg-prueth: Add XDP support")
> Cc: stable@vger.kernel.org
> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
> v2:
>   - fold in the page_pool XDP_TX completion-path unmap fix raised by
>     Meghana Malladi: tag page_pool TX frames with PRUETH_SWDATA_XDPF_TX
>     so prueth_xmit_free() skips dma_unmap_single() on a pool-owned
>     mapping; xdp_return_frame() already recycles the page.
>   - add Fixes: 62aa3246f462 for that path.
>   - no change to the original zero-copy fix.
> v1: https://lore.kernel.org/netdev/20260620213756.87499-1-devnexen@gmail.com
>   drivers/net/ethernet/ti/icssg/icssg_common.c | 20 +++++++++++++++++---
>   drivers/net/ethernet/ti/icssg/icssg_prueth.h |  1 +
>   2 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
> index 82ddef9c17d5..96c8bf5ef671 100644
> --- a/drivers/net/ethernet/ti/icssg/icssg_common.c
> +++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
> @@ -185,7 +185,7 @@ void prueth_xmit_free(struct prueth_tx_chn *tx_chn,
>   	first_desc = desc;
>   	next_desc = first_desc;
>   	swdata = cppi5_hdesc_get_swdata(first_desc);
> -	if (swdata->type == PRUETH_SWDATA_XSK)
> +	if (swdata->type == PRUETH_SWDATA_XSK || swdata->type == PRUETH_SWDATA_XDPF_TX)

line length crosses 80 characters

>   		goto free_pool;
>   
>   	cppi5_hdesc_get_obuf(first_desc, &buf_dma, &buf_dma_len);
> @@ -259,6 +259,7 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
>   			napi_consume_skb(skb, budget);
>   			break;
>   		case PRUETH_SWDATA_XDPF:
> +		case PRUETH_SWDATA_XDPF_TX:
>   			xdpf = swdata->data.xdpf;
>   			dev_sw_netstats_tx_add(ndev, 1, xdpf->len);
>   			total_bytes += xdpf->len;
> @@ -769,7 +770,8 @@ u32 emac_xmit_xdp_frame(struct prueth_emac *emac,
>   	k3_udma_glue_tx_dma_to_cppi5_addr(tx_chn->tx_chn, &buf_dma);
>   	cppi5_hdesc_attach_buf(first_desc, buf_dma, xdpf->len, buf_dma, xdpf->len);
>   	swdata = cppi5_hdesc_get_swdata(first_desc);
> -	swdata->type = PRUETH_SWDATA_XDPF;
> +	swdata->type = buff_type == PRUETH_TX_BUFF_TYPE_XDP_TX ?
> +		PRUETH_SWDATA_XDPF_TX : PRUETH_SWDATA_XDPF;

Use braces for the condition please

>   	swdata->data.xdpf = xdpf;
>   
>   	/* Report BQL before sending the packet */
> @@ -804,6 +806,7 @@ EXPORT_SYMBOL_GPL(emac_xmit_xdp_frame);
>    */
>   static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len)
>   {
> +	enum prueth_tx_buff_type tx_buff_type;
>   	struct net_device *ndev = emac->ndev;
>   	struct netdev_queue *netif_txq;
>   	int cpu = smp_processor_id();
> @@ -826,11 +829,21 @@ static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len
>   			goto drop;
>   		}
>   
> +		/* In AF_XDP zero-copy mode xdp_convert_buff_to_frame()
> +		 * clones the xsk buffer into a fresh MEM_TYPE_PAGE_ORDER0
> +		 * page that is not DMA mapped. Such a frame must be mapped
> +		 * via the NDO path; only a page pool-backed frame already
> +		 * carries a usable page_pool DMA address.
> +		 */
> +		tx_buff_type = xdpf->mem_type == MEM_TYPE_PAGE_POOL ?
> +				PRUETH_TX_BUFF_TYPE_XDP_TX :
> +				PRUETH_TX_BUFF_TYPE_XDP_NDO;
> +
>   		q_idx = cpu % emac->tx_ch_num;
>   		netif_txq = netdev_get_tx_queue(ndev, q_idx);
>   		__netif_tx_lock(netif_txq, cpu);
>   		result = emac_xmit_xdp_frame(emac, xdpf, q_idx,
> -					     PRUETH_TX_BUFF_TYPE_XDP_TX);
> +					     tx_buff_type);
>   		__netif_tx_unlock(netif_txq);
>   		if (result == ICSSG_XDP_CONSUMED) {
>   			ndev->stats.tx_dropped++;
> @@ -1395,6 +1408,7 @@ void prueth_tx_cleanup(void *data, dma_addr_t desc_dma)
>   		dev_kfree_skb_any(skb);
>   		break;
>   	case PRUETH_SWDATA_XDPF:
> +	case PRUETH_SWDATA_XDPF_TX:
>   		xdpf = swdata->data.xdpf;
>   		xdp_return_frame(xdpf);
>   		break;
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
> index df93d15c5b78..00bb760d68a9 100644
> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
> @@ -153,6 +153,7 @@ enum prueth_swdata_type {
>   	PRUETH_SWDATA_CMD,
>   	PRUETH_SWDATA_XDPF,
>   	PRUETH_SWDATA_XSK,
> +	PRUETH_SWDATA_XDPF_TX,
>   };
>   
>   enum prueth_tx_buff_type {

Reviewed-by: Meghana Malladi <m-malladi@ti.com>

^ permalink raw reply

* [PATCH net v2] net: liquidio: fix BAR resource leak on PF number failure
From: Haoxiang Li @ 2026-06-24  6:40 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, ricardo.farrington,
	felix.manlunas, horms
  Cc: netdev, linux-kernel, Haoxiang Li, stable

If cn23xx_get_pf_num() fails, the function returns without
unmapping either BAR. Unmap both BARs before returning from
the error path.

Found by manual code review.

Fixes: 0c45d7fe12c7 ("liquidio: fix use of pf in pass-through mode in a virtual machine")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
Changes in v2:
 - Modify the commit message.
 - Introduce goto unwind path to do the cleanup. Thanks, Simon!
---
 .../cavium/liquidio/cn23xx_pf_device.c         | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 75f22f74774c..06b4424e778e 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -1163,18 +1163,14 @@ int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
 	if (octeon_map_pci_barx(oct, 1, MAX_BAR1_IOREMAP_SIZE)) {
 		dev_err(&oct->pci_dev->dev, "%s CN23XX BAR1 map failed\n",
 			__func__);
-		octeon_unmap_pci_barx(oct, 0);
-		return 1;
+		goto err_unmap_bar0;
 	}
 
 	if (cn23xx_get_pf_num(oct) != 0)
-		return 1;
+		goto err_unmap_bar1;
 
-	if (cn23xx_sriov_config(oct)) {
-		octeon_unmap_pci_barx(oct, 0);
-		octeon_unmap_pci_barx(oct, 1);
-		return 1;
-	}
+	if (cn23xx_sriov_config(oct))
+		goto err_unmap_bar1;
 
 	octeon_write_csr64(oct, CN23XX_SLI_MAC_CREDIT_CNT, 0x3F802080802080ULL);
 
@@ -1205,6 +1201,12 @@ int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
 	oct->coproc_clock_rate = 1000000ULL * cn23xx_coprocessor_clock(oct);
 
 	return 0;
+
+err_unmap_bar1:
+	octeon_unmap_pci_barx(oct, 1);
+err_unmap_bar0:
+	octeon_unmap_pci_barx(oct, 0);
+	return 1;
 }
 EXPORT_SYMBOL_GPL(setup_cn23xx_octeon_pf_device);
 
-- 
2.25.1


^ permalink raw reply related

* Please backport bridge multicast exponential field encoding fix series to stable kernels
From: Ujjal Roy @ 2026-06-24  6:59 UTC (permalink / raw)
  To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, Ido Schimmel, David Ahern,
	Shuah Khan, Andy Roulin, Yong Wang, Petr Machata, stable, Greg KH,
	Greg Kroah-Hartman
  Cc: Ujjal Roy, bridge, Kernel, Kernel, linux-kselftest

Hi Greg,

Please consider backporting the following bridge multicast fix series
to all applicable stable kernels:

726fa7da2d8c ("ipv4: igmp: get rid of IGMPV3_{QQIC,MRC} and simplify
calculation")
12cfb4ecc471 ("ipv6: mld: rename mldv2_mrc() and add mldv2_qqi()")
95bfd196f0dc ("ipv4: igmp: encode multicast exponential fields")
e51560f4220a ("ipv6: mld: encode multicast exponential fields")
529dbe762de0 ("selftests: net: bridge: add MRC and QQIC field encoding tests")

This series was merged via: db314398f618 ("net: bridge: mcast: support
exponential field encoding")

History: The multicast stack currently supports decoding of IGMPv3 and
MLDv2 exponential timer field encodings, but lacks the corresponding
encoding logic when generating multicast query packets. As a result,
query intervals and response codes exceeding the linear encoding range
can be transmitted incorrectly. This can cause multicast queriers and
listeners to interpret different timing values, resulting in protocol
interoperability issues, membership timeouts, and premature multicast
group expiration.

Testing: The series adds the missing encoding support for both IGMPv3
and MLDv2 and includes selftests that validate the behavior.
I backported the series to v6.6.123.2 and verified the accompanying
selftests. The selftests fail on the unpatched kernel and pass after
applying the series, demonstrating both the bug and the effectiveness
of the fix.

Given that this is a protocol correctness issue affecting multicast
query generation, please consider backporting the complete series to
all applicable stable kernels.

Thanks,
Ujjal

^ permalink raw reply

* [PATCH net v2] net: ipa: fix SMEM state handle leaks in SMP2P init
From: Haoxiang Li @ 2026-06-24  6:59 UTC (permalink / raw)
  To: elder, andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: netdev, linux-kernel, Haoxiang Li, stable

ipa_smp2p_init() acquires two Qualcomm SMEM state handles with
qcom_smem_state_get(). However, neither the init error paths
nor ipa_smp2p_exit() release them.

Release both handles with qcom_smem_state_put() in the init
error paths and in ipa_smp2p_exit().

Fixes: 530f9216a953 ("soc: qcom: ipa: AP/modem communications")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
Changes in v2:
 - Use explicit qcom_smem_state_put() calls instead of devm helpers.
   Thanks, Alex! Thanks, Jakub!
---
 drivers/net/ipa/ipa_smp2p.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ipa/ipa_smp2p.c b/drivers/net/ipa/ipa_smp2p.c
index 2f0ccdd937cc..331c00ad02c0 100644
--- a/drivers/net/ipa/ipa_smp2p.c
+++ b/drivers/net/ipa/ipa_smp2p.c
@@ -232,19 +232,27 @@ ipa_smp2p_init(struct ipa *ipa, struct platform_device *pdev, bool modem_init)
 					  &valid_bit);
 	if (IS_ERR(valid_state))
 		return PTR_ERR(valid_state);
-	if (valid_bit >= 32)		/* BITS_PER_U32 */
-		return -EINVAL;
+	if (valid_bit >= 32) {		/* BITS_PER_U32 */
+		ret = -EINVAL;
+		goto err_valid_state_put;
+	}
 
 	enabled_state = qcom_smem_state_get(dev, "ipa-clock-enabled",
 					    &enabled_bit);
-	if (IS_ERR(enabled_state))
-		return PTR_ERR(enabled_state);
-	if (enabled_bit >= 32)		/* BITS_PER_U32 */
-		return -EINVAL;
+	if (IS_ERR(enabled_state)) {
+		ret = PTR_ERR(enabled_state);
+		goto err_valid_state_put;
+	}
+	if (enabled_bit >= 32) {		/* BITS_PER_U32 */
+		ret = -EINVAL;
+		goto err_enabled_state_put;
+	}
 
 	smp2p = kzalloc_obj(*smp2p);
-	if (!smp2p)
-		return -ENOMEM;
+	if (!smp2p) {
+		ret = -ENOMEM;
+		goto err_enabled_state_put;
+	}
 
 	smp2p->ipa = ipa;
 
@@ -289,6 +297,10 @@ ipa_smp2p_init(struct ipa *ipa, struct platform_device *pdev, bool modem_init)
 	ipa->smp2p = NULL;
 	mutex_destroy(&smp2p->mutex);
 	kfree(smp2p);
+err_enabled_state_put:
+	qcom_smem_state_put(enabled_state);
+err_valid_state_put:
+	qcom_smem_state_put(valid_state);
 
 	return ret;
 }
@@ -305,6 +317,8 @@ void ipa_smp2p_exit(struct ipa *ipa)
 	ipa_smp2p_power_release(ipa);
 	ipa->smp2p = NULL;
 	mutex_destroy(&smp2p->mutex);
+	qcom_smem_state_put(smp2p->enabled_state);
+	qcom_smem_state_put(smp2p->valid_state);
 	kfree(smp2p);
 }
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3] virtio_net: disable cb when NAPI is busy-polled
From: Longjun Tang @ 2026-06-24  7:02 UTC (permalink / raw)
  To: mst, xuanzhuo
  Cc: jasowang, edumazet, virtualization, netdev, tanglongjun,
	lange_tang

From: Longjun Tang <tanglongjun@kylinos.cn>

When busy-poll is active, napi_schedule_prep() returns false in
virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
The device may keep firing irqs until reaches virtqueue_napi_complete().
Under load (received == budget), it will lead to a large number
of spurious interrupts.

Fix it by disabling the callback at the virtnet_poll() entry. This keeps
the callback off while we poll and re-enable by virtqueue_napi_complete()
when going idle.

Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>

---
V1 -> V2: Remain agnostic to busy polling
V2 -> V3: Add fixes tag
---
 drivers/net/virtio_net.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..0a11f2b32500 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
 	unsigned int xdp_xmit = 0;
 	bool napi_complete;

+	/* Keep callbacks suppressed for the duration of this poll,
+	 * busy-poll need.
+	 */
+	virtqueue_disable_cb(rq->vq);
+
 	virtnet_poll_cleantx(rq, budget);

 	received = virtnet_receive(rq, budget, &xdp_xmit);
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH] atm: fore200e: disable PCI device on DMA mask failure
From: Myeonghun Pak @ 2026-06-24  7:07 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Chas Williams, netdev, linux-atm-general, linux-kernel, Ijae Kim
In-Reply-To: <16ca6db2-6cf1-4f49-a77a-62bde8341b50@lunn.ch>

Hi Andrew,

Sorry for the noise.

This was against my local torvalds/linux v7.0 tree (028ef9c96e96), not the
current netdev/net-next tree. I missed 6deb53595092 ("net: remove unused ATM
protocols and legacy ATM device drivers"), which removes fore200e, so this
patch is obsolete.

Please drop/ignore it. I will subscribe to netdev and check the current
networking trees before sending future networking patches.

Thanks,
Myeonghun


2026년 6월 23일 (화) 오후 5:56, Andrew Lunn <andrew@lunn.ch>님이 작성:
>
> On Tue, Jun 23, 2026 at 04:53:56PM +0900, Myeonghun Pak wrote:
> > fore200e_pca_detect() enables the PCI device before setting the DMA
> > mask. If dma_set_mask_and_coherent() fails, the current error path
> > returns without disabling the device.
> >
> > Reuse the existing out_disable unwind label for this failure path so
> > pci_disable_device() is called after a successful pci_enable_device().
>
> What tree is this against?
>
> ommit 6deb53595092b1426885f6503d93eedc1e3ece77
> Author: Jakub Kicinski <kuba@kernel.org>
> Date:   Mon Apr 20 13:42:28 2026 -0700
>
>     net: remove unused ATM protocols and legacy ATM device drivers
>
>     Remove the ATM protocol modules and PCI/SBUS ATM device drivers
>     that are no longer in active use.
>
>     The ATM core protocol stack, PPPoATM, BR2684, and USB DSL modem
>     drivers (drivers/usb/atm/) are retained in-tree to maintain PPP
>     over ATM (PPPoA) and PPPoE-over-BR2684 support for DSL connections.
>     The Solos ADSL2+ PCI driver is also retained.
>
>     Removed ATM protocol modules:
>      - net/atm/clip.c - Classical IP over ATM (RFC 2225)
>      - net/atm/lec.c - LAN Emulation Client (LANE)
>      - net/atm/mpc.c, mpoa_caches.c, mpoa_proc.c - Multi-Protocol Over ATM
>
>     Removed PCI/SBUS ATM device drivers (drivers/atm/):
>      - adummy, atmtcp - software/testing ATM devices
>      - eni - Efficient Networks ENI155P (OC-3, ~1995)
>      - fore200e - FORE Systems 200E PCI/SBUS (OC-3, ~1999)
>
>
> Please subscribe to the netdev Mailing list, so you know what is going
> on.
>
>
>     Andrew
>
> ---
> pw-bot: reject
>

^ permalink raw reply

* Re: [PATCH v3] virtio_net: disable cb when NAPI is busy-polled
From: Michael S. Tsirkin @ 2026-06-24  7:08 UTC (permalink / raw)
  To: Longjun Tang
  Cc: xuanzhuo, jasowang, edumazet, virtualization, netdev, tanglongjun
In-Reply-To: <20260624070206.85467-1-lange_tang@163.com>

On Wed, Jun 24, 2026 at 03:02:06PM +0800, Longjun Tang wrote:
> From: Longjun Tang <tanglongjun@kylinos.cn>
> 
> When busy-poll is active, napi_schedule_prep() returns false in
> virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> The device may keep firing irqs until reaches virtqueue_napi_complete().
> Under load (received == budget), it will lead to a large number
> of spurious interrupts.
> 
> Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> the callback off while we poll and re-enable

and it is re-enabled

> by virtqueue_napi_complete()
> when going idle.
> 
> Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> 
> ---
> V1 -> V2: Remain agnostic to busy polling
> V2 -> V3: Add fixes tag
> ---
>  drivers/net/virtio_net.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..0a11f2b32500 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>  	unsigned int xdp_xmit = 0;
>  	bool napi_complete;
>  
> +	/* Keep callbacks suppressed for the duration of this poll,
> +	 * busy-poll need.

I don't know what "busy-poll need" means. Just drop this part?
In fact, the whole comment can go, we know virtqueue_disable_cb
disables callbacks.

> +	 */
> +	virtqueue_disable_cb(rq->vq);
> +
>  	virtnet_poll_cleantx(rq, budget);
>  
>  	received = virtnet_receive(rq, budget, &xdp_xmit);
> -- 
> 2.43.0


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-24  7:12 UTC (permalink / raw)
  To: avagin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, safinaskar, torvalds, val, viro, willy
In-Reply-To: <CANaxB-zK5q=Xw6UZTmeFtXsDZjUsPkFk=p485m-wtNTBnf4hgg@mail.gmail.com>

Andrei Vagin <avagin@gmail.com>:
> The CRIU fifo test fails with this change. The problem is that vmsplice
> with SPLICE_F_NONBLOCK to a fifo file descriptor fails with -EOPNOTSUPP.
> 
> It seems we need a fix like this one:
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 429b0714ec57..6fc49e933727 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1253,6 +1253,7 @@ static int fifo_open(struct inode *inode, struct
> file *filp)
> 
>         /* We can only do regular read/write on fifos */
>         stream_open(inode, filp);
> +       filp->f_mode |= FMODE_NOWAIT;
> 
>         switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
>         case FMODE_READ:

Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
named fifos? Or this is merely a test?

If this is just a test, I think we need not to preserve this behavior.

I did debian code search with regex "vmsplice.*SPLICE_F_NONBLOCK" and I
found very few packages. And it seems all them use pipes, not named fifos.

(On speed: I still think that my vmsplice patches are good thing,
despite performance regressions in CRIU.)

-- 
Askar Safin

^ permalink raw reply

* [syzbot ci] Re: nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
From: syzbot ci @ 2026-06-24  7:13 UTC (permalink / raw)
  To: davem, david, edumazet, horms, kuba, linux-kernel, netdev,
	oe-linux-nfc, pabeni, sam, stable
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260623222402.175798-1-sam@bynar.io>

syzbot ci has tested the following series

[v1] nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
https://lore.kernel.org/all/20260623222402.175798-1-sam@bynar.io
* [PATCH net] nfc: nci: fix uninit-value in nci_core_init_rsp_packet()

and found the following issue:
UBSAN: array-index-out-of-bounds in nci_init_complete_req

Full report is available here:
https://ci.syzbot.org/series/2a9a8657-37a3-4dce-8cb5-2035027791dd

***

UBSAN: array-index-out-of-bounds in nci_init_complete_req

tree:      linux-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base:      a986fde914d88af47eb78fd29c5d1af7952c3500
arch:      amd64
compiler:  Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
config:    https://ci.syzbot.org/builds/80f835c3-e998-47ff-aaa5-24c578af3b4e/config
syz repro: https://ci.syzbot.org/findings/65008893-2498-4786-b913-f2c474a7b34a/syz_repro

------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/nfc/nci/core.c:192:7
index 4 is out of range for type '__u8[4]' (aka 'unsigned char[4]')
CPU: 0 UID: 0 PID: 5905 Comm: syz.1.33 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
 __ubsan_handle_out_of_bounds+0xe8/0xf0 lib/ubsan.c:455
 nci_init_complete_req+0x255/0x460 net/nfc/nci/core.c:192
 __nci_request+0x7d/0x300 net/nfc/nci/core.c:108
 nci_open_device net/nfc/nci/core.c:529 [inline]
 nci_dev_up+0x8c3/0xdc0 net/nfc/nci/core.c:643
 nfc_dev_up+0x165/0x350 net/nfc/core.c:118
 nfc_genl_dev_up+0x89/0xe0 net/nfc/netlink.c:775
 genl_family_rcv_msg_doit+0x233/0x340 net/netlink/genetlink.c:1114
 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
 genl_rcv_msg+0x614/0x7a0 net/netlink/genetlink.c:1209
 netlink_rcv_skb+0x226/0x4a0 net/netlink/af_netlink.c:2556
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x7bb/0x940 net/netlink/af_netlink.c:1345
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1900
 sock_sendmsg_nosec net/socket.c:775 [inline]
 __sock_sendmsg net/socket.c:790 [inline]
 ____sys_sendmsg+0x9b9/0xa20 net/socket.c:2684
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2738
 __sys_sendmsg net/socket.c:2770 [inline]
 __do_sys_sendmsg net/socket.c:2775 [inline]
 __se_sys_sendmsg net/socket.c:2773 [inline]
 __x64_sys_sendmsg+0x1b1/0x290 net/socket.c:2773
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f55ead9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f55ebcb9028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f55eb015fa0 RCX: 00007f55ead9ce59
RDX: 0000000004008054 RSI: 0000200000000200 RDI: 0000000000000005
RBP: 00007f55eae32e6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f55eb016038 R14: 00007f55eb015fa0 R15: 00007ffcba11c798
 </TASK>
---[ end trace ]---


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* Re: [PATCH net 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: kernel test robot @ 2026-06-24  7:15 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: oe-kbuild-all, davem, edumazet, kuba, pabeni, horms, victor,
	andrew+netdev, zdi-disclosures, stable, Jamal Hadi Salim
In-Reply-To: <20260623184247.508956-1-jhs@mojatatu.com>

Hi Jamal,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Jamal-Hadi-Salim/net-sched-sch_teql-Introduce-slaves_lock-to-avoid-race-condition-and-UAF/20260624-024432
base:   net/main
patch link:    https://lore.kernel.org/r/20260623184247.508956-1-jhs%40mojatatu.com
patch subject: [PATCH net 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
config: sparc-randconfig-r133-20260624 (https://download.01.org/0day-ci/archive/20260624/202606241501.XQBMu4b8-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 15.2.0
sparse: v0.6.5-rc1
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260624/202606241501.XQBMu4b8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606241501.XQBMu4b8-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/sched/sch_teql.c:106:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:106:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:106:25: sparse:    struct Qdisc *
   net/sched/sch_teql.c:217:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:217:17: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:217:17: sparse:    struct Qdisc *
   net/sched/sch_teql.c:220:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:220:17: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:220:17: sparse:    struct Qdisc *
   net/sched/sch_teql.c:300:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:300:17: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:300:17: sparse:    struct Qdisc *
   net/sched/sch_teql.c:359:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:359:23: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:359:23: sparse:    struct Qdisc *
   net/sched/sch_teql.c:333:41: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc *
   net/sched/sch_teql.c:333:41: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc *
   net/sched/sch_teql.c:333:41: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc *
   net/sched/sch_teql.c:349:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc *
   net/sched/sch_teql.c:349:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc *
   net/sched/sch_teql.c:349:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc *

vim +106 net/sched/sch_teql.c

    89	
    90	static struct sk_buff *
    91	teql_dequeue(struct Qdisc *sch)
    92	{
    93		struct teql_sched_data *dat = qdisc_priv(sch);
    94		struct netdev_queue *dat_queue;
    95		struct sk_buff *skb;
    96		struct Qdisc *q;
    97	
    98		skb = __skb_dequeue(&dat->q);
    99		dat_queue = netdev_get_tx_queue(dat->m->dev, 0);
   100		q = rcu_dereference_bh(dat_queue->qdisc);
   101	
   102		if (skb == NULL) {
   103			struct net_device *m = qdisc_dev(q);
   104			if (m) {
   105				spin_lock_bh(&dat->m->slaves_lock);
 > 106				rcu_assign_pointer(dat->m->slaves, sch);
   107				spin_unlock_bh(&dat->m->slaves_lock);
   108				netif_wake_queue(m);
   109			}
   110		} else {
   111			qdisc_bstats_update(sch, skb);
   112		}
   113		WRITE_ONCE(sch->q.qlen, dat->q.qlen + READ_ONCE(q->q.qlen));
   114		return skb;
   115	}
   116	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [syzbot] [net?] WARNING in qdisc_pkt_len_segs_init
From: syzbot @ 2026-06-24  7:20 UTC (permalink / raw)
  To: davem, edumazet, horms, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    9c87e61e3c57 Merge tag 'bpf-next-7.2' of git://git.kernel...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10d2901c580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9a9f723a32776544
dashboard link: https://syzkaller.appspot.com/bug?extid=d5d0d598a4cfdfafdc3b
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/5489ff1c0660/disk-9c87e61e.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/bcdabd5a8fea/vmlinux-9c87e61e.xz
kernel image: https://storage.googleapis.com/syzbot-assets/fa77e3b769c6/bzImage-9c87e61e.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+d5d0d598a4cfdfafdc3b@syzkaller.appspotmail.com

------------[ cut here ]------------
len > ((int)(~0U >> 1))
WARNING: ./include/linux/skbuff.h:2866 at pskb_may_pull_reason include/linux/skbuff.h:2866 [inline], CPU#0: syz.0.5520/24128
WARNING: ./include/linux/skbuff.h:2866 at pskb_may_pull include/linux/skbuff.h:2884 [inline], CPU#0: syz.0.5520/24128
WARNING: ./include/linux/skbuff.h:2866 at qdisc_pkt_len_segs_init+0x4b4/0xa30 net/core/dev.c:4138, CPU#0: syz.0.5520/24128
Modules linked in:
CPU: 0 UID: 0 PID: 24128 Comm: syz.0.5520 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:pskb_may_pull_reason include/linux/skbuff.h:2866 [inline]
RIP: 0010:pskb_may_pull include/linux/skbuff.h:2884 [inline]
RIP: 0010:qdisc_pkt_len_segs_init+0x4b4/0xa30 net/core/dev.c:4138
Code: 00 00 02 00 31 ff e8 1b 3d 49 f8 81 e3 00 00 02 00 0f 85 fd 00 00 00 e8 ca 38 49 f8 45 89 e7 e9 2c ff ff ff e8 bd 38 49 f8 90 <0f> 0b 90 e9 ee fd ff ff 44 89 e7 89 de e8 6a 3a 49 f8 41 39 dc 0f
RSP: 0018:ffffc90000007660 EFLAGS: 00010246
RAX: ffffffff897ccc53 RBX: 0000000000000003 RCX: ffff8880761b8000
RDX: 0000000000000100 RSI: 00000000fffffffa RDI: 0000000000000000
RBP: ffff888034a70b40 R08: ffff88807bd48067 R09: 1ffff1100f7a900c
R10: dffffc0000000000 R11: ffffed100f7a900d R12: 00000000fffffffa
R13: dffffc0000000000 R14: 1ffff1100694e183 R15: ffff888068a29b98
FS:  00007f26d46806c0(0000) GS:ffff888125272000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000010000 CR3: 00000000894be000 CR4: 00000000003526f0
DR0: 0000000000000006 DR1: 0000000000000000 DR2: 000000007fffdff5
DR3: 0000800000000005 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 sch_handle_ingress net/core/dev.c:4479 [inline]
 __netif_receive_skb_core+0x13aa/0x30b0 net/core/dev.c:6057
 __netif_receive_skb_list_core+0x24d/0x830 net/core/dev.c:6281
 __netif_receive_skb_list net/core/dev.c:6348 [inline]
 netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6439
 gro_normal_list include/net/gro.h:523 [inline]
 gro_flush_normal include/net/gro.h:531 [inline]
 napi_complete_done+0x299/0x730 net/core/dev.c:6807
 gro_cell_poll+0x5ab/0x5d0 net/core/gro_cells.c:74
 __napi_poll+0xaa/0x330 net/core/dev.c:7729
 napi_poll net/core/dev.c:7792 [inline]
 net_rx_action+0x61d/0xf50 net/core/dev.c:7949
 handle_softirqs+0x225/0x840 kernel/softirq.c:622
 do_softirq+0x76/0xd0 kernel/softirq.c:523
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450
 local_bh_enable include/linux/bottom_half.h:33 [inline]
 tun_rx_batched+0x616/0x790 drivers/net/tun.c:-1
 tun_get_user+0x2b04/0x4350 drivers/net/tun.c:1986
 tun_chr_write_iter+0x113/0x200 drivers/net/tun.c:2032
 new_sync_write fs/read_write.c:595 [inline]
 vfs_write+0x612/0xba0 fs/read_write.c:687
 ksys_write+0x150/0x270 fs/read_write.c:739
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f26d379ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f26d4680028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f26d3a15fa0 RCX: 00007f26d379ce59
RDX: 000000000000fdef RSI: 00002000000002c0 RDI: 000000000000000a
RBP: 00007f26d3832d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f26d3a16038 R14: 00007f26d3a15fa0 R15: 00007f26d3b3fa48
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [PATCH v5 net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-24  7:21 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable, Yury Norov

Before the commit 755391121038 ("net: mana: Allocate MSI-X vectors
dynamically"), all the MANA IRQs were assigned statically and together
during early driver load.

After this commit, the IRQ allocation for MANA was done in two phases.
HWC IRQ allocated earlier and then, queue IRQs dynamically added at a
later point. By this time, the IRQ weights on vCPUs can become imbalanced
and if IRQ count is greater than the vCPU count the topology aware IRQ
distribution logic in MANA can cause multiple MANA IRQs to land on the
same vCPUs, while other sibling vCPUs have none (case 1).

On SMP enabled, low-vCPU systems, this becomes a bigger problem as the
softIRQ handling overhead of two IRQs on the same vCPUs becomes much more
than their overheads if they were spread across sibling vCPUs.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly.

Fix the affinity assignment logic, in cases where the IRQ count is greater
than the vCPU count and when IRQs are added dynamically, by utilizing all
the vCPUs irrespective of their NUMA/core bindings (case 2).

The results of setting the affinity and hint to NULL were also studied,
and we observed that, with this logic if there are pre-existing IRQs
allocated on the VM (apart from MANA), during MANA IRQs allocation, it
leads to clustering of the MANA queue IRQs again (case 3).


=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Case 3: with affinity set to NULL
=======================================================
4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC			0
IRQ1:	mana_q1			2
IRQ2:	mana_q2			3
IRQ3:	mana_q3			2
IRQ4:	mana_q4			3

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch	aff NULL
20480		15.65		7.73		5.25
10240		15.63		8.93		5.77
8192		15.64		9.69		7.16
6144		15.64		13.16		9.33
4096		15.69		15.75		13.50
2048		15.69		15.83		13.61
1024		15.71		15.28		13.60

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
---
Changes in v5
 * modify commit message to align with fix patch format
---
Changes in v4
 * Add mana prefix on irq_affinity_*() in mana driver
 * Corrected grammar, comment for mana_irq_setup_linear()
 * added new line as per guidelines
 * added case 3 in commit message for when affinity is NULL
---
Changes in v3
 * Optimize the comments in mana_gd_setup_dyn_irqs()
 * add more details in the dev_dbg for extra IRQs
---
Changes in v2
 * Removed the unused skip_first_cpu variable
 * fixed exit condition in irq_setup_linear() with len == 0
 * changed return type of irq_setup_linear() as it will always be 0
 * removed the unnecessary rcu_read_lock() in irq_setup_linear()
 * added appropriate comments to indicate expected behaviour when
   IRQs are more than or equal to num_online_cpus()
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 78 +++++++++++++++----
 1 file changed, 64 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index a0fdd052d7f1..e8b7ffb47eb9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	} else {
 		/* If dynamic allocation is enabled we have already allocated
 		 * hwc msi
+		 * Also, we make sure in this case the following is always true
+		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
 		 */
 		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
 	}
@@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
  * do the same thing.
  */
 
-static int irq_setup(unsigned int *irqs, unsigned int len, int node,
-		     bool skip_first_cpu)
+static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
+				     int node, bool skip_first_cpu)
 {
 	const struct cpumask *next, *prev = cpu_none_mask;
 	cpumask_var_t cpus __free(free_cpumask_var);
@@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+/* must be called with cpus_read_lock() held */
+static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (len == 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	bool skip_first_cpu = false;
 	int *irqs, err, i, msi;
 
 	irqs = kmalloc_objs(int, nvec);
@@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 		return -ENOMEM;
 
 	/*
+	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
+	 * nvec is only Queue IRQ (HWC already setup).
 	 * While processing the next pci irq vector, we start with index 1,
 	 * as IRQ vector at index 0 is already processed for HWC.
 	 * However, the population of irqs array starts with index 0, to be
-	 * further used in irq_setup()
+	 * further used in mana_irq_setup_numa_aware()
 	 */
 	for (i = 1; i <= nvec; i++) {
 		msi = i;
@@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	}
 
 	/*
-	 * When calling irq_setup() for dynamically added IRQs, if number of
-	 * CPUs is more than or equal to allocated MSI-X, we need to skip the
-	 * first CPU sibling group since they are already affinitized to HWC IRQ
+	 * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
+	 * if number of CPUs is more than or equal to allocated MSI-X, we need to
+	 * skip the first CPU sibling group since they are already affinitized to
+	 * HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
-		skip_first_cpu = true;
+	if (gc->num_msix_usable <= num_online_cpus()) {
+		err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
+						true);
+		if (err) {
+			cpus_read_unlock();
+			goto free_irq;
+		}
+	} else {
+		/*
+		 * When num_msix_usable are more than num_online_cpus, our
+		 * queue IRQs should be equal to num of online vCPUs.
+		 * We try to make sure queue IRQs spread across all vCPUs.
+		 * In such a case NUMA or CPU core affinity does not matter.
+		 * Note: in this case the total mana IRQ should always be
+		 * num_online_cpus + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * However, if CPUs went offline since num_msix_usable was
+		 * computed, queue IRQs will be more than num_online_cpus().
+		 * In such cases remaining extra IRQs will retain their default
+		 * affinity.
+		 */
+		int first_unassigned = num_online_cpus();
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
-	if (err) {
-		cpus_read_unlock();
-		goto free_irq;
+		if (nvec > first_unassigned) {
+			char buf[32];
+
+			if (first_unassigned == nvec - 1)
+				snprintf(buf, sizeof(buf), "%d",
+					 first_unassigned);
+			else
+				snprintf(buf, sizeof(buf), "%d-%d",
+					 first_unassigned, nvec - 1);
+
+			dev_dbg(&pdev->dev,
+				"MANA IRQ indices #%s will retain the default CPU affinity\n",
+				buf);
+		}
+
+		mana_irq_setup_linear(irqs, nvec);
 	}
 
 	cpus_read_unlock();
@@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 		nvec -= 1;
 	}
 
-	err = irq_setup(irqs, nvec, gc->numa_node, false);
+	err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
 	if (err) {
 		cpus_read_unlock();
 		goto free_irq;

base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
-- 
2.34.1


^ permalink raw reply related

* [PATCH net] net: enetc: fix potential divide-by-zero when num_vsi is zero
From: wei.fang @ 2026-06-24  7:27 UTC (permalink / raw)
  To: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
	davem, edumazet, kuba, pabeni
  Cc: Frank.Li, wei.fang, imx, netdev, linux-kernel

From: Wei Fang <wei.fang@nxp.com>

For i.MX94 series, all the standalone ENETCs do not support SR-IOV, so
pf->caps.num_vsi is zero. This leads to a divide-by-zero in
enetc4_default_rings_allocation() when distributing rings among PF and
VFs.

Division by zero is undefined behavior in C. On ARM64, the UDIV/SDIV
instructions silently return zero rather than raising an exception, so
the issue does not cause a visible crash. However, relying on this
behavior is incorrect and poses a cross-platform compatibility risk.

Add an explicit check for num_vsi == 0 and return early after the PF's
rings have been configured.

Fixes: 2d673b0e2f8d ("net: enetc: add standalone ENETC support for i.MX94")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
 drivers/net/ethernet/freescale/enetc/enetc4_pf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc4_pf.c b/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
index 4e771f852358..437a15bbb47b 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
@@ -322,6 +322,9 @@ static void enetc4_default_rings_allocation(struct enetc_pf *pf)
 	val = enetc4_psicfgr0_val_construct(false, num_tx_bdr, num_rx_bdr);
 	enetc_port_wr(hw, ENETC4_PSICFGR0(0), val);

+	if (!pf->caps.num_vsi)
+		return;
+
 	num_rx_bdr = pf->caps.num_rx_bdr - num_rx_bdr;
 	rx_rem = num_rx_bdr % pf->caps.num_vsi;
 	num_rx_bdr = num_rx_bdr / pf->caps.num_vsi;
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH bpf-next v8 4/7] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests
From: Emil Tsalapatis @ 2026-06-24  7:26 UTC (permalink / raw)
  To: Mahe Tardy, bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song
In-Reply-To: <20260622120515.137082-5-mahe.tardy@gmail.com>

On Mon Jun 22, 2026 at 8:05 AM EDT, Mahe Tardy wrote:
> This test opens a server and client, enters a new cgroup, attach a
> cgroup_skb program on egress and calls the bpf_icmp_send function from
> the client egress so that an ICMP unreach control message is sent back
> to the client. It then fetches the message from the error queue to
> confirm the correct ICMP unreach code has been sent.
>
> Note that, for the client, we have to connect in non-blocking mode to
> let the test execute faster. Otherwise, we need to wait for the TCP
> three-way handshake to timeout in the kernel before reading the errno.
>
> Also note that we don't set IP_RECVERR on the socket in
> connect_to_fd_nonblock since the error will be transferred anyway in our
> test because the connection is rejected at the beginning of the TCP
> handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err for more details.
>
> Reviewed-by: Jordan Rife <jordan@jrife.io>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  .../bpf/prog_tests/icmp_send_kfunc.c          | 151 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/icmp_send.c |  38 +++++
>  2 files changed, 189 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
>  create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> new file mode 100644
> index 000000000000..f4e5b883d4c8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> @@ -0,0 +1,151 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include <poll.h>
> +#include "icmp_send.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15
> +
> +static int connect_to_fd_nonblock(int server_fd)
> +{
> +	struct sockaddr_storage addr;
> +	socklen_t len = sizeof(addr);
> +	int fd, err;
> +
> +	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
> +		return -1;
> +
> +	fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);
> +	if (fd < 0)
> +		return -1;
> +
> +	err = connect(fd, (struct sockaddr *)&addr, len);
> +	if (err < 0 && errno != EINPROGRESS) {
> +		close(fd);
> +		return -1;
> +	}
> +
> +	return fd;
> +}
> +
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
> +	struct sock_extended_err *sock_err;
> +	char ctrl_buf[512];
> +	struct msghdr msg = {
> +		.msg_control = ctrl_buf,
> +		.msg_controllen = sizeof(ctrl_buf),
> +	};
> +	struct pollfd pfd = {
> +		.fd = sockfd,
> +		.events = POLLERR,
> +	};
> +	struct cmsghdr *cm;
> +	ssize_t n;
> +
> +	if (!ASSERT_GE(poll(&pfd, 1, TIMEOUT_MS), 1, "poll_errqueue"))
> +		return;
> +
> +	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
> +	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> +		return;
> +
> +	cm = CMSG_FIRSTHDR(&msg);
> +	if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
> +		return;
> +
> +	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
> +		if (cm->cmsg_level != IPPROTO_IP || cm->cmsg_type != IP_RECVERR)
> +			continue;
> +
> +		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
> +
> +		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
> +			       "sock_err_origin_icmp"))
> +			return;
> +		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
> +			       "sock_err_type_dest_unreach"))
> +			return;
> +		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
> +		return;
> +	}
> +
> +	ASSERT_FAIL("no IP_RECVERR/IPV6_RECVERR control message found");
> +}
> +
> +static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
> +{
> +	int srv_fd = -1, client_fd = -1;
> +	struct sockaddr_in addr;
> +	socklen_t len = sizeof(addr);
> +
> +	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0, TIMEOUT_MS);
> +	if (!ASSERT_OK_FD(srv_fd, "start_server"))
> +		return;
> +
> +	if (getsockname(srv_fd, (struct sockaddr *)&addr, &len)) {
> +		close(srv_fd);
> +		return;
> +	}
> +	skel->bss->server_port = ntohs(addr.sin_port);
> +	skel->bss->unreach_code = code;
> +
> +	client_fd = connect_to_fd_nonblock(srv_fd);
> +	if (!ASSERT_OK_FD(client_fd, "client_connect_nonblock")) {
> +		close(srv_fd);
> +		return;
> +	}
> +
> +	/* Skip reading ICMP error queue if code is invalid */
> +	if (code >= 0 && code <= NR_ICMP_UNREACH)
> +		read_icmp_errqueue(client_fd, code);
> +
> +	close(client_fd);
> +	close(srv_fd);
> +}
> +
> +void test_icmp_send_unreach_cgroup(void)
> +{
> +	struct icmp_send *skel;
> +	int cgroup_fd = -1;
> +
> +	skel = icmp_send__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		goto cleanup;
> +
> +	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> +	if (!ASSERT_OK_FD(cgroup_fd, "join_cgroup"))
> +		goto cleanup;
> +
> +	skel->links.egress =
> +		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> +		goto cleanup;
> +
> +	for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
> +		/*
> +		 * The TCP stack reacts differently when asking for
> +		 * fragmentation, let's ignore it for now.
> +		 */
> +		if (code == ICMP_FRAG_NEEDED)
> +			continue;
> +
> +		trigger_prog_read_icmp_errqueue(skel, code);
> +		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
> +	}
> +
> +	/* Test an invalid code */
> +	trigger_prog_read_icmp_errqueue(skel, -1);
> +	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
> +
> +cleanup:
> +	icmp_send__destroy(skel);
> +	if (cgroup_fd >= 0)
> +		close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
> new file mode 100644
> index 000000000000..6d0be0a9afe1
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/icmp_send.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +/* 127.0.0.1 in host byte order */
> +#define SERVER_IP 0x7F000001
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +__u16 server_port = 0;
> +int unreach_code = 0;
> +int kfunc_ret = -1;
> +
> +SEC("cgroup_skb/egress")
> +int egress(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct iphdr *iph;
> +	struct tcphdr *tcph;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> +	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> +		return SK_PASS;
> +
> +	tcph = (void *)iph + iph->ihl * 4;
> +	if ((void *)(tcph + 1) > data_end ||
> +	    tcph->dest != bpf_htons(server_port))
> +		return SK_PASS;
> +
> +	kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
> +
> +	return SK_DROP;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> --
> 2.34.1


^ permalink raw reply

* Re: [PATCH bpf-next v8 7/7] selftests/bpf: add bpf_icmp_send recursion test
From: Emil Tsalapatis @ 2026-06-24  7:31 UTC (permalink / raw)
  To: Mahe Tardy, bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song
In-Reply-To: <20260622120515.137082-8-mahe.tardy@gmail.com>

On Mon Jun 22, 2026 at 8:05 AM EDT, Mahe Tardy wrote:
> This test is similar to test_icmp_send_unreach_cgroup but checks that,
> in case of recursion, meaning that the BPF program calling the kfunc was
> re-triggered by the icmp_send done by the kfunc, the kfunc will stop
> early and return -EBUSY.
>
> The test attaches to the root cgroup to ensure the ICMP packet generated
> by the kfunc re-triggers the BPF program. Since it's attached only for
> this recursion test, it should not disrupt the whole network.
>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  .../bpf/prog_tests/icmp_send_kfunc.c          | 45 +++++++++++++++
>  tools/testing/selftests/bpf/progs/icmp_send.c | 56 +++++++++++++++++++
>  2 files changed, 101 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> index 66447681f72d..fd4b8fa78a01 100644
> --- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> @@ -1,8 +1,10 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <test_progs.h>
>  #include <network_helpers.h>
> +#include <cgroup_helpers.h>
>  #include <linux/errqueue.h>
>  #include <poll.h>
> +#include <unistd.h>
>  #include "icmp_send.skel.h"
>
>  #define TIMEOUT_MS 1000
> @@ -10,6 +12,7 @@
>  #define ICMP_DEST_UNREACH 3
>  #define ICMPV6_DEST_UNREACH 1
>
> +#define ICMP_HOST_UNREACH 1
>  #define ICMP_FRAG_NEEDED 4
>  #define NR_ICMP_UNREACH 15
>  #define ICMPV6_REJECT_ROUTE 6
> @@ -203,3 +206,45 @@ void test_icmp_send_unreach_tc(void)
>  	bpf_link__destroy(link);
>  	icmp_send__destroy(skel);
>  }
> +
> +void test_icmp_send_unreach_recursion(void)
> +{
> +	struct icmp_send *skel;
> +	int cgroup_fd = -1;
> +
> +	skel = icmp_send__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		goto cleanup;
> +
> +	if (setup_cgroup_environment()) {
> +		fprintf(stderr, "Failed to setup cgroup environment\n");
> +		goto cleanup;
> +	}
> +
> +	cgroup_fd = get_root_cgroup();
> +	if (!ASSERT_OK_FD(cgroup_fd, "get_root_cgroup"))
> +		goto cleanup;
> +
> +	skel->data->target_pid = getpid();
> +	skel->links.recursion =
> +		bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.recursion, "prog_attach_cgroup"))
> +		goto cleanup;
> +
> +	trigger_prog_read_icmp_errqueue(skel, ICMP_HOST_UNREACH, AF_INET,
> +					"127.0.0.1");
> +
> +	/*
> +	 * Because there's recursion involved, the first call will return at
> +	 * index 1 since it will return the second, and the second call will
> +	 * return at index 0 since it will return the first.
> +	 */
> +	ASSERT_EQ(skel->data->rec_kfunc_rets[0], -EBUSY, "kfunc_rets[0]");
> +	ASSERT_EQ(skel->data->rec_kfunc_rets[1], 0, "kfunc_rets[1]");
> +
> +cleanup:
> +	cleanup_cgroup_environment();
> +	icmp_send__destroy(skel);
> +	if (cgroup_fd >= 0)
> +		close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
> index 5fa5467bdb70..fd9c7684797b 100644
> --- a/tools/testing/selftests/bpf/progs/icmp_send.c
> +++ b/tools/testing/selftests/bpf/progs/icmp_send.c
> @@ -13,6 +13,10 @@ __u16 server_port = 0;
>  int unreach_type = 0;
>  int unreach_code = 0;
>  int kfunc_ret = -1;
> +int target_pid = -1;
> +
> +unsigned int rec_count = 0;
> +int rec_kfunc_rets[] = { -1, -1 };
>
>  SEC("cgroup_skb/egress")
>  int egress(struct __sk_buff *skb)
> @@ -125,4 +129,56 @@ int tc_egress(struct __sk_buff *skb)
>  	return TCX_DROP;
>  }
>
> +SEC("cgroup_skb/egress")
> +int recursion(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct icmphdr *icmph;
> +	struct tcphdr *tcph;
> +	struct iphdr *iph;
> +	int ret;
> +
> +	if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
> +		return SK_PASS;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4)
> +		return SK_PASS;
> +
> +	if (iph->daddr != bpf_htonl(SERVER_IP))
> +		return SK_PASS;
> +
> +	if (iph->protocol == IPPROTO_TCP) {
> +		tcph = (void *)iph + iph->ihl * 4;
> +		if ((void *)(tcph + 1) > data_end ||
> +		    tcph->dest != bpf_htons(server_port))
> +			return SK_PASS;
> +	} else if (iph->protocol == IPPROTO_ICMP) {
> +		icmph = (void *)iph + iph->ihl * 4;
> +		if ((void *)(icmph + 1) > data_end ||
> +		    icmph->type != unreach_type ||
> +		    icmph->code != unreach_code)
> +			return SK_PASS;
> +	} else {
> +		return SK_PASS;
> +	}
> +
> +	/*
> +	 * This call will provoke a recursion: the ICMP packet generated by the
> +	 * kfunc will re-trigger this program since we are in the root cgroup in
> +	 * which the kernel ICMP socket belongs. However when re-entering the
> +	 * kfunc, it should return EBUSY.
> +	 */
> +	ret = bpf_icmp_send(skb, unreach_type, unreach_code);
> +	rec_kfunc_rets[rec_count & 1] = ret;
> +	__sync_fetch_and_add(&rec_count, 1);
> +
> +	/* Let the first ICMP error message pass */
> +	if (iph->protocol == IPPROTO_ICMP)
> +		return SK_PASS;
> +
> +	return SK_DROP;
> +}
> +
>  char LICENSE[] SEC("license") = "Dual BSD/GPL";
> --
> 2.34.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox