* [PATCH bpf-next v3 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
To: ast, daniel, andrii
Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>
BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
useful: an XDP program receiving a VLAN-tagged frame on a physical
device wants the lookup to behave as if the packet had arrived on the
corresponding VLAN subinterface, so iif-based policy routing and VRF
table selection use the right ingress.
Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
The device must be up and in the same network namespace as
params->ifindex (a VLAN device can be moved to another netns while
registered on its parent; receive would deliver into that other
namespace, which a lookup here cannot represent). If params->ifindex
is itself a VLAN device, its inner (QinQ) subinterface is matched.
For a bond or team, a tag on a port matches no device and returns
NOT_FWDED; pass the master's ifindex.
The lookup then runs with the resolved device as the ingress;
params->ifindex itself is not modified on the input side. When the
resolved device is enslaved to a VRF, both the full lookup (via the
l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
select the VRF's table from the resolved ingress. That follows from
feeding the resolved device to the flow as the ingress
(fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
the VRF master from the subinterface rather than from
params->ifindex.
The two failure classes get different treatment on purpose. A
h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
-EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
with a program-controlled value. An unmatched VID, a device that is
down, or one in another namespace is a data outcome and returns
BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
fib_get_table() finds no table and mirroring real ingress, where the
receive path drops such frames. A VID of 0 (a priority tag) is looked
up literally and normally fails the same way; receive instead
processes such frames untagged, so callers should not set the flag for
priority tags. Proceeding on the physical device for any of these
would be fail-open for the policy-routing cases above.
The h_vlan fields share a union with tbid, so the flag cannot be
combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
return -EINVAL; restricting now keeps a later relaxation backward
compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
consumed on the ingress side and the egress tag is written on
success.
Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
NULL, so every lookup with the flag returns NOT_FWDED, which is
correct since no VLAN device can exist.
Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
include/uapi/linux/bpf.h | 21 ++++++++++-
net/core/filter.c | 66 +++++++++++++++++++++++++++++++---
tools/include/uapi/linux/bpf.h | 21 ++++++++++-
3 files changed, 101 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f1ac9266a2ab..23bc3109619d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3547,6 +3547,22 @@ union bpf_attr {
* are written only on success; other output fields keep
* the helper's existing behaviour, so a frag-needed result
* still reports the route mtu in *params*->mtu_result.
+ * **BPF_FIB_LOOKUP_VLAN_INPUT**
+ * Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ * as an input VLAN tag and run the lookup as if ingress
+ * had happened on the VLAN subinterface carrying that tag
+ * on *params*->ifindex. The VID is the low 12 bits of
+ * *params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ * ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ * **-EINVAL**. If *params*->ifindex is itself a VLAN
+ * device, its inner (QinQ) subinterface is matched; for a
+ * bond or team, pass the master's ifindex. An unmatched
+ * tag, a down device, or one in another namespace returns
+ * **BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ * A VID of 0 is looked up literally, so do not set this
+ * flag for priority-tagged frames. Cannot be combined with
+ * **BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ * (returns **-EINVAL**).
*
* *ctx* is either **struct xdp_md** for XDP programs or
* **struct sk_buff** tc cls_act programs.
@@ -7343,6 +7359,7 @@ enum {
BPF_FIB_LOOKUP_SRC = (1U << 4),
BPF_FIB_LOOKUP_MARK = (1U << 5),
BPF_FIB_LOOKUP_VLAN = (1U << 6),
+ BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
};
enum {
@@ -7412,7 +7429,9 @@ struct bpf_fib_lookup {
/*
* output with BPF_FIB_LOOKUP_VLAN: set from the
* resolved egress VLAN device (see the flag); zeroed
- * on other successful lookups.
+ * on other successful lookups. input with
+ * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+ * the lookup by.
*/
__be16 h_vlan_proto;
__be16 h_vlan_TCI;
diff --git a/net/core/filter.c b/net/core/filter.c
index 27e4792f11e9..399adf2a824a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6226,6 +6226,25 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
return 0;
}
+
+static struct net_device *bpf_fib_vlan_input_dev(struct net_device *dev,
+ const struct bpf_fib_lookup *params)
+{
+ __be16 proto = params->h_vlan_proto;
+ struct net_device *vlan_dev;
+ u16 vid;
+
+ if (proto != htons(ETH_P_8021Q) && proto != htons(ETH_P_8021AD))
+ return ERR_PTR(-EINVAL);
+
+ vid = ntohs(params->h_vlan_TCI) & VLAN_VID_MASK;
+ vlan_dev = __vlan_find_dev_deep_rcu(dev, proto, vid);
+ if (!vlan_dev || !(vlan_dev->flags & IFF_UP) ||
+ !net_eq(dev_net(vlan_dev), dev_net(dev)))
+ return NULL;
+
+ return vlan_dev;
+}
#endif
#if IS_ENABLED(CONFIG_INET)
@@ -6246,6 +6265,14 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (unlikely(!dev))
return -ENODEV;
+ if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+ dev = bpf_fib_vlan_input_dev(dev, params);
+ if (IS_ERR(dev))
+ return PTR_ERR(dev);
+ if (!dev)
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
+ }
+
/* verify forwarding is enabled on this interface */
in_dev = __in_dev_get_rcu(dev);
if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
@@ -6255,7 +6282,11 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
fl4.flowi4_iif = 1;
fl4.flowi4_oif = params->ifindex;
} else {
- fl4.flowi4_iif = params->ifindex;
+ /*
+ * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+ * resolved dev to a subinterface above.
+ */
+ fl4.flowi4_iif = dev->ifindex;
fl4.flowi4_oif = 0;
}
fl4.flowi4_dscp = inet_dsfield_to_dscp(params->tos);
@@ -6394,6 +6425,14 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (unlikely(!dev))
return -ENODEV;
+ if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+ dev = bpf_fib_vlan_input_dev(dev, params);
+ if (IS_ERR(dev))
+ return PTR_ERR(dev);
+ if (!dev)
+ return BPF_FIB_LKUP_RET_NOT_FWDED;
+ }
+
idev = __in6_dev_get_safely(dev);
if (unlikely(!idev || !READ_ONCE(idev->cnf.forwarding)))
return BPF_FIB_LKUP_RET_FWD_DISABLED;
@@ -6402,7 +6441,12 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
fl6.flowi6_iif = 1;
oif = fl6.flowi6_oif = params->ifindex;
} else {
- oif = fl6.flowi6_iif = params->ifindex;
+ /*
+ * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+ * resolved dev to a subinterface above.
+ */
+ oif = dev->ifindex;
+ fl6.flowi6_iif = oif;
fl6.flowi6_oif = 0;
strict = RT6_LOOKUP_F_HAS_SADDR;
}
@@ -6515,7 +6559,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
#define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
- BPF_FIB_LOOKUP_VLAN)
+ BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_VLAN_INPUT)
+
+static bool bpf_fib_lookup_flags_ok(u32 flags)
+{
+ if (flags & ~BPF_FIB_LOOKUP_MASK)
+ return false;
+
+ if ((flags & BPF_FIB_LOOKUP_VLAN_INPUT) &&
+ (flags & (BPF_FIB_LOOKUP_TBID | BPF_FIB_LOOKUP_OUTPUT)))
+ return false;
+
+ return true;
+}
BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6523,7 +6579,7 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
if (plen < sizeof(*params))
return -EINVAL;
- if (flags & ~BPF_FIB_LOOKUP_MASK)
+ if (!bpf_fib_lookup_flags_ok(flags))
return -EINVAL;
switch (params->family) {
@@ -6562,7 +6618,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
if (plen < sizeof(*params))
return -EINVAL;
- if (flags & ~BPF_FIB_LOOKUP_MASK)
+ if (!bpf_fib_lookup_flags_ok(flags))
return -EINVAL;
if (params->tot_len)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f1ac9266a2ab..23bc3109619d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3547,6 +3547,22 @@ union bpf_attr {
* are written only on success; other output fields keep
* the helper's existing behaviour, so a frag-needed result
* still reports the route mtu in *params*->mtu_result.
+ * **BPF_FIB_LOOKUP_VLAN_INPUT**
+ * Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ * as an input VLAN tag and run the lookup as if ingress
+ * had happened on the VLAN subinterface carrying that tag
+ * on *params*->ifindex. The VID is the low 12 bits of
+ * *params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ * ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ * **-EINVAL**. If *params*->ifindex is itself a VLAN
+ * device, its inner (QinQ) subinterface is matched; for a
+ * bond or team, pass the master's ifindex. An unmatched
+ * tag, a down device, or one in another namespace returns
+ * **BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ * A VID of 0 is looked up literally, so do not set this
+ * flag for priority-tagged frames. Cannot be combined with
+ * **BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ * (returns **-EINVAL**).
*
* *ctx* is either **struct xdp_md** for XDP programs or
* **struct sk_buff** tc cls_act programs.
@@ -7343,6 +7359,7 @@ enum {
BPF_FIB_LOOKUP_SRC = (1U << 4),
BPF_FIB_LOOKUP_MARK = (1U << 5),
BPF_FIB_LOOKUP_VLAN = (1U << 6),
+ BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
};
enum {
@@ -7412,7 +7429,9 @@ struct bpf_fib_lookup {
/*
* output with BPF_FIB_LOOKUP_VLAN: set from the
* resolved egress VLAN device (see the flag); zeroed
- * on other successful lookups.
+ * on other successful lookups. input with
+ * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+ * the lookup by.
*/
__be16 h_vlan_proto;
__be16 h_vlan_TCI;
--
2.54.0
^ permalink raw reply related
* [PATCH bpf-next v3 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
To: ast, daniel, andrii
Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>
Cover both directions of the new VLAN flags in the fib_lookup test,
36 table cases plus a dedicated cross-netns subtest.
For BPF_FIB_LOOKUP_VLAN the egress cases assert: without the flag the
lookup returns the VLAN netdev's ifindex and zeroed vlan fields, with
the flag it returns the parent's ifindex plus the tag (including via
a neighbour resolved on the VLAN device, in OUTPUT mode, over a bond,
and through a DIRECT|TBID table), with the flag on a non-VLAN egress
it changes nothing, for a stacked VLAN it leaves ifindex untouched
with the vlan fields zero, and a frag-needed return reports the route
mtu in mtu_result while leaving the swap unwritten.
For BPF_FIB_LOOKUP_VLAN_INPUT, an iif rule on the subinterface routes
the same destination to a different gateway, so the asserted gateway
shows which device the lookup used as ingress: without the flag the
main table answers, with a matching tag the subinterface's table
does, with or without SKIP_NEIGH, and BPF_FIB_LOOKUP_SRC selects the
subinterface's address. A VRF-enslaved subinterface selects the VRF
table through the l3mdev rule and, with DIRECT, through
l3mdev_fib_table_rcu(). One case sets BPF_FIB_LOOKUP_VLAN as well and
asserts both directions work in a single lookup. Resolution semantics
are pinned: an 802.1ad tag resolves its device, PCP and DEI bits in
h_vlan_TCI are ignored, a VLAN ifindex resolves the inner QinQ
device, a tag on a bond master resolves while the same tag on the
bond port does not.
The error cases assert -EINVAL for an invalid h_vlan_proto on both
address families, for the TBID and OUTPUT flag combinations and for
an unknown flag bit, and BPF_FIB_LKUP_RET_NOT_FWDED for a VID with no
configured device on both families, for a VID-0 priority tag and for
a device that exists but is down. The failure cases also assert that
params is left untouched.
A separate subtest moves a VLAN device into a second netns while it
stays registered on its parent, and checks both directions refuse to
cross the boundary: the input flag fails closed with the tag and
ifindex untouched, and the egress flag does not publish the foreign
parent's ifindex.
The tbid read-back check is skipped for DIRECT cases that set
BPF_FIB_LOOKUP_VLAN, since a successful swap packs the vlan fields
into the union the check reads.
Re-run the cases through bpf_xdp_fib_lookup() as well: the egress flag
exists because VLAN devices have no XDP xmit, so XDP is the primary
consumer. bpf_prog_test_run uses the netns' loopback for the xdp context's
device, so the lookup runs against the test netns' FIB, and the
path-independent results (return code, swapped ifindex, vlan tag, gateway)
are asserted to match the skb path.
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
.../selftests/bpf/prog_tests/fib_lookup.c | 554 +++++++++++++++++-
.../testing/selftests/bpf/progs/fib_lookup.c | 9 +
2 files changed, 559 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
index bd7658958004..987a691fe078 100644
--- a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
@@ -2,6 +2,7 @@
/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
#include <linux/rtnetlink.h>
+#include <linux/if_ether.h>
#include <sys/types.h>
#include <net/if.h>
@@ -37,6 +38,41 @@
#define IPV6_LOCAL "fd01::3"
#define IPV6_GW1 "fd01::1"
#define IPV6_GW2 "fd01::2"
+#define VLAN_ID 100
+#define VLAN_IFACE "veth1.100"
+#define VLAN_ID_DOWN 102
+#define VLAN_IFACE_DOWN "veth1.102"
+#define QINQ_OUTER_IFACE "veth1.200"
+#define QINQ_INNER_IFACE "veth1.200.300"
+#define VLAN_TABLE "300"
+#define IPV4_VLAN_IFACE_ADDR "10.5.0.254"
+#define IPV4_VLAN_EGRESS_DST "10.5.0.2"
+#define IPV4_QINQ_DST "10.7.0.2"
+#define IPV4_VLAN_DST "10.6.0.2"
+#define IPV4_VLAN_GW "10.5.0.1"
+#define IPV6_VLAN_IFACE_ADDR "fd02::254"
+#define IPV6_VLAN_EGRESS_DST "fd02::2"
+#define IPV6_VLAN_DST "fd03::2"
+#define IPV6_VLAN_GW "fd02::1"
+#define VLAN_VID_UNUSED 999
+#define VRF_IFACE "vrf-blue"
+#define VRF_TABLE "1000"
+#define VRF_VLAN_ID 101
+#define VRF_VLAN_IFACE "veth1.101"
+#define IPV4_VRF_IFACE_ADDR "10.8.0.254"
+#define IPV4_VRF_GW "10.8.0.1"
+#define IPV4_VRF_DST "10.9.0.2"
+#define TBID_VLAN_ID 50
+#define TBID_VLAN_IFACE "veth2.50"
+#define IPV4_TBID_VLAN_DST "172.2.0.2"
+#define IPV4_BOND_VLAN_DST "10.11.0.2"
+#define IPV4_VLAN_MTU_DST "10.5.9.2"
+#define QINQ_AD_VLAN_ID 200
+#define QINQ_INNER_VLAN_ID 300
+#define BOND_IFACE "bond99"
+#define BOND_PORT "veth3"
+#define BOND_PORT_PEER "veth4"
+#define BOND_VLAN_ID 500
#define DMAC "11:11:11:11:11:11"
#define DMAC_INIT { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, }
#define DMAC2 "01:01:01:01:01:01"
@@ -52,6 +88,17 @@ struct fib_lookup_test {
__u32 tbid;
__u8 dmac[6];
__u32 mark;
+ /*
+ * input tag with BPF_FIB_LOOKUP_VLAN_INPUT; expected output tag
+ * with BPF_FIB_LOOKUP_VLAN (checked when check_vlan is set)
+ */
+ __u16 vlan_proto;
+ __u16 vlan_id;
+ bool check_vlan;
+ const char *expected_dev; /* expected params->ifindex after lookup */
+ const char *iif; /* override the default veth1 input device */
+ __u16 tot_len; /* triggers the in-lookup mtu check when set */
+ __u16 expected_mtu; /* expected mtu_result (union with tot_len) */
};
static const struct fib_lookup_test tests[] = {
@@ -142,6 +189,209 @@ static const struct fib_lookup_test tests[] = {
.expected_dst = IPV6_GW1,
.lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
.mark = MARK, },
+ /* vlan egress resolution */
+ { .desc = "IPv4 VLAN egress, no flag",
+ .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = VLAN_IFACE, .check_vlan = true, },
+ { .desc = "IPv4 VLAN egress, single VLAN",
+ .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = "veth1", .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ /*
+ * skb path without tot_len: mtu_result is the FIB result (VLAN)
+ * device's mtu (1400) with or without the swap, not the parent's (1500)
+ */
+ { .desc = "IPv4 VLAN egress, skb-path mtu is the VLAN device's without the flag",
+ .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = VLAN_IFACE, .check_vlan = true, .expected_mtu = 1400, },
+ { .desc = "IPv4 VLAN egress, skb-path mtu stays the VLAN device's after the swap",
+ .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = "veth1", .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .expected_mtu = 1400, },
+ { .desc = "IPv4 VLAN egress, flag set but egress is not a VLAN",
+ .daddr = IPV4_NUD_FAILED_ADDR, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = "veth1", .check_vlan = true, },
+ { .desc = "IPv4 VLAN egress, stacked VLAN untouched",
+ .daddr = IPV4_QINQ_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = QINQ_INNER_IFACE, .check_vlan = true, },
+ { .desc = "IPv6 VLAN egress, single VLAN",
+ .daddr = IPV6_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = "veth1", .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN egress, neighbour on the VLAN device",
+ .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN,
+ .expected_dev = "veth1", .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT, },
+ { .desc = "IPv4 VLAN egress in OUTPUT mode",
+ .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .iif = VLAN_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_OUTPUT | BPF_FIB_LOOKUP_VLAN |
+ BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = "veth1", .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN egress over a bond",
+ .daddr = IPV4_BOND_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = BOND_IFACE, .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+ { .desc = "IPv4 VLAN egress via TBID table",
+ .daddr = IPV4_TBID_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID |
+ BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .tbid = 100,
+ .expected_dev = "veth2", .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = TBID_VLAN_ID, },
+ { .desc = "IPv4 VLAN egress, success writes mtu_result with the swap",
+ .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .tot_len = 500, .expected_mtu = 1000,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = "veth1", .check_vlan = true,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN egress, FRAG_NEEDED reports mtu, swap unwritten",
+ .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_FRAG_NEEDED,
+ .tot_len = 1400, .expected_mtu = 1000,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .expected_dev = "veth1", .check_vlan = true, },
+ /* vlan tag as lookup input */
+ { .desc = "IPv4 VLAN input, no flag",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_GW1,
+ .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+ { .desc = "IPv4 VLAN input, tag selects subinterface route",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ { .desc = "IPv6 VLAN input, tag selects subinterface route",
+ .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV6_VLAN_GW, .expected_dev = VLAN_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN input and egress combined",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_VLAN_GW, .expected_dev = "veth1",
+ .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_VLAN |
+ BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN input, neighbour resolved on the route",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT2, },
+ { .desc = "IPv4 VLAN input, source address from the subinterface",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_src = IPV4_VLAN_IFACE_ADDR,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SRC |
+ BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ /*
+ * VRF: the resolved subinterface is enslaved, so the l3mdev rule
+ * (full lookup) and l3mdev_fib_table_rcu() (DIRECT) must select
+ * the VRF table from the resolved ingress
+ */
+ { .desc = "IPv4 VLAN input, VRF subinterface, no flag",
+ .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_GW1,
+ .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+ { .desc = "IPv4 VLAN input, tag selects VRF table",
+ .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+ { .desc = "IPv4 VLAN input, DIRECT uses VRF table from resolved ingress",
+ .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_DIRECT |
+ BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+ /*
+ * failure arms also assert params is left untouched: ifindex still
+ * names the physical device and the input tag bytes survive
+ */
+ { .desc = "IPv4 VLAN input, invalid proto",
+ .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+ .expected_dev = "veth1", .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN input, unmatched VID",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+ .expected_dev = "veth1", .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+ { .desc = "IPv4 VLAN input, subinterface down",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+ .expected_dev = "veth1", .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID_DOWN, },
+ /*
+ * the resolver runs before the forwarding check, so on devices
+ * with forwarding off FWD_DISABLED (not NOT_FWDED) proves the tag
+ * resolved to that device and the lookup used it as ingress
+ */
+ { .desc = "IPv4 VLAN input, 802.1ad tag",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021AD, .vlan_id = QINQ_AD_VLAN_ID, },
+ { .desc = "IPv4 VLAN input, PCP and DEI bits ignored in TCI",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+ .expected_dst = IPV4_VLAN_GW,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = 0xe000 | VLAN_ID, },
+ { .desc = "IPv4 VLAN input, inner QinQ device from VLAN ifindex",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+ .iif = QINQ_OUTER_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = QINQ_INNER_VLAN_ID, },
+ /*
+ * bonding: the VLANs live on the master, as on receive, where the
+ * frame is steered to the master before VLAN processing; a port
+ * ifindex does not match (ports carry vid state but no VLAN devs)
+ */
+ { .desc = "IPv4 VLAN input, tag on bond master resolves",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+ .iif = BOND_IFACE,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+ { .desc = "IPv4 VLAN input, tag on bond port does not match",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+ .iif = BOND_PORT, .expected_dev = BOND_PORT, .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+ { .desc = "IPv6 VLAN input, invalid proto",
+ .daddr = IPV6_VLAN_DST, .expected_ret = -EINVAL,
+ .expected_dev = "veth1", .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN input, VID 0 priority tag fails closed",
+ .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+ .expected_dev = "veth1", .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = 0, },
+ { .desc = "IPv6 VLAN input, unmatched VID",
+ .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+ .expected_dev = "veth1", .check_vlan = true,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+ { .desc = "unknown flag bit rejected",
+ .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+ .lookup_flags = (1 << 14) | BPF_FIB_LOOKUP_SKIP_NEIGH, },
+ { .desc = "IPv4 VLAN input rejected with TBID",
+ .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_TBID,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+ { .desc = "IPv4 VLAN input rejected with OUTPUT",
+ .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+ .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_OUTPUT,
+ .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
};
static int setup_netns(void)
@@ -204,6 +454,110 @@ static int setup_netns(void)
SYS(fail, "ip rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
SYS(fail, "ip -6 rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
+ /*
+ * Setup for vlan tests: a subinterface for egress resolution and
+ * tag-as-input, a QinQ stack, and an iif rule so the input tests
+ * observe which device the lookup used as ingress.
+ */
+ SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+ VLAN_IFACE, VLAN_ID);
+ SYS(fail, "ip link set dev %s up", VLAN_IFACE);
+ /*
+ * lower than the veth1 parent (1500): the skb-path mtu check uses the
+ * FIB result (VLAN) device, so mtu_result is this value with or
+ * without the egress swap, which two arms below pin
+ */
+ SYS(fail, "ip link set dev %s mtu 1400", VLAN_IFACE);
+ SYS(fail, "ip addr add %s/24 dev %s", IPV4_VLAN_IFACE_ADDR, VLAN_IFACE);
+ SYS(fail, "ip addr add %s/64 dev %s nodad", IPV6_VLAN_IFACE_ADDR, VLAN_IFACE);
+
+ /*
+ * stays down: the input flag must treat its tag the way real
+ * ingress treats a frame arriving on a down VLAN device (drop)
+ */
+ SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+ VLAN_IFACE_DOWN, VLAN_ID_DOWN);
+
+ err = write_sysctl("/proc/sys/net/ipv4/conf/" VLAN_IFACE "/forwarding", "1");
+ if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VLAN_IFACE ".forwarding)"))
+ goto fail;
+
+ err = write_sysctl("/proc/sys/net/ipv6/conf/" VLAN_IFACE "/forwarding", "1");
+ if (!ASSERT_OK(err, "write_sysctl(net.ipv6.conf." VLAN_IFACE ".forwarding)"))
+ goto fail;
+
+ SYS(fail, "ip link add link veth1 name %s type vlan proto 802.1ad id 200",
+ QINQ_OUTER_IFACE);
+ SYS(fail, "ip link add link %s name %s type vlan id 300",
+ QINQ_OUTER_IFACE, QINQ_INNER_IFACE);
+ SYS(fail, "ip link set dev %s up", QINQ_OUTER_IFACE);
+ SYS(fail, "ip link set dev %s up", QINQ_INNER_IFACE);
+ SYS(fail, "ip route add %s/32 dev %s", IPV4_QINQ_DST, QINQ_INNER_IFACE);
+
+ SYS(fail, "ip route add %s/32 via %s", IPV4_VLAN_DST, IPV4_GW1);
+ SYS(fail, "ip route add table %s %s/32 via %s",
+ VLAN_TABLE, IPV4_VLAN_DST, IPV4_VLAN_GW);
+ SYS(fail, "ip rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+ SYS(fail, "ip -6 route add %s/128 via %s", IPV6_VLAN_DST, IPV6_GW1);
+ SYS(fail, "ip -6 route add table %s %s/128 via %s",
+ VLAN_TABLE, IPV6_VLAN_DST, IPV6_VLAN_GW);
+ SYS(fail, "ip -6 rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+
+ /*
+ * a bond with one port and a VLAN on the bond: VLANs on a bond
+ * live on the master, so resolution succeeds for the master's
+ * ifindex and fails closed for a port's, matching receive, which
+ * steers the frame to the master before VLAN processing
+ */
+ SYS(fail, "ip link add %s type bond", BOND_IFACE);
+ SYS(fail, "ip link add %s type veth peer name %s", BOND_PORT, BOND_PORT_PEER);
+ SYS(fail, "ip link set %s master %s", BOND_PORT, BOND_IFACE);
+ SYS(fail, "ip link set dev %s up", BOND_IFACE);
+ SYS(fail, "ip link set dev %s up", BOND_PORT);
+ SYS(fail, "ip link add link %s name %s.%d type vlan id %d",
+ BOND_IFACE, BOND_IFACE, BOND_VLAN_ID, BOND_VLAN_ID);
+ SYS(fail, "ip link set dev %s.%d up", BOND_IFACE, BOND_VLAN_ID);
+ SYS(fail, "ip route add %s/32 dev %s.%d",
+ IPV4_BOND_VLAN_DST, BOND_IFACE, BOND_VLAN_ID);
+
+ /*
+ * a VRF with its own dedicated subinterface (the iif rules above
+ * must not see it), for the table-selection-by-ingress cases
+ */
+ SYS(fail, "ip link add %s type vrf table %s", VRF_IFACE, VRF_TABLE);
+ SYS(fail, "ip link set dev %s up", VRF_IFACE);
+ SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+ VRF_VLAN_IFACE, VRF_VLAN_ID);
+ SYS(fail, "ip link set %s master %s", VRF_VLAN_IFACE, VRF_IFACE);
+ SYS(fail, "ip link set dev %s up", VRF_VLAN_IFACE);
+ SYS(fail, "ip addr add %s/24 dev %s", IPV4_VRF_IFACE_ADDR, VRF_VLAN_IFACE);
+ err = write_sysctl("/proc/sys/net/ipv4/conf/" VRF_VLAN_IFACE "/forwarding", "1");
+ if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VRF_VLAN_IFACE ".forwarding)"))
+ goto fail;
+ SYS(fail, "ip route add %s/32 via %s", IPV4_VRF_DST, IPV4_GW1);
+ SYS(fail, "ip route add table %s %s/32 via %s",
+ VRF_TABLE, IPV4_VRF_DST, IPV4_VRF_GW);
+
+ /* neighbours on the VLAN subinterface for the non-SKIP_NEIGH cases */
+ err = write_sysctl("/proc/sys/net/ipv4/neigh/" VLAN_IFACE "/gc_stale_time", "900");
+ if (!ASSERT_OK(err, "write_sysctl(net.ipv4.neigh." VLAN_IFACE ".gc_stale_time)"))
+ goto fail;
+ SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+ IPV4_VLAN_EGRESS_DST, VLAN_IFACE, DMAC);
+ SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+ IPV4_VLAN_GW, VLAN_IFACE, DMAC2);
+
+ /* a VLAN on veth2 with a route in the tbid test table */
+ SYS(fail, "ip link add link veth2 name %s type vlan id %d",
+ TBID_VLAN_IFACE, TBID_VLAN_ID);
+ SYS(fail, "ip link set dev %s up", TBID_VLAN_IFACE);
+ SYS(fail, "ip route add table 100 %s/32 dev %s",
+ IPV4_TBID_VLAN_DST, TBID_VLAN_IFACE);
+
+ /* a locked-mtu route via the subinterface for the FRAG_NEEDED case */
+ SYS(fail, "ip route add %s/32 dev %s mtu lock 1000",
+ IPV4_VLAN_MTU_DST, VLAN_IFACE);
+
return 0;
fail:
return -1;
@@ -218,9 +572,16 @@ static int set_lookup_params(struct bpf_fib_lookup *params,
memset(params, 0, sizeof(*params));
params->l4_protocol = IPPROTO_TCP;
- params->ifindex = ifindex;
+ params->ifindex = test->iif ? if_nametoindex(test->iif) : ifindex;
params->tbid = test->tbid;
params->mark = test->mark;
+ params->tot_len = test->tot_len;
+
+ /* h_vlan_proto/h_vlan_TCI union with tbid */
+ if (test->lookup_flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+ params->h_vlan_proto = htons(test->vlan_proto);
+ params->h_vlan_TCI = htons(test->vlan_id);
+ }
if (inet_pton(AF_INET6, test->daddr, params->ipv6_dst) == 1) {
params->family = AF_INET6;
@@ -298,7 +659,7 @@ void test_fib_lookup(void)
struct nstoken *nstoken = NULL;
struct __sk_buff skb = { };
struct fib_lookup *skel;
- int prog_fd, err, ret, i;
+ int prog_fd, xdp_fd, err, ret, i;
/* The test does not use the skb->data, so
* use pkt_v6 for both v6 and v4 test.
@@ -309,11 +670,16 @@ void test_fib_lookup(void)
.ctx_in = &skb,
.ctx_size_in = sizeof(skb),
);
+ LIBBPF_OPTS(bpf_test_run_opts, xdp_opts,
+ .data_in = &pkt_v6,
+ .data_size_in = sizeof(pkt_v6),
+ );
skel = fib_lookup__open_and_load();
if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
return;
prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+ xdp_fd = bpf_program__fd(skel->progs.fib_lookup_xdp);
SYS(fail, "ip netns add %s", NS_TEST);
@@ -352,6 +718,21 @@ void test_fib_lookup(void)
if (tests[i].expected_dst)
assert_dst_ip(fib_params, tests[i].expected_dst);
+ if (tests[i].expected_dev)
+ ASSERT_EQ(fib_params->ifindex,
+ if_nametoindex(tests[i].expected_dev), "ifindex");
+
+ if (tests[i].expected_mtu)
+ ASSERT_EQ(fib_params->mtu_result, tests[i].expected_mtu,
+ "mtu_result");
+
+ if (tests[i].check_vlan) {
+ ASSERT_EQ(fib_params->h_vlan_proto,
+ htons(tests[i].vlan_proto), "h_vlan_proto");
+ ASSERT_EQ(fib_params->h_vlan_TCI,
+ htons(tests[i].vlan_id), "h_vlan_TCI");
+ }
+
ret = memcmp(tests[i].dmac, fib_params->dmac, sizeof(tests[i].dmac));
if (!ASSERT_EQ(ret, 0, "dmac not match")) {
char expected[18], actual[18];
@@ -361,17 +742,182 @@ void test_fib_lookup(void)
printf("dmac expected %s actual %s ", expected, actual);
}
- // ensure tbid is zero'd out after fib lookup.
- if (tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) {
+ /*
+ * ensure tbid is zero'd out after fib lookup. With
+ * BPF_FIB_LOOKUP_VLAN the union holds the packed vlan
+ * fields instead, so skip the check for those.
+ */
+ if ((tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) &&
+ !(tests[i].lookup_flags & BPF_FIB_LOOKUP_VLAN)) {
if (!ASSERT_EQ(skel->bss->fib_params.tbid, 0,
"expected fib_params.tbid to be zero"))
goto fail;
}
}
+ /*
+ * Re-run the cases through bpf_xdp_fib_lookup(). test_run uses the
+ * current netns' loopback for ctx->rxq->dev, so dev_net() is NS_TEST
+ * and the lookup runs against its FIB. The path-independent results
+ * (return code, swapped ifindex, vlan tag, gateway) must match the skb
+ * path; the no-tot_len mtu_result is skb-specific and not rechecked.
+ */
+ for (i = 0; i < ARRAY_SIZE(tests); i++) {
+ if (set_lookup_params(fib_params, &tests[i], skb.ifindex))
+ continue;
+
+ skel->bss->fib_lookup_ret = -1;
+ skel->bss->lookup_flags = tests[i].lookup_flags;
+
+ err = bpf_prog_test_run_opts(xdp_fd, &xdp_opts);
+ if (!ASSERT_OK(err, "xdp test_run"))
+ continue;
+
+ if (!ASSERT_EQ(skel->bss->fib_lookup_ret, tests[i].expected_ret,
+ "xdp fib_lookup_ret"))
+ printf("(xdp) %s\n", tests[i].desc);
+
+ if (tests[i].expected_dev)
+ ASSERT_EQ(fib_params->ifindex,
+ if_nametoindex(tests[i].expected_dev),
+ "xdp ifindex");
+
+ if (tests[i].expected_dst)
+ assert_dst_ip(fib_params, tests[i].expected_dst);
+
+ if (tests[i].check_vlan) {
+ ASSERT_EQ(fib_params->h_vlan_proto,
+ htons(tests[i].vlan_proto), "xdp h_vlan_proto");
+ ASSERT_EQ(fib_params->h_vlan_TCI,
+ htons(tests[i].vlan_id), "xdp h_vlan_TCI");
+ }
+ }
+
fail:
if (nstoken)
close_netns(nstoken);
SYS_NOFAIL("ip netns del " NS_TEST);
fib_lookup__destroy(skel);
}
+
+#define NS_VLAN_A "fib_lookup_vlan_ns_a"
+#define NS_VLAN_B "fib_lookup_vlan_ns_b"
+
+/*
+ * A VLAN device can be moved to another netns while staying registered
+ * on its parent. Neither direction may then cross the boundary: the
+ * egress flag must not publish the foreign parent's ifindex, and the
+ * input flag must fail closed rather than use a foreign ingress.
+ */
+void test_fib_lookup_vlan_netns(void)
+{
+ struct bpf_fib_lookup *fib_params;
+ struct nstoken *nstoken = NULL;
+ struct __sk_buff skb = { };
+ struct fib_lookup *skel = NULL;
+ int prog_fd, err, parent_idx, vlan_idx;
+
+ LIBBPF_OPTS(bpf_test_run_opts, run_opts,
+ .data_in = &pkt_v6,
+ .data_size_in = sizeof(pkt_v6),
+ .ctx_in = &skb,
+ .ctx_size_in = sizeof(skb),
+ );
+
+ skel = fib_lookup__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+ return;
+ prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+ fib_params = &skel->bss->fib_params;
+
+ SYS(fail, "ip netns add %s", NS_VLAN_A);
+ SYS(fail, "ip netns add %s", NS_VLAN_B);
+
+ nstoken = open_netns(NS_VLAN_A);
+ if (!ASSERT_OK_PTR(nstoken, "open_netns(a)"))
+ goto fail;
+
+ SYS(fail, "ip link add veth7 type veth peer name veth8");
+ SYS(fail, "ip link set dev veth7 up");
+ SYS(fail, "ip link add link veth7 name veth7.66 type vlan id 66");
+ SYS(fail, "ip link set veth7.66 netns %s", NS_VLAN_B);
+
+ parent_idx = if_nametoindex("veth7");
+ if (!ASSERT_NEQ(parent_idx, 0, "if_nametoindex(veth7)"))
+ goto fail;
+
+ /*
+ * input: the moved device is still in veth7's VLAN group, but it
+ * lives in another netns, so the lookup must fail closed
+ */
+ skb.ifindex = parent_idx;
+ memset(fib_params, 0, sizeof(*fib_params));
+ fib_params->family = AF_INET;
+ fib_params->l4_protocol = IPPROTO_TCP;
+ fib_params->ifindex = parent_idx;
+ fib_params->h_vlan_proto = htons(ETH_P_8021Q);
+ fib_params->h_vlan_TCI = htons(66);
+ if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+ 1, "inet_pton(dst)"))
+ goto fail;
+
+ skel->bss->fib_lookup_ret = -1;
+ skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT |
+ BPF_FIB_LOOKUP_SKIP_NEIGH;
+ err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+ if (!ASSERT_OK(err, "test_run(input)"))
+ goto fail;
+ ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_NOT_FWDED,
+ "input across netns fails closed");
+ ASSERT_EQ(fib_params->ifindex, parent_idx, "ifindex untouched");
+ ASSERT_EQ(fib_params->h_vlan_TCI, htons(66), "tag untouched");
+
+ close_netns(nstoken);
+ nstoken = open_netns(NS_VLAN_B);
+ if (!ASSERT_OK_PTR(nstoken, "open_netns(b)"))
+ goto fail;
+
+ /*
+ * egress: the fib result is the VLAN device here, but its parent
+ * is in the other netns, so the swap must not happen
+ */
+ SYS(fail, "ip link set dev veth7.66 up");
+ SYS(fail, "ip addr add 10.66.0.1/24 dev veth7.66");
+ err = write_sysctl("/proc/sys/net/ipv4/conf/veth7.66/forwarding", "1");
+ if (!ASSERT_OK(err, "write_sysctl(forwarding)"))
+ goto fail;
+
+ vlan_idx = if_nametoindex("veth7.66");
+ if (!ASSERT_NEQ(vlan_idx, 0, "if_nametoindex(veth7.66)"))
+ goto fail;
+
+ skb.ifindex = vlan_idx;
+ memset(fib_params, 0, sizeof(*fib_params));
+ fib_params->family = AF_INET;
+ fib_params->l4_protocol = IPPROTO_TCP;
+ fib_params->ifindex = vlan_idx;
+ if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+ 1, "inet_pton(dst)") ||
+ !ASSERT_EQ(inet_pton(AF_INET, "10.66.0.1", &fib_params->ipv4_src),
+ 1, "inet_pton(src)"))
+ goto fail;
+
+ skel->bss->fib_lookup_ret = -1;
+ skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN |
+ BPF_FIB_LOOKUP_SKIP_NEIGH;
+ err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+ if (!ASSERT_OK(err, "test_run(egress)"))
+ goto fail;
+ ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_SUCCESS,
+ "egress lookup succeeds");
+ ASSERT_EQ(fib_params->ifindex, vlan_idx,
+ "foreign parent not published");
+ ASSERT_EQ(fib_params->h_vlan_TCI, 0, "vlan fields zero");
+
+fail:
+ if (nstoken)
+ close_netns(nstoken);
+ SYS_NOFAIL("ip netns del " NS_VLAN_A);
+ SYS_NOFAIL("ip netns del " NS_VLAN_B);
+ fib_lookup__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/fib_lookup.c b/tools/testing/selftests/bpf/progs/fib_lookup.c
index 7b5dd2214ff4..f43e22d33814 100644
--- a/tools/testing/selftests/bpf/progs/fib_lookup.c
+++ b/tools/testing/selftests/bpf/progs/fib_lookup.c
@@ -19,4 +19,13 @@ int fib_lookup(struct __sk_buff *skb)
return TC_ACT_SHOT;
}
+SEC("xdp")
+int fib_lookup_xdp(struct xdp_md *ctx)
+{
+ fib_lookup_ret = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params),
+ lookup_flags);
+
+ return XDP_DROP;
+}
+
char _license[] SEC("license") = "GPL";
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v14 6/9] tls: device: add TX KeyUpdate support
From: Jakub Kicinski @ 2026-06-17 22:56 UTC (permalink / raw)
To: Rishikesh Jethwani
Cc: netdev, saeedm, tariqt, mbloch, borisp, john.fastabend, sd, davem,
pabeni, edumazet, leon
In-Reply-To: <CAKaoeS3mTGUo-dvtdmQfz3JAnn69-e36xrj9YmjbWTnsiw9uqg@mail.gmail.com>
On Wed, 17 Jun 2026 15:32:58 -0700 Rishikesh Jethwani wrote:
> From: Rishikesh Jethwani <rjethwani@everpuredata.com>
Please keep in mind that net-next is currently closed.
You need to wait until the merge window is over (2 weeks)
before reposting.
^ permalink raw reply
* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Gustavo A. R. Silva @ 2026-06-17 22:59 UTC (permalink / raw)
To: Ilya Maximets, netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kees Cook, Gustavo A. R. Silva, Nathan Chancellor,
Nick Desaulniers, Bill Wendling, Justin Stitt, linux-kernel,
linux-hardening, llvm, Johan Thomsen
In-Reply-To: <b16e8bac-e149-4052-b1cb-8fd3e1137f9c@ovn.org>
On 6/17/26 16:01, Ilya Maximets wrote:
> On 6/17/26 10:08 PM, Gustavo A. R. Silva wrote:
>> Hi,
>>
>> On 6/16/26 04:03, Ilya Maximets wrote:
>>> kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
>>> structure to the options_len, which is then initialized to zero.
>>> Later, we're initializing the structure by copying the tunnel info
>>> together with the options, and this triggers a warning for a potential
>>> memcpy overflow, since the compiler estimates that the options can't
>>> fit into the structure, even though the memory for them is actually
>>> allocated.
>>>
>>> memcpy: detected buffer overflow: 104 byte write of buffer size 96
>>> WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
>>> skb_tunnel_info_unclone+0x179/0x190
>>> geneve_xmit+0x7fe/0xe00
>>
>> This warning has nothing to do with counted_by. See below for more
>> comments.
>>
>>>
>>> The issue is triggered when built with clang and source fortification.
>>>
>>> Fix that by doing the copy in two stages: first - the main data with
>>> the options_len, then the options. This way the correct length should
>>> be known at the time of the copy.
>>>
>>> It would be better if the options_len never changed after allocation,
>>> but the allocation code is a little separate from the initialization
>>> and it would be awkward and potentially dangerous to return a struct
>>> with options_len set to a non-zero value from the metadata_dst_alloc().
>>>
>>> Another option would be to use ip_tunnel_info_opts_set(), but it is
>>> doing too many unnecessary operations for the use case here.
>>>
>>> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
>>> Reported-by: Johan Thomsen <write@ownrisk.dk>
>>> Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>>
>>> Johan, if you can test this one in your setup as well, that would
>>> be great. Thanks.
>>>
>>> include/net/dst_metadata.h | 7 +++++--
>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>>> --- a/include/net/dst_metadata.h
>>> +++ b/include/net/dst_metadata.h
>>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>> if (!new_md)
>>> return ERR_PTR(-ENOMEM);
>>>
>>> - memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>>> - sizeof(struct ip_tunnel_info) + md_size);
>>
>> What's going on here is that, internally, fortified memcpy() retrieves
>> the destination size via __builtin_dynamic_object_size() in mode 1.
>>
>> That is:
>>
>> __builtin_dynamic_object_size(&new_md->u.tun_info, 1)
>>
>> For the above case, Clang returns sizeof(new_md->u.tun_info) == 96.
>>
>> So the warning is reporting that 104 bytes don't fit in an object of
>> size 96 bytes, regardless of any counted_by annotation or allocation.
>
> Hmm. Does __builtin_dynamic_object_size(&new_md->u.tun_info, 1) return
> 104 when the options_len is 8? If so, isn't that because it is counted
> by that field? Asking because the fortification doesn't complain if we
> keep the full 104-byte copy as-is, but set the options_len beforehand,
> as tested by Johan.
I see. If that is the case, then, internally, fortified memcpy() ends up
using mode 0 instead of mode 1. Something like this:
__builtin_dynamic_object_size(&new_md->u.tun_info, 0)
The above will effectively consider the allocation and counted_by because
it will interpret new_md->u.tun_info as an open-ended object due to the
flexible-array member (in struct ip_tunnel_info) whose size is determined
by counted_by.
I'm not entirely convinced we really want this.
-Gustavo
>
>>
>> Of course, in this case, the write of 104 bytes into new_md->u.tun_info
>> is intentional and controlled, but what if it weren't?
>>
>> On the other hand, for this same case, GCC currently returns SIZE_MAX,
>> which translates to -1 inside fortified memcpy(). Thus, bounds-checking
>> is bypassed, which is why this warning doesn't show up with GCC.
>>
>> However, this is a bug in GCC. We're already looking into that.
>>
>> I think we've had just a handful of cases like this across the whole
>> kernel tree. We can deal with them as you did here (by directly copying
>> the composite structure first, and then using memcpy() to copy into the
>> flexible-array member). If these cases ever become more common, we
>> could create some kind of helper to do both things at once. :)
>>
>>> + /* Copy in two stages to keep the __counted_by happy. */
>>
>> So based on my comments above, this code comment is not correct.
>
> I feel like some comment is still needed, do you have some suggestions
> for what would be a better wording?
>
>>
>>> + new_md->u.tun_info = md_dst->u.tun_info;
>>
>> This is fine.
>>
>>> + memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
>>> + ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
>>
>> Is ip_tunnel_info_opts() really needed here?
>>
>> Probably this works just fine:
>>
>> memcpy(new_md->u.tun_info.options, md_dst->u.tun_info.options, md_size);
>
> The logic here is: we have the access function, therefore we should use it.
> It gives a bad example if we don't.
>
> Best regards, Ilya Maximets.
^ permalink raw reply
* Re: [ANN] netdev development stats for 7.2
From: Jacob Keller @ 2026-06-17 23:19 UTC (permalink / raw)
To: Jakub Kicinski, netdev
In-Reply-To: <20260617115319.43a5942d@kernel.org>
On 6/17/2026 11:53 AM, Jakub Kicinski wrote:
> Top scores (positive): Top scores (negative):
> 1 ( ) [768] Jakub Kicinski 1 ( +1) [91] Tariq Toukan
> 2 ( ) [376] Simon Horman 2 ( +8) [86] Wei Fang
> 3 ( ) [346] Andrew Lunn 3 ( +4) [67] Ratheesh Kannoth
> 4 ( ) [265] Paolo Abeni 4 (***) [54] javen
> 5 ( +4) [ 91] Ido Schimmel 5 ( +6) [49] Lorenzo Bianconi
> 6 (+14) [ 74] David Laight 6 (***) [48] Luiz Angelo Daros de Luca
> 7 ( ) [ 62] Krzysztof Kozlowski 7 (***) [43] Simon Wunderlich
> 8 ( +2) [ 57] Aleksandr Loktionov 8 (***) [38] Chuck Lever
> 9 (+12) [ 50] Nikolay Aleksandrov 9 (+18) [38] Grzegorz Nitka
> 10 ( -4) [ 49] Willem de Bruijn 10 (***) [35] Pablo Neira Ayuso
> 11 ( +3) [ 49] Sabrina Dubroca 11 (***) [35] Markus Stockhausen
> 12 (+41) [ 47] Alexander Lobakin 12 (***) [34] Selvamani Rajagopal
> 13 (+24) [ 47] Maxime Chevallier 13 (***) [34] Jason Xing
> 14 ( -6) [ 46] David Ahern 14 ( -8) [33] Illusion Wang
> 15 (***) [ 43] Jiayuan Chen 15 (***) [30] Minxi Hou
>
> One process note on the reviewer score. Tariq tops the negative list.
> I've been returning to the question of whether it's fair since
> he has to handle submissions of most of nVidia's patches.
> Still, I don't understand why reading thru the list and reviewing
> one patchset from another company a day is too much to ask.
>
This is a difficult question. When I've covered for Tony in a similar
position, I've felt like it is hard enough to keep an eye on our own
list let alone also finding time to review other places.
A positive note here is that nVidia is now green overall, so at least
there is some participation from the company as a whole. On the other
hand, Tony isn't in the top negatives despite performing a somewhat
similar role.
I know I was lacking myself in the last cycle due to a bunch of
unrelated work and issues. I've been working to get review back into my
daily flow.
^ permalink raw reply
* Re: [PATCH net v3] net: airoha: Fix skb->priority underflow in airoha_dev_select_queue()
From: Jakub Kicinski @ 2026-06-17 23:19 UTC (permalink / raw)
To: lorenzo
Cc: Wayen Yan, netdev, horms, pabeni, edumazet, andrew+netdev,
angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
linux-mediatek
In-Reply-To: <178161373805.2167512.2544164327472822616@gmail.com>
On Sun, 14 Jun 2026 07:30:54 +0800 Wayen Yan wrote:
> In airoha_dev_select_queue(), the expression:
>
> queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES;
>
> implicitly converts to unsigned arithmetic: when skb->priority is 0
> (the default for unclassified traffic), (0u - 1u) wraps to UINT_MAX,
> and UINT_MAX % 8 = 7, routing default best-effort packets to the
> highest-priority QoS queue. This causes QoS inversion where the
> majority of traffic on a PON gateway starves actual high-priority
> flows (VoIP, gaming, etc.).
>
> Fix by guarding the subtraction: when priority is 0, map to queue 0
> (lowest priority), otherwise apply the original (priority - 1) % 8
> mapping.
>
> Fixes: 2b288b81560b ("net: airoha: Introduce ndo_select_queue callback")
> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
> Reviewed-by: Joe Damato <joe@dama.to>
> Signed-off-by: Wayen Yan <win847@gmail.com>
> ---
> drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 31cdb11cd7..d476ef83c3 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -1933,7 +1933,7 @@ static u16 airoha_dev_select_queue(struct net_device *dev, struct sk_buff *skb,
> */
> channel = netdev_uses_dsa(dev) ? skb_get_queue_mapping(skb) : port->id;
> channel = channel % AIROHA_NUM_QOS_CHANNELS;
> - queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES; /* QoS queue */
> + queue = skb->priority ? (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES : 0;
Hi Lorenzo, is there a reason we're subtracting 1 here in the first
place? Could be just me, but may be worth adding a comment here.
Intuitively if we are "narrowing" 16 prios to 8 queues it'd make most
sense to group the adjacent ones -- divide by two.
Please respin with some sort of an explanation..
> queue = channel * AIROHA_NUM_QOS_QUEUES + queue;
>
> return queue < dev->num_tx_queues ? queue : 0;
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH v3 0/3] net/smc: bound wire-controlled CDC cursors against the local buffers
From: Jakub Kicinski @ 2026-06-17 23:24 UTC (permalink / raw)
To: Bryam Vargas via B4 Relay
Cc: hexlabsecurity, Wenjia Zhang, Dust Li, D. Wythe, Sidraya Jayagond,
Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
Paolo Abeni, linux-kernel, linux-rdma, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-0-551fa514257e@proton.me>
On Sun, 14 Jun 2026 03:23:29 -0500 Bryam Vargas via B4 Relay wrote:
> A peer's CDC producer/consumer cursors are copied from the wire and used,
> without an upper bound against the local buffers, as (a) a raw index into the
> RMB on the urgent path, (b) the receive length in smc_rx_recvmsg(), and (c) the
> send length in smc_tx_sendmsg() on the SMC-D DMB-merge path. A malicious or
> buggy peer can forge a cursor so each of these runs past the relevant buffer:
> an out-of-bounds read of adjacent kernel memory (disclosed to the peer) on the
> receive/urgent side, and an out-of-bounds write of attacker-influenced length
> and content on the send side.
Once again, SMC maintainers -- please review.
--
mping: SHARED MEMORY COMMUNICATIONS (SMC) SOCKETS
^ permalink raw reply
* Re: [PATCH] net: tn40xx: fix netdev and NAPI leak in probe error paths
From: Jakub Kicinski @ 2026-06-17 23:33 UTC (permalink / raw)
To: ZhaoJinming
Cc: FUJITA Tomonori, Andrew Lunn, David S . Miller, Eric Dumazet,
Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260615064256.1068059-1-zhaojinming@uniontech.com>
On Mon, 15 Jun 2026 14:42:56 +0800 ZhaoJinming wrote:
> In tn40_probe(), after tn40_netdev_alloc() and netif_napi_add() succeed,
> none of the subsequent error paths call netif_napi_del() or free_netdev()
> to undo these operations. On any probe failure after netif_napi_add() the
> NAPI structure (embedded in the netdev private data) remains on the
> per-netdev napi_list while the backing memory is never freed, causing:
it's devm_ allocated:
ndev = devm_alloc_etherdev(&pdev->dev, sizeof(struct tn40_priv));
you're introducing a bug instead of fixing one..
--
pw-bot: reject
pv-bot: slop
^ permalink raw reply
* Re: [PATCH] rocker: Fix memory leak in ofdpa_port_fdb()
From: Jakub Kicinski @ 2026-06-17 23:44 UTC (permalink / raw)
To: Andrew Lunn, Jiri Pirko
Cc: Jacob Keller, Ziran Zhang, Andrew Lunn, David S . Miller,
Eric Dumazet, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <61892bd4-7368-4cd8-b360-0267e5c47156@lunn.ch>
On Wed, 17 Jun 2026 11:26:46 +0200 Andrew Lunn wrote:
> On Tue, Jun 16, 2026 at 04:29:59PM -0700, Jacob Keller wrote:
> > On 6/15/2026 6:32 PM, Ziran Zhang wrote:
> > > In ofdpa_port_fdb(), the hash_del() only unlinks the node from
> > > hash table, but does not free it.
> > >
> > > Fix this by adding kfree(found) after the !found == removing check,
> > > where the pointer value is no longer needed.
> > >
> > > Found by Coccinelle kfree script.
>
> Is rocker actually used any more? I'm not too sure of the history, but
> was it not added as a way to develop the early switchdev code? There
> was a qemu implementation of the 'hardware'?
>
> Is it still useful? Should we actually just remove the driver?
I think it came up before but I don't remember the conclusion :S
We should either add rocker to NIPA or delete it. Jiri, WDYT?
^ permalink raw reply
* Re: [PATCH] net: airoha: Stop TX queues on error path in airoha_dev_open
From: Jakub Kicinski @ 2026-06-17 23:44 UTC (permalink / raw)
To: Wayen Yan
Cc: netdev, lorenzo, horms, pabeni, edumazet, andrew+netdev,
angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
linux-mediatek
In-Reply-To: <178160729880.2156257.7978513589649053826@gmail.com>
On Tue, 16 Jun 2026 18:50:39 +0800 Wayen Yan wrote:
> In airoha_dev_open(), if airoha_set_vip_for_gdm_port() fails after
> netif_tx_start_all_queues() has been called, the TX queues remain
> started while the device configuration is incomplete. This leaves
> the device in an inconsistent state where packets could be
> transmitted before the VIP/IFC port configuration is complete.
Not sure if this was superseded by another posting but FWIW
this posting did not apply.
^ permalink raw reply
* Re: [PATCH net-next] ionic: Change list definition method
From: Jakub Kicinski @ 2026-06-17 23:47 UTC (permalink / raw)
To: Lei Zhu; +Cc: brett.creeley, andrew+netdev, davem, edumazet, netdev
In-Reply-To: <20260617023243.61595-1-zhulei_szu@163.com>
On Wed, 17 Jun 2026 10:32:43 +0800 Lei Zhu wrote:
> The LIST_HEAD macro can both define a linked list and initialize
> it in one step. To simplify code, we replace the separate operations
> of linked list definition and manual initialization with the LIST_HEAD
> macro.
## Form letter - net-next-closed
We have already submitted our pull request with net-next material for v7.2,
and therefore net-next is closed for new drivers, features, code refactoring
and optimizations. We are currently accepting bug fixes only.
Please repost when net-next reopens after June 29th.
RFC patches sent for review only are obviously welcome at any time.
See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
--
pw-bot: defer
pv-bot: closed
^ permalink raw reply
* Re: [PATCH net v6 0/7] net: require CAP_NET_ADMIN in the device netns for tunnel changelink
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Maoyi Xie
Cc: davem, edumazet, kuba, pabeni, dsahern, steffen.klassert, herbert,
horms, kuniyu, shaw.leon, netdev, linux-kernel, stable
In-Reply-To: <20260612085941.3158249-1-maoyixie.tju@gmail.com>
Hello:
This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 12 Jun 2026 16:59:34 +0800 you wrote:
> A tunnel changelink() operates on at most two netns, dev_net(dev) and
> the tunnel link netns t->net. They differ once the device is created in
> or moved to a netns other than the one the request runs in. The rtnl
> changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a
> caller privileged there but not in the link netns can rewrite a tunnel
> that lives in the link netns. Commit 8b484efd5cb4 ("ip6: vti: Use
> ip6_tnl.net in vti6_siocdevprivate().") added the same check on the
> ioctl path. This series adds it on the RTM_NEWLINK path.
>
> [...]
Here is the summary with links:
- [net,v6,1/7] net: ip_gre: require CAP_NET_ADMIN in the device netns for changelink
https://git.kernel.org/netdev/net/c/8165f7ff57d9
- [net,v6,2/7] net: ipip: require CAP_NET_ADMIN in the device netns for changelink
https://git.kernel.org/netdev/net/c/8211a2632466
- [net,v6,3/7] net: ip_vti: require CAP_NET_ADMIN in the device netns for changelink
https://git.kernel.org/netdev/net/c/95cceadbfd52
- [net,v6,4/7] net: ip6_tunnel: require CAP_NET_ADMIN in the device netns for changelink
https://git.kernel.org/netdev/net/c/2496fa0b7d18
- [net,v6,5/7] net: ip6_gre: require CAP_NET_ADMIN in the device netns for changelink
https://git.kernel.org/netdev/net/c/f00a50876d28
- [net,v6,6/7] net: ip6_vti: require CAP_NET_ADMIN in the device netns for changelink
https://git.kernel.org/netdev/net/c/e2ac3b242c37
- [net,v6,7/7] xfrm: xfrm_interface: require CAP_NET_ADMIN in the device netns for changelink
https://git.kernel.org/netdev/net/c/095515d89b19
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] octeontx2-af: npc: Log successful MCAM drop-on-non-hit install at debug level
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Ratheesh Kannoth
Cc: kuba, linux-kernel, netdev, andrew+netdev, davem, edumazet,
pabeni, sgoutham
In-Reply-To: <20260615033157.535237-1-rkannoth@marvell.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 09:01:57 +0530 you wrote:
> npc_install_mcam_drop_rule() used dev_err() after a successful
> rvu_mbox_handler_npc_mcam_write_entry() call, so normal installs appeared
> as errors in dmesg. Use dev_dbg() for the success path and keep dev_err()
> for real failures.
>
> Fixes: 3571fe07a090 ("octeontx2-af: Drop rules for NPC MCAM")
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
>
> [...]
Here is the summary with links:
- [net] octeontx2-af: npc: Log successful MCAM drop-on-non-hit install at debug level
https://git.kernel.org/netdev/net/c/4f6ac65e8162
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] net: ethernet: mtk_eth_soc: fix supported_interface set after phylink_create
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Christian Marangi
Cc: nbd, lorenzo, andrew+netdev, davem, edumazet, kuba, pabeni,
matthias.bgg, angelogioacchino.delregno, linux, daniel, netdev,
linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <20260615151106.15438-1-ansuelsmth@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 17:11:00 +0200 you wrote:
> Everything configured in phylink_config it's assumed to be set before
> calling phylink_create() to permit correct parsing of all the different
> modes and capabilities.
>
> Commit 51cf06ddafc9 ("net: ethernet: mtk_eth_soc: add support for MT7988
> internal 2.5G PHY") while introducing support for 2.5G phy for MT7988,
> probably due to an auto-rebase, placed the configuration of the INTERNAL
> interface mode for the supported_interfaces for phylink_config right after
> phylink_create() introducing a possible problem with supported interfaces
> parsing.
>
> [...]
Here is the summary with links:
- [net] net: ethernet: mtk_eth_soc: fix supported_interface set after phylink_create
https://git.kernel.org/netdev/net/c/e4b4d8410c7c
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH][net-next] net/mlx5: Remove broken and unused mlx5_query_mtppse()
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: lirongqing
Cc: saeedm, leon, tariqt, mbloch, andrew+netdev, davem, edumazet,
kuba, pabeni, netdev, gal, linux-rdma, linux-kernel
In-Reply-To: <20260615140406.1828-1-lirongqing@baidu.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 22:04:06 +0800 you wrote:
> From: Li RongQing <lirongqing@baidu.com>
>
> mlx5_query_mtppse() reads the Event Trigger Pin (MTPPSE) register but
> reads the returned arm and mode values from the input buffer 'in'
> instead of the output buffer 'out', so it always returns the values
> that were written rather than the actual hardware state, making the
> query useless.
>
> [...]
Here is the summary with links:
- [net-next] net/mlx5: Remove broken and unused mlx5_query_mtppse()
https://git.kernel.org/netdev/net/c/b50fa1e07cf8
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] netdev-genl: report NAPI thread PID in the caller's pid namespace
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Maoyi Xie
Cc: davem, edumazet, kuba, pabeni, horms, daniel, razor, dw, sdf,
dtatulea, skhawaja, netdev, linux-kernel, stable
In-Reply-To: <20260615171736.1709318-1-maoyixie.tju@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 16 Jun 2026 01:17:36 +0800 you wrote:
> netdev_nl_napi_fill_one() reports the NAPI kthread PID in NETDEV_A_NAPI_PID
> using task_pid_nr(), which returns the PID in the initial pid namespace.
>
> NETDEV_CMD_NAPI_GET does not have GENL_ADMIN_PERM and the netdev genl family
> is netnsok, so a caller in a child pid namespace can issue it. That caller
> then sees the kthread's global PID, even though the kthread is not visible
> in its pid namespace, where the value should be 0.
>
> [...]
Here is the summary with links:
- [net] netdev-genl: report NAPI thread PID in the caller's pid namespace
https://git.kernel.org/netdev/net/c/1f24c0d01db2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] net: psample: fix info leak in PSAMPLE_ATTR_DATA
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, bestswngs,
yotam.gi, jhs, jiri
In-Reply-To: <20260616003046.1099490-1-kuba@kernel.org>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 17:30:46 -0700 you wrote:
> psample open codes nla_put() presumably to avoid wiping
> the data with 0s just to override it with packet data.
> This open coding is missing clearing the pad, however,
> each netlink attr is padded to 4B and data_len may
> not be divisible by 4B.
>
> Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>
> [...]
Here is the summary with links:
- [net] net: psample: fix info leak in PSAMPLE_ATTR_DATA
https://git.kernel.org/netdev/net/c/aedd02af1f8b
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next] net: pse-pd: set user byte command SUB2 field
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Robert Marko
Cc: o.rempel, kory.maincent, andrew+netdev, davem, edumazet, kuba,
pabeni, netdev, linux-kernel, luka.perkov
In-Reply-To: <20260611102517.445549-1-robert.marko@sartura.hr>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 11 Jun 2026 12:24:49 +0200 you wrote:
> The Set User Byte to Save command has three subject bytes.
> The PD692x0 protocol guides defines SUB2 with value 0x4e, while SUB1
> carries the NVM user byte.
>
> Template only initialized SUB and SUB1.
> Fill SUB2 explicitly so the command matches the documented layout.
>
> [...]
Here is the summary with links:
- [net-next] net: pse-pd: set user byte command SUB2 field
https://git.kernel.org/netdev/net/c/e586644d0a89
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH] net: ehea: unwind probe_port sysfs file on failure
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Pengpeng Hou
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, kees, netdev,
linux-kernel
In-Reply-To: <20260615070033.43461-1-pengpeng@iscas.ac.cn>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 15:00:31 +0800 you wrote:
> ehea_create_device_sysfs() creates probe_port and then remove_port. If
> the second device_create_file() fails, the helper returns the error but
> leaves probe_port installed even though probe treats the sysfs setup as
> failed.
>
> Remove probe_port on the remove_port creation failure path so the helper
> leaves no partial sysfs state behind.
>
> [...]
Here is the summary with links:
- net: ehea: unwind probe_port sysfs file on failure
https://git.kernel.org/netdev/net/c/1c4b39746c4b
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] sctp: hold socket lock when dumping endpoints in sctp_diag
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Xin Long
Cc: netdev, linux-sctp, davem, kuba, edumazet, pabeni, horms,
marcelo.leitner, w, zdi-disclosures
In-Reply-To: <4c1b49ab87e0f7d552ebd8172b364b1994e913c9.1781552190.git.lucien.xin@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 15:36:30 -0400 you wrote:
> SCTP_DIAG endpoint dumping was traversing endpoint address lists without
> holding lock_sock(), while those lists could change concurrently via
> socket operations (e.g., bindx changes). This creates a race where
> nla_reserve() counts addresses under RCU protection, but the subsequent
> copy may see fewer entries, potentially leaking uninitialized memory to
> userspace.
>
> [...]
Here is the summary with links:
- [net,v2] sctp: hold socket lock when dumping endpoints in sctp_diag
https://git.kernel.org/netdev/net/c/7d8297e26b4e
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] octeontx2-pf: Fix leak of SQ timestamp buffer on teardown
From: patchwork-bot+netdevbpf @ 2026-06-18 0:20 UTC (permalink / raw)
To: Ratheesh Kannoth
Cc: amakarov, davem, jesse.brandeburg, kuba, linux-kernel, netdev,
richardcochran, andrew+netdev, edumazet, pabeni, sgoutham
In-Reply-To: <20260615030704.504536-1-rkannoth@marvell.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 08:37:04 +0530 you wrote:
> The send-queue timestamp ring is allocated with qmem_alloc() when
> timestamping is used, but otx2_free_sq_res() never freed sq->timestamps,
> leaking that memory across ifdown and device removal. Add the missing
> qmem_free() alongside the other SQ companion buffers.
>
> Fixes: c9c12d339d93 ("octeontx2-pf: Add support for PTP clock")
> Cc: Aleksey Makarov <amakarov@marvell.com>
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
>
> [...]
Here is the summary with links:
- [net] octeontx2-pf: Fix leak of SQ timestamp buffer on teardown
https://git.kernel.org/netdev/net/c/a056db30de92
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2 1/1] net: ipv4: bound TCP reordering sysctl writes and MTU probe sizes
From: patchwork-bot+netdevbpf @ 2026-06-18 0:21 UTC (permalink / raw)
To: Ren Wei
Cc: netdev, edumazet, kuniyu, david.laight.linux, ncardwell, pabeni,
chia-yu.chang, ij, yuuchihsu, idosch, fmancera, herbert,
yuantan098, zcliangcn, bird, bronzed_45_vested
In-Reply-To: <1a5b7e1ef4d70fbad8c8ee0b82d8405f3c964a3d.1781395200.git.bronzed_45_vested@icloud.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 15 Jun 2026 18:31:18 +0800 you wrote:
> From: Wyatt Feng <bronzed_45_vested@icloud.com>
>
> Reject invalid `net.ipv4.tcp_reordering` values before they reach TCP
> socket state. The sysctl is stored as an `int` but copied into the
> `u32` `tp->reordering` field for new sockets, so negative writes wrap
> to large values.
>
> [...]
Here is the summary with links:
- [net,v2,1/1] net: ipv4: bound TCP reordering sysctl writes and MTU probe sizes
https://git.kernel.org/netdev/net/c/efb8763d7bbb
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH bpf v2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Kuniyuki Iwashima @ 2026-06-18 0:25 UTC (permalink / raw)
To: rhkrqnwk98
Cc: bobbyeshleman, bpf, davem, edumazet, horms, jakub, john.fastabend,
kuba, linux-kernel, netdev, pabeni
In-Reply-To: <20260612123553.2724240-1-rhkrqnwk98@gmail.com>
From: Sechang Lim <rhkrqnwk98@gmail.com>
Date: Fri, 12 Jun 2026 12:35:51 +0000
> sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
> to find the length of the next message. strparser assembles a message out
> of several received skbs by chaining them onto the head's frag_list and
> recording where to append the next one in strp->skb_nextp:
>
> *strp->skb_nextp = skb;
> strp->skb_nextp = &skb->next;
>
> and then calls the parser on the head:
>
> len = (*strp->cb.parse_msg)(strp, head);
>
> The parser is only meant to inspect the skb, but the program may call
> bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
> bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
It's bpf prog's responsibility not to abuse them.
Even setting aside that, why not simply block such BPF prog ?
It cannot be done at load time, but doable at attach time.
---8<---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 630d530782fe..4d60b77da8ef 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4556,6 +4556,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
switch (ptype) {
case BPF_PROG_TYPE_SK_SKB:
+ if (attr->attach_type == BPF_SK_SKB_STREAM_PARSER &&
+ prog->aux->changes_pkt_data) {
+ ret = -EINVAL;
+ goto out;
+ }
+ fallthrough;
case BPF_PROG_TYPE_SK_MSG:
ret = sock_map_get_from_fd(attr, prog);
break;
---8<---
> Once the head carries a frag_list these go
>
> ... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail
>
> and __pskb_pull_tail() frees the frag_list skbs that strparser still
> tracks through skb_nextp:
>
> while ((list = skb_shinfo(skb)->frag_list) != insp) {
> skb_shinfo(skb)->frag_list = list->next;
> consume_skb(list);
> }
>
> strp->skb_nextp now points into a freed sk_buff. The next segment of
> the same message arrives in __strp_recv(), which links it with
> *strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
> and the write happen in different __strp_recv() calls, so the message
> has to span at least three segments before it triggers.
>
> BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
> Write of size 8 at addr ffff88810db86140 by task repro/349
>
> Call Trace:
> <IRQ>
> __strp_recv+0x447/0xda0
> __tcp_read_sock+0x13d/0x590
> tcp_bpf_strp_read_sock+0x195/0x320
> strp_data_ready+0x267/0x340
> sk_psock_strp_data_ready+0x1ce/0x350
> tcp_data_queue+0x1364/0x2fd0
> tcp_rcv_established+0xe07/0x1640
> [...]
>
> Allocated by task 349:
> skb_clone+0x17b/0x210
> __strp_recv+0x2c3/0xda0
> __tcp_read_sock+0x13d/0x590
> [...]
>
> Freed by task 349:
> kmem_cache_free+0x150/0x570
> __pskb_pull_tail+0x57b/0xc20
> skb_ensure_writable+0x236/0x260
> __bpf_skb_change_tail+0x1d4/0x590
> sk_skb_change_tail+0x2a/0x40
> bpf_prog_1b285dcd6c41373e+0x27/0x30
> bpf_prog_run_pin_on_cpu+0xf3/0x260
> sk_psock_strp_parse+0x118/0x1e0
> __strp_recv+0x4f6/0xda0
> [...]
>
> The same resize also leaves the head's length inconsistent with its
> frags, so a later __pskb_pull_tail() can instead hit the
> BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.
>
> Run the parser on a private clone of the head when the message spans more
> than one skb and the program can modify the packet
> (prog->aux->changes_pkt_data), so a resizing helper can only touch the
> clone and strparser's head and skb_nextp stay valid. Single-skb messages
> have no frag_list and read-only parsers cannot resize, so both are still
> parsed in place. If the clone cannot be allocated, return 0 so the caller
> retries on the next read rather than failing the parser.
>
> Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")
> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
> ---
> v2:
> - clone only when prog->aux->changes_pkt_data (Bobby Eshleman)
> - return 0 on clone failure instead of -ENOMEM (Bobby Eshleman)
> - free the clone with consume_skb() instead of kfree_skb()
> - drop the unrelated guard(rcu)() change (Bobby Eshleman)
>
> v1:
> - https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/
>
> net/core/skmsg.c | 26 +++++++++++++++++++++++---
> 1 file changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index e1850caf1a71..97e5bc5f38c3 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -1149,9 +1149,29 @@ static int sk_psock_strp_parse(struct strparser *strp, struct sk_buff *skb)
> rcu_read_lock();
> prog = READ_ONCE(psock->progs.stream_parser);
> if (likely(prog)) {
> - skb->sk = psock->sk;
> - ret = bpf_prog_run_pin_on_cpu(prog, skb);
> - skb->sk = NULL;
> + struct sk_buff *parse_skb = skb;
> +
> + /*
> + * strparser chains the message skbs through skb->frag_list and
> + * keeps a pointer into that list in strp->skb_nextp. The parser
> + * program may call bpf_skb_change_tail() and friends, which go
> + * through __pskb_pull_tail() and free the frag_list skbs that
> + * strparser still tracks. Run the program on a clone when the head
> + * has a frag_list and the program can modify the packet, so it
> + * cannot drop frags strparser owns.
> + */
> + if (skb_has_frag_list(skb) && prog->aux->changes_pkt_data) {
> + parse_skb = skb_clone(skb, GFP_ATOMIC);
> + if (!parse_skb) {
> + rcu_read_unlock();
> + return 0;
> + }
> + }
> + parse_skb->sk = psock->sk;
> + ret = bpf_prog_run_pin_on_cpu(prog, parse_skb);
> + parse_skb->sk = NULL;
> + if (parse_skb != skb)
> + consume_skb(parse_skb);
> }
> rcu_read_unlock();
> return ret;
> --
> 2.43.0
>
^ permalink raw reply related
* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: Finn Thain @ 2026-06-18 0:55 UTC (permalink / raw)
To: Carsten Strotmann
Cc: Jakub Kicinski, Carsten Strotmann, John Paul Adrian Glaubitz,
davem, netdev, edumazet, pabeni, andrew+netdev, horms, geert,
chleroy, npiggin, mpe, maddy, linux-mips, linux-m68k,
linuxppc-dev
In-Reply-To: <1781694488854.956546368.818588236@strotmann.de>
On Wed, 17 Jun 2026, Carsten Strotmann wrote:
> > _Someone_ has to handle the reports and patches. And since nobody is
> > doing that the code is going to GitHub, where it can continue to "just
> > be left" or whatever, without racking up CVEs for the Linux kernel and
> > leading to maintainer burn out :/
> >
>
> That's a good point. The large influx of reports is a problem, and burn
> out of maintainers is a too high cost.
>
Carsten, if, as a maintainer, you want to avoid burnout then
1) don't promise what you can't deliver (that is, decline sponsorship)
2) delegate (that is, leverage AI as an ally not as a lame excuse)
So the question remains: what is it which _can_ be delivered by and for
the "community" (by which I mean, that group of people which includes
actual end users -- not merely paying customers and sponsored developers).
This question has precious little to do with burnout, but it's the
question we need to address.
^ permalink raw reply
* Re: [RFC PATCH 1/2] landlock: fix TCP Fast Open connection bypass
From: Bryam Vargas @ 2026-06-18 1:25 UTC (permalink / raw)
To: Matthieu Buffet
Cc: Mickaël Salaün, Günther Noack, Mikhail Ivanov,
Paul Moore, Eric Dumazet, Neal Cardwell, linux-security-module,
netdev, linux-kernel
In-Reply-To: <20260617180526.15627-2-matthieu@buffet.re>
Thanks Matthieu, your #41, so no competing patch from me. I built your v0
(Landlock + MPTCP) and ran an A/B: without it, a confined task with CONNECT_TCP
denied still reaches the port via sendto(MSG_FASTOPEN); with it, that path is now
denied too, on IPv4 and IPv6.
Tested-by: Bryam Vargas <hexlabsecurity@proton.me>
One scope note, since you mention MPTCP: an MPTCP socket isn't covered.
sk_is_tcp() is false for the mptcp parent (sk_protocol is IPPROTO_MPTCP), so
neither the new sendmsg hook nor the existing socket_connect one mediates it. On
the patched kernel my MPTCP arm still reaches the blocked port via both connect()
and MSG_FASTOPEN. If MPTCP is meant to be in scope for CONNECT_TCP, the guard
wants `|| sk->sk_protocol == IPPROTO_MPTCP` (not sk_is_mptcp(), which is the
subflow flag).
Bryam
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox