Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH 1/2] landlock: fix TCP Fast Open connection bypass
From: Bryam Vargas @ 2026-06-18  1:25 UTC (permalink / raw)
  To: Matthieu Buffet
  Cc: Mickaël Salaün, Günther Noack, Mikhail Ivanov,
	Paul Moore, Eric Dumazet, Neal Cardwell, linux-security-module,
	netdev, linux-kernel
In-Reply-To: <20260617180526.15627-2-matthieu@buffet.re>

Thanks Matthieu, your #41, so no competing patch from me. I built your v0
(Landlock + MPTCP) and ran an A/B: without it, a confined task with CONNECT_TCP
denied still reaches the port via sendto(MSG_FASTOPEN); with it, that path is now
denied too, on IPv4 and IPv6.

Tested-by: Bryam Vargas <hexlabsecurity@proton.me>

One scope note, since you mention MPTCP: an MPTCP socket isn't covered.
sk_is_tcp() is false for the mptcp parent (sk_protocol is IPPROTO_MPTCP), so
neither the new sendmsg hook nor the existing socket_connect one mediates it. On
the patched kernel my MPTCP arm still reaches the blocked port via both connect()
and MSG_FASTOPEN. If MPTCP is meant to be in scope for CONNECT_TCP, the guard
wants `|| sk->sk_protocol == IPPROTO_MPTCP` (not sk_is_mptcp(), which is the
subflow flag).

Bryam

^ permalink raw reply

* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: Finn Thain @ 2026-06-18  0:55 UTC (permalink / raw)
  To: Carsten Strotmann
  Cc: Jakub Kicinski, Carsten Strotmann, John Paul Adrian Glaubitz,
	davem, netdev, edumazet, pabeni, andrew+netdev, horms, geert,
	chleroy, npiggin, mpe, maddy, linux-mips, linux-m68k,
	linuxppc-dev
In-Reply-To: <1781694488854.956546368.818588236@strotmann.de>


On Wed, 17 Jun 2026, Carsten Strotmann wrote:

> > _Someone_ has to handle the reports and patches. And since nobody is 
> > doing that the code is going to GitHub, where it can continue to "just 
> > be left" or whatever, without racking up CVEs for the Linux kernel and 
> > leading to maintainer burn out :/
> > 
> 
> That's a good point. The large influx of reports is a problem, and burn 
> out of maintainers is a too high cost.
> 

Carsten, if, as a maintainer, you want to avoid burnout then

1) don't promise what you can't deliver (that is, decline sponsorship)

2) delegate (that is, leverage AI as an ally not as a lame excuse)

So the question remains: what is it which _can_ be delivered by and for 
the "community" (by which I mean, that group of people which includes 
actual end users -- not merely paying customers and sponsored developers).

This question has precious little to do with burnout, but it's the 
question we need to address.

^ permalink raw reply

* Re: [PATCH bpf v2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Kuniyuki Iwashima @ 2026-06-18  0:25 UTC (permalink / raw)
  To: rhkrqnwk98
  Cc: bobbyeshleman, bpf, davem, edumazet, horms, jakub, john.fastabend,
	kuba, linux-kernel, netdev, pabeni
In-Reply-To: <20260612123553.2724240-1-rhkrqnwk98@gmail.com>

From: Sechang Lim <rhkrqnwk98@gmail.com>
Date: Fri, 12 Jun 2026 12:35:51 +0000
> sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
> to find the length of the next message. strparser assembles a message out
> of several received skbs by chaining them onto the head's frag_list and
> recording where to append the next one in strp->skb_nextp:
> 
> 	*strp->skb_nextp = skb;
> 	strp->skb_nextp = &skb->next;
> 
> and then calls the parser on the head:
> 
> 	len = (*strp->cb.parse_msg)(strp, head);
> 
> The parser is only meant to inspect the skb, but the program may call
> bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
> bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.

It's bpf prog's responsibility not to abuse them.

Even setting aside that, why not simply block such BPF prog ?

It cannot be done at load time, but doable at attach time.

---8<---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 630d530782fe..4d60b77da8ef 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4556,6 +4556,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 
 	switch (ptype) {
 	case BPF_PROG_TYPE_SK_SKB:
+		if (attr->attach_type == BPF_SK_SKB_STREAM_PARSER &&
+		    prog->aux->changes_pkt_data) {
+			ret = -EINVAL;
+			goto out;
+		}
+		fallthrough;
 	case BPF_PROG_TYPE_SK_MSG:
 		ret = sock_map_get_from_fd(attr, prog);
 		break;
---8<---


> Once the head carries a frag_list these go
> 
> 	... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail
> 
> and __pskb_pull_tail() frees the frag_list skbs that strparser still
> tracks through skb_nextp:
> 
> 	while ((list = skb_shinfo(skb)->frag_list) != insp) {
> 		skb_shinfo(skb)->frag_list = list->next;
> 		consume_skb(list);
> 	}
> 
> strp->skb_nextp now points into a freed sk_buff. The next segment of
> the same message arrives in __strp_recv(), which links it with
> *strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
> and the write happen in different __strp_recv() calls, so the message
> has to span at least three segments before it triggers.
> 
>   BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
>   Write of size 8 at addr ffff88810db86140 by task repro/349
> 
>   Call Trace:
>    <IRQ>
>    __strp_recv+0x447/0xda0
>    __tcp_read_sock+0x13d/0x590
>    tcp_bpf_strp_read_sock+0x195/0x320
>    strp_data_ready+0x267/0x340
>    sk_psock_strp_data_ready+0x1ce/0x350
>    tcp_data_queue+0x1364/0x2fd0
>    tcp_rcv_established+0xe07/0x1640
>    [...]
> 
>   Allocated by task 349:
>    skb_clone+0x17b/0x210
>    __strp_recv+0x2c3/0xda0
>    __tcp_read_sock+0x13d/0x590
>    [...]
> 
>   Freed by task 349:
>    kmem_cache_free+0x150/0x570
>    __pskb_pull_tail+0x57b/0xc20
>    skb_ensure_writable+0x236/0x260
>    __bpf_skb_change_tail+0x1d4/0x590
>    sk_skb_change_tail+0x2a/0x40
>    bpf_prog_1b285dcd6c41373e+0x27/0x30
>    bpf_prog_run_pin_on_cpu+0xf3/0x260
>    sk_psock_strp_parse+0x118/0x1e0
>    __strp_recv+0x4f6/0xda0
>    [...]
> 
> The same resize also leaves the head's length inconsistent with its
> frags, so a later __pskb_pull_tail() can instead hit the
> BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.
> 
> Run the parser on a private clone of the head when the message spans more
> than one skb and the program can modify the packet
> (prog->aux->changes_pkt_data), so a resizing helper can only touch the
> clone and strparser's head and skb_nextp stay valid. Single-skb messages
> have no frag_list and read-only parsers cannot resize, so both are still
> parsed in place. If the clone cannot be allocated, return 0 so the caller
> retries on the next read rather than failing the parser.
> 
> Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")
> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
> ---
> v2:
>  - clone only when prog->aux->changes_pkt_data (Bobby Eshleman)
>  - return 0 on clone failure instead of -ENOMEM (Bobby Eshleman)
>  - free the clone with consume_skb() instead of kfree_skb()
>  - drop the unrelated guard(rcu)() change (Bobby Eshleman)
> 
> v1:
>  - https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/
> 
>  net/core/skmsg.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index e1850caf1a71..97e5bc5f38c3 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -1149,9 +1149,29 @@ static int sk_psock_strp_parse(struct strparser *strp, struct sk_buff *skb)
>  	rcu_read_lock();
>  	prog = READ_ONCE(psock->progs.stream_parser);
>  	if (likely(prog)) {
> -		skb->sk = psock->sk;
> -		ret = bpf_prog_run_pin_on_cpu(prog, skb);
> -		skb->sk = NULL;
> +		struct sk_buff *parse_skb = skb;
> +
> +		/*
> +		 * strparser chains the message skbs through skb->frag_list and
> +		 * keeps a pointer into that list in strp->skb_nextp.  The parser
> +		 * program may call bpf_skb_change_tail() and friends, which go
> +		 * through __pskb_pull_tail() and free the frag_list skbs that
> +		 * strparser still tracks.  Run the program on a clone when the head
> +		 * has a frag_list and the program can modify the packet, so it
> +		 * cannot drop frags strparser owns.
> +		 */
> +		if (skb_has_frag_list(skb) && prog->aux->changes_pkt_data) {
> +			parse_skb = skb_clone(skb, GFP_ATOMIC);
> +			if (!parse_skb) {
> +				rcu_read_unlock();
> +				return 0;
> +			}
> +		}
> +		parse_skb->sk = psock->sk;
> +		ret = bpf_prog_run_pin_on_cpu(prog, parse_skb);
> +		parse_skb->sk = NULL;
> +		if (parse_skb != skb)
> +			consume_skb(parse_skb);
>  	}
>  	rcu_read_unlock();
>  	return ret;
> -- 
> 2.43.0
> 

^ permalink raw reply related

* Re: [PATCH net v2 1/1] net: ipv4: bound TCP reordering sysctl writes and MTU probe sizes
From: patchwork-bot+netdevbpf @ 2026-06-18  0:21 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, edumazet, kuniyu, david.laight.linux, ncardwell, pabeni,
	chia-yu.chang, ij, yuuchihsu, idosch, fmancera, herbert,
	yuantan098, zcliangcn, bird, bronzed_45_vested
In-Reply-To: <1a5b7e1ef4d70fbad8c8ee0b82d8405f3c964a3d.1781395200.git.bronzed_45_vested@icloud.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 18:31:18 +0800 you wrote:
> From: Wyatt Feng <bronzed_45_vested@icloud.com>
> 
> Reject invalid `net.ipv4.tcp_reordering` values before they reach TCP
> socket state. The sysctl is stored as an `int` but copied into the
> `u32` `tp->reordering` field for new sockets, so negative writes wrap
> to large values.
> 
> [...]

Here is the summary with links:
  - [net,v2,1/1] net: ipv4: bound TCP reordering sysctl writes and MTU probe sizes
    https://git.kernel.org/netdev/net/c/efb8763d7bbb

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] octeontx2-pf: Fix leak of SQ timestamp buffer on teardown
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Ratheesh Kannoth
  Cc: amakarov, davem, jesse.brandeburg, kuba, linux-kernel, netdev,
	richardcochran, andrew+netdev, edumazet, pabeni, sgoutham
In-Reply-To: <20260615030704.504536-1-rkannoth@marvell.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 08:37:04 +0530 you wrote:
> The send-queue timestamp ring is allocated with qmem_alloc() when
> timestamping is used, but otx2_free_sq_res() never freed sq->timestamps,
> leaking that memory across ifdown and device removal.  Add the missing
> qmem_free() alongside the other SQ companion buffers.
> 
> Fixes: c9c12d339d93 ("octeontx2-pf: Add support for PTP clock")
> Cc: Aleksey Makarov <amakarov@marvell.com>
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
> 
> [...]

Here is the summary with links:
  - [net] octeontx2-pf: Fix leak of SQ timestamp buffer on teardown
    https://git.kernel.org/netdev/net/c/a056db30de92

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net v2] sctp: hold socket lock when dumping endpoints in sctp_diag
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Xin Long
  Cc: netdev, linux-sctp, davem, kuba, edumazet, pabeni, horms,
	marcelo.leitner, w, zdi-disclosures
In-Reply-To: <4c1b49ab87e0f7d552ebd8172b364b1994e913c9.1781552190.git.lucien.xin@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 15:36:30 -0400 you wrote:
> SCTP_DIAG endpoint dumping was traversing endpoint address lists without
> holding lock_sock(), while those lists could change concurrently via
> socket operations (e.g., bindx changes). This creates a race where
> nla_reserve() counts addresses under RCU protection, but the subsequent
> copy may see fewer entries, potentially leaking uninitialized memory to
> userspace.
> 
> [...]

Here is the summary with links:
  - [net,v2] sctp: hold socket lock when dumping endpoints in sctp_diag
    https://git.kernel.org/netdev/net/c/7d8297e26b4e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] net: ehea: unwind probe_port sysfs file on failure
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Pengpeng Hou
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, kees, netdev,
	linux-kernel
In-Reply-To: <20260615070033.43461-1-pengpeng@iscas.ac.cn>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 15:00:31 +0800 you wrote:
> ehea_create_device_sysfs() creates probe_port and then remove_port. If
> the second device_create_file() fails, the helper returns the error but
> leaves probe_port installed even though probe treats the sysfs setup as
> failed.
> 
> Remove probe_port on the remove_port creation failure path so the helper
> leaves no partial sysfs state behind.
> 
> [...]

Here is the summary with links:
  - net: ehea: unwind probe_port sysfs file on failure
    https://git.kernel.org/netdev/net/c/1c4b39746c4b

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next] net: pse-pd: set user byte command SUB2 field
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Robert Marko
  Cc: o.rempel, kory.maincent, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel, luka.perkov
In-Reply-To: <20260611102517.445549-1-robert.marko@sartura.hr>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 11 Jun 2026 12:24:49 +0200 you wrote:
> The Set User Byte to Save command has three subject bytes.
> The PD692x0 protocol guides defines SUB2 with value 0x4e, while SUB1
> carries the NVM user byte.
> 
> Template only initialized SUB and SUB1.
> Fill SUB2 explicitly so the command matches the documented layout.
> 
> [...]

Here is the summary with links:
  - [net-next] net: pse-pd: set user byte command SUB2 field
    https://git.kernel.org/netdev/net/c/e586644d0a89

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] net: psample: fix info leak in PSAMPLE_ATTR_DATA
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, bestswngs,
	yotam.gi, jhs, jiri
In-Reply-To: <20260616003046.1099490-1-kuba@kernel.org>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 17:30:46 -0700 you wrote:
> psample open codes nla_put() presumably to avoid wiping
> the data with 0s just to override it with packet data.
> This open coding is missing clearing the pad, however,
> each netlink attr is padded to 4B and data_len may
> not be divisible by 4B.
> 
> Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> 
> [...]

Here is the summary with links:
  - [net] net: psample: fix info leak in PSAMPLE_ATTR_DATA
    https://git.kernel.org/netdev/net/c/aedd02af1f8b

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] netdev-genl: report NAPI thread PID in the caller's pid namespace
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: davem, edumazet, kuba, pabeni, horms, daniel, razor, dw, sdf,
	dtatulea, skhawaja, netdev, linux-kernel, stable
In-Reply-To: <20260615171736.1709318-1-maoyixie.tju@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 16 Jun 2026 01:17:36 +0800 you wrote:
> netdev_nl_napi_fill_one() reports the NAPI kthread PID in NETDEV_A_NAPI_PID
> using task_pid_nr(), which returns the PID in the initial pid namespace.
> 
> NETDEV_CMD_NAPI_GET does not have GENL_ADMIN_PERM and the netdev genl family
> is netnsok, so a caller in a child pid namespace can issue it. That caller
> then sees the kthread's global PID, even though the kthread is not visible
> in its pid namespace, where the value should be 0.
> 
> [...]

Here is the summary with links:
  - [net] netdev-genl: report NAPI thread PID in the caller's pid namespace
    https://git.kernel.org/netdev/net/c/1f24c0d01db2

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH][net-next] net/mlx5: Remove broken and unused mlx5_query_mtppse()
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: lirongqing
  Cc: saeedm, leon, tariqt, mbloch, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, gal, linux-rdma, linux-kernel
In-Reply-To: <20260615140406.1828-1-lirongqing@baidu.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 22:04:06 +0800 you wrote:
> From: Li RongQing <lirongqing@baidu.com>
> 
> mlx5_query_mtppse() reads the Event Trigger Pin (MTPPSE) register but
> reads the returned arm and mode values from the input buffer 'in'
> instead of the output buffer 'out', so it always returns the values
> that were written rather than the actual hardware state, making the
> query useless.
> 
> [...]

Here is the summary with links:
  - [net-next] net/mlx5: Remove broken and unused mlx5_query_mtppse()
    https://git.kernel.org/netdev/net/c/b50fa1e07cf8

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] net: ethernet: mtk_eth_soc: fix supported_interface set after phylink_create
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Christian Marangi
  Cc: nbd, lorenzo, andrew+netdev, davem, edumazet, kuba, pabeni,
	matthias.bgg, angelogioacchino.delregno, linux, daniel, netdev,
	linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <20260615151106.15438-1-ansuelsmth@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 17:11:00 +0200 you wrote:
> Everything configured in phylink_config it's assumed to be set before
> calling phylink_create() to permit correct parsing of all the different
> modes and capabilities.
> 
> Commit 51cf06ddafc9 ("net: ethernet: mtk_eth_soc: add support for MT7988
> internal 2.5G PHY") while introducing support for 2.5G phy for MT7988,
> probably due to an auto-rebase, placed the configuration of the INTERNAL
> interface mode for the supported_interfaces for phylink_config right after
> phylink_create() introducing a possible problem with supported interfaces
> parsing.
> 
> [...]

Here is the summary with links:
  - [net] net: ethernet: mtk_eth_soc: fix supported_interface set after phylink_create
    https://git.kernel.org/netdev/net/c/e4b4d8410c7c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] octeontx2-af: npc: Log successful MCAM drop-on-non-hit install at debug level
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Ratheesh Kannoth
  Cc: kuba, linux-kernel, netdev, andrew+netdev, davem, edumazet,
	pabeni, sgoutham
In-Reply-To: <20260615033157.535237-1-rkannoth@marvell.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 09:01:57 +0530 you wrote:
> npc_install_mcam_drop_rule() used dev_err() after a successful
> rvu_mbox_handler_npc_mcam_write_entry() call, so normal installs appeared
> as errors in dmesg.  Use dev_dbg() for the success path and keep dev_err()
> for real failures.
> 
> Fixes: 3571fe07a090 ("octeontx2-af: Drop rules for NPC MCAM")
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
> 
> [...]

Here is the summary with links:
  - [net] octeontx2-af: npc: Log successful MCAM drop-on-non-hit install at debug level
    https://git.kernel.org/netdev/net/c/4f6ac65e8162

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net v6 0/7] net: require CAP_NET_ADMIN in the device netns for tunnel changelink
From: patchwork-bot+netdevbpf @ 2026-06-18  0:20 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: davem, edumazet, kuba, pabeni, dsahern, steffen.klassert, herbert,
	horms, kuniyu, shaw.leon, netdev, linux-kernel, stable
In-Reply-To: <20260612085941.3158249-1-maoyixie.tju@gmail.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 12 Jun 2026 16:59:34 +0800 you wrote:
> A tunnel changelink() operates on at most two netns, dev_net(dev) and
> the tunnel link netns t->net. They differ once the device is created in
> or moved to a netns other than the one the request runs in. The rtnl
> changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a
> caller privileged there but not in the link netns can rewrite a tunnel
> that lives in the link netns. Commit 8b484efd5cb4 ("ip6: vti: Use
> ip6_tnl.net in vti6_siocdevprivate().") added the same check on the
> ioctl path. This series adds it on the RTM_NEWLINK path.
> 
> [...]

Here is the summary with links:
  - [net,v6,1/7] net: ip_gre: require CAP_NET_ADMIN in the device netns for changelink
    https://git.kernel.org/netdev/net/c/8165f7ff57d9
  - [net,v6,2/7] net: ipip: require CAP_NET_ADMIN in the device netns for changelink
    https://git.kernel.org/netdev/net/c/8211a2632466
  - [net,v6,3/7] net: ip_vti: require CAP_NET_ADMIN in the device netns for changelink
    https://git.kernel.org/netdev/net/c/95cceadbfd52
  - [net,v6,4/7] net: ip6_tunnel: require CAP_NET_ADMIN in the device netns for changelink
    https://git.kernel.org/netdev/net/c/2496fa0b7d18
  - [net,v6,5/7] net: ip6_gre: require CAP_NET_ADMIN in the device netns for changelink
    https://git.kernel.org/netdev/net/c/f00a50876d28
  - [net,v6,6/7] net: ip6_vti: require CAP_NET_ADMIN in the device netns for changelink
    https://git.kernel.org/netdev/net/c/e2ac3b242c37
  - [net,v6,7/7] xfrm: xfrm_interface: require CAP_NET_ADMIN in the device netns for changelink
    https://git.kernel.org/netdev/net/c/095515d89b19

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next] ionic: Change list definition method
From: Jakub Kicinski @ 2026-06-17 23:47 UTC (permalink / raw)
  To: Lei Zhu; +Cc: brett.creeley, andrew+netdev, davem, edumazet, netdev
In-Reply-To: <20260617023243.61595-1-zhulei_szu@163.com>

On Wed, 17 Jun 2026 10:32:43 +0800 Lei Zhu wrote:
> The LIST_HEAD macro can both define a linked list and initialize
> it in one step. To simplify code, we replace the separate operations
> of linked list definition and manual initialization with the LIST_HEAD
> macro.

## Form letter - net-next-closed

We have already submitted our pull request with net-next material for v7.2,
and therefore net-next is closed for new drivers, features, code refactoring
and optimizations. We are currently accepting bug fixes only.

Please repost when net-next reopens after June 29th.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
-- 
pw-bot: defer
pv-bot: closed

^ permalink raw reply

* Re: [PATCH] net: airoha: Stop TX queues on error path in airoha_dev_open
From: Jakub Kicinski @ 2026-06-17 23:44 UTC (permalink / raw)
  To: Wayen Yan
  Cc: netdev, lorenzo, horms, pabeni, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek
In-Reply-To: <178160729880.2156257.7978513589649053826@gmail.com>

On Tue, 16 Jun 2026 18:50:39 +0800 Wayen Yan wrote:
> In airoha_dev_open(), if airoha_set_vip_for_gdm_port() fails after
> netif_tx_start_all_queues() has been called, the TX queues remain
> started while the device configuration is incomplete. This leaves
> the device in an inconsistent state where packets could be
> transmitted before the VIP/IFC port configuration is complete.

Not sure if this was superseded by another posting but FWIW
this posting did not apply.

^ permalink raw reply

* Re: [PATCH] rocker: Fix memory leak in ofdpa_port_fdb()
From: Jakub Kicinski @ 2026-06-17 23:44 UTC (permalink / raw)
  To: Andrew Lunn, Jiri Pirko
  Cc: Jacob Keller, Ziran Zhang, Andrew Lunn, David S . Miller,
	Eric Dumazet, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <61892bd4-7368-4cd8-b360-0267e5c47156@lunn.ch>

On Wed, 17 Jun 2026 11:26:46 +0200 Andrew Lunn wrote:
> On Tue, Jun 16, 2026 at 04:29:59PM -0700, Jacob Keller wrote:
> > On 6/15/2026 6:32 PM, Ziran Zhang wrote:  
> > > In ofdpa_port_fdb(), the hash_del() only unlinks the node from
> > > hash table, but does not free it.
> > > 
> > > Fix this by adding kfree(found) after the !found == removing check,
> > > where the pointer value is no longer needed.
> > > 
> > > Found by Coccinelle kfree script.
> 
> Is rocker actually used any more? I'm not too sure of the history, but
> was it not added as a way to develop the early switchdev code? There
> was a qemu implementation of the 'hardware'?
> 
> Is it still useful? Should we actually just remove the driver?

I think it came up before but I don't remember the conclusion :S
We should either add rocker to NIPA or delete it. Jiri, WDYT?

^ permalink raw reply

* Re: [PATCH] net: tn40xx: fix netdev and NAPI leak in probe error paths
From: Jakub Kicinski @ 2026-06-17 23:33 UTC (permalink / raw)
  To: ZhaoJinming
  Cc: FUJITA Tomonori, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260615064256.1068059-1-zhaojinming@uniontech.com>

On Mon, 15 Jun 2026 14:42:56 +0800 ZhaoJinming wrote:
> In tn40_probe(), after tn40_netdev_alloc() and netif_napi_add() succeed,
> none of the subsequent error paths call netif_napi_del() or free_netdev()
> to undo these operations.  On any probe failure after netif_napi_add() the
> NAPI structure (embedded in the netdev private data) remains on the
> per-netdev napi_list while the backing memory is never freed, causing:

it's devm_ allocated:

	ndev = devm_alloc_etherdev(&pdev->dev, sizeof(struct tn40_priv));

you're introducing a bug instead of fixing one..
-- 
pw-bot: reject
pv-bot: slop

^ permalink raw reply

* Re: [PATCH v3 0/3] net/smc: bound wire-controlled CDC cursors against the local buffers
From: Jakub Kicinski @ 2026-06-17 23:24 UTC (permalink / raw)
  To: Bryam Vargas via B4 Relay
  Cc: hexlabsecurity, Wenjia Zhang, Dust Li, D. Wythe, Sidraya Jayagond,
	Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-0-551fa514257e@proton.me>

On Sun, 14 Jun 2026 03:23:29 -0500 Bryam Vargas via B4 Relay wrote:
> A peer's CDC producer/consumer cursors are copied from the wire and used,
> without an upper bound against the local buffers, as (a) a raw index into the
> RMB on the urgent path, (b) the receive length in smc_rx_recvmsg(), and (c) the
> send length in smc_tx_sendmsg() on the SMC-D DMB-merge path.  A malicious or
> buggy peer can forge a cursor so each of these runs past the relevant buffer:
> an out-of-bounds read of adjacent kernel memory (disclosed to the peer) on the
> receive/urgent side, and an out-of-bounds write of attacker-influenced length
> and content on the send side.

Once again, SMC maintainers -- please review.
-- 
mping: SHARED MEMORY COMMUNICATIONS (SMC) SOCKETS

^ permalink raw reply

* Re: [PATCH net v3] net: airoha: Fix skb->priority underflow in airoha_dev_select_queue()
From: Jakub Kicinski @ 2026-06-17 23:19 UTC (permalink / raw)
  To: lorenzo
  Cc: Wayen Yan, netdev, horms, pabeni, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek
In-Reply-To: <178161373805.2167512.2544164327472822616@gmail.com>

On Sun, 14 Jun 2026 07:30:54 +0800 Wayen Yan wrote:
> In airoha_dev_select_queue(), the expression:
> 
>   queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES;
> 
> implicitly converts to unsigned arithmetic: when skb->priority is 0
> (the default for unclassified traffic), (0u - 1u) wraps to UINT_MAX,
> and UINT_MAX % 8 = 7, routing default best-effort packets to the
> highest-priority QoS queue. This causes QoS inversion where the
> majority of traffic on a PON gateway starves actual high-priority
> flows (VoIP, gaming, etc.).
> 
> Fix by guarding the subtraction: when priority is 0, map to queue 0
> (lowest priority), otherwise apply the original (priority - 1) % 8
> mapping.
> 
> Fixes: 2b288b81560b ("net: airoha: Introduce ndo_select_queue callback")
> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
> Reviewed-by: Joe Damato <joe@dama.to>
> Signed-off-by: Wayen Yan <win847@gmail.com>
> ---
>  drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 31cdb11cd7..d476ef83c3 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -1933,7 +1933,7 @@ static u16 airoha_dev_select_queue(struct net_device *dev, struct sk_buff *skb,
>  	 */
>  	channel = netdev_uses_dsa(dev) ? skb_get_queue_mapping(skb) : port->id;
>  	channel = channel % AIROHA_NUM_QOS_CHANNELS;
> -	queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES; /* QoS queue */
> +	queue = skb->priority ? (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES : 0;

Hi Lorenzo, is there a reason we're subtracting 1 here in the first
place? Could be just me, but may be worth adding a comment here.

Intuitively if we are "narrowing" 16 prios to 8 queues it'd make most
sense to group the adjacent ones -- divide by two.

Please respin with some sort of an explanation..

>  	queue = channel * AIROHA_NUM_QOS_QUEUES + queue;
>  
>  	return queue < dev->num_tx_queues ? queue : 0;
-- 
pw-bot: cr

^ permalink raw reply

* Re: [ANN] netdev development stats for 7.2
From: Jacob Keller @ 2026-06-17 23:19 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
In-Reply-To: <20260617115319.43a5942d@kernel.org>

On 6/17/2026 11:53 AM, Jakub Kicinski wrote:
> Top scores (positive):               Top scores (negative):              
>    1 (   ) [768] Jakub Kicinski         1 ( +1) [91] Tariq Toukan        
>    2 (   ) [376] Simon Horman           2 ( +8) [86] Wei Fang            
>    3 (   ) [346] Andrew Lunn            3 ( +4) [67] Ratheesh Kannoth    
>    4 (   ) [265] Paolo Abeni            4 (***) [54] javen               
>    5 ( +4) [ 91] Ido Schimmel           5 ( +6) [49] Lorenzo Bianconi    
>    6 (+14) [ 74] David Laight           6 (***) [48] Luiz Angelo Daros de Luca
>    7 (   ) [ 62] Krzysztof Kozlowski    7 (***) [43] Simon Wunderlich    
>    8 ( +2) [ 57] Aleksandr Loktionov    8 (***) [38] Chuck Lever         
>    9 (+12) [ 50] Nikolay Aleksandrov    9 (+18) [38] Grzegorz Nitka      
>   10 ( -4) [ 49] Willem de Bruijn      10 (***) [35] Pablo Neira Ayuso   
>   11 ( +3) [ 49] Sabrina Dubroca       11 (***) [35] Markus Stockhausen  
>   12 (+41) [ 47] Alexander Lobakin     12 (***) [34] Selvamani Rajagopal 
>   13 (+24) [ 47] Maxime Chevallier     13 (***) [34] Jason Xing          
>   14 ( -6) [ 46] David Ahern           14 ( -8) [33] Illusion Wang       
>   15 (***) [ 43] Jiayuan Chen          15 (***) [30] Minxi Hou       
>  
> One process note on the reviewer score. Tariq tops the negative list. 
> I've been returning to the question of whether it's fair since 
> he has to handle submissions of most of nVidia's patches.
> Still, I don't understand why reading thru the list and reviewing
> one patchset from another company a day is too much to ask.
> 

This is a difficult question. When I've covered for Tony in a similar
position, I've felt like it is hard enough to keep an eye on our own
list let alone also finding time to review other places.

A positive note here is that nVidia is now green overall, so at least
there is some participation from the company as a whole. On the other
hand, Tony isn't in the top negatives despite performing a somewhat
similar role.

I know I was lacking myself in the last cycle due to a bunch of
unrelated work and issues. I've been working to get review back into my
daily flow.

^ permalink raw reply

* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Gustavo A. R. Silva @ 2026-06-17 22:59 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kees Cook, Gustavo A. R. Silva, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, linux-kernel,
	linux-hardening, llvm, Johan Thomsen
In-Reply-To: <b16e8bac-e149-4052-b1cb-8fd3e1137f9c@ovn.org>



On 6/17/26 16:01, Ilya Maximets wrote:
> On 6/17/26 10:08 PM, Gustavo A. R. Silva wrote:
>> Hi,
>>
>> On 6/16/26 04:03, Ilya Maximets wrote:
>>> kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
>>> structure to the options_len, which is then initialized to zero.
>>> Later, we're initializing the structure by copying the tunnel info
>>> together with the options, and this triggers a warning for a potential
>>> memcpy overflow, since the compiler estimates that the options can't
>>> fit into the structure, even though the memory for them is actually
>>> allocated.
>>>
>>>    memcpy: detected buffer overflow: 104 byte write of buffer size 96
>>>    WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
>>>     skb_tunnel_info_unclone+0x179/0x190
>>>     geneve_xmit+0x7fe/0xe00
>>
>> This warning has nothing to do with counted_by. See below for more
>> comments.
>>
>>>
>>> The issue is triggered when built with clang and source fortification.
>>>
>>> Fix that by doing the copy in two stages: first - the main data with
>>> the options_len, then the options.  This way the correct length should
>>> be known at the time of the copy.
>>>
>>> It would be better if the options_len never changed after allocation,
>>> but the allocation code is a little separate from the initialization
>>> and it would be awkward and potentially dangerous to return a struct
>>> with options_len set to a non-zero value from the metadata_dst_alloc().
>>>
>>> Another option would be to use ip_tunnel_info_opts_set(), but it is
>>> doing too many unnecessary operations for the use case here.
>>>
>>> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
>>> Reported-by: Johan Thomsen <write@ownrisk.dk>
>>> Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>>
>>> Johan, if you can test this one in your setup as well, that would
>>> be great.  Thanks.
>>>
>>>    include/net/dst_metadata.h | 7 +++++--
>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>>> --- a/include/net/dst_metadata.h
>>> +++ b/include/net/dst_metadata.h
>>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>>    	if (!new_md)
>>>    		return ERR_PTR(-ENOMEM);
>>>    
>>> -	memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>>> -	       sizeof(struct ip_tunnel_info) + md_size);
>>
>> What's going on here is that, internally, fortified memcpy() retrieves
>> the destination size via __builtin_dynamic_object_size() in mode 1.
>>
>> That is:
>>
>> __builtin_dynamic_object_size(&new_md->u.tun_info, 1)
>>
>> For the above case, Clang returns sizeof(new_md->u.tun_info) == 96.
>>
>> So the warning is reporting that 104 bytes don't fit in an object of
>> size 96 bytes, regardless of any counted_by annotation or allocation.
> 
> Hmm.  Does __builtin_dynamic_object_size(&new_md->u.tun_info, 1) return
> 104 when the options_len is 8?  If so, isn't that because it is counted
> by that field?  Asking because the fortification doesn't complain if we
> keep the full 104-byte copy as-is, but set the options_len beforehand,
> as tested by Johan.

I see. If that is the case, then, internally, fortified memcpy() ends up
using mode 0 instead of mode 1. Something like this:

__builtin_dynamic_object_size(&new_md->u.tun_info, 0)

The above will effectively consider the allocation and counted_by because
it will interpret new_md->u.tun_info as an open-ended object due to the
flexible-array member (in struct ip_tunnel_info) whose size is determined
by counted_by.

I'm not entirely convinced we really want this.

-Gustavo

> 
>>
>> Of course, in this case, the write of 104 bytes into new_md->u.tun_info
>> is intentional and controlled, but what if it weren't?
>>
>> On the other hand, for this same case, GCC currently returns SIZE_MAX,
>> which translates to -1 inside fortified memcpy(). Thus, bounds-checking
>> is bypassed, which is why this warning doesn't show up with GCC.
>>
>> However, this is a bug in GCC. We're already looking into that.
>>
>> I think we've had just a handful of cases like this across the whole
>> kernel tree. We can deal with them as you did here (by directly copying
>> the composite structure first, and then using memcpy() to copy into the
>> flexible-array member). If these cases ever become more common, we
>> could create some kind of helper to do both things at once. :)
>>
>>> +	/* Copy in two stages to keep the __counted_by happy. */
>>
>> So based on my comments above, this code comment is not correct.
> 
> I feel like some comment is still needed, do you have some suggestions
> for what would be a better wording?
> 
>>
>>> +	new_md->u.tun_info = md_dst->u.tun_info;
>>
>> This is fine.
>>
>>> +	memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
>>> +	       ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
>>
>> Is ip_tunnel_info_opts() really needed here?
>>
>> Probably this works just fine:
>>
>> memcpy(new_md->u.tun_info.options, md_dst->u.tun_info.options, md_size);
> 
> The logic here is: we have the access function, therefore we should use it.
> It gives a bad example if we don't.
> 
> Best regards, Ilya Maximets.


^ permalink raw reply

* Re: [PATCH v14 6/9] tls: device: add TX KeyUpdate support
From: Jakub Kicinski @ 2026-06-17 22:56 UTC (permalink / raw)
  To: Rishikesh Jethwani
  Cc: netdev, saeedm, tariqt, mbloch, borisp, john.fastabend, sd, davem,
	pabeni, edumazet, leon
In-Reply-To: <CAKaoeS3mTGUo-dvtdmQfz3JAnn69-e36xrj9YmjbWTnsiw9uqg@mail.gmail.com>

On Wed, 17 Jun 2026 15:32:58 -0700 Rishikesh Jethwani wrote:
> From: Rishikesh Jethwani <rjethwani@everpuredata.com>

Please keep in mind that net-next is currently closed.
You need to wait until the merge window is over (2 weeks)
before reposting.

^ permalink raw reply

* [PATCH bpf-next v3 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
	emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
	kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
	memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
	yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>

Cover both directions of the new VLAN flags in the fib_lookup test,
36 table cases plus a dedicated cross-netns subtest.

For BPF_FIB_LOOKUP_VLAN the egress cases assert: without the flag the
lookup returns the VLAN netdev's ifindex and zeroed vlan fields, with
the flag it returns the parent's ifindex plus the tag (including via
a neighbour resolved on the VLAN device, in OUTPUT mode, over a bond,
and through a DIRECT|TBID table), with the flag on a non-VLAN egress
it changes nothing, for a stacked VLAN it leaves ifindex untouched
with the vlan fields zero, and a frag-needed return reports the route
mtu in mtu_result while leaving the swap unwritten.

For BPF_FIB_LOOKUP_VLAN_INPUT, an iif rule on the subinterface routes
the same destination to a different gateway, so the asserted gateway
shows which device the lookup used as ingress: without the flag the
main table answers, with a matching tag the subinterface's table
does, with or without SKIP_NEIGH, and BPF_FIB_LOOKUP_SRC selects the
subinterface's address. A VRF-enslaved subinterface selects the VRF
table through the l3mdev rule and, with DIRECT, through
l3mdev_fib_table_rcu(). One case sets BPF_FIB_LOOKUP_VLAN as well and
asserts both directions work in a single lookup. Resolution semantics
are pinned: an 802.1ad tag resolves its device, PCP and DEI bits in
h_vlan_TCI are ignored, a VLAN ifindex resolves the inner QinQ
device, a tag on a bond master resolves while the same tag on the
bond port does not.

The error cases assert -EINVAL for an invalid h_vlan_proto on both
address families, for the TBID and OUTPUT flag combinations and for
an unknown flag bit, and BPF_FIB_LKUP_RET_NOT_FWDED for a VID with no
configured device on both families, for a VID-0 priority tag and for
a device that exists but is down. The failure cases also assert that
params is left untouched.

A separate subtest moves a VLAN device into a second netns while it
stays registered on its parent, and checks both directions refuse to
cross the boundary: the input flag fails closed with the tag and
ifindex untouched, and the egress flag does not publish the foreign
parent's ifindex.

The tbid read-back check is skipped for DIRECT cases that set
BPF_FIB_LOOKUP_VLAN, since a successful swap packs the vlan fields
into the union the check reads.

Re-run the cases through bpf_xdp_fib_lookup() as well: the egress flag
exists because VLAN devices have no XDP xmit, so XDP is the primary
consumer. bpf_prog_test_run uses the netns' loopback for the xdp context's
device, so the lookup runs against the test netns' FIB, and the
path-independent results (return code, swapped ifindex, vlan tag, gateway)
are asserted to match the skb path.

Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 .../selftests/bpf/prog_tests/fib_lookup.c     | 554 +++++++++++++++++-
 .../testing/selftests/bpf/progs/fib_lookup.c  |   9 +
 2 files changed, 559 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
index bd7658958004..987a691fe078 100644
--- a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
 
 #include <linux/rtnetlink.h>
+#include <linux/if_ether.h>
 #include <sys/types.h>
 #include <net/if.h>
 
@@ -37,6 +38,41 @@
 #define IPV6_LOCAL		"fd01::3"
 #define IPV6_GW1		"fd01::1"
 #define IPV6_GW2		"fd01::2"
+#define VLAN_ID			100
+#define VLAN_IFACE		"veth1.100"
+#define VLAN_ID_DOWN		102
+#define VLAN_IFACE_DOWN		"veth1.102"
+#define QINQ_OUTER_IFACE	"veth1.200"
+#define QINQ_INNER_IFACE	"veth1.200.300"
+#define VLAN_TABLE		"300"
+#define IPV4_VLAN_IFACE_ADDR	"10.5.0.254"
+#define IPV4_VLAN_EGRESS_DST	"10.5.0.2"
+#define IPV4_QINQ_DST		"10.7.0.2"
+#define IPV4_VLAN_DST		"10.6.0.2"
+#define IPV4_VLAN_GW		"10.5.0.1"
+#define IPV6_VLAN_IFACE_ADDR	"fd02::254"
+#define IPV6_VLAN_EGRESS_DST	"fd02::2"
+#define IPV6_VLAN_DST		"fd03::2"
+#define IPV6_VLAN_GW		"fd02::1"
+#define VLAN_VID_UNUSED		999
+#define VRF_IFACE		"vrf-blue"
+#define VRF_TABLE		"1000"
+#define VRF_VLAN_ID		101
+#define VRF_VLAN_IFACE		"veth1.101"
+#define IPV4_VRF_IFACE_ADDR	"10.8.0.254"
+#define IPV4_VRF_GW		"10.8.0.1"
+#define IPV4_VRF_DST		"10.9.0.2"
+#define TBID_VLAN_ID		50
+#define TBID_VLAN_IFACE		"veth2.50"
+#define IPV4_TBID_VLAN_DST	"172.2.0.2"
+#define IPV4_BOND_VLAN_DST	"10.11.0.2"
+#define IPV4_VLAN_MTU_DST	"10.5.9.2"
+#define QINQ_AD_VLAN_ID		200
+#define QINQ_INNER_VLAN_ID	300
+#define BOND_IFACE		"bond99"
+#define BOND_PORT		"veth3"
+#define BOND_PORT_PEER		"veth4"
+#define BOND_VLAN_ID		500
 #define DMAC			"11:11:11:11:11:11"
 #define DMAC_INIT { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, }
 #define DMAC2			"01:01:01:01:01:01"
@@ -52,6 +88,17 @@ struct fib_lookup_test {
 	__u32 tbid;
 	__u8 dmac[6];
 	__u32 mark;
+	/*
+	 * input tag with BPF_FIB_LOOKUP_VLAN_INPUT; expected output tag
+	 * with BPF_FIB_LOOKUP_VLAN (checked when check_vlan is set)
+	 */
+	__u16 vlan_proto;
+	__u16 vlan_id;
+	bool check_vlan;
+	const char *expected_dev; /* expected params->ifindex after lookup */
+	const char *iif;	  /* override the default veth1 input device */
+	__u16 tot_len;		  /* triggers the in-lookup mtu check when set */
+	__u16 expected_mtu;	  /* expected mtu_result (union with tot_len) */
 };
 
 static const struct fib_lookup_test tests[] = {
@@ -142,6 +189,209 @@ static const struct fib_lookup_test tests[] = {
 	  .expected_dst = IPV6_GW1,
 	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
 	  .mark = MARK, },
+	/* vlan egress resolution */
+	{ .desc = "IPv4 VLAN egress, no flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, single VLAN",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * skb path without tot_len: mtu_result is the FIB result (VLAN)
+	 * device's mtu (1400) with or without the swap, not the parent's (1500)
+	 */
+	{ .desc = "IPv4 VLAN egress, skb-path mtu is the VLAN device's without the flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, .expected_mtu = 1400, },
+	{ .desc = "IPv4 VLAN egress, skb-path mtu stays the VLAN device's after the swap",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .expected_mtu = 1400, },
+	{ .desc = "IPv4 VLAN egress, flag set but egress is not a VLAN",
+	  .daddr = IPV4_NUD_FAILED_ADDR, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, stacked VLAN untouched",
+	  .daddr = IPV4_QINQ_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = QINQ_INNER_IFACE, .check_vlan = true, },
+	{ .desc = "IPv6 VLAN egress, single VLAN",
+	  .daddr = IPV6_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, neighbour on the VLAN device",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT, },
+	{ .desc = "IPv4 VLAN egress in OUTPUT mode",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .iif = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_OUTPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress over a bond",
+	  .daddr = IPV4_BOND_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = BOND_IFACE, .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress via TBID table",
+	  .daddr = IPV4_TBID_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID |
+			  BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .tbid = 100,
+	  .expected_dev = "veth2", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = TBID_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, success writes mtu_result with the swap",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .tot_len = 500, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, FRAG_NEEDED reports mtu, swap unwritten",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_FRAG_NEEDED,
+	  .tot_len = 1400, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	/* vlan tag as lookup input */
+	{ .desc = "IPv4 VLAN input, no flag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects subinterface route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, tag selects subinterface route",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV6_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input and egress combined",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = "veth1",
+	  .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, neighbour resolved on the route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT2, },
+	{ .desc = "IPv4 VLAN input, source address from the subinterface",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_src = IPV4_VLAN_IFACE_ADDR,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SRC |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * VRF: the resolved subinterface is enslaved, so the l3mdev rule
+	 * (full lookup) and l3mdev_fib_table_rcu() (DIRECT) must select
+	 * the VRF table from the resolved ingress
+	 */
+	{ .desc = "IPv4 VLAN input, VRF subinterface, no flag",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects VRF table",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, DIRECT uses VRF table from resolved ingress",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_DIRECT |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	/*
+	 * failure arms also assert params is left untouched: ifindex still
+	 * names the physical device and the input tag bytes survive
+	 */
+	{ .desc = "IPv4 VLAN input, invalid proto",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, unmatched VID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "IPv4 VLAN input, subinterface down",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID_DOWN, },
+	/*
+	 * the resolver runs before the forwarding check, so on devices
+	 * with forwarding off FWD_DISABLED (not NOT_FWDED) proves the tag
+	 * resolved to that device and the lookup used it as ingress
+	 */
+	{ .desc = "IPv4 VLAN input, 802.1ad tag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021AD, .vlan_id = QINQ_AD_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, PCP and DEI bits ignored in TCI",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0xe000 | VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, inner QinQ device from VLAN ifindex",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = QINQ_OUTER_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = QINQ_INNER_VLAN_ID, },
+	/*
+	 * bonding: the VLANs live on the master, as on receive, where the
+	 * frame is steered to the master before VLAN processing; a port
+	 * ifindex does not match (ports carry vid state but no VLAN devs)
+	 */
+	{ .desc = "IPv4 VLAN input, tag on bond master resolves",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = BOND_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, tag on bond port does not match",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .iif = BOND_PORT, .expected_dev = BOND_PORT, .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, invalid proto",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, VID 0 priority tag fails closed",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0, },
+	{ .desc = "IPv6 VLAN input, unmatched VID",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "unknown flag bit rejected",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = (1 << 14) | BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input rejected with TBID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_TBID,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input rejected with OUTPUT",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_OUTPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
 };
 
 static int setup_netns(void)
@@ -204,6 +454,110 @@ static int setup_netns(void)
 	SYS(fail, "ip rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 	SYS(fail, "ip -6 rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 
+	/*
+	 * Setup for vlan tests: a subinterface for egress resolution and
+	 * tag-as-input, a QinQ stack, and an iif rule so the input tests
+	 * observe which device the lookup used as ingress.
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE, VLAN_ID);
+	SYS(fail, "ip link set dev %s up", VLAN_IFACE);
+	/*
+	 * lower than the veth1 parent (1500): the skb-path mtu check uses the
+	 * FIB result (VLAN) device, so mtu_result is this value with or
+	 * without the egress swap, which two arms below pin
+	 */
+	SYS(fail, "ip link set dev %s mtu 1400", VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VLAN_IFACE_ADDR, VLAN_IFACE);
+	SYS(fail, "ip addr add %s/64 dev %s nodad", IPV6_VLAN_IFACE_ADDR, VLAN_IFACE);
+
+	/*
+	 * stays down: the input flag must treat its tag the way real
+	 * ingress treats a frame arriving on a down VLAN device (drop)
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE_DOWN, VLAN_ID_DOWN);
+
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	err = write_sysctl("/proc/sys/net/ipv6/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv6.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	SYS(fail, "ip link add link veth1 name %s type vlan proto 802.1ad id 200",
+	    QINQ_OUTER_IFACE);
+	SYS(fail, "ip link add link %s name %s type vlan id 300",
+	    QINQ_OUTER_IFACE, QINQ_INNER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_OUTER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_INNER_IFACE);
+	SYS(fail, "ip route add %s/32 dev %s", IPV4_QINQ_DST, QINQ_INNER_IFACE);
+
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VLAN_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VLAN_TABLE, IPV4_VLAN_DST, IPV4_VLAN_GW);
+	SYS(fail, "ip rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+	SYS(fail, "ip -6 route add %s/128 via %s", IPV6_VLAN_DST, IPV6_GW1);
+	SYS(fail, "ip -6 route add table %s %s/128 via %s",
+	    VLAN_TABLE, IPV6_VLAN_DST, IPV6_VLAN_GW);
+	SYS(fail, "ip -6 rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+
+	/*
+	 * a bond with one port and a VLAN on the bond: VLANs on a bond
+	 * live on the master, so resolution succeeds for the master's
+	 * ifindex and fails closed for a port's, matching receive, which
+	 * steers the frame to the master before VLAN processing
+	 */
+	SYS(fail, "ip link add %s type bond", BOND_IFACE);
+	SYS(fail, "ip link add %s type veth peer name %s", BOND_PORT, BOND_PORT_PEER);
+	SYS(fail, "ip link set %s master %s", BOND_PORT, BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_PORT);
+	SYS(fail, "ip link add link %s name %s.%d type vlan id %d",
+	    BOND_IFACE, BOND_IFACE, BOND_VLAN_ID, BOND_VLAN_ID);
+	SYS(fail, "ip link set dev %s.%d up", BOND_IFACE, BOND_VLAN_ID);
+	SYS(fail, "ip route add %s/32 dev %s.%d",
+	    IPV4_BOND_VLAN_DST, BOND_IFACE, BOND_VLAN_ID);
+
+	/*
+	 * a VRF with its own dedicated subinterface (the iif rules above
+	 * must not see it), for the table-selection-by-ingress cases
+	 */
+	SYS(fail, "ip link add %s type vrf table %s", VRF_IFACE, VRF_TABLE);
+	SYS(fail, "ip link set dev %s up", VRF_IFACE);
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VRF_VLAN_IFACE, VRF_VLAN_ID);
+	SYS(fail, "ip link set %s master %s", VRF_VLAN_IFACE, VRF_IFACE);
+	SYS(fail, "ip link set dev %s up", VRF_VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VRF_IFACE_ADDR, VRF_VLAN_IFACE);
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VRF_VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VRF_VLAN_IFACE ".forwarding)"))
+		goto fail;
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VRF_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VRF_TABLE, IPV4_VRF_DST, IPV4_VRF_GW);
+
+	/* neighbours on the VLAN subinterface for the non-SKIP_NEIGH cases */
+	err = write_sysctl("/proc/sys/net/ipv4/neigh/" VLAN_IFACE "/gc_stale_time", "900");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.neigh." VLAN_IFACE ".gc_stale_time)"))
+		goto fail;
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_EGRESS_DST, VLAN_IFACE, DMAC);
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_GW, VLAN_IFACE, DMAC2);
+
+	/* a VLAN on veth2 with a route in the tbid test table */
+	SYS(fail, "ip link add link veth2 name %s type vlan id %d",
+	    TBID_VLAN_IFACE, TBID_VLAN_ID);
+	SYS(fail, "ip link set dev %s up", TBID_VLAN_IFACE);
+	SYS(fail, "ip route add table 100 %s/32 dev %s",
+	    IPV4_TBID_VLAN_DST, TBID_VLAN_IFACE);
+
+	/* a locked-mtu route via the subinterface for the FRAG_NEEDED case */
+	SYS(fail, "ip route add %s/32 dev %s mtu lock 1000",
+	    IPV4_VLAN_MTU_DST, VLAN_IFACE);
+
 	return 0;
 fail:
 	return -1;
@@ -218,9 +572,16 @@ static int set_lookup_params(struct bpf_fib_lookup *params,
 	memset(params, 0, sizeof(*params));
 
 	params->l4_protocol = IPPROTO_TCP;
-	params->ifindex = ifindex;
+	params->ifindex = test->iif ? if_nametoindex(test->iif) : ifindex;
 	params->tbid = test->tbid;
 	params->mark = test->mark;
+	params->tot_len = test->tot_len;
+
+	/* h_vlan_proto/h_vlan_TCI union with tbid */
+	if (test->lookup_flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		params->h_vlan_proto = htons(test->vlan_proto);
+		params->h_vlan_TCI = htons(test->vlan_id);
+	}
 
 	if (inet_pton(AF_INET6, test->daddr, params->ipv6_dst) == 1) {
 		params->family = AF_INET6;
@@ -298,7 +659,7 @@ void test_fib_lookup(void)
 	struct nstoken *nstoken = NULL;
 	struct __sk_buff skb = { };
 	struct fib_lookup *skel;
-	int prog_fd, err, ret, i;
+	int prog_fd, xdp_fd, err, ret, i;
 
 	/* The test does not use the skb->data, so
 	 * use pkt_v6 for both v6 and v4 test.
@@ -309,11 +670,16 @@ void test_fib_lookup(void)
 		    .ctx_in = &skb,
 		    .ctx_size_in = sizeof(skb),
 	);
+	LIBBPF_OPTS(bpf_test_run_opts, xdp_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+	);
 
 	skel = fib_lookup__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
 		return;
 	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	xdp_fd = bpf_program__fd(skel->progs.fib_lookup_xdp);
 
 	SYS(fail, "ip netns add %s", NS_TEST);
 
@@ -352,6 +718,21 @@ void test_fib_lookup(void)
 		if (tests[i].expected_dst)
 			assert_dst_ip(fib_params, tests[i].expected_dst);
 
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev), "ifindex");
+
+		if (tests[i].expected_mtu)
+			ASSERT_EQ(fib_params->mtu_result, tests[i].expected_mtu,
+				  "mtu_result");
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "h_vlan_TCI");
+		}
+
 		ret = memcmp(tests[i].dmac, fib_params->dmac, sizeof(tests[i].dmac));
 		if (!ASSERT_EQ(ret, 0, "dmac not match")) {
 			char expected[18], actual[18];
@@ -361,17 +742,182 @@ void test_fib_lookup(void)
 			printf("dmac expected %s actual %s ", expected, actual);
 		}
 
-		// ensure tbid is zero'd out after fib lookup.
-		if (tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) {
+		/*
+		 * ensure tbid is zero'd out after fib lookup. With
+		 * BPF_FIB_LOOKUP_VLAN the union holds the packed vlan
+		 * fields instead, so skip the check for those.
+		 */
+		if ((tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) &&
+		    !(tests[i].lookup_flags & BPF_FIB_LOOKUP_VLAN)) {
 			if (!ASSERT_EQ(skel->bss->fib_params.tbid, 0,
 					"expected fib_params.tbid to be zero"))
 				goto fail;
 		}
 	}
 
+	/*
+	 * Re-run the cases through bpf_xdp_fib_lookup(). test_run uses the
+	 * current netns' loopback for ctx->rxq->dev, so dev_net() is NS_TEST
+	 * and the lookup runs against its FIB. The path-independent results
+	 * (return code, swapped ifindex, vlan tag, gateway) must match the skb
+	 * path; the no-tot_len mtu_result is skb-specific and not rechecked.
+	 */
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (set_lookup_params(fib_params, &tests[i], skb.ifindex))
+			continue;
+
+		skel->bss->fib_lookup_ret = -1;
+		skel->bss->lookup_flags = tests[i].lookup_flags;
+
+		err = bpf_prog_test_run_opts(xdp_fd, &xdp_opts);
+		if (!ASSERT_OK(err, "xdp test_run"))
+			continue;
+
+		if (!ASSERT_EQ(skel->bss->fib_lookup_ret, tests[i].expected_ret,
+			       "xdp fib_lookup_ret"))
+			printf("(xdp) %s\n", tests[i].desc);
+
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev),
+				  "xdp ifindex");
+
+		if (tests[i].expected_dst)
+			assert_dst_ip(fib_params, tests[i].expected_dst);
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "xdp h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "xdp h_vlan_TCI");
+		}
+	}
+
 fail:
 	if (nstoken)
 		close_netns(nstoken);
 	SYS_NOFAIL("ip netns del " NS_TEST);
 	fib_lookup__destroy(skel);
 }
+
+#define NS_VLAN_A	"fib_lookup_vlan_ns_a"
+#define NS_VLAN_B	"fib_lookup_vlan_ns_b"
+
+/*
+ * A VLAN device can be moved to another netns while staying registered
+ * on its parent. Neither direction may then cross the boundary: the
+ * egress flag must not publish the foreign parent's ifindex, and the
+ * input flag must fail closed rather than use a foreign ingress.
+ */
+void test_fib_lookup_vlan_netns(void)
+{
+	struct bpf_fib_lookup *fib_params;
+	struct nstoken *nstoken = NULL;
+	struct __sk_buff skb = { };
+	struct fib_lookup *skel = NULL;
+	int prog_fd, err, parent_idx, vlan_idx;
+
+	LIBBPF_OPTS(bpf_test_run_opts, run_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+		    .ctx_in = &skb,
+		    .ctx_size_in = sizeof(skb),
+	);
+
+	skel = fib_lookup__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+		return;
+	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	fib_params = &skel->bss->fib_params;
+
+	SYS(fail, "ip netns add %s", NS_VLAN_A);
+	SYS(fail, "ip netns add %s", NS_VLAN_B);
+
+	nstoken = open_netns(NS_VLAN_A);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(a)"))
+		goto fail;
+
+	SYS(fail, "ip link add veth7 type veth peer name veth8");
+	SYS(fail, "ip link set dev veth7 up");
+	SYS(fail, "ip link add link veth7 name veth7.66 type vlan id 66");
+	SYS(fail, "ip link set veth7.66 netns %s", NS_VLAN_B);
+
+	parent_idx = if_nametoindex("veth7");
+	if (!ASSERT_NEQ(parent_idx, 0, "if_nametoindex(veth7)"))
+		goto fail;
+
+	/*
+	 * input: the moved device is still in veth7's VLAN group, but it
+	 * lives in another netns, so the lookup must fail closed
+	 */
+	skb.ifindex = parent_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = parent_idx;
+	fib_params->h_vlan_proto = htons(ETH_P_8021Q);
+	fib_params->h_vlan_TCI = htons(66);
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+	if (!ASSERT_OK(err, "test_run(input)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_NOT_FWDED,
+		  "input across netns fails closed");
+	ASSERT_EQ(fib_params->ifindex, parent_idx, "ifindex untouched");
+	ASSERT_EQ(fib_params->h_vlan_TCI, htons(66), "tag untouched");
+
+	close_netns(nstoken);
+	nstoken = open_netns(NS_VLAN_B);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(b)"))
+		goto fail;
+
+	/*
+	 * egress: the fib result is the VLAN device here, but its parent
+	 * is in the other netns, so the swap must not happen
+	 */
+	SYS(fail, "ip link set dev veth7.66 up");
+	SYS(fail, "ip addr add 10.66.0.1/24 dev veth7.66");
+	err = write_sysctl("/proc/sys/net/ipv4/conf/veth7.66/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(forwarding)"))
+		goto fail;
+
+	vlan_idx = if_nametoindex("veth7.66");
+	if (!ASSERT_NEQ(vlan_idx, 0, "if_nametoindex(veth7.66)"))
+		goto fail;
+
+	skb.ifindex = vlan_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = vlan_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, "10.66.0.1", &fib_params->ipv4_src),
+		       1, "inet_pton(src)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+	if (!ASSERT_OK(err, "test_run(egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_SUCCESS,
+		  "egress lookup succeeds");
+	ASSERT_EQ(fib_params->ifindex, vlan_idx,
+		  "foreign parent not published");
+	ASSERT_EQ(fib_params->h_vlan_TCI, 0, "vlan fields zero");
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " NS_VLAN_A);
+	SYS_NOFAIL("ip netns del " NS_VLAN_B);
+	fib_lookup__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/fib_lookup.c b/tools/testing/selftests/bpf/progs/fib_lookup.c
index 7b5dd2214ff4..f43e22d33814 100644
--- a/tools/testing/selftests/bpf/progs/fib_lookup.c
+++ b/tools/testing/selftests/bpf/progs/fib_lookup.c
@@ -19,4 +19,13 @@ int fib_lookup(struct __sk_buff *skb)
 	return TC_ACT_SHOT;
 }
 
+SEC("xdp")
+int fib_lookup_xdp(struct xdp_md *ctx)
+{
+	fib_lookup_ret = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params),
+					lookup_flags);
+
+	return XDP_DROP;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v3 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
	emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
	kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
	memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
	yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>

BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
useful: an XDP program receiving a VLAN-tagged frame on a physical
device wants the lookup to behave as if the packet had arrived on the
corresponding VLAN subinterface, so iif-based policy routing and VRF
table selection use the right ingress.

Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
The device must be up and in the same network namespace as
params->ifindex (a VLAN device can be moved to another netns while
registered on its parent; receive would deliver into that other
namespace, which a lookup here cannot represent). If params->ifindex
is itself a VLAN device, its inner (QinQ) subinterface is matched.
For a bond or team, a tag on a port matches no device and returns
NOT_FWDED; pass the master's ifindex.
The lookup then runs with the resolved device as the ingress;
params->ifindex itself is not modified on the input side. When the
resolved device is enslaved to a VRF, both the full lookup (via the
l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
select the VRF's table from the resolved ingress. That follows from
feeding the resolved device to the flow as the ingress
(fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
the VRF master from the subinterface rather than from
params->ifindex.

The two failure classes get different treatment on purpose. A
h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
-EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
with a program-controlled value. An unmatched VID, a device that is
down, or one in another namespace is a data outcome and returns
BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
fib_get_table() finds no table and mirroring real ingress, where the
receive path drops such frames. A VID of 0 (a priority tag) is looked
up literally and normally fails the same way; receive instead
processes such frames untagged, so callers should not set the flag for
priority tags. Proceeding on the physical device for any of these
would be fail-open for the policy-routing cases above.

The h_vlan fields share a union with tbid, so the flag cannot be
combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
return -EINVAL; restricting now keeps a later relaxation backward
compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
consumed on the ingress side and the egress tag is written on
success.

Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
NULL, so every lookup with the flag returns NOT_FWDED, which is
correct since no VLAN device can exist.

Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 include/uapi/linux/bpf.h       | 21 ++++++++++-
 net/core/filter.c              | 66 +++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h | 21 ++++++++++-
 3 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f1ac9266a2ab..23bc3109619d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3547,6 +3547,22 @@ union bpf_attr {
  *			are written only on success; other output fields keep
  *			the helper's existing behaviour, so a frag-needed result
  *			still reports the route mtu in *params*->mtu_result.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7343,6 +7359,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7412,7 +7429,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
diff --git a/net/core/filter.c b/net/core/filter.c
index 27e4792f11e9..399adf2a824a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6226,6 +6226,25 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
 
 	return 0;
 }
+
+static struct net_device *bpf_fib_vlan_input_dev(struct net_device *dev,
+						 const struct bpf_fib_lookup *params)
+{
+	__be16 proto = params->h_vlan_proto;
+	struct net_device *vlan_dev;
+	u16 vid;
+
+	if (proto != htons(ETH_P_8021Q) && proto != htons(ETH_P_8021AD))
+		return ERR_PTR(-EINVAL);
+
+	vid = ntohs(params->h_vlan_TCI) & VLAN_VID_MASK;
+	vlan_dev = __vlan_find_dev_deep_rcu(dev, proto, vid);
+	if (!vlan_dev || !(vlan_dev->flags & IFF_UP) ||
+	    !net_eq(dev_net(vlan_dev), dev_net(dev)))
+		return NULL;
+
+	return vlan_dev;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_INET)
@@ -6246,6 +6265,14 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	/* verify forwarding is enabled on this interface */
 	in_dev = __in_dev_get_rcu(dev);
 	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
@@ -6255,7 +6282,11 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl4.flowi4_iif = 1;
 		fl4.flowi4_oif = params->ifindex;
 	} else {
-		fl4.flowi4_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		fl4.flowi4_iif = dev->ifindex;
 		fl4.flowi4_oif = 0;
 	}
 	fl4.flowi4_dscp = inet_dsfield_to_dscp(params->tos);
@@ -6394,6 +6425,14 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	idev = __in6_dev_get_safely(dev);
 	if (unlikely(!idev || !READ_ONCE(idev->cnf.forwarding)))
 		return BPF_FIB_LKUP_RET_FWD_DISABLED;
@@ -6402,7 +6441,12 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl6.flowi6_iif = 1;
 		oif = fl6.flowi6_oif = params->ifindex;
 	} else {
-		oif = fl6.flowi6_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		oif = dev->ifindex;
+		fl6.flowi6_iif = oif;
 		fl6.flowi6_oif = 0;
 		strict = RT6_LOOKUP_F_HAS_SADDR;
 	}
@@ -6515,7 +6559,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
 			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
 			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
-			     BPF_FIB_LOOKUP_VLAN)
+			     BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_VLAN_INPUT)
+
+static bool bpf_fib_lookup_flags_ok(u32 flags)
+{
+	if (flags & ~BPF_FIB_LOOKUP_MASK)
+		return false;
+
+	if ((flags & BPF_FIB_LOOKUP_VLAN_INPUT) &&
+	    (flags & (BPF_FIB_LOOKUP_TBID | BPF_FIB_LOOKUP_OUTPUT)))
+		return false;
+
+	return true;
+}
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6523,7 +6579,7 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	switch (params->family) {
@@ -6562,7 +6618,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	if (params->tot_len)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f1ac9266a2ab..23bc3109619d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3547,6 +3547,22 @@ union bpf_attr {
  *			are written only on success; other output fields keep
  *			the helper's existing behaviour, so a frag-needed result
  *			still reports the route mtu in *params*->mtu_result.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7343,6 +7359,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7412,7 +7429,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox