Netdev List
 help / color / mirror / Atom feed
* Re: [net-next v44] mctp pcc: Implement MCTP over PCC Transport
From: Jeremy Kerr @ 2026-06-23  5:54 UTC (permalink / raw)
  To: Adam Young, Paolo Abeni, admiyo
  Cc: matt, andrew+netdev, davem, edumazet, kuba, netdev, linux-kernel,
	sudeep.holla, Jonathan.Cameron, lihuisong
In-Reply-To: <edacdad5-7936-4fbf-ba66-973768ebdf73@amperemail.onmicrosoft.com>

Hi Adam,

> > > The code itself will not, as written, work on a 32 Bit system, as there
> > > are 64 bit specific code.
> > Yeah, that was more my question - what is 64-bit specific about the
> > code?
> 
> I'd really have to dig, as this decision was mixed in with the earlier 
> endinaness conversions.  I suspect, that it was origianally triggered by 
> the ACPICA insisting on Machine architecture for these values, where 
> they are supposed to be explicitly 32, 64 bit etc.

Looking at the history, it appears like this was a in response to the
AI review comments about the stats update with interrupts enabled. The
report was that it may result in tearing of the stats values on 32-bit
machines, but if you have the stats updates right (which I think
you do now?) then this is not an issue.

> It might be perfectly safe, but I have no way to test. Treat it as a 
> general trend toward not supporting newer technology on older architectures.

This is more about not imposing a configuration restriction with no
purpose. There is no need for you to test/support those configurations
though.

> Could I safely acquire a spinlock in the rx_ callback during interupt
> context?

Yes, absolutely. That's a main use-case for spinlocks, to allow
serialisation from within an atomic context.

> I thought that had a significant impact on the system.

You don't want to be doing large amounts of work within the critical
section (and with interrupts disabled), but your scenario should be
fine.

Cheers,


Jeremy

^ permalink raw reply

* Re: [PATCH] net/tcp-ao: fix use-after-free of key in del_async path
From: Eric Dumazet @ 2026-06-23  5:50 UTC (permalink / raw)
  To: HanQuan; +Cc: netdev, ncardwell
In-Reply-To: <20260623015208.1191687-1-eilaimemedsnaimel@gmail.com>

On Mon, Jun 22, 2026 at 6:52 PM HanQuan <eilaimemedsnaimel@gmail.com> wrote:
>
> In tcp_ao_delete_key(), the del_async path skips the current_key
> and rnext_key validity checks present in the synchronous path,
> assuming these pointers are always NULL on LISTEN sockets.  However,
> if a key was added with set_current=1/set_rnext=1 while the socket
> was in CLOSE state, current_key and rnext_key will be non-NULL
> after listen() transitions the socket to LISTEN.
>
> When such a key is deleted with del_async=1, hlist_del_rcu() and
> call_rcu() free the key without clearing the dangling pointers.
> After the RCU grace period, getsockopt(TCP_AO_INFO) dereferences
> current_key->sndid and rnext_key->rcvid from freed slab memory.
>
> Clear current_key and rnext_key in the del_async path when they
> reference the key being deleted.
>
> Fixes: d6732b95b6fb ("net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)")
> Signed-off-by: HanQuan <eilaimemedsnaimel@gmail.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* RE: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
From: Selvamani Rajagopal @ 2026-06-23  5:48 UTC (permalink / raw)
  To: Parthiban.Veerasooran@microchip.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, robh@kernel.org, krzk+dt@kernel.org,
	conor+dt@kernel.org, Piergiorgio Beruto
  Cc: andrew@lunn.ch, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Conor.Dooley@microchip.com,
	devicetree@vger.kernel.org
In-Reply-To: <64f4f30e-a987-4289-b36a-1acc977a6764@microchip.com>


> -----Original Message-----
> From: Parthiban.Veerasooran@microchip.com <Parthiban.Veerasooran@microchip.com>
> Subject: Re: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level
> triggered.
> 
> 
> I will find some time this week to test and share my feedback. In the
> meantime, would it be possible for you to test using two instances (Test
> Case 2)? I did not encounter many issues when testing with a single
> instance.
> 
> I believe that testing with two instances increases the likelihood of
> reproducing the issue in your setup as well.

Parthiban,

Thanks.

Our EVB design allows only one board to be connected to one Raspberry Pi. 
So, I don't think I can have a setup like yours. We did test with three Raspberry Pi boards with 
multi-drop connection. Couldn't see your "NULL pointer" crash. Will keep trying though.

But I could see assert in skb_put immediately quickly.

> 
> Best regards,
> Parthiban V
> >
> >>


^ permalink raw reply

* Re: [PATCH net] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Eric Dumazet @ 2026-06-23  5:37 UTC (permalink / raw)
  To: Xin Long
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, syzbot+e14bc5d4942756023b77, Jon Maloy
In-Reply-To: <CADvbK_e_7184a_jm5ASjKafXrJqaOKFUKJF6V1wp7AnFxK596g@mail.gmail.com>

On Mon, Jun 22, 2026 at 6:48 PM Xin Long <lucien.xin@gmail.com> wrote:
>

> Could this corrupt the list for concurrent RCU readers?
> When list_del_rcu() is called, it intentionally leaves the next pointer
> intact so concurrent readers can continue their traversal. However, the
> immediate call to list_add() overwrites both the next and prev pointers
> to link the entry into private_list.
> If a concurrent reader is currently positioned at rcast, won't it follow
> the newly clobbered next pointer and jump from the original RCU list
> directly into private_list?
> Because private_list is allocated on the local stack, the reader might
> interpret stack memory as a struct udp_replicast. Furthermore, the reader
> would miss its loop termination condition because it expects to reach the
> original list head, potentially resulting in an infinite loop or a crash.
> [ ... ]

I think you are right.

Considering there is already one rcu_head in udp_replicast I will use it in V2.

I will squash/test the following:

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index befaf7137caf642462b7203a2429a60386e64db8..1b1977bd09a3f24028a30c1b98d5edc4b1882ba2
100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -803,17 +803,24 @@ static int tipc_udp_enable(struct net *net,
struct tipc_bearer *b,
        return err;
 }

+static void rcast_free_rcu(struct rcu_head *rcu)
+{
+       struct udp_replicast *rcast = container_of(rcu, struct
udp_replicast, rcu);
+
+       dst_cache_destroy(&rcast->dst_cache);
+       kfree(rcast);
+}
+
 /* cleanup_bearer - break the socket/bearer association */
 static void cleanup_bearer(struct work_struct *work)
 {
        struct udp_bearer *ub = container_of(work, struct udp_bearer, work);
        struct udp_replicast *rcast, *tmp;
-       LIST_HEAD(private_list);
        struct tipc_net *tn;

        list_for_each_entry_safe(rcast, tmp, &ub->rcast.list, list) {
                list_del_rcu(&rcast->list);
-               list_add(&rcast->list, &private_list);
+               call_rcu_hurry(&rcast->rcu, rcast_free_rcu);
        }

        tn = tipc_net(sock_net(ub->sk));
@@ -822,11 +829,6 @@ static void cleanup_bearer(struct work_struct *work)

        synchronize_net();

-       list_for_each_entry_safe(rcast, tmp, &private_list, list) {
-               dst_cache_destroy(&rcast->dst_cache);
-               kfree(rcast);
-       }
-
        dst_cache_destroy(&ub->rcast.dst_cache);
        atomic_dec(&tn->wq_count);
        kfree(ub);


Thanks.

^ permalink raw reply

* Re: [PATCH net 0/2] tcp: make TCP-AO lookups more predictable
From: Eric Dumazet @ 2026-06-23  5:24 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, netdev, eric.dumazet
In-Reply-To: <CAJwJo6Zc7Raz9HkKEARouSnz+FgTi9C5kAA+sa=-5Cg-CLe=MQ@mail.gmail.com>

On Mon, Jun 22, 2026 at 6:13 PM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>
> Hi Eric,
>
> On Mon, 22 Jun 2026 at 19:52, Eric Dumazet <edumazet@google.com> wrote:
> >
> > This series fixes a TCP-AO key lookup precedence bug.
> >
> > TCP-AO stores MKTs in an unsorted list and returns the first match. This
> > allows newer, less-specific keys (wildcard VRF or shorter prefixes) to
> > shadow older, more-specific keys if inserted later.
>
> Yeah, at this moment, TCP-AO doesn't allow any intersection of the keys:
> If you have matching VRFs, matching keyids for matching peer/masks –
> then when the userspace tries to add the second key, setsockopt() is
> going to return -EKEYREJECTED/-EEXIST. This is quite different from
> TCP-MD5, where the most matching key is the one that's going to be
> used by the kernel.
>
> This simplification (not allowing any key intersects) is mostly from a
> very permissive RFC5925, where MKT matches can be: ip-addr/mask; ip
> address ranges; wildcards of addresses; tcp ports. So, this part was
> intentionally simplified until there is a user who requires one of
> these things. And based on their requirements, a better data structure
> than a simple list could be used. Basically, the longest prefix match
> is like adding power-of-two ip ranges. Also, that's another reason why
> I wanted an extendable setsockopt(), where one can add new
> flags/fields to uAPI without breaking the existing users.
>
> Anyways, if you have the requirement to have intersecting keys with
> bigger mask matching (imitating TCP-MD5 behaviour), we can do that,
> but I think that needs a new TCP_AO_KEYF_PREFIX_MATCH (or something of
> a kind). Then the keys with everything matching, but a prefix could be
> added to the socket, and the longest prefix match will be used.
>
> I think one API decision should be documented straight away (besides
> the key flag) – how this flag works with multiple keys.
> Say there are 4 keys on a socket, all match the peer being connected:
> keyA: ip 10.0.0.0 /8 (keyid = 100)
> keyB: ip 10.0.0.0 /16 (keyid = 100)
> keyC: ip 10.0.0.0 /8 (keyid = 101)
> keyD: ip 10.0.0.0 /16 (keyid = 102)
>
> So, keyA and keyB obviously will have to use this new
> TCP_AO_KEYF_PREFIX_MATCH. Should keyC or keyD be copied to the
> established connection socket or not?
> I'd think the presence of TCP_AO_KEYF_PREFIX_MATCH flag on keyC&keyD
> should also affect whether they are copied or not. If the flag is not
> on keyC&keyD –  they should be copied to the established socket
> (together with keyB, preserving the previous behaviour).
>
> Otherwise, if they have the flag, what should happen?
> 1. keyB + keyC + keyD
> 2. keyB + keyD
> If we go with (2), then if a user wants keyC on a socket, they could
> either remove TCP_AO_KEYF_PREFIX_MATCH from keyC or add keyC1 with
> mask /16 and the same password as keyC – slightly inconvenient, but
> quite flexible.
>
> What do you think?

If intersecting keys are not yet allowed, I think we must return an
error code at the insertion stage,
instead of hoping the user will do "the right thing".

Thanks.

^ permalink raw reply

* Re: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
From: Parthiban.Veerasooran @ 2026-06-23  5:19 UTC (permalink / raw)
  To: Selvamani.Rajagopal, andrew+netdev, davem, edumazet, kuba, pabeni,
	robh, krzk+dt, conor+dt, Pier.Beruto
  Cc: andrew, netdev, linux-kernel, Conor.Dooley, devicetree
In-Reply-To: <CYYPR02MB9828A1434E6339A6CFCCA74283EF2@CYYPR02MB9828.namprd02.prod.outlook.com>

Hi Selvamani,

On 22/06/26 10:44 am, Selvamani Rajagopal wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
>>
>> AI review bot Sashiko suggested one potential issue where skb pointers aren't protected.
>> But those
>> concerns are in transmit path. This crash seems to be in receive path. If you think that
>> might help,
>> I can generate a patch for that.
> 
> 
> Parthiban,
> 
> I just submitted a patch for "net" tree. I was able to see one crash though. Crash signature
> was different from yours. As I remember, yours is NULL pointer access. Mine was due to
> trying to place the data beyond the "end" point.
> 
> Anyway, if you have time to spare and want to try and see if it fixes your crash, I would appreciate
> the feedback..
> 
> https://patchwork.kernel.org/project/netdevbpf/list/?series=1114495
Thank you for the update, and I appreciate your efforts.

I will find some time this week to test and share my feedback. In the 
meantime, would it be possible for you to test using two instances (Test 
Case 2)? I did not encounter many issues when testing with a single 
instance.

I believe that testing with two instances increases the likelihood of 
reproducing the issue in your setup as well.

Best regards,
Parthiban V
> 
>>
>> What do you suggest? Since you are able to see the crash, would you have time to
>> investigate?
>>
>> Sincerely
>> Selva


^ permalink raw reply

* Re: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
From: Parthiban.Veerasooran @ 2026-06-23  5:19 UTC (permalink / raw)
  To: Selvamani.Rajagopal, andrew+netdev, davem, edumazet, kuba, pabeni,
	robh, krzk+dt, conor+dt, Pier.Beruto
  Cc: andrew, netdev, linux-kernel, Conor.Dooley, devicetree
In-Reply-To: <CYYPR02MB9828A1434E6339A6CFCCA74283EF2@CYYPR02MB9828.namprd02.prod.outlook.com>

Hi Selvamani,

On 22/06/26 10:44 am, Selvamani Rajagopal wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
>>
>> AI review bot Sashiko suggested one potential issue where skb pointers aren't protected.
>> But those
>> concerns are in transmit path. This crash seems to be in receive path. If you think that
>> might help,
>> I can generate a patch for that.
> 
> 
> Parthiban,
> 
> I just submitted a patch for "net" tree. I was able to see one crash though. Crash signature
> was different from yours. As I remember, yours is NULL pointer access. Mine was due to
> trying to place the data beyond the "end" point.
> 
> Anyway, if you have time to spare and want to try and see if it fixes your crash, I would appreciate
> the feedback..
> 
> https://patchwork.kernel.org/project/netdevbpf/list/?series=1114495
Thank you for the update, and I appreciate your efforts.

I will find some time this week to test and share my feedback. In the 
meantime, would it be possible for you to test using two instances (Test 
Case 2)? I did not encounter many issues when testing with a single 
instance.

I believe that testing with two instances increases the likelihood of 
reproducing the issue in your setup as well.

Best regards,
Parthiban V
> 
>>
>> What do you suggest? Since you are able to see the crash, would you have time to
>> investigate?
>>
>> Sincerely
>> Selva


^ permalink raw reply

* [PATCH net v2 2/2] octeontx2-af: suppress kpu profile loading warning
From: nshettyj @ 2026-06-23  4:06 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: sgoutham, rkannoth, lcherian, gakula, hkelam, sbhatta,
	andrew+netdev, davem, edumazet, kuba, pabeni, Sunil.Goutham,
	naveenm, hkalra, Nitin Shetty J
In-Reply-To: <20260623040609.3090846-1-nshettyj@marvell.com>

From: Harman Kalra <hkalra@marvell.com>

There are three ways in which a KPU profile can be loaded
(in high to low priority order):
1. profile image integrated in kernel image
2. firmware database method
3. default profile

In most cases the profile is loaded using the 2nd method, which
causes a spurious warning from the Linux firmware subsystem (method 1)
due to the absence of firmware in the kernel image.

Replace request_firmware_direct() with firmware_request_nowarn() to
suppress such warnings when no image is integrated into the kernel image.

Fixes: c0c9ac88156a ("octeontx2-af: npc: Support for custom KPU profile from filesystem")
Signed-off-by: Harman Kalra <hkalra@marvell.com>
Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
---
 drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c
index c7bc0b3a29b9..007d3f22b0c9 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c
@@ -2246,7 +2246,7 @@ static int npc_load_kpu_profile_from_fs(struct rvu *rvu)
 
 	strcat(path, kpu_profile);
 
-	if (request_firmware_direct(&fw, path, rvu->dev))
+	if (firmware_request_nowarn(&fw, path, rvu->dev))
 		return -ENOENT;
 
 	dev_info(rvu->dev, "Loading KPU profile from filesystem: %s\n",
-- 
2.48.1


^ permalink raw reply related

* [PATCH net v2 1/2] octeontx2-af: fix VF bringup affecting PF promiscuous state
From: nshettyj @ 2026-06-23  4:06 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: sgoutham, rkannoth, lcherian, gakula, hkelam, sbhatta,
	andrew+netdev, davem, edumazet, kuba, pabeni, Sunil.Goutham,
	naveenm, hkalra, Nitin Shetty J
In-Reply-To: <20260623040609.3090846-1-nshettyj@marvell.com>

From: Harman Kalra <hkalra@marvell.com>

Mbox handling of nix_set_rx_mode for a VF with promiscuous and
all_multi flags set to false causes deletion of the PF's promiscuous
and allmulti MCAM rules. This occurs because the APIs that
enable/disable these rules operate only on the PF, even when the
mbox request is made via a VF interface.

Guard both rvu_npc_enable_allmulti_entry() and
rvu_npc_enable_promisc_entry() disable paths with an is_vf() check so
that a VF bringing up or tearing down its interface cannot inadvertently
clear the PF's MCAM rules.

Fixes: 967db3529eca ("octeontx2-af: add support for multicast/promisc packet replication feature")
Signed-off-by: Harman Kalra <hkalra@marvell.com>
Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
---
 drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index d8989395e875..a7e0e0e05ad2 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -4575,7 +4575,7 @@ int rvu_mbox_handler_nix_set_rx_mode(struct rvu *rvu, struct nix_rx_mode *req,
 		rvu_npc_install_allmulti_entry(rvu, pcifunc, nixlf,
 					       pfvf->rx_chan_base);
 	} else {
-		if (!nix_rx_multicast)
+		if (!nix_rx_multicast && !is_vf(pcifunc))
 			rvu_npc_enable_allmulti_entry(rvu, pcifunc, nixlf, false);
 	}
 
@@ -4585,7 +4585,7 @@ int rvu_mbox_handler_nix_set_rx_mode(struct rvu *rvu, struct nix_rx_mode *req,
 					      pfvf->rx_chan_base,
 					      pfvf->rx_chan_cnt);
 	else
-		if (!nix_rx_multicast)
+		if (!nix_rx_multicast && !is_vf(pcifunc))
 			rvu_npc_enable_promisc_entry(rvu, pcifunc, nixlf, false);
 
 	return 0;
-- 
2.48.1


^ permalink raw reply related

* [PATCH net v2 0/2] octeontx2-af: Bug fixes for KPU profile and VF RX mode
From: nshettyj @ 2026-06-23  4:06 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: sgoutham, rkannoth, lcherian, gakula, hkelam, sbhatta,
	andrew+netdev, davem, edumazet, kuba, pabeni, Sunil.Goutham,
	naveenm, hkalra

From: Harman Kalra <hkalra@marvell.com>

Hello,

This is version 2 of the patch series targeting the net branch. The second 
patch has been rebased against the current HEAD of the net branch to resolve 
a merge conflict. No logical changes were made to either patch.

The first patch addresses a spurious firmware loading warning by
switching to the non-warning variant of the firmware request API when
falling back to alternative loading methods.

The second patch resolves an issue where a VF changing its interface
state could inadvertently delete the RX promiscuous and all-multicast
MCAM rules belonging to the host PF.

Changes in v2:
- Rebased patch 2/2 to resolve a merge conflict on the net branch.
- Patch 1/2 remains unchanged.

Harman Kalra (2):
  octeontx2-af: fix VF bringup affecting PF promiscuous state
  octeontx2-af: suppress kpu profile loading warning

 drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c | 4 ++--
 drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

-- 
2.48.1


^ permalink raw reply

* Re: [PATCH net] netpoll: fix a use-after-free on shutdown path
From: Pavan Chebbi @ 2026-06-23  4:05 UTC (permalink / raw)
  To: Breno Leitao
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Amerigo Wang, netdev, linux-kernel, vlad.wing,
	asantostc, kernel-team, stable
In-Reply-To: <20260622-netpoll_rcu_fix-v1-1-15c3285e92e6@debian.org>

[-- Attachment #1: Type: text/plain, Size: 2334 bytes --]

On Mon, Jun 22, 2026 at 8:31 PM Breno Leitao <leitao@debian.org> wrote:
>
> There is a use-after-free error on netpoll, which is clearly detected by
> KASAN.
>
>       BUG: KASAN: slab-use-after-free in _raw_spin_lock_irqsave+0x3b/0x80
>       Read of size 1 at addr ... by task kworker/9:1
>       Workqueue: events queue_process
>       Call Trace:
>        skb_dequeue+0x1e/0xb0
>        queue_process+0x2c/0x600
>        process_scheduled_works+0x4b6/0x850
>        worker_thread+0x414/0x5a0
>       Allocated by task 242:
>        __netpoll_setup+0x201/0x4a0
>        netpoll_setup+0x249/0x550
>        enabled_store+0x32f/0x380
>       Freed by task 0:
>        kfree+0x1b7/0x540
>        rcu_core+0x3f8/0x7a0
>
> The problem happens when there is a pending TX worker running in
> parallel with the cleanup path.
>
> This is what happens on netpoll shutdown path:
>
> 1) __netpoll_cleanup() is called
> 2) set dev->npinfo to NULL
> 3) call_rcu() with rcu_cleanup_netpoll_info()
>   3.1) rcu_cleanup_netpoll_info() tries to cancel all workers with
>        cancel_delayed_work(), but doesn't wait for the worker to finish
> 4) and kfree(npinfo);
>
> Because 3.1) doesn't really cancel the work, as the comment says "we
> can't call cancel_delayed_work_sync here, as we are in softirq", the TX
> worker can run after 4).
>
> Tl;DR: queue_process() is not an RCU reader, it reaches npinfo through
> the work item via container_of().
>
> In reality, we can improve this cleanup path by a lot, but, given that
> this is targeting net, just do the sane path:
>
> 1) set dev->npinfo to NULL
> 2) synchronize net / RCU
> 3) cancel_delayed_work_sync() any new worker (that potentially showed up
>    after the grace period -- and should exit soon given they will see
>    dev->npinfo = NULL)
> 4) then rcu_cleanup_netpoll_info() -> kfree() npinfo
>
> In the future, we can do the cleanup inline here, and don't need
> npinfo->rcu rcu_head, but that is net-next material.
>
> Cc: stable@vger.kernel.org
> Fixes: 38e6bc185d95 ("netpoll: make __netpoll_cleanup non-block")
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  net/core/netpoll.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH net] eth: fbnic: fix ordering of heartbeat vs ownership
From: Pavan Chebbi @ 2026-06-23  4:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	Alexander Duyck
In-Reply-To: <20260622154753.827506-1-kuba@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]

On Mon, Jun 22, 2026 at 9:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> When requesting ownership of the NIC (MAC/PHY control), we set up
> the heartbeat to look stale:
>
>   /* Initialize heartbeat, set last response to 1 second in the past
>    * so that we will trigger a timeout if the firmware doesn't respond
>    */
>   fbd->last_heartbeat_response = req_time - HZ;
>   fbd->last_heartbeat_request = req_time;
>
> The response handler then sets:
>
>   fbd->last_heartbeat_response = jiffies;
>
> for which we wait via:
>
>   fbnic_fw_init_heartbeat() -> fbnic_fw_heartbeat_current()
>
> The scheme is a bit odd, but it should work in principle.
>
> Fix the ordering of operations. We have to set up the stale heartbeat
> before we send the message. Otherwise if the response is very fast
> we will override it. This triggers on QEMU if we run on the core
> that handles the IRQ, and results in ndo_open failing with ETIMEDOUT.
>
> The change in ordering doesn't impact releasing the ownership.
> Both ndo_stop and heartbeat check are under rtnl_lock.
>
> Fixes: 20d2e88cc746 ("eth: fbnic: Add initial messaging to notify FW of our presence")
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/meta/fbnic/fbnic_fw.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next v4 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
From: bot+bpf-ci @ 2026-06-23  3:39 UTC (permalink / raw)
  To: avinash.duduskar, ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260623025147.1001664-4-avinash.duduskar@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5118 bytes --]

>     selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
>
>     Cover both directions of the new VLAN flags in the fib_lookup test,
>     38 table cases plus dedicated cross-netns and XDP-redirect subtests.
>
>     For BPF_FIB_LOOKUP_VLAN the egress cases assert: without the flag the
>     lookup returns the VLAN netdev's ifindex and zeroed vlan fields, with
>     the flag it returns the parent's ifindex plus the tag (including via
>     a neighbour resolved on the VLAN device, in OUTPUT mode, over a bond,
>     and through a DIRECT|TBID table), with the flag on a non-VLAN egress
>     it changes nothing, for a stacked VLAN (QinQ) it returns
>     BPF_FIB_LKUP_RET_VLAN_FAILURE with params->ifindex left at the input, a
>     lookup without the flag returns the inner VLAN device's ifindex, and
>     a frag-needed return reports the route mtu in mtu_result while leaving
>     the swap unwritten.
>
>     The VLAN_FAILURE arms are IPv4. bpf_ipv6_fib_lookup() restores
>     params->ifindex with the same save/restore the IPv4 arms exercise, so an
>     IPv6 VLAN_FAILURE arm would only re-test shared code.
>
>     For BPF_FIB_LOOKUP_VLAN_INPUT, an iif rule on the subinterface routes
>     the same destination to a different gateway, so the asserted gateway
>     shows which device the lookup used as ingress: without the flag the
>     main table answers, with a matching tag the subinterface's table
>     does, with or without SKIP_NEIGH, and BPF_FIB_LOOKUP_SRC selects the
>     subinterface's address. A VRF-enslaved subinterface selects the VRF
>     table through the l3mdev rule and, with DIRECT, through
>     l3mdev_fib_table_rcu(). One case sets BPF_FIB_LOOKUP_VLAN as well and
>     asserts both directions work in a single lookup. Resolution semantics
>     are pinned: an 802.1ad tag resolves its device, PCP and DEI bits in
>     h_vlan_TCI are ignored, a VLAN ifindex resolves the inner QinQ
>     device, a tag on a bond master resolves while the same tag on the
>     bond port does not.
>
>     The error cases assert -EINVAL for an invalid h_vlan_proto on both
>     address families, for the TBID and OUTPUT flag combinations and for
>     an unknown flag bit, and BPF_FIB_LKUP_RET_NOT_FWDED for a VID with no
>     configured device on both families, for a VID-0 priority tag and for
>     a device that exists but is down. The failure cases also assert that
>     params is left untouched. By contrast, a no-neighbour case whose
>     input and egress devices differ asserts NO_NEIGH reports the egress
>     ifindex, not the input: only VLAN_FAILURE rewinds params->ifindex to
>     the input.
>
>     A separate subtest moves a VLAN device into a second netns while it
>     stays registered on its parent, and checks both directions refuse to
>     cross the boundary: the input flag fails closed with the tag and
>     ifindex untouched, and the egress flag returns
>     BPF_FIB_LKUP_RET_VLAN_FAILURE without publishing the foreign parent's
>     ifindex.
>
>     The tbid read-back check is skipped for DIRECT cases that set
>     BPF_FIB_LOOKUP_VLAN, since a successful swap packs the vlan fields
>     into the union the check reads.
>
>     Re-run the cases through bpf_xdp_fib_lookup() as well: the egress flag
>     exists because VLAN devices have no XDP xmit, so XDP is the primary
>     consumer. bpf_prog_test_run uses the netns' loopback for the xdp context's
>     device, so the lookup runs against the test netns' FIB, and the
>     path-independent results (return code, swapped ifindex, vlan tag, gateway)
>     are asserted to match the skb path.
>
>     A live-frames subtest (test_fib_lookup_vlan_redirect) drives real
>     frames through the XDP redirect path with BPF_F_TEST_XDP_LIVE_FRAMES, the
>     native xdp_do_redirect() plus xdp_do_flush() path. A reducible VLAN
>     egress is redirected to the physical parent and delivered to its peer;
>     a QinQ egress returns VLAN_FAILURE and is passed to the stack, since
>     redirecting to the VLAN device would drop the frame at xdp_do_flush()
>     (no ndo_xdp_xmit). The redirect program distinguishes SUCCESS from not;
>     the table and netns arms pin the exact VLAN_FAILURE value.

This isn't a bug, but could the changelog be tightened?

The description runs roughly seven paragraphs that walk through what each
group of test arms asserts: the egress arms with and without the flag,
the input arms and VRF table selection, the error arms and their failure
modes, the netns subtest boundary checks, and the XDP redirect subtest
behaviour.

Much of it carries rationale, so this is a soft observation, but someone
wanting to understand the per-case behaviour can read it more quickly from
the test table itself. Could the summary focus on the why (the two new
flags and the invariants worth pinning) and lean on the test table for the
per-arm specifics?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27999579457

^ permalink raw reply

* [PATCH] net: ipa: fix SMEM state handle leaks in SMP2P init
From: Haoxiang Li @ 2026-06-23  3:18 UTC (permalink / raw)
  To: elder, andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: netdev, linux-kernel, Haoxiang Li, stable

ipa_smp2p_init() acquires two Qualcomm SMEM state handles with
qcom_smem_state_get(). However, neither the init error paths
nor ipa_smp2p_exit() release them.

Use devm_qcom_smem_state_get() for both state handles so the
references are released automatically when the platform device
is removed.

Fixes: 530f9216a953 ("soc: qcom: ipa: AP/modem communications")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
 drivers/net/ipa/ipa_smp2p.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ipa/ipa_smp2p.c b/drivers/net/ipa/ipa_smp2p.c
index 2f0ccdd937cc..d8fd56949082 100644
--- a/drivers/net/ipa/ipa_smp2p.c
+++ b/drivers/net/ipa/ipa_smp2p.c
@@ -228,15 +228,15 @@ ipa_smp2p_init(struct ipa *ipa, struct platform_device *pdev, bool modem_init)
 	u32 valid_bit;
 	int ret;
 
-	valid_state = qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
-					  &valid_bit);
+	valid_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
+					       &valid_bit);
 	if (IS_ERR(valid_state))
 		return PTR_ERR(valid_state);
 	if (valid_bit >= 32)		/* BITS_PER_U32 */
 		return -EINVAL;
 
-	enabled_state = qcom_smem_state_get(dev, "ipa-clock-enabled",
-					    &enabled_bit);
+	enabled_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled",
+						 &enabled_bit);
 	if (IS_ERR(enabled_state))
 		return PTR_ERR(enabled_state);
 	if (enabled_bit >= 32)		/* BITS_PER_U32 */
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net 2/2] selftests/net: Add TCP-AO key shadowing test
From: Dmitry Safonov @ 2026-06-23  3:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, netdev, eric.dumazet
In-Reply-To: <20260622185248.1717846-3-edumazet@google.com>

Hi Eric,

Thanks for adding a selftest,

On Mon, 22 Jun 2026 at 19:52, Eric Dumazet <edumazet@google.com> wrote:
>
> Add a new selftest shadowing.c to tools/testing/selftests/net/tcp_ao
> to verify that more specific keys are correctly preferred over less
> specific ones (shadowing prevention), regardless of their insertion order.
>
> The test configures a server with a specific host key, and a client with
> both a specific host key and a wildcard subnet key, inserted in the
> "wrong" order (wildcard last, which would shadow the specific one under
> the bug). It then verifies that the client can still successfully
> connect to the server, which only succeeds if the client correctly
> selects the more specific key for the outbound connection.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Assisted-by: Gemini:gemini-3.1-pro
> ---
[..]

> +static void *client_fn(void *arg)
> +{
> +       int sk = socket(test_family, SOCK_STREAM, IPPROTO_TCP);
> +       union tcp_addr wildcard_addr = {};
> +
> +       if (sk < 0)
> +               test_error("socket()");
> +
> +       /* Client adds keys in the "wrong" order (wildcard last) to trigger shadowing.
> +        * 1. Specific key (Key B, ID 100)
> +        * 2. Wildcard key (Key A, ID 101)
> +        *
> +        * Without the fix, the wildcard key will be at the head of the list
> +        * and will shadow the specific key during outbound lookup, causing
> +        * the client to send a SYN with KeyID 101 (which the server doesn't have).
> +        */
> +
> +       /* 1. Add specific key */
> +       if (test_add_key(sk, "pass_specific", this_ip_dest, -1, 100, 100))
> +               test_error("setsockopt(TCP_AO_ADD_KEY) specific");
> +
> +       /* 2. Add wildcard key (any address, prefix 0) */
> +       if (test_add_key(sk, "pass_wildcard", wildcard_addr, 0, 101, 101))
> +               test_error("setsockopt(TCP_AO_ADD_KEY) wildcard");

Two notes here:
1. The two keys do not match, so I likely misunderstood your cover
letter. I thought you wanted to override more generic keys (longer
prefix; non-VRF-specific), but your goal is to choose between two
distinct keys, likely with different password-seeds.
2. This only changes behaviour on the initial connect() from a client,
where the client did not provide any priority of which keys to use on
SYN segment (by .set_current=1).

I think the patch is fine, yet it sounds like it's limited in
usefulness, if I didn't miss something obvious.

Thanks,
             Dmitry

^ permalink raw reply

* [PATCH net v2] net: sungem: fix probe error cleanup
From: Ruoyu Wang @ 2026-06-23  2:57 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, netdev, linux-kernel, Ruoyu Wang

gem_init_one() calls gem_remove_one() when register_netdev() fails.
gem_remove_one() unregisters and frees resources owned by the net_device,
including the DMA block, MMIO mapping, PCI regions, and the net_device
itself. gem_init_one() then falls through to its own cleanup labels and
frees the same resources again.

Keep the register_netdev() error path in gem_init_one(): clear drvdata so
PM/remove paths do not see a half-registered device, remove the NAPI
instance added during probe, and let the existing cleanup labels release
the resources once.

The issue was found by a local static-analysis checker for probe error
paths. The reported path was manually inspected before sending this fix.

Compile-tested with CONFIG_SUNGEM=y. Runtime testing was not performed
because no sungem hardware is available.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
---
v2:
- Add a Fixes tag.
- Describe how the issue was found.
- Add testing information.

v1: https://lore.kernel.org/netdev/20260620155326.80582-1-ruoyuw560@gmail.com/

 drivers/net/ethernet/sun/sungem.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 8e69d917d827..26974ee71352 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -2986,10 +2986,10 @@ static int gem_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	dev->max_mtu = GEM_MAX_MTU;
 
 	/* Register with kernel */
-	if (register_netdev(dev)) {
+	err = register_netdev(dev);
+	if (err) {
 		pr_err("Cannot register net device, aborting\n");
-		err = -ENOMEM;
-		goto err_out_free_consistent;
+		goto err_out_clear_drvdata;
 	}
 
 	/* Undo the get_cell with appropriate locking (we could use
@@ -3003,8 +3003,13 @@ static int gem_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		    dev->dev_addr);
 	return 0;
 
+err_out_clear_drvdata:
+	pci_set_drvdata(pdev, NULL);
+	netif_napi_del(&gp->napi);
+
 err_out_free_consistent:
-	gem_remove_one(pdev);
+	dma_free_coherent(&pdev->dev, sizeof(struct gem_init_block),
+			  gp->init_block, gp->gblock_dvma);
 err_out_iounmap:
 	gem_put_cell(gp);
 	iounmap(gp->regs);
-- 
2.51.0

^ permalink raw reply related

* [PATCH bpf-next v4 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
From: Avinash Duduskar @ 2026-06-23  2:51 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern
In-Reply-To: <20260623025147.1001664-1-avinash.duduskar@gmail.com>

Cover both directions of the new VLAN flags in the fib_lookup test,
38 table cases plus dedicated cross-netns and XDP-redirect subtests.

For BPF_FIB_LOOKUP_VLAN the egress cases assert: without the flag the
lookup returns the VLAN netdev's ifindex and zeroed vlan fields, with
the flag it returns the parent's ifindex plus the tag (including via
a neighbour resolved on the VLAN device, in OUTPUT mode, over a bond,
and through a DIRECT|TBID table), with the flag on a non-VLAN egress
it changes nothing, for a stacked VLAN (QinQ) it returns
BPF_FIB_LKUP_RET_VLAN_FAILURE with params->ifindex left at the input, a
lookup without the flag returns the inner VLAN device's ifindex, and
a frag-needed return reports the route mtu in mtu_result while leaving
the swap unwritten.

The VLAN_FAILURE arms are IPv4. bpf_ipv6_fib_lookup() restores
params->ifindex with the same save/restore the IPv4 arms exercise, so an
IPv6 VLAN_FAILURE arm would only re-test shared code.

For BPF_FIB_LOOKUP_VLAN_INPUT, an iif rule on the subinterface routes
the same destination to a different gateway, so the asserted gateway
shows which device the lookup used as ingress: without the flag the
main table answers, with a matching tag the subinterface's table
does, with or without SKIP_NEIGH, and BPF_FIB_LOOKUP_SRC selects the
subinterface's address. A VRF-enslaved subinterface selects the VRF
table through the l3mdev rule and, with DIRECT, through
l3mdev_fib_table_rcu(). One case sets BPF_FIB_LOOKUP_VLAN as well and
asserts both directions work in a single lookup. Resolution semantics
are pinned: an 802.1ad tag resolves its device, PCP and DEI bits in
h_vlan_TCI are ignored, a VLAN ifindex resolves the inner QinQ
device, a tag on a bond master resolves while the same tag on the
bond port does not.

The error cases assert -EINVAL for an invalid h_vlan_proto on both
address families, for the TBID and OUTPUT flag combinations and for
an unknown flag bit, and BPF_FIB_LKUP_RET_NOT_FWDED for a VID with no
configured device on both families, for a VID-0 priority tag and for
a device that exists but is down. The failure cases also assert that
params is left untouched. By contrast, a no-neighbour case whose
input and egress devices differ asserts NO_NEIGH reports the egress
ifindex, not the input: only VLAN_FAILURE rewinds params->ifindex to
the input.

A separate subtest moves a VLAN device into a second netns while it
stays registered on its parent, and checks both directions refuse to
cross the boundary: the input flag fails closed with the tag and
ifindex untouched, and the egress flag returns
BPF_FIB_LKUP_RET_VLAN_FAILURE without publishing the foreign parent's
ifindex.

The tbid read-back check is skipped for DIRECT cases that set
BPF_FIB_LOOKUP_VLAN, since a successful swap packs the vlan fields
into the union the check reads.

Re-run the cases through bpf_xdp_fib_lookup() as well: the egress flag
exists because VLAN devices have no XDP xmit, so XDP is the primary
consumer. bpf_prog_test_run uses the netns' loopback for the xdp context's
device, so the lookup runs against the test netns' FIB, and the
path-independent results (return code, swapped ifindex, vlan tag, gateway)
are asserted to match the skb path.

A live-frames subtest (test_fib_lookup_vlan_redirect) drives real
frames through the XDP redirect path with BPF_F_TEST_XDP_LIVE_FRAMES, the
native xdp_do_redirect() plus xdp_do_flush() path. A reducible VLAN
egress is redirected to the physical parent and delivered to its peer;
a QinQ egress returns VLAN_FAILURE and is passed to the stack, since
redirecting to the VLAN device would drop the frame at xdp_do_flush()
(no ndo_xdp_xmit). The redirect program distinguishes SUCCESS from not;
the table and netns arms pin the exact VLAN_FAILURE value.

Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 .../selftests/bpf/prog_tests/fib_lookup.c     | 696 +++++++++++++++++-
 .../testing/selftests/bpf/progs/fib_lookup.c  |  36 +
 2 files changed, 728 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
index bd7658958004..d51bc3332e56 100644
--- a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
 
 #include <linux/rtnetlink.h>
+#include <linux/if_ether.h>
 #include <sys/types.h>
 #include <net/if.h>
 
@@ -23,6 +24,7 @@
 #define IPV4_TBID_ADDR		"172.0.0.254"
 #define IPV4_TBID_NET		"172.0.0.0"
 #define IPV4_TBID_DST		"172.0.0.2"
+#define IPV4_TBID_NONEIGH_DST	"172.0.0.5"
 #define IPV6_TBID_ADDR		"fd00::FFFF"
 #define IPV6_TBID_NET		"fd00::"
 #define IPV6_TBID_DST		"fd00::2"
@@ -37,6 +39,41 @@
 #define IPV6_LOCAL		"fd01::3"
 #define IPV6_GW1		"fd01::1"
 #define IPV6_GW2		"fd01::2"
+#define VLAN_ID			100
+#define VLAN_IFACE		"veth1.100"
+#define VLAN_ID_DOWN		102
+#define VLAN_IFACE_DOWN		"veth1.102"
+#define QINQ_OUTER_IFACE	"veth1.200"
+#define QINQ_INNER_IFACE	"veth1.200.300"
+#define VLAN_TABLE		"300"
+#define IPV4_VLAN_IFACE_ADDR	"10.5.0.254"
+#define IPV4_VLAN_EGRESS_DST	"10.5.0.2"
+#define IPV4_QINQ_DST		"10.7.0.2"
+#define IPV4_VLAN_DST		"10.6.0.2"
+#define IPV4_VLAN_GW		"10.5.0.1"
+#define IPV6_VLAN_IFACE_ADDR	"fd02::254"
+#define IPV6_VLAN_EGRESS_DST	"fd02::2"
+#define IPV6_VLAN_DST		"fd03::2"
+#define IPV6_VLAN_GW		"fd02::1"
+#define VLAN_VID_UNUSED		999
+#define VRF_IFACE		"vrf-blue"
+#define VRF_TABLE		"1000"
+#define VRF_VLAN_ID		101
+#define VRF_VLAN_IFACE		"veth1.101"
+#define IPV4_VRF_IFACE_ADDR	"10.8.0.254"
+#define IPV4_VRF_GW		"10.8.0.1"
+#define IPV4_VRF_DST		"10.9.0.2"
+#define TBID_VLAN_ID		50
+#define TBID_VLAN_IFACE		"veth2.50"
+#define IPV4_TBID_VLAN_DST	"172.2.0.2"
+#define IPV4_BOND_VLAN_DST	"10.11.0.2"
+#define IPV4_VLAN_MTU_DST	"10.5.9.2"
+#define QINQ_AD_VLAN_ID		200
+#define QINQ_INNER_VLAN_ID	300
+#define BOND_IFACE		"bond99"
+#define BOND_PORT		"veth3"
+#define BOND_PORT_PEER		"veth4"
+#define BOND_VLAN_ID		500
 #define DMAC			"11:11:11:11:11:11"
 #define DMAC_INIT { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, }
 #define DMAC2			"01:01:01:01:01:01"
@@ -52,6 +89,17 @@ struct fib_lookup_test {
 	__u32 tbid;
 	__u8 dmac[6];
 	__u32 mark;
+	/*
+	 * input tag with BPF_FIB_LOOKUP_VLAN_INPUT; expected output tag
+	 * with BPF_FIB_LOOKUP_VLAN (checked when check_vlan is set)
+	 */
+	__u16 vlan_proto;
+	__u16 vlan_id;
+	bool check_vlan;
+	const char *expected_dev; /* expected params->ifindex after lookup */
+	const char *iif;	  /* override the default veth1 input device */
+	__u16 tot_len;		  /* triggers the in-lookup mtu check when set */
+	__u16 expected_mtu;	  /* expected mtu_result (union with tot_len) */
 };
 
 static const struct fib_lookup_test tests[] = {
@@ -79,6 +127,17 @@ static const struct fib_lookup_test tests[] = {
 	  .daddr = IPV4_TBID_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
 	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID, .tbid = 100,
 	  .dmac = DMAC_INIT2, },
+	/*
+	 * An error that returns after the egress device is resolved must
+	 * report the egress ifindex, not the input. This routes from input
+	 * veth1 via veth2 (table 100) to a dst with no neighbour, so
+	 * input != egress, pinning NO_NEIGH to the egress device.
+	 */
+	{ .desc = "IPv4 NO_NEIGH reports the egress ifindex, not the input",
+	  .daddr = IPV4_TBID_NONEIGH_DST,
+	  .expected_ret = BPF_FIB_LKUP_RET_NO_NEIGH,
+	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID, .tbid = 100,
+	  .expected_dev = "veth2", },
 	{ .desc = "IPv6 TBID lookup failure",
 	  .daddr = IPV6_TBID_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
 	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID,
@@ -142,6 +201,223 @@ static const struct fib_lookup_test tests[] = {
 	  .expected_dst = IPV6_GW1,
 	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
 	  .mark = MARK, },
+	/* vlan egress resolution */
+	/*
+	 * Invariant the VLAN-egress arms jointly enforce: a
+	 * BPF_FIB_LOOKUP_VLAN SUCCESS always carries a physical,
+	 * xmit-capable ifindex -- no SUCCESS ever returns a VLAN-device
+	 * ifindex. Reducible arms pin ifindex == the physical parent; the
+	 * QinQ and foreign-netns arms pin VLAN_FAILURE with params->ifindex
+	 * left at the input, so a regression to best-effort (SUCCESS + the
+	 * VLAN ifindex) fails one.
+	 */
+	{ .desc = "IPv4 VLAN egress, no flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, single VLAN",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * skb path without tot_len: mtu_result is the FIB result (VLAN)
+	 * device's mtu (1400) with or without the swap, not the parent's (1500)
+	 */
+	{ .desc = "IPv4 VLAN egress, skb-path mtu is the VLAN device's without the flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, .expected_mtu = 1400, },
+	{ .desc = "IPv4 VLAN egress, skb-path mtu stays the VLAN device's after the swap",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .expected_mtu = 1400, },
+	{ .desc = "IPv4 VLAN egress, flag set but egress is not a VLAN",
+	  .daddr = IPV4_NUD_FAILED_ADDR, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, QinQ not reducible (VLAN_FAILURE)",
+	  .daddr = IPV4_QINQ_DST,
+	  .expected_ret = BPF_FIB_LKUP_RET_VLAN_FAILURE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	{ .desc = "IPv4 QinQ egress without the flag (escape hatch)",
+	  .daddr = IPV4_QINQ_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = QINQ_INNER_IFACE, },
+	{ .desc = "IPv6 VLAN egress, single VLAN",
+	  .daddr = IPV6_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, neighbour on the VLAN device",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT, },
+	{ .desc = "IPv4 VLAN egress in OUTPUT mode",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .iif = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_OUTPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress over a bond",
+	  .daddr = IPV4_BOND_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = BOND_IFACE, .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress via TBID table",
+	  .daddr = IPV4_TBID_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID |
+			  BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .tbid = 100,
+	  .expected_dev = "veth2", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = TBID_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, success writes mtu_result with the swap",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .tot_len = 500, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, FRAG_NEEDED reports mtu, swap unwritten",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_FRAG_NEEDED,
+	  .tot_len = 1400, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	/* vlan tag as lookup input */
+	{ .desc = "IPv4 VLAN input, no flag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects subinterface route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, tag selects subinterface route",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV6_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input and egress combined",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = "veth1",
+	  .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, neighbour resolved on the route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT2, },
+	{ .desc = "IPv4 VLAN input, source address from the subinterface",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_src = IPV4_VLAN_IFACE_ADDR,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SRC |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * VRF: the resolved subinterface is enslaved, so the l3mdev rule
+	 * (full lookup) and l3mdev_fib_table_rcu() (DIRECT) must select
+	 * the VRF table from the resolved ingress
+	 */
+	{ .desc = "IPv4 VLAN input, VRF subinterface, no flag",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects VRF table",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, DIRECT uses VRF table from resolved ingress",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_DIRECT |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	/*
+	 * failure arms also assert params is left untouched: ifindex still
+	 * names the physical device and the input tag bytes survive
+	 */
+	{ .desc = "IPv4 VLAN input, invalid proto",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, unmatched VID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "IPv4 VLAN input, subinterface down",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID_DOWN, },
+	/*
+	 * the resolver runs before the forwarding check, so on devices
+	 * with forwarding off FWD_DISABLED (not NOT_FWDED) proves the tag
+	 * resolved to that device and the lookup used it as ingress
+	 */
+	{ .desc = "IPv4 VLAN input, 802.1ad tag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021AD, .vlan_id = QINQ_AD_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, PCP and DEI bits ignored in TCI",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0xe000 | VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, inner QinQ device from VLAN ifindex",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = QINQ_OUTER_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = QINQ_INNER_VLAN_ID, },
+	/*
+	 * bonding: the VLANs live on the master, as on receive, where the
+	 * frame is steered to the master before VLAN processing; a port
+	 * ifindex does not match (ports carry vid state but no VLAN devs)
+	 */
+	{ .desc = "IPv4 VLAN input, tag on bond master resolves",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = BOND_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, tag on bond port does not match",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .iif = BOND_PORT, .expected_dev = BOND_PORT, .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, invalid proto",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, VID 0 priority tag fails closed",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0, },
+	{ .desc = "IPv6 VLAN input, unmatched VID",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "unknown flag bit rejected",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = (1 << 14) | BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input rejected with TBID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_TBID,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input rejected with OUTPUT",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_OUTPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
 };
 
 static int setup_netns(void)
@@ -204,6 +480,110 @@ static int setup_netns(void)
 	SYS(fail, "ip rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 	SYS(fail, "ip -6 rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 
+	/*
+	 * Setup for vlan tests: a subinterface for egress resolution and
+	 * tag-as-input, a QinQ stack, and an iif rule so the input tests
+	 * observe which device the lookup used as ingress.
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE, VLAN_ID);
+	SYS(fail, "ip link set dev %s up", VLAN_IFACE);
+	/*
+	 * lower than the veth1 parent (1500): the skb-path mtu check uses the
+	 * FIB result (VLAN) device, so mtu_result is this value with or
+	 * without the egress swap, which two arms below pin
+	 */
+	SYS(fail, "ip link set dev %s mtu 1400", VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VLAN_IFACE_ADDR, VLAN_IFACE);
+	SYS(fail, "ip addr add %s/64 dev %s nodad", IPV6_VLAN_IFACE_ADDR, VLAN_IFACE);
+
+	/*
+	 * stays down: the input flag must treat its tag the way real
+	 * ingress treats a frame arriving on a down VLAN device (drop)
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE_DOWN, VLAN_ID_DOWN);
+
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	err = write_sysctl("/proc/sys/net/ipv6/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv6.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	SYS(fail, "ip link add link veth1 name %s type vlan proto 802.1ad id 200",
+	    QINQ_OUTER_IFACE);
+	SYS(fail, "ip link add link %s name %s type vlan id 300",
+	    QINQ_OUTER_IFACE, QINQ_INNER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_OUTER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_INNER_IFACE);
+	SYS(fail, "ip route add %s/32 dev %s", IPV4_QINQ_DST, QINQ_INNER_IFACE);
+
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VLAN_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VLAN_TABLE, IPV4_VLAN_DST, IPV4_VLAN_GW);
+	SYS(fail, "ip rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+	SYS(fail, "ip -6 route add %s/128 via %s", IPV6_VLAN_DST, IPV6_GW1);
+	SYS(fail, "ip -6 route add table %s %s/128 via %s",
+	    VLAN_TABLE, IPV6_VLAN_DST, IPV6_VLAN_GW);
+	SYS(fail, "ip -6 rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+
+	/*
+	 * a bond with one port and a VLAN on the bond: VLANs on a bond
+	 * live on the master, so resolution succeeds for the master's
+	 * ifindex and fails closed for a port's, matching receive, which
+	 * steers the frame to the master before VLAN processing
+	 */
+	SYS(fail, "ip link add %s type bond", BOND_IFACE);
+	SYS(fail, "ip link add %s type veth peer name %s", BOND_PORT, BOND_PORT_PEER);
+	SYS(fail, "ip link set %s master %s", BOND_PORT, BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_PORT);
+	SYS(fail, "ip link add link %s name %s.%d type vlan id %d",
+	    BOND_IFACE, BOND_IFACE, BOND_VLAN_ID, BOND_VLAN_ID);
+	SYS(fail, "ip link set dev %s.%d up", BOND_IFACE, BOND_VLAN_ID);
+	SYS(fail, "ip route add %s/32 dev %s.%d",
+	    IPV4_BOND_VLAN_DST, BOND_IFACE, BOND_VLAN_ID);
+
+	/*
+	 * a VRF with its own dedicated subinterface (the iif rules above
+	 * must not see it), for the table-selection-by-ingress cases
+	 */
+	SYS(fail, "ip link add %s type vrf table %s", VRF_IFACE, VRF_TABLE);
+	SYS(fail, "ip link set dev %s up", VRF_IFACE);
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VRF_VLAN_IFACE, VRF_VLAN_ID);
+	SYS(fail, "ip link set %s master %s", VRF_VLAN_IFACE, VRF_IFACE);
+	SYS(fail, "ip link set dev %s up", VRF_VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VRF_IFACE_ADDR, VRF_VLAN_IFACE);
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VRF_VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VRF_VLAN_IFACE ".forwarding)"))
+		goto fail;
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VRF_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VRF_TABLE, IPV4_VRF_DST, IPV4_VRF_GW);
+
+	/* neighbours on the VLAN subinterface for the non-SKIP_NEIGH cases */
+	err = write_sysctl("/proc/sys/net/ipv4/neigh/" VLAN_IFACE "/gc_stale_time", "900");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.neigh." VLAN_IFACE ".gc_stale_time)"))
+		goto fail;
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_EGRESS_DST, VLAN_IFACE, DMAC);
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_GW, VLAN_IFACE, DMAC2);
+
+	/* a VLAN on veth2 with a route in the tbid test table */
+	SYS(fail, "ip link add link veth2 name %s type vlan id %d",
+	    TBID_VLAN_IFACE, TBID_VLAN_ID);
+	SYS(fail, "ip link set dev %s up", TBID_VLAN_IFACE);
+	SYS(fail, "ip route add table 100 %s/32 dev %s",
+	    IPV4_TBID_VLAN_DST, TBID_VLAN_IFACE);
+
+	/* a locked-mtu route via the subinterface for the FRAG_NEEDED case */
+	SYS(fail, "ip route add %s/32 dev %s mtu lock 1000",
+	    IPV4_VLAN_MTU_DST, VLAN_IFACE);
+
 	return 0;
 fail:
 	return -1;
@@ -218,9 +598,16 @@ static int set_lookup_params(struct bpf_fib_lookup *params,
 	memset(params, 0, sizeof(*params));
 
 	params->l4_protocol = IPPROTO_TCP;
-	params->ifindex = ifindex;
+	params->ifindex = test->iif ? if_nametoindex(test->iif) : ifindex;
 	params->tbid = test->tbid;
 	params->mark = test->mark;
+	params->tot_len = test->tot_len;
+
+	/* h_vlan_proto/h_vlan_TCI union with tbid */
+	if (test->lookup_flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		params->h_vlan_proto = htons(test->vlan_proto);
+		params->h_vlan_TCI = htons(test->vlan_id);
+	}
 
 	if (inet_pton(AF_INET6, test->daddr, params->ipv6_dst) == 1) {
 		params->family = AF_INET6;
@@ -298,7 +685,7 @@ void test_fib_lookup(void)
 	struct nstoken *nstoken = NULL;
 	struct __sk_buff skb = { };
 	struct fib_lookup *skel;
-	int prog_fd, err, ret, i;
+	int prog_fd, xdp_fd, err, ret, i;
 
 	/* The test does not use the skb->data, so
 	 * use pkt_v6 for both v6 and v4 test.
@@ -309,11 +696,16 @@ void test_fib_lookup(void)
 		    .ctx_in = &skb,
 		    .ctx_size_in = sizeof(skb),
 	);
+	LIBBPF_OPTS(bpf_test_run_opts, xdp_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+	);
 
 	skel = fib_lookup__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
 		return;
 	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	xdp_fd = bpf_program__fd(skel->progs.fib_lookup_xdp);
 
 	SYS(fail, "ip netns add %s", NS_TEST);
 
@@ -352,6 +744,21 @@ void test_fib_lookup(void)
 		if (tests[i].expected_dst)
 			assert_dst_ip(fib_params, tests[i].expected_dst);
 
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev), "ifindex");
+
+		if (tests[i].expected_mtu)
+			ASSERT_EQ(fib_params->mtu_result, tests[i].expected_mtu,
+				  "mtu_result");
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "h_vlan_TCI");
+		}
+
 		ret = memcmp(tests[i].dmac, fib_params->dmac, sizeof(tests[i].dmac));
 		if (!ASSERT_EQ(ret, 0, "dmac not match")) {
 			char expected[18], actual[18];
@@ -361,15 +768,296 @@ void test_fib_lookup(void)
 			printf("dmac expected %s actual %s ", expected, actual);
 		}
 
-		// ensure tbid is zero'd out after fib lookup.
-		if (tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) {
+		/*
+		 * ensure tbid is zero'd out after fib lookup. With
+		 * BPF_FIB_LOOKUP_VLAN the union holds the packed vlan
+		 * fields instead, so skip the check for those.
+		 */
+		if ((tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) &&
+		    !(tests[i].lookup_flags & BPF_FIB_LOOKUP_VLAN)) {
 			if (!ASSERT_EQ(skel->bss->fib_params.tbid, 0,
 					"expected fib_params.tbid to be zero"))
 				goto fail;
 		}
 	}
 
+	/*
+	 * Re-run the cases through bpf_xdp_fib_lookup(). test_run uses the
+	 * current netns' loopback for ctx->rxq->dev, so dev_net() is NS_TEST
+	 * and the lookup runs against its FIB. The path-independent results
+	 * (return code, swapped ifindex, vlan tag, gateway) must match the skb
+	 * path; the no-tot_len mtu_result is skb-specific and not rechecked.
+	 */
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (set_lookup_params(fib_params, &tests[i], skb.ifindex))
+			continue;
+
+		skel->bss->fib_lookup_ret = -1;
+		skel->bss->lookup_flags = tests[i].lookup_flags;
+
+		err = bpf_prog_test_run_opts(xdp_fd, &xdp_opts);
+		if (!ASSERT_OK(err, "xdp test_run"))
+			continue;
+
+		if (!ASSERT_EQ(skel->bss->fib_lookup_ret, tests[i].expected_ret,
+			       "xdp fib_lookup_ret"))
+			printf("(xdp) %s\n", tests[i].desc);
+
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev),
+				  "xdp ifindex");
+
+		if (tests[i].expected_dst)
+			assert_dst_ip(fib_params, tests[i].expected_dst);
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "xdp h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "xdp h_vlan_TCI");
+		}
+	}
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " NS_TEST);
+	fib_lookup__destroy(skel);
+}
+
+#define NS_VLAN_A	"fib_lookup_vlan_ns_a"
+#define NS_VLAN_B	"fib_lookup_vlan_ns_b"
+
+/*
+ * A VLAN device can be moved to another netns while staying registered
+ * on its parent. Neither direction may then cross the boundary: the
+ * egress flag must not publish the foreign parent's ifindex, and the
+ * input flag must fail closed rather than use a foreign ingress.
+ */
+void test_fib_lookup_vlan_netns(void)
+{
+	struct bpf_fib_lookup *fib_params;
+	struct nstoken *nstoken = NULL;
+	struct __sk_buff skb = { };
+	struct fib_lookup *skel = NULL;
+	int prog_fd, err, parent_idx, vlan_idx;
+
+	LIBBPF_OPTS(bpf_test_run_opts, run_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+		    .ctx_in = &skb,
+		    .ctx_size_in = sizeof(skb),
+	);
+
+	skel = fib_lookup__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+		return;
+	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	fib_params = &skel->bss->fib_params;
+
+	SYS(fail, "ip netns add %s", NS_VLAN_A);
+	SYS(fail, "ip netns add %s", NS_VLAN_B);
+
+	nstoken = open_netns(NS_VLAN_A);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(a)"))
+		goto fail;
+
+	SYS(fail, "ip link add veth7 type veth peer name veth8");
+	SYS(fail, "ip link set dev veth7 up");
+	SYS(fail, "ip link add link veth7 name veth7.66 type vlan id 66");
+	SYS(fail, "ip link set veth7.66 netns %s", NS_VLAN_B);
+
+	parent_idx = if_nametoindex("veth7");
+	if (!ASSERT_NEQ(parent_idx, 0, "if_nametoindex(veth7)"))
+		goto fail;
+
+	/*
+	 * input: the moved device is still in veth7's VLAN group, but it
+	 * lives in another netns, so the lookup must fail closed
+	 */
+	skb.ifindex = parent_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = parent_idx;
+	fib_params->h_vlan_proto = htons(ETH_P_8021Q);
+	fib_params->h_vlan_TCI = htons(66);
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+	if (!ASSERT_OK(err, "test_run(input)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_NOT_FWDED,
+		  "input across netns fails closed");
+	ASSERT_EQ(fib_params->ifindex, parent_idx, "ifindex untouched");
+	ASSERT_EQ(fib_params->h_vlan_TCI, htons(66), "tag untouched");
+
+	close_netns(nstoken);
+	nstoken = open_netns(NS_VLAN_B);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(b)"))
+		goto fail;
+
+	/*
+	 * egress: the fib result is the VLAN device here, but its parent
+	 * is in the other netns, so the swap must not happen
+	 */
+	SYS(fail, "ip link set dev veth7.66 up");
+	SYS(fail, "ip addr add 10.66.0.1/24 dev veth7.66");
+	err = write_sysctl("/proc/sys/net/ipv4/conf/veth7.66/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(forwarding)"))
+		goto fail;
+
+	vlan_idx = if_nametoindex("veth7.66");
+	if (!ASSERT_NEQ(vlan_idx, 0, "if_nametoindex(veth7.66)"))
+		goto fail;
+
+	skb.ifindex = vlan_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = vlan_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, "10.66.0.1", &fib_params->ipv4_src),
+		       1, "inet_pton(src)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+	if (!ASSERT_OK(err, "test_run(egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_VLAN_FAILURE,
+		  "egress returns VLAN_FAILURE");
+	ASSERT_EQ(fib_params->ifindex, vlan_idx,
+		  "foreign parent not published");
+	ASSERT_EQ(fib_params->h_vlan_TCI, 0, "vlan fields zero");
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " NS_VLAN_A);
+	SYS_NOFAIL("ip netns del " NS_VLAN_B);
+	fib_lookup__destroy(skel);
+}
+
+#define REDIRECT_NPKTS 1000
+
+/*
+ * The egress flag exists so an XDP program can redirect to the physical
+ * parent. A redirect that lands on a VLAN device is dropped at
+ * xdp_do_flush(), because a VLAN device has no ndo_xdp_xmit. Drive real
+ * frames with BPF_F_TEST_XDP_LIVE_FRAMES, which runs the native
+ * xdp_do_redirect() + xdp_do_flush() path: a reducible VLAN egress
+ * resolves to veth1 and is delivered to its peer veth2, while a QinQ
+ * egress returns VLAN_FAILURE and is passed to the stack instead of
+ * redirected to a device that would silently drop it.
+ */
+void test_fib_lookup_vlan_redirect(void)
+{
+	int redirect_fd, err, veth1_idx, veth2_idx = -1;
+	struct bpf_fib_lookup *fib_params;
+	struct nstoken *nstoken = NULL;
+	struct fib_lookup *skel = NULL;
+	bool xdp_attached = false;
+
+	LIBBPF_OPTS(bpf_test_run_opts, lf_opts,
+		    .data_in = &pkt_v4,
+		    .data_size_in = sizeof(pkt_v4),
+		    .flags = BPF_F_TEST_XDP_LIVE_FRAMES,
+		    .repeat = REDIRECT_NPKTS,
+	);
+
+	skel = fib_lookup__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+		return;
+	redirect_fd = bpf_program__fd(skel->progs.fib_lookup_redirect);
+	fib_params = &skel->bss->fib_params;
+
+	SYS(fail, "ip netns add %s", NS_TEST);
+	nstoken = open_netns(NS_TEST);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto fail;
+	if (setup_netns())
+		goto fail;
+
+	veth1_idx = if_nametoindex("veth1");
+	veth2_idx = if_nametoindex("veth2");
+	if (!ASSERT_NEQ(veth1_idx, 0, "if_nametoindex(veth1)") ||
+	    !ASSERT_NEQ(veth2_idx, 0, "if_nametoindex(veth2)"))
+		goto fail;
+
+	/*
+	 * A redirect to veth1 is delivered to its peer veth2. veth_xdp_xmit()
+	 * only accepts the frame if veth2's NAPI is up, which on veth means
+	 * veth2 carries an XDP program; xdp_count tallies what arrives.
+	 */
+	err = bpf_xdp_attach(veth2_idx, bpf_program__fd(skel->progs.xdp_count),
+			     XDP_FLAGS_DRV_MODE, NULL);
+	if (!ASSERT_OK(err, "attach xdp_count on veth2"))
+		goto fail;
+	xdp_attached = true;
+
+	/* reducible VLAN egress: resolves to the physical parent veth1 */
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = veth1_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, IPV4_IFACE_ADDR, &fib_params->ipv4_src),
+		       1, "inet_pton(src)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, IPV4_VLAN_EGRESS_DST, &fib_params->ipv4_dst),
+		       1, "inet_pton(reducible dst)"))
+		goto fail;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH;
+	skel->bss->redirected = 0;
+	skel->bss->passed = 0;
+	skel->bss->delivered = 0;
+
+	err = bpf_prog_test_run_opts(redirect_fd, &lf_opts);
+	if (!ASSERT_OK(err, "test_run(reducible egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->redirected, REDIRECT_NPKTS, "reducible egress redirected");
+	ASSERT_EQ(skel->bss->passed, 0, "reducible egress not passed");
+	ASSERT_GT(skel->bss->delivered, 0, "reducible egress delivered to veth2");
+
+	/*
+	 * QinQ egress: not reducible, so the lookup returns VLAN_FAILURE and
+	 * the program passes the frame instead of redirecting to the inner
+	 * VLAN device. redirected == 0 is the assertion that matters: the
+	 * program did not redirect to a device that would drop the frame at
+	 * xdp_do_flush(). veth2's delivered count is not checked here, since
+	 * a passed frame can still reach veth2 through the stack's forwarding
+	 * path, which is unrelated to the redirect under test.
+	 */
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = veth1_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, IPV4_IFACE_ADDR, &fib_params->ipv4_src),
+		       1, "inet_pton(src)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, IPV4_QINQ_DST, &fib_params->ipv4_dst),
+		       1, "inet_pton(qinq dst)"))
+		goto fail;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH;
+	skel->bss->redirected = 0;
+	skel->bss->passed = 0;
+
+	err = bpf_prog_test_run_opts(redirect_fd, &lf_opts);
+	if (!ASSERT_OK(err, "test_run(qinq egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->passed, REDIRECT_NPKTS, "qinq egress passed");
+	ASSERT_EQ(skel->bss->redirected, 0, "qinq egress not redirected");
+
 fail:
+	if (xdp_attached)
+		bpf_xdp_detach(veth2_idx, XDP_FLAGS_DRV_MODE, NULL);
 	if (nstoken)
 		close_netns(nstoken);
 	SYS_NOFAIL("ip netns del " NS_TEST);
diff --git a/tools/testing/selftests/bpf/progs/fib_lookup.c b/tools/testing/selftests/bpf/progs/fib_lookup.c
index 7b5dd2214ff4..862a1e9457b4 100644
--- a/tools/testing/selftests/bpf/progs/fib_lookup.c
+++ b/tools/testing/selftests/bpf/progs/fib_lookup.c
@@ -19,4 +19,40 @@ int fib_lookup(struct __sk_buff *skb)
 	return TC_ACT_SHOT;
 }
 
+SEC("xdp")
+int fib_lookup_xdp(struct xdp_md *ctx)
+{
+	fib_lookup_ret = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params),
+					lookup_flags);
+
+	return XDP_DROP;
+}
+
+int redirected = 0;
+int passed = 0;
+int delivered = 0;
+
+SEC("xdp")
+int fib_lookup_redirect(struct xdp_md *ctx)
+{
+	struct bpf_fib_lookup params = fib_params;
+	long ret;
+
+	ret = bpf_fib_lookup(ctx, &params, sizeof(params), lookup_flags);
+	if (ret == BPF_FIB_LKUP_RET_SUCCESS) {
+		redirected++;
+		return bpf_redirect(params.ifindex, 0);
+	}
+
+	passed++;
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_count(struct xdp_md *ctx)
+{
+	delivered++;
+	return XDP_DROP;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v4 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-23  2:51 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern
In-Reply-To: <20260623025147.1001664-1-avinash.duduskar@gmail.com>

BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
useful: an XDP program receiving a VLAN-tagged frame on a physical
device wants the lookup to behave as if the packet had arrived on the
corresponding VLAN subinterface, so iif-based policy routing and VRF
table selection use the right ingress.

Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
The device must be up and in the same network namespace as
params->ifindex (a VLAN device can be moved to another netns while
registered on its parent; receive would deliver into that other
namespace, which a lookup here cannot represent). If params->ifindex
is itself a VLAN device, its inner (QinQ) subinterface is matched.
For a bond or team, a tag on a port matches no device and returns
NOT_FWDED; pass the master's ifindex.
The lookup then runs with the resolved device as the ingress;
params->ifindex itself is not modified on the input side. When the
resolved device is enslaved to a VRF, both the full lookup (via the
l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
select the VRF's table from the resolved ingress. That follows from
feeding the resolved device to the flow as the ingress
(fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
the VRF master from the subinterface rather than from
params->ifindex.

The two failure classes get different treatment on purpose. A
h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
-EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
with a program-controlled value. An unmatched VID, a device that is
down, or one in another namespace is a data outcome and returns
BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
fib_get_table() finds no table and mirroring real ingress, where the
receive path drops such frames. A VID of 0 (a priority tag) is looked
up literally and normally fails the same way; receive instead
processes such frames untagged, so callers should not set the flag for
priority tags. Proceeding on the physical device for any of these
would be fail-open for the policy-routing cases above.

The h_vlan fields share a union with tbid, so the flag cannot be
combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
return -EINVAL; restricting now keeps a later relaxation backward
compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
consumed on the ingress side and the egress tag is written on
success.

Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
NULL, so every lookup with the flag returns NOT_FWDED, which is
correct since no VLAN device can exist.

Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 include/uapi/linux/bpf.h       | 21 ++++++++++-
 net/core/filter.c              | 66 +++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h | 21 ++++++++++-
 3 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d0058d88eb2..46a1443534bd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3552,6 +3552,22 @@ union bpf_attr {
  *			are written only on success; other output fields keep
  *			the helper's existing behaviour, so a frag-needed result
  *			still reports the route mtu in *params*->mtu_result.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7348,6 +7364,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7418,7 +7435,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
diff --git a/net/core/filter.c b/net/core/filter.c
index 8345295d84de..fc603cc36ce9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6228,6 +6228,25 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
 
 	return 0;
 }
+
+static struct net_device *bpf_fib_vlan_input_dev(struct net_device *dev,
+						 const struct bpf_fib_lookup *params)
+{
+	__be16 proto = params->h_vlan_proto;
+	struct net_device *vlan_dev;
+	u16 vid;
+
+	if (proto != htons(ETH_P_8021Q) && proto != htons(ETH_P_8021AD))
+		return ERR_PTR(-EINVAL);
+
+	vid = ntohs(params->h_vlan_TCI) & VLAN_VID_MASK;
+	vlan_dev = __vlan_find_dev_deep_rcu(dev, proto, vid);
+	if (!vlan_dev || !(vlan_dev->flags & IFF_UP) ||
+	    !net_eq(dev_net(vlan_dev), dev_net(dev)))
+		return NULL;
+
+	return vlan_dev;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_INET)
@@ -6249,6 +6268,14 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	/* verify forwarding is enabled on this interface */
 	in_dev = __in_dev_get_rcu(dev);
 	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
@@ -6258,7 +6285,11 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl4.flowi4_iif = 1;
 		fl4.flowi4_oif = params->ifindex;
 	} else {
-		fl4.flowi4_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		fl4.flowi4_iif = dev->ifindex;
 		fl4.flowi4_oif = 0;
 	}
 	fl4.flowi4_dscp = inet_dsfield_to_dscp(params->tos);
@@ -6401,6 +6432,14 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	idev = __in6_dev_get_safely(dev);
 	if (unlikely(!idev || !READ_ONCE(idev->cnf.forwarding)))
 		return BPF_FIB_LKUP_RET_FWD_DISABLED;
@@ -6409,7 +6448,12 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl6.flowi6_iif = 1;
 		oif = fl6.flowi6_oif = params->ifindex;
 	} else {
-		oif = fl6.flowi6_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		oif = dev->ifindex;
+		fl6.flowi6_iif = oif;
 		fl6.flowi6_oif = 0;
 		strict = RT6_LOOKUP_F_HAS_SADDR;
 	}
@@ -6525,7 +6569,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
 			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
 			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
-			     BPF_FIB_LOOKUP_VLAN)
+			     BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_VLAN_INPUT)
+
+static bool bpf_fib_lookup_flags_ok(u32 flags)
+{
+	if (flags & ~BPF_FIB_LOOKUP_MASK)
+		return false;
+
+	if ((flags & BPF_FIB_LOOKUP_VLAN_INPUT) &&
+	    (flags & (BPF_FIB_LOOKUP_TBID | BPF_FIB_LOOKUP_OUTPUT)))
+		return false;
+
+	return true;
+}
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6533,7 +6589,7 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	switch (params->family) {
@@ -6572,7 +6628,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	if (params->tot_len)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 8d0058d88eb2..46a1443534bd 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3552,6 +3552,22 @@ union bpf_attr {
  *			are written only on success; other output fields keep
  *			the helper's existing behaviour, so a frag-needed result
  *			still reports the route mtu in *params*->mtu_result.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7348,6 +7364,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7418,7 +7435,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-23  2:51 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern
In-Reply-To: <20260623025147.1001664-1-avinash.duduskar@gmail.com>

bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
from the fib result. When the egress is a VLAN device, the returned
ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
programs that want to forward the frame (e.g. xdp-forward) must
instead target the underlying physical device and push the VLAN tag
themselves. Today the program has no way to learn either the
underlying ifindex or the VLAN tag without maintaining its own
VLAN-to-ifindex map in userspace and refreshing it on netlink
events.

Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
result is a VLAN device whose immediate parent is a real (non-VLAN)
device in the same network namespace, populate the existing output
fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
device and replace params->ifindex with the parent's ifindex.
params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
consumer wanting to set egress priority writes PCP itself.
params->smac is the VLAN device's own address, which can differ from
the parent's.

Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
and not vlan_dev_real_dev(), which walks to the bottom of a stack. When
the immediate parent is not a real device in the same namespace, the
lookup returns BPF_FIB_LKUP_RET_VLAN_FAILURE and leaves params->ifindex
at the input. This covers a stacked VLAN (QinQ), where the immediate
parent is itself a VLAN device and one h_vlan_proto/h_vlan_TCI pair
cannot describe two tags, and a parent in another network namespace (a
VLAN device can be moved while its parent stays), whose ifindex would
be meaningless in the caller's namespace. A program that wants the VLAN
device's own ifindex re-issues the lookup without BPF_FIB_LOOKUP_VLAN,
so the unreducible case stays distinct from a physical egress. That
distinction matters for XDP: a program cannot xmit on a VLAN device, so
a success carrying the VLAN ifindex would make it redirect to a device
with no ndo_xdp_xmit and drop the frame at xdp_do_flush(). The swap and
the vlan fields are written only on the reduce path; other output
fields keep their existing behaviour, so a frag-needed result still
reports the route mtu in params->mtu_result.

On the skb path without tot_len the deferred mtu check is done against
the resolved egress device. To keep that the VLAN device rather than
the parent after the swap, bpf_ipv4_fib_lookup()/bpf_ipv6_fib_lookup()
hand the FIB-result device back to the caller; the XDP path always
runs the route-mtu check and passes NULL. When the flag is not set,
behaviour is unchanged: h_vlan_proto and h_vlan_TCI are zeroed and
ifindex is left at the FIB result.

The new block is compiled only under CONFIG_VLAN_8021Q since
vlan_dev_priv() is not defined otherwise; without that config
is_vlan_dev() is constant false and the flag is accepted but never
acts. That is safe because no VLAN device can exist there, so every
egress is already physical.

This lets an XDP redirect target the physical device and learn the
tag to push in a single lookup, which xdp-forward's optional VLAN
mode (xdp-project/xdp-tools#504) wants from the kernel side.

The helper's input semantics are unchanged; the reverse direction
(supplying a tag as lookup input) is added in the following patch.

Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 include/uapi/linux/bpf.h       | 28 +++++++++++++-
 net/core/filter.c              | 69 ++++++++++++++++++++++++----------
 tools/include/uapi/linux/bpf.h | 28 +++++++++++++-
 3 files changed, 104 insertions(+), 21 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 89b36de5fdbb..8d0058d88eb2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3532,6 +3532,26 @@ union bpf_attr {
  *			Use the mark present in *params*->mark for the fib lookup.
  *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
  *			as it only has meaning for full lookups.
+ *		**BPF_FIB_LOOKUP_VLAN**
+ *			If the fib lookup resolves to a VLAN device whose
+ *			parent is a real (non-VLAN) device, set
+ *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
+ *			the VLAN device and replace *params*->ifindex with the
+ *			parent's ifindex. *params*->h_vlan_TCI carries the VID
+ *			only, with PCP and DEI bits zero; a consumer wanting to
+ *			set egress priority writes PCP itself. *params*->smac is
+ *			the VLAN device's own address, which can differ from the
+ *			parent's. Only the immediate parent is resolved; if it
+ *			is itself a VLAN device (QinQ) or in another namespace,
+ *			the egress cannot be reduced to a physical device plus
+ *			one tag and the lookup returns
+ *			**BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
+ *			left at the input. Re-issue without
+ *			**BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
+ *			ifindex. The swap and the vlan fields
+ *			are written only on success; other output fields keep
+ *			the helper's existing behaviour, so a frag-needed result
+ *			still reports the route mtu in *params*->mtu_result.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7327,6 +7347,7 @@ enum {
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
+	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
 };
 
 enum {
@@ -7340,6 +7361,7 @@ enum {
 	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
 	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
 	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
+	BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
 };
 
 struct bpf_fib_lookup {
@@ -7393,7 +7415,11 @@ struct bpf_fib_lookup {
 
 	union {
 		struct {
-			/* output */
+			/*
+			 * output with BPF_FIB_LOOKUP_VLAN: set from the
+			 * resolved egress VLAN device (see the flag); zeroed
+			 * on other successful lookups.
+			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
 		};
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e96b4b847ce..8345295d84de 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6201,10 +6201,28 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #endif
 
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
-static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
+static int bpf_fib_set_fwd_params(struct net_device *dev,
+				  struct bpf_fib_lookup *params,
+				  u32 flags, u32 mtu)
 {
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
+
+#if IS_ENABLED(CONFIG_VLAN_8021Q)
+	if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
+		struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+
+		if (!is_vlan_dev(real_dev) &&
+		    net_eq(dev_net(real_dev), dev_net(dev))) {
+			params->h_vlan_proto = vlan_dev_vlan_proto(dev);
+			params->h_vlan_TCI = htons(vlan_dev_vlan_id(dev));
+			params->ifindex = real_dev->ifindex;
+		} else {
+			return BPF_FIB_LKUP_RET_VLAN_FAILURE;
+		}
+	}
+#endif
+
 	if (mtu)
 		params->mtu_result = mtu; /* union with tot_len */
 
@@ -6214,8 +6232,10 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
 
 #if IS_ENABLED(CONFIG_INET)
 static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
-			       u32 flags, bool check_mtu)
+			       u32 flags, bool check_mtu,
+			       struct net_device **fwd_dev)
 {
+	u32 in_ifindex = params->ifindex;
 	struct neighbour *neigh = NULL;
 	struct fib_nh_common *nhc;
 	struct in_device *in_dev;
@@ -6347,16 +6367,23 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 
 set_fwd_params:
-	return bpf_fib_set_fwd_params(params, mtu);
+	if (fwd_dev)
+		*fwd_dev = dev;
+	err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
+	if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
+		params->ifindex = in_ifindex;
+	return err;
 }
 #endif
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
-			       u32 flags, bool check_mtu)
+			       u32 flags, bool check_mtu,
+			       struct net_device **fwd_dev)
 {
 	struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
 	struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
+	u32 in_ifindex = params->ifindex;
 	struct fib6_result res = {};
 	struct neighbour *neigh;
 	struct net_device *dev;
@@ -6486,13 +6513,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 
 set_fwd_params:
-	return bpf_fib_set_fwd_params(params, mtu);
+	if (fwd_dev)
+		*fwd_dev = dev;
+	err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
+	if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
+		params->ifindex = in_ifindex;
+	return err;
 }
 #endif
 
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
 			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
-			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK)
+			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
+			     BPF_FIB_LOOKUP_VLAN)
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6507,12 +6540,12 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
 		return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
-					   flags, true);
+					   flags, true, NULL);
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
 		return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
-					   flags, true);
+					   flags, true, NULL);
 #endif
 	}
 	return -EAFNOSUPPORT;
@@ -6532,6 +6565,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
 {
 	struct net *net = dev_net(skb->dev);
+	struct net_device *fwd_dev = NULL;
 	int rc = -EAFNOSUPPORT;
 	bool check_mtu = false;
 
@@ -6547,29 +6581,26 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
-		rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
+		rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu,
+					 &fwd_dev);
 		break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
+		rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu,
+					 &fwd_dev);
 		break;
 #endif
 	}
 
 	if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
-		struct net_device *dev;
-
-		/* When tot_len isn't provided by user, check skb
-		 * against MTU of FIB lookup resulting net_device
+		/* without tot_len, check the skb against the FIB-result
+		 * device's MTU
 		 */
-		dev = dev_get_by_index_rcu(net, params->ifindex);
-		if (unlikely(!dev))
-			return -ENODEV;
-		if (!is_skb_forwardable(dev, skb))
+		if (!is_skb_forwardable(fwd_dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
 
-		params->mtu_result = dev->mtu; /* union with tot_len */
+		params->mtu_result = fwd_dev->mtu; /* union with tot_len */
 	}
 
 	return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 89b36de5fdbb..8d0058d88eb2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3532,6 +3532,26 @@ union bpf_attr {
  *			Use the mark present in *params*->mark for the fib lookup.
  *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
  *			as it only has meaning for full lookups.
+ *		**BPF_FIB_LOOKUP_VLAN**
+ *			If the fib lookup resolves to a VLAN device whose
+ *			parent is a real (non-VLAN) device, set
+ *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
+ *			the VLAN device and replace *params*->ifindex with the
+ *			parent's ifindex. *params*->h_vlan_TCI carries the VID
+ *			only, with PCP and DEI bits zero; a consumer wanting to
+ *			set egress priority writes PCP itself. *params*->smac is
+ *			the VLAN device's own address, which can differ from the
+ *			parent's. Only the immediate parent is resolved; if it
+ *			is itself a VLAN device (QinQ) or in another namespace,
+ *			the egress cannot be reduced to a physical device plus
+ *			one tag and the lookup returns
+ *			**BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
+ *			left at the input. Re-issue without
+ *			**BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
+ *			ifindex. The swap and the vlan fields
+ *			are written only on success; other output fields keep
+ *			the helper's existing behaviour, so a frag-needed result
+ *			still reports the route mtu in *params*->mtu_result.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7327,6 +7347,7 @@ enum {
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
+	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
 };
 
 enum {
@@ -7340,6 +7361,7 @@ enum {
 	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
 	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
 	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
+	BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
 };
 
 struct bpf_fib_lookup {
@@ -7393,7 +7415,11 @@ struct bpf_fib_lookup {
 
 	union {
 		struct {
-			/* output */
+			/*
+			 * output with BPF_FIB_LOOKUP_VLAN: set from the
+			 * resolved egress VLAN device (see the flag); zeroed
+			 * on other successful lookups.
+			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
 		};
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v4 0/3] bpf: bidirectional VLAN support for bpf_fib_lookup()
From: Avinash Duduskar @ 2026-06-23  2:51 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	toke, dsahern

This series adds VLAN awareness to bpf_fib_lookup() in both directions.
BPF_FIB_LOOKUP_VLAN resolves a VLAN egress to its underlying real device
plus the VLAN tag (XDP programs need this because VLAN devices have no XDP
xmit), and BPF_FIB_LOOKUP_VLAN_INPUT runs the lookup as if a tagged frame
had arrived on the matching VLAN subinterface, for iif policy routing and
VRF table selection.

The independent l3mdev/VRF flow-init fix, patch 1 in v1 and v2, was split
out and merged to bpf separately.

v4 changes what bpf_fib_lookup() returns in one case. v3 left a VLAN
egress that cannot be reduced to a physical device plus one tag (a QinQ
egress, or a parent in another namespace) as best-effort SUCCESS with
the VLAN device's ifindex. In his v3 review Toke asked for a distinct
error code there instead, so an XDP program cannot mistake an unresolved
VLAN egress for a physical one, and suggested the name
BPF_FIB_LKUP_RET_VLAN_FAILURE; v4 implements that. The
code is appended after BPF_FIB_LKUP_RET_NO_SRC_ADDR (nothing renumbered,
tools/ mirror updated) and is returned only when BPF_FIB_LOOKUP_VLAN is
set, so no existing caller can observe it. On that failure params->ifindex
is left at the input, like the input-side failures; a tc or XDP program
that wants the VLAN device's own ifindex re-issues the lookup without the
flag, the recovery path he described.

The reason for a distinct code rather than best-effort SUCCESS: SUCCESS
on an unreducible egress silently blackholed XDP. A redirect to the VLAN
device drops at xdp_do_flush() with no in-band signal to the program;
VLAN_FAILURE makes the unreducible case explicit, and a live-frames
selftest exercises both the redirected and the passed paths. Only the
immediate parent is resolved, so QinQ and foreign-netns are the
unreducible cases; bond, team, TBID and VRF egress resolve, as the
selftest table pins.

Changes v3 -> v4:

- Patch 1: return BPF_FIB_LKUP_RET_VLAN_FAILURE for an unreducible VLAN
  egress, leaving params->ifindex at the input.

- Patch 3: the QinQ-egress and cross-namespace-egress arms expect
  VLAN_FAILURE; an escape-hatch arm re-issues without the flag for the
  inner VLAN device's ifindex; and test_fib_lookup_vlan_redirect drives
  live frames (BPF_F_TEST_XDP_LIVE_FRAMES) through the native redirect
  path, asserting a reducible egress is delivered and a QinQ egress is
  passed to the stack. The selftest's VLAN_FAILURE arms are IPv4 only,
  since bpf_ipv6_fib_lookup() restores params->ifindex with the same code
  as the IPv4 path that the arms exercise.

The three other v3 questions Toke marked fine as-is, and v4 keeps them: an
unmatched, down, or cross-namespace tag on input returns NOT_FWDED;
OUTPUT | VLAN_INPUT is rejected with -EINVAL; the BPF_FIB_LOOKUP_VLAN_INPUT
name is kept.

Taking the tag as lookup input follows the approach David Ahern suggested
in the 2021 fwmark discussion:
https://lore.kernel.org/bpf/6248c547-ad64-04d6-fcec-374893cc1ef2@gmail.com/

v3: https://lore.kernel.org/all/20260617224729.1428662-1-avinash.duduskar@gmail.com/
v2: https://lore.kernel.org/all/20260616223426.3568080-1-avinash.duduskar@gmail.com/
v1: https://lore.kernel.org/all/20260609172052.81613-1-avinash.duduskar@gmail.com/

Avinash Duduskar (3):
  bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
  bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
  selftests/bpf: Add bpf_fib_lookup() VLAN flag tests

 include/uapi/linux/bpf.h                      |  47 +-
 net/core/filter.c                             | 133 +++-
 tools/include/uapi/linux/bpf.h                |  47 +-
 .../selftests/bpf/prog_tests/fib_lookup.c     | 696 +++++++++++++++++-
 .../testing/selftests/bpf/progs/fib_lookup.c  |  36 +
 5 files changed, 930 insertions(+), 29 deletions(-)


base-commit: a975094bf98ca97be9146f9d3b5681a6f9cf5ce3
-- 
2.54.0


^ permalink raw reply

* [PATCH] net/tcp-ao: fix use-after-free of key in del_async path
From: HanQuan @ 2026-06-23  1:52 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, ncardwell, HanQuan

In tcp_ao_delete_key(), the del_async path skips the current_key
and rnext_key validity checks present in the synchronous path,
assuming these pointers are always NULL on LISTEN sockets.  However,
if a key was added with set_current=1/set_rnext=1 while the socket
was in CLOSE state, current_key and rnext_key will be non-NULL
after listen() transitions the socket to LISTEN.

When such a key is deleted with del_async=1, hlist_del_rcu() and
call_rcu() free the key without clearing the dangling pointers.
After the RCU grace period, getsockopt(TCP_AO_INFO) dereferences
current_key->sndid and rnext_key->rcvid from freed slab memory.

Clear current_key and rnext_key in the del_async path when they
reference the key being deleted.

Fixes: d6732b95b6fb ("net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)")
Signed-off-by: HanQuan <eilaimemedsnaimel@gmail.com>
---
 net/ipv4/tcp_ao.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
index 2f69bcecae78..a56bb79e15e0 100644
--- a/net/ipv4/tcp_ao.c
+++ b/net/ipv4/tcp_ao.c
@@ -1747,6 +1747,10 @@ static int tcp_ao_delete_key(struct sock *sk, struct tcp_ao_info *ao_info,
 	 * them and we can just free all resources in RCU fashion.
 	 */
 	if (del_async) {
+		if (ao_info->current_key == key)
+			WRITE_ONCE(ao_info->current_key, NULL);
+		if (ao_info->rnext_key == key)
+			WRITE_ONCE(ao_info->rnext_key, NULL);
 		atomic_sub(tcp_ao_sizeof_key(key), &sk->sk_omem_alloc);
 		call_rcu(&key->rcu, tcp_ao_key_free_rcu);
 		return 0;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v1 net] ipv4: fib: Don't ignore error route in local/main tables.
From: patchwork-bot+netdevbpf @ 2026-06-23  1:50 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: dsahern, idosch, davem, edumazet, kuba, pabeni, horms, kuni1840,
	netdev
In-Reply-To: <20260619212753.3367244-1-kuniyu@google.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 19 Jun 2026 21:27:20 +0000 you wrote:
> When CONFIG_IP_MULTIPLE_TABLES is enabled but no rule is added,
> fib_lookup() performs route lookup directly on two tables.
> 
> Since the first lookup does not properly bail out, the result
> of an error route in the merged local/main table could be
> overwritten by another route in the default table:
> 
> [...]

Here is the summary with links:
  - [v1,net] ipv4: fib: Don't ignore error route in local/main tables.
    https://git.kernel.org/netdev/net/c/b72f0db64205

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] bnx2x: fix potential memory leak in bnx2x_alloc_mem_bp()
From: patchwork-bot+netdevbpf @ 2026-06-23  1:50 UTC (permalink / raw)
  To: Abdun Nihaal
  Cc: skalluru, manishc, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-kernel, barak, stable
In-Reply-To: <20260620062402.89549-1-nihaal@cse.iitm.ac.in>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sat, 20 Jun 2026 11:53:50 +0530 you wrote:
> If the allocation of fp[i].tpa_info fails, the error path will not free
> the struct bnx2x_fastpath allocated earlier, as it is not linked to the
> bp structure yet. Fix that by linking it immediately after allocation.
> 
> Cc: stable@vger.kernel.org
> Fixes: 15192a8cf8a8 ("bnx2x: Split the FP structure")
> Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
> 
> [...]

Here is the summary with links:
  - [net] bnx2x: fix potential memory leak in bnx2x_alloc_mem_bp()
    https://git.kernel.org/netdev/net/c/a986fde914d8

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Menglong Dong @ 2026-06-23  1:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Menglong Dong, xuanzhuo, eperezma, jasowang, andrew+netdev, davem,
	edumazet, kuba, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260622085825-mutt-send-email-mst@kernel.org>

On 2026/6/22 21:24 Michael S. Tsirkin <mst@redhat.com> write:
> On Mon, Jun 22, 2026 at 08:27:12PM +0800, Menglong Dong wrote:
> > On 2026/6/22 06:31 Michael S. Tsirkin <mst@redhat.com> write:
> > > On Tue, Jun 16, 2026 at 07:59:12PM +0800, Menglong Dong wrote:
> > [...]
[...]
> > 
> > And the logic is like this:
> > 
> > Kernel: tx NAPI is waked up from skb_xmit_done() ->
> > Kernel: sq->vq and xsk->tx_ring are both empty ->
> > Kernel: call virtnet_xsk_xmit_batch()
> > 
> >     User: submit a entry to the xsk->tx_ring
> >     User: check the wakeup flag
> >     User: wakeup flag is not set, skip send()
> > 
> > Kernel: call xsk_set_tx_need_wakeup(), because sq->vq is empty
> > 
> > If we don't send more data, the data in the xsk->tx_ring will
> > not be sent forever.
> 
> I'm not 100% sure I understand, but when someone fixes cross-CPU races
> with no synchronization or CPU memory barriers just with extra checks,
> this always gives me pause.
> 
> AI helped write this for me, for example:
>   1. Kernel: xsk_set_tx_need_wakeup stores NEED_WAKEUP (sits in store buffer)
>   2. Kernel: xsk_tx_peek_release_desc_batch - load, sees empty (reordered before the store is globally visible)
>   3. Kernel: peek finds nothing, returns 0
>   4. Userspace: stores entry + producer
>   5. Userspace: loads flags - doesn't see NEED_WAKEUP yet (still in kernel's store buffer)
>   6. Userspeace: skips send()
>   7. Kernel: NEED_WAKEUP store finally becomes visible - too late
> 
> Seems legit?

Ah, it seems right. The race condition problem is more complex
than I thought. And seems that this is a common problem of
XSK WAKEUP, which should exists for all the drivers.

So I think we can remove the checking here. And I'll see if I
can solve such problem completely further. WDYT?

> 
> 
> 
> > > 
> > > >  	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> > > >  
> > > > +	if (need_wakeup) {
> > > > +		if (vring_size == sq->vq->num_free)
> > > > +			/* we can't wake up by ourself, and it should be done
> > > > +			 * by the user.
> > > > +			 */
> > > > +			xsk_set_tx_need_wakeup(pool);
> > > > +		else
> > > > +			/* we can wake up from skb_xmit_done() */
> > > > +			xsk_clear_tx_need_wakeup(pool);
> > > 
> > > But what if we don't have get tx napi so no wakeup in skb_xmit_done?
> > 
> > Sorry that I'm not sure what "get tx napi" means here ;(
> > 
> > There are entry in sq->vq, so skb_xmit_done() will be called after
> > the entries in the ring is consumed by the HOST, right?
> > Then, the corresponding sq->napi will be scheduled, as we ensure
> > that tx napi is always enabled, which means napi->weight is not
> > zero, in this commit:
> > 1df5116a41a8 ("virtio_net: xsk: prevent disable tx napi")
> 
> Oh I forgot we did that. But can xsk bind when tx napi has already
> been disabled previously?

According to my observe, it can, which I think is another issue, and
I were about to fix it later in a separate patch.

It is a problem, right?

There are 2 approach to fix it:
1. don't allow the binding if the tx napi is not enabled
2. or we set the tx_napi->weight to 1 when binding, and
    restore it to 0 when unbind.

Should I fix it in this series?

Thanks!
Menglong Dong

> 
> 
> > Right?
> > 
> > Thanks!
> > Menglong Dong
> > 
> > > 
> > > 
> > > > +	}
> > > > +
> > > >  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
> > > >  		check_sq_full_and_disable(vi, vi->dev, sq);
> > > >  
> > > > @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> > > >  	u64_stats_add(&sq->stats.xdp_tx,  sent);
> > > >  	u64_stats_update_end(&sq->stats.syncp);
> > > >  
> > > > -	if (xsk_uses_need_wakeup(pool))
> > > > -		xsk_set_tx_need_wakeup(pool);
> > > > -
> > > >  	return sent;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.54.0
> > > 
> > > 
> > > 
> > 
> > 
> > 
> 
> 





^ permalink raw reply

* Re: [PATCH net] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Xin Long @ 2026-06-23  1:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, syzbot+e14bc5d4942756023b77, Jon Maloy
In-Reply-To: <20260622171048.1626022-1-edumazet@google.com>

On Mon, Jun 22, 2026 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
>
> TIPC UDP media bearer teardown calls dst_cache_destroy() on its
> replicast caches before calling synchronize_net() to wait for
> concurrent RCU readers (transmitters) to finish:
>
> static void cleanup_bearer(struct work_struct *work)
> {
> ...
>         list_for_each_entry_safe(rcast, tmp, &ub->rcast.list, list) {
>                 dst_cache_destroy(&rcast->dst_cache);
>                 list_del_rcu(&rcast->list);
>                 kfree_rcu(rcast, rcu);
>         }
> ...
>         dst_cache_destroy(&ub->rcast.dst_cache);
>         udp_tunnel_sock_release(ub->sk);
>         synchronize_net();
> ...
> }
>
> This is highly buggy because dst_cache_destroy() immediately frees the
> per-CPU cache memory (free_percpu()) and releases the cached dst
> entries without any synchronization.
>
> If a concurrent transmitter (e.g., tipc_udp_xmit()) is running on another
> CPU under RCU protection, it can call dst_cache_get() concurrently,
> leading to:
> 1. Use-After-Free on the per-CPU cache pointer itself (crash).
> 2. "rcuref - imbalanced put()" warning if it attempts to release a
>    dst that was concurrently released by dst_cache_destroy().
>
> Furthermore, calling kfree(ub) immediately after synchronize_net() without
> closing the socket first (or waiting after closing it) leaves a window
> where a concurrent receiver (tipc_udp_recv()) could start after
> synchronize_net(), access ub, and suffer a UAF when kfree(ub) runs.
>
> To fix this, we must defer dst_cache_destroy() and kfree(ub) until after
> we have ensured that no more readers can see the bearer/socket and all
> existing readers have finished:
>
> 1. Move the rcast entries from the public list to a private list
>    and delete them using list_del_rcu() (stops new transmit readers).
> 2. Release the bearer socket using udp_tunnel_sock_release() (stops
>    new receive readers).
> 3. Call synchronize_net() to wait for all outstanding RCU readers
>    (both transmit and receive) to finish.
> 4. Now that it is safe, call dst_cache_destroy() on all isolated
>    rcast entries and the main bearer cache, and free the memory.
>
> Fixes: e9c1a793210f ("tipc: add dst_cache support for udp media")
> Reported-by: syzbot+e14bc5d4942756023b77@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/6a396a66.52ae72c2.136ac7.0003.GAE@google.com/T/#u
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Xin Long <lucien.xin@gmail.com>
> Cc: Jon Maloy <jon.maloy@ericsson.com>
> ---
>  net/tipc/udp_media.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
> index 988b8a7f953ad6da860e6190f1f244650f121dce..befaf7137caf642462b7203a2429a60386e64db8 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -808,21 +808,26 @@ static void cleanup_bearer(struct work_struct *work)
>  {
>         struct udp_bearer *ub = container_of(work, struct udp_bearer, work);
>         struct udp_replicast *rcast, *tmp;
> +       LIST_HEAD(private_list);
>         struct tipc_net *tn;
>
>         list_for_each_entry_safe(rcast, tmp, &ub->rcast.list, list) {
> -               dst_cache_destroy(&rcast->dst_cache);
>                 list_del_rcu(&rcast->list);
> -               kfree_rcu(rcast, rcu);
> +               list_add(&rcast->list, &private_list);
>         }

Could this corrupt the list for concurrent RCU readers?
When list_del_rcu() is called, it intentionally leaves the next pointer
intact so concurrent readers can continue their traversal. However, the
immediate call to list_add() overwrites both the next and prev pointers
to link the entry into private_list.
If a concurrent reader is currently positioned at rcast, won't it follow
the newly clobbered next pointer and jump from the original RCU list
directly into private_list?
Because private_list is allocated on the local stack, the reader might
interpret stack memory as a struct udp_replicast. Furthermore, the reader
would miss its loop termination condition because it expects to reach the
original list head, potentially resulting in an infinite loop or a crash.
[ ... ]

This looks legit.

Thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox