Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: stmmac: loongson1: Use dev_err_probe()
From: Jakub Kicinski @ 2026-06-17 22:07 UTC (permalink / raw)
  To: Jacob Keller
  Cc: keguang.zhang, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Maxime Coquelin, Alexandre Torgue, linux-mips,
	netdev, linux-stm32, linux-arm-kernel, linux-kernel
In-Reply-To: <6b8db599-5bb2-47f9-ab53-a0b5141af2e5@intel.com>

On Wed, 17 Jun 2026 14:26:25 -0700 Jacob Keller wrote:
> It does claim that it has benefit since you get the error code emitted
> symbolically. But we have %pe for that. I wonder if dev_err_probe
> predates %pe?

I'd argue

  No of match data provided: -EINVAL

is more confusing than just:

  No of match data provided

the EINVAL is meaningless and hardcoded in this case?

^ permalink raw reply

* Re: [PATCH net-next] r8169: migrate Rx path to page_pool
From: Francois Romieu @ 2026-06-17 22:25 UTC (permalink / raw)
  To: Atharva Potdar
  Cc: hkallweit1, nic_swsd, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev
In-Reply-To: <CAF9AHva0TSFz5tedMEgJTkhThzDGqmW7MJshAtf3ULbLY4wd=w@mail.gmail.com>

Atharva Potdar <atharvapotdar07@gmail.com> :
[...]
> Francois:
> > You may consider fdd7b4c3302c93f6833e338903ea77245eb510b4 and some related
> > changes around that time.
> 
> I am sorry but I don't fully understand the context of this commit or
> the behaviour it addresses. Could you please help me regarding what I
> need to watch out for this change?

It should be clearer with c0cd884af045338476b8e69a61fceb3f34ff22f1 (and
may be 6f0333b8fde44b8c04a53b2461504f0e8f1cebe6 as well).

Old chipsets did not correctly implement receive buffer size limit.

I have no idea how mildly recent ones (< 15 years) behave.

Reinstatement of the bug on old chipset imho deserves some warning.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH v13 5/6] tls: add hardware offload key update support
From: Rishikesh Jethwani @ 2026-06-17 22:28 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: netdev, saeedm, tariqt, mbloch, borisp, john.fastabend, kuba,
	davem, pabeni, edumazet, leon
In-Reply-To: <ag8PQ2pxvKMHglWV@krikkit>

On Thu, May 21, 2026 at 6:57 AM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> 2026-05-12, 10:55:35 -0700, Rishikesh Jethwani wrote:
> > > Not blaming you for NIC behavior, but... the NIC passes up as
> > > "decrypted" records that have failed decryption (because it was using
> > > the wrong (old) key), or passes as "encrypted" the incorrectly
> > > decrypted data (that it has "decrypted" with the old key)?
> > >
> > > Or this is only the first record(s) after the KeyUpdate message, if
> > > they fall within the same packet, the whole packet was "decrypted"
> > > with the old key but only the KeyUpdate itself (and maybe some more
> > > records before it) decrypted correctly ; but subsequent packets get
> > > passed as !decrypted and don't need this reencrypt dance?
> > >
> > > (this is maybe more of a question for Tariq or the other @nvidia
> > > folks)
> > >
> > >
> > > I haven't reviewed the whole patch at this point, because of Paolo's
> > > suggestion and this confusion with the RX rekey.
> > >
> > > The traces show how both NICs behave during the key transition:
> >
> >   Broadcom (NIC preserves decrypted flags):
> >   - decrypted=1: NIC fully decrypted these with the old key; one reencrypt
> > pass (retry=0) re-encrypts those frags back to ciphertext, then SW decrypts
> > with the new key.
> >   - encrypted=0, decrypted=0: boundary-straddling record; same single
> > reencrypt pass.
> >   - encrypted=1: NIC never touched these; SW decrypts directly with the new
> > key.
> >
> >   Mellanox (NIC clears decrypted flags on auth failure):
> >   - encrypted=0, decrypted=0: NIC partially processed the record but
> > cleared all decrypted flags on auth failure. retry=0 reencrypts with the
> > wrong frag interpretation and gets EBADMSG; retry=1 toggles the flags and
> > succeeds.
> >   - encrypted=1: NIC reported these as untouched; SW decrypts directly.
> >
> >   The retry path exists specifically for the Mellanox case: cleared flags
> > hide which frags the NIC actually touched, so the first pass may pick the
> > wrong interpretation and need to flip it.
>
> Thanks for describing this.
>
> I don't think the core TLS implementation should implement workarounds
> for HW/drivers doing strange things. IMO this should instead be
> stuffed into a driver CB, which would implement the necessary steps to
> perform during a rekey.
>
> The core can provide a helper, but I would move the whole
> "reencrypt+maybe retry" logic to a CB.

The driver callback approach sounds good.

^ permalink raw reply

* Re: [PATCH v14 6/9] tls: device: add TX KeyUpdate support
From: Rishikesh Jethwani @ 2026-06-17 22:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, saeedm, tariqt, mbloch, borisp, john.fastabend, sd, davem,
	pabeni, edumazet, leon
In-Reply-To: <20260525211608.2508449-1-kuba@kernel.org>

On Mon, May 25, 2026 at 2:16 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> tls: device: add TX KeyUpdate support
>
> This adds TX-side TLS 1.3 KeyUpdate support for HW-offloaded sockets by
> installing a temporary SW context with the new key, deferring the HW
> reinstall until all old-key records have been ACKed, and falling back to
> permanent SW mode if the HW reinstall fails.
>
> Thanks for the patch.  A few questions inline.
>
> > diff --git a/include/net/tls.h b/include/net/tls.h
> > index 2512a3799b21..c1085873ee01 100644
> > --- a/include/net/tls.h
> > +++ b/include/net/tls.h
>
> [ ... ]
>
> > diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> > index 7a98d2f6cbd3..c435b3450872 100644
> > --- a/net/tls/tls_device.c
> > +++ b/net/tls/tls_device.c
>
> [ ... ]
>
> > @@ -222,6 +246,19 @@ static void tls_tcp_clean_acked(struct sock *sk, u32 acked_seq)
> >       }
> >
> >       ctx->unacked_record_sn += deleted_records;
> > +
> > +     /* Once all old-key HW records are ACKed, set REKEY_READY to
> > +      * let sendmsg know it can finish the rekey and switch back
> > +      * to HW offload.
> > +      */
> > +     if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) &&
> > +         !test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> > +             u32 boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> > +
> > +             if (!before(acked_seq, boundary_seq))
> > +                     set_bit(TLS_TX_REKEY_READY, &tls_ctx->flags);
> > +     }
> > +
> >       spin_unlock_irqrestore(&ctx->lock, flags);
> >  }
>
> [Medium]
> Is the ordering between the boundary_seq publish and the PENDING bit
> publish in tls_device_start_rekey() strong enough for this read site?
>
> In tls_device_start_rekey() the sequence is:
>
>         WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
>         ...
>         set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
>
> set_bit() does not document any implicit memory barrier
> (Documentation/atomic_bitops.txt).  On weakly-ordered architectures, can
> this reader observe PENDING=1 with a stale boundary_seq, set REKEY_READY
> prematurely, and let tls_device_complete_rekey() run before the old-key
> records are actually ACKed?
>
> Would an smp_wmb() (or smp_mb__before_atomic()) between the write of
> boundary_seq and the set_bit, or moving both writes under
> offload_ctx->lock that this reader already holds, be appropriate?

I'll move both writes under offload_ctx->lock

> > @@ -253,6 +290,14 @@ void tls_device_free_resources_tx(struct sock *sk)
> >       struct tls_context *tls_ctx = tls_get_ctx(sk);
> >
> >       tls_free_partial_record(sk, tls_ctx);
> > +
> > +     if (unlikely(tls_ctx->rekey.sw_ctx))
> > +             tls_sw_release_resources_tx(sk);
> > +
> > +     if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags)) {
> > +             TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> > +             TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> > +     }
> >  }
>
> [Medium]
> Should a rekey that was still PENDING at socket teardown count as
> TlsTxRekeyOk?  The rekey never completed, but the counter is bumped as
> if it had succeeded.  Should this be a separate "aborted" counter, or
> just omit the OK increment in this path?

I'll add a separate "aborted" counter.

> > @@ -624,6 +672,19 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> >                       goto out;
> >       }
> >
> > +     /* Old-key records all ACKed; switch back to HW. */
> > +     if (test_bit(TLS_TX_REKEY_READY, &tls_ctx->flags))
> > +             tls_device_complete_rekey(sk, tls_ctx, true);
>
> [Medium]
> Should the return value of tls_device_complete_rekey() be checked here?
>
> tls_device_complete_rekey() can return non-zero from its early
> tls_sw_drain_tx() path:
>
>         rc = tls_sw_drain_tx(sk, ctx);
>         if (rc)
>                 return rc;
>
> In that early-return case none of REKEY_PENDING, REKEY_READY, or
> REKEY_FAILED is updated.  The caller falls through to
> tls_sw_sendmsg_locked(), and on the next sendmsg the same READY bit is
> still set, so complete_rekey is retried, drain returns -EAGAIN again,
> and so on.  Is there a way out of this state?  Should persistent drain
> failure transition to FAILED, or at least bump a counter so it is
> observable?

Looking at the actual failure modes, I don't think persistent failure
or a FAILED transition is the right fix here, let me lay out why, and
what I'm proposing instead.

tls_sw_drain_tx() returns non-zero in three cases:
1. tls_tx_records() returns < 0
2. tls_is_partially_sent_record() still true after the push attempt
3. tls_is_pending_open_record() still true after the push attempt

The key thing is that tls_tx_records() (tls_sw.c:460-462) already
collapses every non-EAGAIN failure into tls_err_abort(sk, rc), which
sets sk->sk_err. From that point the socket is poisoned: the next
sendmsg returns the saved sk_err to userspace via the sk->sk_err check
at the top of tls_push_data() (and the SW equivalent), and the loop
exits naturally, userspace stops sending.

That means the failure modes that actually retry on the next sendmsg
are exclusively -EAGAIN-class transients:

- TCP send buffer full → clears as ACKs free space; tls_write_space()
flushes the partial.
- Open record not fully pushed → flushes on the next sendmsg once
buffer space is available.

Transitioning to REKEY_FAILED on -EAGAIN would permanently downgrade a
connection that's just waiting one RTT for TCP buffer space, which I
don't think is what we want, once FAILED, we never retry the HW
transition.

For observability without committing a new SNMP counter, I'll add a
tracepoint at the sendmsg call site so a stuck loop is visible via
tracefs

> > +
> > +     /* Use SW path if rekey is in progress (PENDING) or if HW rekey
> > +      * failed (FAILED).
> > +      */
> > +     if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) ||
> > +         test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> > +             rc = tls_sw_sendmsg_locked(sk, msg, size);
> > +             goto out;
> > +     }
> > +
> >       rc = tls_push_data(sk, &msg->msg_iter, size, msg->msg_flags,
> >                          record_type);
>
> [High]
> What happens to a HW open_record across this transition?
>
> If userspace previously sent with MSG_MORE, offload_ctx->open_record
> holds buffered plaintext and tls_ctx->pending_open_record_frags is true.
> A subsequent setsockopt KeyUpdate then routes future sendmsg calls into
> tls_sw_sendmsg_locked(), which builds into sw_ctx->open_rec (initially
> NULL) and never touches the HW open_record.  At socket close,
> tls_device_sk_destruct() does:
>
>         if (ctx->open_record)
>                 destroy_record(ctx->open_record);
>
> so the buffered plaintext is freed without ever being transmitted.
>
> Also, since pending_open_record_frags stays true, would
> tls_sw_drain_tx() see tls_is_pending_open_record(ctx) as true and return
> -EAGAIN, blocking tls_device_complete_rekey() until something on the SW
> side coincidentally clears it?
>
> The tcp_write_collapse_fence() in tls_device_start_rekey() only handles
> records already in the TCP write queue; it does not appear to flush this
> HW open_record.

Right, flushing the HW open_record is required.

> > @@ -1103,6 +1164,260 @@ static struct tls_offload_context_tx *alloc_offload_ctx_tx(struct tls_context *c
> [ ... ]
> > +static int tls_device_start_rekey(struct sock *sk,
> > +                               struct tls_context *ctx,
> > +                               struct tls_offload_context_tx *offload_ctx,
> > +                               struct tls_crypto_info *new_crypto_info)
> > +{
> > +     bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> > +     bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> [ ... ]
> > +     if (rekey_pending || rekey_failed) {
> > +             rc = crypto_aead_setkey(offload_ctx->rekey.sw.aead_send,
> > +                                     key, cipher_desc->key);
> > +             if (rc)
> > +                     return rc;
>
> [High]
> Can there be in-flight async crypto requests on this aead_send when
> crypto_aead_setkey() is called here?
>
> While a rekey is PENDING, sendmsg has been encrypting via
> offload_ctx->rekey.sw.aead_send through tls_sw_sendmsg_locked().  Async
> backends can return -EINPROGRESS and complete later.  Other call sites
> that mutate AEAD state (tls_sw_release_resources_tx, tls_sw_drain_tx,
> tls_set_sw_offload) call tls_encrypt_async_wait() first.  Should the
> same wait happen here before re-keying the same tfm, so in-flight
> encryptions cannot pick up K2 instead of the K1 they were submitted
> with?
>
> [ ... ]
> > +     } else {
> > +             rc = tls_device_init_rekey_sw(sk, ctx, offload_ctx,
> > +                                           new_crypto_info);
> [ ... ]
> > +             WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
> > +
> > +             /* Prevent a partial record straddling the SW/HW boundary. */
> > +             tcp_write_collapse_fence(sk);
> > +
> > +             ctx->rekey.sw_ctx = &offload_ctx->rekey.sw;
> > +             ctx->rekey.cipher_ctx = &offload_ctx->rekey.tx;
> > +
> > +             set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> [ ... ]
> > +     unsafe_memcpy(&offload_ctx->rekey.crypto_send.info, new_crypto_info,
> > +                   cipher_desc->crypto_info,
> > +                   /* checked in do_tls_setsockopt_conf */);
> > +     memzero_explicit(new_crypto_info, cipher_desc->crypto_info);
> > +
> > +     return 0;
> > +}

Yes, I'll add a wait for in-flight async encryptions submitted to this
tfm with the previous key before changing it.

> [Medium]
> new_crypto_info is zeroized after the copy, but is the staging area
> itself ever scrubbed?
>
> offload_ctx->rekey.crypto_send.info, offload_ctx->rekey.tx.iv, and
> offload_ctx->rekey.tx.rec_seq receive the same key bytes via
> unsafe_memcpy/memcpy.  tls_device_complete_rekey() does not scrub them
> on success, and tls_device_free_resources_tx() / tls_device_tx_del_task()
> kfree the offload_ctx without a memzero_explicit().  Should those sites
> zero the staging area to limit the window in which key bytes are
> recoverable from slab memory after rotation?

Yes, I'll scrub the rekey staging area.

> > +static int tls_device_complete_rekey(struct sock *sk, struct tls_context *ctx,
> > +                                  bool deferred)
> > +{
> [ ... ]
> > +     if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
> > +             netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > +                                             TLS_OFFLOAD_CTX_DIR_TX);
> > +             set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> > +     }
> > +
> > +     memcpy(crypto_info_rec_seq(&offload_ctx->rekey.crypto_send.info, cipher_desc),
> > +            offload_ctx->rekey.tx.rec_seq, cipher_desc->rec_seq);
> > +
> > +     rc = tls_device_dev_add_tx(sk, netdev, &offload_ctx->rekey.crypto_send.info,
> > +                                tcp_sk(sk)->write_seq);
> > +
> > +release_lock:
> > +     up_read(&device_offload_lock);
> > +
> > +     spin_lock_irqsave(&offload_ctx->lock, flags);
> > +     memcpy(&rcd_sn, offload_ctx->rekey.tx.rec_seq, sizeof(rcd_sn));
> > +     offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> > +     spin_unlock_irqrestore(&offload_ctx->lock, flags);
> > +
> > +     memcpy(ctx->tx.iv, offload_ctx->rekey.tx.iv,
> > +            cipher_desc->salt + cipher_desc->iv);
> > +     memcpy(ctx->tx.rec_seq, offload_ctx->rekey.tx.rec_seq,
> > +            cipher_desc->rec_seq);
> > +     unsafe_memcpy(&ctx->crypto_send.info,
> > +                   &offload_ctx->rekey.crypto_send.info,
> > +                   cipher_desc->crypto_info,
> > +                   /* checked during rekey setup */);
> > +
> > +     if (rc)
> > +             goto rekey_fail;
> > +
> > +     clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> > +
> > +     key = crypto_info_key(&offload_ctx->rekey.crypto_send.info, cipher_desc);
> > +     rc = crypto_aead_setkey(offload_ctx->aead_send, key, cipher_desc->key);
> > +     if (rc)
> > +             goto rekey_fail;
>
> [High]
> Can these unconditional copies of the new key material into ctx->tx.iv,
> ctx->tx.rec_seq, and ctx->crypto_send.info before the rc check leave
> inconsistent state on failure?
>
> If tls_device_dev_add_tx() failed, control reaches rekey_fail with:
>
>   - HW context removed (TLS_TX_DEV_CLOSED set)
>   - ctx->crypto_send.info advertising the NEW key
>   - ctx->tx.iv and ctx->tx.rec_seq holding the NEW salt/IV/rec_seq
>   - offload_ctx->aead_send still holding the OLD key (the setkey for it
>     is reached only after the rc check)
>
> Userspace getsockopt(SOL_TLS, TLS_TX) would then read NEW key bytes that
> no kernel AEAD ever absorbed.  Any subsequent fallback that builds IVs
> from ctx->tx and encrypts via offload_ctx->aead_send would mix NEW salt
> with the OLD key.
>
> Could the memcpy block move below the rc check?
>

I'll move the memcpys below the rc check.

> [High]
> On the success path, is there a window where tls_device_down() sees
> stale TLS_TX_DEV_CLOSED?
>
> The success path here:
>
>         if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
>                 netdev->tlsdev_ops->tls_dev_del(...);
>                 set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
>         }
>         ...
>         rc = tls_device_dev_add_tx(...);   /* installs NEW HW ctx */
>
> release_lock:
>         up_read(&device_offload_lock);
>         ...
>         if (rc) goto rekey_fail;
>         clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
>
> Between dropping device_offload_lock and clear_bit(), can
> tls_device_down() take down_write(&device_offload_lock) and run:
>
>         if (ctx->tx_conf == TLS_HW &&
>             !test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
>                 netdev->tlsdev_ops->tls_dev_del(...);
>                 set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
>         }
>
> It sees TLS_TX_DEV_CLOSED still set, skips tls_dev_del() for the
> freshly-installed NEW HW context, and proceeds to clear ctx->netdev.
> Would the new NIC-side TLS state then be leaked until the netdev is
> unregistered?
>
> Should TLS_TX_DEV_CLOSED be cleared while still holding
> device_offload_lock?

I will fix this TLS_TX_DEV_CLOSED issue.

>
> > +
> > +     /* Start marker: the NIC passes through everything before
> > +      * write_seq unencrypted (already SW-encrypted during rekey),
> > +      * same as during initial offload setup.
> > +      */
> > +     tls_device_commit_start_marker(sk, offload_ctx, start_marker_record);
> [ ... ]
> > +rekey_fail:
> > +     kfree(start_marker_record);
> > +     set_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> > +     clear_bit(TLS_TX_REKEY_READY, &ctx->flags);
> > +     clear_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> > +     if (deferred)
> > +             TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> > +     TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYFALLBACK);
> > +
> > +     return 0;
> > +}
>
> [Medium]
> After REKEY_FAILED, the connection is structurally encrypting in SW even
> though ctx->tx_conf stays TLS_HW.  TLSCURRTXDEVICE is never decremented
> and TLSCURRTXSW is never incremented in this path.  Should the
> current-mode gauges in /proc/net/tls_stat reflect the actual encryption
> mode for these connections?

I'll make suggested change

> > +static int tls_set_device_offload_rekey(struct sock *sk,
> > +                                     struct tls_context *ctx,
> > +                                     struct net_device *netdev,
> > +                                     struct tls_crypto_info *new_crypto_info)
> > +{
> > +     struct tls_offload_context_tx *offload_ctx = tls_offload_ctx_tx(ctx);
> > +     bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> > +     bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> > +     bool defer = true;
> > +     int rc;
> > +
> > +     if (!rekey_pending && !rekey_failed)
> > +             defer = tls_has_unacked_records(offload_ctx);
> > +
> > +     down_read(&device_offload_lock);
> > +
> > +     rc = tls_device_start_rekey(sk, ctx, offload_ctx, new_crypto_info);
> > +     if (rc) {
> > +             up_read(&device_offload_lock);
> > +             return rc;
> > +     }
> > +
> > +     up_read(&device_offload_lock);
> > +
> > +     if (defer) {
> > +             if (!rekey_pending)
> > +                     TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> > +             else
> > +                     TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> > +             return 0;
> > +     }
> > +
> > +     return tls_device_complete_rekey(sk, ctx, false);
> > +}
>
> [Medium]
> On the rekey_pending path here, TLSTXREKEYOK is incremented before any
> HW reinstall actually happens.  Combined with the increment in
> tls_device_complete_rekey() on success and the increment in
> tls_device_free_resources_tx() when PENDING is set at close, can a
> single logical successful rekey produce more than one TLSTXREKEYOK
> increment?

The accounting is per-setsockopt, not per-HW-install, same convention
as the SW path (tls_main.c:743 also bumps TlsTxRekeyOk once per
successful setsockopt). When two
setsockopt calls overlap a single HW transition, each call gets one
final outcome. The early OK++ in the rekey_pending branch
credits the previous in-flight setsockopt (which is implicitly
"completed" from userspace's view, since its key is now being
replaced). The trailing complete_rekey
OK++ credits the current one. No setsockopt produces zero or two
final-outcome bumps, each produces exactly one.

> [Medium]
> tls_has_unacked_records() inspects offload_ctx->records_list only.  Is
> that sufficient to decide whether to defer?  pending_open_record_frags
> or offload_ctx->open_record can still hold pre-rekey state, in which
> case the !defer path falls into tls_device_complete_rekey(deferred=false),
> tls_sw_drain_tx() returns -EAGAIN, setsockopt returns the error to
> userspace via err_crypto_info, and none of the state transitions from
> tls_device_start_rekey() (PENDING bit, sw_ctx, cipher_ctx, boundary_seq,
> sk_validate_xmit_skb) are rolled back.  Should the failure path either
> roll those back or transition to a defined recovery state?

I'll check tls_is_pending_open_record also to decide whether to defer

> [ ... ]
>
> > diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
> > index 1110f7ac6bcb..64ac4ef4012b 100644
> > --- a/net/tls/tls_device_fallback.c
> > +++ b/net/tls/tls_device_fallback.c
> > @@ -435,6 +435,30 @@ struct sk_buff *tls_validate_xmit_skb_sw(struct sock *sk,
> >       return tls_sw_fallback(sk, skb);
> >  }
> >
> > +struct sk_buff *tls_validate_xmit_skb_rekey(struct sock *sk,
> > +                                         struct net_device *dev,
> > +                                         struct sk_buff *skb)
> > +{
> > +     struct tls_context *tls_ctx = tls_get_ctx(sk);
> > +     u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> > +     u32 boundary_seq;
> > +
> > +     if (test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags))
> > +             return skb;
> > +
> > +     /* If this packet is at or after the rekey boundary, it's already
> > +      * SW-encrypted with the new key, pass through unchanged
> > +      */
> > +     boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> > +     if (!before(tcp_seq, boundary_seq))
> > +             return skb;
> > +
> > +     /* Packet before boundary means retransmit of old data,
> > +      * use SW fallback with the old key
> > +      */
> > +     return tls_sw_fallback(sk, skb);
> > +}
>
> [Critical]
> Can this leak plaintext on the wire after REKEY_FAILED?
>
> The first branch returns the skb unchanged whenever
> TLS_TX_REKEY_FAILED is set, regardless of where tcp_seq is relative to
> boundary_seq.
>
> After tls_device_complete_rekey() has called tls_dev_del() and then
> landed in rekey_fail, there is no NIC TLS context for the socket.
> SKBs from before the boundary are still sitting in the socket write
> queue with plaintext payload, because HW-offload semantics expected the
> NIC to encrypt them at TX time.  When TCP retransmits one of those old
> skbs:
>
>         tcp_seq < boundary_seq, but TLS_TX_REKEY_FAILED is set
>         -> early return skb unchanged
>         -> NIC has no TLS context for this flow
>         -> driver transmits the skb as-is
>
> Should the FAILED branch still route pre-boundary packets through
> tls_sw_fallback() with the old key, the same way the non-failed branch
> does, so that pre-boundary retransmits are encrypted?
>
> > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > index fd04857fa0ab..2548ad2b2219 100644
> > --- a/net/tls/tls_main.c
> > +++ b/net/tls/tls_main.c
> > @@ -371,6 +371,8 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
> >
> >       if (ctx->tx_conf == TLS_SW)
> >               tls_sw_cancel_work_tx(ctx);
> > +     else if (ctx->tx_conf == TLS_HW && ctx->rekey.sw_ctx)
> > +             tls_sw_cancel_work_tx(ctx);
> >
> >       lock_sock(sk);

boundary_seq captures tcp_sk(sk)->write_seq at start_rekey;
clean_acked_data only sets REKEY_READY when acked_seq >= boundary_seq;
complete_rekey is only invoked when READY is set i.e.
complete_rekey only runs when all pre-boundary data is already ACKed.
So when complete_rekey fails and sets REKEY_FAILED, the pre-boundary
plaintext SKBs have already been ACKed and removed from
sk_write_queue. The early return in tls_validate_xmit_skb_rekey is
safe because no pre-boundary data remains in the write queue for
retransmission.

> [High]
> Is reading ctx->rekey.sw_ctx before lock_sock(sk) safe?
>
> ctx->rekey.sw_ctx is published by tls_device_start_rekey() under
> lock_sock.  Here close reads it before taking lock_sock, so a concurrent
> setsockopt rekey can publish sw_ctx after this read but before close
> acquires the socket lock.  Close then proceeds without canceling the
> delayed tx_work, while later cleanup
> (tls_device_free_resources_tx -> tls_sw_release_resources_tx) frees
> aead_send.
>
> If a tx_work_handler scheduled by the rekey path runs after that, can it
> touch the freed delayed_work or aead_send?
>
> Should the read be moved under lock_sock, or should
> tls_sw_release_resources_tx() internally cancel the delayed work?
>

Why setsockopt and close can't be concurrent here:

  tls_sk_proto_close is proto->close. It is reached only via
inet_release → __sock_release → run from __fput when the file's last
reference drops. Any syscall that could mutate ctx->rekey.sw_ctx
(setsockopt, sendmsg) holds a file reference for its entire duration. So while
one thread is inside do_tls_setsockopt_conf publishing sw_ctx, the
file refcount cannot reach 0, and tls_sk_proto_close cannot start. By
the time tls_sk_proto_close runs, no setsockopt/sendmsg is in flight
on this socket.

  The existing line proves the pattern is intended:

  The pre-existing check at tls_main.c:372 (if (ctx->tx_conf ==
TLS_SW) tls_sw_cancel_work_tx(ctx)) already reads ctx->tx_conf before
lock_sock. tx_conf is likewise only written under setsockopt. The new
ctx->rekey.sw_ctx read at line 374 relies on the same invariant, so
if you accept line 372 as safe, line 374 is too.

^ permalink raw reply

* Re: [PATCH bpf v2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Bobby Eshleman @ 2026-06-17 22:36 UTC (permalink / raw)
  To: Sechang Lim
  Cc: John Fastabend, Jakub Sitnicki, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, bpf,
	linux-kernel
In-Reply-To: <20260612123553.2724240-1-rhkrqnwk98@gmail.com>

On Fri, Jun 12, 2026 at 12:35:51PM +0000, Sechang Lim wrote:
> sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
> to find the length of the next message. strparser assembles a message out
> of several received skbs by chaining them onto the head's frag_list and
> recording where to append the next one in strp->skb_nextp:
> 
> 	*strp->skb_nextp = skb;
> 	strp->skb_nextp = &skb->next;
> 
> and then calls the parser on the head:
> 
> 	len = (*strp->cb.parse_msg)(strp, head);
> 
> The parser is only meant to inspect the skb, but the program may call
> bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
> bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
> Once the head carries a frag_list these go
> 
> 	... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail
> 
> and __pskb_pull_tail() frees the frag_list skbs that strparser still
> tracks through skb_nextp:
> 
> 	while ((list = skb_shinfo(skb)->frag_list) != insp) {
> 		skb_shinfo(skb)->frag_list = list->next;
> 		consume_skb(list);
> 	}
> 
> strp->skb_nextp now points into a freed sk_buff. The next segment of
> the same message arrives in __strp_recv(), which links it with
> *strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
> and the write happen in different __strp_recv() calls, so the message
> has to span at least three segments before it triggers.
> 
>   BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
>   Write of size 8 at addr ffff88810db86140 by task repro/349
> 
>   Call Trace:
>    <IRQ>
>    __strp_recv+0x447/0xda0
>    __tcp_read_sock+0x13d/0x590
>    tcp_bpf_strp_read_sock+0x195/0x320
>    strp_data_ready+0x267/0x340
>    sk_psock_strp_data_ready+0x1ce/0x350
>    tcp_data_queue+0x1364/0x2fd0
>    tcp_rcv_established+0xe07/0x1640
>    [...]
> 
>   Allocated by task 349:
>    skb_clone+0x17b/0x210
>    __strp_recv+0x2c3/0xda0
>    __tcp_read_sock+0x13d/0x590
>    [...]
> 
>   Freed by task 349:
>    kmem_cache_free+0x150/0x570
>    __pskb_pull_tail+0x57b/0xc20
>    skb_ensure_writable+0x236/0x260
>    __bpf_skb_change_tail+0x1d4/0x590
>    sk_skb_change_tail+0x2a/0x40
>    bpf_prog_1b285dcd6c41373e+0x27/0x30
>    bpf_prog_run_pin_on_cpu+0xf3/0x260
>    sk_psock_strp_parse+0x118/0x1e0
>    __strp_recv+0x4f6/0xda0
>    [...]
> 
> The same resize also leaves the head's length inconsistent with its
> frags, so a later __pskb_pull_tail() can instead hit the
> BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.
> 
> Run the parser on a private clone of the head when the message spans more
> than one skb and the program can modify the packet
> (prog->aux->changes_pkt_data), so a resizing helper can only touch the
> clone and strparser's head and skb_nextp stay valid. Single-skb messages
> have no frag_list and read-only parsers cannot resize, so both are still
> parsed in place. If the clone cannot be allocated, return 0 so the caller
> retries on the next read rather than failing the parser.
> 
> Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")
> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
> ---
> v2:
>  - clone only when prog->aux->changes_pkt_data (Bobby Eshleman)
>  - return 0 on clone failure instead of -ENOMEM (Bobby Eshleman)
>  - free the clone with consume_skb() instead of kfree_skb()
>  - drop the unrelated guard(rcu)() change (Bobby Eshleman)
> 
> v1:
>  - https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/
> 
>  net/core/skmsg.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index e1850caf1a71..97e5bc5f38c3 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -1149,9 +1149,29 @@ static int sk_psock_strp_parse(struct strparser *strp, struct sk_buff *skb)
>  	rcu_read_lock();
>  	prog = READ_ONCE(psock->progs.stream_parser);
>  	if (likely(prog)) {
> -		skb->sk = psock->sk;
> -		ret = bpf_prog_run_pin_on_cpu(prog, skb);
> -		skb->sk = NULL;
> +		struct sk_buff *parse_skb = skb;
> +
> +		/*
> +		 * strparser chains the message skbs through skb->frag_list and
> +		 * keeps a pointer into that list in strp->skb_nextp.  The parser
> +		 * program may call bpf_skb_change_tail() and friends, which go
> +		 * through __pskb_pull_tail() and free the frag_list skbs that
> +		 * strparser still tracks.  Run the program on a clone when the head
> +		 * has a frag_list and the program can modify the packet, so it
> +		 * cannot drop frags strparser owns.
> +		 */
> +		if (skb_has_frag_list(skb) && prog->aux->changes_pkt_data) {
> +			parse_skb = skb_clone(skb, GFP_ATOMIC);
> +			if (!parse_skb) {
> +				rcu_read_unlock();
> +				return 0;
> +			}
> +		}
> +		parse_skb->sk = psock->sk;
> +		ret = bpf_prog_run_pin_on_cpu(prog, parse_skb);
> +		parse_skb->sk = NULL;
> +		if (parse_skb != skb)
> +			consume_skb(parse_skb);
>  	}
>  	rcu_read_unlock();
>  	return ret;
> -- 
> 2.43.0
> 

Hey Sechang,

I'm still on the fence about "return 0" vs ENOMEM. I hate to flip-flop
on you here, but now I'm not sure if it is worth the complication to
return 0 since we're really only buying a single timer interval in which
we need 1) suddenly more memory to alloc the clone, and 2) another data
ready event to cause the stream parsing to pick up again. If any one
doesn't happen, the end result is the same. Not sure its a good
trade-off for the complexity of basically tricking the caller with the
zero return. Maybe let's go back to ENOMEM?

BTW, based on the comm name "repro", it sounds like you have a decent
reproducer for this. I wonder if it is possible to add something to the
selftests to catch this?

Best,
Bobby

^ permalink raw reply

* Re: [PATCH] net: stmmac: loongson1: Use dev_err_probe()
From: Jacob Keller @ 2026-06-17 22:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: keguang.zhang, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Maxime Coquelin, Alexandre Torgue, linux-mips,
	netdev, linux-stm32, linux-arm-kernel, linux-kernel
In-Reply-To: <20260617150744.05756f0b@kernel.org>

On 6/17/2026 3:07 PM, Jakub Kicinski wrote:
> On Wed, 17 Jun 2026 14:26:25 -0700 Jacob Keller wrote:
>> It does claim that it has benefit since you get the error code emitted
>> symbolically. But we have %pe for that. I wonder if dev_err_probe
>> predates %pe?
> 
> I'd argue
> 
>   No of match data provided: -EINVAL
> 
> is more confusing than just:
> 
>   No of match data provided
> 
> the EINVAL is meaningless and hardcoded in this case?

Yea, it is a bit more pointless with a hard-coded error.

^ permalink raw reply

* Re: [PATCH v14 7/9] tls: device: add RX KeyUpdate support
From: Rishikesh Jethwani @ 2026-06-17 22:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, saeedm, tariqt, mbloch, borisp, john.fastabend, sd, davem,
	pabeni, edumazet, leon
In-Reply-To: <20260525211610.2508479-1-kuba@kernel.org>

On Mon, May 25, 2026 at 2:16 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> tls: device: add RX KeyUpdate support
>
> This adds RX-side TLS 1.3 KeyUpdate support to the device offload path,
> keeping the old AEAD/IV/rec_seq around to handle in-flight records that
> the NIC already decrypted with the retired key, and deferring tls_dev_add
> of the new key until the old key has fully drained.
>
> > diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> > index c435b3450872..1c58cbd55ffb 100644
> > --- a/net/tls/tls_device.c
> > +++ b/net/tls/tls_device.c
>
> [ ... ]
>
> > @@ -1552,13 +1722,30 @@ int tls_set_device_offload(struct sock *sk,
> >       return rc;
> >  }
> >
> > -int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
> > +int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx,
> > +                           struct tls_crypto_info *new_crypto_info)
> >  {
>
> [ ... ]
>
> > +     if (new_crypto_info) {
> > +             struct tls_sw_context_rx *sw_ctx = tls_sw_ctx_rx(ctx);
> > +
> > +             if (!test_bit(TLS_RX_DEV_CLOSED, &ctx->flags)) {
> > +                     set_bit(TLS_RX_DEV_CLOSED, &ctx->flags);
> > +                     synchronize_net();
> > +                     netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > +                                                     TLS_OFFLOAD_CTX_DIR_RX);
> > +             }
> > +
> > +             if (context->rekey.old_aead_recv &&
> > +                 before(copied_seq, context->rekey.old_nic_boundary)) {
>
> [ ... ]
>
> > +                     context->dev_add_pending = 1;
> > +             } else {
> > +                     u32 rcv_nxt;
> > +
> > +                     if (context->rekey.old_aead_recv) {
> > +                             crypto_free_aead(context->rekey.old_aead_recv);
> > +                             context->rekey.old_aead_recv = NULL;
> > +                     }
> > +
> > +                     /* flush the backlog so rcv_nxt is accurate */
> > +                     __sk_flush_backlog(sk);
> > +                     rcv_nxt = tcp_sk(sk)->rcv_nxt;
> > +
> > +                     if (before(copied_seq, rcv_nxt)) {
> > +                             context->rekey.old_aead_recv = sw_ctx->aead_recv;
> > +                             sw_ctx->aead_recv = NULL;
> > +                             memcpy(context->rekey.old_iv, ctx->rx.iv,
> > +                                    sizeof(context->rekey.old_iv));
> > +                             memcpy(context->rekey.old_rec_seq, ctx->rx.rec_seq,
> > +                                    sizeof(context->rekey.old_rec_seq));
> > +                             context->rekey.old_nic_boundary = rcv_nxt;
> > +                             context->dev_add_pending = 1;
> > +                     }
> > +             }
> > +     }
> > +
> > +     rc = tls_sw_ctx_init(sk, 0, new_crypto_info);
> >       if (rc)
> >               goto release_ctx;
>
> [High]
> In the rekey branch, sw_ctx->aead_recv has just been moved into
> context->rekey.old_aead_recv (with sw_ctx->aead_recv set to NULL) and the
> NIC offload has already been torn down via tls_dev_del with
> TLS_RX_DEV_CLOSED set.
>
> If tls_sw_ctx_init() then fails (for example crypto_alloc_aead, setkey,
> setauthsize, or an ENOMEM in the strparser init), control jumps to
> release_ctx, and for the new_crypto_info case that label is essentially a
> no-op:
>
>         release_ctx:
>                 if (!new_crypto_info) {
>                         kfree(ctx->priv_ctx_rx);
>                         ctx->priv_ctx_rx = NULL;
>                 }
>
> So sw_ctx->aead_recv stays NULL, the rekey state stays populated,
> TLS_RX_DEV_CLOSED stays set, and tls_sw_ctx_finalize() is never called.
>
> The error returns to do_tls_setsockopt_conf() and, because
> update && rx_conf == TLS_HW, it goes straight to err_crypto_info without
> restoring anything, while rx_conf is still TLS_HW and the socket remains
> open.
>
> Can a subsequent recvmsg on this socket then NULL-deref in tls_decrypt_sg(),
> which uses ctx->aead_recv unconditionally to compute
> sizeof(*aead_req) + crypto_aead_reqsize(ctx->aead_recv)?
>
> Should this path roll back the rekey state (restore sw_ctx->aead_recv from
> rekey.old_aead_recv, clear rekey state, clear dev_add_pending) on
> tls_sw_ctx_init() failure?
>
> [High]
> The error path renaming from tls_sw_free_resources_rx() to
> tls_sw_release_resources_rx() in free_sw_resources, combined with a later
> kfree(ctx->priv_ctx_rx), looks like it drops the strparser anchor cleanup:
>
> > free_sw_resources:
> >       up_read(&device_offload_lock);
> > -     tls_sw_free_resources_rx(sk);
> > +     tls_sw_release_resources_rx(sk);
> >       down_read(&device_offload_lock);
> >  release_ctx:
> > -     ctx->priv_ctx_rx = NULL;
> > +     if (!new_crypto_info) {
> > +             kfree(ctx->priv_ctx_rx);
> > +             ctx->priv_ctx_rx = NULL;
> > +     }
>
> tls_sw_free_resources_rx() previously did three things: release_resources_rx
> (frees aead_recv, stops strparser), __tls_strp_done() ->
> tls_strp_anchor_free() which consume_skb()s the anchor allocated in
> tls_strp_init() during tls_sw_ctx_init(), and finally tls_sw_free_ctx_rx().
>
> The new code only does steps one and three — calling kfree(priv_ctx_rx)
> without first doing the anchor consume_skb().
>
> Since tls_strp_init() runs inside tls_sw_ctx_init() during the
> !new_crypto_info path, and free_sw_resources is reached only after
> tls_device_dev_add_rx() fails (i.e. after init succeeded), is the
> strp->anchor SKB leaked on every initial-setup failure here?
>
> > +     if (!context->dev_add_pending) {
> > +             rc = tls_device_dev_add_rx(sk, ctx, netdev, src_crypto_info,
> > +                                        copied_seq, !!new_crypto_info);
> > +             if (!new_crypto_info) {
> > +                     if (rc)
> > +                             goto free_sw_resources;
> > +                     tls_device_attach(ctx, sk, netdev);
> > +             }
> > +     } else if (!was_dev_add_pending) {
> > +             TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSRXREKEYINPROGRESS);
> > +     } else {
> > +             TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSRXREKEYOK);
> > +     }
>
> [Medium, Medium]
> For the rekey path, context->dev_add_pending is set to 1 before
> tls_sw_ctx_init() is called, but the matching
> TLS_INC_STATS(LINUX_MIB_TLSRXREKEYINPROGRESS) only happens later in the
> "else if (!was_dev_add_pending)" branch after init has succeeded.
>
> If tls_sw_ctx_init() fails, the goto release_ctx is a no-op for
> new_crypto_info, leaving dev_add_pending == 1 with no prior
> REKEYINPROGRESS increment.
>
> When the socket closes, tls_device_offload_cleanup_rx() then runs:
>
>         if (rx_ctx && rx_ctx->dev_add_pending) {
>                 rx_ctx->dev_add_pending = 0;
>                 TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSRXREKEYOK);
>                 TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSRXREKEYINPROGRESS);
>         }
>
> Does this decrement TLSRXREKEYINPROGRESS without a prior matching
> increment, wrapping the per-CPU SNMP counter to a very large unsigned
> value, and bump TLSRXREKEYOK for a rekey that never installed a new key
> on the NIC?
>
> Separately, in the same cleanup tail, dev_add_pending == 1 means the
> deferred tls_dev_add never ran, so the new key was not installed on the
> NIC. Should this path really increment LINUX_MIB_TLSRXREKEYOK, given that
> tls_device_dev_add_rx() is the documented place that bumps REKEYOK after
> the NIC accepts the new key?
>
> [Medium]
> In the same hunk, when was_dev_add_pending is 1 (a previous rekey is still
> draining) and a new rekey arrives, the third arm runs:
>
>         } else {
>                 TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSRXREKEYOK);
>         }
>
> dev_add_pending is still 1 here and the new key has not been installed on
> the NIC yet. The eventual tls_device_deferred_dev_add_rx() will also call
> tls_device_dev_add_rx() with is_rekey=true, which on success increments
> LINUX_MIB_TLSRXREKEYOK again.
>
> Is the same successful rekey getting counted twice in this case, and a
> never-completed nested rekey getting counted as success?

Not double-counted, the totals balance. REKEYINPROGRESS is raised
exactly once per draining-chain (by the first rekey that defers) and
lowered exactly once (by the single deferred tls_device_dev_add_rx,
since nested rekeys hit the draining branch and don't re-arm
old_nic_boundary). Each nested rekey is counted immediately in the
else arm; the deferred add's REKEYOK is credited to the rekey that
opened the
INPROGRESS window. So N accepted rekeys → N REKEYOK, gauge back to 0.

^ permalink raw reply

* [PATCH bpf] bpf: zero-initialize the fib lookup flow struct
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: bpf, davem, dsahern, eddyz87, edumazet, emil, horms,
	john.fastabend, jolsa, kuba, linux-kernel, martin.lau, memxor,
	netdev, pabeni, sdf, song, toke, yonghong.song

bpf_ipv4_fib_lookup() and bpf_ipv6_fib_lookup() build the flow key on
the stack with a bare "struct flowi4 fl4;" / "struct flowi6 fl6;" and
fill it field by field, but never set flowi4_l3mdev / flowi6_l3mdev.

On the non-DIRECT path the lookup goes through the fib rules whenever the
netns has custom rules, which a VRF installs:

	bpf_ipv4_fib_lookup() -> fib_lookup() -> __fib_lookup()
	  -> l3mdev_update_flow()   reads !fl->flowi_l3mdev
	  -> fib_rules_lookup() -> fib_rule_match()
	       -> l3mdev_fib_rule_match()   uses fl->flowi_l3mdev

l3mdev_update_flow() resolves the l3mdev master from the ingress device
only while the field is still zero. Left at a nonzero stack value the
resolution is skipped, and l3mdev_fib_rule_match() then tests that value
as an ifindex, so the VRF master is not resolved and the rule fails to
match: an ingress enslaved to a VRF can fail to select its table. FIB
rules matching on an L3 master device (l3mdev_fib_rule_iif_match()/
_oif_match()) read the same value, so an "ip rule iif/oif <vrf>"
mismatches the same way.

Zero-initialize the whole flow struct rather than adding one more
field assignment, so any flowi field added later is covered too.
ip_route_input_slow() likewise zeroes the field before its input lookup.

CONFIG_INIT_STACK_ALL_ZERO masks this by default, but it depends on
compiler support (CC_HAS_AUTO_VAR_INIT_ZERO), so INIT_STACK_NONE builds,
including older toolchains that fall back to it, are exposed. Built with
INIT_STACK_ALL_PATTERN, a plain bpf_fib_lookup (no VLAN, no DIRECT) over a
VRF slave whose destination is routed only in the VRF table returns
BPF_FIB_LKUP_RET_NOT_FWDED, and resolves with this patch. On the default
config the lookup succeeds either way, so ordinary testing does not catch
the bug.

Fixes: 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices")
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 net/core/filter.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9590877b0714..7c58df589826 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6139,7 +6139,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct in_device *in_dev;
 	struct net_device *dev;
 	struct fib_result res;
-	struct flowi4 fl4;
+	struct flowi4 fl4 = {};
 	u32 mtu = 0;
 	int err;

@@ -6279,7 +6279,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct neighbour *neigh;
 	struct net_device *dev;
 	struct inet6_dev *idev;
-	struct flowi6 fl6;
+	struct flowi6 fl6 = {};
 	int strict = 0;
 	int oif, err;
 	u32 mtu = 0;
-- 
2.54.0

^ permalink raw reply related

* [PATCH bpf-next v3 0/3] bpf: bidirectional VLAN support for bpf_fib_lookup()
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
	emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
	kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
	memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
	yonghong.song

This series adds VLAN awareness to bpf_fib_lookup() in both directions.
BPF_FIB_LOOKUP_VLAN resolves a VLAN egress to its underlying real device
plus the VLAN tag (XDP programs need this because VLAN devices have no XDP
xmit), and BPF_FIB_LOOKUP_VLAN_INPUT runs the lookup as if a tagged frame
had arrived on the matching VLAN subinterface, for iif policy routing and
VRF table selection.

The l3mdev/VRF flow-init fix that was patch 1 in v1 and v2 has been split
out and sent to bpf on its own, since it is an independent Fixes:-tagged
fix that routes to stable on its own schedule. This series is otherwise
independent of it: on the default CONFIG_INIT_STACK_ALL_ZERO the VRF
selftests pass with or without the fix. Only the one full-lookup VRF arm
("IPv4 VLAN input, tag selects VRF table") depends on it, and only on
INIT_STACK_ALL_PATTERN or NONE builds, where the uninitialized
flowi_l3mdev otherwise misses the l3mdev rule and the lookup falls
through to the main table. Applying the l3mdev fix first closes that
window.

Changes v2 -> v3 (all from Toke's review unless noted):

- Split the l3mdev/VRF flow-init fix out to a standalone bpf submission
  (it was patch 1 in v2).

- Patch 2 (VLAN_INPUT): bpf_fib_vlan_input_dev() returns a
  struct net_device * with ERR_PTR() for the -EINVAL case and NULL for
  NOT_FWDED, instead of an int return and a **dev out-parameter.

- Trim the BPF_FIB_LOOKUP_VLAN and BPF_FIB_LOOKUP_VLAN_INPUT UAPI doc
  blocks, and drop the in-function comments that restated the commit
  message or the flag doc.

- Patch 1 (VLAN egress): on the skb path without tot_len, the deferred mtu
  check now runs against the resolved egress (VLAN) device, not the parent
  params->ifindex was swapped to, so a VLAN device with a smaller mtu than
  its parent is no longer checked against, or reported as, the parent's
  larger mtu. Found by the bpf ci bot; this was an open question in v2.

- Patch 3 (selftests): re-run every case through bpf_xdp_fib_lookup() as
  well, since the feature targets XDP; and flip the no-tot_len mtu arm to
  expect the VLAN device's mtu after the fix above.

Open questions (defaults chosen, noted here in case a maintainer
prefers otherwise):

1. An unmatched, down, or foreign-netns tag returns
   BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
   fib_get_table() finds no table, rather than a new return code.

2. BPF_FIB_LOOKUP_OUTPUT | BPF_FIB_LOOKUP_VLAN_INPUT is rejected with
   -EINVAL; restricting now keeps relaxing later backward-compatible.

3. The name BPF_FIB_LOOKUP_VLAN_INPUT reads oddly next to
   BPF_FIB_LOOKUP_OUTPUT. A pair like _VLAN_EGRESS/_VLAN_INGRESS is an
   option while nothing is merged.

4. The egress flag leaves a VLAN it cannot reduce to a physical parent
   plus one tag (QinQ, or a parent in another namespace) as SUCCESS with
   the VLAN device's ifindex and the vlan fields zero, like a plain
   lookup. The input side instead fails closed (NOT_FWDED) on the
   cross-namespace case. An XDP caller cannot xmit on a VLAN device, and
   a zero h_vlan_proto does not distinguish this result from a physical
   egress, so returning NOT_FWDED would be safer for XDP. But the two
   cases differ: a foreign-netns parent is clearly fail-worthy, while a
   QinQ egress is still a forwardable route (tc xmits on the inner VLAN
   device), so failing it closed would reject a usable route. Should
   egress signal NOT_FWDED, for both or only foreign-netns? I left it
   best-effort, but will change it if you prefer.

Taking the tag as lookup input follows the approach David Ahern
suggested in the 2021 fwmark discussion:
https://lore.kernel.org/bpf/6248c547-ad64-04d6-fcec-374893cc1ef2@gmail.com/

v2: https://lore.kernel.org/all/20260616223426.3568080-1-avinash.duduskar@gmail.com/
v1: https://lore.kernel.org/all/20260609172052.81613-1-avinash.duduskar@gmail.com/

Avinash Duduskar (3):
  bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
  bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
  selftests/bpf: Add bpf_fib_lookup() VLAN flag tests

 include/uapi/linux/bpf.h                      |  41 +-
 net/core/filter.c                             | 125 +++-
 tools/include/uapi/linux/bpf.h                |  41 +-
 .../selftests/bpf/prog_tests/fib_lookup.c     | 554 +++++++++++++++++-
 .../testing/selftests/bpf/progs/fib_lookup.c  |   9 +
 5 files changed, 741 insertions(+), 29 deletions(-)

base-commit: e771677c937da5808f7b6c1f0e4a97ec1a84f8a8
-- 
2.54.0

^ permalink raw reply

* [PATCH bpf-next v3 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
	emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
	kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
	memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
	yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>

bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
from the fib result. When the egress is a VLAN device, the returned
ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
programs that want to forward the frame (e.g. xdp-forward) must
instead target the underlying physical device and push the VLAN tag
themselves. Today the program has no way to learn either the
underlying ifindex or the VLAN tag without maintaining its own
VLAN-to-ifindex map in userspace and refreshing it on netlink
events.

Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
result is a VLAN device whose immediate parent is a real (non-VLAN)
device in the same network namespace, populate the existing output
fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
device and replace params->ifindex with the parent's ifindex.
params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
consumer wanting to set egress priority writes PCP itself.
params->smac is the VLAN device's own address, which can differ from
the parent's.

Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
and not vlan_dev_real_dev(), which walks to the bottom of a stack. For a
stacked VLAN (QinQ) the immediate parent is itself a VLAN device; since
one h_vlan_proto/h_vlan_TCI pair cannot describe two tags, ifindex is
left unchanged and the vlan fields remain zero in that case. The swap
is also skipped when the parent lives in another network namespace (a
VLAN device can be moved while its parent stays), since its ifindex
would be meaningless or match an unrelated device in the caller's
namespace. The swap and the vlan fields are written only on success;
other output fields keep their existing behaviour, so a frag-needed
result still reports the route mtu in params->mtu_result.

On the skb path without tot_len the deferred mtu check is done against
the resolved egress device. To keep that the VLAN device rather than
the parent after the swap, bpf_ipv4_fib_lookup()/bpf_ipv6_fib_lookup()
hand the FIB-result device back to the caller; the XDP path always
runs the route-mtu check and passes NULL. When the flag is not set,
behaviour is unchanged: h_vlan_proto and h_vlan_TCI are zeroed and
ifindex is left at the FIB result.

The new block is compiled only under CONFIG_VLAN_8021Q since
vlan_dev_priv() is not defined otherwise; without that config
is_vlan_dev() is constant false and the flag is accepted but never
acts.

This lets an XDP redirect target the physical device and learn the
tag to push in a single lookup, which xdp-forward's optional VLAN
mode (xdp-project/xdp-tools#504) wants from the kernel side.

The helper's input semantics are unchanged; the reverse direction
(supplying a tag as lookup input) is added in the following patch.

Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 include/uapi/linux/bpf.h       | 22 +++++++++++-
 net/core/filter.c              | 61 +++++++++++++++++++++++-----------
 tools/include/uapi/linux/bpf.h | 22 +++++++++++-
 3 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 89b36de5fdbb..f1ac9266a2ab 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3532,6 +3532,21 @@ union bpf_attr {
  *			Use the mark present in *params*->mark for the fib lookup.
  *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
  *			as it only has meaning for full lookups.
+ *		**BPF_FIB_LOOKUP_VLAN**
+ *			If the fib lookup resolves to a VLAN device whose
+ *			parent is a real (non-VLAN) device, set
+ *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
+ *			the VLAN device and replace *params*->ifindex with the
+ *			parent's ifindex. *params*->h_vlan_TCI carries the VID
+ *			only, with PCP and DEI bits zero; a consumer wanting to
+ *			set egress priority writes PCP itself. *params*->smac is
+ *			the VLAN device's own address, which can differ from the
+ *			parent's. Only the immediate parent is resolved (QinQ is
+ *			not supported), and the swap is skipped if the parent is
+ *			in a different namespace. The swap and the vlan fields
+ *			are written only on success; other output fields keep
+ *			the helper's existing behaviour, so a frag-needed result
+ *			still reports the route mtu in *params*->mtu_result.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7327,6 +7342,7 @@ enum {
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
+	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
 };
 
 enum {
@@ -7393,7 +7409,11 @@ struct bpf_fib_lookup {
 
 	union {
 		struct {
-			/* output */
+			/*
+			 * output with BPF_FIB_LOOKUP_VLAN: set from the
+			 * resolved egress VLAN device (see the flag); zeroed
+			 * on other successful lookups.
+			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
 		};
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e96b4b847ce..27e4792f11e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6201,10 +6201,26 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #endif
 
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
-static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
+static int bpf_fib_set_fwd_params(struct net_device *dev,
+				  struct bpf_fib_lookup *params,
+				  u32 flags, u32 mtu)
 {
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
+
+#if IS_ENABLED(CONFIG_VLAN_8021Q)
+	if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
+		struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+
+		if (!is_vlan_dev(real_dev) &&
+		    net_eq(dev_net(real_dev), dev_net(dev))) {
+			params->h_vlan_proto = vlan_dev_vlan_proto(dev);
+			params->h_vlan_TCI = htons(vlan_dev_vlan_id(dev));
+			params->ifindex = real_dev->ifindex;
+		}
+	}
+#endif
+
 	if (mtu)
 		params->mtu_result = mtu; /* union with tot_len */
 
@@ -6214,7 +6230,8 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
 
 #if IS_ENABLED(CONFIG_INET)
 static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
-			       u32 flags, bool check_mtu)
+			       u32 flags, bool check_mtu,
+			       struct net_device **fwd_dev)
 {
 	struct neighbour *neigh = NULL;
 	struct fib_nh_common *nhc;
@@ -6347,13 +6364,16 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 
 set_fwd_params:
-	return bpf_fib_set_fwd_params(params, mtu);
+	if (fwd_dev)
+		*fwd_dev = dev;
+	return bpf_fib_set_fwd_params(dev, params, flags, mtu);
 }
 #endif
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
-			       u32 flags, bool check_mtu)
+			       u32 flags, bool check_mtu,
+			       struct net_device **fwd_dev)
 {
 	struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
 	struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
@@ -6486,13 +6506,16 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 
 set_fwd_params:
-	return bpf_fib_set_fwd_params(params, mtu);
+	if (fwd_dev)
+		*fwd_dev = dev;
+	return bpf_fib_set_fwd_params(dev, params, flags, mtu);
 }
 #endif
 
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
 			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
-			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK)
+			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
+			     BPF_FIB_LOOKUP_VLAN)
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6507,12 +6530,12 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
 		return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
-					   flags, true);
+					   flags, true, NULL);
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
 		return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
-					   flags, true);
+					   flags, true, NULL);
 #endif
 	}
 	return -EAFNOSUPPORT;
@@ -6532,6 +6555,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
 {
 	struct net *net = dev_net(skb->dev);
+	struct net_device *fwd_dev = NULL;
 	int rc = -EAFNOSUPPORT;
 	bool check_mtu = false;
 
@@ -6547,29 +6571,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
-		rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
+		rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu,
+					 &fwd_dev);
 		break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
+		rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu,
+					 &fwd_dev);
 		break;
 #endif
 	}
 
 	if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
-		struct net_device *dev;
-
-		/* When tot_len isn't provided by user, check skb
-		 * against MTU of FIB lookup resulting net_device
+		/*
+		 * Without tot_len, check the skb against the FIB result
+		 * device's MTU, which BPF_FIB_LOOKUP_VLAN keeps as the VLAN
+		 * device even though params->ifindex was swapped to the parent.
 		 */
-		dev = dev_get_by_index_rcu(net, params->ifindex);
-		if (unlikely(!dev))
-			return -ENODEV;
-		if (!is_skb_forwardable(dev, skb))
+		if (!is_skb_forwardable(fwd_dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
 
-		params->mtu_result = dev->mtu; /* union with tot_len */
+		params->mtu_result = fwd_dev->mtu; /* union with tot_len */
 	}
 
 	return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 89b36de5fdbb..f1ac9266a2ab 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3532,6 +3532,21 @@ union bpf_attr {
  *			Use the mark present in *params*->mark for the fib lookup.
  *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
  *			as it only has meaning for full lookups.
+ *		**BPF_FIB_LOOKUP_VLAN**
+ *			If the fib lookup resolves to a VLAN device whose
+ *			parent is a real (non-VLAN) device, set
+ *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
+ *			the VLAN device and replace *params*->ifindex with the
+ *			parent's ifindex. *params*->h_vlan_TCI carries the VID
+ *			only, with PCP and DEI bits zero; a consumer wanting to
+ *			set egress priority writes PCP itself. *params*->smac is
+ *			the VLAN device's own address, which can differ from the
+ *			parent's. Only the immediate parent is resolved (QinQ is
+ *			not supported), and the swap is skipped if the parent is
+ *			in a different namespace. The swap and the vlan fields
+ *			are written only on success; other output fields keep
+ *			the helper's existing behaviour, so a frag-needed result
+ *			still reports the route mtu in *params*->mtu_result.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7327,6 +7342,7 @@ enum {
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
+	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
 };
 
 enum {
@@ -7393,7 +7409,11 @@ struct bpf_fib_lookup {
 
 	union {
 		struct {
-			/* output */
+			/*
+			 * output with BPF_FIB_LOOKUP_VLAN: set from the
+			 * resolved egress VLAN device (see the flag); zeroed
+			 * on other successful lookups.
+			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
 		};
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v3 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
	emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
	kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
	memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
	yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>

BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
useful: an XDP program receiving a VLAN-tagged frame on a physical
device wants the lookup to behave as if the packet had arrived on the
corresponding VLAN subinterface, so iif-based policy routing and VRF
table selection use the right ingress.

Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
The device must be up and in the same network namespace as
params->ifindex (a VLAN device can be moved to another netns while
registered on its parent; receive would deliver into that other
namespace, which a lookup here cannot represent). If params->ifindex
is itself a VLAN device, its inner (QinQ) subinterface is matched.
For a bond or team, a tag on a port matches no device and returns
NOT_FWDED; pass the master's ifindex.
The lookup then runs with the resolved device as the ingress;
params->ifindex itself is not modified on the input side. When the
resolved device is enslaved to a VRF, both the full lookup (via the
l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
select the VRF's table from the resolved ingress. That follows from
feeding the resolved device to the flow as the ingress
(fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
the VRF master from the subinterface rather than from
params->ifindex.

The two failure classes get different treatment on purpose. A
h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
-EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
with a program-controlled value. An unmatched VID, a device that is
down, or one in another namespace is a data outcome and returns
BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
fib_get_table() finds no table and mirroring real ingress, where the
receive path drops such frames. A VID of 0 (a priority tag) is looked
up literally and normally fails the same way; receive instead
processes such frames untagged, so callers should not set the flag for
priority tags. Proceeding on the physical device for any of these
would be fail-open for the policy-routing cases above.

The h_vlan fields share a union with tbid, so the flag cannot be
combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
return -EINVAL; restricting now keeps a later relaxation backward
compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
consumed on the ingress side and the egress tag is written on
success.

Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
NULL, so every lookup with the flag returns NOT_FWDED, which is
correct since no VLAN device can exist.

Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 include/uapi/linux/bpf.h       | 21 ++++++++++-
 net/core/filter.c              | 66 +++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h | 21 ++++++++++-
 3 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f1ac9266a2ab..23bc3109619d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3547,6 +3547,22 @@ union bpf_attr {
  *			are written only on success; other output fields keep
  *			the helper's existing behaviour, so a frag-needed result
  *			still reports the route mtu in *params*->mtu_result.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7343,6 +7359,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7412,7 +7429,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
diff --git a/net/core/filter.c b/net/core/filter.c
index 27e4792f11e9..399adf2a824a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6226,6 +6226,25 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
 
 	return 0;
 }
+
+static struct net_device *bpf_fib_vlan_input_dev(struct net_device *dev,
+						 const struct bpf_fib_lookup *params)
+{
+	__be16 proto = params->h_vlan_proto;
+	struct net_device *vlan_dev;
+	u16 vid;
+
+	if (proto != htons(ETH_P_8021Q) && proto != htons(ETH_P_8021AD))
+		return ERR_PTR(-EINVAL);
+
+	vid = ntohs(params->h_vlan_TCI) & VLAN_VID_MASK;
+	vlan_dev = __vlan_find_dev_deep_rcu(dev, proto, vid);
+	if (!vlan_dev || !(vlan_dev->flags & IFF_UP) ||
+	    !net_eq(dev_net(vlan_dev), dev_net(dev)))
+		return NULL;
+
+	return vlan_dev;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_INET)
@@ -6246,6 +6265,14 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	/* verify forwarding is enabled on this interface */
 	in_dev = __in_dev_get_rcu(dev);
 	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
@@ -6255,7 +6282,11 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl4.flowi4_iif = 1;
 		fl4.flowi4_oif = params->ifindex;
 	} else {
-		fl4.flowi4_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		fl4.flowi4_iif = dev->ifindex;
 		fl4.flowi4_oif = 0;
 	}
 	fl4.flowi4_dscp = inet_dsfield_to_dscp(params->tos);
@@ -6394,6 +6425,14 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (unlikely(!dev))
 		return -ENODEV;
 
+	if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		dev = bpf_fib_vlan_input_dev(dev, params);
+		if (IS_ERR(dev))
+			return PTR_ERR(dev);
+		if (!dev)
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
 	idev = __in6_dev_get_safely(dev);
 	if (unlikely(!idev || !READ_ONCE(idev->cnf.forwarding)))
 		return BPF_FIB_LKUP_RET_FWD_DISABLED;
@@ -6402,7 +6441,12 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		fl6.flowi6_iif = 1;
 		oif = fl6.flowi6_oif = params->ifindex;
 	} else {
-		oif = fl6.flowi6_iif = params->ifindex;
+		/*
+		 * dev->ifindex, not params->ifindex: VLAN_INPUT may have
+		 * resolved dev to a subinterface above.
+		 */
+		oif = dev->ifindex;
+		fl6.flowi6_iif = oif;
 		fl6.flowi6_oif = 0;
 		strict = RT6_LOOKUP_F_HAS_SADDR;
 	}
@@ -6515,7 +6559,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
 			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
 			     BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
-			     BPF_FIB_LOOKUP_VLAN)
+			     BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_VLAN_INPUT)
+
+static bool bpf_fib_lookup_flags_ok(u32 flags)
+{
+	if (flags & ~BPF_FIB_LOOKUP_MASK)
+		return false;
+
+	if ((flags & BPF_FIB_LOOKUP_VLAN_INPUT) &&
+	    (flags & (BPF_FIB_LOOKUP_TBID | BPF_FIB_LOOKUP_OUTPUT)))
+		return false;
+
+	return true;
+}
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6523,7 +6579,7 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	switch (params->family) {
@@ -6562,7 +6618,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
-	if (flags & ~BPF_FIB_LOOKUP_MASK)
+	if (!bpf_fib_lookup_flags_ok(flags))
 		return -EINVAL;
 
 	if (params->tot_len)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f1ac9266a2ab..23bc3109619d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3547,6 +3547,22 @@ union bpf_attr {
  *			are written only on success; other output fields keep
  *			the helper's existing behaviour, so a frag-needed result
  *			still reports the route mtu in *params*->mtu_result.
+ *		**BPF_FIB_LOOKUP_VLAN_INPUT**
+ *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *			as an input VLAN tag and run the lookup as if ingress
+ *			had happened on the VLAN subinterface carrying that tag
+ *			on *params*->ifindex. The VID is the low 12 bits of
+ *			*params*->h_vlan_TCI; *params*->h_vlan_proto must be
+ *			ETH_P_8021Q or ETH_P_8021AD in network byte order, else
+ *			**-EINVAL**. If *params*->ifindex is itself a VLAN
+ *			device, its inner (QinQ) subinterface is matched; for a
+ *			bond or team, pass the master's ifindex. An unmatched
+ *			tag, a down device, or one in another namespace returns
+ *			**BPF_FIB_LKUP_RET_NOT_FWDED**, mirroring real ingress.
+ *			A VID of 0 is looked up literally, so do not set this
+ *			flag for priority-tagged frames. Cannot be combined with
+ *			**BPF_FIB_LOOKUP_TBID** or **BPF_FIB_LOOKUP_OUTPUT**
+ *			(returns **-EINVAL**).
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -7343,6 +7359,7 @@ enum {
 	BPF_FIB_LOOKUP_SRC     = (1U << 4),
 	BPF_FIB_LOOKUP_MARK    = (1U << 5),
 	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7412,7 +7429,9 @@ struct bpf_fib_lookup {
 			/*
 			 * output with BPF_FIB_LOOKUP_VLAN: set from the
 			 * resolved egress VLAN device (see the flag); zeroed
-			 * on other successful lookups.
+			 * on other successful lookups. input with
+			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+			 * the lookup by.
 			 */
 			__be16	h_vlan_proto;
 			__be16	h_vlan_TCI;
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v3 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
From: Avinash Duduskar @ 2026-06-17 22:47 UTC (permalink / raw)
  To: ast, daniel, andrii
  Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
	emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
	kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
	memxor, netdev, pabeni, rongtao, sdf, shuah, song, toke, yatsenko,
	yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>

Cover both directions of the new VLAN flags in the fib_lookup test,
36 table cases plus a dedicated cross-netns subtest.

For BPF_FIB_LOOKUP_VLAN the egress cases assert: without the flag the
lookup returns the VLAN netdev's ifindex and zeroed vlan fields, with
the flag it returns the parent's ifindex plus the tag (including via
a neighbour resolved on the VLAN device, in OUTPUT mode, over a bond,
and through a DIRECT|TBID table), with the flag on a non-VLAN egress
it changes nothing, for a stacked VLAN it leaves ifindex untouched
with the vlan fields zero, and a frag-needed return reports the route
mtu in mtu_result while leaving the swap unwritten.

For BPF_FIB_LOOKUP_VLAN_INPUT, an iif rule on the subinterface routes
the same destination to a different gateway, so the asserted gateway
shows which device the lookup used as ingress: without the flag the
main table answers, with a matching tag the subinterface's table
does, with or without SKIP_NEIGH, and BPF_FIB_LOOKUP_SRC selects the
subinterface's address. A VRF-enslaved subinterface selects the VRF
table through the l3mdev rule and, with DIRECT, through
l3mdev_fib_table_rcu(). One case sets BPF_FIB_LOOKUP_VLAN as well and
asserts both directions work in a single lookup. Resolution semantics
are pinned: an 802.1ad tag resolves its device, PCP and DEI bits in
h_vlan_TCI are ignored, a VLAN ifindex resolves the inner QinQ
device, a tag on a bond master resolves while the same tag on the
bond port does not.

The error cases assert -EINVAL for an invalid h_vlan_proto on both
address families, for the TBID and OUTPUT flag combinations and for
an unknown flag bit, and BPF_FIB_LKUP_RET_NOT_FWDED for a VID with no
configured device on both families, for a VID-0 priority tag and for
a device that exists but is down. The failure cases also assert that
params is left untouched.

A separate subtest moves a VLAN device into a second netns while it
stays registered on its parent, and checks both directions refuse to
cross the boundary: the input flag fails closed with the tag and
ifindex untouched, and the egress flag does not publish the foreign
parent's ifindex.

The tbid read-back check is skipped for DIRECT cases that set
BPF_FIB_LOOKUP_VLAN, since a successful swap packs the vlan fields
into the union the check reads.

Re-run the cases through bpf_xdp_fib_lookup() as well: the egress flag
exists because VLAN devices have no XDP xmit, so XDP is the primary
consumer. bpf_prog_test_run uses the netns' loopback for the xdp context's
device, so the lookup runs against the test netns' FIB, and the
path-independent results (return code, swapped ifindex, vlan tag, gateway)
are asserted to match the skb path.

Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
---
 .../selftests/bpf/prog_tests/fib_lookup.c     | 554 +++++++++++++++++-
 .../testing/selftests/bpf/progs/fib_lookup.c  |   9 +
 2 files changed, 559 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
index bd7658958004..987a691fe078 100644
--- a/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/fib_lookup.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
 
 #include <linux/rtnetlink.h>
+#include <linux/if_ether.h>
 #include <sys/types.h>
 #include <net/if.h>
 
@@ -37,6 +38,41 @@
 #define IPV6_LOCAL		"fd01::3"
 #define IPV6_GW1		"fd01::1"
 #define IPV6_GW2		"fd01::2"
+#define VLAN_ID			100
+#define VLAN_IFACE		"veth1.100"
+#define VLAN_ID_DOWN		102
+#define VLAN_IFACE_DOWN		"veth1.102"
+#define QINQ_OUTER_IFACE	"veth1.200"
+#define QINQ_INNER_IFACE	"veth1.200.300"
+#define VLAN_TABLE		"300"
+#define IPV4_VLAN_IFACE_ADDR	"10.5.0.254"
+#define IPV4_VLAN_EGRESS_DST	"10.5.0.2"
+#define IPV4_QINQ_DST		"10.7.0.2"
+#define IPV4_VLAN_DST		"10.6.0.2"
+#define IPV4_VLAN_GW		"10.5.0.1"
+#define IPV6_VLAN_IFACE_ADDR	"fd02::254"
+#define IPV6_VLAN_EGRESS_DST	"fd02::2"
+#define IPV6_VLAN_DST		"fd03::2"
+#define IPV6_VLAN_GW		"fd02::1"
+#define VLAN_VID_UNUSED		999
+#define VRF_IFACE		"vrf-blue"
+#define VRF_TABLE		"1000"
+#define VRF_VLAN_ID		101
+#define VRF_VLAN_IFACE		"veth1.101"
+#define IPV4_VRF_IFACE_ADDR	"10.8.0.254"
+#define IPV4_VRF_GW		"10.8.0.1"
+#define IPV4_VRF_DST		"10.9.0.2"
+#define TBID_VLAN_ID		50
+#define TBID_VLAN_IFACE		"veth2.50"
+#define IPV4_TBID_VLAN_DST	"172.2.0.2"
+#define IPV4_BOND_VLAN_DST	"10.11.0.2"
+#define IPV4_VLAN_MTU_DST	"10.5.9.2"
+#define QINQ_AD_VLAN_ID		200
+#define QINQ_INNER_VLAN_ID	300
+#define BOND_IFACE		"bond99"
+#define BOND_PORT		"veth3"
+#define BOND_PORT_PEER		"veth4"
+#define BOND_VLAN_ID		500
 #define DMAC			"11:11:11:11:11:11"
 #define DMAC_INIT { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, }
 #define DMAC2			"01:01:01:01:01:01"
@@ -52,6 +88,17 @@ struct fib_lookup_test {
 	__u32 tbid;
 	__u8 dmac[6];
 	__u32 mark;
+	/*
+	 * input tag with BPF_FIB_LOOKUP_VLAN_INPUT; expected output tag
+	 * with BPF_FIB_LOOKUP_VLAN (checked when check_vlan is set)
+	 */
+	__u16 vlan_proto;
+	__u16 vlan_id;
+	bool check_vlan;
+	const char *expected_dev; /* expected params->ifindex after lookup */
+	const char *iif;	  /* override the default veth1 input device */
+	__u16 tot_len;		  /* triggers the in-lookup mtu check when set */
+	__u16 expected_mtu;	  /* expected mtu_result (union with tot_len) */
 };
 
 static const struct fib_lookup_test tests[] = {
@@ -142,6 +189,209 @@ static const struct fib_lookup_test tests[] = {
 	  .expected_dst = IPV6_GW1,
 	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
 	  .mark = MARK, },
+	/* vlan egress resolution */
+	{ .desc = "IPv4 VLAN egress, no flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, single VLAN",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * skb path without tot_len: mtu_result is the FIB result (VLAN)
+	 * device's mtu (1400) with or without the swap, not the parent's (1500)
+	 */
+	{ .desc = "IPv4 VLAN egress, skb-path mtu is the VLAN device's without the flag",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = VLAN_IFACE, .check_vlan = true, .expected_mtu = 1400, },
+	{ .desc = "IPv4 VLAN egress, skb-path mtu stays the VLAN device's after the swap",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .expected_mtu = 1400, },
+	{ .desc = "IPv4 VLAN egress, flag set but egress is not a VLAN",
+	  .daddr = IPV4_NUD_FAILED_ADDR, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	{ .desc = "IPv4 VLAN egress, stacked VLAN untouched",
+	  .daddr = IPV4_QINQ_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = QINQ_INNER_IFACE, .check_vlan = true, },
+	{ .desc = "IPv6 VLAN egress, single VLAN",
+	  .daddr = IPV6_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, neighbour on the VLAN device",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT, },
+	{ .desc = "IPv4 VLAN egress in OUTPUT mode",
+	  .daddr = IPV4_VLAN_EGRESS_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .iif = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_OUTPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress over a bond",
+	  .daddr = IPV4_BOND_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = BOND_IFACE, .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress via TBID table",
+	  .daddr = IPV4_TBID_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .lookup_flags = BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_TBID |
+			  BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .tbid = 100,
+	  .expected_dev = "veth2", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = TBID_VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, success writes mtu_result with the swap",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .tot_len = 500, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN egress, FRAG_NEEDED reports mtu, swap unwritten",
+	  .daddr = IPV4_VLAN_MTU_DST, .expected_ret = BPF_FIB_LKUP_RET_FRAG_NEEDED,
+	  .tot_len = 1400, .expected_mtu = 1000,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .expected_dev = "veth1", .check_vlan = true, },
+	/* vlan tag as lookup input */
+	{ .desc = "IPv4 VLAN input, no flag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects subinterface route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, tag selects subinterface route",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV6_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input and egress combined",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = "veth1",
+	  .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_VLAN |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, neighbour resolved on the route",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW, .expected_dev = VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, .dmac = DMAC_INIT2, },
+	{ .desc = "IPv4 VLAN input, source address from the subinterface",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_src = IPV4_VLAN_IFACE_ADDR,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SRC |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	/*
+	 * VRF: the resolved subinterface is enslaved, so the l3mdev rule
+	 * (full lookup) and l3mdev_fib_table_rcu() (DIRECT) must select
+	 * the VRF table from the resolved ingress
+	 */
+	{ .desc = "IPv4 VLAN input, VRF subinterface, no flag",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_GW1,
+	  .lookup_flags = BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input, tag selects VRF table",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, DIRECT uses VRF table from resolved ingress",
+	  .daddr = IPV4_VRF_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VRF_GW, .expected_dev = VRF_VLAN_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_DIRECT |
+			  BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VRF_VLAN_ID, },
+	/*
+	 * failure arms also assert params is left untouched: ifindex still
+	 * names the physical device and the input tag bytes survive
+	 */
+	{ .desc = "IPv4 VLAN input, invalid proto",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, unmatched VID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "IPv4 VLAN input, subinterface down",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID_DOWN, },
+	/*
+	 * the resolver runs before the forwarding check, so on devices
+	 * with forwarding off FWD_DISABLED (not NOT_FWDED) proves the tag
+	 * resolved to that device and the lookup used it as ingress
+	 */
+	{ .desc = "IPv4 VLAN input, 802.1ad tag",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021AD, .vlan_id = QINQ_AD_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, PCP and DEI bits ignored in TCI",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_SUCCESS,
+	  .expected_dst = IPV4_VLAN_GW,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0xe000 | VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, inner QinQ device from VLAN ifindex",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = QINQ_OUTER_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = QINQ_INNER_VLAN_ID, },
+	/*
+	 * bonding: the VLANs live on the master, as on receive, where the
+	 * frame is steered to the master before VLAN processing; a port
+	 * ifindex does not match (ports carry vid state but no VLAN devs)
+	 */
+	{ .desc = "IPv4 VLAN input, tag on bond master resolves",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_FWD_DISABLED,
+	  .iif = BOND_IFACE,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, tag on bond port does not match",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .iif = BOND_PORT, .expected_dev = BOND_PORT, .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = BOND_VLAN_ID, },
+	{ .desc = "IPv6 VLAN input, invalid proto",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = -EINVAL,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = 0x1234, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input, VID 0 priority tag fails closed",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = 0, },
+	{ .desc = "IPv6 VLAN input, unmatched VID",
+	  .daddr = IPV6_VLAN_DST, .expected_ret = BPF_FIB_LKUP_RET_NOT_FWDED,
+	  .expected_dev = "veth1", .check_vlan = true,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_SKIP_NEIGH,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_VID_UNUSED, },
+	{ .desc = "unknown flag bit rejected",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = (1 << 14) | BPF_FIB_LOOKUP_SKIP_NEIGH, },
+	{ .desc = "IPv4 VLAN input rejected with TBID",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_TBID,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
+	{ .desc = "IPv4 VLAN input rejected with OUTPUT",
+	  .daddr = IPV4_VLAN_DST, .expected_ret = -EINVAL,
+	  .lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT | BPF_FIB_LOOKUP_OUTPUT,
+	  .vlan_proto = ETH_P_8021Q, .vlan_id = VLAN_ID, },
 };
 
 static int setup_netns(void)
@@ -204,6 +454,110 @@ static int setup_netns(void)
 	SYS(fail, "ip rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 	SYS(fail, "ip -6 rule add prio 2 fwmark %d lookup %s", MARK, MARK_TABLE);
 
+	/*
+	 * Setup for vlan tests: a subinterface for egress resolution and
+	 * tag-as-input, a QinQ stack, and an iif rule so the input tests
+	 * observe which device the lookup used as ingress.
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE, VLAN_ID);
+	SYS(fail, "ip link set dev %s up", VLAN_IFACE);
+	/*
+	 * lower than the veth1 parent (1500): the skb-path mtu check uses the
+	 * FIB result (VLAN) device, so mtu_result is this value with or
+	 * without the egress swap, which two arms below pin
+	 */
+	SYS(fail, "ip link set dev %s mtu 1400", VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VLAN_IFACE_ADDR, VLAN_IFACE);
+	SYS(fail, "ip addr add %s/64 dev %s nodad", IPV6_VLAN_IFACE_ADDR, VLAN_IFACE);
+
+	/*
+	 * stays down: the input flag must treat its tag the way real
+	 * ingress treats a frame arriving on a down VLAN device (drop)
+	 */
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VLAN_IFACE_DOWN, VLAN_ID_DOWN);
+
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	err = write_sysctl("/proc/sys/net/ipv6/conf/" VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv6.conf." VLAN_IFACE ".forwarding)"))
+		goto fail;
+
+	SYS(fail, "ip link add link veth1 name %s type vlan proto 802.1ad id 200",
+	    QINQ_OUTER_IFACE);
+	SYS(fail, "ip link add link %s name %s type vlan id 300",
+	    QINQ_OUTER_IFACE, QINQ_INNER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_OUTER_IFACE);
+	SYS(fail, "ip link set dev %s up", QINQ_INNER_IFACE);
+	SYS(fail, "ip route add %s/32 dev %s", IPV4_QINQ_DST, QINQ_INNER_IFACE);
+
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VLAN_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VLAN_TABLE, IPV4_VLAN_DST, IPV4_VLAN_GW);
+	SYS(fail, "ip rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+	SYS(fail, "ip -6 route add %s/128 via %s", IPV6_VLAN_DST, IPV6_GW1);
+	SYS(fail, "ip -6 route add table %s %s/128 via %s",
+	    VLAN_TABLE, IPV6_VLAN_DST, IPV6_VLAN_GW);
+	SYS(fail, "ip -6 rule add prio 3 iif %s lookup %s", VLAN_IFACE, VLAN_TABLE);
+
+	/*
+	 * a bond with one port and a VLAN on the bond: VLANs on a bond
+	 * live on the master, so resolution succeeds for the master's
+	 * ifindex and fails closed for a port's, matching receive, which
+	 * steers the frame to the master before VLAN processing
+	 */
+	SYS(fail, "ip link add %s type bond", BOND_IFACE);
+	SYS(fail, "ip link add %s type veth peer name %s", BOND_PORT, BOND_PORT_PEER);
+	SYS(fail, "ip link set %s master %s", BOND_PORT, BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_IFACE);
+	SYS(fail, "ip link set dev %s up", BOND_PORT);
+	SYS(fail, "ip link add link %s name %s.%d type vlan id %d",
+	    BOND_IFACE, BOND_IFACE, BOND_VLAN_ID, BOND_VLAN_ID);
+	SYS(fail, "ip link set dev %s.%d up", BOND_IFACE, BOND_VLAN_ID);
+	SYS(fail, "ip route add %s/32 dev %s.%d",
+	    IPV4_BOND_VLAN_DST, BOND_IFACE, BOND_VLAN_ID);
+
+	/*
+	 * a VRF with its own dedicated subinterface (the iif rules above
+	 * must not see it), for the table-selection-by-ingress cases
+	 */
+	SYS(fail, "ip link add %s type vrf table %s", VRF_IFACE, VRF_TABLE);
+	SYS(fail, "ip link set dev %s up", VRF_IFACE);
+	SYS(fail, "ip link add link veth1 name %s type vlan id %d",
+	    VRF_VLAN_IFACE, VRF_VLAN_ID);
+	SYS(fail, "ip link set %s master %s", VRF_VLAN_IFACE, VRF_IFACE);
+	SYS(fail, "ip link set dev %s up", VRF_VLAN_IFACE);
+	SYS(fail, "ip addr add %s/24 dev %s", IPV4_VRF_IFACE_ADDR, VRF_VLAN_IFACE);
+	err = write_sysctl("/proc/sys/net/ipv4/conf/" VRF_VLAN_IFACE "/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.conf." VRF_VLAN_IFACE ".forwarding)"))
+		goto fail;
+	SYS(fail, "ip route add %s/32 via %s", IPV4_VRF_DST, IPV4_GW1);
+	SYS(fail, "ip route add table %s %s/32 via %s",
+	    VRF_TABLE, IPV4_VRF_DST, IPV4_VRF_GW);
+
+	/* neighbours on the VLAN subinterface for the non-SKIP_NEIGH cases */
+	err = write_sysctl("/proc/sys/net/ipv4/neigh/" VLAN_IFACE "/gc_stale_time", "900");
+	if (!ASSERT_OK(err, "write_sysctl(net.ipv4.neigh." VLAN_IFACE ".gc_stale_time)"))
+		goto fail;
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_EGRESS_DST, VLAN_IFACE, DMAC);
+	SYS(fail, "ip neigh add %s dev %s lladdr %s nud stale",
+	    IPV4_VLAN_GW, VLAN_IFACE, DMAC2);
+
+	/* a VLAN on veth2 with a route in the tbid test table */
+	SYS(fail, "ip link add link veth2 name %s type vlan id %d",
+	    TBID_VLAN_IFACE, TBID_VLAN_ID);
+	SYS(fail, "ip link set dev %s up", TBID_VLAN_IFACE);
+	SYS(fail, "ip route add table 100 %s/32 dev %s",
+	    IPV4_TBID_VLAN_DST, TBID_VLAN_IFACE);
+
+	/* a locked-mtu route via the subinterface for the FRAG_NEEDED case */
+	SYS(fail, "ip route add %s/32 dev %s mtu lock 1000",
+	    IPV4_VLAN_MTU_DST, VLAN_IFACE);
+
 	return 0;
 fail:
 	return -1;
@@ -218,9 +572,16 @@ static int set_lookup_params(struct bpf_fib_lookup *params,
 	memset(params, 0, sizeof(*params));
 
 	params->l4_protocol = IPPROTO_TCP;
-	params->ifindex = ifindex;
+	params->ifindex = test->iif ? if_nametoindex(test->iif) : ifindex;
 	params->tbid = test->tbid;
 	params->mark = test->mark;
+	params->tot_len = test->tot_len;
+
+	/* h_vlan_proto/h_vlan_TCI union with tbid */
+	if (test->lookup_flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+		params->h_vlan_proto = htons(test->vlan_proto);
+		params->h_vlan_TCI = htons(test->vlan_id);
+	}
 
 	if (inet_pton(AF_INET6, test->daddr, params->ipv6_dst) == 1) {
 		params->family = AF_INET6;
@@ -298,7 +659,7 @@ void test_fib_lookup(void)
 	struct nstoken *nstoken = NULL;
 	struct __sk_buff skb = { };
 	struct fib_lookup *skel;
-	int prog_fd, err, ret, i;
+	int prog_fd, xdp_fd, err, ret, i;
 
 	/* The test does not use the skb->data, so
 	 * use pkt_v6 for both v6 and v4 test.
@@ -309,11 +670,16 @@ void test_fib_lookup(void)
 		    .ctx_in = &skb,
 		    .ctx_size_in = sizeof(skb),
 	);
+	LIBBPF_OPTS(bpf_test_run_opts, xdp_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+	);
 
 	skel = fib_lookup__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
 		return;
 	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	xdp_fd = bpf_program__fd(skel->progs.fib_lookup_xdp);
 
 	SYS(fail, "ip netns add %s", NS_TEST);
 
@@ -352,6 +718,21 @@ void test_fib_lookup(void)
 		if (tests[i].expected_dst)
 			assert_dst_ip(fib_params, tests[i].expected_dst);
 
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev), "ifindex");
+
+		if (tests[i].expected_mtu)
+			ASSERT_EQ(fib_params->mtu_result, tests[i].expected_mtu,
+				  "mtu_result");
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "h_vlan_TCI");
+		}
+
 		ret = memcmp(tests[i].dmac, fib_params->dmac, sizeof(tests[i].dmac));
 		if (!ASSERT_EQ(ret, 0, "dmac not match")) {
 			char expected[18], actual[18];
@@ -361,17 +742,182 @@ void test_fib_lookup(void)
 			printf("dmac expected %s actual %s ", expected, actual);
 		}
 
-		// ensure tbid is zero'd out after fib lookup.
-		if (tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) {
+		/*
+		 * ensure tbid is zero'd out after fib lookup. With
+		 * BPF_FIB_LOOKUP_VLAN the union holds the packed vlan
+		 * fields instead, so skip the check for those.
+		 */
+		if ((tests[i].lookup_flags & BPF_FIB_LOOKUP_DIRECT) &&
+		    !(tests[i].lookup_flags & BPF_FIB_LOOKUP_VLAN)) {
 			if (!ASSERT_EQ(skel->bss->fib_params.tbid, 0,
 					"expected fib_params.tbid to be zero"))
 				goto fail;
 		}
 	}
 
+	/*
+	 * Re-run the cases through bpf_xdp_fib_lookup(). test_run uses the
+	 * current netns' loopback for ctx->rxq->dev, so dev_net() is NS_TEST
+	 * and the lookup runs against its FIB. The path-independent results
+	 * (return code, swapped ifindex, vlan tag, gateway) must match the skb
+	 * path; the no-tot_len mtu_result is skb-specific and not rechecked.
+	 */
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (set_lookup_params(fib_params, &tests[i], skb.ifindex))
+			continue;
+
+		skel->bss->fib_lookup_ret = -1;
+		skel->bss->lookup_flags = tests[i].lookup_flags;
+
+		err = bpf_prog_test_run_opts(xdp_fd, &xdp_opts);
+		if (!ASSERT_OK(err, "xdp test_run"))
+			continue;
+
+		if (!ASSERT_EQ(skel->bss->fib_lookup_ret, tests[i].expected_ret,
+			       "xdp fib_lookup_ret"))
+			printf("(xdp) %s\n", tests[i].desc);
+
+		if (tests[i].expected_dev)
+			ASSERT_EQ(fib_params->ifindex,
+				  if_nametoindex(tests[i].expected_dev),
+				  "xdp ifindex");
+
+		if (tests[i].expected_dst)
+			assert_dst_ip(fib_params, tests[i].expected_dst);
+
+		if (tests[i].check_vlan) {
+			ASSERT_EQ(fib_params->h_vlan_proto,
+				  htons(tests[i].vlan_proto), "xdp h_vlan_proto");
+			ASSERT_EQ(fib_params->h_vlan_TCI,
+				  htons(tests[i].vlan_id), "xdp h_vlan_TCI");
+		}
+	}
+
 fail:
 	if (nstoken)
 		close_netns(nstoken);
 	SYS_NOFAIL("ip netns del " NS_TEST);
 	fib_lookup__destroy(skel);
 }
+
+#define NS_VLAN_A	"fib_lookup_vlan_ns_a"
+#define NS_VLAN_B	"fib_lookup_vlan_ns_b"
+
+/*
+ * A VLAN device can be moved to another netns while staying registered
+ * on its parent. Neither direction may then cross the boundary: the
+ * egress flag must not publish the foreign parent's ifindex, and the
+ * input flag must fail closed rather than use a foreign ingress.
+ */
+void test_fib_lookup_vlan_netns(void)
+{
+	struct bpf_fib_lookup *fib_params;
+	struct nstoken *nstoken = NULL;
+	struct __sk_buff skb = { };
+	struct fib_lookup *skel = NULL;
+	int prog_fd, err, parent_idx, vlan_idx;
+
+	LIBBPF_OPTS(bpf_test_run_opts, run_opts,
+		    .data_in = &pkt_v6,
+		    .data_size_in = sizeof(pkt_v6),
+		    .ctx_in = &skb,
+		    .ctx_size_in = sizeof(skb),
+	);
+
+	skel = fib_lookup__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+		return;
+	prog_fd = bpf_program__fd(skel->progs.fib_lookup);
+	fib_params = &skel->bss->fib_params;
+
+	SYS(fail, "ip netns add %s", NS_VLAN_A);
+	SYS(fail, "ip netns add %s", NS_VLAN_B);
+
+	nstoken = open_netns(NS_VLAN_A);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(a)"))
+		goto fail;
+
+	SYS(fail, "ip link add veth7 type veth peer name veth8");
+	SYS(fail, "ip link set dev veth7 up");
+	SYS(fail, "ip link add link veth7 name veth7.66 type vlan id 66");
+	SYS(fail, "ip link set veth7.66 netns %s", NS_VLAN_B);
+
+	parent_idx = if_nametoindex("veth7");
+	if (!ASSERT_NEQ(parent_idx, 0, "if_nametoindex(veth7)"))
+		goto fail;
+
+	/*
+	 * input: the moved device is still in veth7's VLAN group, but it
+	 * lives in another netns, so the lookup must fail closed
+	 */
+	skb.ifindex = parent_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = parent_idx;
+	fib_params->h_vlan_proto = htons(ETH_P_8021Q);
+	fib_params->h_vlan_TCI = htons(66);
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN_INPUT |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+	if (!ASSERT_OK(err, "test_run(input)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_NOT_FWDED,
+		  "input across netns fails closed");
+	ASSERT_EQ(fib_params->ifindex, parent_idx, "ifindex untouched");
+	ASSERT_EQ(fib_params->h_vlan_TCI, htons(66), "tag untouched");
+
+	close_netns(nstoken);
+	nstoken = open_netns(NS_VLAN_B);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns(b)"))
+		goto fail;
+
+	/*
+	 * egress: the fib result is the VLAN device here, but its parent
+	 * is in the other netns, so the swap must not happen
+	 */
+	SYS(fail, "ip link set dev veth7.66 up");
+	SYS(fail, "ip addr add 10.66.0.1/24 dev veth7.66");
+	err = write_sysctl("/proc/sys/net/ipv4/conf/veth7.66/forwarding", "1");
+	if (!ASSERT_OK(err, "write_sysctl(forwarding)"))
+		goto fail;
+
+	vlan_idx = if_nametoindex("veth7.66");
+	if (!ASSERT_NEQ(vlan_idx, 0, "if_nametoindex(veth7.66)"))
+		goto fail;
+
+	skb.ifindex = vlan_idx;
+	memset(fib_params, 0, sizeof(*fib_params));
+	fib_params->family = AF_INET;
+	fib_params->l4_protocol = IPPROTO_TCP;
+	fib_params->ifindex = vlan_idx;
+	if (!ASSERT_EQ(inet_pton(AF_INET, "10.66.0.2", &fib_params->ipv4_dst),
+		       1, "inet_pton(dst)") ||
+	    !ASSERT_EQ(inet_pton(AF_INET, "10.66.0.1", &fib_params->ipv4_src),
+		       1, "inet_pton(src)"))
+		goto fail;
+
+	skel->bss->fib_lookup_ret = -1;
+	skel->bss->lookup_flags = BPF_FIB_LOOKUP_VLAN |
+				  BPF_FIB_LOOKUP_SKIP_NEIGH;
+	err = bpf_prog_test_run_opts(prog_fd, &run_opts);
+	if (!ASSERT_OK(err, "test_run(egress)"))
+		goto fail;
+	ASSERT_EQ(skel->bss->fib_lookup_ret, BPF_FIB_LKUP_RET_SUCCESS,
+		  "egress lookup succeeds");
+	ASSERT_EQ(fib_params->ifindex, vlan_idx,
+		  "foreign parent not published");
+	ASSERT_EQ(fib_params->h_vlan_TCI, 0, "vlan fields zero");
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " NS_VLAN_A);
+	SYS_NOFAIL("ip netns del " NS_VLAN_B);
+	fib_lookup__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/fib_lookup.c b/tools/testing/selftests/bpf/progs/fib_lookup.c
index 7b5dd2214ff4..f43e22d33814 100644
--- a/tools/testing/selftests/bpf/progs/fib_lookup.c
+++ b/tools/testing/selftests/bpf/progs/fib_lookup.c
@@ -19,4 +19,13 @@ int fib_lookup(struct __sk_buff *skb)
 	return TC_ACT_SHOT;
 }
 
+SEC("xdp")
+int fib_lookup_xdp(struct xdp_md *ctx)
+{
+	fib_lookup_ret = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params),
+					lookup_flags);
+
+	return XDP_DROP;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v14 6/9] tls: device: add TX KeyUpdate support
From: Jakub Kicinski @ 2026-06-17 22:56 UTC (permalink / raw)
  To: Rishikesh Jethwani
  Cc: netdev, saeedm, tariqt, mbloch, borisp, john.fastabend, sd, davem,
	pabeni, edumazet, leon
In-Reply-To: <CAKaoeS3mTGUo-dvtdmQfz3JAnn69-e36xrj9YmjbWTnsiw9uqg@mail.gmail.com>

On Wed, 17 Jun 2026 15:32:58 -0700 Rishikesh Jethwani wrote:
> From: Rishikesh Jethwani <rjethwani@everpuredata.com>

Please keep in mind that net-next is currently closed.
You need to wait until the merge window is over (2 weeks)
before reposting.

^ permalink raw reply

* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Gustavo A. R. Silva @ 2026-06-17 22:59 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kees Cook, Gustavo A. R. Silva, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, linux-kernel,
	linux-hardening, llvm, Johan Thomsen
In-Reply-To: <b16e8bac-e149-4052-b1cb-8fd3e1137f9c@ovn.org>



On 6/17/26 16:01, Ilya Maximets wrote:
> On 6/17/26 10:08 PM, Gustavo A. R. Silva wrote:
>> Hi,
>>
>> On 6/16/26 04:03, Ilya Maximets wrote:
>>> kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
>>> structure to the options_len, which is then initialized to zero.
>>> Later, we're initializing the structure by copying the tunnel info
>>> together with the options, and this triggers a warning for a potential
>>> memcpy overflow, since the compiler estimates that the options can't
>>> fit into the structure, even though the memory for them is actually
>>> allocated.
>>>
>>>    memcpy: detected buffer overflow: 104 byte write of buffer size 96
>>>    WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
>>>     skb_tunnel_info_unclone+0x179/0x190
>>>     geneve_xmit+0x7fe/0xe00
>>
>> This warning has nothing to do with counted_by. See below for more
>> comments.
>>
>>>
>>> The issue is triggered when built with clang and source fortification.
>>>
>>> Fix that by doing the copy in two stages: first - the main data with
>>> the options_len, then the options.  This way the correct length should
>>> be known at the time of the copy.
>>>
>>> It would be better if the options_len never changed after allocation,
>>> but the allocation code is a little separate from the initialization
>>> and it would be awkward and potentially dangerous to return a struct
>>> with options_len set to a non-zero value from the metadata_dst_alloc().
>>>
>>> Another option would be to use ip_tunnel_info_opts_set(), but it is
>>> doing too many unnecessary operations for the use case here.
>>>
>>> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
>>> Reported-by: Johan Thomsen <write@ownrisk.dk>
>>> Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>>
>>> Johan, if you can test this one in your setup as well, that would
>>> be great.  Thanks.
>>>
>>>    include/net/dst_metadata.h | 7 +++++--
>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>>> --- a/include/net/dst_metadata.h
>>> +++ b/include/net/dst_metadata.h
>>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>>    	if (!new_md)
>>>    		return ERR_PTR(-ENOMEM);
>>>    
>>> -	memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>>> -	       sizeof(struct ip_tunnel_info) + md_size);
>>
>> What's going on here is that, internally, fortified memcpy() retrieves
>> the destination size via __builtin_dynamic_object_size() in mode 1.
>>
>> That is:
>>
>> __builtin_dynamic_object_size(&new_md->u.tun_info, 1)
>>
>> For the above case, Clang returns sizeof(new_md->u.tun_info) == 96.
>>
>> So the warning is reporting that 104 bytes don't fit in an object of
>> size 96 bytes, regardless of any counted_by annotation or allocation.
> 
> Hmm.  Does __builtin_dynamic_object_size(&new_md->u.tun_info, 1) return
> 104 when the options_len is 8?  If so, isn't that because it is counted
> by that field?  Asking because the fortification doesn't complain if we
> keep the full 104-byte copy as-is, but set the options_len beforehand,
> as tested by Johan.

I see. If that is the case, then, internally, fortified memcpy() ends up
using mode 0 instead of mode 1. Something like this:

__builtin_dynamic_object_size(&new_md->u.tun_info, 0)

The above will effectively consider the allocation and counted_by because
it will interpret new_md->u.tun_info as an open-ended object due to the
flexible-array member (in struct ip_tunnel_info) whose size is determined
by counted_by.

I'm not entirely convinced we really want this.

-Gustavo

> 
>>
>> Of course, in this case, the write of 104 bytes into new_md->u.tun_info
>> is intentional and controlled, but what if it weren't?
>>
>> On the other hand, for this same case, GCC currently returns SIZE_MAX,
>> which translates to -1 inside fortified memcpy(). Thus, bounds-checking
>> is bypassed, which is why this warning doesn't show up with GCC.
>>
>> However, this is a bug in GCC. We're already looking into that.
>>
>> I think we've had just a handful of cases like this across the whole
>> kernel tree. We can deal with them as you did here (by directly copying
>> the composite structure first, and then using memcpy() to copy into the
>> flexible-array member). If these cases ever become more common, we
>> could create some kind of helper to do both things at once. :)
>>
>>> +	/* Copy in two stages to keep the __counted_by happy. */
>>
>> So based on my comments above, this code comment is not correct.
> 
> I feel like some comment is still needed, do you have some suggestions
> for what would be a better wording?
> 
>>
>>> +	new_md->u.tun_info = md_dst->u.tun_info;
>>
>> This is fine.
>>
>>> +	memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
>>> +	       ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
>>
>> Is ip_tunnel_info_opts() really needed here?
>>
>> Probably this works just fine:
>>
>> memcpy(new_md->u.tun_info.options, md_dst->u.tun_info.options, md_size);
> 
> The logic here is: we have the access function, therefore we should use it.
> It gives a bad example if we don't.
> 
> Best regards, Ilya Maximets.


^ permalink raw reply

* Re: [ANN] netdev development stats for 7.2
From: Jacob Keller @ 2026-06-17 23:19 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
In-Reply-To: <20260617115319.43a5942d@kernel.org>

On 6/17/2026 11:53 AM, Jakub Kicinski wrote:
> Top scores (positive):               Top scores (negative):              
>    1 (   ) [768] Jakub Kicinski         1 ( +1) [91] Tariq Toukan        
>    2 (   ) [376] Simon Horman           2 ( +8) [86] Wei Fang            
>    3 (   ) [346] Andrew Lunn            3 ( +4) [67] Ratheesh Kannoth    
>    4 (   ) [265] Paolo Abeni            4 (***) [54] javen               
>    5 ( +4) [ 91] Ido Schimmel           5 ( +6) [49] Lorenzo Bianconi    
>    6 (+14) [ 74] David Laight           6 (***) [48] Luiz Angelo Daros de Luca
>    7 (   ) [ 62] Krzysztof Kozlowski    7 (***) [43] Simon Wunderlich    
>    8 ( +2) [ 57] Aleksandr Loktionov    8 (***) [38] Chuck Lever         
>    9 (+12) [ 50] Nikolay Aleksandrov    9 (+18) [38] Grzegorz Nitka      
>   10 ( -4) [ 49] Willem de Bruijn      10 (***) [35] Pablo Neira Ayuso   
>   11 ( +3) [ 49] Sabrina Dubroca       11 (***) [35] Markus Stockhausen  
>   12 (+41) [ 47] Alexander Lobakin     12 (***) [34] Selvamani Rajagopal 
>   13 (+24) [ 47] Maxime Chevallier     13 (***) [34] Jason Xing          
>   14 ( -6) [ 46] David Ahern           14 ( -8) [33] Illusion Wang       
>   15 (***) [ 43] Jiayuan Chen          15 (***) [30] Minxi Hou       
>  
> One process note on the reviewer score. Tariq tops the negative list. 
> I've been returning to the question of whether it's fair since 
> he has to handle submissions of most of nVidia's patches.
> Still, I don't understand why reading thru the list and reviewing
> one patchset from another company a day is too much to ask.
> 

This is a difficult question. When I've covered for Tony in a similar
position, I've felt like it is hard enough to keep an eye on our own
list let alone also finding time to review other places.

A positive note here is that nVidia is now green overall, so at least
there is some participation from the company as a whole. On the other
hand, Tony isn't in the top negatives despite performing a somewhat
similar role.

I know I was lacking myself in the last cycle due to a bunch of
unrelated work and issues. I've been working to get review back into my
daily flow.

^ permalink raw reply

page:              | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox