Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 1/3] net: busy-poll: introduce sk_tx_busy_loop()
From: Jason Xing @ 2026-06-17 10:19 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Menglong Dong, menglong8.dong, Jakub Kicinski, jasowang, mst,
	xuanzhuo, eperezma, andrew+netdev, davem, edumazet, pabeni,
	magnus.karlsson, sdf, horms, ast, daniel, hawk, john.fastabend,
	bjorn, netdev, virtualization, linux-kernel, bpf
In-Reply-To: <ajJrckiXEUztBQDz@boxer>

On Wed, Jun 17, 2026 at 5:40 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Sun, Jun 14, 2026 at 06:12:46PM +0800, Menglong Dong wrote:
> > On 2026/6/14 02:21, Jakub Kicinski wrote:
> > > On Thu, 11 Jun 2026 15:12:40 +0800 menglong8.dong@gmail.com wrote:
> > > > For now, we use sk_busy_loop() for both rx and tx path. The sk_busy_loop()
> > > > will call napi_busy_loop() for the specified napi_id. However, some
> > > > nic drivers have tx napi, such as virtio-net. In this case, sk_busy_loop()
> > > > doesn't work, as it can only schedule the NAPI for the rx queue.
> > > >
> > > > Therefore, introduce sk_tx_busy_loop() for the nic drivers that support tx
> > > > napi, which will schedule the tx napi if available.
> > >
> > > First, I thought the only difference with Tx NAPI is that it can't be
> > > busy polled. So if you want to poll an instance don't register it as
> > > a Tx one instead of adding all this "tx polling" stuff in the core?
> >
> > I see. Register the tx NAPI with netif_napi_add_config() allow us
> > busy poll it. But we still have two NAPI instance: rx NAPI and tx NAPI.
> > sk_busy_loop() can only busy poll on one of them.
> >
> > Before AF_XDP, we don't have the need to send packet via tx NAPI, which
> > means that we don't need to busy poll it.
> >
> > I analyst some nic drivers on the implement of AF_XDP. Some of them
> > will check xsk tx ring of current queue and send the data in it in the
> > rx NAPI, such as mlx5. Some of them will allocate a extra "rxtx" NAPI
> > for the AF_XDP zero-copy queue, which will poll both the data receiving
> > and sending.
> >
> > In the case about, they will do the data sending and receiving for the
> > AF_XDP in a single NAPI instance.
> >
> > However, some driver receiving the data in rx NAPI and send data in
> > tx NAPI for AF_XDP. In this case, we can't use sk_busy_loop() for both
> > rx path and tx path, as we need to wake different NAPI instance.
> >
> > >
> > > Second, can this problem happen for any other NIC or is it purely
> > > an artifact of virtio's delayed Tx completion handling?
> >
> > According to my analysis, only virtio-net and ICSSG driver have
> > split NAPI for AF_XDP. I don't have a ICSSG nic, but the codex tell
> > me that it does have the same problem.
> >
> > I'm not sure if it is a good idea to introduce the sk_tx_busy_loop().
> > Maybe we can modify the driver instead by using the same NAPI
> > for both data sending and receiving, just like others do. The
> > advantage of introduce sk_tx_busy_loop() is that we can split the
> > data sending and receiving, which maybe more efficient.
>
> Would be good if you back your changes by any performance numbers. I
> believe that drivers do tx processing via rx napi as before AF_XDP it was
> only about cleaning up writebacks, AF_XDP added more weight via actual tx
> descriptors submission.
>
> Maybe you can vibe-code virtio-net to work only with rx napi and see what
> are the results.
>
> Side note/question - Do you have a tx-only use case for AF_XDP ? I am
> planning (for a long time actually) to implement asymmetric AF_XDP
> sockets. Currently for ZC scenarios xsk socket occupies both rx and tx
> queues even when you do rx or tx only.

As far as I know, since I use TCP as the userspace protocol, I don't
have any idea on how we can apply this. It seems you've got the
requirement in the real world? Interesting.

Thanks,
Jason

>
> >
> > >
> > > Third, this series does not apply.
> >
> > Ah, I'll rebase this series if a V2 is acceptable.
> >
> > Thanks!
> > Menglong Dong
> >
> > >
> > >
> >
> >
> >
> >

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Petr Mladek @ 2026-06-17 10:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jakub Kicinski, John Ogness, Sergey Senozhatsky, Peter Zijlstra,
	Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616153122.keHMKvVT@linutronix.de>

On Tue 2026-06-16 17:31:22, Sebastian Andrzej Siewior wrote:
> On 2026-06-16 08:11:28 [-0700], Jakub Kicinski wrote:
> > > 
> > > Adding sched and printk folks for opinions while eyeballing
> > > WARN_ON_DEFERRED().
> > 
> > Thanks a lot for looking into this! To be clear - the printk_deferred /
> > WARN_DEFERRED would be just for stable? Or there's still some
> > sensitivity even with nbcon?
> 
> We already have printk_deferred(). WARN_DEFERRED() would be new. I
> *think* this is not limited netpoll/ netconsole but all console drivers
> not using CON_NBCON if the printk (via WARN) occurs with the rq held.
> I don't remember all the details but printk_deferred() was introduced to
> circumvent this until printk is fixed.

Just to make it clear. The problem with the legacy consoles is that
they are called under console_lock() which is a semaphore. And it
calls wake_up_process() in console_unlock() when there is another
waiter on the lock.

> Once we get rid of those legacy drivers and NBCON is the default we can
> get rid of printk_deferred() :)

Yup.

Best Regards,
Petr

^ permalink raw reply

* [PATCH net v2] net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete
From: Doruk Tan Ozturk @ 2026-06-17 10:05 UTC (permalink / raw)
  To: saeedm, leon, tariqt, mbloch, sd, andrew+netdev, davem, edumazet,
	kuba, pabeni
  Cc: borisp, raeds, ehakim, netdev, linux-rdma, linux-kernel,
	Doruk Tan Ozturk, stable

When an offloaded MACsec RX SC is deleted, macsec_del_rxsc_ctx() released
the per-SC metadata_dst with metadata_dst_free(), which calls kfree()
unconditionally and ignores the dst reference count. The RX datapath in
mlx5e_macsec_offload_handle_rx_skb() looks up the SC under rcu_read_lock()
via xa_load() and, while still holding only the RCU read lock, takes a
reference with dst_hold() and attaches the dst to the skb with
skb_dst_set().

A reader that has already obtained the rx_sc pointer can therefore race
with the delete path:

  CPU0 (del_rxsc)                 CPU1 (rx datapath)
  --------------                  ------------------
                                  rcu_read_lock();
                                  rx_sc = xa_load(...)->rx_sc;
  xa_erase(...);
  metadata_dst_free(rx_sc->md_dst); /* kfree(), ignores refcount */
                                  dst_hold(&rx_sc->md_dst->dst); /* UAF */
                                  skb_dst_set(skb, &rx_sc->md_dst->dst);

metadata_dst_free() frees the object even though the datapath still holds
(or is about to take) a reference, so the subsequent dst_hold() /
skb_dst_set() and the later skb free operate on freed memory.

Fix the owner side by dropping the reference with dst_release() instead of
freeing unconditionally. dst_release() only schedules the RCU-deferred
dst_destroy() once the reference count reaches zero, so a concurrent reader
that still holds a reference keeps the object alive.

Dropping the owner reference is not sufficient on its own: once the owner
reference is the last one, dst_release() drops the count to zero and the
destroy is merely RCU-deferred. A racing reader that runs plain dst_hold()
on that already-dead dst gets rcuref_get() == false but dst_hold() only
WARNs and attaches the dying dst to the skb anyway; the later skb free then
calls dst_release() on an object whose destroy is already
scheduled, again a use-after-free.

Convert the RX datapath to dst_hold_safe(), which returns false
(without warning) when the dst is already dead, and only attach it to
the skb when a reference was successfully taken. When the SC is being
deleted the in-flight packet simply proceeds without the offload
metadata_dst: skb_metadata_dst() returns NULL, the MACsec core sees
!is_macsec_md_dst and skips this secy (rx_uses_md_dst path), which is
the correct behaviour for a packet whose SC is going away.

Fixes: b7c9400cbc48 ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst")
Cc: stable@vger.kernel.org
Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
---
v2: also convert the RX datapath dst_hold() to dst_hold_safe() so a reader
    racing the SC delete cannot attach a dst whose last reference was just
    dropped (per the automated review forwarded by Simon Horman).

v1: https://lore.kernel.org/netdev/20260615140534.52691-1-doruk@0sec.ai/

 drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
index 71b3a059c964..e5d9a14c92b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
@@ -829,7 +829,7 @@ static void macsec_del_rxsc_ctx(struct mlx5e_macsec *macsec, struct mlx5e_macsec
 	 */
 	list_del_rcu(&rx_sc->rx_sc_list_element);
 	xa_erase(&macsec->sc_xarray, rx_sc->sc_xarray_element->fs_id);
-	metadata_dst_free(rx_sc->md_dst);
+	dst_release(&rx_sc->md_dst->dst);
 	kfree(rx_sc->sc_xarray_element);
 	kfree_rcu_mightsleep(rx_sc);
 }
@@ -1697,8 +1697,8 @@ void mlx5e_macsec_offload_handle_rx_skb(struct net_device *netdev,
 	sc_xarray_element = xa_load(&macsec->sc_xarray, fs_id);
 	rx_sc = sc_xarray_element->rx_sc;
 	if (rx_sc) {
-		dst_hold(&rx_sc->md_dst->dst);
-		skb_dst_set(skb, &rx_sc->md_dst->dst);
+		if (dst_hold_safe(&rx_sc->md_dst->dst))
+			skb_dst_set(skb, &rx_sc->md_dst->dst);
 	}

 	rcu_read_unlock();
-- 
2.43.0

^ permalink raw reply related

* [PATCH net v2] ice: eswitch: fix use-after-free of metadata_dst in repr release
From: Doruk Tan Ozturk @ 2026-06-17 10:05 UTC (permalink / raw)
  To: anthony.l.nguyen, przemyslaw.kitszel, andrew+netdev, davem,
	edumazet, kuba, pabeni
  Cc: michal.swiatkowski, wojciech.drewek, horms, intel-wired-lan,
	netdev, linux-kernel, Doruk Tan Ozturk, stable

ice_eswitch_release_repr() frees the port representor metadata_dst via
metadata_dst_free(), which directly kfree()s the object and ignores the
dst_entry refcount. The eswitch slow-path TX routine
ice_eswitch_port_start_xmit() takes a reference on this dst with
dst_hold() and attaches it to the skb via skb_dst_set(). If such an skb
is still in flight (e.g. queued in a qdisc) when the representor is torn
down, the metadata_dst is freed while the skb still points at it. When
the skb is later freed, dst_release() operates on already-freed memory.

Replace metadata_dst_free() with dst_release() so the metadata_dst is
freed only after the last reference is dropped. The dst subsystem frees
metadata_dst objects from dst_destroy() once the refcount reaches zero
(DST_METADATA is set by metadata_dst_alloc()).

Same class of bug and fix as commit c32b26aaa2f9 ("netfilter:
nft_tunnel: fix use-after-free on object destroy").

Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
Cc: stable@vger.kernel.org
Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
Reviewed-by: Simon Horman <horms@kernel.org>
---
 v2:
  - Correct the Fixes: tag to the commit that introduced the switchdev
    teardown (Simon Horman); add his Reviewed-by. No functional change.
 v1: https://lore.kernel.org/netdev/20260615140532.52676-1-doruk@0sec.ai/

 drivers/net/ethernet/intel/ice/ice_eswitch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f..41b30a7ca4a9 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -95,7 +95,7 @@ ice_eswitch_release_repr(struct ice_pf *pf, struct ice_repr *repr)
 		return;

 	ice_vsi_update_security(vsi, ice_vsi_ctx_set_antispoof);
-	metadata_dst_free(repr->dst);
+	dst_release(&repr->dst->dst);
 	repr->dst = NULL;
 	ice_fltr_add_mac_and_broadcast(vsi, repr->parent_mac,
 				       ICE_FWD_TO_VSI);
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Jason Xing @ 2026-06-17 10:03 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Tushar Vyavahare, netdev, magnus.karlsson, stfomichev, kernelxing,
	davem, kuba, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <ajJsMj0QMOF5I8qq@boxer>

On Wed, Jun 17, 2026 at 5:43 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Jun 17, 2026 at 07:39:06AM +0800, Jason Xing wrote:
> > Hi Tushar,
> >
> > On Tue, Jun 16, 2026 at 11:50 PM Tushar Vyavahare
> > <tushar.vyavahare@intel.com> wrote:
> > >
> > > This series improves AF_XDP selftests by making timeout handling
> > > explicit and fixing sources of non-determinism in xsk timeout tests.
> > >
> > > Patch 1 introduces test_spec::poll_tmout and removes implicit
> > > dependence on RX UMEM setup state for timeout behavior.
> > >
> > > Patch 2 fixes thread harness sequencing by attaching XDP programs
> > > before worker startup, removing signal-based termination, and using
> > > barrier synchronization only for dual-thread runs.
> > >
> > > Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
> > > configuration does not leak into subsequent cases on shared-netdev
> > > runs.
> > >
> > > Together these changes make timeout handling easier to follow and
> > > improve selftest stability, especially on real NIC runs.
> >
> > net-next is closed, but in the meantime I'll review the series ASAP.
> >
> > BTW, another thing about selftests I had in my mind is that are you
> > planning to work on this [1]?
>
> This one is on me. I took your changes Jason and aligned ZC batching side
> to this behavior, followed by xskxceiver adjustment. I am planning to send
> this today EOD, however let's see how badly internal Sashiko will kick my
> ass.

Thanks. We'll see :)

Thanks,
Jason

>
> >
> > [1]: https://lore.kernel.org/all/20260520004244.55663-1-kerneljasonxing@gmail.com/
> >
> > Thanks,
> > Jason
> >
> > >
> > > Tushar Vyavahare (3):
> > >   selftests/xsk: make poll timeout mode explicit
> > >   selftests/xsk: fix timeout thread harness sequencing
> > >   selftests/xsk: restore shared_umem after POLL_TXQ_FULL
> > >
> > >  .../selftests/bpf/prog_tests/test_xsk.c       | 96 +++++++++++--------
> > >  .../selftests/bpf/prog_tests/test_xsk.h       |  2 +
> > >  2 files changed, 56 insertions(+), 42 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> > >
> >

^ permalink raw reply

* Re: [syzbot] [net?] WARNING in tls_err_abort
From: Sabrina Dubroca @ 2026-06-17 10:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: syzbot, davem, edumazet, horms, john.fastabend, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <ajHENGWdBcbQUpWS@krikkit>

2026-06-16, 23:46:28 +0200, Sabrina Dubroca wrote:
> 2026-06-16, 14:23:59 -0700, Jakub Kicinski wrote:
> > On Tue, 16 Jun 2026 23:00:54 +0200 Sabrina Dubroca wrote:
> > > 2026-06-16, 08:28:16 -0700, Jakub Kicinski wrote:
> > > > On Tue, 16 Jun 2026 17:19:22 +0200 Sabrina Dubroca wrote:  
> > > > Don't we depend on it being set
> > > > to avoid further state transitions once we hit a crypto error?  
> > > 
> > > I kind of thought so too.
> > 
> > In which case the question is whether we should try to remove 
> > the sock_error() instead? (stating the obvious I guess)
> 
> That would make sense, but we can't prevent sock_error() being called
> from some helper.

Actually, getsockopt(SO_ERROR) will also clear sk_err. If we want to
prevent further state transitions, we'll have to use something else
(probably a flag in tls_context set by tls_err_abort()).

So I'd go with 2 separate patches. The 2nd one will be a change in
userspace-visible behavior, but hopefully not one they'd be upset
about.

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
From: Jonas Jelonek @ 2026-06-17  9:55 UTC (permalink / raw)
  To: kory.maincent, github, corey
  Cc: andrew+netdev, davem, edumazet, hkallweit1, horms, kuba,
	linux-kernel, linux, netdev, o.rempel, pabeni
In-Reply-To: <20260616184220.72d1b814@kmaincent-XPS-13-7390>

Hi,

just wanted to point out, whoever continues with this series, to take
care of an issue I noticed with one of the patches here, see [1].

Apart from that, I'd really like to see this move forward since the PSE 
driver I'm currently working on would benefit from it.

Best regards,
Jonas

[1] https://lore.kernel.org/netdev/e00048dd-1ed3-40c3-9912-59bccf015ad5@gmail.com/

^ permalink raw reply

* Re: [PATCH net v3 2/2] net: macb: drop in-flight Tx SKBs on close
From: Nicolai Buchwitz @ 2026-06-17  9:49 UTC (permalink / raw)
  To: Théo Lebrun
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Haavard Skinnemoen,
	Jeff Garzik, Conor Dooley, Paolo Valerio, netdev, linux-kernel,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, stable
In-Reply-To: <20260617-macb-drop-tx-v3-2-d4c7e57d890b@bootlin.com>

On 17.6.2026 11:17, Théo Lebrun wrote:
> The MACB driver has since forever leaked the outgoing SKBs that
> have not yet been marked as completed. They live in queue->tx_skb
> which gets freed without remorse nor checking.
> 
> macb_free_consistent() gets called in a few codepaths, but only
> close will trigger the added expressions. In macb_open() and
> macb_alloc_consistent() failure cases, tx_skb just got allocated
> and is empty.
> 
> Use the new macb_tx_unmap() prototype to report our error as
> SKB_DROP_REASON_NOT_SPECIFIED rather than SKB_CONSUMED which makes it
> sound like no error occurred. Equivalent to dev_kfree_skb_any().
> 
> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
> ---

> [...]

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>

Thanks,
Nicolai

^ permalink raw reply

* Re: [PATCH net v3 1/2] net: macb: give reasons for Tx SKB kfree
From: Nicolai Buchwitz @ 2026-06-17  9:49 UTC (permalink / raw)
  To: Théo Lebrun
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Haavard Skinnemoen,
	Jeff Garzik, Conor Dooley, Paolo Valerio, netdev, linux-kernel,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, stable
In-Reply-To: <20260617-macb-drop-tx-v3-1-d4c7e57d890b@bootlin.com>

On 17.6.2026 11:17, Théo Lebrun wrote:
> Using dev_consume_skb_any() marks the drop reason as SKB_CONSUMED every
> time we free a Tx SKB. Instead, replace by 
> SKB_DROP_REASON_NOT_SPECIFIED
> when packet has been dropped without sending.
> 
> It is not precise but at least differs from SKB_CONSUMED and is used by
> many drivers for their error codepaths through 
> dev_kfree_skb_{any,irq}().
> 
> Pass a reason around rather than call dev_consume_skb_any() or
> dev_kfree_skb_any() because macb_tx_unmap() is called for cleanup in
> all cases.
> 
> macb_tx_error_task() is made complex because some SKBs encountered have
> been successfully sent.
> 
> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
> ---

Looks like my r-b from v2 was lost, but here it goes again :)

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>

Thanks,
Nicolai

^ permalink raw reply

* Re: [PATCH RFC 8/9] arm64: dts: qcom: shikra-cqs-evk: Enable ethernet0
From: Andrew Lunn @ 2026-06-17  9:48 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Mohd Ayaan Anwar, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Richard Cochran, Bjorn Andersson, Konrad Dybcio,
	Maxime Coquelin, Alexandre Torgue, Russell King, linux-arm-msm,
	netdev, devicetree, linux-kernel, linux-stm32, linux-arm-kernel
In-Reply-To: <4f3c6bee-3ccb-467e-a466-89fece0e6a7f@oss.qualcomm.com>

> >>> +	emac0_phy_en_hog: emac0-phy-en-hog {
> >>> +		gpio-hog;
> >>> +		gpios = <149 GPIO_ACTIVE_HIGH>;
> >>> +		output-high;
> >>> +		line-name = "emac0-phy-en";
> >>> +	};
> >>
> >> This looks like a hack - what does this pin actually do?
> >>
> > 
> > The power supply to both PHYs on Shikra is gated by a GPIO pin. I am
> > unsure whether they should be modelled as a fixed, enable-on-boot
> > regulator or just like this. They need to be powered on early so that
> > MDIO can detect them.
> 
> If it's a regulator, then it should be described as a regulator.

Agreed.

> There
> was some discussion regarding the power resources of PHYs over here:
> 
> https://lore.kernel.org/linux-arm-msm/SN7PR19MB67369F7DD02F702437C0F1919D1B2@SN7PR19MB6736.namprd19.prod.outlook.com/

MDIO detection is nice to have, but only works well on simple
boards. I would suggest hard coding the PHY ID in the compatible.

	Andrew

^ permalink raw reply

* Re: [PATCH iwl-next v1] ixgbe: Implement PCI reset handler
From: Andrew Lunn @ 2026-06-17  9:44 UTC (permalink / raw)
  To: Sergey Temerkhanov; +Cc: intel-wired-lan, netdev
In-Reply-To: <20260617084329.199110-1-sergey.temerkhanov@intel.com>

> +static void ixgbe_reset_prep(struct pci_dev *pdev)
> +{
> +	struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
> +	unsigned int timeout = IXGBE_PCIE_RESET_RETRIES;
> +
> +	if (!adapter)
> +		return;
> +
> +	/* Prevent the service task from being requeued in the timer callback
> +	 * while we're resetting.
> +	 */
> +	if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state)) {
> +		timer_delete_sync(&adapter->service_timer);
> +		/* Prevent the service task from running while we're resetting. */
> +		cancel_work_sync(&adapter->service_task);
> +	}
> +
> +	pci_clear_master(pdev);
> +
> +	while (test_and_set_bit(__IXGBE_RESETTING, &adapter->state) && --timeout)
> +		usleep_range(1000, 2000);

Please consider using something from iopoll.h

> +
> +	if (!timeout) {
> +		e_err(drv, "Timed out waiting for __IXGBE_RESETTING to be released. Reset is needed\n");
> +		pci_set_master(pdev);
> +		return;
> +	}

because this is broken. You need to retest the condition before
declaring ETIMEDOUT.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Maciej Fijalkowski @ 2026-06-17  9:43 UTC (permalink / raw)
  To: Jason Xing
  Cc: Tushar Vyavahare, netdev, magnus.karlsson, stfomichev, kernelxing,
	davem, kuba, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <CAL+tcoDr0gtCPeGi1yOUtg+ZD2YxEbjAy41LBgG63b8-=CStcw@mail.gmail.com>

On Wed, Jun 17, 2026 at 07:39:06AM +0800, Jason Xing wrote:
> Hi Tushar,
> 
> On Tue, Jun 16, 2026 at 11:50 PM Tushar Vyavahare
> <tushar.vyavahare@intel.com> wrote:
> >
> > This series improves AF_XDP selftests by making timeout handling
> > explicit and fixing sources of non-determinism in xsk timeout tests.
> >
> > Patch 1 introduces test_spec::poll_tmout and removes implicit
> > dependence on RX UMEM setup state for timeout behavior.
> >
> > Patch 2 fixes thread harness sequencing by attaching XDP programs
> > before worker startup, removing signal-based termination, and using
> > barrier synchronization only for dual-thread runs.
> >
> > Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
> > configuration does not leak into subsequent cases on shared-netdev
> > runs.
> >
> > Together these changes make timeout handling easier to follow and
> > improve selftest stability, especially on real NIC runs.
> 
> net-next is closed, but in the meantime I'll review the series ASAP.
> 
> BTW, another thing about selftests I had in my mind is that are you
> planning to work on this [1]?

This one is on me. I took your changes Jason and aligned ZC batching side
to this behavior, followed by xskxceiver adjustment. I am planning to send
this today EOD, however let's see how badly internal Sashiko will kick my
ass.

> 
> [1]: https://lore.kernel.org/all/20260520004244.55663-1-kerneljasonxing@gmail.com/
> 
> Thanks,
> Jason
> 
> >
> > Tushar Vyavahare (3):
> >   selftests/xsk: make poll timeout mode explicit
> >   selftests/xsk: fix timeout thread harness sequencing
> >   selftests/xsk: restore shared_umem after POLL_TXQ_FULL
> >
> >  .../selftests/bpf/prog_tests/test_xsk.c       | 96 +++++++++++--------
> >  .../selftests/bpf/prog_tests/test_xsk.h       |  2 +
> >  2 files changed, 56 insertions(+), 42 deletions(-)
> >
> > --
> > 2.43.0
> >
> >
> 

^ permalink raw reply

* Re: [PATCH RFC 8/9] arm64: dts: qcom: shikra-cqs-evk: Enable ethernet0
From: Konrad Dybcio @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Mohd Ayaan Anwar
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Richard Cochran, Bjorn Andersson, Konrad Dybcio, Maxime Coquelin,
	Alexandre Torgue, Russell King, linux-arm-msm, netdev, devicetree,
	linux-kernel, linux-stm32, linux-arm-kernel
In-Reply-To: <ajF+xlipLuZtf4HL@oss.qualcomm.com>

On 6/16/26 6:50 PM, Mohd Ayaan Anwar wrote:
> On Tue, Jun 16, 2026 at 11:50:26AM +0200, Konrad Dybcio wrote:
>> On 6/11/26 8:37 PM, Mohd Ayaan Anwar wrote:
>>
>>> +&tlmm {
>>> +	ethernet0_defaults: ethernet0-defaults-state {
>>
>> s/defaults/default
>>
>> Please move this definition to shikra.dtsi
>>
> 
> The CQM and CQS variants have identical GPIO mapping but the IQS is
> different. So should I keep this in shikra.dtsi and overwrite for IQS in
> shikra-iqs-evk.dts?
> 
> 
>>> +
>>> +	emac0_phy_en_hog: emac0-phy-en-hog {
>>> +		gpio-hog;
>>> +		gpios = <149 GPIO_ACTIVE_HIGH>;
>>> +		output-high;
>>> +		line-name = "emac0-phy-en";
>>> +	};
>>
>> This looks like a hack - what does this pin actually do?
>>
> 
> The power supply to both PHYs on Shikra is gated by a GPIO pin. I am
> unsure whether they should be modelled as a fixed, enable-on-boot
> regulator or just like this. They need to be powered on early so that
> MDIO can detect them.

If it's a regulator, then it should be described as a regulator. There
was some discussion regarding the power resources of PHYs over here:

https://lore.kernel.org/linux-arm-msm/SN7PR19MB67369F7DD02F702437C0F1919D1B2@SN7PR19MB6736.namprd19.prod.outlook.com/

Konrad

^ permalink raw reply

* Re: [PATCH bpf-next v2 3/4] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Avinash Duduskar, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
	John Fastabend, Stanislav Fomichev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, David Ahern,
	Shuah Khan, Jesper Dangaard Brouer, Mykyta Yatsenko, Leon Hwang,
	KP Singh, Anton Protopopov, Amery Hung, Eyal Birger, Rong Tao,
	bpf, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20260616223426.3568080-4-avinash.duduskar@gmail.com>

Avinash Duduskar <avinash.duduskar@gmail.com> writes:

> BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
> useful: an XDP program receiving a VLAN-tagged frame on a physical
> device wants the lookup to behave as if the packet had arrived on the
> corresponding VLAN subinterface, so iif-based policy routing and VRF
> table selection use the right ingress.
>
> Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
> params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
> device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
> The device must be up and in the same network namespace as
> params->ifindex (a VLAN device can be moved to another netns while
> registered on its parent; receive would deliver into that other
> namespace, which a lookup here cannot represent). If params->ifindex
> is itself a VLAN device, its inner (QinQ) subinterface is matched.
> For a bond or team, a tag on a port matches no device and returns
> NOT_FWDED; pass the master's ifindex.
> The lookup then runs with the resolved device as the ingress;
> params->ifindex itself is not modified on the input side. When the
> resolved device is enslaved to a VRF, both the full lookup (via the
> l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
> select the VRF's table from the resolved ingress. That follows from
> feeding the resolved device to the flow as the ingress
> (fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
> the VRF master from the subinterface rather than from
> params->ifindex.
>
> The two failure classes get different treatment on purpose. A
> h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
> -EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
> with a program-controlled value. An unmatched VID, a device that is
> down, or one in another namespace is a data outcome and returns
> BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
> fib_get_table() finds no table and mirroring real ingress, where the
> receive path drops such frames. A VID of 0 (a priority tag) is looked
> up literally and normally fails the same way; receive instead
> processes such frames untagged, so callers should not set the flag for
> priority tags. Proceeding on the physical device for any of these
> would be fail-open for the policy-routing cases above.
>
> The h_vlan fields share a union with tbid, so the flag cannot be
> combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
> cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
> return -EINVAL; restricting now keeps a later relaxation backward
> compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
> consumed on the ingress side and the egress tag is written on
> success.
>
> Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
> NULL, so every lookup with the flag returns NOT_FWDED, which is
> correct since no VLAN device can exist.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
> ---
>  include/uapi/linux/bpf.h       | 34 ++++++++++++++-
>  net/core/filter.c              | 80 +++++++++++++++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h | 34 ++++++++++++++-
>  3 files changed, 141 insertions(+), 7 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f77aa9472bf1..57e28da3336a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3552,6 +3552,35 @@ union bpf_attr {
>   *			reports the route mtu in *params*->mtu_result, and on
>   *			the tc path without tot_len the mtu check runs after
>   *			the swap, against the parent device.
> + *		**BPF_FIB_LOOKUP_VLAN_INPUT**
> + *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
> + *			as an input VLAN tag (e.g. parsed from the packet) and
> + *			run the lookup as if ingress had happened on the VLAN
> + *			subinterface carrying that tag for *params*->ifindex,
> + *			rather than on *params*->ifindex itself. The VID is the
> + *			low 12 bits of *params*->h_vlan_TCI;
> + *			*params*->h_vlan_proto must be ETH_P_8021Q or
> + *			ETH_P_8021AD in network byte order (any other value
> + *			returns **-EINVAL**). The
> + *			subinterface is the one configured for that tag on
> + *			*params*->ifindex; if *params*->ifindex is itself a
> + *			VLAN device, its inner (QinQ) subinterface is matched.
> + *			For a bond or team, a tag on a port matches no
> + *			device and returns NOT_FWDED; pass the master's
> + *			ifindex.
> + *			If no matching subinterface exists, or it is not up,
> + *			or it was moved to another network namespace, the
> + *			lookup returns **BPF_FIB_LKUP_RET_NOT_FWDED**,
> + *			mirroring real ingress, which drops a frame whose tag
> + *			is unconfigured or whose VLAN device is down. A VID of
> + *			0 (a priority-tagged frame) is looked up literally like
> + *			any other VID; receive instead processes such frames
> + *			untagged on the device itself, so do not set this flag
> + *			for priority tags.
> + *			Cannot be combined with **BPF_FIB_LOOKUP_TBID** (both
> + *			use the same input fields) or **BPF_FIB_LOOKUP_OUTPUT**
> + *			(this flag is ingress-only); doing so returns
> + *			**-EINVAL**.

This comment is also overly long - please trim.

>   *
>   *		*ctx* is either **struct xdp_md** for XDP programs or
>   *		**struct sk_buff** tc cls_act programs.
> @@ -7348,6 +7377,7 @@ enum {
>  	BPF_FIB_LOOKUP_SRC     = (1U << 4),
>  	BPF_FIB_LOOKUP_MARK    = (1U << 5),
>  	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
> +	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
>  };
>  
>  enum {
> @@ -7416,7 +7446,9 @@ struct bpf_fib_lookup {
>  		struct {
>  			/* output with BPF_FIB_LOOKUP_VLAN: set from the
>  			 * resolved egress VLAN device (see the flag); zeroed
> -			 * on other successful lookups.
> +			 * on other successful lookups. input with
> +			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
> +			 * the lookup by.
>  			 */
>  			__be16	h_vlan_proto;
>  			__be16	h_vlan_TCI;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b37a12321fba..cfbdd842ce61 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6158,6 +6158,41 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
>  
>  	return 0;
>  }
> +
> +/* With BPF_FIB_LOOKUP_VLAN_INPUT the caller passes the packet's VLAN tag in
> + * params->h_vlan_proto and params->h_vlan_TCI; the lookup is done as if
> + * ingress had happened on the matching VLAN subinterface of *dev. Resolve
> + * it and store it in *dev. params is not modified.
> + *
> + * A protocol other than 802.1Q/802.1AD is API misuse (it would otherwise
> + * reach the WARN in vlan_proto_idx()), so it is rejected with -EINVAL. An
> + * unmatched VID, a matching device that is down, or one that was moved
> + * to another netns (receive would deliver into that netns' stack, which
> + * a lookup here cannot represent) is a data outcome, reported as
> + * NOT_FWDED, the same way the DIRECT path reports a missing table. Under
> + * !CONFIG_VLAN_8021Q __vlan_find_dev_deep_rcu() returns NULL, so every
> + * call returns NOT_FWDED, which is correct since no subinterface can
> + * exist.
> + */

As in the previous patch, please drop this comment.

> +static int bpf_fib_vlan_input_dev(struct net_device **dev,
> +				  const struct bpf_fib_lookup *params)
> +{

Just return the dev pointer and use ERR_PTR for errors? That's what we
usually do for these kinds of functions.

-Toke


^ permalink raw reply

* Re: [PATCH v27 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Alejandro Lucero Palau @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Dan Williams (nvidia), alejandro.lucero-palau, linux-cxl, netdev,
	edward.cree, davem, kuba, pabeni, edumazet, dave.jiang
In-Reply-To: <6a31a948d61f5_9b8551006b@djbw-dev.notmuch>


On 6/16/26 20:51, Dan Williams (nvidia) wrote:
> Alejandro Lucero Palau wrote:
>> On 6/10/26 14:56, Alejandro Lucero Palau wrote:
>>> On 6/10/26 07:10, Alejandro Lucero Palau wrote:
>>>> On 6/10/26 00:30, Dan Williams (nvidia) wrote:
>>>>> alejandro.lucero-palau@ wrote:
>>>>>> From: Alejandro Lucero <alucerop@amd.com>
>>>>>>
>>>>>> Use core API for safely obtain the CXL range linked to an HDM
>>>>>> committed
>>>>>> by the BIOS. Map such a range for being used as the ctpio buffer.
>>>>>>
>>>>>> A potential user space action through sysfs unbinding or core cxl
>>>>>> modules remove will trigger sfc driver device detachment, with that
>>>>>> case
>>>>>> not racing with this mapping as this is done during driver probe and
>>>>>> therefore protected with device lock against those user space actions.
>>>>>>
>>>>>> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
>>>>>> ---
>>>>>>    drivers/net/ethernet/sfc/efx.c     |  1 +
>>>>>>    drivers/net/ethernet/sfc/efx_cxl.c | 24 ++++++++++++++++++++++++
>>>>>>    drivers/net/ethernet/sfc/efx_cxl.h |  3 +++
>>>>>>    3 files changed, 28 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/sfc/efx.c
>>>>>> b/drivers/net/ethernet/sfc/efx.c
>>>>>> index 90ccbe310386..578054c21e79 100644
>>>>>> --- a/drivers/net/ethernet/sfc/efx.c
>>>>>> +++ b/drivers/net/ethernet/sfc/efx.c
>>>>>> @@ -984,6 +984,7 @@ static void efx_pci_remove(struct pci_dev
>>>>>> *pci_dev)
>>>>>>        efx_fini_io(efx);
>>>>>>          probe_data = container_of(efx, struct efx_probe_data, efx);
>>>>>> +    efx_cxl_exit(probe_data);
>>>>>>          pci_dbg(efx->pci_dev, "shutdown successful\n");
>>>>>>    diff --git a/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> b/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> index 4d55c08cf2a1..d5766a40e2cf 100644
>>>>>> --- a/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> @@ -18,6 +18,7 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>>>>>>    {
>>>>>>        struct efx_nic *efx = &probe_data->efx;
>>>>>>        struct pci_dev *pci_dev = efx->pci_dev;
>>>>>> +    struct range cxl_pio_range;
>>>>>>        struct efx_cxl *cxl;
>>>>>>        u16 dvsec;
>>>>>>        int rc;
>>>>>> @@ -75,9 +76,32 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>>>>>>            return -ENODEV;
>>>>>>        }
>>>>>>    +    cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
>>>>>> +    if (IS_ERR(cxl->cxlmd)) {
>>>>>> +        pci_err(pci_dev, "CXL accel memdev creation failed\n");
>>>>>> +        return PTR_ERR(cxl->cxlmd);
>>>>>> +    }
>>>>>> +
>>>>>> +    cxl->ctpio_cxl = ioremap_wc(cxl_pio_range.start,
>>>>>> +                    range_len(&cxl_pio_range));
>>>>>> +    if (!cxl->ctpio_cxl) {
>>>>>> +        pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
>>>>>> +            &cxl_pio_range);
>>>>>> +        return -ENOMEM;
>>>>> Dave caught the iounmap leak, but another concern is since you want to
>>>>> continue operation if efx_cxl_init() fails then you probably also want
>>>>> to release the successful attachment to the CXL domain if this happens.
>>>>
>>>> I will do that.
>>>>
>>> Looking at this issue, I think an error when creating the memdev or
>>> during the region attach triggers the memdev removal, but ...
>>>
>>>
>>>>> Minor since something else is likely to fail if ioremap is not
>>>>> reliable.
>>>
>>> .. if we want to specifically do that with an unlikely (but possible)
>>> ioremap error something else needs to be exported like
>>> cxl_memdev_unregister(). Are you happy with that approach?
>>>
>> I have just tested with this:
>>
>> +void cxl_memdev_remove(void *_cxlmd)
>> +{
>> +       struct cxl_memdev *cxlmd = _cxlmd;
>> +       struct device *dev = &cxlmd->dev;
>> +
>> +       devm_remove_action_nowarn(cxlmd->cxlds->dev, cxl_memdev_unregister,
>> +                                 cxlmd);
>> +
>> +       cdev_device_del(&cxlmd->cdev, dev);
>> +       cxl_memdev_shutdown(dev);
>> +       put_device(dev);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_memdev_remove, "CXL");
>>
>>
>> only called if the ioremap fails.
>>
>>
>> Please, let me know if you like this approach before sending another
>> version.
> A devres group can automatically cleanup after devm_cxl_memdev_probe()
> in the error path with no new exports needed from the CXL core.
> Something like:
>
>          void *group = devres_open_group(cxl->cxlds.dev, NULL, GFP_KERNEL);
>          int rc = 0;
>
>          if (!group)
>                  return -ENOMEM;
>          
>          cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
>          if (IS_ERR(cxl->cxlmd)) {
>                  pci_err(pci_dev, "CXL accel memdev creation failed\n");
>                  rc = PTR_ERR(cxl->cxlmd);
>                  goto out;
>          }
>
>          cxl->ctpio_cxl =
>                  ioremap_wc(cxl_pio_range.start, range_len(&cxl_pio_range));
>          if (!cxl->ctpio_cxl) {
>                  pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
>                          &cxl_pio_range);
>                  rc = -ENOMEM;
>          }
>
> out:
>          if (rc)
>                  devres_release_group(group);
>          else
>                  devres_remove_group(group);
>          return rc;


OK. I will use this in v28 instead of that export.


Thanks


^ permalink raw reply

* Re: [PATCH net] net: rnpgbe: fix mailbox endianness handling
From: Andrew Lunn @ 2026-06-17  9:40 UTC (permalink / raw)
  To: Dong Yibo
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, vadim.fedorenko,
	netdev, linux-kernel, yaojun
In-Reply-To: <20260617083531.251119-1-dong100@mucse.com>

On Wed, Jun 17, 2026 at 04:35:31PM +0800, Dong Yibo wrote:
> Mailbox data is exchanged through 32-bit MMIO accesses but the
> mailbox payload is defined using little-endian FW structures with
> __le16 and __le32 fields.

Given you are using __le16 and __le32, why did sparse not find these
issues? It would be good to understand this, because if sparse missed
this, what else has sparse missed which is also broken?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: busy-poll: introduce sk_tx_busy_loop()
From: Maciej Fijalkowski @ 2026-06-17  9:40 UTC (permalink / raw)
  To: Menglong Dong
  Cc: menglong8.dong, Jakub Kicinski, jasowang, mst, xuanzhuo, eperezma,
	andrew+netdev, davem, edumazet, pabeni, magnus.karlsson, sdf,
	horms, ast, daniel, hawk, john.fastabend, bjorn, kerneljasonxing,
	netdev, virtualization, linux-kernel, bpf
In-Reply-To: <TYn10tJ2SIGF1pAhF26DRQ@linux.dev>

On Sun, Jun 14, 2026 at 06:12:46PM +0800, Menglong Dong wrote:
> On 2026/6/14 02:21, Jakub Kicinski wrote:
> > On Thu, 11 Jun 2026 15:12:40 +0800 menglong8.dong@gmail.com wrote:
> > > For now, we use sk_busy_loop() for both rx and tx path. The sk_busy_loop()
> > > will call napi_busy_loop() for the specified napi_id. However, some
> > > nic drivers have tx napi, such as virtio-net. In this case, sk_busy_loop()
> > > doesn't work, as it can only schedule the NAPI for the rx queue.
> > > 
> > > Therefore, introduce sk_tx_busy_loop() for the nic drivers that support tx
> > > napi, which will schedule the tx napi if available.
> > 
> > First, I thought the only difference with Tx NAPI is that it can't be
> > busy polled. So if you want to poll an instance don't register it as 
> > a Tx one instead of adding all this "tx polling" stuff in the core?
> 
> I see. Register the tx NAPI with netif_napi_add_config() allow us
> busy poll it. But we still have two NAPI instance: rx NAPI and tx NAPI.
> sk_busy_loop() can only busy poll on one of them.
> 
> Before AF_XDP, we don't have the need to send packet via tx NAPI, which
> means that we don't need to busy poll it.
> 
> I analyst some nic drivers on the implement of AF_XDP. Some of them
> will check xsk tx ring of current queue and send the data in it in the
> rx NAPI, such as mlx5. Some of them will allocate a extra "rxtx" NAPI
> for the AF_XDP zero-copy queue, which will poll both the data receiving
> and sending.
> 
> In the case about, they will do the data sending and receiving for the
> AF_XDP in a single NAPI instance.
> 
> However, some driver receiving the data in rx NAPI and send data in
> tx NAPI for AF_XDP. In this case, we can't use sk_busy_loop() for both
> rx path and tx path, as we need to wake different NAPI instance.
> 
> > 
> > Second, can this problem happen for any other NIC or is it purely 
> > an artifact of virtio's delayed Tx completion handling?
> 
> According to my analysis, only virtio-net and ICSSG driver have
> split NAPI for AF_XDP. I don't have a ICSSG nic, but the codex tell
> me that it does have the same problem.
> 
> I'm not sure if it is a good idea to introduce the sk_tx_busy_loop().
> Maybe we can modify the driver instead by using the same NAPI
> for both data sending and receiving, just like others do. The
> advantage of introduce sk_tx_busy_loop() is that we can split the
> data sending and receiving, which maybe more efficient.

Would be good if you back your changes by any performance numbers. I
believe that drivers do tx processing via rx napi as before AF_XDP it was
only about cleaning up writebacks, AF_XDP added more weight via actual tx
descriptors submission.

Maybe you can vibe-code virtio-net to work only with rx napi and see what
are the results.

Side note/question - Do you have a tx-only use case for AF_XDP ? I am
planning (for a long time actually) to implement asymmetric AF_XDP
sockets. Currently for ZC scenarios xsk socket occupies both rx and tx
queues even when you do rx or tx only.

> 
> > 
> > Third, this series does not apply.
> 
> Ah, I'll rebase this series if a V2 is acceptable.
> 
> Thanks!
> Menglong Dong
> 
> > 
> > 
> 
> 
> 
> 

^ permalink raw reply

* [PATCH bpf v3 2/2] selftests/bpf: Cover partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-17  9:35 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon, Sun Jian
In-Reply-To: <20260617093557.63880-1-sun.jian.kdev@gmail.com>

prog_run_opts already verifies that BPF_PROG_TEST_RUN returns -ENOSPC
for a short data_out buffer while still reporting the full output size
through data_size_out.

Add the same coverage for non-linear test_run output. Use pass-through
TC and XDP programs with a 9000-byte packet, a 64-byte linear data area,
and a 100-byte data_out buffer. The expected output spans both the linear
data and the first fragment.

Verify that test_run returns -ENOSPC, reports the full packet length
through data_size_out, and copies the packet prefix into data_out for
both non-linear skb and XDP frags paths.

Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
 .../selftests/bpf/prog_tests/prog_run_opts.c  | 70 +++++++++++++++++++
 .../selftests/bpf/progs/test_pkt_access.c     | 12 ++++
 2 files changed, 82 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
index 01f1d1b6715a..9cc898e6a9f7 100644
--- a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
+++ b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
@@ -4,6 +4,10 @@
 
 #include "test_pkt_access.skel.h"
 
+#define NONLINEAR_PKT_LEN 9000
+#define NONLINEAR_LINEAR_DATA_LEN 64
+#define SHORT_OUT_LEN 100
+
 static const __u32 duration;
 
 static void check_run_cnt(int prog_fd, __u64 run_cnt)
@@ -20,6 +24,69 @@ static void check_run_cnt(int prog_fd, __u64 run_cnt)
 	      "incorrect number of repetitions, want %llu have %llu\n", run_cnt, info.run_cnt);
 }
 
+static void init_pkt(__u8 *pkt, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		pkt[i] = i & 0xff;
+}
+
+static void test_skb_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	__u8 pkt[NONLINEAR_PKT_LEN];
+	__u8 out[SHORT_OUT_LEN];
+	struct __sk_buff skb = {};
+	int prog_fd, err;
+
+	init_pkt(pkt, sizeof(pkt));
+
+	skb.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+	topts.data_in = pkt;
+	topts.data_size_in = sizeof(pkt);
+	topts.data_out = out;
+	topts.data_size_out = sizeof(out);
+	topts.ctx_in = &skb;
+	topts.ctx_size_in = sizeof(skb);
+
+	prog_fd = bpf_program__fd(skel->progs.tc_pass_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+	ASSERT_EQ(err, -ENOSPC, "skb_nonlinear_partial_err");
+	ASSERT_EQ(topts.data_size_out, sizeof(pkt), "skb_nonlinear_partial_data_size_out");
+	ASSERT_OK(memcmp(out, pkt, sizeof(out)), "skb_nonlinear_partial_data_out");
+}
+
+static void test_xdp_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	__u8 pkt[NONLINEAR_PKT_LEN];
+	__u8 out[SHORT_OUT_LEN];
+	struct xdp_md ctx = {};
+	int prog_fd, err;
+
+	init_pkt(pkt, sizeof(pkt));
+
+	ctx.data = 0;
+	ctx.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+	topts.data_in = pkt;
+	topts.data_size_in = sizeof(pkt);
+	topts.data_out = out;
+	topts.data_size_out = sizeof(out);
+	topts.ctx_in = &ctx;
+	topts.ctx_size_in = sizeof(ctx);
+
+	prog_fd = bpf_program__fd(skel->progs.xdp_frags_pass_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+	ASSERT_EQ(err, -ENOSPC, "xdp_nonlinear_partial_err");
+	ASSERT_EQ(topts.data_size_out, sizeof(pkt), "xdp_nonlinear_partial_data_size_out");
+	ASSERT_OK(memcmp(out, pkt, sizeof(out)), "xdp_nonlinear_partial_data_out");
+}
+
 void test_prog_run_opts(void)
 {
 	struct test_pkt_access *skel;
@@ -69,6 +136,9 @@ void test_prog_run_opts(void)
 	run_cnt += topts.repeat;
 	check_run_cnt(prog_fd, run_cnt);
 
+	test_skb_nonlinear_data_out_partial(skel);
+	test_xdp_nonlinear_data_out_partial(skel);
+
 cleanup:
 	if (skel)
 		test_pkt_access__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c
index bce7173152c6..cd284401eebd 100644
--- a/tools/testing/selftests/bpf/progs/test_pkt_access.c
+++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c
@@ -150,3 +150,15 @@ int test_pkt_access(struct __sk_buff *skb)
 
 	return TC_ACT_UNSPEC;
 }
+
+SEC("tc")
+int tc_pass_prog(struct __sk_buff *skb)
+{
+	return TC_ACT_OK;
+}
+
+SEC("xdp.frags")
+int xdp_frags_pass_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf v3 1/2] bpf: Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-17  9:35 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon, Sun Jian
In-Reply-To: <20260617093557.63880-1-sun.jian.kdev@gmail.com>

For non-linear test_run output, bpf_test_finish() derives the linear
data copy length from copy_size - frag_size. This only matches the
linear data length when copy_size is the full packet size.

When userspace provides a short data_out buffer, copy_size is clamped to
that buffer size. If copy_size is smaller than frag_size, the computed
length becomes negative and bpf_test_finish() returns -ENOSPC before
copying the packet prefix or updating data_size_out.

Compute the linear data length from the packet layout instead, and clamp
the linear copy length to copy_size. This preserves the expected
partial-copy semantics: return -ENOSPC, copy the packet prefix that fits
in data_out, and report the full packet length through data_size_out.

Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
 net/bpf/test_run.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2bc04feadfab..f15c613aaa4e 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -453,12 +453,8 @@ static int bpf_test_finish(const union bpf_attr *kattr,
 	}

 	if (data_out) {
-		int len = sinfo ? copy_size - frag_size : copy_size;
-
-		if (len < 0) {
-			err = -ENOSPC;
-			goto out;
-		}
+		u32 head_len = size - frag_size;
+		u32 len = min(copy_size, head_len);

 		if (copy_to_user(data_out, data, len))
 			goto out;
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH] e1000: Remove redundant else after return
From: Andrew Lunn @ 2026-06-17  9:36 UTC (permalink / raw)
  To: Lovekesh Solanki
  Cc: andrew+netdev, anthony.l.nguyen, davem, edumazet, kuba, netdev,
	pabeni, przemyslaw.kitszel
In-Reply-To: <20260617075855.113719-1-lovekeshsolanki00@gmail.com>

On Wed, Jun 17, 2026 at 01:28:55PM +0530, Lovekesh Solanki wrote:
> Hi Andrew,
> 
> I read the documentation you linked and understand simple standalone
> cleanups are discouraged.

You also said in the commit message it reduced the indentation level,
but you did not actually reduce the indentation!

    Andrew

^ permalink raw reply

* [PATCH bpf v3 0/2] Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-17  9:35 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon, Sun Jian

When BPF_PROG_TEST_RUN returns non-linear output and userspace provides a
short data_out buffer, bpf_test_finish() can return -ENOSPC before copying
the packet prefix or updating data_size_out.

Fix this by deriving the linear copy length from the packet layout rather
than from the already-clamped copy_size. Add selftest coverage for both
non-linear skb and XDP frags paths.

Changes in v3:

* Keep the fix patch minimal by leaving the existing offset declaration
  unchanged.
* Drop unnecessary memset() calls from the new selftests.
* Keep the pass-through TC program and larger test packet for the skb
  case. pkt_v4 is too small once the short IPv4 input check is accounted
  for, and the existing packet-access program fails before reaching the
  partial copy-out path with such a short linear area.

Changes in v2:

* Fix the Fixes tag to point to the commit that introduced the shared
  non-linear copy-out logic.
* Drop skb-specific wording from the fix commit.
* Move the selftest from skb_load_bytes.c to prog_run_opts.c.
* Add XDP frags coverage in addition to non-linear skb coverage.

v2:
https://lore.kernel.org/bpf/20260616093103.471444-1-sun.jian.kdev@gmail.com/

v1:
https://lore.kernel.org/bpf/20260615073856.152479-1-sun.jian.kdev@gmail.com/

Tested with:
  ./test_progs -t prog_run_opts -v
  ./test_progs -t skb_load_bytes -v
  ./test_progs -t xdp_pull_data -v

Sun Jian (2):
  bpf: Fix partial copy of non-linear test_run output
  selftests/bpf: Cover partial copy of non-linear test_run output

 net/bpf/test_run.c                            |  8 +--
 .../selftests/bpf/prog_tests/prog_run_opts.c  | 70 +++++++++++++++++++
 .../selftests/bpf/progs/test_pkt_access.c     | 12 ++++
 3 files changed, 84 insertions(+), 6 deletions(-)

-- 
2.43.0

^ permalink raw reply

* [PATCH net-next v1] net: wangxun: don't advertise IFF_SUPP_NOFCS
From: Rongguang Wei @ 2026-06-17  9:28 UTC (permalink / raw)
  To: netdev; +Cc: jiawenwu, mengyuanlou, pabeni, kuba, Rongguang Wei

From: Rongguang Wei <weirongguang@kylinos.cn>

Like commit a24162f18825("i40e: don't advertise IFF_SUPP_NOFCS"),
ngbe and txgbe also advertises IFF_SUPP_NOFCS and allowing users
to use the SO_NOFCS socket option. But the driver does not check
skb->no_fcs, so this option is silently ignored.

With this change, send() fails with -EPROTONOSUPPORT when AF_PACKET
socket is set SO_NOFCS option.

Signed-off-by: Rongguang Wei <weirongguang@kylinos.cn>
---
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c   | 1 -
 drivers/net/ethernet/wangxun/txgbe/txgbe_main.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
index d8e3827a8b1f..1e4ebac8e495 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
@@ -713,7 +713,6 @@ static int ngbe_probe(struct pci_dev *pdev,
 	netdev->features |= NETIF_F_GRO;
 
 	netdev->priv_flags |= IFF_UNICAST_FLT;
-	netdev->priv_flags |= IFF_SUPP_NOFCS;
 	netdev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 
 	netdev->min_mtu = ETH_MIN_MTU;
diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
index 8b7c3753bb6a..db9262b00a66 100644
--- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
+++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
@@ -801,7 +801,6 @@ static int txgbe_probe(struct pci_dev *pdev,
 	netdev->features |= NETIF_F_RX_UDP_TUNNEL_PORT;
 
 	netdev->priv_flags |= IFF_UNICAST_FLT;
-	netdev->priv_flags |= IFF_SUPP_NOFCS;
 	netdev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 
 	netdev->min_mtu = ETH_MIN_MTU;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] rocker: Fix memory leak in ofdpa_port_fdb()
From: Andrew Lunn @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Ziran Zhang, Jiri Pirko, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <1446e974-0df0-4956-b2af-7a9403da3c8d@intel.com>

On Tue, Jun 16, 2026 at 04:29:59PM -0700, Jacob Keller wrote:
> On 6/15/2026 6:32 PM, Ziran Zhang wrote:
> > In ofdpa_port_fdb(), the hash_del() only unlinks the node from
> > hash table, but does not free it.
> > 
> > Fix this by adding kfree(found) after the !found == removing check,
> > where the pointer value is no longer needed.
> > 
> > Found by Coccinelle kfree script.
> > 

Is rocker actually used any more? I'm not too sure of the history, but
was it not added as a way to develop the early switchdev code? There
was a qemu implementation of the 'hardware'?

Is it still useful? Should we actually just remove the driver?

	Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/4] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Avinash Duduskar, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
	John Fastabend, Stanislav Fomichev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, David Ahern,
	Shuah Khan, Jesper Dangaard Brouer, Mykyta Yatsenko, Leon Hwang,
	KP Singh, Anton Protopopov, Amery Hung, Eyal Birger, Rong Tao,
	bpf, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20260616223426.3568080-3-avinash.duduskar@gmail.com>

Avinash Duduskar <avinash.duduskar@gmail.com> writes:

> bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
> from the fib result. When the egress is a VLAN device, the returned
> ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
> programs that want to forward the frame (e.g. xdp-forward) must
> instead target the underlying physical device and push the VLAN tag
> themselves. Today the program has no way to learn either the
> underlying ifindex or the VLAN tag without maintaining its own
> VLAN-to-ifindex map in userspace and refreshing it on netlink
> events.
>
> Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
> result is a VLAN device whose immediate parent is a real (non-VLAN)
> device in the same network namespace, populate the existing output
> fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
> device and replace params->ifindex with the parent's ifindex.
> params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
> consumer wanting to set egress priority writes PCP itself.
> params->smac is the VLAN device's own address, which can differ from
> the parent's.
>
> Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
> and not vlan_dev_real_dev(), which walks to the bottom of a stack. For a
> stacked VLAN (QinQ) the immediate parent is itself a VLAN device; since
> one h_vlan_proto/h_vlan_TCI pair cannot describe two tags, ifindex is
> left unchanged and the vlan fields remain zero in that case. The swap
> is also skipped when the parent lives in another network namespace (a
> VLAN device can be moved while its parent stays), since its ifindex
> would be meaningless or match an unrelated device in the caller's
> namespace. The swap and the vlan fields are written only on success;
> other output fields keep their existing behaviour, so a frag-needed
> result still reports the route mtu in params->mtu_result. When the
> flag is not set, behaviour is unchanged: h_vlan_proto and h_vlan_TCI
> are zeroed and ifindex is left at the FIB result.
>
> The new block is compiled only under CONFIG_VLAN_8021Q since
> vlan_dev_priv() is not defined otherwise; without that config
> is_vlan_dev() is constant false and the flag is accepted but never
> acts.
>
> This lets an XDP redirect target the physical device and learn the
> tag to push in a single lookup, which xdp-forward's optional VLAN
> mode (xdp-project/xdp-tools#504) wants from the kernel side.
>
> The helper's input semantics are unchanged; the reverse direction
> (supplying a tag as lookup input) is added in the following patch.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
> ---
>  include/uapi/linux/bpf.h       | 31 ++++++++++++++++++++++++++-
>  net/core/filter.c              | 39 ++++++++++++++++++++++++++++++----
>  tools/include/uapi/linux/bpf.h | 31 ++++++++++++++++++++++++++-
>  3 files changed, 95 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 11dd610fa5fa..f77aa9472bf1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3527,6 +3527,31 @@ union bpf_attr {
>   *			Use the mark present in *params*->mark for the fib lookup.
>   *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
>   *			as it only has meaning for full lookups.
> + *		**BPF_FIB_LOOKUP_VLAN**
> + *			If the fib lookup resolves to a VLAN device whose
> + *			parent is a real (non-VLAN) device, set
> + *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
> + *			the VLAN device and replace *params*->ifindex with the
> + *			parent's ifindex. This lets XDP programs that target
> + *			the underlying physical device (VLAN devices have no
> + *			XDP xmit) discover both the real egress ifindex and
> + *			the VLAN tag to push in one call. *params*->h_vlan_TCI
> + *			carries the VID only, with PCP and DEI bits zero; a
> + *			consumer wanting to set egress priority writes PCP
> + *			itself. *params*->smac is the VLAN device's own
> + *			address, which can differ from the parent's. Only the
> + *			immediate parent is resolved: for a stacked VLAN (QinQ)
> + *			the parent is itself a VLAN device, and since one tag
> + *			pair cannot describe two tags, *params*->ifindex is
> + *			left unchanged and the vlan fields remain zero. The
> + *			same applies when the parent is in another network
> + *			namespace, where its ifindex would be meaningless.
> + *			The swap and the vlan fields are written only on
> + *			success; other output fields keep the helper's
> + *			existing behaviour, so a frag-needed result still
> + *			reports the route mtu in *params*->mtu_result, and on
> + *			the tc path without tot_len the mtu check runs after
> + *			the swap, against the parent device.

This comment is quite long, please trim. At the very least drop:

"This lets XDP programs that target the underlying physical device (VLAN
devices have no XDP xmit) discover both the real egress ifindex and the
VLAN tag to push in one call."

and shorten:

"Only the immediate parent is resolved: for a stacked VLAN
(QinQ) the parent is itself a VLAN device, and since one tag pair cannot
describe two tags, *params*->ifindex is left unchanged and the vlan
fields remain zero. The same applies when the parent is in another
network namespace, where its ifindex would be meaningless."

to:

"The lookup only resolves the immediate parent (QinQ is not supported),
and fails if the parent is in a different namespace."

>   *
>   *		*ctx* is either **struct xdp_md** for XDP programs or
>   *		**struct sk_buff** tc cls_act programs.
> @@ -7322,6 +7347,7 @@ enum {
>  	BPF_FIB_LOOKUP_TBID    = (1U << 3),
>  	BPF_FIB_LOOKUP_SRC     = (1U << 4),
>  	BPF_FIB_LOOKUP_MARK    = (1U << 5),
> +	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
>  };
>  
>  enum {
> @@ -7388,7 +7414,10 @@ struct bpf_fib_lookup {
>  
>  	union {
>  		struct {
> -			/* output */
> +			/* output with BPF_FIB_LOOKUP_VLAN: set from the
> +			 * resolved egress VLAN device (see the flag); zeroed
> +			 * on other successful lookups.
> +			 */
>  			__be16	h_vlan_proto;
>  			__be16	h_vlan_TCI;
>  		};
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6fa172cb1348..b37a12321fba 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6119,10 +6119,40 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
>  #endif
>  
>  #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
> -static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
> +static int bpf_fib_set_fwd_params(struct net_device *dev,
> +				  struct bpf_fib_lookup *params,
> +				  u32 flags, u32 mtu)
>  {
>  	params->h_vlan_TCI = 0;
>  	params->h_vlan_proto = 0;
> +
> +#if IS_ENABLED(CONFIG_VLAN_8021Q)
> +	/* vlan_dev_priv() is only defined when 8021q is built in or as a
> +	 * module; under !CONFIG_VLAN_8021Q is_vlan_dev() is constant false
> +	 * so this would be dead, but it still has to compile.
> +	 */

Superfluous comment - please drop.

> +	if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
> +		struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
> +
> +		/* Resolve the immediate parent only. For a stacked VLAN
> +		 * (QinQ) the parent is itself a VLAN device, and a single
> +		 * h_vlan_proto/h_vlan_TCI pair cannot describe both tags;
> +		 * leave ifindex and the vlan fields untouched in that case
> +		 * rather than report the lower device with only one tag.
> +		 * The same applies when the parent lives in another netns
> +		 * (a VLAN device can be moved while its parent stays):
> +		 * its ifindex would be meaningless, or match an unrelated
> +		 * device, in the caller's namespace.
> +		 */

And this one - it's redundant with the flag description (and commit message).

-Toke


^ permalink raw reply

* [PATCH v2] net: mvneta: free/request IRQ across suspend/resume
From: Yun Zhou @ 2026-06-17  9:20 UTC (permalink / raw)
  To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	bigeasy, clrkwllms, rostedt
  Cc: netdev, linux-kernel, linux-rt-devel, yun.zhou

On PREEMPT_RT, the mvneta IRQ handler is force-threaded. Under high
network traffic, the IRQ can enter suspend with desc->depth == 1
(masked by the oneshot mechanism between handler invocations).

During suspend, the kernel increments depth to 2 and masks the
interrupt at the MPIC level (clearing the SRC_CTL CPU routing bit,
due to IRQCHIP_MASK_ON_SUSPEND). On resume, depth is decremented
back to 1, but since it does not reach 0, the unmask is never
called. The MPIC CPU routing remains cleared, permanently disabling
interrupt delivery.

Fix by freeing the IRQ in suspend and re-requesting it in resume.
This ensures a clean IRQ state (depth=0, proper hardware routing)
on every resume cycle, regardless of the pre-suspend depth. This
follows the approach used by other drivers (e.g. igb).

Fixes: 9768b45ceb0b ("net: mvneta: support suspend and resume")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v2:
  - Move request_irq before cpuhp registration in resume (matching
    mvneta_open ordering) so that failure does not leave cpuhp
    callbacks registered on a non-functional device.
  - On request_irq failure, call netif_device_detach() to prevent
    further traffic on the dead interface.

 drivers/net/ethernet/marvell/mvneta.c | 29 +++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b4a845f04c05..02ea867d07a3 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5826,6 +5826,20 @@ static int mvneta_suspend(struct device *device)
 	mvneta_stop_dev(pp);
 	rtnl_unlock();
 
+	/* Release IRQ to avoid stale MPIC mask state on resume.
+	 * On PREEMPT_RT, forced-threaded oneshot IRQs may leave the
+	 * interrupt masked (depth>0) at suspend time. This prevents
+	 * resume_device_irqs() from restoring the MPIC CPU routing,
+	 * permanently disabling the interrupt. Re-requesting the IRQ
+	 * on resume guarantees a clean state.
+	 */
+	if (pp->neta_armada3700)
+		free_irq(dev->irq, pp);
+	else {
+		on_each_cpu(mvneta_percpu_disable, pp, true);
+		free_percpu_irq(dev->irq, pp->ports);
+	}
+
 	for (queue = 0; queue < rxq_number; queue++) {
 		struct mvneta_rx_queue *rxq = &pp->rxqs[queue];
 
@@ -5892,6 +5906,21 @@ static int mvneta_resume(struct device *device)
 		mvneta_txq_hw_init(pp, txq);
 	}
 
+	/* Re-request IRQ (see comment in mvneta_suspend) */
+	if (pp->neta_armada3700) {
+		err = request_irq(dev->irq, mvneta_isr, 0, dev->name, pp);
+	} else {
+		err = request_percpu_irq(dev->irq, mvneta_percpu_isr,
+					dev->name, pp->ports);
+		if (!err)
+			on_each_cpu(mvneta_percpu_enable, pp, true);
+	}
+	if (err) {
+		netdev_err(dev, "cannot request irq %d\n", dev->irq);
+		netif_device_detach(dev);
+		return err;
+	}
+
 	if (!pp->neta_armada3700) {
 		spin_lock(&pp->lock);
 		pp->is_stopped = false;
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox