Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 10/11] vfio: selftests: Add mlx5 driver - data path and memcpy ops
From: Jason Gunthorpe @ 2026-05-05 15:49 UTC (permalink / raw)
  To: David Matlack
  Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
	linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
	Tariq Toukan, patches
In-Reply-To: <afkghFm9Vo-SfF5c@google.com>

On Mon, May 04, 2026 at 10:41:08PM +0000, David Matlack wrote:
> On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> 
> > @@ -1368,6 +1716,11 @@ static void mlx5st_init(struct vfio_pci_device *device)
> >  	mlx5st_alloc_pd(dev);
> >  	mlx5st_create_mkey(dev);
> >  
> > +	mlx5st_setup_datapath(dev);
> > +
> > +	device->driver.max_memcpy_size = 1 << 20;
> > +	device->driver.max_memcpy_count = SQ_WQE_CNT - 1;
> 
> What are these limits a function of? e.g. Is the 1MB size a hardware
> limit? Can we change SQ_WQE_CNT in the future to increase max count?

I have to check, I don't think there is a small limit on RDMA WRITE,
and SQ_WQE_CNT is just a number it could be somewhat bigger.

This may be a hold over from an ealier version where it was trying to
make a PAS array, and the PAS array was capped. I removed that but
would have missed this.

Jason

^ permalink raw reply

* Re: [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface
From: Jason Gunthorpe @ 2026-05-05 15:45 UTC (permalink / raw)
  To: David Matlack
  Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
	linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
	Tariq Toukan, patches
In-Reply-To: <afkfP-8UHaoyLd5Y@google.com>

On Mon, May 04, 2026 at 10:35:43PM +0000, David Matlack wrote:
> On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> 
> > +/*
> > + * Driver state — overlaid on device->driver.region.vaddr.
> > + *
> > + * Contains both software-only state and HW-visible DMA buffers. HW buffers need
> > + * strict IOVA alignment.
> > + */
> > +struct mlx5st_device {
> 
> Can we do s/mlx5st/mlx5/ on the series?

No, I don't want to do this. Since it is in tree I want to reserve the
mlx5_ prefix only for the main driver. The driver is huge, I do not
want to harm or confuse grep - that team will get mad.

Jason

^ permalink raw reply

* Re: [PATCH net v5 0/8] xsk: fix bugs around xsk skb allocation
From: Alexander Lobakin @ 2026-05-05 15:44 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
	maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
	john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing
In-Reply-To: <20260502200722.53960-1-kerneljasonxing@gmail.com>

From: Jason Xing <kerneljasonxing@gmail.com>
Date: Sat,  2 May 2026 23:07:14 +0300

> From: Jason Xing <kernelxing@tencent.com>
> 
> There are rare issues around xsk_build_skb(). Some of them
> were founded by Sashiko[1][2].

For the series:

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Just don't forget to look at Sashiko's replies, I haven't looked deep
into them, so can't say whether anything there makes sense.

> 
> [1]: https://lore.kernel.org/all/20260415082654.21026-1-kerneljasonxing@gmail.com/
> [2]: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/
> 
> ---
> v5
> Link: https://lore.kernel.org/all/20260424053816.27965-1-kerneljasonxing@gmail.com/
> 1. reword patch 3
> 2. adjust the order of patch 6 that is now moved up to be patch 2)
> 3. use helper in patch (Stan)
> 
> v4
> Link: https://lore.kernel.org/all/20260422033650.68457-1-kerneljasonxing@gmail.com/
> 1. fix 32-bit arch issue in patch 8 (Alexander && Maciej)
> 2. add acked-by tags (Stan)
> 
> v3
> Link: https://lore.kernel.org/all/20260420082805.14844-1-kerneljasonxing@gmail.com/
> 1. use !xs->skb as the indicator of first frag in patch 3 (Stan)
> 2. move init skb after it's allocated successfully, so patch 4,5,6 have
>    been adjusted accordingly (Stan)
> 3. do not support 32-bit arch (Stan)
> 4. add acked-by tags (Stan)
> 
> v2
> Link: https://lore.kernel.org/all/20260418045644.28612-1-kerneljasonxing@gmail.com/#t
> 1. add four patches spotted by sashiko to fix buggy pre-existing
> behavior
> 2. adjust the order of 8 patches.
> 
> 
> 
> Jason Xing (8):
>   xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
>   xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
>   xsk: handle NULL dereference of the skb without frags issue
>   xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
>   xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
>   xsk: avoid skb leak in XDP_TX_METADATA case
>   xsk: fix xsk_addrs slab leak on multi-buffer error path
>   xsk: fix u64 descriptor address truncation on 32-bit architectures
> 
>  net/xdp/xsk.c           | 115 ++++++++++++++++++++++++++--------------
>  net/xdp/xsk_buff_pool.c |   3 ++
>  2 files changed, 77 insertions(+), 41 deletions(-)

Thanks,
Olek

^ permalink raw reply

* Re: [PATCH 05/11] selftests: Add additional kernel functions to tools/include/
From: Jason Gunthorpe @ 2026-05-05 15:43 UTC (permalink / raw)
  To: David Matlack
  Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
	linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
	Tariq Toukan, patches
In-Reply-To: <afkUO56H6KPy5afA@google.com>

On Mon, May 04, 2026 at 09:48:43PM +0000, David Matlack wrote:
> On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> > These are needed by the VFIO mlx5 selftest in the following patches,
> > which includes some headers from mlx5 and also needs a few more
> > MMIO-related features.
> > 
> > - DECLARE_FLEX_ARRAY in new tools/include/linux/stddef.h (wraps
> >   existing __DECLARE_FLEX_ARRAY from uapi/linux/stddef.h)
> 
> Is this needed? I don't see it used anywhere.
> 
>   $ git grep DECLARE_FLEX_ARRAY tools/testing/selftests/vfio

I think it was used in an earlier version, I can drop it

Jason

^ permalink raw reply

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-05-05 15:43 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <afmK531eRcPCecKm@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Mon, May 04, 2026 at 11:15:03PM -0700, Shradha Gupta wrote:
> On Sat, May 02, 2026 at 01:15:36PM -0400, Yury Norov wrote:
> > On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> > > On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > > > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > > > In mana driver, the number of IRQs allocated is capped by the
> > > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > > > their NUMA/core bindings.
> > > > > 
> > > > > This is important, especially in the envs where number of vCPUs are so
> > > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > > > much more than their overheads if they were spread across sibling vCPUs.
> > > > > 
> > > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > > same vCPU, while some vCPUs have none.
> > > > > 
> > > > > In such cases when many parallel TCP connections are tested, the
> > > > > throughput drops significantly.
> > > > > 
> > > > > Test envs:
> > > > > =======================================================
> > > > > Case 1: without this patch
> > > > > =======================================================
> > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > 
> > > > > 	TYPE		effective vCPU aff
> > > > > =======================================================
> > > > > IRQ0:	HWC		0
> > > > > IRQ1:	mana_q1		0
> > > > > IRQ2:	mana_q2		2
> > > > > IRQ3:	mana_q3		0
> > > > > IRQ4:	mana_q4		3
> > > > > 
> > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > vCPU		0	1	2	3
> > > > > =======================================================
> > > > > pass 1:		38.85	0.03	24.89	24.65
> > > > > pass 2:		39.15	0.03	24.57	25.28
> > > > > pass 3:		40.36	0.03	23.20	23.17
> > > > > 
> > > > > =======================================================
> > > > > Case 2: with this patch
> > > > > =======================================================
> > > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > > 
> > > > >         TYPE            effective vCPU aff
> > > > > =======================================================
> > > > > IRQ0:   HWC             0
> > > > > IRQ1:   mana_q1         0
> > > > > IRQ2:   mana_q2         1
> > > > > IRQ3:   mana_q3         2
> > > > > IRQ4:   mana_q4         3
> > > > > 
> > > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > > vCPU            0       1       2       3
> > > > > =======================================================
> > > > > pass 1:         15.42	15.85	14.99	14.51
> > > > > pass 2:         15.53	15.94	15.81	15.93
> > > > > pass 3:         16.41	16.35	16.40	16.36
> > > > > 
> > > > > =======================================================
> > > > > Throughput Impact(in Gbps, same env)
> > > > > =======================================================
> > > > > TCP conn	with patch	w/o patch
> > > > > 20480		15.65		7.73
> > > > > 10240		15.63		8.93
> > > > > 8192		15.64		9.69
> > > > > 6144		15.64		13.16
> > > > > 4096		15.69		15.75
> > > > > 2048		15.69		15.83
> > > > > 1024		15.71		15.28
> > > > > 
> > > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > > Cc: stable@vger.kernel.org
> > > > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > > ---
> > > > > Changes in v2
> > > > >  * Removed the unused skip_first_cpu variable
> > > > >  * fixed exit condition in irq_setup_linear() with len == 0
> > > > >  * changed return type of irq_setup_linear() as it will always be 0
> > > > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > > >  * added appropriate comments to indicate expected behaviour when
> > > > >    IRQs are more than or equal to num_online_cpus()
> > > > > ---
> > > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
> > > > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > index 098fbda0d128..d740d1dc43da 100644
> > > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > > >  	} else {
> > > > >  		/* If dynamic allocation is enabled we have already allocated
> > > > >  		 * hwc msi
> > > > > +		 * Also, we make sure in this case the following is always true
> > > > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > > >  		 */
> > > > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > > >  	}
> > > > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > +/* should be called with cpus_read_lock() held */
> > > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > > +{
> > > > > +	int cpu;
> > > > > +
> > > > > +	for_each_online_cpu(cpu) {
> > > > > +		if (len == 0)
> > > > > +			break;
> > > > > +
> > > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > > +		len--;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > >  {
> > > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > >  	struct gdma_irq_context *gic;
> > > > > -	bool skip_first_cpu = false;
> > > > >  	int *irqs, irq, err, i;
> > > > >  
> > > > >  	irqs = kmalloc_objs(int, nvec);
> > > > 
> > > > So what about WARN_ON() and nvec adjustment before kmalloc?
> > > Hey Yury,
> > > 
> > > I am still a bit unsure about the WARN_ON() before kmalloc, as after
> > > that also, in the same function till we take the cpus_read_lock() the
> > > num_online_cpus() can change(or reduce). That's why I introduced the
> > > dev_dbg() to capture hot-remove edge case.
> > 
> > OK.
> >  
> > > Do you still think it adds more value?
> > 
> > It's your driver, so you know better. I just wonder because you said
> > it's good to add WARN_ON(), and then didn't do that.
> > 
> > > > 
> > > > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > > >  	 */
> > > > >  	cpus_read_lock();
> > > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > > -		skip_first_cpu = true;
> > > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > > > +		if (err) {
> > > > > +			cpus_read_unlock();
> > > > > +			goto free_irq;
> > > > 
> > > > One thing puzzles me: if you skip first CPU with this 'true', and the
> > > > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > > > distribute. What do I miss?
> > > > 
> > > 
> > > Let me explain this case a bit better then,
> > > 
> > > - num_msix_usable = HWC IRQ + Queue IRQ
> > > - nvec in this functions is only Queue IRQ (HWC already setup)
> > > 
> > > When num_online_cpus == num_msix_usable:
> > > - nvec = num_online_cpus - 1
> > > - first CPU is already assigned to HWC IRQ, so skip it
> > > - Queue IRQs fit in the remaining CPUs
> > > 
> > > please let me know if I did not get your question right
> > 
> > Can you put that in a comment?
> 
> Sure I will. thanks
> 
> > 
> > > > > +		}
> > > > > +	} else {
> > > > > +		/*
> > > > > +		 * When num_msix_usable are more than num_online_cpus, we try to
> > > > > +		 * make sure we are using all vcpus. In such a case NUMA or
> > > > > +		 * CPU core affinity does not matter.
> > > > 
> > > > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > > > In theory, the system would have most of flexibility to balance them.
> > > > 
> > > 
> > > Okay, let me fix the comment and elaborate on this. It doesn't matter
> > > because in such a case we want to anyway exhaust and distribute the
> > > Queue IRQs to all vCPUs.
> > > We don't want to rely on the system's balancer in this case as it could
> > > be skewed by other devices' IRQ weights
> > 
> > I don't understand this. If I want to reserve some CPUs to solely
> > handle IRQs from my high-priority hardware, then I configure my system
> > accordingly. For example, assign all non-networking IRQs on CPU0, and
> > all networking IRQs to all CPUs.
> > 
> > In your case, you distribute IRQs evenly, which means you've no
> > preferred CPUs. So, assuming the system is only running your IRQ
> > driver, it's at max is as good as all-CPU distribution. In case of
> > heavy loading some particular CPU, your scheme could cause
> > corresponding IRQs to starve.
> > 
> > I recall, when we was working on irq_setup(), the original idea was to
> > distribute IRQs one-to-one, but than I suggested the 
> > 
> >         irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > 
> > and after experiments, you agreed on that.
> > 
> > Can you please run your throughput test for my suggested distribution
> > too? Would be also nice to see how each distribution works when some
> > CPUs are under stress.
> > 
> > Thanks,
> > Yury
> 
> The design of irq_setup() works exactly how we want it for our IRQs for
> almost all of our usecases, so we want to keep that as is. The only
> scenarios where this is an issue in terms of significant throughput drop
> is when we are working with low vCPU VMs (vCPU <= 4 with high TCP
> connection counts) and where there are additional NVMe devices attached
> to the VM.
> 
> The current patch about utilizing all the vCPUs helps in that case and
> doesn't cause any regression for other cases.
> 
> This linear path is only taken when num_msix_usable > num_online_cpus(),
> which is limited to low-vCPU VMs. Larger VMs continue using irq_setup()
> as before.
> 
> We can definately get our throughput run results on other suggestions
> you have. And about that, I just needed a bit more clarity on what to
> test against. Are you suggesting, with irq_setup() intact and in use, we
> configure the non-mana IRQs to say CPU0 and capture the numbers?

Can you try this:

       while(len--)
               // Or cpu_online_mask or cpu_all_mask?
               irq_set_affinity_and_hint(*irqs++, NULL);

And compare it to the linear version under your vCPU scenario?

Can you run your throughput test alone and on parallel with some
IRQ torture test?

        stress-ng --timer 4 --timeout 60s

And maybe pin the stress test to the default CPU. Assuming it's 0:

        taskset -c 0 stress-ng --timer 4 --timeout 60s

Unless the 'linear' version is significantly faster, I'd stick to the
above.

Thanks,
Yury

^ permalink raw reply

* [PATCH net] tcp: tcp_child_process() related UAF
From: Eric Dumazet @ 2026-05-05 15:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Neal Cardwell, Kuniyuki Iwashima, netdev,
	eric.dumazet, Eric Dumazet, Damiano Melotti

tcp_child_process( .. child ...) currently calls sock_put(child).

Unfortunately @child (named @nsk in callers) can be used after
this point to send a RST packet.

To fix this UAF, I remove the sock_put() from tcp_child_process()
and let the callers handle this after it is safe.

Remove @rsk variable in tcp_v4_do_rcv() and change tcp_v6_do_rcv()
so that both functions look the same.

Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Reported-by: Damiano Melotti <melotti@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_ipv4.c      | 14 ++++++--------
 net/ipv4/tcp_minisocks.c |  2 +-
 net/ipv6/tcp_ipv6.c      | 13 ++++++++-----
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8fc24c3743c5f905f8e07a26fb0edb40fb6ab767..c0526cc0398049fb34b5de20a1175d54942e80cd 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1827,7 +1827,6 @@ INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *,
 int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 {
 	enum skb_drop_reason reason;
-	struct sock *rsk;
 
 	reason = psp_sk_rx_policy_check(sk, skb);
 	if (reason)
@@ -1863,24 +1862,21 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 			return 0;
 		if (nsk != sk) {
 			reason = tcp_child_process(sk, nsk, skb);
-			if (reason) {
-				rsk = nsk;
+			sock_put(nsk);
+			if (reason)
 				goto reset;
-			}
 			return 0;
 		}
 	} else
 		sock_rps_save_rxhash(sk, skb);
 
 	reason = tcp_rcv_state_process(sk, skb);
-	if (reason) {
-		rsk = sk;
+	if (reason)
 		goto reset;
-	}
 	return 0;
 
 reset:
-	tcp_v4_send_reset(rsk, skb, sk_rst_convert_drop_reason(reason));
+	tcp_v4_send_reset(sk, skb, sk_rst_convert_drop_reason(reason));
 discard:
 	sk_skb_reason_drop(sk, skb, reason);
 	/* Be careful here. If this function gets more complicated and
@@ -2193,8 +2189,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
 
 				rst_reason = sk_rst_convert_drop_reason(drop_reason);
 				tcp_v4_send_reset(nsk, skb, rst_reason);
+				sock_put(nsk);
 				goto discard_and_relse;
 			}
+			sock_put(nsk);
 			sock_put(sk);
 			return 0;
 		}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 199f0b579e89cf25689e74a8d37bb0c022a6c92d..e6092c3ac840bdc1f62d4435c414e7f79edc10c2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -1012,6 +1012,6 @@ enum skb_drop_reason tcp_child_process(struct sock *parent, struct sock *child,
 	}
 
 	bh_unlock_sock(child);
-	sock_put(child);
+
 	return reason;
 }
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2c3f7a739709d7b89f376f79b71173e5f2d8e64e..51583aef0643e92c961fc00f48f1192184d087ed 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1617,12 +1617,13 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 	if (sk->sk_state == TCP_LISTEN) {
 		struct sock *nsk = tcp_v6_cookie_check(sk, skb);
 
+		if (!nsk)
+			return 0;
 		if (nsk != sk) {
-			if (nsk) {
-				reason = tcp_child_process(sk, nsk, skb);
-				if (reason)
-					goto reset;
-			}
+			reason = tcp_child_process(sk, nsk, skb);
+			sock_put(nsk);
+			if (reason)
+				goto reset;
 			return 0;
 		}
 	} else
@@ -1827,8 +1828,10 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 				rst_reason = sk_rst_convert_drop_reason(drop_reason);
 				tcp_v6_send_reset(nsk, skb, rst_reason);
+				sock_put(nsk);
 				goto discard_and_relse;
 			}
+			sock_put(nsk);
 			sock_put(sk);
 			return 0;
 		}
-- 
2.54.0.545.g6539524ca2-goog


^ permalink raw reply related

* Re: [PATCH net 2/5] net: dsa: mt7530: preserve VLAN tags on trapped link-local frames
From: Chester A. Unal @ 2026-05-05 15:37 UTC (permalink / raw)
  To: Daniel Golle, Andrew Lunn, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Matthias Brugger,
	AngeloGioacchino Del Regno, DENG Qingfang, Florian Fainelli,
	Arınç ÜNAL, Sean Wang, netdev, linux-kernel,
	linux-arm-kernel, linux-mediatek
In-Reply-To: <3ad4b724d7ae6250b8429d50fe913d2dca07a3f9.1777986341.git.daniel@makrotopia.org>

Hey Daniel.

On 05/05/2026 15:16, Daniel Golle wrote:
> The BPC, RGAC1 and RGAC2 registers control the handling of link-local
> frames with reserved MAC DAs (01:80:C2:00:00:0x). These frames are
> correctly trapped to the CPU port, but the egress VLAN tag attribute was
> set to MT7530_VLAN_EG_UNTAGGED which causes the switch to strip any
> VLAN tags from trapped frames before they reach the CPU.
> 
> This causes VLAN-tagged link-local frames (STP BPDUs, LLDP, PTP Peer
> Delay Requests) to arrive at the CPU without their VLAN tag, so they
> are delivered to the base network interface instead of the VLAN
> sub-interface. The DSA local_termination selftest confirms this: all
> link-local protocol tests on VLAN upper interfaces fail.
> 
> Set the EG_TAG attribute to MT7530_VLAN_EG_DISABLED (system default)
> so that the switch does not modify VLAN tags in trapped frames. This
> way VLAN-tagged frames retain their original tag and are delivered to
> the correct VLAN sub-interface, matching the behavior of non-trapped
> frames which pass through without VLAN tag modification.
> 
> Fixes: 69ddba9d170b ("net: dsa: mt7530: fix handling of all link-local frames")
> Signed-off-by: Daniel Golle <daniel@makrotopia.org>

Thank you for this patch. Could you please confirm that it conforms to the
findings documented on this patch log?

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=e8bf353577f382c7066c661fed41b2adc0fc7c40

Cheers.
Chester A.

^ permalink raw reply

* [PATCH iwl-next v5 5/5] ice: add support for transmitting unreadable frags
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel
In-Reply-To: <20260505152923.1040589-1-aleksander.lobakin@intel.com>

Advertise netmem Tx support in ice. The only change needed is to set
ICE_TX_BUF_FRAG conditionally, only when skb_frag_is_net_iov() is
false. Otherwise, the Tx buffer type will be ICE_TX_BUF_EMPTY and
the driver will skip the DMA unmapping operation.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c   |  1 +
 drivers/net/ethernet/intel/ice/ice_sf_eth.c |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c   | 17 +++++++++++++----
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index a88f9e3c0077..0e61b38e53a5 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3453,6 +3453,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
 
 	netdev->netdev_ops = &ice_netdev_ops;
 	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
+	netdev->netmem_tx = true;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
 	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index 8d1541712abc..fdf98aa431ba 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -59,6 +59,7 @@ static int ice_sf_cfg_netdev(struct ice_dynamic_port *dyn_port,
 	ether_addr_copy(netdev->perm_addr, dyn_port->hw_addr);
 	netdev->netdev_ops = &ice_sf_netdev_ops;
 	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
+	netdev->netmem_tx = true;
 	SET_NETDEV_DEVLINK_PORT(netdev, devlink_port);
 
 	err = register_netdev(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 43b467076027..1d97e2cc2ade 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -113,11 +113,17 @@ ice_prgm_fdir_fltr(struct ice_vsi *vsi, struct ice_fltr_desc *fdir_desc,
 static void
 ice_unmap_and_free_tx_buf(struct ice_tx_ring *ring, struct ice_tx_buf *tx_buf)
 {
-	if (tx_buf->type != ICE_TX_BUF_XDP_TX && dma_unmap_len(tx_buf, len))
+	switch (tx_buf->type) {
+	case ICE_TX_BUF_DUMMY:
+	case ICE_TX_BUF_FRAG:
+	case ICE_TX_BUF_SKB:
+	case ICE_TX_BUF_XDP_XMIT:
 		dma_unmap_page(ring->dev,
 			       dma_unmap_addr(tx_buf, dma),
 			       dma_unmap_len(tx_buf, len),
 			       DMA_TO_DEVICE);
+		break;
+	}
 
 	switch (tx_buf->type) {
 	case ICE_TX_BUF_DUMMY:
@@ -338,12 +344,14 @@ static bool ice_clean_tx_irq(struct ice_tx_ring *tx_ring, int napi_budget)
 			}
 
 			/* unmap any remaining paged data */
-			if (dma_unmap_len(tx_buf, len)) {
+			if (tx_buf->type != ICE_TX_BUF_EMPTY) {
 				dma_unmap_page(tx_ring->dev,
 					       dma_unmap_addr(tx_buf, dma),
 					       dma_unmap_len(tx_buf, len),
 					       DMA_TO_DEVICE);
+
 				dma_unmap_len_set(tx_buf, len, 0);
+				tx_buf->type = ICE_TX_BUF_EMPTY;
 			}
 		}
 		ice_trace(clean_tx_irq_unmap_eop, tx_ring, tx_desc, tx_buf);
@@ -1495,7 +1503,8 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first,
 				       DMA_TO_DEVICE);
 
 		tx_buf = &tx_ring->tx_buf[i];
-		tx_buf->type = ICE_TX_BUF_FRAG;
+		if (!skb_frag_is_net_iov(frag))
+			tx_buf->type = ICE_TX_BUF_FRAG;
 	}
 
 	/* record SW timestamp if HW timestamp is not available */
@@ -2377,7 +2386,7 @@ void ice_clean_ctrl_tx_irq(struct ice_tx_ring *tx_ring)
 		}
 
 		/* unmap the data header */
-		if (dma_unmap_len(tx_buf, len))
+		if (tx_buf->type != ICE_TX_BUF_EMPTY)
 			dma_unmap_single(tx_ring->dev,
 					 dma_unmap_addr(tx_buf, dma),
 					 dma_unmap_len(tx_buf, len),
-- 
2.54.0


^ permalink raw reply related

* [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel
In-Reply-To: <20260505152923.1040589-1-aleksander.lobakin@intel.com>

Now ice is ready to get queue_mgmt_ops support. It already has API
to disable/reconfig/enable one particular queue (for XSk). Reuse as
much of its code as possible to implement Rx queue management
callbacks and vice versa -- ice_queue_mem_{alloc,free}() can be
reused during ifup/ifdown to elide code duplication.
With this, ice passes the io_uring zcrx selftests, meaning the Rx
part of netmem/MP support is done.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_lib.h    |   5 +
 drivers/net/ethernet/intel/ice/ice_txrx.h   |   2 +
 drivers/net/ethernet/intel/ice/ice_base.c   | 182 +++++++++++++++-----
 drivers/net/ethernet/intel/ice/ice_main.c   |   2 +-
 drivers/net/ethernet/intel/ice/ice_sf_eth.c |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c   |  26 ++-
 6 files changed, 165 insertions(+), 54 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h
index 476fa54ec4e8..e1e85976e523 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_lib.h
@@ -4,6 +4,8 @@
 #ifndef _ICE_LIB_H_
 #define _ICE_LIB_H_
 
+#include <net/netdev_queues.h>
+
 #include "ice.h"
 #include "ice_vlan.h"
 
@@ -135,4 +137,7 @@ void ice_clear_feature_support(struct ice_pf *pf, enum ice_feature f);
 void ice_init_feature_support(struct ice_pf *pf);
 bool ice_vsi_is_rx_queue_active(struct ice_vsi *vsi);
 void ice_vsi_update_l2tsel(struct ice_vsi *vsi, enum ice_l2tsel l2tsel);
+
+extern const struct netdev_queue_mgmt_ops ice_queue_mgmt_ops;
+
 #endif /* !_ICE_LIB_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 5e517f219379..7c22b2015067 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -462,6 +462,8 @@ u16
 ice_select_queue(struct net_device *dev, struct sk_buff *skb,
 		 struct net_device *sb_dev);
 void ice_clean_tx_ring(struct ice_tx_ring *tx_ring);
+void ice_queue_mem_free(struct net_device *dev, void *per_queue_mem);
+void ice_zero_rx_ring(struct ice_rx_ring *rx_ring);
 void ice_clean_rx_ring(struct ice_rx_ring *rx_ring);
 int ice_setup_tx_ring(struct ice_tx_ring *tx_ring);
 int ice_setup_rx_ring(struct ice_rx_ring *rx_ring);
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 1add82d894bb..4e0b8895c303 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -653,6 +653,43 @@ static int ice_rxq_pp_create(struct ice_rx_ring *rq)
 	return err;
 }
 
+static int ice_queue_mem_alloc(struct net_device *dev,
+			       struct netdev_queue_config *qcfg,
+			       void *per_queue_mem, int idx)
+{
+	const struct ice_netdev_priv *priv = netdev_priv(dev);
+	const struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
+	struct ice_rx_ring *new = per_queue_mem;
+	int ret;
+
+	new->count = real->count;
+	new->netdev = real->netdev;
+	new->q_index = real->q_index;
+	new->q_vector = real->q_vector;
+	new->vsi = real->vsi;
+
+	ret = ice_rxq_pp_create(new);
+	if (ret)
+		return ret;
+
+	if (!netif_running(dev))
+		return 0;
+
+	ret = __xdp_rxq_info_reg(&new->xdp_rxq, new->netdev, new->q_index,
+				 new->q_vector->napi.napi_id, new->truesize);
+	if (ret)
+		goto err_destroy_fq;
+
+	xdp_rxq_info_attach_page_pool(&new->xdp_rxq, new->pp);
+
+	return 0;
+
+err_destroy_fq:
+	ice_rxq_pp_destroy(new);
+
+	return ret;
+}
+
 /**
  * ice_vsi_cfg_rxq - Configure an Rx queue
  * @ring: the ring being configured
@@ -691,19 +728,10 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 			dev_info(dev, "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
 				 ring->q_index);
 		} else {
-			err = ice_rxq_pp_create(ring);
+			err = ice_queue_mem_alloc(ring->netdev, NULL, ring,
+						  ring->q_index);
 			if (err)
 				return err;
-
-			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
-						 ring->q_index,
-						 ring->q_vector->napi.napi_id,
-						 ring->truesize);
-			if (err)
-				goto err_destroy_fq;
-
-			xdp_rxq_info_attach_page_pool(&ring->xdp_rxq,
-						      ring->pp);
 		}
 	}
 
@@ -712,7 +740,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 	if (err) {
 		dev_err(dev, "ice_setup_rx_ctx failed for RxQ %d, err %d\n",
 			ring->q_index, err);
-		goto err_destroy_fq;
+		goto err_clean_rq;
 	}
 
 	if (ring->xsk_pool) {
@@ -743,12 +771,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 		err = ice_alloc_rx_bufs(ring, num_bufs);
 
 	if (err)
-		goto err_destroy_fq;
+		goto err_clean_rq;
 
 	return 0;
 
-err_destroy_fq:
-	ice_rxq_pp_destroy(ring);
+err_clean_rq:
+	ice_clean_rx_ring(ring);
 
 	return err;
 }
@@ -1460,27 +1488,7 @@ static void ice_qp_reset_stats(struct ice_vsi *vsi, u16 q_idx)
 		       sizeof(vsi->xdp_rings[q_idx]->ring_stats->stats));
 }
 
-/**
- * ice_qp_clean_rings - Cleans all the rings of a given index
- * @vsi: VSI that contains rings of interest
- * @q_idx: ring index in array
- */
-static void ice_qp_clean_rings(struct ice_vsi *vsi, u16 q_idx)
-{
-	ice_clean_tx_ring(vsi->tx_rings[q_idx]);
-	if (vsi->xdp_rings)
-		ice_clean_tx_ring(vsi->xdp_rings[q_idx]);
-	ice_clean_rx_ring(vsi->rx_rings[q_idx]);
-}
-
-/**
- * ice_qp_dis - Disables a queue pair
- * @vsi: VSI of interest
- * @q_idx: ring index in array
- *
- * Returns 0 on success, negative on failure.
- */
-int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
+static int __ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
 {
 	struct ice_txq_meta txq_meta = { };
 	struct ice_q_vector *q_vector;
@@ -1519,23 +1527,35 @@ int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
 	}
 
 	ice_vsi_ctrl_one_rx_ring(vsi, false, q_idx, false);
-	ice_qp_clean_rings(vsi, q_idx);
 	ice_qp_reset_stats(vsi, q_idx);
 
+	ice_clean_tx_ring(vsi->tx_rings[q_idx]);
+	if (vsi->xdp_rings)
+		ice_clean_tx_ring(vsi->xdp_rings[q_idx]);
+
 	return fail;
 }
 
 /**
- * ice_qp_ena - Enables a queue pair
+ * ice_qp_dis - Disables a queue pair
  * @vsi: VSI of interest
  * @q_idx: ring index in array
  *
  * Returns 0 on success, negative on failure.
  */
-int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
+int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
+{
+	int ret;
+
+	ret = __ice_qp_dis(vsi, q_idx);
+	ice_clean_rx_ring(vsi->rx_rings[q_idx]);
+
+	return ret;
+}
+
+static int __ice_qp_ena(struct ice_vsi *vsi, u16 q_idx, int fail)
 {
 	struct ice_q_vector *q_vector;
-	int fail = 0;
 	bool link_up;
 	int err;
 
@@ -1553,10 +1573,6 @@ int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
 		ice_tx_xsk_pool(vsi, q_idx);
 	}
 
-	err = ice_vsi_cfg_single_rxq(vsi, q_idx);
-	if (!fail)
-		fail = err;
-
 	q_vector = vsi->rx_rings[q_idx]->q_vector;
 	ice_qvec_cfg_msix(vsi, q_vector, q_idx);
 
@@ -1577,3 +1593,81 @@ int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
 
 	return fail;
 }
+
+/**
+ * ice_qp_ena - Enables a queue pair
+ * @vsi: VSI of interest
+ * @q_idx: ring index in array
+ *
+ * Returns 0 on success, negative on failure.
+ */
+int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
+{
+	return __ice_qp_ena(vsi, q_idx, ice_vsi_cfg_single_rxq(vsi, q_idx));
+}
+
+static int ice_queue_start(struct net_device *dev,
+			   struct netdev_queue_config *qcfg,
+			   void *per_queue_mem, int idx)
+{
+	const struct ice_netdev_priv *priv = netdev_priv(dev);
+	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
+	struct ice_rx_ring *new = per_queue_mem;
+	struct napi_struct *napi;
+	int ret;
+
+	real->pp = new->pp;
+	real->rx_fqes = new->rx_fqes;
+	real->hdr_fqes = new->hdr_fqes;
+	real->hdr_pp = new->hdr_pp;
+
+	real->hdr_truesize = new->hdr_truesize;
+	real->truesize = new->truesize;
+	real->rx_hdr_len = new->rx_hdr_len;
+	real->rx_buf_len = new->rx_buf_len;
+
+	memcpy(&real->xdp_rxq, &new->xdp_rxq, sizeof(new->xdp_rxq));
+
+	ret = ice_setup_rx_ctx(real);
+	if (ret)
+		return ret;
+
+	napi = &real->q_vector->napi;
+
+	page_pool_enable_direct_recycling(real->pp, napi);
+	if (real->hdr_pp)
+		page_pool_enable_direct_recycling(real->hdr_pp, napi);
+
+	ret = ice_alloc_rx_bufs(real, ICE_DESC_UNUSED(real));
+
+	return __ice_qp_ena(priv->vsi, idx, ret);
+}
+
+static int ice_queue_stop(struct net_device *dev, void *per_queue_mem,
+			  int idx)
+{
+	const struct ice_netdev_priv *priv = netdev_priv(dev);
+	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
+	int ret;
+
+	ret = __ice_qp_dis(priv->vsi, idx);
+	if (ret)
+		return ret;
+
+	page_pool_disable_direct_recycling(real->pp);
+	if (real->hdr_pp)
+		page_pool_disable_direct_recycling(real->hdr_pp);
+
+	ice_zero_rx_ring(real);
+	memcpy(per_queue_mem, real, sizeof(*real));
+
+	return 0;
+}
+
+const struct netdev_queue_mgmt_ops ice_queue_mgmt_ops = {
+	.ndo_queue_mem_alloc	= ice_queue_mem_alloc,
+	.ndo_queue_mem_free	= ice_queue_mem_free,
+	.ndo_queue_mem_size	= sizeof(struct ice_rx_ring),
+	.ndo_queue_start	= ice_queue_start,
+	.ndo_queue_stop		= ice_queue_stop,
+};
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 50975fe7cab7..a88f9e3c0077 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3452,7 +3452,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
 	}
 
 	netdev->netdev_ops = &ice_netdev_ops;
-	netdev->request_ops_lock = true;
+	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
 	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index cd6ba53a873b..8d1541712abc 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -58,7 +58,7 @@ static int ice_sf_cfg_netdev(struct ice_dynamic_port *dyn_port,
 	eth_hw_addr_set(netdev, dyn_port->hw_addr);
 	ether_addr_copy(netdev->perm_addr, dyn_port->hw_addr);
 	netdev->netdev_ops = &ice_sf_netdev_ops;
-	netdev->request_ops_lock = true;
+	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
 	SET_NETDEV_DEVLINK_PORT(netdev, devlink_port);
 
 	err = register_netdev(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 4ca1a0602307..43b467076027 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -531,17 +531,13 @@ void ice_rxq_pp_destroy(struct ice_rx_ring *rq)
 	rq->hdr_pp = NULL;
 }
 
-/**
- * ice_clean_rx_ring - Free Rx buffers
- * @rx_ring: ring to be cleaned
- */
-void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
+void ice_queue_mem_free(struct net_device *dev, void *per_queue_mem)
 {
-	u32 size;
+	struct ice_rx_ring *rx_ring = per_queue_mem;
 
 	if (rx_ring->xsk_pool) {
 		ice_xsk_clean_rx_ring(rx_ring);
-		goto rx_skip_free;
+		return;
 	}
 
 	/* ring already cleared, nothing to do */
@@ -570,8 +566,12 @@ void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
 	}
 
 	ice_rxq_pp_destroy(rx_ring);
+}
+
+void ice_zero_rx_ring(struct ice_rx_ring *rx_ring)
+{
+	size_t size;
 
-rx_skip_free:
 	/* Zero out the descriptor ring */
 	size = ALIGN(rx_ring->count * sizeof(union ice_32byte_rx_desc),
 		     PAGE_SIZE);
@@ -581,6 +581,16 @@ void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
 	rx_ring->next_to_use = 0;
 }
 
+/**
+ * ice_clean_rx_ring - Free Rx buffers
+ * @rx_ring: ring to be cleaned
+ */
+void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
+{
+	ice_queue_mem_free(rx_ring->netdev, rx_ring);
+	ice_zero_rx_ring(rx_ring);
+}
+
 /**
  * ice_free_rx_ring - Free Rx resources
  * @rx_ring: ring to clean the resources from
-- 
2.54.0


^ permalink raw reply related

* [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel
In-Reply-To: <20260505152923.1040589-1-aleksander.lobakin@intel.com>

Queue management ops unconditionally enable netdev locking. The same
lock is taken by default by several NAPI configuration functions,
such as napi_enable() and netif_napi_set_irq().
Request ops locking in advance and make sure we use the _locked
counterparts of those functions to avoid deadlocks, taking the lock
manually where needed (suspend/resume, queue rebuild and resets).

Co-developed-by: Kohei Enju <kohei@enjuk.jp> # ice_xdp_setup_prog(), safe mode
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_base.h    |   2 +
 drivers/net/ethernet/intel/ice/ice_lib.h     |  13 +-
 drivers/net/ethernet/intel/ice/ice_base.c    |  63 ++++-
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c |  15 +-
 drivers/net/ethernet/intel/ice/ice_eswitch.c |  26 ++-
 drivers/net/ethernet/intel/ice/ice_lib.c     | 227 +++++++++++++++----
 drivers/net/ethernet/intel/ice/ice_main.c    |  78 ++++---
 drivers/net/ethernet/intel/ice/ice_sf_eth.c  |   3 +
 drivers/net/ethernet/intel/ice/ice_xsk.c     |   4 +-
 9 files changed, 324 insertions(+), 107 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.h b/drivers/net/ethernet/intel/ice/ice_base.h
index d28294247599..99b2c7232829 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.h
+++ b/drivers/net/ethernet/intel/ice/ice_base.h
@@ -12,8 +12,10 @@ int __ice_vsi_get_qs(struct ice_qs_cfg *qs_cfg);
 int
 ice_vsi_ctrl_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx, bool wait);
 int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx);
+int ice_vsi_alloc_q_vectors_locked(struct ice_vsi *vsi);
 int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi);
 void ice_vsi_map_rings_to_vectors(struct ice_vsi *vsi);
+void ice_vsi_free_q_vectors_locked(struct ice_vsi *vsi);
 void ice_vsi_free_q_vectors(struct ice_vsi *vsi);
 int ice_vsi_cfg_single_txq(struct ice_vsi *vsi, struct ice_tx_ring **tx_rings,
 			   u16 q_idx);
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h
index 49454d98dcfe..476fa54ec4e8 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_lib.h
@@ -53,19 +53,24 @@ struct ice_vsi *
 ice_vsi_setup(struct ice_pf *pf, struct ice_vsi_cfg_params *params);
 
 void ice_vsi_set_napi_queues(struct ice_vsi *vsi);
-void ice_napi_add(struct ice_vsi *vsi);
-
+void ice_vsi_set_napi_queues_locked(struct ice_vsi *vsi);
 void ice_vsi_clear_napi_queues(struct ice_vsi *vsi);
+void ice_vsi_clear_napi_queues_locked(struct ice_vsi *vsi);
+
+void ice_napi_add(struct ice_vsi *vsi);
 
 int ice_vsi_release(struct ice_vsi *vsi);
 
 void ice_vsi_close(struct ice_vsi *vsi);
 
-int ice_ena_vsi(struct ice_vsi *vsi, bool locked);
+int ice_ena_vsi_locked(struct ice_vsi *vsi);
+int ice_ena_vsi(struct ice_vsi *vsi);
 
 void ice_vsi_decfg(struct ice_vsi *vsi);
-void ice_dis_vsi(struct ice_vsi *vsi, bool locked);
+void ice_dis_vsi_locked(struct ice_vsi *vsi);
+void ice_dis_vsi(struct ice_vsi *vsi);
 
+int ice_vsi_rebuild_locked(struct ice_vsi *vsi, u32 vsi_flags);
 int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags);
 int ice_vsi_cfg(struct ice_vsi *vsi);
 struct ice_vsi *ice_vsi_alloc(struct ice_pf *pf);
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index f162cdfc62a7..1add82d894bb 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -155,8 +155,8 @@ static int ice_vsi_alloc_q_vector(struct ice_vsi *vsi, u16 v_idx)
 	 * handler here (i.e. resume, reset/rebuild, etc.)
 	 */
 	if (vsi->netdev)
-		netif_napi_add_config(vsi->netdev, &q_vector->napi,
-				      ice_napi_poll, v_idx);
+		netif_napi_add_config_locked(vsi->netdev, &q_vector->napi,
+					     ice_napi_poll, v_idx);
 
 out:
 	/* tie q_vector and VSI together */
@@ -198,7 +198,7 @@ static void ice_free_q_vector(struct ice_vsi *vsi, int v_idx)
 
 	/* only VSI with an associated netdev is set up with NAPI */
 	if (vsi->netdev)
-		netif_napi_del(&q_vector->napi);
+		netif_napi_del_locked(&q_vector->napi);
 
 	/* release MSIX interrupt if q_vector had interrupt allocated */
 	if (q_vector->irq.index < 0)
@@ -886,13 +886,15 @@ int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
 }
 
 /**
- * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
+ * ice_vsi_alloc_q_vectors_locked - Allocate memory for interrupt vectors
  * @vsi: the VSI being configured
  *
- * We allocate one q_vector per queue interrupt. If allocation fails we
- * return -ENOMEM.
+ * Should be called only under the netdev lock.
+ * We allocate one q_vector per queue interrupt.
+ *
+ * Return: 0 on success, -ENOMEM if allocation fails.
  */
-int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
+int ice_vsi_alloc_q_vectors_locked(struct ice_vsi *vsi)
 {
 	struct device *dev = ice_pf_to_dev(vsi->back);
 	u16 v_idx;
@@ -919,6 +921,30 @@ int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
 	return v_idx ? 0 : err;
 }
 
+/**
+ * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
+ * @vsi: the VSI being configured
+ *
+ * We allocate one q_vector per queue interrupt.
+ *
+ * Return: 0 on success, -ENOMEM if allocation fails.
+ */
+int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_vsi_alloc_q_vectors_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
 /**
  * ice_vsi_map_rings_to_vectors - Map VSI rings to interrupt vectors
  * @vsi: the VSI being configured
@@ -982,10 +1008,12 @@ void ice_vsi_map_rings_to_vectors(struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_free_q_vectors - Free memory allocated for interrupt vectors
+ * ice_vsi_free_q_vectors_locked - Free memory allocated for interrupt vectors
  * @vsi: the VSI having memory freed
+ *
+ * Should be called only under the netdev lock.
  */
-void ice_vsi_free_q_vectors(struct ice_vsi *vsi)
+void ice_vsi_free_q_vectors_locked(struct ice_vsi *vsi)
 {
 	int v_idx;
 
@@ -995,6 +1023,23 @@ void ice_vsi_free_q_vectors(struct ice_vsi *vsi)
 	vsi->num_q_vectors = 0;
 }
 
+/**
+ * ice_vsi_free_q_vectors - Free memory allocated for interrupt vectors
+ * @vsi: the VSI having memory freed
+ */
+void ice_vsi_free_q_vectors(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ice_vsi_free_q_vectors_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+}
+
 /**
  * ice_cfg_tstamp - Configure Tx time stamp queue
  * @tx_ring: Tx ring to be configured with timestamping
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
index 16aa25535152..7d89c0acc5d8 100644
--- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
@@ -273,14 +273,13 @@ void ice_vsi_cfg_dcb_rings(struct ice_vsi *vsi)
  * ice_dcb_ena_dis_vsi - disable certain VSIs for DCB config/reconfig
  * @pf: pointer to the PF instance
  * @ena: true to enable VSIs, false to disable
- * @locked: true if caller holds RTNL lock, false otherwise
  *
  * Before a new DCB configuration can be applied, VSIs of type PF, SWITCHDEV
  * and CHNL need to be brought down. Following completion of DCB configuration
  * the VSIs that were downed need to be brought up again. This helper function
  * does both.
  */
-static void ice_dcb_ena_dis_vsi(struct ice_pf *pf, bool ena, bool locked)
+static void ice_dcb_ena_dis_vsi(struct ice_pf *pf, bool ena)
 {
 	int i;
 
@@ -294,9 +293,9 @@ static void ice_dcb_ena_dis_vsi(struct ice_pf *pf, bool ena, bool locked)
 		case ICE_VSI_CHNL:
 		case ICE_VSI_PF:
 			if (ena)
-				ice_ena_vsi(vsi, locked);
+				ice_ena_vsi(vsi);
 			else
-				ice_dis_vsi(vsi, locked);
+				ice_dis_vsi(vsi);
 			break;
 		default:
 			continue;
@@ -416,7 +415,7 @@ int ice_pf_dcb_cfg(struct ice_pf *pf, struct ice_dcbx_cfg *new_cfg, bool locked)
 		rtnl_lock();
 
 	/* disable VSIs affected by DCB changes */
-	ice_dcb_ena_dis_vsi(pf, false, true);
+	ice_dcb_ena_dis_vsi(pf, false);
 
 	memcpy(curr_cfg, new_cfg, sizeof(*curr_cfg));
 	memcpy(&curr_cfg->etsrec, &curr_cfg->etscfg, sizeof(curr_cfg->etsrec));
@@ -445,7 +444,7 @@ int ice_pf_dcb_cfg(struct ice_pf *pf, struct ice_dcbx_cfg *new_cfg, bool locked)
 
 out:
 	/* enable previously downed VSIs */
-	ice_dcb_ena_dis_vsi(pf, true, true);
+	ice_dcb_ena_dis_vsi(pf, true);
 	if (!locked)
 		rtnl_unlock();
 free_cfg:
@@ -1107,7 +1106,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
 
 	rtnl_lock();
 	/* disable VSIs affected by DCB changes */
-	ice_dcb_ena_dis_vsi(pf, false, true);
+	ice_dcb_ena_dis_vsi(pf, false);
 
 	ret = ice_query_port_ets(pi, &buf, sizeof(buf), NULL);
 	if (ret) {
@@ -1119,7 +1118,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
 	ice_pf_dcb_recfg(pf, false);
 
 	/* enable previously downed VSIs */
-	ice_dcb_ena_dis_vsi(pf, true, true);
+	ice_dcb_ena_dis_vsi(pf, true);
 unlock_rtnl:
 	rtnl_unlock();
 out:
diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f..af0cc77fbf71 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -23,10 +23,16 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	struct net_device *netdev = uplink_vsi->netdev;
 	bool if_running = netif_running(netdev);
 	struct ice_vsi_vlan_ops *vlan_ops;
+	int ret;
+
+	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state)) {
+		netdev_lock(netdev);
+		ret = ice_down(uplink_vsi);
+		netdev_unlock(netdev);
 
-	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state))
-		if (ice_down(uplink_vsi))
+		if (ret)
 			return -ENODEV;
+	}
 
 	ice_remove_vsi_fltr(&pf->hw, uplink_vsi->idx);
 	ice_vsi_cfg_sw_lldp(uplink_vsi, true, false);
@@ -53,8 +59,14 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	if (ice_vsi_update_local_lb(uplink_vsi, true))
 		goto err_override_local_lb;
 
-	if (if_running && ice_up(uplink_vsi))
-		goto err_up;
+	if (if_running) {
+		netdev_lock(netdev);
+		ret = ice_up(uplink_vsi);
+		netdev_unlock(netdev);
+
+		if (ret)
+			goto err_up;
+	}
 
 	return 0;
 
@@ -74,8 +86,12 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	ice_fltr_add_mac_and_broadcast(uplink_vsi,
 				       uplink_vsi->port_info->mac.perm_addr,
 				       ICE_FWD_TO_VSI);
-	if (if_running)
+
+	if (if_running) {
+		netdev_lock(netdev);
 		ice_up(uplink_vsi);
+		netdev_unlock(netdev);
+	}
 
 	return -ENODEV;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 837b71b7b2b7..6760aac609f2 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2304,10 +2304,14 @@ static int ice_vsi_cfg_tc_lan(struct ice_pf *pf, struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_cfg_def - configure default VSI based on the type
+ * ice_vsi_cfg_def_locked - configure default VSI based on the type
  * @vsi: pointer to VSI
+ *
+ * Should be called only with the netdev lock taken.
+ *
+ * Return: 0 on success, -errno on failure.
  */
-static int ice_vsi_cfg_def(struct ice_vsi *vsi)
+static int ice_vsi_cfg_def_locked(struct ice_vsi *vsi)
 {
 	struct device *dev = ice_pf_to_dev(vsi->back);
 	struct ice_pf *pf = vsi->back;
@@ -2350,7 +2354,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 	case ICE_VSI_CTRL:
 	case ICE_VSI_SF:
 	case ICE_VSI_PF:
-		ret = ice_vsi_alloc_q_vectors(vsi);
+		ret = ice_vsi_alloc_q_vectors_locked(vsi);
 		if (ret)
 			goto unroll_vsi_init;
 
@@ -2400,7 +2404,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 		 * creates a VSI and corresponding structures for bookkeeping
 		 * purpose
 		 */
-		ret = ice_vsi_alloc_q_vectors(vsi);
+		ret = ice_vsi_alloc_q_vectors_locked(vsi);
 		if (ret)
 			goto unroll_vsi_init;
 
@@ -2424,7 +2428,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 		}
 		break;
 	case ICE_VSI_LB:
-		ret = ice_vsi_alloc_q_vectors(vsi);
+		ret = ice_vsi_alloc_q_vectors_locked(vsi);
 		if (ret)
 			goto unroll_vsi_init;
 
@@ -2451,7 +2455,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 unroll_vector_base:
 	/* reclaim SW interrupts back to the common pool */
 unroll_alloc_q_vector:
-	ice_vsi_free_q_vectors(vsi);
+	ice_vsi_free_q_vectors_locked(vsi);
 unroll_vsi_init:
 	ice_vsi_delete_from_hw(vsi);
 unroll_get_qs:
@@ -2463,6 +2467,28 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 	return ret;
 }
 
+/**
+ * ice_vsi_cfg_def - configure default VSI based on the type
+ * @vsi: pointer to VSI
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+static int ice_vsi_cfg_def(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_vsi_cfg_def_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
 /**
  * ice_vsi_cfg - configure a previously allocated VSI
  * @vsi: pointer to VSI
@@ -2497,10 +2523,12 @@ int ice_vsi_cfg(struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_decfg - remove all VSI configuration
+ * ice_vsi_decfg_locked - remove all VSI configuration
  * @vsi: pointer to VSI
+ *
+ * Should be called only under the netdev lock.
  */
-void ice_vsi_decfg(struct ice_vsi *vsi)
+static void ice_vsi_decfg_locked(struct ice_vsi *vsi)
 {
 	struct ice_pf *pf = vsi->back;
 	int err;
@@ -2518,7 +2546,7 @@ void ice_vsi_decfg(struct ice_vsi *vsi)
 		ice_destroy_xdp_rings(vsi, ICE_XDP_CFG_PART);
 
 	ice_vsi_clear_rings(vsi);
-	ice_vsi_free_q_vectors(vsi);
+	ice_vsi_free_q_vectors_locked(vsi);
 	ice_vsi_put_qs(vsi);
 	ice_vsi_free_arrays(vsi);
 
@@ -2533,6 +2561,23 @@ void ice_vsi_decfg(struct ice_vsi *vsi)
 		vsi->agg_node->num_vsis--;
 }
 
+/**
+ * ice_vsi_decfg - remove all VSI configuration
+ * @vsi: pointer to VSI
+ */
+void ice_vsi_decfg(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ice_vsi_decfg_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+}
+
 /**
  * ice_vsi_setup - Set up a VSI by a given type
  * @pf: board private structure
@@ -2706,18 +2751,19 @@ void ice_vsi_close(struct ice_vsi *vsi)
 	if (!test_and_set_bit(ICE_VSI_DOWN, vsi->state))
 		ice_down(vsi);
 
-	ice_vsi_clear_napi_queues(vsi);
+	ice_vsi_clear_napi_queues_locked(vsi);
 	ice_vsi_free_irq(vsi);
 	ice_vsi_free_tx_rings(vsi);
 	ice_vsi_free_rx_rings(vsi);
 }
 
 /**
- * ice_ena_vsi - resume a VSI
- * @vsi: the VSI being resume
- * @locked: is the rtnl_lock already held
+ * ice_ena_vsi_locked - resume a VSI (without taking the netdev lock)
+ * @vsi: VSI to resume
+ *
+ * Return: 0 on success, -errno on failure.
  */
-int ice_ena_vsi(struct ice_vsi *vsi, bool locked)
+int ice_ena_vsi_locked(struct ice_vsi *vsi)
 {
 	int err = 0;
 
@@ -2728,15 +2774,8 @@ int ice_ena_vsi(struct ice_vsi *vsi, bool locked)
 
 	if (vsi->netdev && (vsi->type == ICE_VSI_PF ||
 			    vsi->type == ICE_VSI_SF)) {
-		if (netif_running(vsi->netdev)) {
-			if (!locked)
-				rtnl_lock();
-
+		if (netif_running(vsi->netdev))
 			err = ice_open_internal(vsi->netdev);
-
-			if (!locked)
-				rtnl_unlock();
-		}
 	} else if (vsi->type == ICE_VSI_CTRL) {
 		err = ice_vsi_open_ctrl(vsi);
 	}
@@ -2745,11 +2784,34 @@ int ice_ena_vsi(struct ice_vsi *vsi, bool locked)
 }
 
 /**
- * ice_dis_vsi - pause a VSI
+ * ice_ena_vsi - resume a VSI
+ * @vsi: VSI to resume
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int ice_ena_vsi(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_ena_vsi_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
+/**
+ * ice_dis_vsi_locked - pause a VSI
  * @vsi: the VSI being paused
- * @locked: is the rtnl_lock already held
+ *
+ * The caller must always hold the netdev lock.
  */
-void ice_dis_vsi(struct ice_vsi *vsi, bool locked)
+void ice_dis_vsi_locked(struct ice_vsi *vsi)
 {
 	bool already_down = test_bit(ICE_VSI_DOWN, vsi->state);
 
@@ -2758,14 +2820,9 @@ void ice_dis_vsi(struct ice_vsi *vsi, bool locked)
 	if (vsi->netdev && (vsi->type == ICE_VSI_PF ||
 			    vsi->type == ICE_VSI_SF)) {
 		if (netif_running(vsi->netdev)) {
-			if (!locked)
-				rtnl_lock();
 			already_down = test_bit(ICE_VSI_DOWN, vsi->state);
 			if (!already_down)
 				ice_vsi_close(vsi);
-
-			if (!locked)
-				rtnl_unlock();
 		} else if (!already_down) {
 			ice_vsi_close(vsi);
 		}
@@ -2775,12 +2832,30 @@ void ice_dis_vsi(struct ice_vsi *vsi, bool locked)
 }
 
 /**
- * ice_vsi_set_napi_queues - associate netdev queues with napi
+ * ice_dis_vsi - pause a VSI
+ * @vsi: the VSI being paused
+ */
+void ice_dis_vsi(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ice_dis_vsi_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+}
+
+/**
+ * ice_vsi_set_napi_queues_locked - associate netdev queues with napi
  * @vsi: VSI pointer
  *
  * Associate queue[s] with napi for all vectors.
+ * Must be called only with the netdev_lock taken.
  */
-void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
+void ice_vsi_set_napi_queues_locked(struct ice_vsi *vsi)
 {
 	struct net_device *netdev = vsi->netdev;
 	int q_idx, v_idx;
@@ -2788,7 +2863,6 @@ void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
 	if (!netdev)
 		return;
 
-	ASSERT_RTNL();
 	ice_for_each_rxq(vsi, q_idx)
 		if (vsi->rx_rings[q_idx] && vsi->rx_rings[q_idx]->q_vector)
 			netif_queue_set_napi(netdev, q_idx, NETDEV_QUEUE_TYPE_RX,
@@ -2802,17 +2876,37 @@ void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
 	ice_for_each_q_vector(vsi, v_idx) {
 		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
 
-		netif_napi_set_irq(&q_vector->napi, q_vector->irq.virq);
+		netif_napi_set_irq_locked(&q_vector->napi, q_vector->irq.virq);
 	}
 }
 
 /**
- * ice_vsi_clear_napi_queues - dissociate netdev queues from napi
+ * ice_vsi_set_napi_queues - associate VSI queues with NAPIs
  * @vsi: VSI pointer
  *
+ * Version of ice_vsi_set_napi_queues_locked() that takes the netdev_lock,
+ * to use it outside of the net_device_ops context.
+ */
+void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
+{
+	struct net_device *netdev = vsi->netdev;
+
+	if (!netdev)
+		return;
+
+	netdev_lock(netdev);
+	ice_vsi_set_napi_queues_locked(vsi);
+	netdev_unlock(netdev);
+}
+
+/**
+ * ice_vsi_clear_napi_queues_locked - dissociate netdev queues from napi
+ * @vsi: VSI to process
+ *
  * Clear the association between all VSI queues queue[s] and napi.
+ * Must be called only with the netdev_lock taken.
  */
-void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
+void ice_vsi_clear_napi_queues_locked(struct ice_vsi *vsi)
 {
 	struct net_device *netdev = vsi->netdev;
 	int q_idx, v_idx;
@@ -2820,12 +2914,11 @@ void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
 	if (!netdev)
 		return;
 
-	ASSERT_RTNL();
 	/* Clear the NAPI's interrupt number */
 	ice_for_each_q_vector(vsi, v_idx) {
 		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
 
-		netif_napi_set_irq(&q_vector->napi, -1);
+		netif_napi_set_irq_locked(&q_vector->napi, -1);
 	}
 
 	ice_for_each_txq(vsi, q_idx)
@@ -2835,6 +2928,25 @@ void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
 		netif_queue_set_napi(netdev, q_idx, NETDEV_QUEUE_TYPE_RX, NULL);
 }
 
+/**
+ * ice_vsi_clear_napi_queues - dissociate VSI queues from NAPIs
+ * @vsi: VSI to process
+ *
+ * Version of ice_vsi_clear_napi_queues_locked() that takes the netdev lock,
+ * to use it outside of the net_device_ops context.
+ */
+void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
+{
+	struct net_device *netdev = vsi->netdev;
+
+	if (!netdev)
+		return;
+
+	netdev_lock(netdev);
+	ice_vsi_clear_napi_queues_locked(vsi);
+	netdev_unlock(netdev);
+}
+
 /**
  * ice_napi_add - register NAPI handler for the VSI
  * @vsi: VSI for which NAPI handler is to be registered
@@ -3072,16 +3184,17 @@ ice_vsi_realloc_stat_arrays(struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_rebuild - Rebuild VSI after reset
+ * ice_vsi_rebuild_locked - Rebuild VSI after reset
  * @vsi: VSI to be rebuild
  * @vsi_flags: flags used for VSI rebuild flow
  *
  * Set vsi_flags to ICE_VSI_FLAG_INIT to initialize a new VSI, or
  * ICE_VSI_FLAG_NO_INIT to rebuild an existing VSI in hardware.
+ * Should be called only under the netdev lock.
  *
- * Returns 0 on success and negative value on failure
+ * Return: 0 on success, -errno on failure.
  */
-int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
+int ice_vsi_rebuild_locked(struct ice_vsi *vsi, u32 vsi_flags)
 {
 	struct ice_coalesce_stored *coalesce;
 	int prev_num_q_vectors;
@@ -3102,8 +3215,8 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
 	if (ret)
 		goto unlock;
 
-	ice_vsi_decfg(vsi);
-	ret = ice_vsi_cfg_def(vsi);
+	ice_vsi_decfg_locked(vsi);
+	ret = ice_vsi_cfg_def_locked(vsi);
 	if (ret)
 		goto unlock;
 
@@ -3133,12 +3246,38 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
 	kfree(coalesce);
 decfg:
 	if (ret)
-		ice_vsi_decfg(vsi);
+		ice_vsi_decfg_locked(vsi);
 unlock:
 	mutex_unlock(&vsi->xdp_state_lock);
 	return ret;
 }
 
+/**
+ * ice_vsi_rebuild - Rebuild VSI after reset
+ * @vsi: VSI to be rebuild
+ * @vsi_flags: flags used for VSI rebuild flow
+ *
+ * Set vsi_flags to ICE_VSI_FLAG_INIT to initialize a new VSI, or
+ * ICE_VSI_FLAG_NO_INIT to rebuild an existing VSI in hardware.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_vsi_rebuild_locked(vsi, vsi_flags);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
 /**
  * ice_is_reset_in_progress - check for a reset in progress
  * @state: PF state field
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 1d1947a7fe11..50975fe7cab7 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -507,16 +507,15 @@ static void ice_sync_fltr_subtask(struct ice_pf *pf)
 /**
  * ice_pf_dis_all_vsi - Pause all VSIs on a PF
  * @pf: the PF
- * @locked: is the rtnl_lock already held
  */
-static void ice_pf_dis_all_vsi(struct ice_pf *pf, bool locked)
+static void ice_pf_dis_all_vsi(struct ice_pf *pf)
 {
 	int node;
 	int v;
 
 	ice_for_each_vsi(pf, v)
 		if (pf->vsi[v])
-			ice_dis_vsi(pf->vsi[v], locked);
+			ice_dis_vsi(pf->vsi[v]);
 
 	for (node = 0; node < ICE_MAX_PF_AGG_NODES; node++)
 		pf->pf_agg_node[node].num_vsis = 0;
@@ -605,7 +604,7 @@ ice_prepare_for_reset(struct ice_pf *pf, enum ice_reset_req reset_type)
 	ice_clear_hw_tbls(hw);
 	/* disable the VSIs and their queues that are not already DOWN */
 	set_bit(ICE_VSI_REBUILD_PENDING, ice_get_main_vsi(pf)->state);
-	ice_pf_dis_all_vsi(pf, false);
+	ice_pf_dis_all_vsi(pf);
 
 	if (test_bit(ICE_FLAG_PTP_SUPPORTED, pf->flags))
 		ice_ptp_prepare_for_reset(pf, reset_type);
@@ -2943,9 +2942,9 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
 				goto resume_if;
 			}
 		}
-		xdp_features_set_redirect_target(vsi->netdev, true);
+		xdp_features_set_redirect_target_locked(vsi->netdev, true);
 	} else if (ice_is_xdp_ena_vsi(vsi) && !prog) {
-		xdp_features_clear_redirect_target(vsi->netdev);
+		xdp_features_clear_redirect_target_locked(vsi->netdev);
 		xdp_ring_err = ice_destroy_xdp_rings(vsi, ICE_XDP_CFG_FULL);
 		if (xdp_ring_err)
 			NL_SET_ERR_MSG_MOD(extack, "Freeing XDP Tx resources failed");
@@ -3447,11 +3446,13 @@ static void ice_set_ops(struct ice_vsi *vsi)
 
 	if (ice_is_safe_mode(pf)) {
 		netdev->netdev_ops = &ice_netdev_safe_mode_ops;
+		netdev->request_ops_lock = true;
 		ice_set_ethtool_safe_mode_ops(netdev);
 		return;
 	}
 
 	netdev->netdev_ops = &ice_netdev_ops;
+	netdev->request_ops_lock = true;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
 	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
@@ -4058,6 +4059,7 @@ bool ice_is_wol_supported(struct ice_hw *hw)
  * @locked: is adev device_lock held
  *
  * Only change the number of queues if new_tx, or new_rx is non-0.
+ * Note that it should be called only with the netdev lock taken.
  *
  * Returns 0 on success.
  */
@@ -4083,7 +4085,7 @@ int ice_vsi_recfg_qs(struct ice_vsi *vsi, int new_rx, int new_tx, bool locked)
 
 	/* set for the next time the netdev is started */
 	if (!netif_running(vsi->netdev)) {
-		err = ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT);
+		err = ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT);
 		if (err)
 			goto rebuild_err;
 		dev_dbg(ice_pf_to_dev(pf), "Link is down, queue count change happens when link is brought up\n");
@@ -4091,7 +4093,7 @@ int ice_vsi_recfg_qs(struct ice_vsi *vsi, int new_rx, int new_tx, bool locked)
 	}
 
 	ice_vsi_close(vsi);
-	err = ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT);
+	err = ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT);
 	if (err)
 		goto rebuild_err;
 
@@ -5437,7 +5439,7 @@ static void ice_prepare_for_shutdown(struct ice_pf *pf)
 	dev_dbg(ice_pf_to_dev(pf), "Tearing down internal switch for shutdown\n");
 
 	/* disable the VSIs and their queues that are not already DOWN */
-	ice_pf_dis_all_vsi(pf, false);
+	ice_pf_dis_all_vsi(pf);
 
 	ice_for_each_vsi(pf, v)
 		if (pf->vsi[v])
@@ -5473,16 +5475,17 @@ static int ice_reinit_interrupt_scheme(struct ice_pf *pf)
 
 	/* Remap vectors and rings, after successful re-init interrupts */
 	ice_for_each_vsi(pf, v) {
-		if (!pf->vsi[v])
+		struct ice_vsi *vsi = pf->vsi[v];
+
+		if (!vsi)
 			continue;
 
-		ret = ice_vsi_alloc_q_vectors(pf->vsi[v]);
+		ret = ice_vsi_alloc_q_vectors(vsi);
 		if (ret)
 			goto err_reinit;
-		ice_vsi_map_rings_to_vectors(pf->vsi[v]);
-		rtnl_lock();
-		ice_vsi_set_napi_queues(pf->vsi[v]);
-		rtnl_unlock();
+
+		ice_vsi_map_rings_to_vectors(vsi);
+		ice_vsi_set_napi_queues(vsi);
 	}
 
 	ret = ice_req_irq_msix_misc(pf);
@@ -5495,13 +5498,15 @@ static int ice_reinit_interrupt_scheme(struct ice_pf *pf)
 	return 0;
 
 err_reinit:
-	while (v--)
-		if (pf->vsi[v]) {
-			rtnl_lock();
-			ice_vsi_clear_napi_queues(pf->vsi[v]);
-			rtnl_unlock();
-			ice_vsi_free_q_vectors(pf->vsi[v]);
-		}
+	while (v--) {
+		struct ice_vsi *vsi = pf->vsi[v];
+
+		if (!vsi)
+			continue;
+
+		ice_vsi_clear_napi_queues(vsi);
+		ice_vsi_free_q_vectors(vsi);
+	}
 
 	return ret;
 }
@@ -5564,14 +5569,17 @@ static int ice_suspend(struct device *dev)
 	 * to CPU0.
 	 */
 	ice_free_irq_msix_misc(pf);
+
 	ice_for_each_vsi(pf, v) {
-		if (!pf->vsi[v])
+		struct ice_vsi *vsi = pf->vsi[v];
+
+		if (!vsi)
 			continue;
-		rtnl_lock();
-		ice_vsi_clear_napi_queues(pf->vsi[v]);
-		rtnl_unlock();
-		ice_vsi_free_q_vectors(pf->vsi[v]);
+
+		ice_vsi_clear_napi_queues(vsi);
+		ice_vsi_free_q_vectors(vsi);
 	}
+
 	ice_clear_interrupt_scheme(pf);
 
 	pci_save_state(pdev);
@@ -6699,7 +6707,7 @@ static void ice_napi_enable_all(struct ice_vsi *vsi)
 		ice_init_moderation(q_vector);
 
 		if (q_vector->rx.rx_ring || q_vector->tx.tx_ring)
-			napi_enable(&q_vector->napi);
+			napi_enable_locked(&q_vector->napi);
 	}
 }
 
@@ -7198,7 +7206,7 @@ static void ice_napi_disable_all(struct ice_vsi *vsi)
 		struct ice_q_vector *q_vector = vsi->q_vectors[q_idx];
 
 		if (q_vector->rx.rx_ring || q_vector->tx.tx_ring)
-			napi_disable(&q_vector->napi);
+			napi_disable_locked(&q_vector->napi);
 
 		cancel_work_sync(&q_vector->tx.dim.work);
 		cancel_work_sync(&q_vector->rx.dim.work);
@@ -7498,7 +7506,7 @@ int ice_vsi_open(struct ice_vsi *vsi)
 		if (err)
 			goto err_set_qs;
 
-		ice_vsi_set_napi_queues(vsi);
+		ice_vsi_set_napi_queues_locked(vsi);
 	}
 
 	err = ice_up_complete(vsi);
@@ -7584,7 +7592,7 @@ static int ice_vsi_rebuild_by_type(struct ice_pf *pf, enum ice_vsi_type type)
 		vsi->vsi_num = ice_get_hw_vsi_num(&pf->hw, vsi->idx);
 
 		/* enable the VSI */
-		err = ice_ena_vsi(vsi, false);
+		err = ice_ena_vsi(vsi);
 		if (err) {
 			dev_err(dev, "enable VSI failed, err %d, VSI index %d, type %s\n",
 				err, vsi->idx, ice_vsi_type_str(type));
@@ -9190,7 +9198,7 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
 		return 0;
 
 	/* Pause VSI queues */
-	ice_dis_vsi(vsi, true);
+	ice_dis_vsi_locked(vsi);
 
 	if (!hw && !test_bit(ICE_FLAG_TC_MQPRIO, pf->flags))
 		ice_remove_q_channels(vsi, true);
@@ -9229,14 +9237,14 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
 	cur_rxq = vsi->num_rxq;
 
 	/* proceed with rebuild main VSI using correct number of queues */
-	ret = ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT);
+	ret = ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT);
 	if (ret) {
 		/* fallback to current number of queues */
 		dev_info(dev, "Rebuild failed with new queues, try with current number of queues\n");
 		vsi->req_txq = cur_txq;
 		vsi->req_rxq = cur_rxq;
 		clear_bit(ICE_RESET_FAILED, pf->state);
-		if (ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT)) {
+		if (ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT)) {
 			dev_err(dev, "Rebuild of main VSI failed again\n");
 			return ret;
 		}
@@ -9292,7 +9300,7 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
 		vsi->all_enatc = 0;
 	}
 	/* resume VSI */
-	ice_ena_vsi(vsi, true);
+	ice_ena_vsi_locked(vsi);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index a730aa368c92..cd6ba53a873b 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -58,6 +58,7 @@ static int ice_sf_cfg_netdev(struct ice_dynamic_port *dyn_port,
 	eth_hw_addr_set(netdev, dyn_port->hw_addr);
 	ether_addr_copy(netdev->perm_addr, dyn_port->hw_addr);
 	netdev->netdev_ops = &ice_sf_netdev_ops;
+	netdev->request_ops_lock = true;
 	SET_NETDEV_DEVLINK_PORT(netdev, devlink_port);
 
 	err = register_netdev(netdev);
@@ -183,7 +184,9 @@ static void ice_sf_dev_remove(struct auxiliary_device *adev)
 	devlink = priv_to_devlink(sf_dev->priv);
 	devl_lock(devlink);
 
+	netdev_lock(vsi->netdev);
 	ice_vsi_close(vsi);
+	netdev_unlock(vsi->netdev);
 
 	ice_sf_decfg_netdev(vsi);
 	ice_devlink_destroy_sf_dev_port(sf_dev);
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 0643017541c3..be0cb548487a 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -33,9 +33,9 @@ ice_qvec_toggle_napi(struct ice_vsi *vsi, struct ice_q_vector *q_vector,
 		return;
 
 	if (enable)
-		napi_enable(&q_vector->napi);
+		napi_enable_locked(&q_vector->napi);
 	else
-		napi_disable(&q_vector->napi);
+		napi_disable_locked(&q_vector->napi);
 }
 
 /**
-- 
2.54.0


^ permalink raw reply related

* [PATCH iwl-next v5 2/5] libeth: handle creating pools with unreadable buffers
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel
In-Reply-To: <20260505152923.1040589-1-aleksander.lobakin@intel.com>

libeth uses netmems for quite some time already, so in order to
support unreadable frags / memory providers, it only needs to set
PP_FLAG_ALLOW_UNREADABLE_NETMEM when needed.
Also add a couple sanity checks to make sure the driver didn't mess
up the configuration options and, in case when an MP is installed,
return the truesize always equal to PAGE_SIZE, so that
libeth_rx_alloc() will never try to allocate frags. Memory providers
manage buffers on their own and expect 1:1 buffer / HW Rx descriptor
association.

Bonus: mention in the libeth_sqe_type description that
LIBETH_SQE_EMPTY should also be used for netmem Tx SQEs -- they
don't need DMA unmapping.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/libeth/tx.h                |  2 +-
 drivers/net/ethernet/intel/libeth/rx.c | 42 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/net/libeth/tx.h b/include/net/libeth/tx.h
index c3db5c6f1641..a66fc2b3a114 100644
--- a/include/net/libeth/tx.h
+++ b/include/net/libeth/tx.h
@@ -12,7 +12,7 @@
 
 /**
  * enum libeth_sqe_type - type of &libeth_sqe to act on Tx completion
- * @LIBETH_SQE_EMPTY: unused/empty OR XDP_TX/XSk frame, no action required
+ * @LIBETH_SQE_EMPTY: empty OR netmem/XDP_TX/XSk frame, no action required
  * @LIBETH_SQE_CTX: context descriptor with empty SQE, no action required
  * @LIBETH_SQE_SLAB: kmalloc-allocated buffer, unmap and kfree()
  * @LIBETH_SQE_FRAG: mapped skb frag, only unmap DMA
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 8874b714cdcc..11e6e8f353ef 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 
 #include <net/libeth/rx.h>
+#include <net/netdev_queues.h>
 
 /* Rx buffer management */
 
@@ -139,9 +140,47 @@ static bool libeth_rx_page_pool_params_zc(struct libeth_fq *fq,
 	fq->buf_len = clamp(mtu, LIBETH_RX_BUF_STRIDE, max);
 	fq->truesize = fq->buf_len;
 
+	/*
+	 * Allow frags only for kernel pages. `fq->truesize == pp->max_len`
+	 * will always fall back to regular page_pool_alloc_netmems()
+	 * regardless of the MTU / FQ buffer size.
+	 */
+	if (pp->flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM)
+		fq->truesize = pp->max_len;
+
 	return true;
 }
 
+/**
+ * libeth_rx_page_pool_check_unread - check input params for unreadable MPs
+ * @fq: buffer queue to check
+ * @pp: &page_pool_params for the queue
+ *
+ * Make sure we don't create an invalid pool with full-frame unreadable
+ * buffers, bidirectional unreadable buffers or so, and configure the
+ * ZC payload pool accordingly.
+ *
+ * Return: true on success, false on invalid input params.
+ */
+static bool libeth_rx_page_pool_check_unread(const struct libeth_fq *fq,
+					     struct page_pool_params *pp)
+{
+	if (!netif_rxq_has_unreadable_mp(pp->netdev, pp->queue_idx))
+		return true;
+
+	/* For now, the core stack doesn't allow XDP with unreadable frags */
+	if (fq->xdp)
+		return false;
+
+	/* It should be either a header pool or a ZC payload pool */
+	if (fq->type == LIBETH_FQE_HDR)
+		return !fq->hsplit;
+
+	pp->flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
+
+	return fq->hsplit;
+}
+
 /**
  * libeth_rx_fq_create - create a PP with the default libeth settings
  * @fq: buffer queue struct to fill
@@ -165,6 +204,9 @@ int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
 	struct page_pool *pool;
 	int ret;
 
+	if (!libeth_rx_page_pool_check_unread(fq, &pp))
+		return -EINVAL;
+
 	pp.dma_dir = fq->xdp ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 
 	if (!fq->hsplit)
-- 
2.54.0


^ permalink raw reply related

* [PATCH iwl-next v5 1/5] libeth: pass Rx queue index to PP when creating a fill queue
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel
In-Reply-To: <20260505152923.1040589-1-aleksander.lobakin@intel.com>

Since recently, page_pool_create() accepts optional stack index of
the Rx queue which the pool will be created for. It can then be
used on control path for stuff like memory providers.
Add the same field to libeth_fq and pass the index from all the
drivers using libeth for managing Rx to simplify implementing MP
support later.
idpf has one libeth_fq per buffer/fill queue and each Rx queue has
two fill queues, but since fill queues can never be shared, we can
store the corresponding Rx queue index there during the
initialization to pass it to libeth.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.h |  2 ++
 include/net/libeth/rx.h                     |  2 ++
 drivers/net/ethernet/intel/iavf/iavf_txrx.c |  1 +
 drivers/net/ethernet/intel/ice/ice_base.c   |  2 ++
 drivers/net/ethernet/intel/idpf/idpf_txrx.c | 13 +++++++++++++
 drivers/net/ethernet/intel/libeth/rx.c      |  1 +
 6 files changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index 4be5b3b6d3ed..a0d92adf11c4 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -748,6 +748,7 @@ libeth_cacheline_set_assert(struct idpf_tx_queue, 64,
  * @size: Length of descriptor ring in bytes
  * @dma: Physical address of ring
  * @q_vector: Backreference to associated vector
+ * @rxq_idx: stack index of the corresponding Rx queue
  * @rx_buffer_low_watermark: RX buffer low watermark
  * @rx_hbuf_size: Header buffer size
  * @rx_buf_size: Buffer size
@@ -791,6 +792,7 @@ struct idpf_buf_queue {
 	dma_addr_t dma;
 
 	struct idpf_q_vector *q_vector;
+	u16 rxq_idx;
 
 	u16 rx_buffer_low_watermark;
 	u16 rx_hbuf_size;
diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index 5d991404845e..3b3d7acd13c9 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -71,6 +71,7 @@ enum libeth_fqe_type {
  * @xdp: flag indicating whether XDP is enabled
  * @buf_len: HW-writeable length per each buffer
  * @nid: ID of the closest NUMA node with memory
+ * @idx: stack index of the corresponding Rx queue
  */
 struct libeth_fq {
 	struct_group_tagged(libeth_fq_fp, fp,
@@ -88,6 +89,7 @@ struct libeth_fq {
 
 	u32			buf_len;
 	int			nid;
+	u32			idx;
 };
 
 int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi);
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 363c42bf3dcf..d3c68659162b 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -771,6 +771,7 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
 		.count		= rx_ring->count,
 		.buf_len	= LIBIE_MAX_RX_BUF_LEN,
 		.nid		= NUMA_NO_NODE,
+		.idx		= rx_ring->queue_index,
 	};
 	int ret;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 1667f686ff75..f162cdfc62a7 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -609,6 +609,7 @@ static int ice_rxq_pp_create(struct ice_rx_ring *rq)
 	struct libeth_fq fq = {
 		.count		= rq->count,
 		.nid		= NUMA_NO_NODE,
+		.idx		= rq->q_index,
 		.hsplit		= rq->vsi->hsplit,
 		.xdp		= ice_is_xdp_ena_vsi(rq->vsi),
 		.buf_len	= LIBIE_MAX_RX_BUF_LEN,
@@ -631,6 +632,7 @@ static int ice_rxq_pp_create(struct ice_rx_ring *rq)
 		.count		= rq->count,
 		.type		= LIBETH_FQE_HDR,
 		.nid		= NUMA_NO_NODE,
+		.idx		= rq->q_index,
 		.xdp		= ice_is_xdp_ena_vsi(rq->vsi),
 	};
 
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index f6b3b15364ff..95930edc566f 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -557,6 +557,7 @@ static int idpf_rx_hdr_buf_alloc_all(struct idpf_buf_queue *bufq)
 		.type	= LIBETH_FQE_HDR,
 		.xdp	= idpf_xdp_enabled(bufq->q_vector->vport),
 		.nid	= idpf_q_vector_to_mem(bufq->q_vector),
+		.idx	= bufq->rxq_idx,
 	};
 	int ret;
 
@@ -699,6 +700,7 @@ static int idpf_rx_bufs_init_singleq(struct idpf_rx_queue *rxq)
 		.type		= LIBETH_FQE_MTU,
 		.buf_len	= IDPF_RX_MAX_BUF_SZ,
 		.nid		= idpf_q_vector_to_mem(rxq->q_vector),
+		.idx		= rxq->idx,
 	};
 	int ret;
 
@@ -759,6 +761,7 @@ static int idpf_rx_bufs_init(struct idpf_buf_queue *bufq,
 		.hsplit		= idpf_queue_has(HSPLIT_EN, bufq),
 		.xdp		= idpf_xdp_enabled(bufq->q_vector->vport),
 		.nid		= idpf_q_vector_to_mem(bufq->q_vector),
+		.idx		= bufq->rxq_idx,
 	};
 	int ret;
 
@@ -1913,6 +1916,16 @@ static int idpf_rxq_group_alloc(struct idpf_vport *vport,
 							LIBETH_RX_LL_LEN;
 			idpf_rxq_set_descids(rsrc, q);
 		}
+
+		if (!idpf_is_queue_model_split(rsrc->rxq_model))
+			continue;
+
+		for (u32 j = 0; j < rsrc->num_bufqs_per_qgrp; j++) {
+			struct idpf_buf_queue *bufq;
+
+			bufq = &rx_qgrp->splitq.bufq_sets[j].bufq;
+			bufq->rxq_idx = rx_qgrp->splitq.rxq_sets[0]->rxq.idx;
+		}
 	}
 
 err_alloc:
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 62521a1f4ec9..8874b714cdcc 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -156,6 +156,7 @@ int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
 		.order		= LIBETH_RX_PAGE_ORDER,
 		.pool_size	= fq->count,
 		.nid		= fq->nid,
+		.queue_idx	= fq->idx,
 		.dev		= napi->dev->dev.parent,
 		.netdev		= napi->dev,
 		.napi		= napi,
-- 
2.54.0


^ permalink raw reply related

* [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

Now that ice uses libeth for managing Rx buffers and supports
configurable header split, it's ready to get support for sending
and receiving packets with unreadable (to the kernel) frags.

Extend libeth just a little bit to allow creating PPs with custom
memory providers and make sure ice works correctly with the netdev
ops locking. Then add the full set of queue_mgmt_ops and don't
unmap unreadable frags on Tx completion.
No perf regressions for the regular flows and no code duplication
implied.

Credits to the fbnic developers, whose code helped me understand
the memory providers and queue_mgmt_ops logics and served as
a reference.

Alexander Lobakin (5):
  libeth: pass Rx queue index to PP when creating a fill queue
  libeth: handle creating pools with unreadable buffers
  ice: migrate to netdev ops lock
  ice: implement Rx queue management ops
  ice: add support for transmitting unreadable frags

 drivers/net/ethernet/intel/ice/ice_base.h    |   2 +
 drivers/net/ethernet/intel/ice/ice_lib.h     |  18 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h    |   2 +
 drivers/net/ethernet/intel/idpf/idpf_txrx.h  |   2 +
 include/net/libeth/rx.h                      |   2 +
 include/net/libeth/tx.h                      |   2 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c  |   1 +
 drivers/net/ethernet/intel/ice/ice_base.c    | 247 +++++++++++++++----
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c |  15 +-
 drivers/net/ethernet/intel/ice/ice_eswitch.c |  26 +-
 drivers/net/ethernet/intel/ice/ice_lib.c     | 227 +++++++++++++----
 drivers/net/ethernet/intel/ice/ice_main.c    |  79 +++---
 drivers/net/ethernet/intel/ice/ice_sf_eth.c  |   4 +
 drivers/net/ethernet/intel/ice/ice_txrx.c    |  43 +++-
 drivers/net/ethernet/intel/ice/ice_xsk.c     |   4 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c  |  13 +
 drivers/net/ethernet/intel/libeth/rx.c       |  43 ++++
 17 files changed, 566 insertions(+), 164 deletions(-)

---
Note: apply to net-next, not Tony's next-queue (ready to be sent as
a PR).

From v4[0]:
* rebase on top of the latest net-next;
* 3/5: fix the last [hopefully] missing netdev lock (E-Switch code,
       Simon), rechecked with our internal Intel's Sashiko setup;
* 3/5: pick fixes for .ndo_bpf() and safe mode from Kohei.

From v3[1]:
* rebase on top of recent Larysa's changes;
* 3/5: fix the last locking inconsistencies (Jakub);
* 3/5: pick a kdoc fix from Tony.

From v2[2]:
* rebase on top of net-next-7.0;
* 3/5: fix [hopefully] all inconsistent locking (Jakub, Tony);
* 4/5: pick a hotfix from Kohei.

From v1[3]:
* rebase on top of the latest next-queue;
* fix a typo 'rxq_ixd' -> 'rxq_idx' (Tony).

Testing hints:
* regular Rx and Tx for regressions;
* <kernel root>/tools/testing/selftests/drivers/net/hw/ contains
  scripts for testing netmem Rx and Tx, namely devmem.py and
  iou-zcrx.py (read the documentation first).

[0] https://lore.kernel.org/intel-wired-lan/20260318163505.31765-1-aleksander.lobakin@intel.com
[1] https://lore.kernel.org/intel-wired-lan/20260224174618.2780516-1-aleksander.lobakin@intel.com
[2] https://lore.kernel.org/intel-wired-lan/20251204155133.2437621-1-aleksander.lobakin@intel.com
[3] https://lore.kernel.org/intel-wired-lan/20251125173603.3834486-1-aleksander.lobakin@intel.com
-- 
2.54.0

^ permalink raw reply

* Re: [PATCH net-next v2] declance: Remove IRQF_ONESHOT
From: Sebastian Andrzej Siewior @ 2026-05-05 15:24 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: netdev, linux-mips, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni
In-Reply-To: <alpine.DEB.2.21.2605051410280.46195@angie.orcam.me.uk>

On 2026-05-05 15:00:56 [+0100], Maciej W. Rozycki wrote:
> 
>  Ah, I missed that IRQF_ONESHOT bit in irq_setup_forced_threading().  So 
> it seems like we're good.  I'll proceed with the original change then to 
> just amend IOASIC DMA IRQ documentation.  Thank you for patience again.

No worries, you are welcome.
I'm not if sure if you may need to change the primary handler if the
interrupt flow is EOI and cascading based on what you wrote. If you have
access to the HW then you it should be easy to test given the
`threadirqs' argument should expose problems.

>   Maciej

Sebastian

^ permalink raw reply

* Re: [PATCH net v2] net: rtsn: fix mdio_node leak in rtsn_mdio_alloc()
From: Niklas Söderlund @ 2026-05-05 15:19 UTC (permalink / raw)
  To: Shitalkumar Gandhi
  Cc: Geert Uytterhoeven, Andrew Lunn, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, netdev,
	linux-renesas-soc, linux-kernel, Shitalkumar Gandhi
In-Reply-To: <20260505123236.406000-1-shitalkumar.gandhi@cambiumnetworks.com>

Hi Shitalkumar,

Thanks for your work.

On 2026-05-05 18:02:36 +0530, Shitalkumar Gandhi wrote:
> of_get_child_by_name() takes a reference. The rtsn_reset() and
> rtsn_change_mode() failure paths jump to out_free_bus and leak
> mdio_node.
> 
> Add out_put_node to drop it before falling through.
> 
> Fixes: b0d3969d2b4d ("net: ethernet: rtsn: Add support for Renesas Ethernet-TSN")
> Signed-off-by: Shitalkumar Gandhi <shitalkumar.gandhi@cambiumnetworks.com>
> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>

Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

> ---
> Changes in v2:
> - Restore blank line between `return 0;` and `out_put_node:` label (Geert)
> - Add Reviewed-by: Geert Uytterhoeven
> 
> Resent as a new thread (no code changes) so netdev CI picks it up
> (Andrew).
> 
> Link to v1: https://lore.kernel.org/netdev/20260504200356.3529873-1-shitalkumar.gandhi@cambiumnetworks.com/
> Link to v2 (mis-threaded): https://lore.kernel.org/netdev/20260505085840.352206-1-shitalkumar.gandhi@cambiumnetworks.com/
> 
>  drivers/net/ethernet/renesas/rtsn.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/renesas/rtsn.c b/drivers/net/ethernet/renesas/rtsn.c
> index 03a2669f0518..ee8381b60b8d 100644
> --- a/drivers/net/ethernet/renesas/rtsn.c
> +++ b/drivers/net/ethernet/renesas/rtsn.c
> @@ -797,11 +797,11 @@ static int rtsn_mdio_alloc(struct rtsn_private *priv)
>  	/* Enter config mode before registering the MDIO bus */
>  	ret = rtsn_reset(priv);
>  	if (ret)
> -		goto out_free_bus;
> +		goto out_put_node;
>  
>  	ret = rtsn_change_mode(priv, OCR_OPC_CONFIG);
>  	if (ret)
> -		goto out_free_bus;
> +		goto out_put_node;
>  
>  	rtsn_modify(priv, MPIC, MPIC_PSMCS_MASK | MPIC_PSMHT_MASK,
>  		    MPIC_PSMCS_DEFAULT | MPIC_PSMHT_DEFAULT);
> @@ -824,6 +824,8 @@ static int rtsn_mdio_alloc(struct rtsn_private *priv)
>  
>  	return 0;
>  
> +out_put_node:
> +	of_node_put(mdio_node);
>  out_free_bus:
>  	mdiobus_free(mii);
>  	return ret;
> -- 
> 2.25.1
> 

-- 
Kind Regards,
Niklas Söderlund

^ permalink raw reply

* RE: [PATCH net-next 1/5] dt-bindings: net: add onsemi's TS2500/NCN26010 10BASE-T1S MACPHY
From: Selvamani Rajagopal @ 2026-05-05 15:11 UTC (permalink / raw)
  To: Rob Herring
  Cc: Piergiorgio Beruto, andrew+netdev@lunn.ch, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	krzk+dt@kernel.org, conor+dt@kernel.org, netdev@vger.kernel.org,
	devicetree@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260505134434.GA2493310-robh@kernel.org>



> -----Original Message-----
> From: Rob Herring <robh@kernel.org>
> Sent: Tuesday, May 5, 2026 6:45 AM
> To: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
> Cc: Piergiorgio Beruto <Pier.Beruto@onsemi.com>; andrew+netdev@lunn.ch;
> davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com;
> krzk+dt@kernel.org; conor+dt@kernel.org; netdev@vger.kernel.org;
> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next 1/5] dt-bindings: net: add onsemi's TS2500/NCN26010
> 10BASE-T1S MACPHY
> 
> 
> This Message Is From an External Sender
> This message came from outside your organization.
> 
> On Fri, May 01, 2026 at 07:15:17PM +0000, Selvamani Rajagopal wrote:
> > Add YAML device tree binding for the onsemi NCN26010 and TS2500
> > IEEE 802.3cg compliant Ethernet transceiver devices.
> >
> > Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
> > ---
> > .../bindings/net/onnn,ncn260xx.yaml | 71 +++++++++++++++++++
> > 1 file changed, 71 insertions(+)
> > create mode 100644 Documentation/devicetree/bindings/net/onnn,ncn260xx.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/net/onnn,ncn260xx.yaml
> b/Documentation/devicetree/bindings/net/onnn,ncn260xx.yaml
> > new file mode 100644
> > index 000000000..198cd7e9d
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/net/onnn,ncn260xx.yaml
> > @@ -0,0 +1,71 @@
> > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/net/onnn,ncn260xx.yaml#
> <http://devicetree.org/schemas/net/onnn,ncn260xx.yaml#
> vicetree.org>
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> <http://devicetree.org/meta-schemas/core.yaml#
> n=devicetree.org>
> > +
> > +title: onsemi NCN26010/TS2500 10BASE-T1S MACPHY Ethernet Controllers
> > +
> > +maintainers:
> > + - Piergiorgio Beruto <Pier.Beruto@onsemi.com>
> > + - Selva Rajagopal <Selvamani.Rajagopal@onsemi.com>
> > +
> > +description: |
> > + The NCN26010 and TS2500 combine a Media Access Controller (MAC) and an
> > + Ethernet PHY to enable 10BASE‑T1S networks. The Ethernet Media Access
> > + Controller (MAC) module implements a 10 Mbps half duplex Ethernet MAC,
> > + compatible with the IEEE 802.3 standard and a 10BASE-T1S physical layer
> > + transceiver integrated into the NCN26010. The communication between
> > + the host and the MAC-PHY is specified in the OPEN Alliance 10BASE-T1x
> > + MACPHY Serial Interface (TC6).
> > +
> > + Specifications about the NCN26010 can be found at:
> > + https://www.onsemi.com/download/data-sheet/pdf/ncn26010-d.pdf
> <https://www.onsemi.com/download/data-sheet/pdf/ncn26010-d.pdf
> n=onsemi.com>
> > + https://www.onsemi.com/products/interfaces/ethernet-controllers/t30hm1ts2500
> <https://www.onsemi.com/products/interfaces/ethernet-controllers/t30hm1ts2500
> semi.com>
> > +
> > +allOf:
> > + - $ref: /schemas/net/ethernet-controller.yaml#
> > + - $ref: /schemas/spi/spi-peripheral-props.yaml#
> > +
> > +properties:
> > + compatible:
> > + const: onnn,ncn260xx
> 
> Don't use wildcards in compatible strings.
> 
> > +
> > + reg:
> > + maxItems: 1
> > +
> > + interrupts:
> > + description: |
> 
> Don't need '|'.
> 
> > + Interrupt from MAC-PHY asserted in the event of Receive Chunks
> > + Available, Transmit Chunk Credits Available and Extended Status
> > + Event.
> > + maxItems: 1
> > +
> > + spi-max-frequency:
> > + minimum: 15000000
> 
> A minimum is strange. What if you have a board issue requiring lower
> frequency?

Had the same question in internal review. Datasheet says the minimum speed 15 MHz is needed. That's why we had placed.

> 
> > + maximum: 25000000
> > +
> > +required:
> > + - compatible
> > + - reg
> > + - interrupts
> > + - spi-max-frequency
> 
> Normally this is not required. It's only for boards which can't operate
> at the maximum frequency of the device.
> 
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > + - |
> > + spi {
> > + #address-cells = <1>;
> > + #size-cells = <0>;
> > +
> > + ethernet@0 {
> > + compatible = "onnn,ncn260xx";
> > + reg = <0>;
> > + pinctrl-names = "default";
> > + interrupt-parent = <&gpio>;
> > + interrupts = <25 2>;
> > + status = "okay";
> 
> Drop. Examples are always enabled.
> 
> > + spi-max-frequency = <25000000>;
> > + };
> > + };
> > --
> > 2.43.0
> >
> >
> > Public Information


^ permalink raw reply

* Re: [REGRESSION] aquantia: Sunshine/Moonlight UDP video streaming broken since 5b4015ad833c ("net: aquantia: Remove redundant UDP length adjustment with GSO_PARTIAL")
From: Gal Pressman @ 2026-05-05 15:05 UTC (permalink / raw)
  To: Matthew Schwartz, Dragos Tatulea, Jakub Kicinski
  Cc: regressions, netdev, linux-kernel@vger.kernel.org
In-Reply-To: <cdce7fbb-671d-48b0-95ff-56b882d3e9a5@linux.dev>

On 27/04/2026 21:26, Matthew Schwartz wrote:
> On 4/27/26 11:09 AM, Gal Pressman wrote:
>> Hello Matthew,
>>
>> On 27/04/2026 2:20, Matthew Schwartz wrote:
>>> Hello,
>>>
>>> When using a previously working setup of remote streaming from my workstation to another device via Sunshine (the host server) and Moonlight (the client app) on my home network, I no longer receive any video output on the client app after upgrading my host workstation to kernel 7.0. Reverting back to kernel 6.19 on the host restored my setup to a working state.
>>>
>>> After bisecting, I landed on 5b4015ad833c ("net: aquantia: Remove redundant UDP length adjustment with GSO_PARTIAL") as the first bad commit. I confirmed this by moving the cable to my second on-board NIC (Intel) on the same workstation, which restored video output without any other kernel changes. My affected on-board NIC is Aquantia AQC113 [1d6a:04c0] (rev 03), atlantic driver, firmware 1.3.34, MTU 1500.
>>>
>>> Looking into it a bit further, ethtool -K enp97s0 tx-udp-segmentation off also serves as a workaround on my Aquantia port without changing to my other ethernet port. The working Intel NIC reports tx-udp-segmentation as "off [fixed]", so traffic falls back to software UDP segmentation on there.
>>>
>>> Please let me know if there's any additional info I can provide.
>>>
>>> Thanks,
>>> Matt
>>>
>>> #regzbot introduced: 5b4015ad833c
>>
>> Thank you for the report and the bisect!
>>
>> I will take a look and try to figure out what's wrong (though I don't
>> have real hardware to test on).
>> Is the userspace app open source? can I see its code and try to run it
>> myself?
> 
> Thanks for the reply. The code for Sunshine is available here: https://github.com/LizardByte/Sunshine and the code for Moonlight is here: https://github.com/moonlight-stream/moonlight-qt.
> 
> I have been using the Arch Linux Sunshine package which I installed by following the Linux instructions here: https://docs.lizardbyte.dev/projects/sunshine/latest/md_docs_2getting__started.html, but there are also binaries for other distros or it's buildable from source. For Moonlight, I have been using the Flatpak distributed on Flathub because the client device runs an atomic rootfs, but you can also use any other device that Moonlight supports.
> 
>>
>> I will be OOO for the rest of the week, hope to have some meaningful
>> reply by the end of next week.
> 

I think I see the issue, do you mind testing the following diff?

index a0813d425b71..5bd1706b11b0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -599,10 +599,22 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 		uh = udp_hdr(seg);
 	}

-	/* last packet can be partial gso_size, account for that in checksum */
-	newlen = htons(skb_tail_pointer(seg) - skb_transport_header(seg) +
-		       seg->data_len);
-	check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
+	if (skb_is_gso(seg)) {
+		newlen = msslen;
+	} else {
+		/* last packet can be partial gso_size, account for that in
+		 * checksum.
+		 */
+		newlen = htons(skb_tail_pointer(seg) -
+			       skb_transport_header(seg) + seg->data_len);
+		check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
+	}

 	uh->len = newlen;
 	uh->check = check;

^ permalink raw reply related

* Re: [PATCH wireguard] wireguard: prevent ipv6 addrconf via IFF_NO_ADDRCONF flag
From: Jason A. Donenfeld @ 2026-05-05 15:05 UTC (permalink / raw)
  To: Valentin Spreckels
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, wireguard, netdev, linux-kernel
In-Reply-To: <afefejiY8SX8UfTm@zx2c4.com>

On Sun, May 03, 2026 at 09:18:18PM +0200, Jason A. Donenfeld wrote:
> On Sat, Mar 21, 2026 at 08:20:53PM +0100, Valentin Spreckels wrote:
> > Hi Jason,
> > 
> > On 11/03/2026 23:59, Jason A. Donenfeld wrote:
> > > Hi Valentin,
> > > 
> > > On Sun, Feb 08, 2026 at 06:05:45PM +0100, Valentin Spreckels wrote:
> > >> Use the flag introduced in commit 8a321cf7becc6 ("net: add
> > >> IFF_NO_ADDRCONF and use it in bonding to prevent ipv6 addrconf")
> > >> instead of mangling the addr_gen_mode to prevent ipv6 addrconf.
> > > 
> > > Can you give some more context here? Why was IFF_NO_ADDRCONF added when
> > > the IN6_ADDR_GEN_MODE_NONE method has been working fine? What's the
> > > difference between these approaches? I don't doubt that your patch is
> > > correct, but I would like to better understand this.
> > 
> > Only wireguard configures addr_gen_mode inside the kernel, otherwise it 
> > is only set by userspace; userspace is also able to overwrite the 
> > IFF_NO_ADDRCONF set by wireguard.
> > 
> > Commit 8a321cf7becc ("net: add IFF_NO_ADDRCONF and use it in bonding to 
> > prevent ipv6 addrconf") introduces the private interface flag 
> > IFF_NO_ADDRCONF, which isn't accessible by userspace.
> > 
> > Thus use the IFF_NO_ADDRCONF flag in wireguard.
> > 
> > 
> > Does that answer your questions? If yes, I will submit a v2 with this as 
> > commit message.
> 
> I applied this here:
> https://git.zx2c4.com/wireguard-linux/commit/?id=88427bcbe5bd3711de387b1c1f6540ef6fc05a78
> 
> Sorry for the delay! Patch looks good as-is, once I looked into the
> internal mechanism.

I'm backing this patch out for now. It seems to break the selftests:

    [+] NS2: ping6 -c 10 -f -W 1 fd00::1
    ping6: connect: Network unreachable

Try it yourself with:

    $ make -C tools/testing/selftests/wireguard/qemu -j$(nproc) 

I assume it's because of:

        case NETDEV_UP:
        case NETDEV_CHANGE:
                if (idev && idev->cnf.disable_ipv6)
                        break;

                if (dev->priv_flags & IFF_NO_ADDRCONF) {
			[...]
                        break;
                }

Feel free to submit a v2 if you think this is fixable or if the tests
themselves are wrong.

Jason

^ permalink raw reply

* [PATCH net 11/11] selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl
From: Matthieu Baerts (NGI0) @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christoph Paasch
  Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0), stable,
	Shuah Khan, linux-kselftest
In-Reply-To: <20260505-net-mptcp-pm-fixes-7-1-rc3-v1-0-fca8091060a4@kernel.org>

When pm_netlink.sh is executed with '-i', 'ip mptcp' is used instead of
'pm_nl_ctl'. IPRoute2 doesn't support the 'unknown' flag, which has only
been added to 'pm_nl_ctl' for this specific check: to ensure that the
kernel ignores such unsupported flag.

No reason to add this flag to 'ip mptcp'. Then, this check should be
skipped when 'ip mptcp' is used.

Fixes: 0cef6fcac24d ("selftests: mptcp: ip_mptcp option for more scripts")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
---
 tools/testing/selftests/net/mptcp/pm_netlink.sh | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/pm_netlink.sh b/tools/testing/selftests/net/mptcp/pm_netlink.sh
index b69f30fcb91e..04594dfc22b1 100755
--- a/tools/testing/selftests/net/mptcp/pm_netlink.sh
+++ b/tools/testing/selftests/net/mptcp/pm_netlink.sh
@@ -194,9 +194,13 @@ check "show_endpoints" \
 flush_endpoint
 check "show_endpoints" "" "flush addrs"
 
-add_endpoint 10.0.1.1 flags unknown
-check "show_endpoints" "$(format_endpoints "1,10.0.1.1")" "ignore unknown flags"
-flush_endpoint
+# "unknown" flag is only supported by pm_nl_ctl
+if ! mptcp_lib_is_ip_mptcp; then
+	add_endpoint 10.0.1.1 flags unknown
+	check "show_endpoints" "$(format_endpoints "1,10.0.1.1")" \
+	      "ignore unknown flags"
+	flush_endpoint
+fi
 
 set_limits 9 1 2>/dev/null
 check "get_limits" "${default_limits}" "rcv addrs above hard limit"

-- 
2.53.0


^ permalink raw reply related

* [PATCH net 10/11] selftests: mptcp: check output: catch cmd errors
From: Matthieu Baerts (NGI0) @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christoph Paasch
  Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0), stable,
	Shuah Khan, linux-kselftest
In-Reply-To: <20260505-net-mptcp-pm-fixes-7-1-rc3-v1-0-fca8091060a4@kernel.org>

Using '${?}' inside the if-statement to check the returned value from
the command that was evaluated as part of the if-statement is not
correct: here, '${?}' will be linked to the previous instruction, not
the one that is expected here (${cmd}).

Instead, simply mark the error, except if an error is expected. If
that's the case, 1 can be passed as the 4th argument of this helper.
Three checks from pm_netlink.sh expect an error.

While at it, improve the error message when the command unexpectedly
fails or succeeds.

Note that we could expect a specific returned value, but the checks
currently expecting an error can be used with 'ip mptcp' or 'pm_nl_ctl',
and these two tools don't return the same error code.

Fixes: 2d0c1d27ea4e ("selftests: mptcp: add mptcp_lib_check_output helper")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
---
 tools/testing/selftests/net/mptcp/mptcp_lib.sh  | 16 ++++++++++------
 tools/testing/selftests/net/mptcp/pm_netlink.sh | 10 ++++++----
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_lib.sh b/tools/testing/selftests/net/mptcp/mptcp_lib.sh
index 5fea7e7df628..989a5975dcea 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_lib.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_lib.sh
@@ -474,20 +474,24 @@ mptcp_lib_wait_local_port_listen() {
 	wait_local_port_listen "${@}" "tcp"
 }
 
+# $1: error file, $2: cmd, $3: expected msg, [$4: expected error]
 mptcp_lib_check_output() {
 	local err="${1}"
 	local cmd="${2}"
 	local expected="${3}"
+	local exp_error="${4:-0}"
 	local cmd_ret=0
 	local out
 
-	if ! out=$(${cmd} 2>"${err}"); then
-		cmd_ret=${?}
-	fi
+	out=$(${cmd} 2>"${err}") || cmd_ret=1
 
-	if [ ${cmd_ret} -ne 0 ]; then
-		mptcp_lib_pr_fail "command execution '${cmd}' stderr"
-		cat "${err}"
+	if [ "${cmd_ret}" != "${exp_error}" ]; then
+		mptcp_lib_pr_fail "unexpected returned code for '${cmd}', info:"
+		if [ "${exp_error}" = 0 ]; then
+			cat "${err}"
+		else
+			echo "${out}"
+		fi
 		return 2
 	elif [ "${out}" = "${expected}" ]; then
 		return 0
diff --git a/tools/testing/selftests/net/mptcp/pm_netlink.sh b/tools/testing/selftests/net/mptcp/pm_netlink.sh
index 123d9d7a0278..b69f30fcb91e 100755
--- a/tools/testing/selftests/net/mptcp/pm_netlink.sh
+++ b/tools/testing/selftests/net/mptcp/pm_netlink.sh
@@ -122,10 +122,12 @@ check()
 	local cmd="$1"
 	local expected="$2"
 	local msg="$3"
+	local exp_error="$4"
 	local rc=0
 
 	mptcp_lib_print_title "$msg"
-	mptcp_lib_check_output "${err}" "${cmd}" "${expected}" || rc=${?}
+	mptcp_lib_check_output "${err}" "${cmd}" "${expected}" "${exp_error}" ||
+		rc=${?}
 	if [ ${rc} -eq 2 ]; then
 		mptcp_lib_result_fail "${msg} # error ${rc}"
 		ret=${KSFT_FAIL}
@@ -158,13 +160,13 @@ check "show_endpoints" \
 			    "3,10.0.1.3,signal backup")" "dump addrs"
 
 del_endpoint 2
-check "get_endpoint 2" "" "simple del addr"
+check "get_endpoint 2" "" "simple del addr" 1
 check "show_endpoints" \
 	"$(format_endpoints "1,10.0.1.1" \
 			    "3,10.0.1.3,signal backup")" "dump addrs after del"
 
 add_endpoint 10.0.1.3 2>/dev/null
-check "get_endpoint 4" "" "duplicate addr"
+check "get_endpoint 4" "" "duplicate addr" 1
 
 add_endpoint 10.0.1.4 flags signal
 check "get_endpoint 4" "$(format_endpoints "4,10.0.1.4,signal")" "id addr increment"
@@ -173,7 +175,7 @@ for i in $(seq 5 9); do
 	add_endpoint "10.0.1.${i}" flags signal >/dev/null 2>&1
 done
 check "get_endpoint 9" "$(format_endpoints "9,10.0.1.9,signal")" "hard addr limit"
-check "get_endpoint 10" "" "above hard addr limit"
+check "get_endpoint 10" "" "above hard addr limit" 1
 
 del_endpoint 9
 for i in $(seq 10 255); do

-- 
2.53.0


^ permalink raw reply related

* [PATCH net 09/11] mptcp: pm: prio: skip closed subflows
From: Matthieu Baerts (NGI0) @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christoph Paasch
  Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0), stable
In-Reply-To: <20260505-net-mptcp-pm-fixes-7-1-rc3-v1-0-fca8091060a4@kernel.org>

When sending an MP_PRIO, closed subflows need to be skipped.

This fixes the case where the initial subflow got closed, re-opened
later, then an MP_PRIO is needed for the same local address.

Note that explicit MP_PRIO cannot be sent during the 3WHS, so it is fine
to use __mptcp_subflow_active().

Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support")
Cc: stable@vger.kernel.org
Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
 net/mptcp/pm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 4a6e5ab30d80..3c152bf66cd5 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -284,6 +284,9 @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk,
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 		struct mptcp_addr_info local, remote;
 
+		if (!__mptcp_subflow_active(subflow))
+			continue;
+
 		mptcp_local_address((struct sock_common *)ssk, &local);
 		if (!mptcp_addresses_equal(&local, addr, addr->port))
 			continue;

-- 
2.53.0


^ permalink raw reply related

* [PATCH net 08/11] mptcp: pm: ADD_ADDR rtx: return early if no retrans
From: Matthieu Baerts (NGI0) @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christoph Paasch
  Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0), stable
In-Reply-To: <20260505-net-mptcp-pm-fixes-7-1-rc3-v1-0-fca8091060a4@kernel.org>

No need to iterate over all subflows if there is no retransmission
needed.

Exit early in this case then.

Fixes: 30549eebc4d8 ("mptcp: make ADD_ADDR retransmission timeout adaptive")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
 net/mptcp/pm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 8a5dba7fe66e..4a6e5ab30d80 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -308,6 +308,9 @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk)
 	struct mptcp_subflow_context *subflow;
 	unsigned int max = 0, max_stale = 0;
 
+	if (!rto)
+		return 0;
+
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 		struct inet_connection_sock *icsk = inet_csk(ssk);

-- 
2.53.0


^ permalink raw reply related

* [PATCH net 07/11] mptcp: pm: ADD_ADDR rtx: skip inactive subflows
From: Matthieu Baerts (NGI0) @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christoph Paasch
  Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0), stable
In-Reply-To: <20260505-net-mptcp-pm-fixes-7-1-rc3-v1-0-fca8091060a4@kernel.org>

When looking at the maximum RTO amongst the subflows, inactive subflows
were taken into account: that includes stale ones, and the initial one
if it has been already been closed.

Unusable subflows are now simply skipped. Stale ones are used as an
alternative: if there are only stale ones, to take their maximum RTO and
avoid to eventually fallback to net.mptcp.add_addr_timeout, which is set
to 2 minutes by default.

Fixes: 30549eebc4d8 ("mptcp: make ADD_ADDR retransmission timeout adaptive")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
 net/mptcp/pm.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 29d1bb6a69cf..8a5dba7fe66e 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -306,18 +306,28 @@ static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk)
 	const struct net *net = sock_net((struct sock *)msk);
 	unsigned int rto = mptcp_get_add_addr_timeout(net);
 	struct mptcp_subflow_context *subflow;
-	unsigned int max = 0;
+	unsigned int max = 0, max_stale = 0;
 
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 		struct inet_connection_sock *icsk = inet_csk(ssk);
 
-		if (icsk->icsk_rto > max)
+		if (!__mptcp_subflow_active(subflow))
+			continue;
+
+		if (unlikely(subflow->stale)) {
+			if (icsk->icsk_rto > max_stale)
+				max_stale = icsk->icsk_rto;
+		} else if (icsk->icsk_rto > max) {
 			max = icsk->icsk_rto;
+		}
 	}
 
-	if (max && max < rto)
-		rto = max;
+	if (max)
+		return min(max, rto);
+
+	if (max_stale)
+		return min(max_stale, rto);
 
 	return rto;
 }

-- 
2.53.0


^ permalink raw reply related

* [PATCH net 06/11] mptcp: pm: ADD_ADDR rtx: resched blocked ADD_ADDR quicker
From: Matthieu Baerts (NGI0) @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christoph Paasch
  Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0), stable
In-Reply-To: <20260505-net-mptcp-pm-fixes-7-1-rc3-v1-0-fca8091060a4@kernel.org>

When an ADD_ADDR needs to be retransmitted and another one has already
been prepared -- e.g. multiple ADD_ADDRs have been sent in a row and
need to be retransmitted later -- this additional retransmission will
need to wait.

In this case, the timer was reset to TCP_RTO_MAX / 8, which is ~15
seconds. This delay is unnecessary long: it should just be rescheduled
at the next opportunity, e.g. after the retransmission timeout.

Without this modification, some issues can be seen from time to time in
the selftests when multiple ADD_ADDRs are sent, and the host takes time
to process them, e.g. the "signal addresses, ADD_ADDR timeout" MPTCP
Join selftest, especially with a debug kernel config.

Note that on older kernels, 'timeout' is not available. It should be
enough to replace it by one second (HZ).

Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
 net/mptcp/pm.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 8899327e59a1..29d1bb6a69cf 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -342,13 +342,8 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
 		goto out;
 	}

-	if (mptcp_pm_should_add_signal_addr(msk)) {
-		timeout = TCP_RTO_MAX / 8;
-		goto out;
-	}
-
 	timeout = mptcp_adjust_add_addr_timeout(msk);
-	if (!timeout)
+	if (!timeout || mptcp_pm_should_add_signal_addr(msk))
 		goto out;

 	spin_lock_bh(&msk->pm.lock);

-- 
2.53.0

^ permalink raw reply related

* [PATCH net 05/11] mptcp: pm: ADD_ADDR rtx: free sk if last
From: Matthieu Baerts (NGI0) @ 2026-05-05 15:00 UTC (permalink / raw)
  To: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christoph Paasch
  Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0), stable
In-Reply-To: <20260505-net-mptcp-pm-fixes-7-1-rc3-v1-0-fca8091060a4@kernel.org>

When an ADD_ADDR is retransmitted, the sk is held in sk_reset_timer(),
and released at the end.

If at that moment, it was the last reference being held, the sk would
not be freed. sock_put() should then be called instead of __sock_put().

But that's not enough: if it is the last reference, sock_put() will call
sk_free(), which will end up calling sk_stop_timer_sync() on the same
timer, and waiting indefinitely to finish. So it is needed to mark that
the timer is done at the end of the timer handler when it has not been
rescheduled, not to call sk_stop_timer_sync() on "itself".

Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
 net/mptcp/pm.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 2a01bf1b5bfd..8899327e59a1 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -16,6 +16,7 @@ struct mptcp_pm_add_entry {
 	struct list_head	list;
 	struct mptcp_addr_info	addr;
 	u8			retrans_times;
+	bool			timer_done;
 	struct timer_list	add_timer;
 	struct mptcp_sock	*sock;
 	struct rcu_head		rcu;
@@ -327,22 +328,22 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
 							      add_timer);
 	struct mptcp_sock *msk = entry->sock;
 	struct sock *sk = (struct sock *)msk;
-	unsigned int timeout;
+	unsigned int timeout = 0;
 
 	pr_debug("msk=%p\n", msk);
 
-	if (unlikely(inet_sk_state_load(sk) == TCP_CLOSE))
-		goto exit;
-
 	bh_lock_sock(sk);
+	if (unlikely(inet_sk_state_load(sk) == TCP_CLOSE))
+		goto out;
+
 	if (sock_owned_by_user(sk)) {
 		/* Try again later. */
-		sk_reset_timer(sk, timer, jiffies + HZ / 20);
+		timeout = HZ / 20;
 		goto out;
 	}
 
 	if (mptcp_pm_should_add_signal_addr(msk)) {
-		sk_reset_timer(sk, timer, jiffies + TCP_RTO_MAX / 8);
+		timeout = TCP_RTO_MAX / 8;
 		goto out;
 	}
 
@@ -360,8 +361,9 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
 	}
 
 	if (entry->retrans_times < ADD_ADDR_RETRANS_MAX)
-		sk_reset_timer(sk, timer,
-			       jiffies + (timeout << entry->retrans_times));
+		timeout <<= entry->retrans_times;
+	else
+		timeout = 0;
 
 	spin_unlock_bh(&msk->pm.lock);
 
@@ -369,9 +371,13 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
 		mptcp_pm_subflow_established(msk);
 
 out:
+	if (timeout)
+		sk_reset_timer(sk, timer, jiffies + timeout);
+	else
+		/* if sock_put calls sk_free: avoid waiting for this timer */
+		entry->timer_done = true;
 	bh_unlock_sock(sk);
-exit:
-	__sock_put(sk);
+	sock_put(sk);
 }
 
 struct mptcp_pm_add_entry *
@@ -434,6 +440,7 @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk,
 
 	timer_setup(&add_entry->add_timer, mptcp_pm_add_timer, 0);
 reset_timer:
+	add_entry->timer_done = false;
 	timeout = mptcp_adjust_add_addr_timeout(msk);
 	if (timeout)
 		sk_reset_timer(sk, &add_entry->add_timer, jiffies + timeout);
@@ -454,7 +461,8 @@ static void mptcp_pm_free_anno_list(struct mptcp_sock *msk)
 	spin_unlock_bh(&msk->pm.lock);
 
 	list_for_each_entry_safe(entry, tmp, &free_list, list) {
-		sk_stop_timer_sync(sk, &entry->add_timer);
+		if (!entry->timer_done)
+			sk_stop_timer_sync(sk, &entry->add_timer);
 		kfree_rcu(entry, rcu);
 	}
 }

-- 
2.53.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox