Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v7 0/4] net: rnpgbe: Add TX/RX and link status support
From: Simon Horman @ 2026-06-12 18:44 UTC (permalink / raw)
  To: Dong Yibo
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, danishanwar,
	vadim.fedorenko, u.kleine-koenig, linux-kernel, netdev, yaojun
In-Reply-To: <20260611100036.36370-1-dong100@mucse.com>

On Thu, Jun 11, 2026 at 06:00:32PM +0800, Dong Yibo wrote:
> Hi maintainers,
> 
> This patch series adds the packet transmission, reception, and link status
> management features to the RNPGBE driver, building upon the previously
> introduced mailbox communication and basic driver infrastructure.
> 
> The series introduces:
> - Msix/msi interrupt handling with NAPI support
> - TX path with scatter-gather DMA and completion handling
> - RX path with page pool buffer management
> - Link status monitoring and carrier management
> 
> These changes enable the RNPGBE driver to support basic tx/rx
> network operations.
> 
> Changelog:
> v6 -> v7:
> [patch 2/4]:
> 1. Fix 'frag_idx' error in rnpgbe_tx_map. (Sashiko-gemini)
> [patch 3/4]:
> 1. Fix skb leak in invalid size path in rnpgbe_clean_rx_irq.
>    (Sashiko-gemini)
> 2. Fix invalid size range check for rxdesc. (Sashiko-gemini)
> [patch 4/4]:
> 1. Fix 'data race on the reply payload'. (Sashiko-gemini)
> 2. Fix 'asymmetric behaviour' when report up/down. (andrew)
> 
> links:
> ---
> v1: https://lore.kernel.org/netdev/20260325091204.94015-1-dong100@mucse.com/
> v2: https://lore.kernel.org/netdev/20260403025713.527841-1-dong100@mucse.com/
> v3: https://lore.kernel.org/netdev/20260507081539.171844-1-dong100@mucse.com/
> v4: https://lore.kernel.org/netdev/20260526033539.164061-1-dong100@mucse.com/
> v5: https://lore.kernel.org/netdev/20260528023150.239532-1-dong100@mucse.com/
> v6: https://lore.kernel.org/netdev/20260604112750.769215-1-dong100@mucse.com/
> 
> Additional Notes:

Thanks for the update and the notes.

There is another round of AI-generated review of this patch-set available
on both https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/

I would appreciate it if you could look over that too. With a view to
addressing any issues that directly affect this patch.

...

^ permalink raw reply

* Re: [PATCH bpf-next v3 0/7] bpf, skmsg: some fixes for skmsg
From: Kuniyuki Iwashima @ 2026-06-12 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiayuan Chen, Kuniyuki Iwashima, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Emil Tsalapatis, John Fastabend,
	Stanislav Fomichev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jakub Sitnicki, Shuah Khan,
	Jesper Dangaard Brouer, Ihor Solodrai, Sechang Lim, Cong Wang,
	LKML, Network Development, open list:KERNEL SELFTEST FRAMEWORK
In-Reply-To: <CAADnVQKL26zPepqOf_FtuA6ALQEtFJmiKU12iahdvfeAyJ7XhA@mail.gmail.com>

On Fri, Jun 12, 2026 at 10:10 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jun 12, 2026 at 6:09 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
> >
> >
> > All fixes are from previous patches sent by Weiming Shi, Zhang Cen,
> > Kuniyuki and Sechang Lim, which have already been reviewed by me and John and Jakub.
> >
> > https://lore.kernel.org/bpf/20260610081218.506709-2-rhkrqnwk98@gmail.com/
> > https://lore.kernel.org/bpf/20260520102715.3033936-1-rollkingzzc@gmail.com/
> > https://lore.kernel.org/bpf/20260424190310.1520555-2-bestswngs@gmail.com/
> > https://lore.kernel.org/bpf/20260424191602.1522411-3-bestswngs@gmail.com/
> > https://lore.kernel.org/bpf/20260423155807.1245644-2-bestswngs@gmail.com/
> > https://lore.kernel.org/bpf/20260221233234.3814768-4-kuniyu@google.com/
> >
> > The automated reviewer (sashiko) may still flag a few other potential
> > issues on top of this series. After looking into them, they are either
> > already covered by the patches here, or only reachable under very narrow
> > conditions that require a specially crafted BPF program and an unusual
> > sk_msg ring state, so they are not practical to trigger and are left out
> > of this series. I'm collecting these fixes together because the same
> > problems have been re-sent many times in slightly different forms, and I
> > hope this series can be prioritized for merging so the duplicates can
> > finally settle. With so many AI-generated patches floating around for
> > these spots, leaving them unmerged just keeps wasting maintainer review
> > cycles on the same issues.
> >
> >
> > v2->v3: Target to bpf-next and carry Emil's reviewed-by tag.
> >         Reverse xmas tree style is used suggested by Cong.
> >         (not all code match reverse xmas tree due to variable dependency)
> > v1->v2: fix problem when fix the conflict.
> >
>
> Kuniyuki,
>
> one patch is yours. thanks.
> Please help review the rest.

Sure, will look into it.

Thanks

^ permalink raw reply

* Re: [net-next 8/9] dt-bindings: net: renesas,etheravb: Add optional gPTP phandle for Gen4
From: Sergey Shtylyov @ 2026-06-12 18:37 UTC (permalink / raw)
  To: Niklas Söderlund, Paul Barker, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Richard Cochran,
	Geert Uytterhoeven, Magnus Damm, netdev, linux-renesas-soc,
	devicetree, linux-kernel
In-Reply-To: <20260610102432.3538432-9-niklas.soderlund+renesas@ragnatech.se>

On 6/10/26 1:24 PM, Niklas Söderlund wrote:

> The RAVB module on Gen4 have no gPTP clock as part of the RAVB module
> itself, instead it relies on an external system wide gPTP clock. The
> gPTP clock is shared with RTSN on V4H and RSWITCH on S4.
> 
> Add an optional phandle so that the RAVB driver can find and use the
> gPTP clock. Ideally this should have been an mandatory property but for

   s/an/a/.

> backward compatible it is optional. The RAVB module is capable of
> functioning without it, but can in such cases not provided PTP
> functionality.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

Reviewed-by: Sergey Shtylyov <sergei.shtylyov@gmail.com>

[...]

> diff --git a/Documentation/devicetree/bindings/net/renesas,etheravb.yaml b/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> index 1e00ef5b3acd..7bc910ab3ae0 100644
> --- a/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> +++ b/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> @@ -122,6 +122,13 @@ properties:
>        Specify when the AVB_LINK signal is active-low instead of normal
>        active-high.
>  
> +  renesas,gptp:
> +    $ref: /schemas/types.yaml#/definitions/phandle
> +    description:
> +      A phandle to an external gPTP clock for Gen4 platforms. The property is

   You're sure wa can't handle that with the usual "clocks" prop?

> +      optional for backwards compatibility, but without it gPTP timestamps are
> +      disabled as Gen4 have no gPTP as part of the RAVB module itself.

   Again, I'd prefer EtherAVB -- to comply with the binding file name...

[...]

MBR, Sergey


^ permalink raw reply

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
From: Alexei Starovoitov @ 2026-06-12 18:34 UTC (permalink / raw)
  To: Cong Wang
  Cc: Cong Wang, Jakub Kicinski, Network Development, bpf,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Hemanth Malla,
	zijianzhang
In-Reply-To: <aixMETjhANSYBZP_@pop-os.localdomain>

On Fri, Jun 12, 2026 at 11:12 AM Cong Wang <cwang@multikernel.io> wrote:
>
> On Fri, Jun 12, 2026 at 09:01:43AM -0700, Alexei Starovoitov wrote:
> > Just saying that the code is free nowadays, so whether it's 1k lines
> > or 10 lines is irrelevant for the discussion.
> >
> > As far as the idea goes, I think, it would be interesting in pre-AI era,
> > but today splice and friends are a prime target for bugs and more bugs.
> > skmsg and tcp_bpf are reeling from unfixed bugs too,
> > so my take is that we should not add any new features to skmsg
> > and instead deprecate what is already there.
>
> I guess maybe the name misleads you, it has nothing related to splice()
> syscall. Its ring buffer was developed on top of include/linux/circ_buf.h
> which again has nothing related to splice()/vmsplice()/pipe().
>
> In case it is not obvious, this patchset does not add any new user-space
> interface, only a kfunc which is visible to only sockmap eBPF programs
> which already require CAP_BPF privilege.

Not the name, but the concept. Taking from one socket and feeding
into another already caused a ton of issues for the networking stack.
If you can convince Kuba we can entertain it.

^ permalink raw reply

* Re: [net-next 5/9] net: ethernet: ravb: Replace gPTP flags with callbacks
From: Sergey Shtylyov @ 2026-06-12 18:31 UTC (permalink / raw)
  To: Niklas Söderlund, Paul Barker, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Richard Cochran,
	Geert Uytterhoeven, Magnus Damm, netdev, linux-renesas-soc,
	devicetree, linux-kernel
In-Reply-To: <20260610102432.3538432-6-niklas.soderlund+renesas@ragnatech.se>

On 6/10/26 1:24 PM, Niklas Söderlund wrote:

> Prepare for adding Gen4 support which will add a third and new way to
> interact with the gPTP clock by replacing the flags for Gen2 behavior
> (info->gptp) and Gen3 behavior (info->ccc_gac) with callbacks.
> 
> This will make adding Gen4 support cleaner as the code will not have "if
> else if else" sprinkled all over to handle each generations special
> cases.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

Reviewed-by: Sergey Shtylyov <sergei.shtylyov@gmail.com>

[...]

> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
> index 013ced6dcf29..70bef3b31d38 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
> @@ -1034,6 +1034,27 @@ struct ravb_ptp {
>  	struct ravb_ptp_perout perout[N_PER_OUT];
>  };
>  
> +/**
> + * struct ravb_gptp_info - Platform specific gPTP behavior
> + *
> + * Each generation of RAVB have slightly different behaviors when interacting

   Well, I haven't seen the word RAVB in any Renesas' manuals, have you?
   I personally prefer calling it EtherAVB; the manuals had "Ethernet AVB", IIRC... :-)

[...]

MBR, Sergey


^ permalink raw reply

* Re: [PATCH iwl-net] idpf: decrease statistics refresh interval
From: Tantilov, Emil S @ 2026-06-12 18:28 UTC (permalink / raw)
  To: Alexander Lobakin, Danny Gonzalez
  Cc: Tony Nguyen, Przemek Kitszel, David S. Miller, Jakub Kicinski,
	Eric Dumazet, intel-wired-lan, netdev, linux-kernel,
	David Decotigny, Anjali Singhai, Sridhar Samudrala, Brian Vazquez,
	Li Li, stable
In-Reply-To: <7a6a3f63-0b69-4e1f-997e-f198e2bc43e9@intel.com>



On 6/12/2026 2:29 AM, Alexander Lobakin wrote:
> From: Danny Gonzalez <digonzal@google.com>
> Date: Thu, 11 Jun 2026 11:26:18 -0700
> 
>> On Thu, Jun 11, 2026 at 8:57 AM Alexander Lobakin
>> <aleksander.lobakin@intel.com> wrote:
>>>
>>> From: Danny Gonzalez <digonzal@google.com>
>>> Date: Thu, 11 Jun 2026 00:24:37 +0000
>>>
>>>> The default 10s statistics refresh interval is too slow for real-time
>>>> monitoring and causes network selftests (e.g., uso.py) to fail when
>>>> verifying traffic immediately after transmission.
>>>>
>>>> A 10s delay also causes aliasing in telemetry tools polling at shorter
>>>> intervals (e.g., 5s), leading to inaccurate rate calculations on
>>>> high-throughput NICs.
>>>>
>>>> Decrease the refresh interval to 250ms to ensure fresh stats and fix
>>>> test failures.
>>>
>>> Have you tried a bit more conservate value like 1s? Wouldn't it be
>>> enough for tests to pass?
>>>
>>> 250 ms is also okay, just curious.
>>
>> Yes, 1s also allows the tests to pass.
>>
>> We have a preference for 250 ms since High-Freq Telemetry (1s poll)
>> 1s driver refresh rate causes aliasing:
>>
>> # sar -n DEV 1 | grep eth1
>> 10:52:15         eth1    390.00    339.00     51.92     55.54
>> 0.00      0.00      0.00      0.00
>> 10:52:16         eth1    409.00    360.00     54.72     58.64
>> 0.00      0.00      0.00      0.00
>> 10:52:17         eth1      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00      0.00
> 
> Ack!
> 
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
>>
>> Thanks,
>> Danny
> 
> Thanks,
> Olek

Unfortunately this doesn't scale very well as it introduces a bit of an 
overhead for the virtchnl having to update stats at a higher frequency, 
which is why the delay was so big to begin with ... You can have 
multiple vports and thousands of VFs. We do have a different solution 
for this case in the OOT driver, where we only speed it up a bit when 
there is an actual request from the user via mod_delayed_work():
https://github.com/intel/ethernet-linux-idpf/blob/main/idpf/src/idpf_ethtool.c#L1735

which would be a better approach for this issue, IMHO.

Thanks,
Emil

^ permalink raw reply

* Re: [net-next 4/9] net: ethernet: ravb: Remove redundant argument to ravb_ptp_init()
From: Sergey Shtylyov @ 2026-06-12 18:17 UTC (permalink / raw)
  To: Niklas Söderlund, Paul Barker, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Richard Cochran,
	Geert Uytterhoeven, Magnus Damm, netdev, linux-renesas-soc,
	devicetree, linux-kernel
In-Reply-To: <20260610102432.3538432-5-niklas.soderlund+renesas@ragnatech.se>

On 6/10/26 1:24 PM, Niklas Söderlund wrote:

> There is no need to explicitly pass the struct platform_device pointer
> to ravb_ptp_init(), it can retrieve it directly from the private data
> structure.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

Reviewed-by: Sergey Shtylyov <sergei.shtylyov@gmail.com>

[...]

MBR, Sergey


^ permalink raw reply

* Re: [net-next 2/9] net: ethernet: ravb: Move programming of gPTP timer interval
From: Sergey Shtylyov @ 2026-06-12 18:16 UTC (permalink / raw)
  To: Niklas Söderlund, Paul Barker, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Richard Cochran,
	Geert Uytterhoeven, Magnus Damm, netdev, linux-renesas-soc,
	devicetree, linux-kernel
In-Reply-To: <20260610102432.3538432-3-niklas.soderlund+renesas@ragnatech.se>

On 6/10/26 1:24 PM, Niklas Söderlund wrote:

> Commit f384ab481cab ("net: ravb: Split GTI computation and set
> operations") broke apart the operations of computing the timer interval
> and programming of it. However it kept the programming of the interval
> in the RAVB main logic.
> 
> Having split the two apart this can be improved further by moving the
> programming to the gPTP initialization function, as the first action of
> the gPTP init function is to wait for the timer interval programming to
> be acknowledge by the hardware.
> 
> As an added bonus the interaction with the gPTP registers for the
> programming can then also be done while holding the gPTP registers lock.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

Reviewed-by: Sergey Shtylyov <sergei.shtylyov@gmail.com>

[...]

MBR, Sergey


^ permalink raw reply

* Re: [PATCH v2 5/5] tcp: Remove mmap_lock fallback path
From: Vlastimil Babka (SUSE) @ 2026-06-12 18:13 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: Alice Ryhl, Andrew Morton, Arve Hjønnevåg,
	Carlos Llamas, Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Suren Baghdasaryan, Todd Kjos
In-Reply-To: <20260610230419.F746395F@davehans-spike.ostc.intel.com>

On 6/11/26 01:04, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Previously, the per-VMA locking could fail in the face of writers
> which necessitates a fallback to mmap_lock. The new
> lock_vma_under_rcu_wait() will wait for writers instead of failing.

vma_start_read_unlocked()

> 
> Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
> 
> This really is a nice cleanup.

[obama_medal_meme.jpg] ;)

But yeah, it is!

> It removes the need to pass the lock
> state back and forth to find_tcp_vma().
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Acked-by: Lorenzo Stoakes <ljs@kernel.org>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Arve Hjønnevåg <arve@android.com>
> Cc: Todd Kjos <tkjos@android.com>
> Cc: Christian Brauner <christian@brauner.io>
> Cc: Carlos Llamas <cmllamas@google.com>
> Cc: Alice Ryhl <aliceryhl@google.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: netdev@vger.kernel.org
> ---
> 
>  b/net/ipv4/tcp.c |   31 +++++++++----------------------
>  1 file changed, 9 insertions(+), 22 deletions(-)
> 
> diff -puN net/ipv4/tcp.c~ipv4-tcp-vma-waiter net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c~ipv4-tcp-vma-waiter	2026-06-10 15:57:56.972472379 -0700
> +++ b/net/ipv4/tcp.c	2026-06-10 15:57:56.976472521 -0700
> @@ -2171,27 +2171,18 @@ static void tcp_zc_finalize_rx_tstamp(st
>  }
>  
>  static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
> -					   unsigned long address,
> -					   bool *mmap_locked)
> +					   unsigned long address)
>  {
> -	struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
> +	struct vm_area_struct *vma = vma_start_read_unlocked(mm, address);
>  
> -	if (vma) {
> -		if (vma->vm_ops != &tcp_vm_ops) {
> -			vma_end_read(vma);
> -			return NULL;
> -		}
> -		*mmap_locked = false;
> -		return vma;
> -	}
> +	if (!vma)
> +		return NULL;
>  
> -	mmap_read_lock(mm);
> -	vma = vma_lookup(mm, address);
> -	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> -		mmap_read_unlock(mm);
> +	if (vma->vm_ops != &tcp_vm_ops) {
> +		vma_end_read(vma);
>  		return NULL;
>  	}
> -	*mmap_locked = true;
> +
>  	return vma;
>  }
>  
> @@ -2212,7 +2203,6 @@ static int tcp_zerocopy_receive(struct s
>  	u32 seq = tp->copied_seq;
>  	u32 total_bytes_to_map;
>  	int inq = tcp_inq(sk);
> -	bool mmap_locked;
>  	int ret;
>  
>  	zc->copybuf_len = 0;
> @@ -2237,7 +2227,7 @@ static int tcp_zerocopy_receive(struct s
>  		return 0;
>  	}
>  
> -	vma = find_tcp_vma(current->mm, address, &mmap_locked);
> +	vma = find_tcp_vma(current->mm, address);
>  	if (!vma)
>  		return -EINVAL;
>  
> @@ -2319,10 +2309,7 @@ static int tcp_zerocopy_receive(struct s
>  						   zc, total_bytes_to_map);
>  	}
>  out:
> -	if (mmap_locked)
> -		mmap_read_unlock(current->mm);
> -	else
> -		vma_end_read(vma);
> +	vma_end_read(vma);
>  	/* Try to copy straggler data. */
>  	if (!ret)
>  		copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
> _


^ permalink raw reply

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
From: Cong Wang @ 2026-06-12 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Cong Wang, Jakub Kicinski, Network Development, bpf,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Hemanth Malla,
	zijianzhang
In-Reply-To: <CAADnVQ+KTNKkf_Tc-RZR-g8wEfJU4qWcOPnjDbA2=PEtZsYnYg@mail.gmail.com>

On Fri, Jun 12, 2026 at 09:01:43AM -0700, Alexei Starovoitov wrote:
> Just saying that the code is free nowadays, so whether it's 1k lines
> or 10 lines is irrelevant for the discussion.
> 
> As far as the idea goes, I think, it would be interesting in pre-AI era,
> but today splice and friends are a prime target for bugs and more bugs.
> skmsg and tcp_bpf are reeling from unfixed bugs too,
> so my take is that we should not add any new features to skmsg
> and instead deprecate what is already there.

I guess maybe the name misleads you, it has nothing related to splice()
syscall. Its ring buffer was developed on top of include/linux/circ_buf.h
which again has nothing related to splice()/vmsplice()/pipe().

In case it is not obvious, this patchset does not add any new user-space
interface, only a kfunc which is visible to only sockmap eBPF programs
which already require CAP_BPF privilege.

If you have a better name on your mind, I am happy to change it.

Thanks.

^ permalink raw reply

* Re: [net-next 1/9] net: ethernet: ravb: Remove gPTP control from WoL setup and restore
From: Sergey Shtylyov @ 2026-06-12 18:12 UTC (permalink / raw)
  To: Niklas Söderlund, Paul Barker, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Richard Cochran,
	Geert Uytterhoeven, Magnus Damm, netdev, linux-renesas-soc,
	devicetree, linux-kernel
In-Reply-To: <20260610102432.3538432-2-niklas.soderlund+renesas@ragnatech.se>

On 6/10/26 1:24 PM, Niklas Söderlund wrote:

> Since commit a6a85ba36fd0 ("net: ravb: Move PTP initialization in the
> driver's ndo_open API for ccc_gac platorms") the gPTP clock (if
> supported) is stopped and started by opening and closing the ndev.
> 
> This makes the special case to stop and start it when resuming from WoL
> redundant. As the ndev will always be closed and re-opened when
> suspending and resuming the system.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

Reviewed-by: Sergey Shtylyov <sergei.shtylyov@gmail.com>

[...]

MBR, Sergey


^ permalink raw reply

* Re: [PATCH v2 4/5] binder: Remove mmap_lock fallback
From: Vlastimil Babka (SUSE) @ 2026-06-12 18:07 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: Alice Ryhl, Andrew Morton, Arve Hjønnevåg,
	Carlos Llamas, Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Suren Baghdasaryan, Todd Kjos
In-Reply-To: <20260610230417.77D64DBB@davehans-spike.ostc.intel.com>

On 6/11/26 01:04, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Previously, the per-VMA locking could fail in the face of writers
> which necessitate a fallback to mmap_lock. The new
> vma_start_read_unlocked() will wait for writers instead of failing.
> 
> Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

The usage seems fine to me. But since it handles cases where vma is NULL, it
seems to support my point for patch 3 that vma_start_read_unlocked() should
also allow for and handle vma_lookup() returning NULL.

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

> Acked-by: Lorenzo Stoakes <ljs@kernel.org>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Arve Hjønnevåg <arve@android.com>
> Cc: Todd Kjos <tkjos@android.com>
> Cc: Christian Brauner <christian@brauner.io>
> Cc: Carlos Llamas <cmllamas@google.com>
> Cc: Alice Ryhl <aliceryhl@google.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: netdev@vger.kernel.org
> 
> ---
> 
>  b/drivers/android/binder_alloc.c |   17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff -puN drivers/android/binder_alloc.c~binder-vma-waiter drivers/android/binder_alloc.c
> --- a/drivers/android/binder_alloc.c~binder-vma-waiter	2026-06-10 15:57:56.419452721 -0700
> +++ b/drivers/android/binder_alloc.c	2026-06-10 15:57:56.423452863 -0700
> @@ -259,21 +259,14 @@ static int binder_page_insert(struct bin
>  	struct vm_area_struct *vma;
>  	int ret = -ESRCH;
>  
> -	/* attempt per-vma lock first */
> -	vma = lock_vma_under_rcu(mm, addr);
> -	if (vma) {
> -		if (binder_alloc_is_mapped(alloc))
> -			ret = vm_insert_page(vma, addr, page);
> -		vma_end_read(vma);
> +	vma = vma_start_read_unlocked(mm, addr);
> +	if (!vma)
>  		return ret;
> -	}
>  
> -	/* fall back to mmap_lock */
> -	mmap_read_lock(mm);
> -	vma = vma_lookup(mm, addr);
> -	if (vma && binder_alloc_is_mapped(alloc))
> +	if (binder_alloc_is_mapped(alloc))
>  		ret = vm_insert_page(vma, addr, page);
> -	mmap_read_unlock(mm);
> +
> +	vma_end_read(vma);
>  
>  	return ret;
>  }
> _


^ permalink raw reply

* Re: [PATCH v2 3/5] mm: Add RCU-based VMA lookup helper that waits for writers
From: Vlastimil Babka (SUSE) @ 2026-06-12 18:00 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: Alice Ryhl, Andrew Morton, Arve Hjønnevåg,
	Carlos Llamas, Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Suren Baghdasaryan, Todd Kjos
In-Reply-To: <20260610230415.C0521C88@davehans-spike.ostc.intel.com>

On 6/11/26 01:04, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> == Background ==
> 
> There are basically two parallel ways to look up a VMA: the
> traditional way, which is protected by mmap_lock, and the RCU-based
> per-VMA lock way which is based on RCU and refcounts.
> 
> == Problem ==
> 
> The mmap_lock one is more straightforward to use but it has a big
> disadvantage in that it can not be mixed with page faults since those
> can take mmap_lock for read, which can deadlock when mixed with page
> faults. For example:

... mixed with nested page faults and parallel writers, perhaps?

> 
> 	mmap_read_lock(mm);
> 	// Another thread does mmap_write_lock().
> 	// New mmap_lock readers are blocked.
> 	vma = vma_lookup(mm, address);
> 	// This deadlocks on mmap_read_lock() if it faults:
> 	copy_from_user(address);
> 	mmap_read_unlock(mm);
> 
> The RCU one can be mixed with faults, but it is not available in all

I'd stick to the per-VMA lock term than "RCU one"

> configs, so all RCU users need to be able to fall back to the
> traditional way.

This is now an obsolete statement as patch 1 makes them available? But the
problem is that they can fail and using mmap read lock as a fallback in a
simple way has the above issue?

> == Solution ==
> 
> Add a variant of the RCU-based lookup that waits for writers. This is
> basically the same as the existing RCU-based lookup, but it also takes
> mmap_lock for read and waits for writers to finish before returning
> the VMA. This has some advantages:

I would stress the part that the mmap lock is taken *only temporarily* to
wait for the writers and ensure we obtain a per-vma read lock, and then
dropped again? As that's the main trick IIUC.

>  1. Callers do not need to have a fallback path for when they
>     collide with writers.
>  2. It can be used in contexts where page faults can happen because
>     it can take the mmap_lock for read but never *holds* it.
>  3. Its fast path does not require taking mmap_lock for read.
> 
> Basically, when applied correctly, this approach results in faster
> *and* simpler code.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Arve Hjønnevåg <arve@android.com>
> Cc: Todd Kjos <tkjos@android.com>
> Cc: Christian Brauner <christian@brauner.io>
> Cc: Carlos Llamas <cmllamas@google.com>
> Cc: Alice Ryhl <aliceryhl@google.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: netdev@vger.kernel.org
> 
> --
> 
> Changes from v1:
>  * Add a comment explaining that this can not be mixed with other
>    per-VMA lock or mmap_lock users. It is prone to deadlocks if so.
>  * Add a FIXME about making the mmap_read_lock() killable

I don't see it anywhere?

>  * Add more chaneglog bits about the possibility for an infinite goto
>    loop.
>  * Adopt vma_start_read_unlocked() implementation from Lorenzo
> ---
> 
>  b/include/linux/mmap_lock.h |    3 +++
>  b/mm/mmap_lock.c            |   27 +++++++++++++++++++++++++++
>  2 files changed, 30 insertions(+)
> 
> diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h
> --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait	2026-06-10 15:57:55.828431712 -0700
> +++ b/include/linux/mmap_lock.h	2026-06-10 15:57:55.834431925 -0700
> @@ -257,6 +257,9 @@ static inline bool vma_start_read_locked
>  	return vma_start_read_locked_nested(vma, 0);
>  }
>  
> +struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
> +					       unsigned long address);
> +
>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	vma_refcount_put(vma);
> diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c
> --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait	2026-06-10 15:57:55.831431819 -0700
> +++ b/mm/mmap_lock.c	2026-06-10 16:02:50.723860779 -0700
> @@ -338,6 +338,33 @@ inval:
>  	return NULL;
>  }
>  
> +/*
> + * Find the VMA covering 'address' and lock it for reading. Waits for writers to
> + * finish if the VMA is being modified. Returns NULL if there is no VMA covering
> + * 'address'.
> + *
> + * Use only in code paths where no mmap_lock and no VMA lock is held.

I think we have various asserts that could be used and are stronger than a
comment ;)

> + *
> + * The fast path does not take mmap_lock.
> + */
> +struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
> +					       unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +
> +	/* Fast path: return stable VMA covering 'address': */
> +	vma = lock_vma_under_rcu(mm, address);
> +	if (vma)
> +		return vma;
> +
> +	/* Slow path: preclude VMA writers by getting mmap read lock. */

Again I would say "temporarily".

> +	guard(rwsem_read)(&mm->mmap_lock);

Aside from the missing vma_lookup() I'm not sure we should also trust the
result of the lookup blindly? Should we also verify we found a vma? Some
callers might not fail the lookup because they will only lookup something
that's sure to be present, but some might fail?

> +	if (!vma_start_read_locked(vma))
> +		return NULL;

You can count me on the side that would rather see explicit operations than
the guard. Exactly because it's a subtle usage of the mmap sem, and yeah
also the tracing that Suren pointed out.

Seems to me uffd_lock_vma() mostly does all this right (but also does more
stuff that we don't want to do). I'm just not sure right know when
vma_start_read_locked() failures can happen in practice here (can it be only
recfount overflow or also refcount being zero? hopefully not zero if we
found the vma under mmap lock for read? comment in
lock_next_vma_under_mmap_lock() seems to hint at that) and what to do about
them. We seem to have some unhelpfully stale comments around.

- uffd_lock_vma() doesn't document that it can return -EAGAIN. (and should
the caller then retry or what?)

- vma_start_read_locked() has a comment saying how it cannot fail, but it in
fact can.

> +
> +	return vma;
> +}
> +
>  static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm,
>  							    struct vma_iterator *vmi,
>  							    unsigned long from_addr)
> _


^ permalink raw reply

* [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag
From: Xin Long @ 2026-06-12 17:59 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Marcelo Ricardo Leitner, Willy Tarreau, Zero Day Initiative

SCTP_DIAG endpoint dumping currently walks the endpoint hash table
without taking the socket lock before calling inet_sctp_diag_fill().

This is problematic because inet_sctp_diag_fill() eventually calls
inet_diag_msg_sctpladdrs_fill(), which traverses the endpoint's local
address list twice: once to count entries for nla_reserve(), and once
again to copy the addresses into the netlink buffer.

Since these two traversals are protected only by separate RCU read-side
critical sections, concurrent socket operations such as
SCTP_SOCKOPT_BINDX_REM may remove entries from the address list between
them. In that case, the number of copied addresses becomes smaller than
the originally reserved buffer size, leaving part of the netlink payload
uninitialized and potentially leaking kernel memory to user space.

Fix this by changing sctp_for_each_endpoint() to iterate with net and
position awareness while taking a reference on each socket, then release
the endpoint hash bucket read_lock_bh() before invoking the callback.

A socket reference is required because the callback acquires lock_sock(),
which must be called outside of read_lock_bh() since lock_sock() may
sleep. Holding a socket reference ensures the socket remains valid after
dropping the bucket lock and before acquiring the socket lock.

With the socket lock held, concurrent bind-address modifications are
serialized against the diagnostic dump, ensuring the local address list
remains stable during buffer sizing and initialization.

This also simplifies endpoint traversal by removing the temporary
callback local position tracking args[4] and moving dump progress
tracking into sctp_for_each_endpoint() itself.

While at it, fix the idiag_states check in sctp_ep_dump() and skip ep
dumping when non LISTEN|CLOSE states are also requested and the ep has
assocs, since such cases will be handled later by sctp_sock_dump().

Reported-by: Zero Day Initiative <zdi-disclosures@trendmicro.com>
Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/sctp/sctp.h |  3 +-
 net/sctp/diag.c         | 62 +++++++++++++++++++----------------------
 net/sctp/socket.c       | 34 +++++++++++++++++-----
 3 files changed, 57 insertions(+), 42 deletions(-)

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 58242b37b47a..cd82b05354a3 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -111,7 +111,8 @@ int sctp_transport_lookup_process(sctp_callback_t cb, struct net *net,
 				  const union sctp_addr *paddr, void *p, int dif);
 int sctp_transport_traverse_process(sctp_callback_t cb, sctp_callback_t cb_done,
 				    struct net *net, int *pos, void *p);
-int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *), void *p);
+int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *),
+			   struct net *net, int *pos, void *p);
 int sctp_get_sctp_info(struct sock *sk, struct sctp_association *asoc,
 		       struct sctp_info *info);
 
diff --git a/net/sctp/diag.c b/net/sctp/diag.c
index d758f5c3e06e..9108272ca527 100644
--- a/net/sctp/diag.c
+++ b/net/sctp/diag.c
@@ -92,6 +92,7 @@ static int inet_diag_msg_sctpladdrs_fill(struct sk_buff *skb,
 		if (!--addrcnt)
 			break;
 	}
+	WARN_ON_ONCE(addrcnt);
 	rcu_read_unlock();
 
 	return 0;
@@ -373,42 +374,36 @@ static int sctp_ep_dump(struct sctp_endpoint *ep, void *p)
 	struct sk_buff *skb = commp->skb;
 	struct netlink_callback *cb = commp->cb;
 	const struct inet_diag_req_v2 *r = commp->r;
-	struct net *net = sock_net(skb->sk);
 	struct inet_sock *inet = inet_sk(sk);
 	int err = 0;
 
-	if (!net_eq(sock_net(sk), net))
+	lock_sock(sk);
+	if (sctp_sstate(sk, CLOSED))
 		goto out;
 
-	if (cb->args[4] < cb->args[1])
-		goto next;
-
-	if (!(r->idiag_states & TCPF_LISTEN) && !list_empty(&ep->asocs))
-		goto next;
+	if ((r->idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)) &&
+	    !list_empty(&ep->asocs))
+		goto out;
 
 	if (r->sdiag_family != AF_UNSPEC &&
 	    sk->sk_family != r->sdiag_family)
-		goto next;
+		goto out;
 
 	if (r->id.idiag_sport != inet->inet_sport &&
 	    r->id.idiag_sport)
-		goto next;
+		goto out;
 
 	if (r->id.idiag_dport != inet->inet_dport &&
 	    r->id.idiag_dport)
-		goto next;
-
-	if (inet_sctp_diag_fill(sk, NULL, skb, r,
-				sk_user_ns(NETLINK_CB(cb->skb).sk),
-				NETLINK_CB(cb->skb).portid,
-				cb->nlh->nlmsg_seq, NLM_F_MULTI,
-				cb->nlh, commp->net_admin) < 0) {
-		err = 2;
 		goto out;
-	}
-next:
-	cb->args[4]++;
+
+	err = inet_sctp_diag_fill(sk, NULL, skb, r,
+				  sk_user_ns(NETLINK_CB(cb->skb).sk),
+				  NETLINK_CB(cb->skb).portid,
+				  cb->nlh->nlmsg_seq, NLM_F_MULTI,
+				  cb->nlh, commp->net_admin);
 out:
+	release_sock(sk);
 	return err;
 }
 
@@ -479,41 +474,40 @@ static void sctp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 		.r = r,
 		.net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN),
 	};
-	int pos = cb->args[2];
+	int pos;
 
 	/* eps hashtable dumps
 	 * args:
 	 * 0 : if it will traversal listen sock
 	 * 1 : to record the sock pos of this time's traversal
-	 * 4 : to work as a temporary variable to traversal list
 	 */
 	if (cb->args[0] == 0) {
-		if (!(idiag_states & TCPF_LISTEN))
-			goto skip;
-		if (sctp_for_each_endpoint(sctp_ep_dump, &commp))
-			goto done;
-skip:
+		if (idiag_states & TCPF_LISTEN) {
+			pos = cb->args[1];
+			if (sctp_for_each_endpoint(sctp_ep_dump, net, &pos,
+						   &commp)) {
+				cb->args[1] = pos;
+				return;
+			}
+		}
 		cb->args[0] = 1;
 		cb->args[1] = 0;
-		cb->args[4] = 0;
 	}
 
+	if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)))
+		return;
+
 	/* asocs by transport hashtable dump
 	 * args:
 	 * 1 : to record the assoc pos of this time's traversal
 	 * 2 : to record the transport pos of this time's traversal
 	 * 3 : to mark if we have dumped the ep info of the current asoc
 	 * 4 : to work as a temporary variable to traversal list
-	 * 5 : to save the sk we get from travelsing the tsp list.
 	 */
-	if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)))
-		goto done;
-
+	pos = cb->args[2];
 	sctp_transport_traverse_process(sctp_sock_filter, sctp_sock_dump,
 					net, &pos, &commp);
 	cb->args[2] = pos;
-
-done:
 	cb->args[1] = cb->args[4];
 	cb->args[4] = 0;
 }
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 66e12fb0c646..1ed405dedc01 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -5369,24 +5369,44 @@ struct sctp_transport *sctp_transport_get_idx(struct net *net,
 }
 
 int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *),
-			   void *p) {
-	int err = 0;
-	int hash = 0;
-	struct sctp_endpoint *ep;
+			   struct net *net, int *pos, void *p) {
+	int err, hash = 0, idx = 0, start;
 	struct sctp_hashbucket *head;
+	struct sctp_endpoint *ep;
+	struct sock *sk;
 
 	for (head = sctp_ep_hashtable; hash < sctp_ep_hashsize;
 	     hash++, head++) {
+		start = idx;
+again:
+		sk = NULL;
 		read_lock_bh(&head->lock);
 		sctp_for_each_hentry(ep, &head->chain) {
-			err = cb(ep, p);
-			if (err)
+			if (sock_net(ep->base.sk) != net)
+				continue;
+			if (idx++ >= *pos) {
+				sk = ep->base.sk;
+				sock_hold(sk);
 				break;
+			}
 		}
 		read_unlock_bh(&head->lock);
+
+		if (sk) {
+			err = cb(ep, p);
+			if (err) {
+				sock_put(sk);
+				return err;
+			}
+			sock_put(sk);
+			(*pos)++;
+
+			idx = start;
+			goto again;
+		}
 	}
 
-	return err;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(sctp_for_each_endpoint);
 
-- 
2.47.1


^ permalink raw reply related

* Re: [PATCH v2] nfc: nci: add data_len bound checks to activation parameter extractors
From: Bryam Vargas @ 2026-06-12 17:57 UTC (permalink / raw)
  To: Simon Horman, David Heidelberg; +Cc: oe-linux-nfc, netdev, linux-kernel
In-Reply-To: <20260612131813.674622-3-horms@kernel.org>

Thanks for the review. v3 is posted as a new top-level thread:
https://lore.kernel.org/all/20260612-b4-disp-6d52d8b0-v3-1-ae1f21cdd8ab@proton.me/
Point-by-point below.

> [Severity: Medium] Is the correct error type being returned here?
> [...] -EINVAL [...] nci_to_errno() will truncate it [...] -ENOSYS [...]
> Should a valid NCI status code [...] be used instead?

Correct. These extractors return into the int 'err' in
nci_rf_intf_activated_ntf_packet(), which is handed to nci_req_complete()
and later run through nci_to_errno(). nci_to_errno() takes a __u8, so the
-EINVAL is truncated (to 234) at that argument and falls through to the
default case as -ENOSYS. The contract here is an NCI_STATUS_* code: 'err'
is initialized to NCI_STATUS_OK and the functions' own default cases
already return NCI_STATUS_RF_PROTOCOL_ERROR.

v3 returns NCI_STATUS_RF_PROTOCOL_ERROR from the new short-region guards,
which matches the sibling default cases and maps to -EPROTO via
nci_to_errno(). NCI_STATUS_INVALID_PARAM is equally valid (also -EPROTO) --
I went with RF_PROTOCOL_ERROR for consistency with the surrounding code;
glad to switch if you prefer INVALID_PARAM. v3 is otherwise identical to v2
and is posted as a new top-level thread.

> [Severity: Critical] [...] atr_res_len [...] integer underflow [...]
> up to 47 uninitialized stack bytes into ndev->remote_gb [...] broadcast
> to userspace?

The uninitialized-stack read is real: in nci_store_general_bytes_nfc_dep()
an atr_res_len below NFC_ATR_RES_GT_OFFSET makes min_t() clamp to
NFC_ATR_RES_GB_MAXSIZE and memcpy() copies stale bytes of the on-stack
atr_res[] into ndev->remote_gb. I could not, however, confirm the
"broadcast to userspace" part: remote_gb has no netlink attribute (there is
no NFC_ATTR_TARGET_GENERAL_BYTES), and its only path out is LLCP, which is
gated by the 3-byte llcp_magic memcmp on those same uninitialized bytes --
so a deterministic leak of the 47 bytes isn't reachable that way. It's
still a genuine uninitialized read worth closing; see below.

> [Severity: High] [...] sel_res [...] uninitialized [...] nci_add_new_protocol()
> [...] nfc_genl_send_target() [...] sent out unconditionally [...]

Confirmed, and this one does reach userspace: with sel_res_len == 0,
nci_extract_rf_params_nfca_passive_poll() leaves sel_res unwritten in the
on-stack ntf, nci_add_new_protocol() copies it unconditionally, and
nfc_genl_send_target() puts it into NFC_ATTR_TARGET_SEL_RES on
NFC_CMD_GET_TARGET -- a one-byte stack disclosure (the discover-NTF path
reaches it without the err==NCI_STATUS_OK gate).

Both pre-existing issues are closed by zero-initializing the on-stack
notification structs:

    struct nci_rf_discover_ntf ntf = {};        /* nci_rf_discover_ntf_packet */
    struct nci_rf_intf_activated_ntf ntf = {};  /* nci_rf_intf_activated_ntf_packet */

I'll send that as a separate patch rather than fold it into this one, since
it's an independent change in different functions.


^ permalink raw reply

* [PATCH v3] nfc: nci: add data_len bound checks to activation parameter extractors
From: Bryam Vargas via B4 Relay @ 2026-06-12 17:50 UTC (permalink / raw)
  To: David Heidelberg; +Cc: linux-kernel, Simon Horman, netdev, oe-linux-nfc

From: Bryam Vargas <hexlabsecurity@proton.me>

nci_extract_activation_params_iso_dep() and
nci_extract_activation_params_nfc_dep() read an inner length byte from
the NCI RF_INTF_ACTIVATED_NTF payload and use it to memcpy() into fixed
kernel buffers, but neither function receives the caller-validated
activation_params_len.  A crafted NCI notification with
activation_params_len=1 and an inner length byte of up to 20 (NFC-A) or
50 (NFC-B) causes memcpy() to read that many bytes past the one valid
byte in the activation params region -- a slab out-of-bounds read of
kernel memory adjacent to the NCI skb.

The sibling nci_extract_rf_params_*() family was given equivalent
protection by commit 571dcbeb8e63 ("net: nfc: nci: Fix parameter
validation for packet data"), but the two activation parameter
extractors were not updated at that time.

Add a data_len parameter to both functions, guard against an empty
region before consuming the inner length byte, decrement the remaining
count after consuming it, and clamp the copy length to what is actually
available.  Update both call sites to pass ntf.activation_params_len,
which is already validated against the skb at ntf.c:801.

Fixes: e8c0dacd9836 ("NFC: Update names and structs to NCI spec 1.0 d18")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
---
 net/nfc/nci/ntf.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/net/nfc/nci/ntf.c b/net/nfc/nci/ntf.c
index c96512bb8653..8bc3adbc5b6b 100644
--- a/net/nfc/nci/ntf.c
+++ b/net/nfc/nci/ntf.c
@@ -525,15 +525,19 @@ static int nci_rf_discover_ntf_packet(struct nci_dev *ndev,
 
 static int nci_extract_activation_params_iso_dep(struct nci_dev *ndev,
 						 struct nci_rf_intf_activated_ntf *ntf,
-						 const __u8 *data)
+						 const __u8 *data, __u8 data_len)
 {
 	struct activation_params_nfca_poll_iso_dep *nfca_poll;
 	struct activation_params_nfcb_poll_iso_dep *nfcb_poll;
 
 	switch (ntf->activation_rf_tech_and_mode) {
 	case NCI_NFC_A_PASSIVE_POLL_MODE:
+		if (data_len < 1)
+			return NCI_STATUS_RF_PROTOCOL_ERROR;
 		nfca_poll = &ntf->activation_params.nfca_poll_iso_dep;
 		nfca_poll->rats_res_len = min_t(__u8, *data++, NFC_ATS_MAXSIZE);
+		data_len--;
+		nfca_poll->rats_res_len = min_t(__u8, nfca_poll->rats_res_len, data_len);
 		pr_debug("rats_res_len %d\n", nfca_poll->rats_res_len);
 		if (nfca_poll->rats_res_len > 0) {
 			memcpy(nfca_poll->rats_res,
@@ -542,8 +546,12 @@ static int nci_extract_activation_params_iso_dep(struct nci_dev *ndev,
 		break;
 
 	case NCI_NFC_B_PASSIVE_POLL_MODE:
+		if (data_len < 1)
+			return NCI_STATUS_RF_PROTOCOL_ERROR;
 		nfcb_poll = &ntf->activation_params.nfcb_poll_iso_dep;
 		nfcb_poll->attrib_res_len = min_t(__u8, *data++, 50);
+		data_len--;
+		nfcb_poll->attrib_res_len = min_t(__u8, nfcb_poll->attrib_res_len, data_len);
 		pr_debug("attrib_res_len %d\n", nfcb_poll->attrib_res_len);
 		if (nfcb_poll->attrib_res_len > 0) {
 			memcpy(nfcb_poll->attrib_res,
@@ -562,7 +570,7 @@ static int nci_extract_activation_params_iso_dep(struct nci_dev *ndev,
 
 static int nci_extract_activation_params_nfc_dep(struct nci_dev *ndev,
 						 struct nci_rf_intf_activated_ntf *ntf,
-						 const __u8 *data)
+						 const __u8 *data, __u8 data_len)
 {
 	struct activation_params_poll_nfc_dep *poll;
 	struct activation_params_listen_nfc_dep *listen;
@@ -570,9 +578,13 @@ static int nci_extract_activation_params_nfc_dep(struct nci_dev *ndev,
 	switch (ntf->activation_rf_tech_and_mode) {
 	case NCI_NFC_A_PASSIVE_POLL_MODE:
 	case NCI_NFC_F_PASSIVE_POLL_MODE:
+		if (data_len < 1)
+			return NCI_STATUS_RF_PROTOCOL_ERROR;
 		poll = &ntf->activation_params.poll_nfc_dep;
 		poll->atr_res_len = min_t(__u8, *data++,
 					  NFC_ATR_RES_MAXSIZE - 2);
+		data_len--;
+		poll->atr_res_len = min_t(__u8, poll->atr_res_len, data_len);
 		pr_debug("atr_res_len %d\n", poll->atr_res_len);
 		if (poll->atr_res_len > 0)
 			memcpy(poll->atr_res, data, poll->atr_res_len);
@@ -580,9 +592,13 @@ static int nci_extract_activation_params_nfc_dep(struct nci_dev *ndev,
 
 	case NCI_NFC_A_PASSIVE_LISTEN_MODE:
 	case NCI_NFC_F_PASSIVE_LISTEN_MODE:
+		if (data_len < 1)
+			return NCI_STATUS_RF_PROTOCOL_ERROR;
 		listen = &ntf->activation_params.listen_nfc_dep;
 		listen->atr_req_len = min_t(__u8, *data++,
 					    NFC_ATR_REQ_MAXSIZE - 2);
+		data_len--;
+		listen->atr_req_len = min_t(__u8, listen->atr_req_len, data_len);
 		pr_debug("atr_req_len %d\n", listen->atr_req_len);
 		if (listen->atr_req_len > 0)
 			memcpy(listen->atr_req, data, listen->atr_req_len);
@@ -806,12 +822,14 @@ static int nci_rf_intf_activated_ntf_packet(struct nci_dev *ndev,
 		switch (ntf.rf_interface) {
 		case NCI_RF_INTERFACE_ISO_DEP:
 			err = nci_extract_activation_params_iso_dep(ndev,
-								    &ntf, data);
+								    &ntf, data,
+								    ntf.activation_params_len);
 			break;
 
 		case NCI_RF_INTERFACE_NFC_DEP:
 			err = nci_extract_activation_params_nfc_dep(ndev,
-								    &ntf, data);
+								    &ntf, data,
+								    ntf.activation_params_len);
 			break;
 
 		case NCI_RF_INTERFACE_FRAME:

---
base-commit: 8e65320d91cdc3b241d4b94855c88459b91abf66
change-id: 20260612-b4-disp-6d52d8b0-ae1f21cdd8ab

Best regards,
-- 
Bryam Vargas <hexlabsecurity@proton.me>



^ permalink raw reply related

* Re: [net-next 3/9] net: ethernet: ravb: Simplify gPTP start and stop
From: Sergey Shtylyov @ 2026-06-12 17:49 UTC (permalink / raw)
  To: Niklas Söderlund, Paul Barker, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Richard Cochran,
	Geert Uytterhoeven, Magnus Damm, netdev, linux-renesas-soc,
	devicetree, linux-kernel
In-Reply-To: <20260610102432.3538432-4-niklas.soderlund+renesas@ragnatech.se>

On 6/10/26 1:24 PM, Niklas Söderlund wrote:

> For devices that do not support the gPTP clock in config mode the
> somewhat oddly named flag gptp is set, compared to devices that do
> support the gPTP clock in config and operation mode where the flag
> ccc_gac is set instead. The two flags are mutually exclusive.
> 
> For the gptp-flag devices (Gen2) the clock is tied to the AVB-DMAC, when
> it is stopped so is the gPTP clock. For ccc_gac-flag devices (Gen3) the
> gPTP clock is available whenever the ndev is open.
> 
> Prepare to add Gen4 support which will add a third way by cleaning the
> Gen2 and Gen3 cases up a bit.
> 
> Fold the gptp-flag start and stop calls into ravb_dmac_init() and
> ravb_stop_dma(), which start and stops the AVB-DMAC. There are no

   s/stops/stop/.

> functional change as all call sites to the construct,

   s/,/:/?

> 
>     if (info->gptp)
>         ravb_ptp_init(ndev, priv->pdev);
> 
> Are always just after a call to into ravb_dmac_init() and all call sites

   s/Are/are/.

> to the to the construct,

   "to the" repeated... And s/,/:/?

> 
>     if (info->gptp)
>         ravb_ptp_stop(ndev);
> 
> Are always directly followed by a call to ravb_stop_dma().

   s/Are/are/.

> 
> There are two special cases where the calling construct covers both the
> gptp-flag and info->ccc_gac devices, one for start and one for stop. The
> condition that it is preceded by a call to ravb_dmac_init(), or followed
> by a call to ravb_stop_dma() are however true for them too. Reworked the
> two special cases to drop the check of info->gptp.
> 
> The end result is that the gPTP clock will be started or stopped for the
> gptp-flag devices in tandem with the AVB-DMAC, while the info->ccc_gac
> devices will be controlled, as before, when the ndev is opened or
> closed.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
[...]
MBR, Sergey


^ permalink raw reply

* Re: [PATCH v2 2/5] binder: Make shrinker rely solely on per-VMA lock
From: Suren Baghdasaryan @ 2026-06-12 17:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Vlastimil Babka (SUSE), Alice Ryhl, Dave Hansen, linux-kernel,
	Andrew Morton, Arve Hjønnevåg, Carlos Llamas,
	Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Todd Kjos
In-Reply-To: <e9e196ff-7428-43bd-8e06-dc2cf0628c9e@intel.com>

On Fri, Jun 12, 2026 at 9:55 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/12/26 09:41, Suren Baghdasaryan wrote:
> >> I think the key to distinguishing between:
> >>
> >>         vma==NULL because there's no VMA
> >> and
> >>         vma==NULL because of a trylock failure
> >>
> >> is binder_alloc_is_mapped(). It won't return false until vm_ops->close()
> >> finishes. vm_ops->close() shouldn't be able to happen while
> >> lock_vma_under_rcu() is held. So if you've got a non-NULL VMA, you've
> >> also got a stable is binder_alloc_is_mapped().
> > By "stable binder_alloc_is_mapped()" do you mean it would always be
> > true?
>
> By stable, I meant that it can't change.
>
>         vma = lock_vma_under_rcu()
>         mapped = binder_alloc_is_mapped();
>         <window>
>         vma_end_read(vma);
>
> During <window> it can't go from true=>false or false=>true.
>
> false=>true never happens from what I can tell. It's just plain
> impossible given the current code.
>
> true=>false is locked out because when lock_vma_under_rcu() is held.
>
> > Asking because in your patch you removed this condition:
> >
> > -         if (vma && !binder_alloc_is_mapped(alloc))
> > -                  goto err_invalid_vma;
> >
> > So, previously if we found the VMA but binder_alloc_is_mapped()==false
> > we would bail out and now we don't. Are you reasoning that this
> > combination is impossible?
>
> It's not impossible, but I do think it is irrelevant. Or at least that
> the *VMA* is irrelevant in this case. binder_alloc_is_mapped()==false
> means that the binder VMA is gone. It's not in the maple tree, and it's
> not coming back. If a VMA is found, it's an impostor.

Right, but before your change we were bailing out early. With your
change we would be generating the traces and freeing the page. I think
that's a functional change. Was that your intention?

>
> That's why I did:
>
> -        if (vma) {
> +        if (mapped) {
>
> The question isn't whether a VMA was found. The question is whether the
> binder VMA is still mapped at page_addr. *That* is best inferred from
> binder_alloc_is_mapped(), not the VMA lookup.
>
> At least that's what I decided after staring at it for far too long.

^ permalink raw reply

* [RFC net-next v1 2/2] selftests: net: Add kthread preserving test in napi_threaded and busy_poll_test
From: Shuhao Tan @ 2026-06-12 17:36 UTC (permalink / raw)
  To: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Shuah Khan
  Cc: Shuhao Tan, Mina Almasry, Samiullah Khawaja, Kuniyuki Iwashima,
	netdev, linux-kernel, linux-kselftest
In-Reply-To: <20260612173644.380972-1-tanshuhao@google.com>

Add testcase to ensure the kthread stays the same across NIC link
flap.

Add testcase to ensure the same kthread can poll different napis
across NIC link flap.

Signed-off-by: Shuhao Tan <tanshuhao@google.com>
---
 .../selftests/drivers/net/napi_threaded.py    | 41 ++++++++++++++++++-
 tools/testing/selftests/net/busy_poll_test.sh | 24 +++++++++++
 2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/drivers/net/napi_threaded.py b/tools/testing/selftests/drivers/net/napi_threaded.py
index f4be72b2145a..20110fb6942e 100755
--- a/tools/testing/selftests/drivers/net/napi_threaded.py
+++ b/tools/testing/selftests/drivers/net/napi_threaded.py
@@ -127,6 +127,44 @@ def change_num_queues(cfg, nl) -> None:
     _assert_napi_threaded_enabled(nl, napi0_id)
     _assert_napi_threaded_enabled(nl, napi1_id)
 
+def nic_link_flap(cfg, nl) -> None:
+    """
+    Test that if threaded is enabled, and NIC goes through
+    a reset, the kthread stays unchanged across the link flap.
+    """
+    napis = nl.napi_get({'ifindex': cfg.ifindex}, dump=True)
+    ksft_ge(len(napis), 2)
+
+    napi0_id = napis[0]['id']
+    napi1_id = napis[1]['id']
+
+    _setup_deferred_cleanup(cfg)
+
+    # set threaded
+    _set_threaded_state(cfg, 1)
+    napis = nl.napi_get({'ifindex': cfg.ifindex}, dump=True)
+
+    # check napi threaded is set for both napis
+    _assert_napi_threaded_enabled(nl, napi0_id)
+    _assert_napi_threaded_enabled(nl, napi1_id)
+
+    pid0 = napis[0].get('pid')
+    pid1 = napis[1].get('pid')
+
+    cmd(f"ip link set {cfg.ifname} down")
+    cmd(f"ip link set {cfg.ifname} up")
+
+    # re-acquire napi info
+    napis = nl.napi_get({'ifindex': cfg.ifindex}, dump=True)
+    ksft_ge(len(napis), 2)
+
+    # check napi threaded is set for both napis
+    _assert_napi_threaded_enabled(nl, napi0_id)
+    _assert_napi_threaded_enabled(nl, napi1_id)
+
+    # check the kthread remains the same
+    ksft_eq(napis[0].get('pid'), pid0)
+    ksft_eq(napis[1].get('pid'), pid1)
 
 def main() -> None:
     """ Ksft boiler plate main """
@@ -134,7 +172,8 @@ def main() -> None:
     with NetDrvEnv(__file__, queue_count=2) as cfg:
         ksft_run([napi_init,
                   change_num_queues,
-                  enable_dev_threaded_disable_napi_threaded],
+                  enable_dev_threaded_disable_napi_threaded,
+                  nic_link_flap],
                  args=(cfg, NetdevFamily()))
     ksft_exit()
 
diff --git a/tools/testing/selftests/net/busy_poll_test.sh b/tools/testing/selftests/net/busy_poll_test.sh
index 5ec1c85c1623..897ce6700601 100755
--- a/tools/testing/selftests/net/busy_poll_test.sh
+++ b/tools/testing/selftests/net/busy_poll_test.sh
@@ -124,6 +124,23 @@ test_busypoll_with_napi_threaded()
 	return $?
 }
 
+test_busypoll_with_napi_threaded_link_flap()
+{
+	# Only enable napi threaded poll. Set suspend timeout and prefer busy
+	# poll to 0. Run again after a link flap.
+	test_busypoll 0 ${NAPI_THREADED_MODE_BUSY_POLL} 0 || return $?
+
+	ip netns exec nssv ip link set dev $NSIM_SV_NAME down
+	ip netns exec nscl ip link set dev $NSIM_CL_NAME down
+
+	ip netns exec nssv ip link set dev $NSIM_SV_NAME up
+	ip netns exec nscl ip link set dev $NSIM_CL_NAME up
+
+	test_busypoll 0 ${NAPI_THREADED_MODE_BUSY_POLL} 0
+
+	return $?
+}
+
 ###
 ### Code start
 ###
@@ -176,6 +193,13 @@ if [ $? -ne 0 ]; then
 	exit 1
 fi
 
+test_busypoll_with_napi_threaded_link_flap
+if [ $? -ne 0 ]; then
+	echo "test_busypoll_with_napi_threaded_link_flap failed"
+	cleanup_ns
+	exit 1
+fi
+
 echo "$NSIM_SV_FD:$NSIM_SV_IFIDX" > $NSIM_DEV_SYS_UNLINK
 
 echo $NSIM_CL_ID > $NSIM_DEV_SYS_DEL
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* [RFC net-next v1 1/2] net: Save kthread of threaded NAPI in napi_config in napi_del and restore in napi_add
From: Shuhao Tan @ 2026-06-12 17:36 UTC (permalink / raw)
  To: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Shuah Khan
  Cc: Shuhao Tan, Mina Almasry, Samiullah Khawaja, Kuniyuki Iwashima,
	netdev, linux-kernel, linux-kselftest
In-Reply-To: <20260612173644.380972-1-tanshuhao@google.com>

Replace napi->thread with a new thread_node struct that has a back
pointer to napi_struct.

Make the NAPI thread to use the thread_node as data pointer so that
it can poll on different NAPIs thoughout its lifetime.

Park the thread and save the thread_node in napi_config on napi_del.
Restore the node and unpark the thread on napi_add_config.

Signed-off-by: Shuhao Tan <tanshuhao@google.com>
---
 include/linux/netdevice.h |  13 +++-
 net/core/dev.c            | 151 +++++++++++++++++++++++++++++---------
 net/core/netdev-genl.c    |  12 ++-
 3 files changed, 139 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f4f0837c09f..1cda88607e99 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -63,6 +63,7 @@ struct dsa_port;
 struct ip_tunnel_parm_kern;
 struct macsec_context;
 struct macsec_ops;
+struct napi_struct;
 struct netdev_config;
 struct netdev_name_node;
 struct sd_flow_limit;
@@ -363,10 +364,20 @@ struct gro_node {
 	u32			cached_napi_id;
 };
 
+/*
+ * Structure for persisting threaded NAPI kthread
+ */
+struct napi_thread_node {
+	struct task_struct *thread;
+	struct napi_struct *napi;
+	struct rcu_head rcu;
+};
+
 /*
  * Structure for per-NAPI config
  */
 struct napi_config {
+	struct napi_thread_node *thread_node;
 	u64 gro_flush_timeout;
 	u64 irq_suspend_timeout;
 	u32 defer_hard_irqs;
@@ -403,7 +414,7 @@ struct napi_struct {
 	struct gro_node		gro;
 	struct hrtimer		timer;
 	/* all fields past this point are write-protected by netdev_lock */
-	struct task_struct	*thread;
+	struct napi_thread_node __rcu *thread_node;
 	unsigned long		gro_flush_timeout;
 	unsigned long		irq_suspend_timeout;
 	u32			defer_hard_irqs;
diff --git a/net/core/dev.c b/net/core/dev.c
index 202e35acb15b..f5e3b9e526af 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1645,25 +1645,62 @@ EXPORT_SYMBOL(netdev_notify_peers);
 
 static int napi_threaded_poll(void *data);
 
-static int napi_kthread_create(struct napi_struct *n)
+static int napi_thread_node_create(struct napi_struct *n)
 {
+	struct napi_thread_node *thread_node = NULL;
+	struct task_struct *thread = NULL;
 	int err = 0;
 
+	thread_node = kvzalloc_obj(*thread_node);
+	if (!thread_node)
+		return -ENOMEM;
+
 	/* Create and wake up the kthread once to put it in
 	 * TASK_INTERRUPTIBLE mode to avoid the blocked task
 	 * warning and work with loadavg.
 	 */
-	n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
-				n->dev->name, n->napi_id);
-	if (IS_ERR(n->thread)) {
-		err = PTR_ERR(n->thread);
+	thread_node->napi = n;
+	thread = kthread_run(napi_threaded_poll, thread_node, "napi/%s-%d",
+			     n->dev->name, n->napi_id);
+	if (IS_ERR(thread)) {
+		err = PTR_ERR(thread);
 		pr_err("kthread_run failed with err %d\n", err);
-		n->thread = NULL;
+		goto free_thread_node;
 	}
 
+	thread_node->thread = thread;
+	rcu_assign_pointer(n->thread_node, thread_node);
+
+	return 0;
+
+free_thread_node:
+	kvfree(thread_node);
+
 	return err;
 }
 
+static void napi_thread_node_stop(struct napi_thread_node *thread_node)
+{
+	kthread_stop(thread_node->thread);
+	kvfree_rcu(thread_node, rcu);
+}
+
+static int napi_kthread_create(struct napi_struct *n)
+{
+	struct napi_thread_node *thread_node;
+
+	if (n->config && n->config->thread_node) {
+		thread_node = n->config->thread_node;
+		rcu_assign_pointer(n->thread_node, thread_node);
+		n->config->thread_node = NULL;
+		WRITE_ONCE(thread_node->napi, n);
+		kthread_unpark(thread_node->thread);
+		return 0;
+	}
+
+	return napi_thread_node_create(n);
+}
+
 static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -4949,7 +4986,7 @@ EXPORT_SYMBOL(__dev_direct_xmit);
 /*************************************************************************
  *			Receiver routines
  *************************************************************************/
-static DEFINE_PER_CPU(struct task_struct *, backlog_napi);
+static DEFINE_PER_CPU(struct napi_thread_node, backlog_napi);
 
 int weight_p __read_mostly = 64;           /* old backlog weight */
 int dev_weight_rx_bias __read_mostly = 1;  /* bias for backlog weight */
@@ -4959,10 +4996,11 @@ int dev_weight_tx_bias __read_mostly = 1;  /* bias for output_queue quota */
 static inline void ____napi_schedule(struct softnet_data *sd,
 				     struct napi_struct *napi)
 {
-	struct task_struct *thread;
+	struct napi_thread_node *thread_node;
 
 	lockdep_assert_irqs_disabled();
 
+	rcu_read_lock();
 	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
 		/* Paired with smp_mb__before_atomic() in
 		 * napi_enable()/netif_set_threaded().
@@ -4970,18 +5008,21 @@ static inline void ____napi_schedule(struct softnet_data *sd,
 		 * read on napi->thread. Only call
 		 * wake_up_process() when it's not NULL.
 		 */
-		thread = READ_ONCE(napi->thread);
-		if (thread) {
-			if (use_backlog_threads() && thread == raw_cpu_read(backlog_napi))
+		thread_node = rcu_dereference(napi->thread_node);
+		if (thread_node) {
+			if (use_backlog_threads() &&
+			    thread_node == this_cpu_ptr(&backlog_napi))
 				goto use_local_napi;
 
 			set_bit(NAPI_STATE_SCHED_THREADED, &napi->state);
-			wake_up_process(thread);
+			wake_up_process(thread_node->thread);
+			rcu_read_unlock();
 			return;
 		}
 	}
 
 use_local_napi:
+	rcu_read_unlock();
 	DEBUG_NET_WARN_ON_ONCE(!list_empty(&napi->poll_list));
 	list_add_tail(&napi->poll_list, &sd->poll_list);
 	WRITE_ONCE(napi->list_owner, smp_processor_id());
@@ -7148,6 +7189,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 
 static void napi_stop_kthread(struct napi_struct *napi)
 {
+	struct napi_thread_node *thread_node;
 	unsigned long val, new;
 
 	/* Wait until the napi STATE_THREADED is unset. */
@@ -7180,8 +7222,9 @@ static void napi_stop_kthread(struct napi_struct *napi)
 		msleep(20);
 	}
 
-	kthread_stop(napi->thread);
-	napi->thread = NULL;
+	thread_node = netdev_lock_dereference(napi->thread_node, napi->dev);
+	rcu_assign_pointer(napi->thread_node, NULL);
+	napi_thread_node_stop(thread_node);
 }
 
 static void napi_set_threaded_state(struct napi_struct *napi,
@@ -7197,9 +7240,13 @@ static void napi_set_threaded_state(struct napi_struct *napi,
 int napi_set_threaded(struct napi_struct *napi,
 		      enum netdev_napi_threaded threaded)
 {
+	struct napi_thread_node *thread_node;
+
+	thread_node = netdev_lock_dereference(napi->thread_node, napi->dev);
+
 	if (threaded) {
-		if (!napi->thread) {
-			int err = napi_kthread_create(napi);
+		if (!thread_node) {
+			int err = napi_thread_node_create(napi);
 
 			if (err)
 				return err;
@@ -7215,7 +7262,7 @@ int napi_set_threaded(struct napi_struct *napi,
 	 * softirq mode will happen in the next round of napi_schedule().
 	 * This should not cause hiccups/stalls to the live traffic.
 	 */
-	if (!threaded && napi->thread) {
+	if (!threaded && thread_node) {
 		napi_stop_kthread(napi);
 	} else {
 		/* Make sure kthread is created before THREADED bit is set. */
@@ -7236,8 +7283,9 @@ int netif_set_threaded(struct net_device *dev,
 
 	if (threaded) {
 		list_for_each_entry(napi, &dev->napi_list, dev_list) {
-			if (!napi->thread) {
-				err = napi_kthread_create(napi);
+			/* protected by assertion above */
+			if (!rcu_dereference_protected(napi->thread_node, 1)) {
+				err = napi_thread_node_create(napi);
 				if (err) {
 					threaded = NETDEV_NAPI_THREADED_DISABLED;
 					break;
@@ -7253,8 +7301,14 @@ int netif_set_threaded(struct net_device *dev,
 		WARN_ON_ONCE(napi_set_threaded(napi, threaded));
 
 	/* Override the config for all NAPIs even if currently not listed */
-	for (i = 0; i < dev->num_napi_configs; i++)
+	for (i = 0; i < dev->num_napi_configs; i++) {
 		dev->napi_config[i].threaded = threaded;
+		/* Stop parked threads in inactive napi_configs */
+		if (!threaded && dev->napi_config[i].thread_node) {
+			napi_thread_node_stop(dev->napi_config[i].thread_node);
+			dev->napi_config[i].thread_node = NULL;
+		}
+	}
 
 	return err;
 }
@@ -7657,7 +7711,7 @@ void napi_enable_locked(struct napi_struct *n)
 		BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));
 
 		new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
-		if (n->dev->threaded && n->thread)
+		if (n->dev->threaded && n->thread_node)
 			new |= NAPIF_STATE_THREADED;
 	} while (!try_cmpxchg(&n->state, &val, new));
 }
@@ -7682,6 +7736,8 @@ EXPORT_SYMBOL(napi_enable);
 /* Must be called in process context */
 void __netif_napi_del_locked(struct napi_struct *napi)
 {
+	struct napi_thread_node *thread_node;
+
 	netdev_assert_locked(napi->dev);
 
 	if (!test_and_clear_bit(NAPI_STATE_LISTED, &napi->state))
@@ -7693,6 +7749,18 @@ void __netif_napi_del_locked(struct napi_struct *napi)
 	if (test_and_clear_bit(NAPI_STATE_HAS_NOTIFIER, &napi->state))
 		irq_set_affinity_notifier(napi->irq, NULL);
 
+	thread_node = netdev_lock_dereference(napi->thread_node, napi->dev);
+	if (thread_node) {
+		rcu_assign_pointer(napi->thread_node, NULL);
+		if (napi->config) {
+			kthread_park(thread_node->thread);
+			napi->config->thread_node = thread_node;
+			napi->config->thread_node->napi = NULL;
+		} else {
+			napi_thread_node_stop(thread_node);
+		}
+	}
+
 	if (napi->config) {
 		napi->index = -1;
 		napi->config = NULL;
@@ -7702,11 +7770,6 @@ void __netif_napi_del_locked(struct napi_struct *napi)
 	napi_free_frags(napi);
 
 	gro_cleanup(&napi->gro);
-
-	if (napi->thread) {
-		kthread_stop(napi->thread);
-		napi->thread = NULL;
-	}
 }
 EXPORT_SYMBOL(__netif_napi_del_locked);
 
@@ -7802,11 +7865,21 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 	return work;
 }
 
-static int napi_thread_wait(struct napi_struct *napi)
+static struct napi_struct *
+napi_thread_wait(struct napi_thread_node *thread_node)
 {
+	struct napi_struct *napi = READ_ONCE(thread_node->napi);
+
 	set_current_state(TASK_INTERRUPTIBLE);
 
 	while (!kthread_should_stop()) {
+		if (kthread_should_park()) {
+			kthread_parkme();
+			napi = READ_ONCE(thread_node->napi);
+			/* Might be awakened for stopping */
+			continue;
+		}
+
 		/* Testing SCHED_THREADED bit here to make sure the current
 		 * kthread owns this napi and could poll on this napi.
 		 * Testing SCHED bit is not enough because SCHED bit might be
@@ -7815,7 +7888,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 		if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state)) {
 			WARN_ON(!list_empty(&napi->poll_list));
 			__set_current_state(TASK_RUNNING);
-			return 0;
+			return napi;
 		}
 
 		schedule();
@@ -7823,7 +7896,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 	}
 	__set_current_state(TASK_RUNNING);
 
-	return -1;
+	return NULL;
 }
 
 static void napi_threaded_poll_loop(struct napi_struct *napi,
@@ -7880,13 +7953,19 @@ static void napi_threaded_poll_loop(struct napi_struct *napi,
 
 static int napi_threaded_poll(void *data)
 {
-	struct napi_struct *napi = data;
+	struct napi_thread_node *thread_node = data;
 	unsigned long last_qs = jiffies;
+	struct napi_struct *napi;
 	bool want_busy_poll;
 	bool in_busy_poll;
 	unsigned long val;
 
-	while (!napi_thread_wait(napi)) {
+	while (1) {
+		napi = napi_thread_wait(thread_node);
+
+		if (!napi)
+			break;
+
 		val = READ_ONCE(napi->state);
 
 		want_busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
@@ -12155,6 +12234,8 @@ EXPORT_SYMBOL(alloc_netdev_mqs);
 
 static void netdev_napi_exit(struct net_device *dev)
 {
+	unsigned int i;
+
 	if (!list_empty(&dev->napi_list)) {
 		struct napi_struct *p, *n;
 
@@ -12166,6 +12247,10 @@ static void netdev_napi_exit(struct net_device *dev)
 		synchronize_net();
 	}
 
+	for (i = 0; i < dev->num_napi_configs; i++) {
+		if (dev->napi_config[i].thread_node)
+			napi_thread_node_stop(dev->napi_config[i].thread_node);
+	}
 	kvfree(dev->napi_config);
 }
 
@@ -13204,12 +13289,12 @@ static void backlog_napi_setup(unsigned int cpu)
 	struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
 	struct napi_struct *napi = &sd->backlog;
 
-	napi->thread = this_cpu_read(backlog_napi);
+	rcu_assign_pointer(napi->thread_node, this_cpu_ptr(&backlog_napi));
 	set_bit(NAPI_STATE_THREADED, &napi->state);
 }
 
 static struct smp_hotplug_thread backlog_threads = {
-	.store			= &backlog_napi,
+	.store			= &backlog_napi.thread,
 	.thread_should_run	= backlog_napi_should_run,
 	.thread_fn		= run_backlog_napi,
 	.thread_comm		= "backlog_napi/%u",
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 11b0b91683d7..f2ecdb26d6f1 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -162,6 +162,7 @@ static int
 netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			const struct genl_info *info)
 {
+	struct napi_thread_node *thread_node;
 	unsigned long irq_suspend_timeout;
 	unsigned long gro_flush_timeout;
 	u32 napi_defer_hard_irqs;
@@ -188,11 +189,16 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			 napi_get_threaded(napi)))
 		goto nla_put_failure;
 
-	if (napi->thread) {
-		pid = task_pid_nr(napi->thread);
-		if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
+	rcu_read_lock();
+	thread_node = rcu_dereference(napi->thread_node);
+	if (thread_node) {
+		pid = task_pid_nr(thread_node->thread);
+		if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid)) {
+			rcu_read_unlock();
 			goto nla_put_failure;
+		}
 	}
+	rcu_read_unlock();
 
 	napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
 	if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS,
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* [RFC net-next v1 0/2] Reuse threaded NAPI kthread across napi_del()/napi_add().
From: Shuhao Tan @ 2026-06-12 17:36 UTC (permalink / raw)
  To: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Shuah Khan
  Cc: Shuhao Tan, Mina Almasry, Samiullah Khawaja, Kuniyuki Iwashima,
	netdev, linux-kernel, linux-kselftest

Currently the lifetime of the kthread of a threaded NAPI is tied to
the napi_struct. netif_napi_del() stops the kthread when it destroys
the NAPI struct.

This patch series reuses the same kthread (thus preserving all of its
attributes) across napi_del/napi_add. This series now tie the lifetime
of the kthread to net_device instead of napi_struct.

A normal use case for threaded NAPI will be "enable threaded" ->
"configure the thread". Driver reset (that can be caused by a NIC
configuration change) often destroys the configuration and causes a
usability issue. This series aims to solve the issue.

There is a downside of this approach: If a device reduces number of
queues while its NAPIs are threaded. The kthread associated with
removed queues will not be stopped. Since the mapping between the
index passed to napi_add_config() and NAPI is an implementation
detail of individual drivers, it is not straightfoward to perform
a garbage collection and stop kthreads that are no longer associated
with a queue. This patch series tries to minimize the effect by
parking the kthread in napi_del. The kthread still shows up in
/proc, but should not consume CPU cycles.

There was a discussion
https://lore.kernel.org/CAAywjhR0TPKZ-xzqjSP709OVmZWUisDNv2CVc_VxgOrXRtop+g@mail.gmail.com/
around what to do with the kthread between napi_disable/napi_enable.
It seems that there was an intention to keep user configuration for
the kthread across NIC configuration change. This patch series extends
the effort to cover more NIC configuration changes. Roughly tracing
through the call hierarchy of napi_del reveals that at least the
following drivers will not preserve the user configurations:
- idpf: idpf_set_channels(), idpf_set_ringparam()
- mlx4: mlx4_en_set_ringparam(), mlx4_en_set_rxfh(),
  mlx4_en_setchannels()
- bnx2: bnx2_set_ringparam(), bnx2_set_channels(), bnx2_change_mtu()
- gve: gve_set_channels(), gve_set_ringparam()
- ena2: ena_set_ringparam(), ena_set_channels()
- (non exhaustive)

These drivers destroy and recreate queues during configuration
changes. If a NAPI was threaded before destruction, during the
creation, a new kthread will be spawned for the NAPI.

Some drivers do not have this problem, e.g. fbnic, netdevsim. But
these drivers and the drivers mentioned above will still lose kthread
during link flap (ndo_stop/ndo_open).

Because the kthreads before and after these configuration changes are
different, all the attributes associated with the kthread are lost.
These include CPU mask, priority, scheduler policy, etc.. If the
threaded state is preserved for a NAPI, it makes sense to want to
preserve the attributes of the thread as well.

Link: https://lore.kernel.org/CAAywjhR0TPKZ-xzqjSP709OVmZWUisDNv2CVc_VxgOrXRtop+g@mail.gmail.com/

Shuhao Tan (2):
  net: Save kthread of threaded NAPI in napi_config in napi_del and
    restore in napi_add
  selftests: net: Add kthread preserving test in napi_threaded and
    busy_poll_test

 include/linux/netdevice.h                     |  13 +-
 net/core/dev.c                                | 151 ++++++++++++++----
 net/core/netdev-genl.c                        |  12 +-
 .../selftests/drivers/net/napi_threaded.py    |  41 ++++-
 tools/testing/selftests/net/busy_poll_test.sh |  24 +++
 5 files changed, 203 insertions(+), 38 deletions(-)

-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply

* Re: [PATCH net] net: bcmgenet: Use weighted round-robin TX DMA arbitration
From: Justin Chen @ 2026-06-12 17:34 UTC (permalink / raw)
  To: Nicolai Buchwitz
  Cc: Ovidiu Panait, opendmb, florian.fainelli,
	bcm-kernel-feedback-list, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel
In-Reply-To: <69d920a3b52cc049dc1b96a9c7d5e3b6@tipi-net.de>



On 6/12/26 12:37 AM, Nicolai Buchwitz wrote:
> On 11.6.2026 19:27, Justin Chen wrote:
> 
>> [...]
> 
>>> AFAIK the existing queue mechanism dates back to the STB QoS use 
>>> cases Florian
>>> had in mind when he wrote the driver, so let's hear what he has to 
>>> say on this.
>>>
>>> The change itself looks correct to me. My concern is the targeting. This
>>> flips the default policy for every GENET user rather than fixing a 
>>> specific
>>> defect, and Justin's series already called the persistent timeouts a 
>>> design
>>> issue rather than a bug.
>>>
>>> So isn't this more net-next material than net with a broad Fixes: tag?
>>> Please also add to the commit message that it drops the existing 
>>> priority
>>> handling for rings 1-4.
>>>
>>
>> I'm ok with these changes. Internally we no longer require priority 
>> queues. It is a legacy use case we no longer have. My idea was to 
>> remove the queues entirely and have one big queue. But I figured I 
>> would wait for Nicolai's XDP changes so we do not need to remove and 
>> re-add queues for XDP. This could be a stop-gap solution until that is 
>> done.
> 
> Thanks Justin. IMHO XDP could benefit from a queue-refactoring prep. It 
> gets simpler on top of a single queue, and dropping q1-4 frees up a 
> bunch of BDs for the XDP ring instead of the 32 it has atm.
> 
> Happy to take it on, or leave it to you :)
> 

I am not going to say no to less work! Please do.

Thanks,
Justin

>> [...]
> 
> Thanks
> Nicolai


^ permalink raw reply

* Re: [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support
From: Jonas Köppeler @ 2026-06-12 17:21 UTC (permalink / raw)
  To: Simon Schippers, hawk, netdev
  Cc: kernel-team, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Chris Arges, Mike Freemon,
	Toke Høiland-Jørgensen, Breno Leitao,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Stanislav Fomichev, bpf
In-Reply-To: <5c326cb0-9b7c-4d67-8f80-0c9be1e8fcba@tu-dortmund.de>

On 6/12/26 16:10, Simon Schippers wrote:
> On 6/12/26 10:35, hawk@kernel.org wrote:
>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>> This series adds BQL (Byte Queue Limits) to the veth driver, reducing
>> latency by dynamically limiting in-flight packets in the ptr_ring and
>> moving buffering into the qdisc where AQM algorithms can act on it.
> 
> LGTM, thanks for the detailed changelog :)
> 
> Maybe we should stop searching for the perfect tx-usecs value.
> 100us is probably fine for most hardware to not have a performance
> regression. And lowering it does not really improve the RTT anyways.
> Do you agree?
I agree, I already thought that it just might be a very lucky case when 
using 50us where something accidentally aligns nicely. Interestingly, I 
could also reproduce that 50us was consistently a little better compared 
to 100us on an Intel CPU. Maybe if I get the time, I'll have another 
look at it, but in general I think 50us or 100us does not really matter.

> 
> Nevertheless, I will compile and run the benchmarks again.
> 
> I will go on vacation from 15th to 24th of June, so I will not be able
> to contribute code or run benchmarks then.
> 
> Thanks,
> Simon
> 


^ permalink raw reply

* Re: [PATCH] net: qrtr: fix 32-bit integer overflow in qrtr_endpoint_post()
From: Simon Horman @ 2026-06-12 17:12 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Manivannan Sadhasivam, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-arm-msm, stable,
	linux-kernel
In-Reply-To: <20260611125455.2352279-1-michael.bommarito@gmail.com>

On Thu, Jun 11, 2026 at 08:54:55AM -0400, Michael Bommarito wrote:
> qrtr_endpoint_post() validates an incoming packet with
> 
> 	if (!size || len != ALIGN(size, 4) + hdrlen)
> 		goto err;
> 
> where size comes from the wire. On 32-bit, size_t is 32 bits and
> ALIGN(size, 4) wraps to 0 for size >= 0xfffffffd, so the check
> passes and skb_put_data(skb, data + hdrlen, size) writes past the
> hdrlen-sized skb and oopses the kernel. 64-bit is unaffected.
> 
> This is the 32-bit residual of ad9d24c9429e2 ("net: qrtr: fix OOB
> Read in qrtr_endpoint_post"), which fixed only the 64-bit case.
> 
> Reject any size that cannot fit the buffer before the ALIGN.
> 
> Fixes: ad9d24c9429e2 ("net: qrtr: fix OOB Read in qrtr_endpoint_post")
> Cc: stable@vger.kernel.org
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> 32-bit only; reachable via /dev/qrtr-tun (CONFIG_QRTR_TUN) or a QMI modem.
> Reproduced on i386 (a 32-byte write with size 0xfffffffd faults; well-formed
> writes are unaffected).  QRTR mostly runs on 64-bit now, so this is a
> correctness fix completing ad9d24c9429e2, not a high-severity bug.

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH v7 4/5] vfio/pci: implement get_tph and DMA_BUF_TPH feature
From: Alex Williamson @ 2026-06-12 17:10 UTC (permalink / raw)
  To: Zhiping Zhang; +Cc: netdev, kvm, linux-rdma, linux-pci, dri-devel, alex
In-Reply-To: <20260611161546.4075580-5-zhipingz@meta.com>

On Thu, 11 Jun 2026 09:11:19 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:

> Implement dma-buf get_tph for vfio-pci exported dma-bufs and add
> VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata
> for a VFIO-owned device.
> 
> 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
> uAPI carries both with explicit validity flags, and get_tph() returns
> the value matching the importer's requested namespace or -EOPNOTSUPP.
> 
> Publish and read the TPH descriptor under dmabuf->resv, matching the
> locking used for other importer-visible dma-buf state. The SET ioctl
> takes dma_resv_lock_interruptible(), while the callback runs under
> DMA-buf's asserted resv lock.
> 
> Reject requests the device cannot consume as a completer:
> pcie_tph_completer_type() must report at least
> PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires
> PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Validate fields before the completer
> check so userspace gets the narrowest errno.
> 
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c   |  3 +
>  drivers/vfio/pci/vfio_pci_dmabuf.c | 94 +++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_priv.h   | 12 ++++
>  include/uapi/linux/vfio.h          | 37 ++++++++++++
>  4 files changed, 145 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 050e7542952e..4fa36f2f7555 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1569,6 +1569,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_DMA_BUF:
>  		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
> +		return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
> +							 argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 1a177ce7de54..0a0705c8dbea 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -3,6 +3,7 @@
>   */
>  #include <linux/dma-buf-mapping.h>
>  #include <linux/pci-p2pdma.h>
> +#include <linux/pci-tph.h>
>  #include <linux/dma-resv.h>
>  
>  #include "vfio_pci_priv.h"
> @@ -19,7 +20,12 @@ struct vfio_pci_dma_buf {
>  	u32 nr_ranges;
>  	struct kref kref;
>  	struct completion comp;
> -	u8 revoked : 1;
> +	u8 tph_st_valid:1;
> +	u8 tph_st_ext_valid:1;
> +	u8 tph_ph:2;
> +	u8 tph_st;
> +	u16 tph_st_ext;
> +	u8 revoked:1;

If these bitfields are now all protected under dma_resv_lock they
should be grouped together with a comment to that effect, no need for
revoked to get kicked out to its own storage unit.  In [1] I'm
proposing runtime modified flags each get their own storage unit, but
for more isolated cases, so long as we keep track and enforce serialized
updates, I'm ok with runtime bitfields.  Thanks,

Alex

[1]https://lore.kernel.org/all/20260611213539.4100590-1-alex.williamson@nvidia.com/

>  };
>  
>  static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> @@ -69,6 +75,26 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
>  	return ret;
>  }
>  
> +static int vfio_pci_dma_buf_get_tph(struct dma_buf *dmabuf, bool extended,
> +				    u16 *steering_tag, u8 *ph)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	dma_resv_assert_held(dmabuf->resv);
> +
> +	if (extended) {
> +		if (!priv->tph_st_ext_valid)
> +			return -EOPNOTSUPP;
> +		*steering_tag = priv->tph_st_ext;
> +	} else {
> +		if (!priv->tph_st_valid)
> +			return -EOPNOTSUPP;
> +		*steering_tag = priv->tph_st;
> +	}
> +	*ph = priv->tph_ph;
> +	return 0;
> +}
> +
>  static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
>  				   struct sg_table *sgt,
>  				   enum dma_data_direction dir)
> @@ -101,6 +127,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
>  
>  static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
>  	.attach = vfio_pci_dma_buf_attach,
> +	.get_tph = vfio_pci_dma_buf_get_tph,
>  	.map_dma_buf = vfio_pci_dma_buf_map,
>  	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
>  	.release = vfio_pci_dma_buf_release,
> @@ -333,6 +360,71 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  	return ret;
>  }
>  
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf_tph set_tph;
> +	struct vfio_pci_dma_buf *priv;
> +	struct dma_buf *dmabuf;
> +	u8 comp;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(set_tph));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> +		return -EFAULT;
> +
> +	if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> +		return -EINVAL;
> +
> +	if (set_tph.ph & ~0x3)
> +		return -EINVAL;
> +
> +	comp = pcie_tph_completer_type(vdev->pdev);
> +	if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
> +		return -EOPNOTSUPP;
> +	if ((set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT) &&
> +	    comp != PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH)
> +		return -EOPNOTSUPP;
> +
> +	dmabuf = dma_buf_get(set_tph.dmabuf_fd);
> +	if (IS_ERR(dmabuf))
> +		return PTR_ERR(dmabuf);
> +
> +	if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
> +		ret = -EINVAL;
> +		goto out_put;
> +	}
> +
> +	priv = dmabuf->priv;
> +	if (priv->vdev != vdev) {
> +		ret = -EINVAL;
> +		goto out_put;
> +	}
> +
> +	ret = dma_resv_lock_interruptible(dmabuf->resv, NULL);
> +	if (ret)
> +		goto out_put;
> +
> +	priv->tph_st         = set_tph.steering_tag;
> +	priv->tph_st_ext     = set_tph.steering_tag_ext;
> +	priv->tph_ph         = set_tph.ph;
> +	priv->tph_st_valid   = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST);
> +	priv->tph_st_ext_valid =
> +		!!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT);
> +	dma_resv_unlock(dmabuf->resv);
> +	ret = 0;
> +
> +out_put:
> +	dma_buf_put(dmabuf);
> +	return ret;
> +}
> +
>  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
>  {
>  	struct vfio_pci_dma_buf *priv;
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index fca9d0dfac90..c58f369be4b3 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  				  struct vfio_device_feature_dma_buf __user *arg,
>  				  size_t argsz);
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> +				      u32 flags,
> +				      struct vfio_device_feature_dma_buf_tph __user *arg,
> +				      size_t argsz);
>  void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
>  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
>  #else
> @@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  {
>  	return -ENOTTY;
>  }
> +
> +static inline int
> +vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf_tph __user *arg,
> +				  size_t argsz)
> +{
> +	return -ENOTTY;
> +}
>  static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
>  {
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..5dd693220a0d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1534,6 +1534,43 @@ struct vfio_device_feature_dma_buf {
>   */
>  #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
>  
> +/**
> + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
> + * with a vfio-exported dma-buf. The dma-buf must have been created by
> + * VFIO_DEVICE_FEATURE_DMA_BUF on this device, and the device must report
> + * TPH Completer support in Device Capabilities 2 (bits 13:12); requests
> + * carrying VFIO_DMA_BUF_TPH_ST_EXT additionally require the device to
> + * report the Extended TPH Completer encoding. Otherwise the ioctl
> + * returns -EOPNOTSUPP.
> + *
> + * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
> + *
> + * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
> + * distinct namespaces. Userspace supplies whichever values are valid and sets
> + * the matching VFIO_DMA_BUF_TPH_ST / VFIO_DMA_BUF_TPH_ST_EXT bits in @flags;
> + * an importer requests one namespace and receives the matching value.
> + *
> + * @flags == 0 marks any previously published ST / Extended-ST as invalid
> + * for future get_tph() requests on this dma-buf.
> + *
> + * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
> + *
> + * Userspace must publish TPH before handing the dma-buf fd to an importer.
> + * Calling SET again replaces the published values.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
> +
> +#define VFIO_DMA_BUF_TPH_ST		(1 << 0)
> +#define VFIO_DMA_BUF_TPH_ST_EXT		(1 << 1)
> +
> +struct vfio_device_feature_dma_buf_tph {
> +	__s32	dmabuf_fd;
> +	__u32	flags;
> +	__u16	steering_tag_ext;
> +	__u8	steering_tag;
> +	__u8	ph;
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox