Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: net: thunderbolt: tbnet_poll() can overflow skb_shinfo()->frags[]
From: Mika Westerberg @ 2026-06-16  9:25 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: Mika Westerberg, Yehezkel Bernat, Andrew Lunn, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel
In-Reply-To: <178159529251.2170936.1136950368069628844@maoyixie.com>

Hi,

On Tue, Jun 16, 2026 at 03:34:52PM +0800, Maoyi Xie wrote:
> Hi all,
> 
> After the recent skb frags[] overflow fixes (t7xx, cdc-phonet, f_phonet), I
> went looking for the same pattern. I think tbnet_poll() in
> drivers/net/thunderbolt/main.c has it too. I would appreciate it if you could
> take a look.
> 
> tbnet_poll() reassembles a ThunderboltIP packet that spans several frames into
> one skb. It adds one rx fragment per frame.
> 
> 	skb = net->skb;
> 	if (!skb) {
> 		skb = build_skb(...);
> 		...
> 		net->skb = skb;
> 	} else {
> 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> 				page, hdr_size, frame_size,
> 				TBNET_RX_PAGE_SIZE - hdr_size);
> 	}
> 
> Nothing checks skb_shinfo(skb)->nr_frags against MAX_SKB_FRAGS here. The frame
> count comes from the peer, in the frame header. tbnet_check_frame() only bounds
> it at the start of a packet.
> 
> 	if (frame_count == 0 || frame_count > TBNET_RING_SIZE / 4) {
> 		net->stats.rx_length_errors++;
> 		return false;
> 	}
> 
> TBNET_RING_SIZE is 256, so frame_count can be as large as 64. MAX_SKB_FRAGS is 17
> by default. Frame 0 builds the skb and every frame after it adds a fragment, so
> nr_frags can reach 63. Once nr_frags hits MAX_SKB_FRAGS, skb_add_rx_frag() writes
> one entry past skb_shinfo()->frags[]. The frame_size and MTU checks do not stop
> this. With small frames, 64 fragments stay well under TBNET_MAX_MTU.
> 
> So a malicious or buggy peer can send a packet with frame_count between 19 and
> 64. The frames only need to increment the way tbnet_check_frame() wants. That
> drives nr_frags past frags[] and overruns skb_shared_info.

I agree this can happen.

> The fix I had in mind mirrors f0813bcd2d9d ("net: wwan: t7xx: fix potential
> skb->frags overflow in RX path") and 600dc40554dc ("net: usb: cdc-phonet: fix
> skb frags[] overflow in rx_complete()"). Add the fragment only while there is
> room, and drop the packet otherwise.
> 
> 	-	} else {
> 	+	} else if (skb_shinfo(skb)->nr_frags < MAX_SKB_FRAGS) {
> 			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> 					page, hdr_size, frame_size,
> 					TBNET_RX_PAGE_SIZE - hdr_size);
> 	+	} else {
> 	+		net->stats.rx_length_errors++;
> 	+		__free_pages(page, TBNET_RX_PAGE_ORDER);
> 	+		dev_kfree_skb_any(net->skb);
> 	+		net->skb = NULL;
> 	+		continue;
> 		}
> 
> I do not have two Thunderbolt hosts, so this is from reading the code. I can put
> together a focused reproducer if that helps.
> 
> Does this look like a real overflow? And is the MAX_SKB_FRAGS guard the right
> place, or would you rather tighten the frame_count bound in tbnet_check_frame()?
> It has been there since the driver was added (e69b6c02b4c3), so it is a stable
> candidate. Happy to send a proper patch once you confirm.

I would prefer do this in tbnet_check_frame(). Thanks!

^ permalink raw reply

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Pedro Falcato @ 2026-06-16  9:20 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
	edumazet, kuba, pabeni, gregkh, rafael, akpm, david, netdev,
	linux-mm, iommu, driver-core, linux-kernel,
	Jesper Dangaard Brouer, Ilias Apalodimas
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>

(+cc page pool maintainers)
On Mon, Jun 15, 2026 at 11:42:20PM +0000, Luigi Rizzo wrote:
> The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
> especially with greedy senders, this has a high chance of happening in
> the softirq handler for tx network interrupts, creating a significant
> performance bottleneck.
> 
> Allow tx sockets to allocate socket buffers directly from the bounce
> buffers. This avoids the second copy and removes the above bottleneck.
> The fraction of swiotlb buffers allowed for this feature is set with
>    /sys/module/swiotlb/parameters/zerocopy_tx_percent
> (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
> 
> Implementation:
> - define a new page type to unambiguously identify bounce buffers used
>   as backing storage for socket buffers
> - modify skb_page_frag_refill to perform the modified allocation
> - modify the destructors __free_frozen_pages(), free_unref_folio() to
>   handle those pages and return them to the pool.
> 
> The savings are especially visible with fewer queues. In synthetic
> benchmarks, senders with 1-2 queues would cap around 50Gbps with
> conventional swiotlb, and reach over 170Gbps with the feature enabled.

I could be wrong, but I genuinely think that the way to go about this is
using page_pool for regular TX as well. page_pool pages are all dma-mapped
(so whatever swiotlb optimization you want can be done there), and the net
stack already has awareness of these special pages and special skbs, so it
won't Just Return Them back to the page allocator.

Otherwise you can easily go all over the place, and that's just not great.
Also this could possibly benefit setups that use IOMMU as well.

-- 
Pedro

^ permalink raw reply

* Re: [PATCH net-next v5 0/3] airoha: add the capability to configure GDM3/GDM4 as WAN/LAN on demand
From: Lorenzo Bianconi @ 2026-06-16  9:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	linux-arm-kernel, linux-mediatek, netdev, Madhur Agrawal,
	Alexander Lobakin
In-Reply-To: <20260615163713.665271a2@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 435 bytes --]

> On Thu, 11 Jun 2026 23:55:50 +0200 Lorenzo Bianconi wrote:
> >       net: airoha: use int instead of atomic_t for qdma users counter
> >       net: airoha: refactor QDMA start/stop into reusable helpers
> >       net: airoha: defer GDM3/GDM4 WAN mode and GDM2 loopback to QoS offload
> 
> only the first patch applies cleanly right now

ack, I will repost missing ones as soon as net-next is open again.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH bpf] bpf, sockmap: fix lock inversion between stab->lock and sk_callback_lock
From: Sechang Lim @ 2026-06-16  9:11 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Simon Horman, netdev, bpf,
	linux-kernel

sock_map_update_common() and __sock_map_delete() hold stab->lock and call
sock_map_unref() -> sock_map_del_link() under it. sock_map_del_link() takes
sk_callback_lock for write to stop the strparser and verdict, giving the
lock order stab->lock -> sk_callback_lock.

The opposite order comes from an SK_SKB stream parser. On RX,
sk_psock_strp_data_ready() holds sk_callback_lock for read while running
the parser. The verdict redirects the skb to egress, where a sched_cls
program calls bpf_map_delete_elem() on a sockmap, which takes stab->lock:

  WARNING: possible circular locking dependency detected
  7.1.0-rc6 Not tainted
  ------------------------------------------------------
  syz.9.8824 is trying to acquire lock:
  (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
  but task is already holding lock:
  (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173

  -> #1 (clock-AF_INET){++.-}-{3:3}:
         _raw_write_lock_bh
         sock_map_del_link net/core/sock_map.c:167
         sock_map_unref net/core/sock_map.c:184
         sock_map_update_common net/core/sock_map.c:509
         sock_map_update_elem_sys net/core/sock_map.c:588
         map_update_elem kernel/bpf/syscall.c:1805

  -> #0 (&stab->lock){+.-.}-{3:3}:
         _raw_spin_lock_bh
         __sock_map_delete net/core/sock_map.c:421
         sock_map_delete_elem net/core/sock_map.c:452
         bpf_prog_06044d24140080b6
         tcx_run net/core/dev.c:4451
         sch_handle_egress net/core/dev.c:4541
         __dev_queue_xmit net/core/dev.c:4808
         ...
         tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701
         strp_data_ready net/strparser/strparser.c:402
         sk_psock_strp_data_ready net/core/skmsg.c:1174
         tcp_data_queue net/ipv4/tcp_input.c:5661

  Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    rlock(clock-AF_INET);
                                 lock(&stab->lock);
                                 lock(clock-AF_INET);
    lock(&stab->lock);

   *** DEADLOCK ***

sk_callback_lock is an rwlock and the established side takes it for write,
so the read side cannot re-enter once a writer is queued.

sock_map_del_link() uses psock->link_lock and sk_callback_lock, not
stab->lock. The socket is removed from the slot with xchg() under
stab->lock, which leaves a single deleter owning it, and its reference is
dropped only by sk_psock_put() in sock_map_unref(). Release stab->lock
right after the xchg() and run sock_map_unref() outside it. Do the same
for the replaced socket in sock_map_update_common(). sock_map_free()
already unrefs without stab->lock.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/core/sock_map.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99e3789492a0..390bd5ee46d4 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -421,13 +421,13 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 	spin_lock_bh(&stab->lock);
 	if (!sk_test || sk_test == *psk)
 		sk = xchg(psk, NULL);
+	spin_unlock_bh(&stab->lock);
 
 	if (likely(sk))
 		sock_map_unref(sk, psk);
 	else
 		err = -EINVAL;
 
-	spin_unlock_bh(&stab->lock);
 	return err;
 }
 
@@ -505,9 +505,10 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 
 	sock_map_add_link(psock, link, map, &stab->sks[idx]);
 	stab->sks[idx] = sk;
+	spin_unlock_bh(&stab->lock);
+
 	if (osk)
 		sock_map_unref(osk, &stab->sks[idx]);
-	spin_unlock_bh(&stab->lock);
 	return 0;
 out_unlock:
 	spin_unlock_bh(&stab->lock);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] vhost/net: fix clear_user start address in VHOST_GET_FEATURES_ARRAY
From: rom.wang @ 2026-06-16  9:01 UTC (permalink / raw)
  To: r4o5m6e8o
  Cc: eperezma, jasowang, kvm, linux-kernel, mst, netdev, pabeni,
	virtualization, wangyufeng
In-Reply-To: <20260526080336.61296-1-r4o5m6e8o@163.com>

Gentle ping. Any comments on this patch?

                                  Thanks
                             Yufeng Wang


^ permalink raw reply

* Re: [PATCH net] net: dsa: Fix skb ownership in taggers
From: Linus Walleij @ 2026-06-16  9:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Florian Fainelli, Jonas Gorski,
	Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh, UNGLinuxDriver,
	Chester A. Unal, Daniel Golle, Matthias Brugger,
	AngeloGioacchino Del Regno, Wei Fang, Clark Wang,
	Clément Léger, George McCollister, David Yang, netdev,
	Sashiko AI Review
In-Reply-To: <20260615180113.13fca89f@kernel.org>

On Tue, Jun 16, 2026 at 3:01 AM Jakub Kicinski <kuba@kernel.org> wrote:

> Impressive. Thanks a lot for doing this.

The grunt work was AI assisted actually, I just took a deep breath
and jumped in on the deep end and vibe coded it. If AI finds the bug
AI should help fixing it...

> patchwork says it doesn't apply to net. Is it on top of net or net-next?

It's on net-next so it covers the new tagger, since the merge window
opened I just guessed it needed to be "net" at this point.

> Since the merge window started already net-next is probably better but
> you need to designate in the subject correctly. Feel free to repost
> without the 24h wait, maybe we can still slip this into our main PR.

OK I fix the improvements pointed out by Wei and Quingfang and
repost tagged for net-next.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH] nfc: fdp: reject an oversized device-reported packet length
From: Simon Horman @ 2026-06-16  9:00 UTC (permalink / raw)
  To: hexlabsecurity
  Cc: David Heidelberg, linux-kernel, Robert Dolca, netdev,
	oe-linux-nfc, Samuel Ortiz, Kang Chen
In-Reply-To: <20260615-b4-disp-f42dce2d-v1-1-186ff3dcbf37@proton.me>

On Mon, Jun 15, 2026 at 03:04:02AM -0500, Bryam Vargas via B4 Relay wrote:
> From: Bryam Vargas <hexlabsecurity@proton.me>
> 
> fdp_nci_i2c_read() reads the length of the next packet from the device
> into phy->next_read_size and uses it as the i2c_master_recv() byte count
> into a fixed on-stack buffer:
> 
> 	u8 tmp[FDP_NCI_I2C_MAX_PAYLOAD];		/* 261 bytes */
> 	...
> 	len = phy->next_read_size;
> 	r = i2c_master_recv(client, tmp, len);
> 
> When a "length packet" arrives (tmp[0] == 0 && tmp[1] == 0), the next
> length is taken verbatim from two device-supplied bytes:
> 
> 	phy->next_read_size = (tmp[2] << 8) + tmp[3] + 3;
> 
> next_read_size is a u16, so this can be driven as high as 65535 - far
> larger than the 261-byte tmp[] buffer - and it is never bounded before
> the next iteration's i2c_master_recv(). A malfunctioning, malicious or
> counterfeit FDP NFC controller (or an attacker tampering with the I2C
> bus) that sends such a length packet makes i2c_master_recv() write up to
> about 64 KB into the 261-byte on-stack buffer: a stack out-of-bounds
> write that clobbers the stack canary, saved registers and the return
> address.
> 
> Reject a next_read_size larger than the receive buffer the same way a
> corrupted packet is already handled - drop it and force resynchronization
> - so a device can never drive an over-length read.
> 
> Fixes: a06347c04c13 ("NFC: Add Intel Fields Peak NFC solution driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
> ---
> I reproduced the out-of-bounds write with an in-kernel test that drives
> the fdp_nci_i2c_read() buffer geometry verbatim under KASAN
> (CONFIG_KASAN_STACK=y), modelling i2c_master_recv() delivering
> next_read_size device bytes into the 261-byte tmp[] buffer:
> 
>   next_read_size = 281, no bound:
>     BUG: KASAN: stack-out-of-bounds in i2c_master_recv...
>     Write of size 281 ... [48, 309) 'tmp'   (the 261-byte buffer)
>   with the device length bounded to <= FDP_NCI_I2C_MAX_PAYLOAD (what this
>     patch enforces): no KASAN report.
>   a well-formed packet (length <= 261) is unaffected, no KASAN report.
> 
> The full device range - next_read_size = 65535 (tmp[2] = 0xff,
> tmp[3] = 0xfc; the u16 field truncates the + 3), a 65535-byte write =
> 65274 bytes past the buffer, smashing the stack canary and the return
> address - reproduces the same way under userspace AddressSanitizer on
> both -m32 and -m64.
> ---
>  drivers/nfc/fdp/i2c.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/nfc/fdp/i2c.c b/drivers/nfc/fdp/i2c.c
> index c1896a1d978c..0392bb49bb4b 100644
> --- a/drivers/nfc/fdp/i2c.c
> +++ b/drivers/nfc/fdp/i2c.c
> @@ -166,6 +166,20 @@ static int fdp_nci_i2c_read(struct fdp_i2c_phy *phy, struct sk_buff **skb)
>  		/* Packet that contains a length */
>  		if (tmp[0] == 0 && tmp[1] == 0) {
>  			phy->next_read_size = (tmp[2] << 8) + tmp[3] + 3;


Thanks Bryam,

I agree with your analysis regarding overrunning tmp and that the
fix for that is correct.

But I am concerned that there is also an expectation in the code that
next_read_size is always at least FDP_NCI_I2C_MIN_PAYLOAD (5).
But that smaller values can be achieved if either:

* tmp[2] is 0 and tmp[3] is < 2.
* the addition above overflows 16bits. e.g. both tmp[2] and tmp[3] are 255.

So I wonder if the check you are adding below should also guard
against phy->next_read_size < FDP_NCI_I2C_MIN_PAYLOAD.

> +
> +			/*
> +			 * next_read_size is taken from the device and is used
> +			 * as the i2c_master_recv() count on the next iteration.
> +			 * A value larger than the receive buffer would overflow
> +			 * tmp[]; treat it like a corrupted packet and force
> +			 * resynchronization.
> +			 */
> +			if (phy->next_read_size > FDP_NCI_I2C_MAX_PAYLOAD) {
> +				dev_dbg(&client->dev, "%s: corrupted packet\n",
> +					__func__);
> +				phy->next_read_size = FDP_NCI_I2C_MIN_PAYLOAD;
> +				goto flush;
> +			}
>  		} else {
>  			phy->next_read_size = FDP_NCI_I2C_MIN_PAYLOAD;
>  
> 
> ---
> base-commit: 8e65320d91cdc3b241d4b94855c88459b91abf66
> change-id: 20260615-b4-disp-f42dce2d-055035ea37ba
> 
> Best regards,
> -- 
> Bryam Vargas <hexlabsecurity@proton.me>
> 
> 

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net-next] i40e: add devlink parameter for Flow Director ATR sample rate
From: Kwapulinski, Piotr @ 2026-06-16  8:53 UTC (permalink / raw)
  To: mheib@redhat.com, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, jiri@resnulli.us, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org, corbet@lwn.net, Nguyen, Anthony L,
	Kitszel, Przemyslaw, andrew+netdev@lunn.ch
In-Reply-To: <20260614161131.192068-1-mheib@redhat.com>

>-----Original Message-----
>From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of mheib@redhat.com
>Sent: Sunday, June 14, 2026 6:12 PM
>To: intel-wired-lan@lists.osuosl.org
>Cc: netdev@vger.kernel.org; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; horms@kernel.org; corbet@lwn.net; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; andrew+netdev@lunn.ch; Mohammad Heib <mheib@redhat.com>
>Subject: [Intel-wired-lan] [PATCH net-next] i40e: add devlink parameter for Flow Director ATR sample rate
>
>From: Mohammad Heib <mheib@redhat.com>
>
>The i40e driver uses Flow Director ATR to periodically update flow steering information for active TCP flows. The update frequency is currently controlled by I40E_DEFAULT_ATR_SAMPLE_RATE and is fixed at driver build time.
>
>On systems with a large number of queues and high-rate TCP workloads, the default sampling interval can result in frequent Flow Director reprogramming for long-lived flows.
>
>The amount of TCP packet reordering observed on some systems is sensitive to the ATR sampling interval. Increasing the interval reduces Flow Director programming activity and can significantly reduce the associated reordering.
>
>Since the optimal sampling interval depends on the workload and system configuration, a single fixed value is not suitable for all deployments.
>
>Add a devlink parameter to allow administrators to tune the ATR sample rate at runtime without rebuilding the driver or disabling ATR functionality entirely.
>
>Signed-off-by: Mohammad Heib <mheib@redhat.com>
>---
> Documentation/networking/devlink/i40e.rst     | 19 ++++++
> drivers/net/ethernet/intel/i40e/i40e.h        |  1 +
> .../net/ethernet/intel/i40e/i40e_devlink.c    | 65 +++++++++++++++++++
> drivers/net/ethernet/intel/i40e/i40e_main.c   |  4 +-
> drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  4 +-
> 5 files changed, 90 insertions(+), 3 deletions(-)
>
>diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
>index 51c887f0dc83..704469aa9acf 100644
>--- a/Documentation/networking/devlink/i40e.rst
>+++ b/Documentation/networking/devlink/i40e.rst
>@@ -40,6 +40,25 @@ Parameters
> 
>         The default value is ``0`` (internal calculation is used).
> 
>+.. list-table:: Driver specific parameters implemented
>+    :widths: 5 5 90
>+
>+    * - Name
>+      - Mode
>+      - Description
>+    * - ``atr_sample_rate``
>+      - runtime
>+      - Controls how frequently Flow Director ATR updates flow steering
>+        information for active TCP flows.
>+
>+        ATR programs Flow Director entries based on sampled transmitted
>+        packets. The sampling interval is specified as the number of
>+        transmitted packets between ATR updates.
>+
>+        Lower values increase Flow Director programming activity, while
>+        higher values reduce the update frequency.
>+
>+        The default value is ``20``.
> 
> Info versions
> =============
>diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>index 1b6a8fbaa648..88eb40ee45f0 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e.h
>+++ b/drivers/net/ethernet/intel/i40e/i40e.h
>@@ -487,6 +487,7 @@ struct i40e_pf {
> 	u16 rss_size_max;          /* HW defined max RSS queues */
> 	u16 fdir_pf_filter_count;  /* num of guaranteed filters for this PF */
> 	u16 num_alloc_vsi;         /* num VSIs this driver supports */
>+	u32 atr_sample_rate;
> 	bool wol_en;
> 
> 	struct hlist_head fdir_filter_list;
>diff --git a/drivers/net/ethernet/intel/i40e/i40e_devlink.c b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
>index 229179ccc131..16e51762db45 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e_devlink.c
>+++ b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
>@@ -33,12 +33,77 @@ static int i40e_max_mac_per_vf_get(struct devlink *devlink,
> 	return 0;
> }
> 
>+static int i40e_atr_sample_rate_set(struct devlink *devlink,
>+				    u32 id,
>+				    struct devlink_param_gset_ctx *ctx,
>+				    struct netlink_ext_ack *extack) {
>+	struct i40e_pf *pf = devlink_priv(devlink);
>+	struct i40e_vsi *vsi;
>+	u32 sample_rate = ctx->val.vu32;
>+	int i;
Please keep the RCT and put 'i' right within a loop.
Thank you.
Piotr

>+
>+	pf->atr_sample_rate = sample_rate;
>+
>+	if (!test_bit(I40E_FLAG_FD_ATR_ENA, pf->flags))
>+		return 0;
>+
>+	vsi = i40e_pf_get_main_vsi(pf);
>+	if (!vsi)
>+		return 0;
>+
>+	for (i = 0; i < vsi->num_queue_pairs; i++) {
>+		if (!vsi->tx_rings[i])
>+			continue;
>+		vsi->tx_rings[i]->atr_sample_rate = sample_rate;
>+		vsi->tx_rings[i]->atr_count = 0;
>+	}
>+
>+	return 0;
>+}
>+
>+static int i40e_atr_sample_rate_get(struct devlink *devlink,
>+				    u32 id,
>+				    struct devlink_param_gset_ctx *ctx,
>+				    struct netlink_ext_ack *extack) {
>+	struct i40e_pf *pf = devlink_priv(devlink);
>+
>+	ctx->val.vu32 = pf->atr_sample_rate;
>+
>+	return 0;
>+}
>+
>+static int i40e_atr_sample_rate_validate(struct devlink *devlink, u32 id,
>+					 union devlink_param_value val,
>+					 struct netlink_ext_ack *extack)
>+{
>+	if (!val.vu32) {
>+		NL_SET_ERR_MSG_MOD(extack,
>+				   "ATR sample rate must be greater than 0");
>+		return -EINVAL;
>+	}
>+	return 0;
>+}
>+
>+enum i40e_dl_param_id {
>+	I40E_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
>+	I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
>+};
>+
> static const struct devlink_param i40e_dl_params[] = {
> 	DEVLINK_PARAM_GENERIC(MAX_MAC_PER_VF,
> 			      BIT(DEVLINK_PARAM_CMODE_RUNTIME),
> 			      i40e_max_mac_per_vf_get,
> 			      i40e_max_mac_per_vf_set,
> 			      NULL),
>+	DEVLINK_PARAM_DRIVER(I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
>+			     "atr_sample_rate",
>+			     DEVLINK_PARAM_TYPE_U32,
>+			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>+			     i40e_atr_sample_rate_get,
>+			     i40e_atr_sample_rate_set,
>+			     i40e_atr_sample_rate_validate),
> };
> 
> static void i40e_info_get_dsn(struct i40e_pf *pf, char *buf, size_t len) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>index d59750c490f4..9c8144970a34 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>@@ -3458,7 +3458,7 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
> 
> 	/* some ATR related tx ring init */
> 	if (test_bit(I40E_FLAG_FD_ATR_ENA, vsi->back->flags)) {
>-		ring->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
>+		ring->atr_sample_rate = vsi->back->atr_sample_rate;
> 		ring->atr_count = 0;
> 	} else {
> 		ring->atr_sample_rate = 0;
>@@ -12745,6 +12745,8 @@ static int i40e_sw_init(struct i40e_pf *pf)
> 		}
> 	}
> 
>+	pf->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
>+
> 	if ((pf->hw.func_caps.fd_filters_guaranteed > 0) ||
> 	    (pf->hw.func_caps.fd_filters_best_effort > 0)) {
> 		set_bit(I40E_FLAG_FD_ATR_ENA, pf->flags); diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>index bb741ff3e5f2..7e29e9244c3a 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>@@ -372,8 +372,8 @@ struct i40e_ring {
> 	u16 next_to_clean;
> 	u16 xdp_tx_active;
> 
>-	u8 atr_sample_rate;
>-	u8 atr_count;
>+	u32 atr_sample_rate;
>+	u32 atr_count;
> 
> 	bool ring_active;		/* is ring online or not */
> 	bool arm_wb;		/* do something to arm write back */
>--
>2.53.0
>

^ permalink raw reply

* Re: [PATCH net] net: psample: fix info leak in PSAMPLE_ATTR_DATA
From: Jiri Pirko @ 2026-06-16  8:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	Weiming Shi, yotam.gi, jhs
In-Reply-To: <20260616003046.1099490-1-kuba@kernel.org>

Tue, Jun 16, 2026 at 02:30:46AM +0200, kuba@kernel.org wrote:
>psample open codes nla_put() presumably to avoid wiping
>the data with 0s just to override it with packet data.
>This open coding is missing clearing the pad, however,
>each netlink attr is padded to 4B and data_len may
>not be divisible by 4B.
>
>Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
>Reported-by: Weiming Shi <bestswngs@gmail.com>
>Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Jiri Pirko <jiri@nvidia.com>

^ permalink raw reply

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: David Hildenbrand (Arm) @ 2026-06-16  8:36 UTC (permalink / raw)
  To: Luigi Rizzo, rizzo.unipi, m.szyprowski, robin.murphy, willemb,
	kuniyu, davem, edumazet, kuba, pabeni
  Cc: gregkh, rafael, akpm, netdev, linux-mm, iommu, driver-core,
	linux-kernel
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>

On 6/16/26 01:42, Luigi Rizzo wrote:
> The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
> especially with greedy senders, this has a high chance of happening in
> the softirq handler for tx network interrupts, creating a significant
> performance bottleneck.
> 
> Allow tx sockets to allocate socket buffers directly from the bounce
> buffers. This avoids the second copy and removes the above bottleneck.
> The fraction of swiotlb buffers allowed for this feature is set with
>    /sys/module/swiotlb/parameters/zerocopy_tx_percent
> (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
> 
> Implementation:
> - define a new page type to unambiguously identify bounce buffers used
>   as backing storage for socket buffers
> - modify skb_page_frag_refill to perform the modified allocation
> - modify the destructors __free_frozen_pages(), free_unref_folio() to
>   handle those pages and return them to the pool.
> 
> The savings are especially visible with fewer queues. In synthetic
> benchmarks, senders with 1-2 queues would cap around 50Gbps with
> conventional swiotlb, and reach over 170Gbps with the feature enabled.
> 
> Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> ---
>  drivers/base/core.c        |   1 +
>  include/linux/netdevice.h  |  22 ++++
>  include/linux/page-flags.h |   4 +
>  include/linux/skbuff.h     |   7 +-
>  include/linux/swiotlb.h    |  74 ++++++++++++
>  include/net/sock.h         |  29 +++++
>  kernel/dma/swiotlb.c       | 227 +++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |  32 ++++++
>  net/core/sock.c            |  98 ++++++++++++++--
>  9 files changed, 485 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index bd2ddf2aab505..e1257dea37ba0 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -3855,6 +3855,7 @@ void device_del(struct device *dev)
>  	unsigned int noio_flag;
>  
>  	device_lock(dev);
> +	swiotlb_device_deleted();
>  	kill_device(dev);
>  	device_unlock(dev);
>  
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0e1e581efc5ac..d7e5929e73c92 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -5368,13 +5368,35 @@ static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
>  	return ops->ndo_start_xmit(skb, dev);
>  }
>  
> +struct sock;
> +
> +#ifdef CONFIG_SWIOTLB
> +/* Per-CPU pointer to the socket currently performing transmission.
> + * Used to bridge the networking and DMA layers, allowing the dma_map_page()
> + * path to identify the socket originating the packet and apply SWIOTLB optimizations.
> + */
> +DECLARE_PER_CPU(struct sock *, current_tx_socket);
> +static inline struct sock *__set_current_tx_socket(struct sock *sk)
> +{
> +	struct sock *old_sk = this_cpu_read(current_tx_socket);
> +
> +	this_cpu_write(current_tx_socket, sk);
> +	return old_sk;
> +}
> +#else
> +static inline struct sock *__set_current_tx_socket(struct sock *sk) { return NULL; }
> +#endif
> +
>  static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  					    struct netdev_queue *txq, bool more)
>  {
>  	const struct net_device_ops *ops = dev->netdev_ops;
> +	struct sock *old_sk;
>  	netdev_tx_t rc;
>  
> +	old_sk = __set_current_tx_socket(skb->sk);
>  	rc = __netdev_start_xmit(ops, skb, dev, more);
> +	__set_current_tx_socket(old_sk);
>  	if (rc == NETDEV_TX_OK)
>  		txq_trans_update(dev, txq);
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b40..0ecbb404038a0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -923,6 +923,7 @@ enum pagetype {
>  	PGTY_zsmalloc		= 0xf6,
>  	PGTY_unaccepted		= 0xf7,
>  	PGTY_large_kmalloc	= 0xf8,
> +	PGTY_zcswiotlb		= 0xf9,
>  
>  	PGTY_mapcount_underflow = 0xff
>  };
> @@ -1055,6 +1056,9 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>  PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
>  PAGE_TYPE_OPS(LargeKmalloc, large_kmalloc, large_kmalloc)
>  
> +/* Pages in socket buffers from the swiotlb pool. */
> +PAGE_TYPE_OPS(ZCSwiotlb, zcswiotlb, zcswiotlb)
> +
>  /**
>   * PageHuge - Determine if the page belongs to hugetlbfs
>   * @page: The page to test.
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 3f06254ab1b72..62340909409e5 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3787,7 +3787,12 @@ static inline void skb_frag_page_copy(skb_frag_t *fragto,
>  	fragto->netmem = fragfrom->netmem;
>  }
>  
> -bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
> +/* zerocopy swiotlb uses an additional non-null struct sock pointer. */
> +bool __skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio, struct sock *sk);
> +static inline bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
> +{
> +	return __skb_page_frag_refill(sz, pfrag, prio, NULL);
> +}
>  
>  /**
>   * __skb_frag_dma_map - maps a paged fragment via the DMA API
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 3dae0f592063e..bd2d0e160a9d8 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -7,8 +7,10 @@
>  #include <linux/init.h>
>  #include <linux/types.h>
>  #include <linux/limits.h>
> +#include <linux/percpu.h>
>  #include <linux/spinlock.h>
>  #include <linux/workqueue.h>
> +#include <linux/atomic.h>
>  
>  struct device;
>  struct page;
> @@ -122,6 +124,9 @@ struct io_tlb_mem {
>  	atomic_long_t total_used;
>  	atomic_long_t used_hiwater;
>  	atomic_long_t transient_nslabs;
> +#else
> +	unsigned long last_used_slots;
> +	unsigned long last_used_jiffies;
>  #endif
>  };
>  
> @@ -185,6 +190,69 @@ bool is_swiotlb_active(struct device *dev);
>  void __init swiotlb_adjust_size(unsigned long size);
>  phys_addr_t default_swiotlb_base(void);
>  phys_addr_t default_swiotlb_limit(void);
> +
> +/* Helpers for zerocopy swiotlb. */
> +/* Control allocation fraction. */
> +extern unsigned int swiotlb_zc_tx_percent;
> +
> +/* Track freshness of the leaf device info. */
> +extern atomic_t global_device_serial;
> +
> +static inline u32 swiotlb_get_device_serial(void)
> +{
> +	return atomic_read(&global_device_serial);
> +}
> +
> +static inline void swiotlb_device_deleted(void)
> +{
> +	atomic_inc(&global_device_serial);
> +}
> +
> +struct page *swiotlb_alloc_pages(struct device *dev, unsigned int order);
> +bool swiotlb_free_pages(struct page *page, bool where_debug_only);
> +void swiotlb_safe_put_device(struct device *dev);
> +
> +static inline void swiotlb_set_page_dev(struct page *page, struct device *dev)
> +{
> +	page->private = (unsigned long)dev;
> +}
> +
> +static inline struct device *swiotlb_page_to_dev(struct page *page)
> +{
> +	return (struct device *)compound_head(page)->private;
> +}
> +
> +static inline bool is_zerocopy_swiotlb_folio(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	return folio_test_zcswiotlb(folio) && folio->private != 0;
> +}
> +
> +/* These two are in mm/page_alloc.c */
> +void swiotlb_prep_compound_page(struct page *page, unsigned int order);
> +void swiotlb_destroy_compound_page(struct page *page, unsigned int order);
> +
> +#if defined(CONFIG_NET)
> +/*
> + * Track the socket for the currently transmitted packet, so the dma mapping
> + * function can record there the leaf device if it needs bounce buffers.
> + */
> +struct sock;
> +DECLARE_PER_CPU(struct sock *, current_tx_socket);
> +void sk_set_bounce_device(struct sock *sk, struct device *dev);
> +static inline void dma_learn_bounce_device(struct device *dev)
> +{
> +	struct sock *sk = this_cpu_read(current_tx_socket);
> +
> +	if (sk)
> +		sk_set_bounce_device(sk, dev);
> +}
> +#else
> +static inline void dma_learn_bounce_device(struct device *dev) {}
> +#endif
> +/* End helpers for zerocopy swiotlb. */
> +
>  #else
>  static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
>  {
> @@ -234,6 +302,12 @@ static inline phys_addr_t default_swiotlb_limit(void)
>  {
>  	return 0;
>  }
> +
> +/* zerocopy swiotlb stubs */
> +static inline bool swiotlb_free_pages(struct page *page, int reason) { return false; }
> +static inline u32 swiotlb_get_device_serial(void) { return 0; }
> +static inline void swiotlb_device_deleted(void) {}
> +
>  #endif /* CONFIG_SWIOTLB */
>  
>  phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dccd3738c3687..1e6caf4bd1366 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -47,6 +47,7 @@
>  #include <linux/skbuff.h>	/* struct sk_buff */
>  #include <linux/mm.h>
>  #include <linux/security.h>
> +#include <linux/swiotlb.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/page_counter.h>
> @@ -70,6 +71,14 @@
>  #include <net/l3mdev.h>
>  #include <uapi/linux/socket.h>
>  
> +#ifdef CONFIG_SWIOTLB
> +struct sk_swiotlb_info {
> +	struct device		*dev;
> +	u32			serial;
> +	unsigned long		jiffies;
> +};
> +#endif
> +
>  /*
>   * This structure really needs to be cleaned up.
>   * Most of it is for TCP, and not used by any of
> @@ -602,8 +611,28 @@ struct sock {
>  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
>  	struct module		*sk_owner;
>  #endif
> +#ifdef CONFIG_SWIOTLB
> +	struct sk_swiotlb_info	sk_swiotlb;
> +#endif
>  };
>  
> +#ifdef CONFIG_SWIOTLB
> +static inline void sk_init_bounce_device(struct sock *sk)
> +{
> +	sk->sk_swiotlb.dev = NULL;
> +}
> +static inline void sk_cleanup_bounce_device(struct sock *sk)
> +{
> +	if (sk->sk_swiotlb.dev) {
> +		swiotlb_safe_put_device(sk->sk_swiotlb.dev);
> +		sk->sk_swiotlb.dev = NULL;
> +	}
> +}
> +#else
> +static inline void sk_init_bounce_device(struct sock *sk) {}
> +static inline void sk_cleanup_bounce_device(struct sock *sk) {}
> +#endif
> +
>  struct sock_bh_locked {
>  	struct sock *sock;
>  	local_lock_t bh_lock;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 1abd3e6146f45..e27f23d03c482 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -37,12 +37,16 @@
>  #include <linux/mm.h>
>  #include <linux/pfn.h>
>  #include <linux/rculist.h>
> +#include <linux/refcount.h>
>  #include <linux/scatterlist.h>
>  #include <linux/set_memory.h>
>  #include <linux/spinlock.h>
>  #include <linux/string.h>
>  #include <linux/swiotlb.h>
> +#include <linux/moduleparam.h>
> +#include <linux/percpu.h>
>  #include <linux/types.h>
> +#include <linux/atomic.h>
>  #ifdef CONFIG_DMA_RESTRICTED_POOL
>  #include <linux/of.h>
>  #include <linux/of_fdt.h>
> @@ -81,6 +85,17 @@ struct io_tlb_slot {
>  static bool swiotlb_force_bounce;
>  static bool swiotlb_force_disable;
>  
> +/**
> + * global_device_serial - Global sequence number for device deletions
> + *
> + * Incremented every time a device is unregistered (in device_del()).
> + * Used by subsystems (like SWIOTLB zero-copy sockets) as a fast, lockless
> + * O(1) cache invalidation serial to detect when a cached device pointer
> + * might have been deleted and needs to be expired to prevent Use-After-Free.
> + */
> +atomic_t global_device_serial = ATOMIC_INIT(0);
> +EXPORT_SYMBOL(global_device_serial);
> +
>  #ifdef CONFIG_SWIOTLB_DYNAMIC
>  
>  static void swiotlb_dyn_alloc(struct work_struct *work);
> @@ -1442,6 +1457,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	offset &= (IO_TLB_SIZE - 1);
>  	index += pad_slots;
>  	pool->slots[index].pad_slots = pad_slots;
> +	/* Fix an upstream bug with alloc_align_mask = 0xffff */
> +	pool->slots[index].alloc_size = mapping_size;
>  	for (i = 0; i < (nr_slots(size) - pad_slots); i++)
>  		pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
>  	tlb_addr = slot_addr(pool->start, index) + offset;
> @@ -1554,6 +1571,13 @@ void __swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
>  		size_t mapping_size, enum dma_data_direction dir,
>  		unsigned long attrs, struct io_tlb_pool *pool)
>  {
> +	/*
> +	 * Recognize and avoid unmapping pages allocated for Zero-Copy SWIOTLB Page Bypass.
> +	 * They will be eventually released when the page reference count drops to 0.
> +	 */
> +	if (is_zerocopy_swiotlb_folio(pfn_to_page(PHYS_PFN(tlb_addr))))
> +		return;
> +
>  	/*
>  	 * First, sync the memory before unmapping the entry
>  	 */
> @@ -1597,6 +1621,21 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
>  	phys_addr_t swiotlb_addr;
>  	dma_addr_t dma_addr;
>  
> +	dma_learn_bounce_device(dev);
> +
> +	/*
> +	 * If the page was allocated via Zero-Copy SWIOTLB Page Bypass, it is likely
> +	 * already good for DMA so we can return its dma address.
> +	 */
> +	if (is_zerocopy_swiotlb_folio(pfn_to_page(PHYS_PFN(paddr)))) {
> +		dma_addr = phys_to_dma_unencrypted(dev, paddr);
> +		if (likely(dma_capable(dev, dma_addr, size, true))) {
> +			if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +				arch_sync_dma_for_device(paddr, size, dir);
> +			return dma_addr;
> +		}
> +	}
> +
>  	trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
>  
>  	swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs);
> @@ -1899,3 +1938,191 @@ static const struct reserved_mem_ops rmem_swiotlb_ops = {
>  
>  RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", &rmem_swiotlb_ops);
>  #endif /* CONFIG_DMA_RESTRICTED_POOL */
> +
> +/*
> + * Asynchronous/Deferred Device Release.
> + * put_device() can trigger the final release path of a device which may sleep.
> + * Since SWIOTLB pages can be freed in atomic or interrupt context (e.g. TX completion),
> + * we must defer the put_device() call to task context using a workqueue.
> + */
> +struct swiotlb_deferred_put {
> +	struct work_struct work;
> +	struct device *dev;
> +};
> +
> +static void swiotlb_deferred_put_work(struct work_struct *work)
> +{
> +	struct swiotlb_deferred_put *dp = container_of(work, struct swiotlb_deferred_put, work);
> +
> +	put_device(dp->dev);
> +	kfree(dp);
> +}
> +
> +/**
> + * swiotlb_safe_put_device() - Safely release device reference from atomic/interrupt context
> + * @dev: The device structure to release.
> + *
> + * Enqueues a deferred put_device() call on a workqueue using GFP_ATOMIC.
> + * If memory allocation fails, the reference is leaked to avoid an immediate crash.
> + */
> +void swiotlb_safe_put_device(struct device *dev)
> +{
> +	struct swiotlb_deferred_put *dp;
> +
> +	if (!dev)
> +		return;
> +
> +	/*
> +	 * FAST PATH (O(1) lockless): If this is not the last reference,
> +	 * we can decrement it atomically and safely in any context
> +	 * without allocating memory or scheduling work!
> +	 */
> +	if (refcount_dec_not_one(&dev->kobj.kref.refcount))
> +		return;
> +
> +	/*
> +	 * SLOW PATH: It is the last reference (refcount == 1). We must
> +	 * defer the final put_device() to task context because it will
> +	 * trigger device_release() which can sleep.
> +	 */
> +	dp = kmalloc_obj(*dp, GFP_ATOMIC);
> +	if (dp) {
> +		INIT_WORK(&dp->work, swiotlb_deferred_put_work);
> +		dp->dev = dev;
> +		schedule_work(&dp->work);
> +	} else {
> +		pr_warn_ratelimited("swiotlb: failed to allocate deferred put, leaking device ref\n");
> +	}
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_safe_put_device);
> +
> +unsigned int swiotlb_zc_tx_percent;
> +module_param_named(zerocopy_tx_percent, swiotlb_zc_tx_percent, uint, 0644);
> +
> +static unsigned long fast_mem_used(struct io_tlb_mem *mem)
> +{
> +#ifdef CONFIG_DEBUG_FS
> +	return mem_used(mem);
> +#else
> +	unsigned long last_j = READ_ONCE(mem->last_used_jiffies);
> +	unsigned long now = jiffies;
> +
> +	if (time_after(now, last_j + HZ / 100) &&
> +	    try_cmpxchg(&mem->last_used_jiffies, &last_j, now)) {
> +		WRITE_ONCE(mem->last_used_slots, mem_used(mem));
> +	}
> +	return READ_ONCE(mem->last_used_slots);
> +#endif
> +}
> +
> +/**
> + * swiotlb_alloc_pages() - Allocate long-lived contiguous pages from SWIOTLB pool
> + * @dev: Device which requires the SWIOTLB bounce buffers.
> + * @order: Allocation order (log2 of number of pages).
> + */
> +struct page *swiotlb_alloc_pages(struct device *dev, unsigned int order)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	struct io_tlb_pool *pool;
> +	int npages = 1 << order;
> +	unsigned int max_pct;
> +	phys_addr_t tlb_addr;
> +	struct page *page;
> +	int index;
> +
> +	if (!mem || !mem->nslabs)
> +		return NULL;
> +
> +	max_pct = clamp(READ_ONCE(swiotlb_zc_tx_percent), 0u, 90u);
> +	if (max_pct == 0 || max_pct * mem->nslabs <= fast_mem_used(mem) * 100)
> +		return NULL;
> +
> +	/*
> +	 * Enforce natural alignment for compound pages. The mask-based
> +	 * compound_head() optimization (used when HVO is enabled and struct page
> +	 * size is a power of 2) assumes that compound pages are naturally aligned
> +	 * to their size. Without this, compound_head() on tail pages can return
> +	 * a wrong head page pointer, leading to refcount corruption.
> +	 */
> +	index = swiotlb_find_slots(dev, 0, PAGE_SIZE * npages, ~(PAGE_MASK << order), &pool);
> +	if (index == -1)
> +		return NULL;
> +
> +	tlb_addr = slot_addr(pool->start, index);
> +
> +	pool->slots[index].pad_slots = 0;
> +	pool->slots[index].alloc_size = PAGE_SIZE * npages;
> +
> +	page = pfn_to_page(PHYS_PFN(tlb_addr));
> +
> +	set_page_count(page, 1);
> +
> +	/* Strictly tag page[0] to prevent clobbering folio tail overlays */
> +	__SetPageZCSwiotlb(page);
> +
> +	swiotlb_set_page_dev(page, dev);
> +	get_device(dev);
> +	swiotlb_prep_compound_page(page, order);
> +	return page;
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_alloc_pages);
> +
> +/*
> + * Debugging to track how swiotlb_free_pages() was called.
> + * b2: 0 from __free_frozen_pages(), 1 from free_unref_folios()
> + * b1: pool found b0: dev present,
> + */
> +static unsigned long zc_debug[8];
> +static int ctrs_num = 8;
> +module_param_array(zc_debug, ulong, &ctrs_num, 0644);
> +static void __zc_debug_stats(bool where, bool has_dev, bool has_pool)
> +{
> +	zc_debug[has_dev + has_pool * 2 + where * 4]++;
> +}
> +
> +/**
> + * swiotlb_free_pages() - Free pages allocated via swiotlb_alloc_pages()
> + * @page: The starting struct page to release.
> + */
> +bool swiotlb_free_pages(struct page *page, bool where_debug_only)
> +{
> +	struct page *head = compound_head(page);
> +	struct device *dev = swiotlb_page_to_dev(head);
> +	phys_addr_t head_tlb_addr = page_to_phys(head);
> +	struct io_tlb_pool *pool;
> +	int index, npages, i;
> +
> +	if (!folio_test_zcswiotlb(page_folio(head)))
> +		return false;
> +
> +	pool = dev ? swiotlb_find_pool(dev, head_tlb_addr) : NULL;
> +	__zc_debug_stats(where_debug_only, !!dev, !!pool);
> +
> +	/* Check for any false positives. */
> +	if (!pool)
> +		return false;
> +
> +	/* Read alloc_size first, it is reset by swiotlb_release_slots(). */
> +	index = (head_tlb_addr - pool->start) >> IO_TLB_SHIFT;
> +	npages = pool->slots[index].alloc_size >> PAGE_SHIFT;
> +
> +	WARN_ON_ONCE(!is_power_of_2(npages));
> +
> +	/* Step 1: Sever compound links (clobbers compound_info / lru.next) */
> +	swiotlb_destroy_compound_page(head, ilog2(npages));
> +
> +	/* Step 2: Re-init LRU, drop refcounts, and strip flag across all constituent pages */
> +	for (i = 0; i < npages; i++) {
> +		INIT_LIST_HEAD(&head[i].lru);
> +		set_page_count(&head[i], 0);
> +		head[i].private = 0;
> +		__ClearPageZCSwiotlb(&head[i]);
> +	}
> +
> +	/* Step 3: Safely release slots back to the pool */
> +	swiotlb_release_slots(dev, head_tlb_addr, pool);
> +	swiotlb_del_transient(dev, head_tlb_addr, pool);
> +	swiotlb_safe_put_device(dev);
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_free_pages);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d49c254174da7..eaba683b5b2a8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -16,6 +16,7 @@
>  
>  #include <linux/stddef.h>
>  #include <linux/mm.h>
> +#include <linux/swiotlb.h>
>  #include <linux/highmem.h>
>  #include <linux/interrupt.h>
>  #include <linux/jiffies.h>
> @@ -705,6 +706,31 @@ void prep_compound_page(struct page *page, unsigned int order)
>  	prep_compound_head(page, order);
>  }
>  
> +#ifdef CONFIG_SWIOTLB
> +void swiotlb_prep_compound_page(struct page *page, unsigned int order)
> +{
> +	if (order > 0)
> +		prep_compound_page(page, order);
> +}

Gah.

> +
> +void swiotlb_destroy_compound_page(struct page *page, unsigned int order)
> +{
> +	if (order > 0) {
> +		struct folio *folio = (struct folio *)page;
> +
> +		__ClearPageHead(page);
> +		page[1].flags.f &= ~PAGE_FLAGS_SECOND;
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +		folio->_nr_pages = 0;
> +#endif
> +		for (int i = 1; i < (1 << order); i++) {
> +			page[i].mapping = NULL;
> +			clear_compound_head(&page[i]);
> +		}
> +	}
> +}

Gah.

> +#endif /* CONFIG_SWIOTLB */
> +
>  static inline void set_buddy_order(struct page *page, unsigned int order)
>  {
>  	set_page_private(page, order);
> @@ -2930,6 +2956,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>  
> +	if (unlikely(swiotlb_free_pages(page, false)))
> +		return;
> +

Oh my.

We shouldn't be handling randomg swiotlb stuff in the page allocator like that.

IIUC, you are writing your own pool+allocator and roughly mimic what hugetlb +
ZONE_DEVICE does.

The creation+destruction of compound pages should very likely be factored out
from other code in a type-unspecific fashion, if really required.

You should probably look into

https://lore.kernel.org/all/20250318161823.4005529-2-tabba@google.com/

to see how to possibly hook into the page freeing path in a cleaner way.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: support xsk wake up
From: Eugenio Perez Martin @ 2026-06-16  8:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Menglong Dong, xuanzhuo, mst, jasowang, andrew+netdev, davem,
	edumazet, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260613144612.0c5b7ba4@kernel.org>

On Sat, Jun 13, 2026 at 11:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 10 Jun 2026 10:27:28 +0200 Eugenio Perez Martin wrote:
> > And the From and Signed-off-by emails don't match, which I'm not sure is valid.
>
> It's clearly the same person. Please focus on the code, not trivial
> process issues.
>
> Quoting documentation:
>
>   Reviewer guidance
>   -----------------
>
>   [...]
>
>   Reviewers are highly encouraged to do more in-depth review of submissions
>   and not focus exclusively on process issues, trivial or subjective
>   matters like code formatting, tags etc.
>
> See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#reviewer-guidance
>

Ack'd, it was just a nitpick since the fixes tag was already needed.
Thanks for the doc pointer, I agree with that so I'll try to avoid
these nits in the future!


^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net] ice: Fix use-after-scope in ice_sched_add_nodes_to_layer()
From: Kwapulinski, Piotr @ 2026-06-16  8:27 UTC (permalink / raw)
  To: NeKon69, Nguyen, Anthony L, Kitszel, Przemyslaw
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, victor.raj@intel.com,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260613101440.80190-1-nobodqwe@gmail.com>

>-----Original Message-----
>From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of NeKon69
>Sent: Saturday, June 13, 2026 12:15 PM
>To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>
>Cc: andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; victor.raj@intel.com; intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; NeKon69 <nobodqwe@gmail.com>
>Subject: [Intel-wired-lan] [PATCH net] ice: Fix use-after-scope in ice_sched_add_nodes_to_layer()
>
>Commit 7fb09a737536 ("ice: Modify recursive way of adding nodes") changed ice_sched_add_nodes_to_layer() from recursive control flow to an iterative loop.
>
>Inside the loop, first_teid_ptr may be set to the address of a block-local variable:
>
>	u32 temp;
>	...
>	if (num_added)
>		first_teid_ptr = &temp;
>
>On the next loop iteration, first_teid_ptr may be passed to ice_sched_add_nodes_to_hw_layer(), after temp from the previous iteration has gone out of scope.
>
>Move temp outside the loop so the pointer remains valid for the lifetime of ice_sched_add_nodes_to_layer().
>
>This was found by Clang with LifetimeSafety enabled while testing C language support on a Linux allmodconfig build.
>
>Fixes: 7fb09a737536 ("ice: Modify recursive way of adding nodes")
>Link: https://github.com/llvm/llvm-project/pull/203270
>Signed-off-by: NeKon69 <nobodqwe@gmail.com>
>---
> drivers/net/ethernet/intel/ice/ice_sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
>index fff0c1afdb41..089ad3967be5 100644
>--- a/drivers/net/ethernet/intel/ice/ice_sched.c
>+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
>@@ -1074,11 +1074,11 @@ ice_sched_add_nodes_to_layer(struct ice_port_info *pi,
> 	u32 *first_teid_ptr = first_node_teid;
> 	u16 new_num_nodes = num_nodes;
> 	int status = 0;
>+	u32 temp;
> 
> 	*num_nodes_added = 0;
> 	while (*num_nodes_added < num_nodes) {
> 		u16 max_child_nodes, num_added = 0;
>-		u32 temp;
> 
> 		status = ice_sched_add_nodes_to_hw_layer(pi, tc_node, parent,
> 							 layer,	new_num_nodes,
>--
>2.54.0
>

Reviewed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>


^ permalink raw reply

* Re: [PATCH v2 4/5] binder: Remove mmap_lock fallback
From: Alice Ryhl @ 2026-06-16  8:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Arve Hjønnevåg,
	Carlos Llamas, Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Suren Baghdasaryan, Todd Kjos,
	Vlastimil Babka
In-Reply-To: <20260610230417.77D64DBB@davehans-spike.ostc.intel.com>

On Wed, Jun 10, 2026 at 04:04:17PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Previously, the per-VMA locking could fail in the face of writers
> which necessitate a fallback to mmap_lock. The new
> vma_start_read_unlocked() will wait for writers instead of failing.
> 
> Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Acked-by: Lorenzo Stoakes <ljs@kernel.org>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Arve Hjønnevåg <arve@android.com>
> Cc: Todd Kjos <tkjos@android.com>
> Cc: Christian Brauner <christian@brauner.io>
> Cc: Carlos Llamas <cmllamas@google.com>
> Cc: Alice Ryhl <aliceryhl@google.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: netdev@vger.kernel.org
> 
> ---
> 
>  b/drivers/android/binder_alloc.c |   17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff -puN drivers/android/binder_alloc.c~binder-vma-waiter drivers/android/binder_alloc.c
> --- a/drivers/android/binder_alloc.c~binder-vma-waiter	2026-06-10 15:57:56.419452721 -0700
> +++ b/drivers/android/binder_alloc.c	2026-06-10 15:57:56.423452863 -0700
> @@ -259,21 +259,14 @@ static int binder_page_insert(struct bin
>  	struct vm_area_struct *vma;
>  	int ret = -ESRCH;
>  
> -	/* attempt per-vma lock first */
> -	vma = lock_vma_under_rcu(mm, addr);
> -	if (vma) {
> -		if (binder_alloc_is_mapped(alloc))
> -			ret = vm_insert_page(vma, addr, page);
> -		vma_end_read(vma);
> +	vma = vma_start_read_unlocked(mm, addr);
> +	if (!vma)
>  		return ret;
> -	}
>  
> -	/* fall back to mmap_lock */
> -	mmap_read_lock(mm);
> -	vma = vma_lookup(mm, addr);
> -	if (vma && binder_alloc_is_mapped(alloc))
> +	if (binder_alloc_is_mapped(alloc))
>  		ret = vm_insert_page(vma, addr, page);
> -	mmap_read_unlock(mm);
> +
> +	vma_end_read(vma);
>  
>  	return ret;
>  }
> _

It would be nice if we could update Rust Binder as well.

diff --git a/drivers/android/binder/page_range.rs b/drivers/android/binder/page_range.rs
index e54a90e62402..8d56b991744f 100644
--- a/drivers/android/binder/page_range.rs
+++ b/drivers/android/binder/page_range.rs
@@ -439,22 +439,9 @@ unsafe fn use_page_slow(&self, i: usize) -> Result<()> {
         // workqueue.
         let mm = MmWithUser::into_mmput_async(self.mm.mmget_not_zero().ok_or(ESRCH)?);
         {
-            let vma_read;
-            let mmap_read;
-            let vma = if let Some(ret) = mm.lock_vma_under_rcu(vma_addr) {
-                vma_read = ret;
-                check_vma(&vma_read, self)
-            } else {
-                mmap_read = mm.mmap_read_lock();
-                mmap_read
-                    .vma_lookup(vma_addr)
-                    .and_then(|vma| check_vma(vma, self))
-            };
-
-            match vma {
-                Some(vma) => vma.vm_insert_page(user_page_addr, &new_page)?,
-                None => return Err(ESRCH),
-            }
+            let vma_read_guard = mm.vma_start_read_unlocked(vma_addr).ok_or(ESRCH)?;
+            let vma = check_vma(&vma_read_guard, self).ok_or(ESRCH)?;
+            vma.vm_insert_page(user_page_addr, &new_page)?;
         }
 
         let inner = self.lock.lock();
diff --git a/rust/kernel/mm.rs b/rust/kernel/mm.rs
index 16f617d11479..2973718af48e 100644
--- a/rust/kernel/mm.rs
+++ b/rust/kernel/mm.rs
@@ -188,6 +188,24 @@ pub fn lock_vma_under_rcu(&self, vma_addr: usize) -> Option<VmaReadGuard<'_>> {
         })
     }
 
+    /// Find the VMA covering 'address' and lock it for reading. Waits for writers to finish if the
+    /// VMA is being modified.
+    #[inline]
+    pub fn vma_start_read_unlocked(&self, vma_addr: usize) -> Option<VmaReadGuard<'_>> {
+        // SAFETY: We may invoke `vma_start_read_unlocked` because we know this `mm` has non-zero
+        // `mm_users`.
+        let vma = unsafe { bindings::vma_start_read_unlocked(self.as_raw(), vma_addr) };
+        if vma.is_null() {
+            return None;
+        }
+        Some(VmaReadGuard {
+            // SAFETY: If `vma_start_read_unlocked` returns a non-null ptr, then it points at a
+            // valid vma. The vma is stable for as long as the vma read lock is held.
+            vma: unsafe { VmaRef::from_raw(vma) },
+            _nts: NotThreadSafe,
+        })
+    }
+
     /// Lock the mmap read lock.
     #[inline]
     pub fn mmap_read_lock(&self) -> MmapReadGuard<'_> {

Alice

^ permalink raw reply related

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: kernel test robot @ 2026-06-16  8:01 UTC (permalink / raw)
  To: Luigi Rizzo, rizzo.unipi, m.szyprowski, robin.murphy, willemb,
	kuniyu, davem, edumazet, kuba, pabeni
  Cc: llvm, oe-kbuild-all, gregkh, rafael, akpm, david, netdev,
	linux-mm, iommu, driver-core, linux-kernel
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>

Hi Luigi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v7.1 next-20260615]
[cannot apply to driver-core/driver-core-testing driver-core/driver-core-next driver-core/driver-core-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/swiotlb-avoid-double-copy-with-swiotlb-on-tx-socket/20260616-074655
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260615234220.3946885-1-lrizzo%40google.com
patch subject: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
config: loongarch-allnoconfig (https://download.01.org/0day-ci/archive/20260616/202606161519.z7SY98jp-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260616/202606161519.z7SY98jp-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606161519.z7SY98jp-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/page_alloc.c:721:17: warning: unused variable 'folio' [-Wunused-variable]
     721 |                 struct folio *folio = (struct folio *)page;
         |                               ^~~~~
   1 warning generated.
--
>> Warning: kernel/dma/swiotlb.c:95 cannot understand function prototype: 'atomic_t global_device_serial = ATOMIC_INIT(0);'
>> Warning: kernel/dma/swiotlb.c:2087 function parameter 'where_debug_only' not described in 'swiotlb_free_pages'
>> Warning: kernel/dma/swiotlb.c:2087 function parameter 'where_debug_only' not described in 'swiotlb_free_pages'


vim +/folio +721 mm/page_alloc.c

   717	
   718	void swiotlb_destroy_compound_page(struct page *page, unsigned int order)
   719	{
   720		if (order > 0) {
 > 721			struct folio *folio = (struct folio *)page;
   722	
   723			__ClearPageHead(page);
   724			page[1].flags.f &= ~PAGE_FLAGS_SECOND;
   725	#ifdef NR_PAGES_IN_LARGE_FOLIO
   726			folio->_nr_pages = 0;
   727	#endif
   728			for (int i = 1; i < (1 << order); i++) {
   729				page[i].mapping = NULL;
   730				clear_compound_head(&page[i]);
   731			}
   732		}
   733	}
   734	#endif /* CONFIG_SWIOTLB */
   735	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* RE: [PATCH net] tipc: free bearer discoverer via RCU to fix tipc_disc_rcv UAF
From: Tung Quang Nguyen @ 2026-06-16  7:50 UTC (permalink / raw)
  To: Samuel Page
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev@vger.kernel.org,
	tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org, Jon Maloy
In-Reply-To: <20260615150009.1734270-1-sam@bynar.io>

>Subject: [PATCH net] tipc: free bearer discoverer via RCU to fix tipc_disc_rcv
>UAF
>
>bearer_disable() tears down a bearer's discovery object with
>tipc_disc_delete(), which frees the struct tipc_discoverer with a plain,
>synchronous kfree(). The discovery receive path, however, still reads that
>object under RCU in softirq context:
>
>  tipc_udp_recv()            // udp_media.c, rcu_dereference(ub->bearer)
>    -> tipc_rcv()            // node.c
>      -> tipc_disc_rcv()     // discover.c
>        -> tipc_disc_addr_trial_msg(b->disc, ...)  // reads d->net etc.
>
>tipc_udp_recv() only gates this path on test_bit(0, &b->up), which is a TOCTOU
>check: an RX softirq that observes b->up == 1 before
>bearer_disable() does clear_bit_unlock(0, &b->up) can still be executing inside
>tipc_disc_rcv() when bearer_disable() reaches
>
>	if (b->disc)
>		tipc_disc_delete(b->disc);
>
>and kfree()s the discoverer. The reader then dereferences freed memory (d-
>>net, inlined into tipc_disc_rcv()) in softirq context [0].
>
>The bearer itself is freed RCU-safely (tipc_bearer_put() -> kfree_rcu(b, rcu))
>because the RX path runs under RCU, but the discoverer hanging off b->disc is
>freed synchronously. The same b->disc is also touched under rcu_read_lock()
>by tipc_disc_add_dest()/tipc_disc_remove_dest().
>
>Free the discoverer with the same RCU lifetime as its bearer. Add an rcu_head
>to struct tipc_discoverer and defer the kfree_skb()/kfree() to an RCU callback
>so any in-flight reader that already loaded b->disc completes before the
>memory is released. The timer is still shut down synchronously up front with
>timer_shutdown_sync() (which can sleep and must not run from the RCU
>callback), and shutting it down before the grace period prevents the periodic
>LINK_REQUEST timer from rearming or re-entering the object.
>
>This mirrors the existing TIPC pattern of pairing call_rcu() with a cleanup
>callback (see tipc_node_free()/tipc_aead_free()).
>
>[0]: (trailing page/memory-state dump trimmed)
>BUG: KASAN: slab-use-after-free in tipc_disc_addr_trial_msg
>net/tipc/discover.c:149 [inline]
>BUG: KASAN: slab-use-after-free in tipc_disc_rcv+0xe7c/0x103c
>net/tipc/discover.c:236 Read of size 8 at addr ffff000028f07428 by task
>ksoftirqd/0/15
>
>CPU: 0 UID: 0 PID: 15 Comm: ksoftirqd/0 Not tainted 7.0.11 #3 PREEMPT
>Hardware name: linux,dummy-virt (DT) Call trace:
> show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:499 (C)  __dump_stack
>lib/dump_stack.c:94 [inline]
> dump_stack_lvl+0xb4/0xd4 lib/dump_stack.c:120  print_address_description
>mm/kasan/report.c:378 [inline]
> print_report+0x118/0x5d8 mm/kasan/report.c:482
> kasan_report+0xb0/0xf4 mm/kasan/report.c:595
>__asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381
>tipc_disc_addr_trial_msg net/tipc/discover.c:149 [inline]
>tipc_disc_rcv+0xe7c/0x103c net/tipc/discover.c:236  tipc_rcv+0x1884/0x2b1c
>net/tipc/node.c:2126
> tipc_udp_recv+0x22c/0x684 net/tipc/udp_media.c:393
> udp_queue_rcv_one_skb+0x898/0x1798 net/ipv4/udp.c:2441
> udp_queue_rcv_skb+0x1b0/0xa44 net/ipv4/udp.c:2518
> udp_unicast_rcv_skb+0x13c/0x348 net/ipv4/udp.c:2678
>__udp4_lib_rcv+0x1aec/0x246c net/ipv4/udp.c:2754
> udp_rcv+0x78/0xa0 net/ipv4/udp.c:2936
> ip_protocol_deliver_rcu+0x68/0x410 net/ipv4/ip_input.c:207
> ip_local_deliver_finish+0x28c/0x4b4 net/ipv4/ip_input.c:241  NF_HOOK
>include/linux/netfilter.h:318 [inline]  NF_HOOK include/linux/netfilter.h:312
>[inline]  ip_local_deliver+0x29c/0x2ec net/ipv4/ip_input.c:262  dst_input
>include/net/dst.h:480 [inline]  ip_rcv_finish net/ipv4/ip_input.c:453 [inline]
>ip_rcv_finish net/ipv4/ip_input.c:439 [inline]  NF_HOOK
>include/linux/netfilter.h:318 [inline]  NF_HOOK include/linux/netfilter.h:312
>[inline]
> ip_rcv+0x21c/0x258 net/ipv4/ip_input.c:573
> __netif_receive_skb_one_core+0x110/0x184 net/core/dev.c:6195
> __netif_receive_skb+0x2c/0x170 net/core/dev.c:6308
> process_backlog+0x178/0x488 net/core/dev.c:6659
> __napi_poll+0xa8/0x540 net/core/dev.c:7726  napi_poll net/core/dev.c:7789
>[inline]
> net_rx_action+0x360/0x964 net/core/dev.c:7946
> handle_softirqs+0x2f0/0x7b0 kernel/softirq.c:622  run_ksoftirqd
>kernel/softirq.c:1063 [inline]
> run_ksoftirqd+0x6c/0x88 kernel/softirq.c:1055
> smpboot_thread_fn+0x65c/0x958 kernel/smpboot.c:160
> kthread+0x39c/0x444 kernel/kthread.c:436
> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860
>
>Allocated by task 68873:
> kasan_save_stack+0x3c/0x64 mm/kasan/common.c:57
>kasan_save_track+0x20/0x3c mm/kasan/common.c:78
> kasan_save_alloc_info+0x40/0x54 mm/kasan/generic.c:570
>poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
> __kasan_kmalloc+0xd4/0xd8 mm/kasan/common.c:415  kasan_kmalloc
>include/linux/kasan.h:263 [inline]
> __kmalloc_cache_noprof+0x1b0/0x458 mm/slub.c:5385  kmalloc_noprof
>include/linux/slab.h:950 [inline]
> tipc_disc_create+0xdc/0x5e0 net/tipc/discover.c:356
> tipc_enable_bearer+0x8b8/0xf94 net/tipc/bearer.c:348
> __tipc_nl_bearer_enable+0x2a8/0x398 net/tipc/bearer.c:1047
> tipc_nl_bearer_enable+0x2c/0x48 net/tipc/bearer.c:1056
> genl_family_rcv_msg_doit+0x1e4/0x2c0 net/netlink/genetlink.c:1114
>genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
> genl_rcv_msg+0x4e8/0x750 net/netlink/genetlink.c:1209
>netlink_rcv_skb+0x204/0x3cc net/netlink/af_netlink.c:2550
> genl_rcv+0x3c/0x54 net/netlink/genetlink.c:1218  netlink_unicast_kernel
>net/netlink/af_netlink.c:1318 [inline]
> netlink_unicast+0x638/0x930 net/netlink/af_netlink.c:1344
> netlink_sendmsg+0x798/0xc68 net/netlink/af_netlink.c:1894
>sock_sendmsg_nosec net/socket.c:727 [inline]
> __sock_sendmsg+0xe0/0x128 net/socket.c:742
> __sys_sendto+0x230/0x2f4 net/socket.c:2206  __do_sys_sendto
>net/socket.c:2213 [inline]  __se_sys_sendto net/socket.c:2209 [inline]
>__arm64_sys_sendto+0xc4/0x13c net/socket.c:2209  __invoke_syscall
>arch/arm64/kernel/syscall.c:35 [inline]
> invoke_syscall+0x84/0x2a8 arch/arm64/kernel/syscall.c:49
> el0_svc_common.constprop.0+0xe4/0x294 arch/arm64/kernel/syscall.c:132
>do_el0_svc+0x44/0x5c arch/arm64/kernel/syscall.c:151  el0_svc+0x38/0xac
>arch/arm64/kernel/entry-common.c:724
> el0t_64_sync_handler+0xa0/0xe4 arch/arm64/kernel/entry-common.c:743
> el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:596
>
>Freed by task 60072:
> kasan_save_stack+0x3c/0x64 mm/kasan/common.c:57
>kasan_save_track+0x20/0x3c mm/kasan/common.c:78
> kasan_save_free_info+0x4c/0x74 mm/kasan/generic.c:584
>poison_slab_object mm/kasan/common.c:253 [inline]
> __kasan_slab_free+0x88/0xb8 mm/kasan/common.c:285  kasan_slab_free
>include/linux/kasan.h:235 [inline]  slab_free_hook mm/slub.c:2685 [inline]
>slab_free mm/slub.c:6170 [inline]
> kfree+0x14c/0x458 mm/slub.c:6488
> tipc_disc_delete+0x50/0x68 net/tipc/discover.c:393
> bearer_disable+0x18c/0x278 net/tipc/bearer.c:418
> tipc_bearer_stop+0xe0/0x198 net/tipc/bearer.c:757
> tipc_net_stop+0x110/0x178 net/tipc/net.c:159  tipc_exit_net+0x80/0x19c
>net/tipc/core.c:112  ops_exit_list net/core/net_namespace.c:199 [inline]
> ops_undo_list+0x244/0x694 net/core/net_namespace.c:252
> cleanup_net+0x3a0/0x830 net/core/net_namespace.c:702
> process_one_work+0x628/0xd38 kernel/workqueue.c:3289
>process_scheduled_works kernel/workqueue.c:3372 [inline]
> worker_thread+0x7a8/0xac0 kernel/workqueue.c:3453
> kthread+0x39c/0x444 kernel/kthread.c:436
> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860
>
>Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash
>values")
>Cc: stable@vger.kernel.org
>Assisted-by: Bynario AI
>Signed-off-by: Samuel Page <sam@bynar.io>
>---
> net/tipc/discover.c | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
>diff --git a/net/tipc/discover.c b/net/tipc/discover.c index
>3e54d2df5683..844975b691ef 100644
>--- a/net/tipc/discover.c
>+++ b/net/tipc/discover.c
>@@ -49,6 +49,7 @@
>
> /**
>  * struct tipc_discoverer - information about an ongoing link setup request
>+ * @rcu: RCU head used to free the structure after a grace period
>  * @bearer_id: identity of bearer issuing requests
>  * @net: network namespace instance
>  * @dest: destination address for request messages @@ -60,6 +61,7 @@
>  * @timer_intv: current interval between requests (in ms)
>  */
> struct tipc_discoverer {
>+	struct rcu_head rcu;
> 	u32 bearer_id;
> 	struct tipc_media_addr dest;
> 	struct net *net;
>@@ -382,6 +384,17 @@ int tipc_disc_create(struct net *net, struct tipc_bearer
>*b,
> 	return 0;
> }
>
>+/* RCU callback: free the discoverer only after any concurrent
>+ * tipc_disc_rcv() softirq reader of bearer->disc has finished.
>+ */
>+static void tipc_disc_free_rcu(struct rcu_head *rp) {
>+	struct tipc_discoverer *d = container_of(rp, struct tipc_discoverer,
>+rcu);

A similar patch was submitted 6 days ago: https://patchwork.kernel.org/project/netdevbpf/patch/20260610153349.2546041-2-bestswngs@gmail.com/

I do not receive updated patch from the submitter yet.
Your patch has the same coding style issue (long line, over 80 columns), see linux/Documentation/process/coding-style.rst

If you break the long line into 2 lines and submit again, I think I can acknowledge your patch.

>+
>+	kfree_skb(d->skb);
>+	kfree(d);
>+}
>+
> /**
>  * tipc_disc_delete - destroy object sending periodic link setup requests
>  * @d: ptr to link dest structure
>@@ -389,8 +402,7 @@ int tipc_disc_create(struct net *net, struct tipc_bearer
>*b,  void tipc_disc_delete(struct tipc_discoverer *d)  {
> 	timer_shutdown_sync(&d->timer);
>-	kfree_skb(d->skb);
>-	kfree(d);
>+	call_rcu(&d->rcu, tipc_disc_free_rcu);
> }
>
> /**
>
>base-commit: 47186409c092cd7dd70350999186c700233e854d
>--
>2.54.0
>

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-06-16  7:38 UTC (permalink / raw)
  To: Joanne Koong, Askar Safin
  Cc: akpm, axboe, bernd, brauner, dhowells, fuse-devel, hch, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos, netdev,
	patches, pfalcato, rostedt, torvalds, val, viro, willy
In-Reply-To: <CAJnrk1Z_V8TShvWV6zwTMQqXra3J4J5CL5ofFMm9DGoLj9whEw@mail.gmail.com>

On 6/16/26 08:38, Joanne Koong wrote:
> On Mon, Jun 15, 2026 at 9:15 PM Askar Safin <safinaskar@gmail.com> wrote:
>>
>> Joanne Koong <joannelkoong@gmail.com>:
>>>
>>> I think this is because of how libfuse handles eof / short reads. When
>>> it detects a short read, it fixes up the header length after the
>>> header was already vmspliced to the pipe because it assumes vmsplice
>>> mapped the header's page into the pipe by reference. It assumes that
>>> modifying the header length in place gets then reflected in what the
>>> pipe later splices out.
>>>
>>> The logic for this happens in fuse_send_data_iov() [1]:
>>> a) sets out->len = headerlen (16) + len (16384) = 16400 in the
>>> stack-allocated fuse_out_header
>>> b) vmsplices the header to the pipe
>>> c) splices the backing file to the pipe. if this hits EOF, it'll get
>>> back 15510 instead of 16384
>>> d) detects the short read [2], fixes up the stack out->len = 16 + 15510 = 15526
>>> e) splices the pipe to /dev/fuse
>>>
>>> After this patch, step b) is a straight copy which means step d)'s
>>> fixup doesn't modify what's in the pipe. This could be fixed up in
>>> libfuse to not depend on modify-after-vmsplice, but I don't think this
>>> helps for applications using already-released libfuse versions. I
>>> think this patch needs to be reverted.
>>>
>>> Thanks,
>>> Joanne
>>>
>>> [1] https://github.com/libfuse/libfuse/blob/master/lib/fuse_lowlevel.c#L846
>>> [2] https://github.com/libfuse/libfuse/blob/master/lib/fuse_lowlevel.c#L956
>>
>> Uh, this is very unfortunate. But I still want to remove vmsplice.
>> Maybe we can somehow save my patchsets? For example, let's return EINVAL
>> for this particular combination (writable pipe + SPLICE_F_NONBLOCK).
> 
> writable pipe + SPLICE_F_NONBLOCK is a valid vmsplice call today, so I
> think returning -EINVAL would still cause regressions.
I recall that, after the vmsplice vs. fork security issue happened, vmsplice was
blocked in some container runtimes. e.g., [1] still seems to disable it, added
in 2021 [2].

So maybe one could at least assume that many containerized workloads should be
able to deal with vmsplice not being available nowadays. But in the general case
I'm afraid Joanne is right.

[1] https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
[2]
https://github.com/containers/common/commit/7ced5daafa0e36102eb931050ba3ff99f42bdfac

-- 
Cheers,

David

^ permalink raw reply

* net: thunderbolt: tbnet_poll() can overflow skb_shinfo()->frags[]
From: Maoyi Xie @ 2026-06-16  7:34 UTC (permalink / raw)
  To: Mika Westerberg, Yehezkel Bernat
  Cc: Andrew Lunn, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel

Hi all,

After the recent skb frags[] overflow fixes (t7xx, cdc-phonet, f_phonet), I
went looking for the same pattern. I think tbnet_poll() in
drivers/net/thunderbolt/main.c has it too. I would appreciate it if you could
take a look.

tbnet_poll() reassembles a ThunderboltIP packet that spans several frames into
one skb. It adds one rx fragment per frame.

	skb = net->skb;
	if (!skb) {
		skb = build_skb(...);
		...
		net->skb = skb;
	} else {
		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
				page, hdr_size, frame_size,
				TBNET_RX_PAGE_SIZE - hdr_size);
	}

Nothing checks skb_shinfo(skb)->nr_frags against MAX_SKB_FRAGS here. The frame
count comes from the peer, in the frame header. tbnet_check_frame() only bounds
it at the start of a packet.

	if (frame_count == 0 || frame_count > TBNET_RING_SIZE / 4) {
		net->stats.rx_length_errors++;
		return false;
	}

TBNET_RING_SIZE is 256, so frame_count can be as large as 64. MAX_SKB_FRAGS is 17
by default. Frame 0 builds the skb and every frame after it adds a fragment, so
nr_frags can reach 63. Once nr_frags hits MAX_SKB_FRAGS, skb_add_rx_frag() writes
one entry past skb_shinfo()->frags[]. The frame_size and MTU checks do not stop
this. With small frames, 64 fragments stay well under TBNET_MAX_MTU.

So a malicious or buggy peer can send a packet with frame_count between 19 and
64. The frames only need to increment the way tbnet_check_frame() wants. That
drives nr_frags past frags[] and overruns skb_shared_info.

The fix I had in mind mirrors f0813bcd2d9d ("net: wwan: t7xx: fix potential
skb->frags overflow in RX path") and 600dc40554dc ("net: usb: cdc-phonet: fix
skb frags[] overflow in rx_complete()"). Add the fragment only while there is
room, and drop the packet otherwise.

	-	} else {
	+	} else if (skb_shinfo(skb)->nr_frags < MAX_SKB_FRAGS) {
			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
					page, hdr_size, frame_size,
					TBNET_RX_PAGE_SIZE - hdr_size);
	+	} else {
	+		net->stats.rx_length_errors++;
	+		__free_pages(page, TBNET_RX_PAGE_ORDER);
	+		dev_kfree_skb_any(net->skb);
	+		net->skb = NULL;
	+		continue;
		}

I do not have two Thunderbolt hosts, so this is from reading the code. I can put
together a focused reproducer if that helps.

Does this look like a real overflow? And is the MAX_SKB_FRAGS guard the right
place, or would you rather tighten the frame_count bound in tbnet_check_frame()?
It has been there since the driver was added (e69b6c02b4c3), so it is a stable
candidate. Happy to send a proper patch once you confirm.

Thanks,
Maoyi
https://maoyixie.com/

^ permalink raw reply

* Re: [PATCH v2 1/5] mm: Make per-VMA locks available universally
From: Alice Ryhl @ 2026-06-16  7:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Arve Hjønnevåg,
	Carlos Llamas, Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Suren Baghdasaryan, Todd Kjos,
	Vlastimil Babka
In-Reply-To: <20260610230411.617DD5E7@davehans-spike.ostc.intel.com>

On Wed, Jun 10, 2026 at 04:04:11PM -0700, Dave Hansen wrote:
> --- a/rust/kernel/mm.rs~unconditional-vma-locks	2026-06-10 15:57:54.051368539 -0700
> +++ b/rust/kernel/mm.rs	2026-06-10 15:57:54.078369499 -0700
> @@ -174,7 +174,6 @@ impl MmWithUser {
>      /// When per-vma locks are disabled, this always returns `None`.
>      #[inline]
>      pub fn lock_vma_under_rcu(&self, vma_addr: usize) -> Option<VmaReadGuard<'_>> {
> -        #[cfg(CONFIG_PER_VMA_LOCK)]
>          {
>              // SAFETY: Calling `bindings::lock_vma_under_rcu` is always okay given an mm where
>              // `mm_users` is non-zero.
> @@ -188,12 +187,6 @@ impl MmWithUser {
>                  });
>              }
>          }
> -
> -        // Silence warnings about unused variables.
> -        #[cfg(not(CONFIG_PER_VMA_LOCK))]
> -        let _ = vma_addr;
> -
> -        None

This isn't quite right:

    error[E0317]: `if` may be missing an `else` clause
       --> rust/kernel/mm.rs:181:13
        |
    181 | /             if !vma.is_null() {
    182 | |                 return Some(VmaReadGuard {
    ...   |
    187 | |                 });
    188 | |             }
        | |_____________^ expected `Option<VmaReadGuard<'_>>`, found `()`
        |
        = note:   expected enum `core::option::Option<mm::VmaReadGuard<'_>>`
                found unit type `()`
        = note: `if` expressions without `else` evaluate to `()`
        = help: consider adding an `else` block that evaluates to the expected type

This error is triggered because you deleted the return 'None' at the end
of the function.

I would like to suggest the following implementation

        // SAFETY: Calling `bindings::lock_vma_under_rcu` is always okay given an mm where
        // `mm_users` is non-zero.
        let vma = unsafe { bindings::lock_vma_under_rcu(self.as_raw(), vma_addr) };
        if vma.is_null() {
            return None;
        }
        Some(VmaReadGuard {
            // SAFETY: If `lock_vma_under_rcu` returns a non-null ptr, then it points at a valid
            // vma. The vma is stable for as long as the vma read lock is held.
            vma: unsafe { VmaRef::from_raw(vma) },
            _nts: NotThreadSafe,
        })

Thanks!
Alice

^ permalink raw reply

* [PATCH] net: mvneta: free/request IRQ across suspend/resume
From: Yun Zhou @ 2026-06-16  7:26 UTC (permalink / raw)
  To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	bigeasy, clrkwllms, rostedt
  Cc: netdev, linux-kernel, linux-rt-devel, yun.zhou

On PREEMPT_RT, the mvneta IRQ handler is force-threaded. Under high
network traffic, the IRQ can enter suspend with desc->depth == 1
(masked by the oneshot mechanism between handler invocations).

During suspend, the kernel increments depth to 2 and masks the
interrupt at the MPIC level (clearing the SRC_CTL CPU routing bit,
due to IRQCHIP_MASK_ON_SUSPEND). On resume, depth is decremented
back to 1, but since it does not reach 0, the unmask is never
called. The MPIC CPU routing remains cleared, permanently disabling
interrupt delivery.

Fix by freeing the IRQ in suspend and re-requesting it in resume.
This ensures a clean IRQ state (depth=0, proper hardware routing)
on every resume cycle, regardless of the pre-suspend depth. This
follows the approach used by other drivers (e.g. igb).

Fixes: 9768b45ceb0b ("net: mvneta: support suspend and resume")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 0c061fb0ed07..e7e9a58dbe55 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5819,6 +5819,20 @@ static int mvneta_suspend(struct device *device)
 	mvneta_stop_dev(pp);
 	rtnl_unlock();
 
+	/* Release IRQ to avoid stale MPIC mask state on resume.
+	 * On PREEMPT_RT, forced-threaded oneshot IRQs may leave the
+	 * interrupt masked (depth>0) at suspend time. This prevents
+	 * resume_device_irqs() from restoring the MPIC CPU routing,
+	 * permanently disabling the interrupt. Re-requesting the IRQ
+	 * on resume guarantees a clean state.
+	 */
+	if (pp->neta_armada3700)
+		free_irq(dev->irq, pp);
+	else {
+		on_each_cpu(mvneta_percpu_disable, pp, true);
+		free_percpu_irq(dev->irq, pp->ports);
+	}
+
 	for (queue = 0; queue < rxq_number; queue++) {
 		struct mvneta_rx_queue *rxq = &pp->rxqs[queue];
 
@@ -5895,6 +5909,20 @@ static int mvneta_resume(struct device *device)
 						 &pp->node_dead);
 	}
 
+	/* Re-request IRQ (see comment in mvneta_suspend) */
+	if (pp->neta_armada3700) {
+		err = request_irq(dev->irq, mvneta_isr, 0, dev->name, pp);
+	} else {
+		err = request_percpu_irq(dev->irq, mvneta_percpu_isr,
+					dev->name, pp->ports);
+		if (!err)
+			on_each_cpu(mvneta_percpu_enable, pp, true);
+	}
+	if (err) {
+		netdev_err(dev, "cannot request irq %d\n", dev->irq);
+		return err;
+	}
+
 	rtnl_lock();
 	mvneta_start_dev(pp);
 	rtnl_unlock();
-- 
2.43.0


^ permalink raw reply related

* RE: [PATCH net v3] tipc: fix slab-use-after-free Read in tipc_aead_decrypt_done
From: Tung Quang Nguyen @ 2026-06-16  7:24 UTC (permalink / raw)
  To: Doruk Tan Ozturk
  Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, aleksander.lobakin@intel.com,
	tipc-discussion@lists.sourceforge.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	jmaloy@redhat.com
In-Reply-To: <20260615114618.71249-1-doruk@0sec.ai>

>Subject: [PATCH net v3] tipc: fix slab-use-after-free Read in
>tipc_aead_decrypt_done
>
>tipc_aead_decrypt() goes straight from tipc_bearer_hold(b) to
>crypto_aead_decrypt(req) without taking a reference on the netns, unlike the
>encrypt path. When crypto_aead_decrypt() is offloaded asynchronously (e.g.
>the SIMD aead wrapper queuing to cryptd), the cryptd worker runs
>tipc_aead_decrypt_done() later. If the bearer's netns is torn down in the
>meantime, cleanup_net() -> tipc_exit_net() -> tipc_crypto_stop() frees the per-
>netns tipc_crypto, and the completion then reads it:
>tipc_aead_decrypt_done() dereferences aead->crypto->stats and
>aead->crypto->net, and tipc_crypto_rcv_complete() dereferences aead[]
>aead->crypto->and the node table -- reading freed memory.
>
>Decoded KASAN splat (v7.1-rc7, CONFIG_KASAN_INLINE + TIPC +
>TIPC_CRYPTO):
>
>  BUG: KASAN: slab-use-after-free in tipc_aead_decrypt_done
>(net/tipc/crypto.c:999)
>  Read of size 8 at addr ffff8881056258a8 by task kworker/u16:2/51
>  Workqueue: events_unbound
>  Call Trace:
>   tipc_aead_decrypt_done (net/tipc/crypto.c:999)
>   process_one_work (kernel/workqueue.c:3314)
>   worker_thread (kernel/workqueue.c:3397 kernel/workqueue.c:3478)
>   kthread (kernel/kthread.c:436)
>   ret_from_fork (arch/x86/kernel/process.c:158)
>   ret_from_fork_asm (arch/x86/entry/entry_64.S:245)
>
>  Allocated by task 169:
>   __kasan_kmalloc (mm/kasan/common.c:398 mm/kasan/common.c:415)
>   tipc_crypto_start (net/tipc/crypto.c:1502)
>   tipc_init_net (net/tipc/core.c:72)
>   ops_init (net/core/net_namespace.c:137)
>   setup_net (net/core/net_namespace.c:446)
>   copy_net_ns (net/core/net_namespace.c:579)
>   create_new_namespaces (kernel/nsproxy.c:132)
>   __x64_sys_unshare (kernel/fork.c:3316)
>   do_syscall_64 (arch/x86/entry/syscall_64.c:63)
>   entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
>
>  Freed by task 8:
>   kfree (mm/slub.c:6566)
>   tipc_exit_net (net/tipc/core.c:119)
>   cleanup_net (net/core/net_namespace.c:704)
>   process_one_work (kernel/workqueue.c:3314)
>   kthread (kernel/kthread.c:436)
>
>This is the same class of bug that commit e279024617134 ("net/tipc: fix slab-
>use-after-free Read in tipc_aead_encrypt_done") fixed for the encrypt side.
>The encrypt path takes maybe_get_net(aead->crypto->net) before
>crypto_aead_encrypt() and drops it with put_net() on the synchronous return
>paths and in tipc_aead_encrypt_done(); the -EINPROGRESS/-EBUSY return
>keeps the reference for the async callback to release. The decrypt path was left
>without the equivalent guard.
>
>Mirror the encrypt-side fix on the decrypt path: take a net reference before
>crypto_aead_decrypt() (failing with -ENODEV and the matching bearer put if it
>cannot be acquired), keep it across the -EINPROGRESS/-EBUSY async return,
>and drop it with put_net() on the synchronous success/error return and at the
>end of tipc_aead_decrypt_done().
>
>Reproduced under KASAN on v7.1-rc7: a UDP bearer with a cluster key is
>flooded with crafted encrypted frames from an unknown peer (driving the
>cluster-key decrypt path) while the bearer's netns is repeatedly torn down. The
>completion must run asynchronously to outlive tipc_crypto_stop(); on x86 the
>stock aesni gcm(aes) now decrypts synchronously, so the async path was
>exercised via cryptd offload. The unguarded aead->crypto dereference in
>tipc_aead_decrypt_done() is the unpatched upstream path;
>tipc_aead_decrypt() still lacks maybe_get_net(aead->crypto->net), so the
>completion can outlive the free on any config where crypto_aead_decrypt()
>goes async.
>
>Found by 0sec automated security-research tooling (https://0sec.ai).
>
>Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication")
>Cc: stable@vger.kernel.org
>Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
>Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>---
>v3:
> - Rewrite the changelog with the decoded stack trace and frame the
>   reproduction on the current tree (v7.1-rc7); drop the v6.12.92
>   references (Tung Quang Nguyen).
>v2:
> - Add Cc: stable@vger.kernel.org and Alexander Lobakin's Reviewed-by.
>   No functional change.
> net/tipc/crypto.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
>diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c index
>6d3b6b89b1d1..84a6489da036 100644
>--- a/net/tipc/crypto.c
>+++ b/net/tipc/crypto.c
>@@ -941,12 +941,20 @@ static int tipc_aead_decrypt(struct net *net, struct
>tipc_aead *aead,
> 		goto exit;
> 	}
>
>+	/* Get net to avoid freed tipc_crypto when delete namespace */
>+	if (!maybe_get_net(aead->crypto->net)) {
>+		tipc_bearer_put(b);
>+		rc = -ENODEV;
>+		goto exit;
>+	}
>+
> 	/* Now, do decrypt */
> 	rc = crypto_aead_decrypt(req);
> 	if (rc == -EINPROGRESS || rc == -EBUSY)
> 		return rc;
>
> 	tipc_bearer_put(b);
>+	put_net(aead->crypto->net);
>
> exit:
> 	kfree(ctx);
>@@ -984,6 +992,7 @@ static void tipc_aead_decrypt_done(void *data, int err)
> 	}
>
> 	tipc_bearer_put(b);
>+	put_net(net);
> }
>
> static inline int tipc_ehdr_size(struct tipc_ehdr *ehdr)
>--
>2.43.0
>

Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>

^ permalink raw reply

* Re: [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag
From: Simon Horman @ 2026-06-16  7:24 UTC (permalink / raw)
  To: Xin Long
  Cc: netdev, linux-sctp, davem, kuba, edumazet, pabeni,
	marcelo.leitner, w, zdi-disclosures
In-Reply-To: <CADvbK_e062WLNVy+BbuNTNoJGBvQBR7PHp_BmxLwwSGq4O9_dw@mail.gmail.com>

On Mon, Jun 15, 2026 at 02:24:34PM -0400, Xin Long wrote:
> On Mon, Jun 15, 2026 at 7:04 AM Simon Horman <horms@kernel.org> wrote:
> >
> > This is an AI-generated review of your patch. The human sending this
> > email has considered the AI review valid, or at least plausible.
> > Full review at: https://netdev-ai.bots.linux.dev/sashiko/

...

> Low: #1, #2, #5, not really issues,
> but worth mentioning about it in changelog.
> 
> Critical: #3, not valid.
> socket refcnt can't be 0 when traversing the chain under read_lock_bh().
> 
> But it seems better to hold ep instead sk, and also to check
> ep->base.dead instead of sk_state CLOSED.
> 
> Medium: #4, not valid.
> it's completely okay to dump duplicate or skip socks because of
> concurrent close() and listen() in diag.
> 
> will post v2 with some improvements mentioned above.

Thanks, much appreciated.

^ permalink raw reply

* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: Carsten Strotmann @ 2026-06-16  7:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: John Paul Adrian Glaubitz, davem, netdev, edumazet, pabeni,
	andrew+netdev, horms, geert, chleroy, npiggin, mpe, maddy,
	linux-mips, linux-m68k, linuxppc-dev
In-Reply-To: <20260615175535.5bc56cfc@kernel.org>

Hi,

I'm a user of AppleTalk and other "Retro"-Features in the Linux Kernel.

On 16 Jun 2026, at 2:55, Jakub Kicinski wrote:

> We can complain about the AI slop til the cows comes home.
> I don't like it, you don't like it. What difference does it make?
>
> If y'all have real solutions please share. Complaining about
> "commercial interests" and "nuk[ing] everything in a panic reaction"
> is not helpful.

the solution, as Adrian pointed out, is to leave these features in the Linux kernel but have them disabled by default. Maybe put a warning message in the kernel config tools that people should only enable these if they know what they are doing.

These "retro"-features should not pose any security risk of they are not compiled into a kernel.

Greetings

Carsten

^ permalink raw reply

* Re: [PATCH] net: ethtool: mm: Increase FPE verification retry count
From: Simon Horman @ 2026-06-16  7:19 UTC (permalink / raw)
  To: muhammad.nazim.amirul.nazle.asmade
  Cc: netdev, andrew, kuba, davem, edumazet, pabeni, vladimir.oltean,
	faizal.abdul.rahim, linux-kernel
In-Reply-To: <20260615072436.26128-1-muhammad.nazim.amirul.nazle.asmade@altera.com>

+ Vladimir

On Mon, Jun 15, 2026 at 12:24:36AM -0700, muhammad.nazim.amirul.nazle.asmade@altera.com wrote:
> From: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
> 
> The current FPE verification retry count is set to 3. However,
> the IEEE 802.3br standard does not specify a fixed value for this.
> A retry count of 3 may be insufficient when the remote device is
> slow to respond during link-up. Increase the retry count to 20 to
> improve robustness.
> 
> Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
> Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>

Vladimir, I'm wondering if you could take a look at this one.

> ---
>  include/linux/ethtool.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index f51346a6a686..9a1b1f5d37a4 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -23,7 +23,7 @@
>  #include <uapi/linux/net_tstamp.h>
>  
>  #define ETHTOOL_MM_MAX_VERIFY_TIME_MS		128
> -#define ETHTOOL_MM_MAX_VERIFY_RETRIES		3
> +#define ETHTOOL_MM_MAX_VERIFY_RETRIES		20
>  
>  struct compat_ethtool_rx_flow_spec {
>  	u32		flow_type;
> -- 
> 2.43.7
> 

^ permalink raw reply

* Re: [PATCH net] net/smc: fix out-of-bounds read in smc_clcsock_data_ready()
From: D. Wythe @ 2026-06-16  7:16 UTC (permalink / raw)
  To: Sechang Lim
  Cc: D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David S . Miller, Mahanta Jambigi,
	Tony Lu, Wen Gu, Simon Horman, Ursula Braun, Karsten Graul,
	Guvenc Gulce, netdev, linux-rdma, linux-s390, bpf, linux-kernel
In-Reply-To: <20260614120931.4041687-1-rhkrqnwk98@gmail.com>

On Sun, Jun 14, 2026 at 12:09:30PM +0000, Sechang Lim wrote:
> smc_clcsock_data_ready() is installed on the listen socket and reads its
> sk_user_data as an smc_sock. A passive-open child inherits this callback,
> but sk_clone_lock() clears the child's sk_user_data because it is tagged
> SK_USER_DATA_NOCOPY. smc_tcp_syn_recv_sock() restores the child's af_ops,
> but the inherited sk_data_ready() is left in place until accept.
> 
> In that window the child is established. A cgroup sock_ops program can run
> bpf_sock_hash_update() on it from tcp_init_transfer(); sk_psock_init()
> stores a sk_psock in the NULL sk_user_data. The inherited callback then
> reads sk_user_data via smc_clcsock_user_data(), which masks only
> SK_USER_DATA_NOCOPY, mistakes the sk_psock for an smc_sock, and reads a
> callback pointer past the end of the sk_psock:
> 
>   BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>   Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
>    <IRQ>
>    smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>    tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
>    tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
>    tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>    tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
>    ip_protocol_deliver_rcu+0x226/0x420 net/ipv4/ip_input.c:207
>    ip_local_deliver_finish+0x35a/0x5f0 net/ipv4/ip_input.c:241
>    __netif_receive_skb_one_core+0x1e5/0x210 net/core/dev.c:6216
>    process_backlog+0x631/0x1470 net/core/dev.c:6682
>    __napi_poll+0xb3/0x320 net/core/dev.c:7749
>    net_rx_action+0x4fa/0xcb0 net/core/dev.c:7969
>    handle_softirqs+0x236/0x800 kernel/softirq.c:622
>    </IRQ>
> 
>   Allocated by task 67930:
>    sk_psock_init+0x142/0x740 net/core/skmsg.c:766
>    sock_map_link+0x646/0xdf0 net/core/sock_map.c:279
>    sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
>    bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
>    __cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
>    tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
>    tcp_rcv_state_process+0x241e/0x4940 net/ipv4/tcp_input.c:7231
>    tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
> 
> Restore the inherited sk_data_ready() in smc_tcp_syn_recv_sock(), where the
> child's sk_user_data is already cleared, rather than only at accept.
> 
> Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
> ---
>  net/smc/af_smc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index b5db69073e20..152971e8ad17 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -156,6 +156,12 @@ static struct sock *smc_tcp_syn_recv_sock(const struct sock *sk,
>  	if (child) {
>  		rcu_assign_sk_user_data(child, NULL);
>  
> +		/*
> +		 * the child inherited the listen-specific sk_data_ready();
> +		 * restore it here, as sk_user_data may be reused before accept
> +		 */
> +		child->sk_data_ready = smc->clcsk_data_ready;

One concern:

smc_clcsock_user_data_rcu() together with refcount_inc_not_zero() only
pins the smc_sock; it does not guarantee anything about the lifetime or
consistency of smc->clcsk_data_ready. In the listen-close path,
smc_clcsock_restore_cb() clears that field under sk_callback_lock,
while smc_tcp_syn_recv_sock() reads it without any lock. These are
independent protection domains. If close wins the race,
child->sk_data_ready can end up NULL and the next data arrival will
crash.

Also, I don't object to this fix, but I'd rather see the underlying cause
addressed directly. The real issue seems to be the conflict between
SMC's sk_user_data and sk_psock. Maybe there is a cleaner solution, e.g.
always setting user_data.

> +
>  		/* v4-mapped sockets don't inherit parent ops. Don't restore. */
>  		if (inet_csk(child)->icsk_af_ops == inet_csk(sk)->icsk_af_ops)
>  			inet_csk(child)->icsk_af_ops = smc->ori_af_ops;
> -- 
> 2.43.0

^ permalink raw reply

* Re: [PATCH net] octeontx2-af: npc: Log successful MCAM drop-on-non-hit install at debug level
From: Simon Horman @ 2026-06-16  7:14 UTC (permalink / raw)
  To: Ratheesh Kannoth
  Cc: kuba, linux-kernel, netdev, andrew+netdev, davem, edumazet,
	pabeni, sgoutham
In-Reply-To: <20260615033157.535237-1-rkannoth@marvell.com>

On Mon, Jun 15, 2026 at 09:01:57AM +0530, Ratheesh Kannoth wrote:
> npc_install_mcam_drop_rule() used dev_err() after a successful
> rvu_mbox_handler_npc_mcam_write_entry() call, so normal installs appeared
> as errors in dmesg.  Use dev_dbg() for the success path and keep dev_err()
> for real failures.
> 
> Fixes: 3571fe07a090 ("octeontx2-af: Drop rules for NPC MCAM")
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>

Reviewed-by: Simon Horman <horms@kernel.org>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox