Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net] xfrm: fix stack-out-of-bounds in xfrm_tmpl_resolve_one
From: Eric Dumazet @ 2026-06-25  9:24 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, eric.dumazet, Eric Dumazet,
	syzbot+0ac4d84afe1066a1f3e9, Steffen Klassert, Herbert Xu

syzbot reported a stack-out-of-bounds read in xfrm_state_find()
which flows from xfrm_tmpl_resolve_one().

The issue occurs when a policy has a mix of family-changing templates
(e.g. BEET or IPTFS) and transport templates. If an optional
family-changing template is skipped because no state is found, the
current family of the flow (`family`) is not updated. The subsequent
transport template is then evaluated using the unchanged family (e.g.
AF_INET), but it uses the template's `encap_family` (e.g. AF_INET6)
to perform the state lookup.

This causes `xfrm_state_find()` to interpret the IPv4 flow addresses
(allocated on the stack as `struct flowi4` in `raw_sendmsg` or
`udp_sendmsg`) as IPv6 addresses (`xfrm_address_t`), leading to a
16-byte read from the 4-byte stack variables, triggering KASAN.

Fix this by tracking the active family of the flow (`cur_family`)
during template resolution:
1. Initialize `cur_family` to the flow's original family.
2. For transport templates, verify that `tmpl->encap_family` matches
   `cur_family`. If they mismatch, abort with -EINVAL.
3. When a template that can change the family (tunnel, beet, iptfs) is
   successfully resolved, update `cur_family` to `tmpl->encap_family`.
4. If a template is skipped (optional), `cur_family` remains unchanged.

This prevents mismatched transport lookups and makes the resolution
robust against any family-transition gaps.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+0ac4d84afe1066a1f3e9@syzkaller.appspotmail.com
Closes: https://www.spinics.net/lists/netdev/msg1200923.html
Assisted-by: Jetski:gemini-3.1-pro-preview
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 net/xfrm/xfrm_policy.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7ef861a0e8231b63ece816b5237b03fa1367ccf9..95e30670303d34598ba164dff59a65c14489d5f3 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2485,6 +2485,7 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const struct flowi *fl,
 	int i, error;
 	xfrm_address_t *daddr = xfrm_flowi_daddr(fl, family);
 	xfrm_address_t *saddr = xfrm_flowi_saddr(fl, family);
+	unsigned short cur_family = family;
 	xfrm_address_t tmp;

 	for (nx = 0, i = 0; i < policy->xfrm_nr; i++) {
@@ -2511,6 +2512,11 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const struct flowi *fl,
 					goto fail;
 				local = &tmp;
 			}
+		} else {
+			if (tmpl->encap_family != cur_family) {
+				error = -EINVAL;
+				goto fail;
+			}
 		}

 		x = xfrm_state_find(remote, local, fl, tmpl, policy, &error,
@@ -2526,6 +2532,11 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const struct flowi *fl,
 			xfrm[nx++] = x;
 			daddr = remote;
 			saddr = local;
+			if (tmpl->mode == XFRM_MODE_TUNNEL ||
+			    tmpl->mode == XFRM_MODE_IPTFS ||
+			    tmpl->mode == XFRM_MODE_BEET) {
+				cur_family = tmpl->encap_family;
+			}
 			continue;
 		}
 		if (x) {
-- 
2.55.0.rc0.799.gd6f94ed593-goog

^ permalink raw reply related

* RE: [PATCH net v2] tipc: fix out-of-bounds read in broadcast Gap ACK blocks
From: Tung Quang Nguyen @ 2026-06-25  9:23 UTC (permalink / raw)
  To: Samuel Page
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev@vger.kernel.org,
	tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, Jon Maloy
In-Reply-To: <20260624135629.727262-1-sam@bynar.io>

>Subject: [PATCH net v2] tipc: fix out-of-bounds read in broadcast Gap ACK
>blocks
>
>A broadcast PROTOCOL/STATE_MSG can carry a Gap ACK blocks record in its
>data area. tipc_get_gap_ack_blks() only verifies that the record's len field is
>self-consistent with its ugack_cnt/bgack_cnt counts (sz == struct_size(p, gacks,
>ugack_cnt + bgack_cnt)); it does not check that the record actually fits in the
>message data area, msg_data_sz().
>
>The unicast caller tipc_link_proto_rcv() bounds it ("if (glen > dlen) break;"), but
>the broadcast caller tipc_bcast_sync_rcv() discards the returned size, so
>tipc_link_advance_transmq() copies the record off the receive skb with an
>attacker-controlled count:
>
>	this_ga = kmemdup(ga, struct_size(ga, gacks, ga->bgack_cnt),
>			  GFP_ATOMIC);
>
>A TIPC neighbour that negotiated TIPC_GAP_ACK_BLOCK triggers it with one
>ordinary broadcast STATE_MSG (msg_bc_ack_invalid() clear), sized so its data
>area is short, carrying a Gap ACK record with len = 0x400, bgack_cnt = 0xff and
>ugack_cnt = 0. len then equals struct_size(p, gacks, 255), so the consistency
>check passes and ga is non-NULL; kmemdup() reads struct_size(ga, gacks, 255)
>= 1024 bytes out of the much smaller skb:
>
>  BUG: KASAN: slab-out-of-bounds in kmemdup_noprof+0x48/0x60
>  Read of size 1024 at addr ffff0000c7030d38 by task poc864/69
>  Call trace:
>   kmemdup_noprof+0x48/0x60
>   tipc_link_advance_transmq+0x86c/0xb80
>   tipc_link_bc_ack_rcv+0x19c/0x1e0
>   tipc_bcast_sync_rcv+0x1c4/0x2c4
>   tipc_rcv+0x85c/0x1340
>   tipc_l2_rcv_msg+0xac/0x104
>  The buggy address belongs to the object at ffff0000c7030d00
>   which belongs to the cache skbuff_small_head of size 704
>  The buggy address is located 56 bytes inside of
>   allocated 704-byte region [ffff0000c7030d00, ffff0000c7030fc0)
>
>The copied-out bytes are subsequently consumed as gap/ack values, but the
>read is already out of bounds at the kmemdup() regardless of how they are
>used.
>
>The unicast STATE path drops such a message: "if (glen > dlen) break;"
>skips the rest of STATE_MSG handling and the skb is freed. Make the broadcast
>path drop it too. tipc_bcast_sync_rcv() now bounds the record against
>msg_data_sz() and, when it does not fit, reports it back through
>tipc_node_bc_sync_rcv() to tipc_rcv() so the skb is discarded rather than
>processed. ga is not cleared on this path: ga == NULL already means "legacy
>peer without Selective ACK", a distinct legitimate state.
>
>Fixes: d7626b5acff9 ("tipc: introduce Gap ACK blocks for broadcast link")
>Cc: stable@vger.kernel.org
>Assisted-by: Bynario AI
>Signed-off-by: Samuel Page <sam@bynar.io>
>---
>v2, per review of v1 [1]:
> - v1 cleared 'ga' on an oversized Gap ACK record, which let the malformed
>   STATE message be processed as a legacy (no Selective ACK) one rather than
>   dropped.  v2 drops it instead, matching the unicast STATE path:
>   tipc_bcast_sync_rcv() reports the bad record through a bool output
>   parameter, propagated by tipc_node_bc_sync_rcv() to tipc_rcv(), which
>   discards the skb.
> - v1 touched only net/tipc/bcast.c; v2 also touches net/tipc/{bcast.h,node.c}.
>
>[1] https://lore.kernel.org/netdev/20260623134137.3641275-1-sam@bynar.io/
>
>For reference, an earlier thread proposed validating inside
>tipc_get_gap_ack_blks():
>
>https://lore.kernel.org/netdev/1316452e465e9a96fce44ec15130a14f3872149f.
>1775809727.git.caoruide123@gmail.com/
>
> net/tipc/bcast.c | 22 ++++++++++++++--------  net/tipc/bcast.h |  2 +-
>net/tipc/node.c  | 13 ++++++++++---
> 3 files changed, 25 insertions(+), 12 deletions(-)
>
>diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index
>76a1585d3f6b..08637c3c9db0 100644
>--- a/net/tipc/bcast.c
>+++ b/net/tipc/bcast.c
>@@ -497,11 +497,12 @@ void tipc_bcast_ack_rcv(struct net *net, struct
>tipc_link *l,
>  */
> int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
> 			struct tipc_msg *hdr,
>-			struct sk_buff_head *retrq)
>+			struct sk_buff_head *retrq, bool *valid)
> {
> 	struct sk_buff_head *inputq = &tipc_bc_base(net)->inputq;
> 	struct tipc_gap_ack_blks *ga;
> 	struct sk_buff_head xmitq;
>+	u16 glen;

Move this variable declaration to the bottom to follow reverse xmas tree style.

> 	int rc = 0;
>
> 	__skb_queue_head_init(&xmitq);
>@@ -510,13 +511,18 @@ int tipc_bcast_sync_rcv(struct net *net, struct
>tipc_link *l,
> 	if (msg_type(hdr) != STATE_MSG) {
> 		tipc_link_bc_init_rcv(l, hdr);
> 	} else if (!msg_bc_ack_invalid(hdr)) {
>-		tipc_get_gap_ack_blks(&ga, l, hdr, false);
>-		if (!sysctl_tipc_bc_retruni)
>-			retrq = &xmitq;
>-		rc = tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr),
>-					  msg_bc_gap(hdr), ga, &xmitq,
>-					  retrq);
>-		rc |= tipc_link_bc_sync_rcv(l, hdr, &xmitq);
>+		glen = tipc_get_gap_ack_blks(&ga, l, hdr, false);
>+		if (glen > msg_data_sz(hdr)) {
>+			/* Malformed Gap ACK blocks; caller drops the msg */
>+			*valid = false;
>+		} else {
>+			if (!sysctl_tipc_bc_retruni)
>+				retrq = &xmitq;
>+			rc = tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr),
>+						  msg_bc_gap(hdr), ga, &xmitq,
>+						  retrq);
>+			rc |= tipc_link_bc_sync_rcv(l, hdr, &xmitq);
>+		}
> 	}
> 	tipc_bcast_unlock(net);
>
>diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index
>2d9352dc7b0e..55d17b5413e1 100644
>--- a/net/tipc/bcast.h
>+++ b/net/tipc/bcast.h
>@@ -97,7 +97,7 @@ void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
> 			struct tipc_msg *hdr);
> int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
> 			struct tipc_msg *hdr,
>-			struct sk_buff_head *retrq);
>+			struct sk_buff_head *retrq, bool *valid);
> int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg,
> 			struct tipc_link *bcl);
> int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]); diff --git
>a/net/tipc/node.c b/net/tipc/node.c index 97aa970a0d83..2887f94ee28f
>100644
>--- a/net/tipc/node.c
>+++ b/net/tipc/node.c
>@@ -1831,12 +1831,13 @@ static void tipc_node_mcast_rcv(struct tipc_node
>*n)  }
>
> static void tipc_node_bc_sync_rcv(struct tipc_node *n, struct tipc_msg *hdr,
>-				  int bearer_id, struct sk_buff_head *xmitq)
>+				  int bearer_id, struct sk_buff_head *xmitq,
>+				  bool *valid)
> {
> 	struct tipc_link *ucl;
> 	int rc;
>
>-	rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr, xmitq);
>+	rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr, xmitq, valid);

'valid' needs to be checked after this call. Then, return immediately if it is false.

>
> 	if (rc & TIPC_LINK_DOWN_EVT) {
> 		tipc_node_reset_links(n);
>@@ -2140,12 +2141,18 @@ void tipc_rcv(struct net *net, struct sk_buff *skb,
>struct tipc_bearer *b)
>
> 	/* Ensure broadcast reception is in synch with peer's send state */
> 	if (unlikely(usr == LINK_PROTOCOL)) {
>+		bool valid = true;
>+
> 		if (unlikely(skb_linearize(skb))) {
> 			tipc_node_put(n);
> 			goto discard;
> 		}
> 		hdr = buf_msg(skb);
>-		tipc_node_bc_sync_rcv(n, hdr, bearer_id, &xmitq);
>+		tipc_node_bc_sync_rcv(n, hdr, bearer_id, &xmitq, &valid);
>+		if (!valid) {
>+			tipc_node_put(n);
>+			goto discard;
>+		}
> 	} else if (unlikely(tipc_link_acked(n->bc_entry.link) != bc_ack)) {
> 		tipc_bcast_ack_rcv(net, n->bc_entry.link, hdr);
> 	}
>
>base-commit: a986fde914d88af47eb78fd29c5d1af7952c3500
>--
>2.54.0


^ permalink raw reply

* Re: [PATCH net] nfc: nci: fix uninit-value in the RF discover/activated NTF handlers
From: Simon Horman @ 2026-06-25  9:18 UTC (permalink / raw)
  To: sam
  Cc: Simon Horman, david, davem, edumazet, kuba, pabeni, oe-linux-nfc,
	netdev, linux-kernel, stable
In-Reply-To: <20260623234126.214667-1-sam@bynar.io>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
nfc: nci: fix uninit-value in the RF discover/activated NTF handlers

This zero-initialises the on-stack nci_rf_discover_ntf and
nci_rf_intf_activated_ntf structs in the RF discover and interface
activated NTF handlers so that the technology- and activation-specific
unions read back as zero when the corresponding length fields are zero.

Link: https://lore.kernel.org/netdev/20260623172109.1105965-2-horms@kernel.org/

>  - activated: with rf_tech_specific_params_len == 0 the same union is read
>    via nci_target_auto_activated(); with activation_params_len == 0 the
>    activation_params union is read by nci_store_ats_nfc_iso_dep() into
>    ndev->target_ats.

[Low]
Does the third scenario in this paragraph match the code?

Looking at nci_store_ats_nfc_iso_dep():

    static int nci_store_ats_nfc_iso_dep(struct nci_dev *ndev,
                         const struct nci_rf_intf_activated_ntf *ntf)
    {
        ndev->target_ats_len = 0;

        if (ntf->activation_params_len <= 0)
            return NCI_STATUS_OK;

        if (ntf->activation_params.nfca_poll_iso_dep.rats_res_len > NFC_ATS_MAXSIZE) {
        ...

The function returns NCI_STATUS_OK before any access to the
activation_params union when activation_params_len is 0, so the union
is not read in that case.

The sibling nci_store_general_bytes_nfc_dep() has the same early return
on activation_params_len <= 0.

The two rf_tech_specific_params_len == 0 cases listed earlier in the
commit message do match the code (nci_add_new_protocol() and
nci_target_auto_activated() unconditionally pass
&ntf.rf_tech_specific_params even when rf_tech_specific_params_len is
0), and the KMSAN trace quoted in the commit message only mentions the
nci_add_new_protocol() path.

The ntf = {} change still looks correct and necessary for the
rf_tech_specific_params union. Could the third bullet be dropped or
adjusted so the commit message reflects only the paths that actually
read uninitialised bytes?

^ permalink raw reply

* [PATCH net] net: libwx: fix VMDQ mask for 1-queue mode
From: Jiawen Wu @ 2026-06-25  9:08 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Larysa Zaremba,
	Kees Cook, Jiawen Wu

In wx_set_vmdq_queues(), the VMDQ mask was not set for the devices not
support WX_FLAG_MULTI_64_FUNC, i.e., NGBE devices. A mask of 0 causes
__ALIGN_MASK(1, ~vmdq->mask) to return 0, which incorrectly sets
q_per_pool to 0 in wx_write_qde().

Fix the VMDQ 1-queue mask to 0x7F then ensures that __ALIGN_MASK(1,
0x7F) correctly evaluates to 1.

Fixes: c52d4b898901 ("net: libwx: Redesign flow when sriov is enabled")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
---
 drivers/net/ethernet/wangxun/libwx/wx_lib.c  | 1 +
 drivers/net/ethernet/wangxun/libwx/wx_type.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
index d042567b8128..814d88d2aee4 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
@@ -1802,6 +1802,7 @@ static bool wx_set_vmdq_queues(struct wx *wx)
 			rss_i = 4;
 		}
 	} else {
+		vmdq_m = WX_VMDQ_1Q_MASK;
 		/* double check we are limited to maximum pools */
 		vmdq_i = min_t(u16, 8, vmdq_i);
 
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
index c7befe4cdfe9..65e3e55db1cf 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
@@ -486,6 +486,7 @@ enum WX_MSCA_CMD_value {
 
 #define WX_VMDQ_4Q_MASK              0x7C
 #define WX_VMDQ_2Q_MASK              0x7E
+#define WX_VMDQ_1Q_MASK              0x7F
 
 /****************** Manageablility Host Interface defines ********************/
 #define WX_HI_MAX_BLOCK_BYTE_LENGTH  256 /* Num of bytes in range */
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-25  9:03 UTC (permalink / raw)
  To: val
  Cc: akpm, axboe, brauner, david, dhowells, fuse-devel, hch, jack,
	joannelkoong, linux-api, linux-fsdevel, linux-kernel, linux-mm,
	miklos, netdev, patches, pfalcato, rostedt, safinaskar, torvalds,
	viro, willy
In-Reply-To: <83f05c55-efba-4bf5-abfe-d2ab0819e904@packett.cool>

Val Packett <val@packett.cool>:
> speaking of fuse_dev_splice……_write actually, this series has broken 
> xdg-document-portal!
> 
> https://github.com/flatpak/xdg-desktop-portal/issues/2026
> 
> Specifically what happens is that the EINVAL is returned due to oh.len 
> != nbytes:
> 
> fuse_dev_do_write: oh.len 16400 != nbytes 15526
> 
> (where 16400 == 16384 (read len) + 16, 15526 == 15510 (file len) + 16)
> 
> After reverting the series, there is no error because oh.len 
> becomes 15526 too.

Please, test v2 version of my fixes:
https://lore.kernel.org/lkml/20260625083409.3769242-1-safinaskar@gmail.com/ .

This should fix this bug.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-25  8:53 UTC (permalink / raw)
  To: avagin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, safinaskar, torvalds, val, viro, willy
In-Reply-To: <CANaxB-xUrLQYGiRJZc4Boi+KX=0TJSWymErNovANVko20fMDVA@mail.gmail.com>

Andrei Vagin <avagin@gmail.com>:
> On Wed, Jun 24, 2026 at 12:12 AM Askar Safin <safinaskar@gmail.com> wrote:
> > Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
> > named fifos? Or this is merely a test?
> 
> Yes, it does.

I. e. CRIU relies on that named fifo behavior? Okay, I just sent
v2 version of my fixes. The patchset contains fix for named fifos.

Please, test that this fixes that named fifo problem.

> I already explained that this isn't just a perfomance degradation, it
> actually breaks the pre-dump mechanism in CRIU. vmsplice is invoked from
> our parasite code within the context of a user process, where execution
> speed is critical. A heavy performance penalty completely invalidates
> the pre-dump logic, making the feature useless.

This is very unfortunate. But I still want to remove vmsplice.

> At a minimum, we may need to consider a deprecation plan where vmsplice
> with SPLICE_F_GIFT triggers a warning for a few releases before these
> changes are applied. Alternatively, we could introduce the proposed
> behavior alongside a sysctl to fall back to the old behavior and explicitly
> state that this fallback path will be completely deprecated in a future kernel
> version.

My patches change not only SPLICE_F_GIFT behavior, but also vmsplice
behavior in general.

Let other developers decide what to do (i. e. do nothing, remove
vmsplice now or implement some deprecation scheme).

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v2] net: meth: Fix skb allocation failure handling in RX init
From: Simon Horman @ 2026-06-25  8:47 UTC (permalink / raw)
  To: haoxiang_li2024
  Cc: Simon Horman, andrew+netdev, davem, edumazet, kuba, pabeni,
	pavan.chebbi, netdev, linux-kernel
In-Reply-To: <20260624031940.2690439-1-haoxiang_li2024@163.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev

Simon says:

Hi Haoxiang,

I think that the pre-existing issues flagged below do not need
to be considered in the context of this patch. But I am wondering
if you could take a look over the others.

---
net: meth: Fix skb allocation failure handling in RX init

meth_init_rx_ring() does not check the return value of alloc_skb().
If the allocation fails, the NULL skb is passed to skb_reserve() and
then dereferenced through skb->head.

> --- a/drivers/net/ethernet/sgi/meth.c
> +++ b/drivers/net/ethernet/sgi/meth.c
> @@ -228,6 +228,9 @@ static int meth_init_rx_ring(struct meth_private *priv)
>  
>  	for (i = 0; i < RX_RING_ENTRIES; i++) {
>  		priv->rx_skbs[i] = alloc_skb(METH_RX_BUFF_SIZE, 0);
> +		if (!priv->rx_skbs[i])
> +			goto err_free_skbs;
> +
>  		/* 8byte status vector + 3quad padding + 2byte padding,
>  		 * to put data on 64bit aligned boundary */
>  		skb_reserve(priv->rx_skbs[i],METH_RX_HEAD);

[ ... ]

> @@ -240,6 +243,17 @@ static int meth_init_rx_ring(struct meth_private *priv)
>  	}
>          priv->rx_write = 0;
>  	return 0;
> +
> +err_free_skbs:
> +	while (i--) {
> +		dma_unmap_single(&priv->pdev->dev, priv->rx_ring_dmas[i],
> +				 METH_RX_BUFF_SIZE, DMA_FROM_DEVICE);

[Severity: Critical]
During the allocation loop, each successfully mapped DMA address is pushed
immediately to the hardware via the mace->eth.rx_fifo register. Unmapping
them here frees the memory, but the hardware FIFO still retains the physical
addresses.

When meth_init_rx_ring() returns -ENOMEM to meth_tx_timeout(), the error is
ignored and the RX DMA engine is enabled anyway:

meth_tx_timeout()
    ...
    meth_init_rx_ring(priv);
    priv->dma_ctrl |= METH_DMA_TX_EN | METH_DMA_RX_EN | METH_DMA_RX_INT_EN;

Will this cause the hardware to DMA incoming packet data into the unmapped
and potentially reallocated memory?

> +		priv->rx_ring[i] = 0;

[Severity: High]
If the hardware remains active as described above, will the RX interrupt
handler dereference this NULL pointer?

When the hardware receives a packet, it raises an RX interrupt and meth_rx()
attempts to process it:

meth_rx()
    ...
    status = priv->rx_ring[priv->rx_write]->status.raw;

Since the pointer at this index was cleared in the error path, this looks
like it will cause a kernel panic.

> +		priv->rx_ring_dmas[i] = 0;
> +		kfree_skb(priv->rx_skbs[i]);
> +		priv->rx_skbs[i] = NULL;

[Severity: Critical]
Does this partial cleanup lead to a double free of socket buffers?

When meth_init_rx_ring() fails at index N, this loop zeroes out elements 0
through N-1. The elements from N through 63 retain their dangling pointers
from the previous execution of meth_free_rx_ring(), which fails to clear SKB
pointers after freeing them.

Since meth_tx_timeout() ignores the error and leaves the device running, the
next time meth_free_rx_ring() is called, it iterates over all 64 elements
again. Will this call kfree_skb() on the dangling pointers in indices N
through 63?

> +	}
> +	return -ENOMEM;
>  }

[Severity: High]
This isn't a bug introduced by this patch, but dma_map_single() is called in
meth_init_rx_ring() without checking for mapping errors:

meth_init_rx_ring()
    ...
    priv->rx_ring_dmas[i] = dma_map_single(&priv->pdev->dev,
                                           priv->rx_ring[i],
                                           METH_RX_BUFF_SIZE,
                                           DMA_FROM_DEVICE);
    mace->eth.rx_fifo = priv->rx_ring_dmas[i];

If the mapping fails, an error cookie is passed directly to the hardware.
Could this lead to IOMMU faults or memory corruption when DMA bounce buffers
are exhausted?

[Severity: High]
This is a pre-existing issue, but there appears to be a race condition due
to lockless processing in meth_rx().

The interrupt handler drops the spinlock before processing the ring:

meth_rx()
    ...
    spin_unlock_irqrestore(&priv->meth_lock, flags);
    ...
    while (priv->rx_write != fifo_rptr) {

Concurrently, meth_tx_timeout() can acquire the lock and destroy the RX rings,
falsely assuming it is protected:

meth_tx_timeout()
    ...
    spin_lock_irqsave(&priv->meth_lock,flags);
    meth_reset(dev);
    ...
    meth_free_rx_ring(priv);

If meth_free_rx_ring() frees the SKBs while meth_rx() is still executing the
lockless while loop, could this result in a use-after-free or NULL pointer
dereference?

^ permalink raw reply

* [syzbot] [net?] KASAN: stack-out-of-bounds Read in xfrm_state_find (6)
From: syzbot @ 2026-06-25  8:49 UTC (permalink / raw)
  To: davem, edumazet, herbert, horms, kuba, linux-kernel, netdev,
	pabeni, steffen.klassert, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    a975094bf98c Merge tag 'exfat-for-7.2-rc1' of git://git.ke..
git tree:       bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=14c0eba1580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9519196b0a0d47bc
dashboard link: https://syzkaller.appspot.com/bug?extid=0ac4d84afe1066a1f3e9
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/25ab9553b7ce/disk-a975094b.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/5498f6d5131a/vmlinux-a975094b.xz
kernel image: https://storage.googleapis.com/syzbot-assets/90c6fca52c8c/bzImage-a975094b.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+0ac4d84afe1066a1f3e9@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: stack-out-of-bounds in jhash2 include/linux/jhash.h:138 [inline]
BUG: KASAN: stack-out-of-bounds in __xfrm6_addr_hash net/xfrm/xfrm_hash.h:16 [inline]
BUG: KASAN: stack-out-of-bounds in __xfrm6_daddr_saddr_hash net/xfrm/xfrm_hash.h:29 [inline]
BUG: KASAN: stack-out-of-bounds in __xfrm_dst_hash net/xfrm/xfrm_hash.h:95 [inline]
BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x590d/0x5ec0 net/xfrm/xfrm_state.c:1421
Read of size 4 at addr ffffc900061a7908 by task syz.8.4714/27393

CPU: 1 UID: 0 PID: 27393 Comm: syz.8.4714 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 jhash2 include/linux/jhash.h:138 [inline]
 __xfrm6_addr_hash net/xfrm/xfrm_hash.h:16 [inline]
 __xfrm6_daddr_saddr_hash net/xfrm/xfrm_hash.h:29 [inline]
 __xfrm_dst_hash net/xfrm/xfrm_hash.h:95 [inline]
 xfrm_state_find+0x590d/0x5ec0 net/xfrm/xfrm_state.c:1421
 xfrm_tmpl_resolve_one net/xfrm/xfrm_policy.c:2513 [inline]
 xfrm_tmpl_resolve net/xfrm/xfrm_policy.c:2564 [inline]
 xfrm_resolve_and_create_bundle+0x7f3/0x3070 net/xfrm/xfrm_policy.c:2862
 xfrm_bundle_lookup net/xfrm/xfrm_policy.c:3097 [inline]
 xfrm_lookup_with_ifid+0x576/0x1b40 net/xfrm/xfrm_policy.c:3228
 xfrm_lookup net/xfrm/xfrm_policy.c:3327 [inline]
 xfrm_lookup_route+0x3c/0x1c0 net/xfrm/xfrm_policy.c:3338
 raw_sendmsg+0x110d/0x1a20 net/ipv4/raw.c:628
 sock_sendmsg_nosec+0x10e/0x180 net/socket.c:776
 __sock_sendmsg net/socket.c:790 [inline]
 ____sys_sendmsg+0x54e/0x850 net/socket.c:2684
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2738
 __sys_sendmsg net/socket.c:2770 [inline]
 __do_sys_sendmsg net/socket.c:2775 [inline]
 __se_sys_sendmsg net/socket.c:2773 [inline]
 __x64_sys_sendmsg+0x1b1/0x290 net/socket.c:2773
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7feeebb9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007feeeca26028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007feeebe15fa0 RCX: 00007feeebb9ce59
RDX: 000000000400c894 RSI: 0000200000000900 RDI: 0000000000000007
RBP: 00007feeebc32e6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007feeebe16038 R14: 00007feeebe15fa0 R15: 00007fff5f12fac8
 </TASK>

The buggy address belongs to stack of task syz.8.4714/27393
 and is located at offset 328 in frame:
 raw_sendmsg+0x0/0x1a20 net/ipv4/raw.c:909

This frame has 5 objects:
 [32, 104) 'opt_copy_u'
 [144, 200) 'ipc'
 [240, 248) 'rt'
 [272, 328) 'fl4'
 [368, 392) 'rfv'

The buggy address belongs to a 8-page vmalloc region starting at 0xffffc900061a0000 allocated at copy_process+0x81b/0x42e0 kernel/fork.c:2110
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8b157
memcg:ffff88807cfe5e02
flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000000 0000000000000000 ffffea00022c55c8 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff ffff88807cfe5e02
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x29c2(GFP_NOWAIT|__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_ZERO), pid 27358, tgid 27357 (syz.3.4705), ts 776904415615, free_ts 773990322016
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x1f9/0x250 mm/page_alloc.c:1859
 prep_new_page mm/page_alloc.c:1867 [inline]
 get_page_from_freelist+0x21fa/0x2270 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5304
 alloc_pages_mpol+0x212/0x380 mm/mempolicy.c:2490
 alloc_frozen_pages_noprof mm/mempolicy.c:2561 [inline]
 alloc_pages_noprof+0xac/0x2a0 mm/mempolicy.c:2581
 vm_area_alloc_pages mm/vmalloc.c:3667 [inline]
 __vmalloc_area_node mm/vmalloc.c:3892 [inline]
 __vmalloc_node_range_noprof+0x795/0x1730 mm/vmalloc.c:4082
 __vmalloc_node_noprof+0xc2/0x100 mm/vmalloc.c:4143
 alloc_thread_stack_node kernel/fork.c:359 [inline]
 dup_task_struct+0x28e/0x850 kernel/fork.c:929
 copy_process+0x81b/0x42e0 kernel/fork.c:2110
 kernel_clone+0x2d7/0x940 kernel/fork.c:2746
 __do_sys_clone kernel/fork.c:2887 [inline]
 __se_sys_clone kernel/fork.c:2871 [inline]
 __x64_sys_clone+0x1b6/0x230 kernel/fork.c:2871
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 27224 tgid 27224 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1406 [inline]
 __free_frozen_pages+0xc1f/0xd10 mm/page_alloc.c:2950
 __slab_free+0x274/0x2c0 mm/slub.c:5672
 qlink_free mm/kasan/quarantine.c:163 [inline]
 qlist_free_all+0x99/0x100 mm/kasan/quarantine.c:179
 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286
 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350
 kasan_slab_alloc include/linux/kasan.h:253 [inline]
 slab_post_alloc_hook mm/slub.c:4610 [inline]
 slab_alloc_node mm/slub.c:4939 [inline]
 __do_kmalloc_node mm/slub.c:5333 [inline]
 __kmalloc_noprof+0x312/0x750 mm/slub.c:5347
 _kmalloc_noprof include/linux/slab.h:973 [inline]
 tomoyo_realpath_from_path+0xef/0x640 security/tomoyo/realpath.c:251
 tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
 tomoyo_check_open_permission+0x229/0x470 security/tomoyo/file.c:776
 security_file_open+0xa9/0x240 security/security.c:2739
 do_dentry_open+0x4a0/0x1380 fs/open.c:924
 vfs_open+0x3b/0x340 fs/open.c:1079
 do_open fs/namei.c:4700 [inline]
 path_openat+0x2e44/0x3830 fs/namei.c:4859
 do_file_open+0x23e/0x4a0 fs/namei.c:4888
 do_sys_openat2+0x115/0x200 fs/open.c:1395
 do_sys_open fs/open.c:1401 [inline]
 __do_sys_openat fs/open.c:1417 [inline]
 __se_sys_openat fs/open.c:1412 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1412
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94

Memory state around the buggy address:
 ffffc900061a7800: 00 00 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00
 ffffc900061a7880: 00 f2 f2 f2 f2 f2 00 f2 f2 f2 00 00 00 00 00 00
>ffffc900061a7900: 00 f2 f2 f2 f2 f2 00 00 00 f3 f3 f3 f3 f3 f3 f3
                      ^
 ffffc900061a7980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffffc900061a7a00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: David Hildenbrand (Arm) @ 2026-06-25  8:46 UTC (permalink / raw)
  To: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, Pedro Falcato, Miklos Szeredi,
	Andy Lutomirski, Collin Funk, David Laight, Stefan Metzmacher,
	The 8472, Willy Tarreau, Joanne Koong, Val Packett, Andrei Vagin,
	patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

On 6/25/26 10:34, Askar Safin wrote:
> This patchset is for VFS. Of course, it depends on my previous vmsplice
> patchset ( https://lore.kernel.org/all/20260531010107.1953702-1-safinaskar@gmail.com/ ).
> 
> I fix some problems in my previous patchset.

I think we concluded that we cannot rip out vmsplice that way at this point, and
I suspect that Christian will drop that topic branch from -next after -rc1.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH net-next v5 1/4] dpll: add DPLL_PIN_TYPE_INT_NCO pin type
From: Jiri Pirko @ 2026-06-25  8:45 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Ivan Vecera, Kubalewski, Arkadiusz, Jakub Kicinski,
	netdev@vger.kernel.org, Jiri Pirko, David S. Miller,
	Donald Hunter, Eric Dumazet, Schmidt, Michal, Paolo Abeni,
	Vaananen, Pasi, Oros, Petr, Prathosh Satish, Simon Horman,
	linux-kernel@vger.kernel.org
In-Reply-To: <0f8fe4e0-72d8-48a6-96ad-d1650919d2df@linux.dev>

Wed, Jun 24, 2026 at 05:57:35PM +0200, vadim.fedorenko@linux.dev wrote:
>On 19/06/2026 18:07, Ivan Vecera wrote:

[...]

>> 
>> Proposal:
>> 1) new pin capability
>>     - name: state-connected-override
>>     - doc: pin state can be changed to connected in any DPLL mode
>> 
>> 2) new NCO pin type to switch the DPLL to NCO mode when connected
>> 
>> 3) automatic-only DPLL
>>     - should expose NCO pin with state-connected-override capability
>> 
>> 4) manual-only DPLL
>>    - does not need to expose NCO pin with state-connected-override cap
>> 
>> 5) dual-mode DPLL (supporting mode switching)
>>    - if it exposes NCO pin with the override cap then it has to support
>>      switching to NCO mode directly from AUTO mode
>>    - if does not expose NCO pin with the override cap then a user MUST
>>      switch the DPLL mode from AUTO to MANUAL to be able to make NCO
>>      pin connected to the DPLL
>
>I still don't see good reasoning for the pin. Even this sentence says
>"DPLL mode" which keeps me thinking whether we have to move it to a
>special DPLL mode. All these items look like overcomplication of a
>simple function of the device itself. DPLL can be either in the closed
>loop when one of the pins provides a signal to align to, or in the open
>loop meaning that software can control adjustments to phase/frequency.
>But it's definitely a property of the device, and it's not a pin in any
>kind...

Vadim, did you see this:
https://lore.kernel.org/all/aiftnkuT9IP31qUm@FV6GYCPJ69/ ?
I very thoroughly described what you are questioning. There is 0 reply
to that email so perhaps you missed it? IDK.



^ permalink raw reply

* RE: [PATCH net 4/4] selftests: bonding: add a test for VLAN propagation over a bonded real device
From: Loktionov, Aleksandr @ 2026-06-25  8:41 UTC (permalink / raw)
  To: Jakub Kicinski, davem@davemloft.net
  Cc: netdev@vger.kernel.org, edumazet@google.com, pabeni@redhat.com,
	andrew+netdev@lunn.ch, horms@kernel.org, jv@jvosburgh.net,
	sdf@fomichev.me, dongchenchen2@huawei.com, idosch@nvidia.com,
	n05ec@lzu.edu.cn, yuantan098@gmail.com, kuniyu@google.com,
	nb@tipi-net.de, dtatulea@nvidia.com
In-Reply-To: <20260624182018.2445732-5-kuba@kernel.org>



> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, June 24, 2026 8:20 PM
> To: davem@davemloft.net
> Cc: netdev@vger.kernel.org; edumazet@google.com; pabeni@redhat.com;
> andrew+netdev@lunn.ch; horms@kernel.org; jv@jvosburgh.net;
> sdf@fomichev.me; dongchenchen2@huawei.com; idosch@nvidia.com;
> n05ec@lzu.edu.cn; yuantan098@gmail.com; kuniyu@google.com; nb@tipi-
> net.de; Loktionov, Aleksandr <aleksandr.loktionov@intel.com>;
> dtatulea@nvidia.com; Jakub Kicinski <kuba@kernel.org>
> Subject: [PATCH net 4/4] selftests: bonding: add a test for VLAN
> propagation over a bonded real device
> 
> Add a regression test for the VLAN notifier handling that the
> netdev_work deferral fixed.
> 
> A VLAN's real device propagates its UP/DOWN, MTU and feature changes
> onto the VLANs stacked on top of it. This used to be done
> synchronously from the real device's notifier and deadlocked when the
> real device was brought up while enslaved to a bond (instance lock
> held across NETDEV_UP) and the VLAN on top was itself a bond member:
> the synchronous propagation re-entered the stack and took the same
> instance lock again.
> 
> The test covers both halves:
>  - that the deferred UP/DOWN, MTU and feature propagation actually
> lands on
>    the VLAN (link state and MTU use an ops-locked dummy, i.e. the
> deferral
>    path; features use veth, which exports vlan_features to inherit),
> and
>  - that the deadlock-prone topology - a VLAN on a dummy, with the VLAN
> and
>    the dummy each enslaved to a different bond - can be built without
>    hanging.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  .../selftests/drivers/net/bonding/Makefile    |   1 +
>  .../drivers/net/bonding/bond_vlan_real_dev.sh | 180
> ++++++++++++++++++
>  2 files changed, 181 insertions(+)
>  create mode 100755
> tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
> 
> diff --git a/tools/testing/selftests/drivers/net/bonding/Makefile
> b/tools/testing/selftests/drivers/net/bonding/Makefile
> index be130bf585a4..6364ca02642d 100644
> --- a/tools/testing/selftests/drivers/net/bonding/Makefile
> +++ b/tools/testing/selftests/drivers/net/bonding/Makefile
> @@ -13,6 +13,7 @@ TEST_PROGS := \
>  	bond_options.sh \
>  	bond_passive_lacp.sh \
>  	bond_stacked_header_parse.sh \
> +	bond_vlan_real_dev.sh \
>  	dev_addr_lists.sh \
>  	mode-1-recovery-updelay.sh \
>  	mode-2-recovery-updelay.sh \
> diff --git
> a/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
> b/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
> new file mode 100755
> index 000000000000..542d9ffc4819
> --- /dev/null
> +++
> b/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
> @@ -0,0 +1,180 @@

...

> +exit "$EXIT_STATUS"
> --
> 2.54.0


Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>


^ permalink raw reply

* [PATCH v2 7/7] pipe: set FMODE_NOWAIT for named FIFOs
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

CRIU relies on ability to do vmsplice(SPLICE_F_NONBLOCK) on named FIFOs.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/pipe.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index c0ccf21b9..a8e9b4459 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1156,6 +1156,12 @@ static int fifo_open(struct inode *inode, struct file *filp)
 	/* We can only do regular read/write on fifos */
 	stream_open(inode, filp);
 
+	/*
+	 * CRIU relies on ability to do vmsplice(SPLICE_F_NONBLOCK)
+	 * on named FIFOs.
+	 */
+	filp->f_mode |= FMODE_NOWAIT;
+
 	switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
 	case FMODE_READ:
 	/*
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 6/7] vmsplice: return -EINVAL for particular combination of flags
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

See comment for details.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index dbd0debc2..b1f71b142 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1258,6 +1258,16 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		bool non_block = (flags & SPLICE_F_NONBLOCK) || (fd_file(f)->f_flags & O_NONBLOCK);
 		ssize_t ret;
 
+		/*
+		 * libfuse relies on sharing vmsplice behavior.
+		 * So we detect particular combination of flags to
+		 * pipe2(2) and vmsplice(2) and return -EINVAL.
+		 * This forces libfuse to fail back to non-vmsplice
+		 * code path.
+		 */
+		if ((flags == SPLICE_F_NONBLOCK) && (fd_file(f)->f_flags & O_NONBLOCK))
+			return -EINVAL;
+
 		do {
 			pipe_lock(pipe);
 			ret = pipe_wait_for_space(pipe, non_block);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 5/7] vmsplice: make sure we don't wait after writing some data
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

Make sure we don't wait for space in pipe after writing some data.
This is needed for compatibility with previous version of vmsplice.
Found by LTP vmsplice01.
See comments in the code and links below for details.

Link: https://lore.kernel.org/all/20260603-raumfahrt-unmerklich-ertrugen-c4ecae70d5f9@brauner/
Link: https://lore.kernel.org/all/CAHk-=wgV-j-G3d+899Zm1pQ=NaJrddPz=GKcL5Yw5DTUM=GaUw@mail.gmail.com/
Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 77487b307..dbd0debc2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1221,6 +1221,8 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
+	struct pipe_inode_info *pipe;
+
 	if (unlikely(flags & ~SPLICE_F_ALL))
 		return -EINVAL;
 
@@ -1229,11 +1231,44 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		return -EBADF;
 
 	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
-	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
+	pipe = get_pipe_info(fd_file(f), /* for_splice = */ false);
+
+	if (!pipe)
 		return -EBADF;
 
 	if (fd_file(f)->f_mode & FMODE_WRITE) {
-		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		/*
+		 * When writing to the pipe, previous implementation of vmsplice
+		 * first waited for space in the pipe to appear
+		 * (depending on whether SPLICE_F_NONBLOCK was passed),
+		 * then did unconditional non-blocking write to the pipe.
+		 *
+		 * This differs from what pwritev2 does.
+		 *
+		 * For compatibility we do the same thing previous
+		 * implementation did.
+		 *
+		 * We lock the pipe, do pipe_wait_for_space, then unlock
+		 * the pipe, and then do vfs_writev. vfs_writev internally
+		 * locks the pipe again. This may cause TOCTOU: when we
+		 * do vfs_writev, the pipe may become full again. So we
+		 * do a loop.
+		 */
+
+		bool non_block = (flags & SPLICE_F_NONBLOCK) || (fd_file(f)->f_flags & O_NONBLOCK);
+		ssize_t ret;
+
+		do {
+			pipe_lock(pipe);
+			ret = pipe_wait_for_space(pipe, non_block);
+			pipe_unlock(pipe);
+
+			if (ret < 0)
+				break;
+
+			ret = vfs_writev(fd_file(f), vec, vlen, NULL, RWF_NOWAIT);
+		} while (!non_block && ret == -EAGAIN);
+
 		if (ret > 0)
 			add_wchar(current, ret);
 		inc_syscw(current);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 4/7] pipe: move wait_for_space to fs/pipe.c and rename it
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

This is needed, because I plan to use it in fs/read_write.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/pipe.c                 | 17 +++++++++++++++++
 fs/splice.c               | 19 +------------------
 include/linux/pipe_fs_i.h |  2 ++
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9..c0ccf21b9 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1451,6 +1451,23 @@ long pipe_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 	return ret;
 }
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block)
+{
+	for (;;) {
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
+		if (!pipe_is_full(pipe))
+			return 0;
+		if (non_block)
+			return -EAGAIN;
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+		pipe_wait_writable(pipe);
+	}
+}
+
 static const struct super_operations pipefs_ops = {
 	.destroy_inode = free_inode_nonrcu,
 	.statfs = simple_statfs,
diff --git a/fs/splice.c b/fs/splice.c
index 707db2c2c..d12243d19 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,23 +1239,6 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
-{
-	for (;;) {
-		if (unlikely(!pipe->readers)) {
-			send_sig(SIGPIPE, current, 0);
-			return -EPIPE;
-		}
-		if (!pipe_is_full(pipe))
-			return 0;
-		if (non_block)
-			return -EAGAIN;
-		if (signal_pending(current))
-			return -ERESTARTSYS;
-		pipe_wait_writable(pipe);
-	}
-}
-
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1268,7 +1251,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
+	ret = pipe_wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index a1eeed800..be653625d 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -335,4 +335,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice);
 int create_pipe_files(struct file **, int);
 unsigned int round_pipe_size(unsigned int size);
 
+int pipe_wait_for_space(struct pipe_inode_info *pipe, bool non_block);
+
 #endif
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 3/7] splice: turn wait_for_space flags argument into bool
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

I want to do this, because I will move this function to fs/pipe.c.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/splice.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6ddf7dd72..707db2c2c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1239,7 +1239,7 @@ ssize_t splice_file_range(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(splice_file_range);
 
-static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
+static int wait_for_space(struct pipe_inode_info *pipe, bool non_block)
 {
 	for (;;) {
 		if (unlikely(!pipe->readers)) {
@@ -1248,7 +1248,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		}
 		if (!pipe_is_full(pipe))
 			return 0;
-		if (flags & SPLICE_F_NONBLOCK)
+		if (non_block)
 			return -EAGAIN;
 		if (signal_pending(current))
 			return -ERESTARTSYS;
@@ -1268,7 +1268,7 @@ ssize_t splice_file_to_pipe(struct file *in,
 	ssize_t ret;
 
 	pipe_lock(opipe);
-	ret = wait_for_space(opipe, flags);
+	ret = wait_for_space(opipe, flags & SPLICE_F_NONBLOCK);
 	if (!ret)
 		ret = do_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 2/7] vmsplice: change argument type back to "int"
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

My previous vmsplice patchset changed vmsplice argument from
"int" to "unsigned long". This may cause problems, so let's
change it back.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c          | 2 +-
 include/linux/syscalls.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e224e7cb8..77487b307 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1218,7 +1218,7 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 /*
  * Legacy preadv2/pwritev2 wrapper.
  */
-SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
+SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned int, flags)
 {
 	if (unlikely(flags & ~SPLICE_F_ALL))
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a86a88207..46a3ec954 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -514,7 +514,7 @@ asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
 			  struct old_timespec32 __user *, const sigset_t __user *,
 			  size_t);
 asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
-asmlinkage long sys_vmsplice(unsigned long fd, const struct iovec __user *vec,
+asmlinkage long sys_vmsplice(int fd, const struct iovec __user *vec,
 			     unsigned long vlen, unsigned int flags);
 asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
 			   int fd_out, loff_t __user *off_out,
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 1/7] vmsplice: open-code do_writev and do_readv
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches
In-Reply-To: <20260625083409.3769242-1-safinaskar@gmail.com>

My previous vmsplice patch did the following mistake: I did
"CLASS(fd, f)(fd)", then did some checks on resulting "struct file",
then passed numeric (!) file descriptor to a function.

This is somewhat okay in this particular case, but I still think
this is code smell, so I fix this by open-coding do_writev and do_readv.

Also I insert a comment to warn other developers to keep
do_writev and do_readv in sync with vmsplice(2).

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 1e5444f4d..e224e7cb8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1070,6 +1070,7 @@ static ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 			unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1093,6 +1094,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 			 unsigned long vlen, rwf_t flags)
 {
+	/* All future changes to this function should be kept in sync with vmsplice(2). */
 	CLASS(fd_pos, f)(fd);
 	ssize_t ret = -EBADF;
 
@@ -1226,14 +1228,24 @@ SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
 	if (fd_empty(f))
 		return -EBADF;
 
-	/* We do do_writev/do_readv, so it is okay to pass "false" here */
+	/* We do vfs_writev/vfs_readv, so it is okay to pass "false" here */
 	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
 		return -EBADF;
 
-	if (fd_file(f)->f_mode & FMODE_WRITE)
-		return do_writev(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
-	else
-		return do_readv(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+	if (fd_file(f)->f_mode & FMODE_WRITE) {
+		ssize_t ret = vfs_writev(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+		if (ret > 0)
+			add_wchar(current, ret);
+		inc_syscw(current);
+		return ret;
+	} else {
+		ssize_t ret = vfs_readv(fd_file(f), vec, vlen, NULL, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+
+		if (ret > 0)
+			add_rchar(current, ret);
+		inc_syscr(current);
+		return ret;
+	}
 }
 
 /*
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-25  8:34 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, fuse-devel,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, Andy Lutomirski, Collin Funk, David Laight,
	Stefan Metzmacher, The 8472, Willy Tarreau, Joanne Koong,
	Val Packett, Andrei Vagin, patches

This patchset is for VFS. Of course, it depends on my previous vmsplice
patchset ( https://lore.kernel.org/all/20260531010107.1953702-1-safinaskar@gmail.com/ ).

I fix some problems in my previous patchset.

1. Fix problem with CLASS(fd, f)(fd). See first patch in this patchset
for details. This is probably not so important, but I fix it anyway.

2. Change "unsigned long" back to "int". See second patch for details.
Again, this is probably not important, but I want to fix this anyway.

3. Fix that LTP vmsplice01 bug.

4. libfuse relies on sharing vmsplice behavior. So we detect particular
combination of flags to pipe2(2) and vmsplice(2) and return -EINVAL.
This forces libfuse to fail back to non-vmsplice code path.
I. e. we fix libfuse-related regression [1].
I did debian code search for regex "vmsplice.*SPLICE_F_NONBLOCK" and
I found no other packages with this particular combination of flags
except for fuse itself. (Okay, other packages are fio and stress-ng,
but these are merely testers.) So, I think this is okay to return
EINVAL here, breakage will be minimal.

5. Set FMODE_NOWAIT for named FIFOs. CRIU relies on ability to do
vmsplice(SPLICE_F_NONBLOCK) on named FIFOs. So, I fix this CRIU-related
regression [2]. But there is another CRIU-related regression, which I do not
fix [3]: CRIU behavior in splice mode becomes so slow that splice mode
becomes useless. I personally still believe that removing vmsplice is
right thing to do. Other option is doing nothing. Yet another option
is to implement some deprecation period [3]. Let other developers
decide.

See patches for details.

Please, run that LTP vmsplice01 test again.

Notes:

- I want to repeat: I change behavior around SPLICE_F_NONBLOCK.
Previously, vmsplice ignored whether pipe itself was opened as
non-blocking file. Now it is not ignored. And in my opinion
new behavior is better.
- vmsplice(2) now is in fs/read_write.c . It is very similar to
preadv2 and pwritev2 now, so I think it belongs to fs/read_write.c now.

Please, review this patchset carefully. I'm still new contributor.
In particular, please, review that do-while loop, I'm not sure I did
everything right.

Tested in Qemu.

[1] https://lore.kernel.org/all/CAJnrk1Y9egYizkx1H9K0cqxSYuB+7vLvQbV7Tf4C5eHFqnnC-A@mail.gmail.com/
[2] https://lore.kernel.org/all/CANaxB-zK5q=Xw6UZTmeFtXsDZjUsPkFk=p485m-wtNTBnf4hgg@mail.gmail.com/
[3] https://lore.kernel.org/all/CANaxB-xUrLQYGiRJZc4Boi+KX=0TJSWymErNovANVko20fMDVA@mail.gmail.com/

v1: https://lore.kernel.org/lkml/20260606061031.3744880-1-safinaskar@gmail.com/

Changes since v1: fix fuse-related and CRIU-related regressions (see above).

Askar Safin (7):
  vmsplice: open-code do_writev and do_readv
  vmsplice: change argument type back to "int"
  splice: turn wait_for_space flags argument into bool
  pipe: move wait_for_space to fs/pipe.c and rename it
  vmsplice: make sure we don't wait after writing some data
  vmsplice: return -EINVAL for particular combination of flags
  pipe: set FMODE_NOWAIT for named FIFOs

 fs/pipe.c                 | 23 +++++++++++++
 fs/read_write.c           | 71 +++++++++++++++++++++++++++++++++++----
 fs/splice.c               | 19 +----------
 include/linux/pipe_fs_i.h |  2 ++
 include/linux/syscalls.h  |  2 +-
 5 files changed, 91 insertions(+), 26 deletions(-)

base-commit: 8d86fcfc2857d64af85f5c87c193c25655c970af
-- 
2.47.3

^ permalink raw reply

* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Sidraya Jayagond @ 2026-06-25  8:32 UTC (permalink / raw)
  To: Runyu Xiao, D. Wythe, Dust Li, Wenjia Zhang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu, stable
In-Reply-To: <20260617152855.1039151-1-runyu.xiao@seu.edu.cn>



On 17/06/26 8:58 pm, Runyu Xiao wrote:
> smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
> listen socket's sk_data_ready callback.  smc_clcsock_data_ready() then
> immediately takes sk_callback_lock before looking up the SMC listener and
> queuing smc_tcp_listen_work().
> 
> That is unsafe once the TCP listen socket is leaving TCP_LISTEN.  The TCP
> close/flush path can run the installed sk_data_ready callback with
> sk_callback_lock already held, so entering smc_clcsock_data_ready() again
> tries to take the same rwlock recursively in the same thread.  The nvmet
> TCP listener had to make the same state check before taking
> sk_callback_lock for this reason.
> 
> This issue was found by our static analysis tool and then manually
> reviewed against the current tree.
> 
> The grounded PoC kept the SMC listen callback installation path:
> 
>   smc_listen()
>   smc_clcsock_replace_cb()
>   sk_data_ready = smc_clcsock_data_ready()
> 
> It then modeled the close/flush carrier that invokes the installed
> sk_data_ready callback while sk_callback_lock is already held.  Lockdep
> reported the same-thread recursive acquisition:
> 
>   WARNING: possible recursive locking detected
>   smc_clcsock_data_ready+0xa/0x4d [vuln_msv]
>   smc_close_flush_work+0x1f/0x30 [vuln_msv]
>   *** DEADLOCK ***
> 
> Return before taking sk_callback_lock when the underlying TCP socket is no
> longer in TCP_LISTEN.  In that state there is no listen accept work to
> queue for SMC, and avoiding the callback lock mirrors the fix used by the
> TCP nvmet listener.
> 
> Fixes: 0558226cebee ("net/smc: Fix slab-out-of-bounds issue in fallback")
> Cc: stable@vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
> ---
>  net/smc/af_smc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 6421c2e1c84d..1af4e3c333ff 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
>  {
>  	struct smc_sock *lsmc;
>  
> +	if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)
> +		return;
> +

In smc_close_active(), the TCP socket remains in TCP_LISTEN state while
holding write_lock_bh(&smc->clcsock->sk->sk_callback_lock);. The patch's
state check would pass during this window, not preventing the recursive
lock scenario.
It's unclear whether it fully prevents the recursive locking scenario
described in the commit message for the specific code path in
smc_close_active().
Could you come up with exact deadlock scenario and how the patch
addresses it?

>  	read_lock_bh(&listen_clcsock->sk_callback_lock);
>  	lsmc = smc_clcsock_user_data(listen_clcsock);
>  	if (!lsmc)


^ permalink raw reply

* [PATCH 6.12 2/2] sctp: disable BH before calling udp_tunnel_xmit_skb()
From: Alexander Martyniuk @ 2026-06-25  8:24 UTC (permalink / raw)
  To: stable, gregkh
  Cc: Alexander Martyniuk, Marcelo Ricardo Leitner, Xin Long,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Weiming Shi, linux-sctp, netdev, linux-kernel
In-Reply-To: <20260625082442.96390-1-alexevgmart@gmail.com>

From: Xin Long <lucien.xin@gmail.com>

commit 2cd7e6971fc2787408ceef17906ea152791448cf upstream.

udp_tunnel_xmit_skb() / udp_tunnel6_xmit_skb() are expected to run with
BH disabled.  After commit 6f1a9140ecda ("add xmit recursion limit to
tunnel xmit functions"), on the path:

  udp(6)_tunnel_xmit_skb() -> ip(6)tunnel_xmit()

dev_xmit_recursion_inc()/dec() must stay balanced on the same CPU.

Without local_bh_disable(), the context may move between CPUs, which can
break the inc/dec pairing. This may lead to incorrect recursion level
detection and cause packets to be dropped in ip(6)_tunnel_xmit() or
__dev_queue_xmit().

Fix it by disabling BH around both IPv4 and IPv6 SCTP UDP xmit paths.

In my testing, after enabling the SCTP over UDP:

  # ip net exec ha sysctl -w net.sctp.udp_port=9899
  # ip net exec ha sysctl -w net.sctp.encap_port=9899
  # ip net exec hb sysctl -w net.sctp.udp_port=9899
  # ip net exec hb sysctl -w net.sctp.encap_port=9899

  # ip net exec ha iperf3 -s

- without this patch:

  # ip net exec hb iperf3 -c 192.168.0.1 --sctp
  [  5]   0.00-10.00  sec  37.2 MBytes  31.2 Mbits/sec  sender
  [  5]   0.00-10.00  sec  37.1 MBytes  31.1 Mbits/sec  receiver

- with this patch:

  # ip net exec hb iperf3 -c 192.168.0.1 --sctp
  [  5]   0.00-10.00  sec  3.14 GBytes  2.69 Gbits/sec  sender
  [  5]   0.00-10.00  sec  3.14 GBytes  2.69 Gbits/sec  receiver

Fixes: 6f1a9140ecda ("net: add xmit recursion limit to tunnel xmit functions")
Fixes: 046c052b475e ("sctp: enable udp tunneling socks")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/c874a8548221dcd56ff03c65ba75a74e6cf99119.1776017727.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Martyniuk <alexevgmart@gmail.com>
---
 net/sctp/ipv6.c     | 2 ++
 net/sctp/protocol.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index b4c321bad033..b45cc51dfc35 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -261,9 +261,11 @@ static int sctp_v6_xmit(struct sk_buff *skb, struct sctp_transport *t)
 	skb_set_inner_ipproto(skb, IPPROTO_SCTP);
 	label = ip6_make_flowlabel(sock_net(sk), skb, fl6->flowlabel, true, fl6);
 
+	local_bh_disable();
 	udp_tunnel6_xmit_skb(dst, sk, skb, NULL, &fl6->saddr, &fl6->daddr,
 			     tclass, ip6_dst_hoplimit(dst), label,
 			     sctp_sk(sk)->udp_port, t->encap_port, false);
+	local_bh_enable();
 	return 0;
 }
 
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 39ca5403d4d7..6ea15361088b 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1086,9 +1086,11 @@ static inline int sctp_v4_xmit(struct sk_buff *skb, struct sctp_transport *t)
 	skb_reset_inner_mac_header(skb);
 	skb_reset_inner_transport_header(skb);
 	skb_set_inner_ipproto(skb, IPPROTO_SCTP);
+	local_bh_disable();
 	udp_tunnel_xmit_skb(dst_rtable(dst), sk, skb, fl4->saddr,
 			    fl4->daddr, dscp, ip4_dst_hoplimit(dst), df,
 			    sctp_sk(sk)->udp_port, t->encap_port, false, false);
+	local_bh_enable();
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH 6.12 1/2] net: ipv6: Make udp_tunnel6_xmit_skb() void
From: Alexander Martyniuk @ 2026-06-25  8:24 UTC (permalink / raw)
  To: stable, gregkh
  Cc: Alexander Martyniuk, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, David Ahern,
	Marcelo Ricardo Leitner, Xin Long, Jon Maloy, Ying Xue, netdev,
	linux-kernel, linux-sctp, tipc-discussion, Petr Machata,
	Ido Schimmel, Nikolay Aleksandrov
In-Reply-To: <20260625082442.96390-1-alexevgmart@gmail.com>

From: Petr Machata <petrm@nvidia.com>

commit 6a7d88ca15f73c5c570c372238f71d63da1fda55 upstream.

The function always returns zero, thus the return value does not carry any
signal. Just make it void.

Most callers already ignore the return value. However:

- Refold arguments of the call from sctp_v6_xmit() so that they fit into
  the 80-column limit.

- tipc_udp_xmit() initializes err from the return value, but that should
  already be always zero at that point. So there's no practical change, but
  elision of the assignment prompts a couple more tweaks to clean up the
  function.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Martyniuk <alexevgmart@gmail.com>
---
 include/net/udp_tunnel.h  | 14 +++++++-------
 net/ipv6/ip6_udp_tunnel.c | 15 +++++++--------
 net/sctp/ipv6.c           |  7 ++++---
 net/tipc/udp_media.c      | 10 +++++-----
 4 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 6e2c5c77031f..8ed36ec520d7 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -152,13 +152,13 @@ void udp_tunnel_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *skb
 			 __be16 df, __be16 src_port, __be16 dst_port,
 			 bool xnet, bool nocheck);
 
-int udp_tunnel6_xmit_skb(struct dst_entry *dst, struct sock *sk,
-			 struct sk_buff *skb,
-			 struct net_device *dev,
-			 const struct in6_addr *saddr,
-			 const struct in6_addr *daddr,
-			 __u8 prio, __u8 ttl, __be32 label,
-			 __be16 src_port, __be16 dst_port, bool nocheck);
+void udp_tunnel6_xmit_skb(struct dst_entry *dst, struct sock *sk,
+			  struct sk_buff *skb,
+			  struct net_device *dev,
+			  const struct in6_addr *saddr,
+			  const struct in6_addr *daddr,
+			  __u8 prio, __u8 ttl, __be32 label,
+			  __be16 src_port, __be16 dst_port, bool nocheck);
 
 void udp_tunnel_sock_release(struct socket *sock);
 
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index 2acf1bb93fc0..f22eff2ba77c 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -74,13 +74,13 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 }
 EXPORT_SYMBOL_GPL(udp_sock_create6);
 
-int udp_tunnel6_xmit_skb(struct dst_entry *dst, struct sock *sk,
-			 struct sk_buff *skb,
-			 struct net_device *dev,
-			 const struct in6_addr *saddr,
-			 const struct in6_addr *daddr,
-			 __u8 prio, __u8 ttl, __be32 label,
-			 __be16 src_port, __be16 dst_port, bool nocheck)
+void udp_tunnel6_xmit_skb(struct dst_entry *dst, struct sock *sk,
+			  struct sk_buff *skb,
+			  struct net_device *dev,
+			  const struct in6_addr *saddr,
+			  const struct in6_addr *daddr,
+			  __u8 prio, __u8 ttl, __be32 label,
+			  __be16 src_port, __be16 dst_port, bool nocheck)
 {
 	struct udphdr *uh;
 	struct ipv6hdr *ip6h;
@@ -109,7 +109,6 @@ int udp_tunnel6_xmit_skb(struct dst_entry *dst, struct sock *sk,
 	ip6h->saddr	  = *saddr;
 
 	ip6tunnel_xmit(sk, skb, dev);
-	return 0;
 }
 EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
 
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 9f835e674c59..b4c321bad033 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -261,9 +261,10 @@ static int sctp_v6_xmit(struct sk_buff *skb, struct sctp_transport *t)
 	skb_set_inner_ipproto(skb, IPPROTO_SCTP);
 	label = ip6_make_flowlabel(sock_net(sk), skb, fl6->flowlabel, true, fl6);
 
-	return udp_tunnel6_xmit_skb(dst, sk, skb, NULL, &fl6->saddr,
-				    &fl6->daddr, tclass, ip6_dst_hoplimit(dst),
-				    label, sctp_sk(sk)->udp_port, t->encap_port, false);
+	udp_tunnel6_xmit_skb(dst, sk, skb, NULL, &fl6->saddr, &fl6->daddr,
+			     tclass, ip6_dst_hoplimit(dst), label,
+			     sctp_sk(sk)->udp_port, t->encap_port, false);
+	return 0;
 }
 
 /* Returns the dst cache entry for the given source and destination ip
diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 258d6aa4f21a..1b8d6bbf8a8e 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -172,7 +172,7 @@ static int tipc_udp_xmit(struct net *net, struct sk_buff *skb,
 			 struct udp_media_addr *dst, struct dst_cache *cache)
 {
 	struct dst_entry *ndst;
-	int ttl, err = 0;
+	int ttl, err;
 
 	local_bh_disable();
 	ndst = dst_cache_get(cache);
@@ -217,13 +217,13 @@ static int tipc_udp_xmit(struct net *net, struct sk_buff *skb,
 			dst_cache_set_ip6(cache, ndst, &fl6.saddr);
 		}
 		ttl = ip6_dst_hoplimit(ndst);
-		err = udp_tunnel6_xmit_skb(ndst, ub->ubsock->sk, skb, NULL,
-					   &src->ipv6, &dst->ipv6, 0, ttl, 0,
-					   src->port, dst->port, false);
+		udp_tunnel6_xmit_skb(ndst, ub->ubsock->sk, skb, NULL,
+				     &src->ipv6, &dst->ipv6, 0, ttl, 0,
+				     src->port, dst->port, false);
 #endif
 	}
 	local_bh_enable();
-	return err;
+	return 0;
 
 tx_error:
 	local_bh_enable();
-- 
2.43.0


^ permalink raw reply related

* RE: [PATCH] ice: propagate ETH56G deskew read errors
From: Jagielski, Jedrzej @ 2026-06-25  7:55 UTC (permalink / raw)
  To: Pengpeng Hou, Nguyen, Anthony L, Kitszel, Przemyslaw
  Cc: Andrew Lunn, davem@davemloft.net, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Richard Cochran, intel-wired-lan@lists.osuosl.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260625030305.85304-1-pengpeng@iscas.ac.cn>

From: Pengpeng Hou <pengpeng@iscas.ac.cn> 
Sent: Thursday, June 25, 2026 5:03 AM

>ice_ptp_calc_deskew_eth56g() returns a u32 deskew value, but it also
>returns the negative read_poll_timeout() error when the DESKEW valid bit
>never appears. That converts the negative error into a large unsigned
>deskew contribution, which can then be folded into the RX timestamp
>offset and programmed into hardware.
>
>Return the deskew value through an output parameter and propagate the
>read error from ice_phy_set_offsets_eth56g() instead of using it as
>offset data.

Hi
looks like fix so please add fixes tag

>
>Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
>---
> drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 27 +++++++++++++++------
> 1 file changed, 19 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
>index 8e5f97835954..bd2e31b816a8 100644
>--- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
>+++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
>@@ -1736,17 +1736,21 @@ static u32 ice_ptp_calc_bitslip_eth56g(struct ice_hw *hw, u8 port, u32 bs,
>  * @ds: deskew multiplier
>  * @rs: RS-FEC enabled
>  * @spd: link speed
>+ * @deskew: calculated deskew value
>  *
>- * Return: calculated deskew value
>+ * Return: 0 on success, negative error code otherwise

please state it clear that @deskew is also an output 

>  */
>-static u32 ice_ptp_calc_deskew_eth56g(struct ice_hw *hw, u8 port, u32 ds,
>-				      bool rs, enum ice_eth56g_link_spd spd)
>+static int ice_ptp_calc_deskew_eth56g(struct ice_hw *hw, u8 port, u32 ds,
>+				      bool rs, enum ice_eth56g_link_spd spd,
>+				      u32 *deskew)
> {
> 	u32 deskew_i, deskew_f;
> 	int err;
> 
>-	if (!ds)
>+	if (!ds) {
>+		*deskew = 0;
> 		return 0;
>+	}
> 
> 	read_poll_timeout(ice_read_ptp_reg_eth56g, err,
> 			  FIELD_GET(PHY_REG_DESKEW_0_VALID, deskew_i), 500,
>@@ -1766,7 +1770,9 @@ static u32 ice_ptp_calc_deskew_eth56g(struct ice_hw *hw, u8 port, u32 ds,
> 	deskew_i = FIELD_PREP(ICE_ETH56G_MAC_CFG_RX_OFFSET_INT, deskew_i);
> 	/* Shift 3 fractional bits to the end of the integer part */
> 	deskew_f <<= ICE_ETH56G_MAC_CFG_FRAC_W - PHY_REG_DESKEW_0_RLEVEL_FRAC_W;
>-	return mul_u32_u32_fx_q9(deskew_i | deskew_f, ds);
>+	*deskew = mul_u32_u32_fx_q9(deskew_i | deskew_f, ds);
>+
>+	return 0;
> }
> 
> /**
>@@ -1789,6 +1795,7 @@ static int ice_phy_set_offsets_eth56g(struct ice_hw *hw, u8 port,
> {
> 	u32 rx_offset, tx_offset, bs_ds;
> 	bool onestep, sfd;
>+	int err;
> 
> 	onestep = hw->ptp.phy.eth56g.onestep_ena;
> 	sfd = hw->ptp.phy.eth56g.sfd_ena;
>@@ -1805,11 +1812,15 @@ static int ice_phy_set_offsets_eth56g(struct ice_hw *hw, u8 port,
> 	if (sfd)
> 		rx_offset = add_u32_u32_fx(rx_offset, cfg->rx_offset.sfd);
> 
>-	if (spd < ICE_ETH56G_LNK_SPD_40G)
>+	if (spd < ICE_ETH56G_LNK_SPD_40G) {
> 		bs_ds = ice_ptp_calc_bitslip_eth56g(hw, port, bs_ds, fc, rs,
> 						    spd);
>-	else
>-		bs_ds = ice_ptp_calc_deskew_eth56g(hw, port, bs_ds, rs, spd);
>+	} else {
>+		err = ice_ptp_calc_deskew_eth56g(hw, port, bs_ds, rs, spd,
>+						 &bs_ds);
>+		if (err)
>+			return err;
>+	}
> 	rx_offset = add_u32_u32_fx(rx_offset, bs_ds);
> 	rx_offset &= ICE_ETH56G_MAC_CFG_RX_OFFSET_INT |
> 		     ICE_ETH56G_MAC_CFG_RX_OFFSET_FRAC;
>-- 
>2.50.1 (Apple Git-155)



^ permalink raw reply

* Re: [net] ipv6: honor per-interface proxy_ndp in forward and NA paths
From: Ido Schimmel @ 2026-06-25  7:53 UTC (permalink / raw)
  To: Chenguang Zhao
  Cc: David Ahern, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev
In-Reply-To: <20260623085600.396401-1-zhaochenguang@kylinos.cn>

Subject prefix is incorrect. See:

https://docs.kernel.org/process/maintainer-netdev.html

On Tue, Jun 23, 2026 at 04:56:00PM +0800, Chenguang Zhao wrote:
> ndisc_recv_ns() has always checked both devconf_all and idev->cnf for
> proxy_ndp, but ip6_forward() and ndisc_recv_na() only looked at the
> global setting.
> 
> Honor per-interface proxy_ndp in both places to match the NS path and
> allow setups that only enable proxy_ndp on specific interfaces.
> 
> Fixes: fbea49e1e240 ("[IPV6] NDISC: Add proxy_ndp sysctl.")
> Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>

Given that this never worked and that the patch changes a 20 years old
user-visible behavior, I prefer that you target it at net-next (without
the Fixes tag) when it opens next week.

Also, did you look into why these "XXX" comments were added in the
original commit from 2006? I *assume* that it's because back then both
ndisc_recv_na() and ip6_forward() were missing an idev, unlike
ndisc_recv_ns().

> ---
>  net/ipv6/ip6_output.c | 4 ++--
>  net/ipv6/ndisc.c      | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 368e4fa3b43c..c4ca4a813479 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -579,8 +579,8 @@ int ip6_forward(struct sk_buff *skb)
>  		return -ETIMEDOUT;
>  	}
>  
> -	/* XXX: idev->cnf.proxy_ndp? */
> -	if (READ_ONCE(net->ipv6.devconf_all->proxy_ndp) &&
> +	if ((READ_ONCE(net->ipv6.devconf_all->proxy_ndp) ||
> +	     (idev && READ_ONCE(idev->cnf.proxy_ndp))) &&
>  	    pneigh_lookup(&nd_tbl, net, &hdr->daddr, skb->dev)) {

Note that idev doesn't necessarily correspond to the device with which
the neighbour lookup is performed (skb->dev). See 0857d6f8c759d.

vrf_ip6_rcv() does not modify skb->dev for neighbour discovery packets,
so this happens to be OK in this case, but you need to explain this in
the commit message.

>  		int proxied = ip6_forward_proxy_check(skb);
>  
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index f867ec8d3d90..e03e94681738 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -1096,9 +1096,9 @@ static enum skb_drop_reason ndisc_recv_na(struct sk_buff *skb)
>  		 */
>  		if (lladdr && !memcmp(lladdr, dev->dev_addr, dev->addr_len) &&
>  		    READ_ONCE(net->ipv6.devconf_all->forwarding) &&
> -		    READ_ONCE(net->ipv6.devconf_all->proxy_ndp) &&
> +		    (READ_ONCE(net->ipv6.devconf_all->proxy_ndp) ||
> +		     (idev && READ_ONCE(idev->cnf.proxy_ndp))) &&
>  		    pneigh_lookup(&nd_tbl, net, &msg->target, dev)) {
> -			/* XXX: idev->cnf.proxy_ndp */
>  			goto out;
>  		}
>  
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH iwl v3] ice: retry reading NVM if admin queue returns EBUSY
From: Przemek Kitszel @ 2026-06-25  7:53 UTC (permalink / raw)
  To: Robert Malz, Simon Horman, Grzegorz Nitka
  Cc: anthony.l.nguyen, intel-wired-lan, netdev
In-Reply-To: <CADcc-bydFL4KNDQEznStE41NFXuCey9S+kyXg0usbonwyWpiAQ@mail.gmail.com>

>> This is an AI-generated review of your patch. The human sending this


>> [Severity: Low]
>> Does this check allow an extra retry execution? Since retry_cnt starts at 0,
>> using a strictly greater-than check allows the loop to retry 4 times
>> (5 total executions) instead of the 3 executions intended by
>> ICE_SQ_SEND_MAX_EXECUTE.
>>
>>> +                             break;
>>> +                     ice_debug(hw, ICE_DBG_NVM,
>>> +                               "NVM read EBUSY error, retry %d\n",
>>> +                               retry_cnt + 1);
>>> +                     last_cmd = false;
>>> +                     ice_release_nvm(hw);
>>> +                     msleep(ICE_SQ_SEND_DELAY_TIME_MS);
>>
> The review is correct that this allows one extra execution. However, I
> intentionally kept this verbatim from the OOT driver (1.15.4) to

This OOT change was made entirely by Grzegorz, for E825C. Would be great 
to acknowledge his authorship (+ SoB ofc, below).
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>

What was the platform that you have encountered the bug on?

> retain the significant test exposure it has there. The absolute worst
> case is one harmless extra retry before giving up. If you strongly
> prefer strict adherence to ICE_SQ_SEND_MAX_EXECUTE, I can update the
> condition in a v4.

Please update (to have one attempt less).

> 
>> [Severity: High]
>> Does dropping the NVM lock here violate the caller's assumption of mutual
>> exclusion?
>>
>> Callers of ice_read_flat_nvm() typically acquire the NVM lock beforehand to
>> ensure an atomic read of multiple sectors without interleaved writes. By
>> explicitly dropping the lock during the msleep() window, another thread, VF,
>> or PF could acquire the NVM lock and modify the NVM before the read resumes,
>> potentially resulting in reading torn or inconsistent NVM data.
>>
>>> +                     status = ice_acquire_nvm(hw, ICE_RES_READ);
>>> +                     if (status)
>>> +                             break;
>>
> 
> I don't think this is an issue. The NVM resource (req_res 0x0008) is
> an ownership lease, not a content snapshot. The FW grants it with a
> hold timeout (see the timeout field handling in ice_aq_req_res). It
> never guaranteed a point-in-time view of the flash.
> Very similar behavior already exists in ice_devlink_nvm_snapshot(),
> which reads the NVM in chunks via ice_read_flat_nvm() and deliberately
> acquires/releases the lock between chunks (with a comment explaining
> the FW reclaim).
> Additionally, the lock is only dropped on the failure path, where
> neither bytes_read nor offset have been advanced. The driver simply
> re-reads the failed chunk after re-acquiring. So no data that is
> returned to the caller is committed across the drop window.
> Overall the lock drop does not weaken any guarantee callers actually
> have. If we wanted to introduce a true snapshot guarantee, that would
> probably require different approach and separate patch.
> 
> 
>> [Severity: High]
>> Can this lead to an unbalanced NVM resource lock release and potential data
>> corruption?
>>
>> If ice_acquire_nvm() fails here, the loop breaks and returns an error with the
>> lock currently unheld. However, callers like ice_read_flash_module() are
>> unaware of this internal lock state change and will unconditionally call
>> ice_release_nvm() upon failure:
>>
>> drivers/net/ethernet/intel/ice/ice_nvm.c:ice_read_flash_module() {
>>          status = ice_acquire_nvm(hw, ICE_RES_READ);
>>          if (status)
>>                  return status;
>>
>>          status = ice_read_flat_nvm(hw, start + offset, &length, data, false);
>>
>>          ice_release_nvm(hw);
>>      ...
>>
>> Because firmware tracks resource locks per-PF rather than per-OS-thread, this
>> errant second release could forcefully strip the lock from another thread on
>> the same PF that successfully acquired it, exposing the NVM to concurrent
>> modification.
>>
> 
> Agreed, this might be a real bug, and the one of the three I think is
> worth investigating.
> If ice_acquire_nvm() fails after the drop, ice_read_flat_nvm() returns
> with the lock unheld while callers unconditionally call
> ice_release_nvm(), so a stray release is issued.
> 
> On probability, though, the window is very small. Reaching it requires
> sustained EBUSY across the retry budget plus a failed re-acquire
> (which itself polls up to ICE_NVM_TIMEOUT), and concurrently another
> requester taking the lock. Most reads happen during init (ice_probe,
> and reset/rebuild via ice_init_nvm), and NVM writes only happen on an
> already initialized driver. The devlink/ethtool nvm_read paths are
> also exposed, but hitting this race would require precise timing
> against a concurrent NVM owner on the device.
> 
> I'd prefer to keep the scope of this patch limited to the EBUSY retry
> path and not take on the unbalanced-release fix here. A proper fix
> should change the lock-ownership contract of ice_read_flat_nvm() (on
> error, the lock must be released by ice_read_flat_nvm(), callers
> release only on success) and update all callers. Code change sould be
> simple for all callers but ice_discover_flash_size(), it intentionally
> holds one lease across a read loop and would need to re-acquire after
> each expected boundary failure.
> 
> Given how small the original window is, I'd rather not trade tested
> OOT behavior for the risk of a complex unbalanced NVM lock fix. I
> actually have a patch mostly ready that fixes the lock-ownership
> contract, but I really don't like it. It changes the design of
> ice_read_flat_nvm(), making it less intuitive for callers. More
> importantly, I just don't have the resources or test coverage right
> now to properly verify such change.
> 
> However, I can modify the failure path for ice_acquire_nvm inside
> ice_read_flat_nvm. Instead of bailing out immediately, we can just
> retry it within the existing retry budget. In this case, the
> probability of leaving ice_read_flat_nvm without holding the lock is
> reduced even further without needing a refactor.
> 
> Please let me know what you think about my thought process on this.

I think that both AI-reported issues against the lock are valid
concerns.

I think that sleep was the actual fix, and re-locking were merely
a necessity due to their expiration (as you said).

A proper fix would be to just increase lock-timeout to accommodate all
attempts (and still do the retries&sleep, but without unlocking).


> 
> 
> 
>>> +                     retry_cnt++;
>>> +             } else {
>>> +                     bytes_read += read_size;
>>> +                     offset += read_size;
>>> +                     retry_cnt = 0;
>>> +             }
>>>        } while (!last_cmd);
>>>
>>>        *length = bytes_read;
> 
> Thanks,
> Robert


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox