Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [External Mail] [PATCH v2 1/7] net: wwan: t9xx: Add PCIe core
From: Jakub Kicinski @ 2026-06-24 23:35 UTC (permalink / raw)
  To: Wu. JackBB (GSM)
  Cc: Loic Poulain, Sergey Ryazanov, Johannes Berg, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Wen-Zhi Huang,
	Shi-Wei Yeh, Minano Tseng, Matthias Brugger,
	AngeloGioacchino Del Regno, Simon Horman, Jonathan Corbet,
	Shuah Khan, linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org, linux-doc@vger.kernel.org
In-Reply-To: <b02c0e1e9f0449f2b819197e4329373b@compal.com>

On Wed, 24 Jun 2026 09:15:17 +0000 Wu. JackBB (GSM) wrote:
> ================================================================================================================================================================
> This message may contain information which is private, privileged or confidential of Compal Electronics, Inc. If you are not the intended recipient of this message, please notify the sender and destroy/delete the message. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information, by persons or entities other than the intended recipient is prohibited.
> ================================================================================================================================================================

If you want to do anything upstream you have to get rid of this first.

^ permalink raw reply

* [PATCH net v3] sctp: add INIT verification after cookie unpacking
From: Xin Long @ 2026-06-24 22:53 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Marcelo Ricardo Leitner

In SCTP handshake, the INIT chunk is initially processed by the server
and embedded into the cookie carried in INIT-ACK. The client then
returns this cookie via COOKIE-ECHO, where the server unpacks it and
reconstructs the original INIT chunk.

When cookie authentication is enabled, the cookie contents are protected
against tampering, so reusing the unpacked INIT without re-verification
is safe.

However, when cookie authentication is disabled, the reconstructed INIT
can no longer be trusted. In this case, the INIT must be explicitly
validated after unpacking to avoid processing potentially tampered data.

Add sctp_verify_init() checks after cookie unpacking in COOKIE-ECHO
processing paths (sctp_sf_do_5_1D_ce() and sctp_sf_do_5_2_4_dupcook())
when cookie_auth_enable is disabled. On failure, the new association is
freed and the packet is discarded.

Also tighten cookie validation in sctp_unpack_cookie() by verifying the
embedded chunk type is SCTP_CID_INIT before treating it as an INIT
chunk.

Finally, update sctp_verify_init() to validate parameter bounds using
the actual embedded INIT length instead of chunk->chunk_end, since the
INIT stored in COOKIE-ECHO may not span the entire chunk buffer.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v2:
  - Because of sctp_abort_on_init_err() param change in patch 1/2,
    pass cid and not chunk.
  - Use SCTP_PAD4() around ntohs(peer_init->chunk_hdr.length) when
    checking param.v in sctp_verify_init() to make Sashiko happy.
v3:
  - Validate the embedded INIT chunk type in sctp_unpack_cookie(), as
    noted by Sashiko.
  - Discard the packet if embedded INIT chunk validation fails,
    consistent with malformed cookie handling.
---
 net/sctp/sm_make_chunk.c |  5 ++++-
 net/sctp/sm_statefuns.c  | 36 +++++++++++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 41958b8e59fd..8adac9e0cd66 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1761,6 +1761,8 @@ struct sctp_association *sctp_unpack_cookie(
 	bear_cookie = &cookie->c;
 
 	ch = (struct sctp_chunkhdr *)(bear_cookie + 1);
+	if (ch->type != SCTP_CID_INIT)
+		goto malformed;
 	chlen = ntohs(ch->length);
 	if (chlen < sizeof(struct sctp_init_chunk))
 		goto malformed;
@@ -2298,7 +2300,8 @@ int sctp_verify_init(struct net *net, const struct sctp_endpoint *ep,
 	 * VIOLATION error.  We build the ERROR chunk here and let the normal
 	 * error handling code build and send the packet.
 	 */
-	if (param.v != (void *)chunk->chunk_end)
+	if (param.v != (void *)peer_init +
+		       SCTP_PAD4(ntohs(peer_init->chunk_hdr.length)))
 		return sctp_process_inv_paramlength(asoc, param.p, chunk, errp);
 
 	/* The only missing mandatory param possible today is
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 8e920cef0858..d23d935e128e 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -707,11 +707,12 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
 					 struct sctp_cmd_seq *commands)
 {
 	struct sctp_ulpevent *ev, *ai_ev = NULL, *auth_ev = NULL;
+	struct sctp_chunk *err_chk_p = NULL;
 	struct sctp_association *new_asoc;
 	struct sctp_init_chunk *peer_init;
 	struct sctp_chunk *chunk = arg;
-	struct sctp_chunk *err_chk_p;
 	struct sctp_chunk *repl;
+	enum sctp_cid cid;
 	struct sock *sk;
 	int error = 0;
 
@@ -785,6 +786,19 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
 		}
 	}
 
+	peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
+	cid = peer_init->chunk_hdr.type;
+	if (!sctp_sk(sk)->cookie_auth_enable &&
+	    !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
+			      &err_chk_p)) {
+		sctp_association_free(new_asoc);
+		if (err_chk_p)
+			sctp_chunk_free(err_chk_p);
+		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
+	}
+	if (err_chk_p)
+		sctp_chunk_free(err_chk_p);
+
 	if (security_sctp_assoc_request(new_asoc, chunk->head_skb ?: chunk->skb)) {
 		sctp_association_free(new_asoc);
 		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
@@ -798,7 +812,6 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
 	/* This is a brand-new association, so these are not yet side
 	 * effects--it is safe to run them here.
 	 */
-	peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
 	if (!sctp_process_init(new_asoc, chunk,
 			       &chunk->subh.cookie_hdr->c.peer_addr,
 			       peer_init, GFP_ATOMIC))
@@ -2215,10 +2228,12 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
 					void *arg,
 					struct sctp_cmd_seq *commands)
 {
+	struct sctp_chunk *err_chk_p = NULL;
 	struct sctp_association *new_asoc;
+	struct sctp_init_chunk *peer_init;
 	struct sctp_chunk *chunk = arg;
 	enum sctp_disposition retval;
-	struct sctp_chunk *err_chk_p;
+	enum sctp_cid cid;
 	int error = 0;
 	char action;
 
@@ -2287,6 +2302,21 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
 	switch (action) {
 	case 'A': /* Association restart. */
 	case 'B': /* Collision case B. */
+		peer_init = (struct sctp_init_chunk *)
+				(chunk->subh.cookie_hdr + 1);
+		cid = peer_init->chunk_hdr.type;
+		if (!sctp_sk(ep->base.sk)->cookie_auth_enable &&
+		    !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
+				      &err_chk_p)) {
+			sctp_association_free(new_asoc);
+			if (err_chk_p)
+				sctp_chunk_free(err_chk_p);
+			return sctp_sf_pdiscard(net, ep, asoc, type, arg,
+						commands);
+		}
+		if (err_chk_p)
+			sctp_chunk_free(err_chk_p);
+		fallthrough;
 	case 'D': /* Collision case D. */
 		/* Update socket peer label if first association. */
 		if (security_sctp_assoc_request((struct sctp_association *)asoc,
-- 
2.47.1


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-24 22:50 UTC (permalink / raw)
  To: Simon Horman
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Konstantin Taranov,
	ernis@linux.microsoft.com, dipayanroy@linux.microsoft.com,
	kees@kernel.org, jacob.e.keller@intel.com,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260619090514.GT827683@horms.kernel.org>

> From: Simon Horman <horms@kernel.org>
> Sent: Friday, June 19, 2026 2:05 AM
> > ...
> > Also validate the packet length reported in the RX CQE before using it as
> > a DMA sync length or passing it to skb processing. The CQE is supplied
> > by the device and should not be blindly trusted by Confidential VMs.
> 
> I think this last part warrants being split out into a separate patch.

Sorry for the late reply. I split v1 into 2 patches of v2, which I just posted:
https://lwn.net/ml/linux-kernel/20260624222605.1794719-1-decui@microsoft.com/
 
Thanks,
Dexuan

^ permalink raw reply

* [PATCH net v2] nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
From: Samuel Page @ 2026-06-24 22:44 UTC (permalink / raw)
  To: David Heidelberg
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, oe-linux-nfc, netdev, linux-kernel, stable

The CORE_INIT_RSP handlers walk the response using length fields taken
from the packet itself, without checking they stay within skb->len:

 - v1 computes
	rsp_2 = skb->data + 6 + rsp_1->num_supported_rf_interfaces;
   from the on-wire (unclamped) interface count and then dereferences
   rsp_2, and memcpy()s the advertised interfaces - both can run past the
   received data;
 - v2 walks supported_rf_interfaces[], advancing the cursor by an
   in-packet rf_extension_cnt with no bound.

A short CORE_INIT_RSP therefore makes the parser read past the packet
(into the uninitialised tail of the RX skb); the values are stored into
struct nci_dev and consumed while bringing the device up:

  BUG: KMSAN: uninit-value in nci_dev_up+0x10f3/0x1720
   nci_dev_up+0x10f3/0x1720
   nfc_dev_up+0x187/0x380
   nfc_genl_dev_up+0xdc/0x1a0
   genl_rcv_msg+0x5d4/0x9e0
   netlink_rcv_skb+0x28f/0x530
  Uninit was stored to memory at:
   nci_rsp_packet+0x68f/0x2310
   nci_rx_work+0x25f/0x5d0
  Uninit was created at:
   __alloc_skb+0x540/0xd40
   virtual_ncidev_write+0x65/0x210

Validate the response length before parsing or storing the
variable-length parts, rejecting truncated responses with
NCI_STATUS_SYNTAX_ERROR.  In v1 the check is done before
num_supported_rf_interfaces is stored into ndev, so a truncated response
cannot leave ndev->num_supported_rf_interfaces holding the unclamped
on-wire count, which nci_init_complete_req() would otherwise use as a
bound for the fixed-size supported_rf_interfaces[] array.

Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation")
Fixes: bcd684aace34 ("net/nfc/nci: Support NCI 2.x initial sequence")
Cc: stable@vger.kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Assisted-by: Bynario AI
Signed-off-by: Samuel Page <sam@bynar.io>
---
v2: validate the response length before storing num_supported_rf_interfaces
    into @ndev.  In v1 the unclamped on-wire count was stored first and the
    length check returned early on a truncated response, leaving
    ndev->num_supported_rf_interfaces > NCI_MAX_SUPPORTED_RF_INTERFACES; a
    subsequent CORE_INIT completion then walked it in nci_init_complete_req(),
    which the syzbot CI run on v1 flagged as a UBSAN array-index-out-of-bounds.
    https://ci.syzbot.org/series/2a9a8657-37a3-4dce-8cb5-2035027791dd
    v1: https://lore.kernel.org/all/20260623222402.175798-1-sam@bynar.io

 net/nfc/nci/rsp.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/net/nfc/nci/rsp.c b/net/nfc/nci/rsp.c
index 9eeb862825c5..6b2fa6bdbd14 100644
--- a/net/nfc/nci/rsp.c
+++ b/net/nfc/nci/rsp.c
@@ -50,11 +50,25 @@ static u8 nci_core_init_rsp_packet_v1(struct nci_dev *ndev,
 	const struct nci_core_init_rsp_1 *rsp_1 = (void *)skb->data;
 	const struct nci_core_init_rsp_2 *rsp_2;
 
+	if (skb->len < sizeof(*rsp_1))
+		return NCI_STATUS_SYNTAX_ERROR;
+
 	pr_debug("status 0x%x\n", rsp_1->status);
 
 	if (rsp_1->status != NCI_STATUS_OK)
 		return rsp_1->status;
 
+	/*
+	 * supported_rf_interfaces[] and the trailing nci_core_init_rsp_2 are
+	 * addressed using the on-wire (unclamped) interface count, so the
+	 * response must be long enough for both before any of it is parsed or
+	 * stored into @ndev - otherwise a truncated response would leave
+	 * ndev->num_supported_rf_interfaces holding the unclamped count.
+	 */
+	if (skb->len < sizeof(*rsp_1) +
+	    rsp_1->num_supported_rf_interfaces + sizeof(*rsp_2))
+		return NCI_STATUS_SYNTAX_ERROR;
+
 	ndev->nfcc_features = __le32_to_cpu(rsp_1->nfcc_features);
 	ndev->num_supported_rf_interfaces = rsp_1->num_supported_rf_interfaces;
 
@@ -88,9 +102,13 @@ static u8 nci_core_init_rsp_packet_v2(struct nci_dev *ndev,
 {
 	const struct nci_core_init_rsp_nci_ver2 *rsp = (void *)skb->data;
 	const u8 *supported_rf_interface = rsp->supported_rf_interfaces;
+	const u8 *end = skb->data + skb->len;
 	u8 rf_interface_idx = 0;
 	u8 rf_extension_cnt = 0;
 
+	if (skb->len < sizeof(*rsp))
+		return NCI_STATUS_SYNTAX_ERROR;
+
 	pr_debug("status %x\n", rsp->status);
 
 	if (rsp->status != NCI_STATUS_OK)
@@ -104,10 +122,16 @@ static u8 nci_core_init_rsp_packet_v2(struct nci_dev *ndev,
 		    NCI_MAX_SUPPORTED_RF_INTERFACES);
 
 	while (rf_interface_idx < ndev->num_supported_rf_interfaces) {
-		ndev->supported_rf_interfaces[rf_interface_idx++] = *supported_rf_interface++;
+		/* one interface byte + one extension-count byte must be present */
+		if (end - supported_rf_interface < 2)
+			return NCI_STATUS_SYNTAX_ERROR;
+		ndev->supported_rf_interfaces[rf_interface_idx++] =
+			*supported_rf_interface++;
 
-		/* skip rf extension parameters */
+		/* skip rf extension parameters, bounded by the packet */
 		rf_extension_cnt = *supported_rf_interface++;
+		if (rf_extension_cnt > end - supported_rf_interface)
+			return NCI_STATUS_SYNTAX_ERROR;
 		supported_rf_interface += rf_extension_cnt;
 	}
 

base-commit: a986fde914d88af47eb78fd29c5d1af7952c3500
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v2 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: Jamal Hadi Salim @ 2026-06-24 22:40 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, jiri, victor, security,
	zdi-disclosures, stable, Jamal Hadi Salim, kernel test robot

The teql master->slaves singly linked list is not protected against
multiple writes. It can be mod'ed concurently from teql_master_xmit(),
teql_dequeue(), teql_init() and teql_destroy() without holding any list
lock or RCU protection.

zdi-disclosures@trendmicro.com has demonstrated that the qdisc is freed
after an RCU grace period, but teql_master_xmit() running on another
CPU can still hold a stale pointer into the list, resulting in a
slab-use-after-free:

BUG: KASAN: slab-use-after-free in teql_destroy+0x3ca/0x440 linux/net/sched/sch_teql.c:142
Read of size 8 at addr ffff88802923aa80 by task ip/10024

The zdi-disclosures@trendmicro.com repro created concurrent AF_PACKET
senders on a teql device against a thread that repeatedly adds/deletes the
slave qdisc, together with a SLUB spray that reclaims the freed slot; the
resulting UAF is controllable enough to be turned into a read/write
primitive against the freed qdisc object.

The fix?
Add a per-master slaves_lock spinlock that serializes all mutations of
master->slaves and the NEXT_SLAVE() links in teql_destroy() and
teql_qdisc_init(). teql_master_xmit() also takes the same slaves_lock
around those updates.
Annotate master->slaves and the per-slave ->next pointer with __rcu and
use the appropriate RCU accessors everywhere they are touched:
rcu_assign_pointer() on the writer side (under slaves_lock),
rcu_dereference_protected() for the writer-side loads (also under
slaves_lock), rcu_dereference_bh() for the loads in teql_master_xmit() and
rtnl_dereference() for the loads in teql_master_open()/teql_master_mtu(),
which run under RTNL.
Pair this with rcu_read_lock_bh()/rcu_read_unlock_bh() around the list
traversal in teql_master_xmit(), so that readers either observe a fully
linked list or are deferred until the in-flight mutation completes. The two
early-return paths in teql_master_xmit() are updated to release the RCU-bh
read-side critical section before returning, since leaving it held would
disable BH on that CPU for good.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: zdi-disclosures@trendmicro.com
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606241501.XQBMu4b8-lkp@intel.com/
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
Changes in v2:
Address sparse issues found by kernel test robot <lkp@intel.com>
 - rcu annotated struct teql_master->slaves and struct teql_sched_data->next
 - teql_destroy/teql_qdisc_init: replace all READ/WRITE_ONCE() with rcu_dereference_protected()/rcu_assign_pointer()
---
 net/sched/sch_teql.c | 119 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 80 insertions(+), 39 deletions(-)

diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index e7bbc9e5174d..f4654f1b1200 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -52,7 +52,8 @@
 struct teql_master {
 	struct Qdisc_ops qops;
 	struct net_device *dev;
-	struct Qdisc *slaves;
+	struct Qdisc __rcu	*slaves;
+	spinlock_t		slaves_lock;	/* serializes writes to ->slaves */
 	struct list_head master_list;
 	unsigned long	tx_bytes;
 	unsigned long	tx_packets;
@@ -61,7 +62,7 @@ struct teql_master {
 };
 
 struct teql_sched_data {
-	struct Qdisc *next;
+	struct Qdisc __rcu	*next;
 	struct teql_master *m;
 	struct sk_buff_head q;
 };
@@ -101,7 +102,9 @@ teql_dequeue(struct Qdisc *sch)
 	if (skb == NULL) {
 		struct net_device *m = qdisc_dev(q);
 		if (m) {
-			dat->m->slaves = sch;
+			spin_lock_bh(&dat->m->slaves_lock);
+			rcu_assign_pointer(dat->m->slaves, sch);
+			spin_unlock_bh(&dat->m->slaves_lock);
 			netif_wake_queue(m);
 		}
 	} else {
@@ -132,34 +135,49 @@ teql_destroy(struct Qdisc *sch)
 	struct Qdisc *q, *prev;
 	struct teql_sched_data *dat = qdisc_priv(sch);
 	struct teql_master *master = dat->m;
+	struct netdev_queue *txq = NULL;
+	bool reset_master_queue = false;
 
 	if (!master)
 		return;
 
-	prev = master->slaves;
+	spin_lock_bh(&master->slaves_lock);
+	prev = rcu_dereference_protected(master->slaves,
+					 lockdep_is_held(&master->slaves_lock));
 	if (prev) {
 		do {
-			q = NEXT_SLAVE(prev);
-			if (q == sch) {
-				NEXT_SLAVE(prev) = NEXT_SLAVE(q);
-				if (q == master->slaves) {
-					master->slaves = NEXT_SLAVE(q);
-					if (q == master->slaves) {
-						struct netdev_queue *txq;
-
-						txq = netdev_get_tx_queue(master->dev, 0);
-						master->slaves = NULL;
-
-						dev_reset_queue(master->dev,
-								txq, NULL);
-					}
-				}
-				skb_queue_purge(&dat->q);
-				break;
+			struct Qdisc *head, *next;
+
+			q = rcu_dereference_protected(NEXT_SLAVE(prev),
+						      lockdep_is_held(&master->slaves_lock));
+			if (q != sch) {
+				prev = q;
+				continue;
 			}
 
-		} while ((prev = q) != master->slaves);
+			next = rcu_dereference_protected(NEXT_SLAVE(q),
+							 lockdep_is_held(&master->slaves_lock));
+			rcu_assign_pointer(NEXT_SLAVE(prev), next);
+
+			head = rcu_dereference_protected(master->slaves,
+							 lockdep_is_held(&master->slaves_lock));
+			if (q == head) {
+				rcu_assign_pointer(master->slaves, next);
+				if (q == next) {
+					txq = netdev_get_tx_queue(master->dev, 0);
+					rcu_assign_pointer(master->slaves, NULL);
+					reset_master_queue = true;
+				}
+			}
+			skb_queue_purge(&dat->q);
+			break;
+		} while (prev != rcu_dereference_protected(master->slaves,
+							   lockdep_is_held(&master->slaves_lock)));
 	}
+	spin_unlock_bh(&master->slaves_lock);
+
+	if (reset_master_queue)
+		dev_reset_queue(master->dev, txq, NULL);
 }
 
 static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
@@ -168,6 +186,7 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 	struct net_device *dev = qdisc_dev(sch);
 	struct teql_master *m = (struct teql_master *)sch->ops;
 	struct teql_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *first;
 
 	if (dev->hard_header_len > m->dev->hard_header_len)
 		return -EINVAL;
@@ -184,7 +203,9 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 
 	skb_queue_head_init(&q->q);
 
-	if (m->slaves) {
+	spin_lock_bh(&m->slaves_lock);
+	first = rcu_dereference_protected(m->slaves, lockdep_is_held(&m->slaves_lock));
+	if (first) {
 		if (m->dev->flags & IFF_UP) {
 			if ((m->dev->flags & IFF_POINTOPOINT &&
 			     !(dev->flags & IFF_POINTOPOINT)) ||
@@ -192,8 +213,10 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 			     !(dev->flags & IFF_BROADCAST)) ||
 			    (m->dev->flags & IFF_MULTICAST &&
 			     !(dev->flags & IFF_MULTICAST)) ||
-			    dev->mtu < m->dev->mtu)
+			    dev->mtu < m->dev->mtu) {
+				spin_unlock_bh(&m->slaves_lock);
 				return -EINVAL;
+			}
 		} else {
 			if (!(dev->flags&IFF_POINTOPOINT))
 				m->dev->flags &= ~IFF_POINTOPOINT;
@@ -204,14 +227,17 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 			if (dev->mtu < m->dev->mtu)
 				m->dev->mtu = dev->mtu;
 		}
-		q->next = NEXT_SLAVE(m->slaves);
-		NEXT_SLAVE(m->slaves) = sch;
+		rcu_assign_pointer(q->next,
+				   rcu_dereference_protected(NEXT_SLAVE(first),
+							     lockdep_is_held(&m->slaves_lock)));
+		rcu_assign_pointer(NEXT_SLAVE(first), sch);
 	} else {
-		q->next = sch;
-		m->slaves = sch;
+		rcu_assign_pointer(q->next, sch);
+		rcu_assign_pointer(m->slaves, sch);
 		m->dev->mtu = dev->mtu;
 		m->dev->flags = (m->dev->flags&~FMASK)|(dev->flags&FMASK);
 	}
+	spin_unlock_bh(&m->slaves_lock);
 	return 0;
 }
 
@@ -285,7 +311,9 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 	int subq = skb_get_queue_mapping(skb);
 	struct sk_buff *skb_res = NULL;
 
-	start = master->slaves;
+	rcu_read_lock_bh();
+
+	start = rcu_dereference_bh(master->slaves);
 
 restart:
 	nores = 0;
@@ -317,10 +345,14 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				    netdev_start_xmit(skb, slave, slave_txq, false) ==
 				    NETDEV_TX_OK) {
 					__netif_tx_unlock(slave_txq);
-					master->slaves = NEXT_SLAVE(q);
+					spin_lock_bh(&master->slaves_lock);
+					rcu_assign_pointer(master->slaves,
+							   rcu_dereference_bh(NEXT_SLAVE(q)));
+					spin_unlock_bh(&master->slaves_lock);
 					netif_wake_queue(dev);
 					master->tx_packets++;
 					master->tx_bytes += length;
+					rcu_read_unlock_bh();
 					return NETDEV_TX_OK;
 				}
 				__netif_tx_unlock(slave_txq);
@@ -329,14 +361,18 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				busy = 1;
 			break;
 		case 1:
-			master->slaves = NEXT_SLAVE(q);
+			spin_lock_bh(&master->slaves_lock);
+			rcu_assign_pointer(master->slaves,
+					   rcu_dereference_bh(NEXT_SLAVE(q)));
+			spin_unlock_bh(&master->slaves_lock);
+			rcu_read_unlock_bh();
 			return NETDEV_TX_OK;
 		default:
 			nores = 1;
 			break;
 		}
 		__skb_pull(skb, skb_network_offset(skb));
-	} while ((q = NEXT_SLAVE(q)) != start);
+	} while ((q = rcu_dereference_bh(NEXT_SLAVE(q))) != start);
 
 	if (nores && skb_res == NULL) {
 		skb_res = skb;
@@ -345,29 +381,32 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	if (busy) {
 		netif_stop_queue(dev);
+		rcu_read_unlock_bh();
 		return NETDEV_TX_BUSY;
 	}
 	master->tx_errors++;
 
 drop:
 	master->tx_dropped++;
+	rcu_read_unlock_bh();
 	dev_kfree_skb(skb);
 	return NETDEV_TX_OK;
 }
 
 static int teql_master_open(struct net_device *dev)
 {
-	struct Qdisc *q;
+	struct Qdisc *q, *first;
 	struct teql_master *m = netdev_priv(dev);
 	int mtu = 0xFFFE;
 	unsigned int flags = IFF_NOARP | IFF_MULTICAST;
 
-	if (m->slaves == NULL)
+	first = rtnl_dereference(m->slaves);
+	if (!first)
 		return -EUNATCH;
 
 	flags = FMASK;
 
-	q = m->slaves;
+	q = first;
 	do {
 		struct net_device *slave = qdisc_dev(q);
 
@@ -389,7 +428,7 @@ static int teql_master_open(struct net_device *dev)
 			flags &= ~IFF_BROADCAST;
 		if (!(slave->flags&IFF_MULTICAST))
 			flags &= ~IFF_MULTICAST;
-	} while ((q = NEXT_SLAVE(q)) != m->slaves);
+	} while ((q = rtnl_dereference(NEXT_SLAVE(q))) != first);
 
 	m->dev->mtu = mtu;
 	m->dev->flags = (m->dev->flags&~FMASK) | flags;
@@ -417,14 +456,15 @@ static void teql_master_stats64(struct net_device *dev,
 static int teql_master_mtu(struct net_device *dev, int new_mtu)
 {
 	struct teql_master *m = netdev_priv(dev);
-	struct Qdisc *q;
+	struct Qdisc *q, *first;
 
-	q = m->slaves;
+	first = rtnl_dereference(m->slaves);
+	q = first;
 	if (q) {
 		do {
 			if (new_mtu > qdisc_dev(q)->mtu)
 				return -EINVAL;
-		} while ((q = NEXT_SLAVE(q)) != m->slaves);
+		} while ((q = rtnl_dereference(NEXT_SLAVE(q))) != first);
 	}
 
 	WRITE_ONCE(dev->mtu, new_mtu);
@@ -444,6 +484,7 @@ static __init void teql_master_setup(struct net_device *dev)
 	struct teql_master *master = netdev_priv(dev);
 	struct Qdisc_ops *ops = &master->qops;
 
+	spin_lock_init(&master->slaves_lock);
 	master->dev	= dev;
 	ops->priv_size  = sizeof(struct teql_sched_data);
 

^ permalink raw reply related

* Re: [PATCH net] net: udp_tunnel: fix use-after-free by refcounting udp_tunnel_nic
From: Jakub Kicinski @ 2026-06-24 22:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, Simon Horman, Ido Schimmel,
	David Ahern, netdev, eric.dumazet, Yue Sun
In-Reply-To: <20260624145722.083632b6@kernel.org>

On Wed, 24 Jun 2026 14:57:22 -0700 Jakub Kicinski wrote:
> so we just need:
> 
> diff --git a/net/ipv4/udp_tunnel_nic.c b/net/ipv4/udp_tunnel_nic.c
> index 9944ed923ddf..d7db89a222f8 100644
> --- a/net/ipv4/udp_tunnel_nic.c
> +++ b/net/ipv4/udp_tunnel_nic.c
> @@ -863,6 +863,7 @@ static void
>  udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
>  {
>         const struct udp_tunnel_nic_info *info = dev->udp_tunnel_nic_info;
> +       bool pending;
>  
>         udp_tunnel_nic_lock(dev);
>  
> @@ -899,12 +900,14 @@ udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
>          * from the work which we will boot immediately.
>          */
>         udp_tunnel_nic_flush(dev, utn);
> +
> +       pending = utn->work_pending;
>         udp_tunnel_nic_unlock(dev);
>  
>         /* Wait for the work to be done using the state, netdev core will
>          * retry unregister until we give up our reference on this device.
>          */
> -       if (utn->work_pending)
> +       if (pending)
>                 return;
>  
>         udp_tunnel_nic_free(utn);

Maybe not even.. by definition the driver should not race with its own
netdev's unregister? I don't get what the race is..

^ permalink raw reply

* [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>

MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.

This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.

Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.

Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 39 +++++++++++++++----
 include/net/mana/mana.h                       |  8 ++++
 2 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..1875bffd82b7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,12 +2044,16 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
 }
 
 static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
-			     dma_addr_t *da, bool *from_pool)
+			     dma_addr_t *da, bool *from_pool,
+			     struct page **pp_page, u32 *dma_sync_offset)
 {
 	struct page *page;
 	u32 offset;
 	void *va;
+
 	*from_pool = false;
+	*pp_page = NULL;
+	*dma_sync_offset = 0;
 
 	/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
 	 * per page.
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 	va  = page_to_virt(page) + offset;
 	*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
 	*from_pool = true;
+	*pp_page = page;
+	*dma_sync_offset = offset + rxq->headroom;
 
 	return va;
 }
 
 /* Allocate frag for rx buffer, and save the old buf */
 static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
-			       struct mana_recv_buf_oob *rxoob, void **old_buf,
-			       bool *old_fp)
+			       struct mana_recv_buf_oob *rxoob, u32 pktlen,
+			       void **old_buf, bool *old_fp)
 {
+	struct page *pp_page;
+	u32 dma_sync_offset;
 	bool from_pool;
 	dma_addr_t da;
 	void *va;
 
-	va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+	va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+			     &dma_sync_offset);
 	if (!va)
 		return;
-	if (!rxoob->from_pool || rxq->frag_count == 1)
+	if (!rxoob->from_pool || rxq->frag_count == 1) {
 		dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
 				 DMA_FROM_DEVICE);
+	} else {
+		/* The page pool maps the whole page and only syncs for device
+		 * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+		 * for the CPU before they are read: this is required if DMA
+		 * is incoherent or bounce buffers are used.
+		 */
+		page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+					   rxoob->dma_sync_offset, pktlen);
+	}
 	*old_buf = rxoob->buf_va;
 	*old_fp = rxoob->from_pool;
 
 	rxoob->buf_va = va;
 	rxoob->sgl[0].address = da;
 	rxoob->from_pool = from_pool;
+	rxoob->pp_page = pp_page;
+	rxoob->dma_sync_offset = dma_sync_offset;
 }
 
 static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2170,7 +2190,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		rxbuf_oob = &rxq->rx_oobs[curr];
 		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-		mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+		mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
 
 		/* Unsuccessful refill will have old_buf == NULL.
 		 * In this case, mana_rx_skb() will drop the packet.
@@ -2566,6 +2586,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 			    struct mana_rxq *rxq, struct device *dev)
 {
 	struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+	struct page *pp_page = NULL;
+	u32 dma_sync_offset = 0;
 	bool from_pool = false;
 	dma_addr_t da;
 	void *va;
@@ -2573,13 +2595,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 	if (mpc->rxbufs_pre)
 		va = mana_get_rxbuf_pre(rxq, &da);
 	else
-		va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+		va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+				     &dma_sync_offset);
 
 	if (!va)
 		return -ENOMEM;
 
 	rx_oob->buf_va = va;
 	rx_oob->from_pool = from_pool;
+	rx_oob->pp_page = pp_page;
+	rx_oob->dma_sync_offset = dma_sync_offset;
 
 	rx_oob->sgl[0].address = da;
 	rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
 
 	void *buf_va;
 	bool from_pool; /* allocated from a page pool */
+	/* head page of the page_pool fragment; valid only when
+	 * from_pool && frag_count > 1.
+	 */
+	struct page *pp_page;
+	/* Fragment offset plus rxq->headroom, passed to
+	 * page_pool_dma_sync_for_cpu().
+	 */
+	u32 dma_sync_offset;
 
 	/* SGL of the buffer going to be sent as part of the work request. */
 	u32 num_sge;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>

Validate the packet length reported in the RX CQE before using it as a DMA
sync length or passing it to skb processing. The CQE is supplied by the
NIC device and should not be blindly trusted.

Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 24 +++++++++++++++----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1875bffd82b7..0b44c51ae6ec 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2190,12 +2190,26 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		rxbuf_oob = &rxq->rx_oobs[curr];
 		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-		mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
+		if (unlikely(pktlen > rxq->datasize)) {
+			/* Increase it even if mana_rx_skb() isn't called. */
+			rxq->rx_cq.work_done++;
 
-		/* Unsuccessful refill will have old_buf == NULL.
-		 * In this case, mana_rx_skb() will drop the packet.
-		 */
-		mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+			++ndev->stats.rx_dropped;
+			netdev_warn_once(ndev,
+				"Dropped oversized RX packet: len=%u, datasize=%u\n",
+				pktlen, rxq->datasize);
+
+			/* Reuse the RX buffer since rxbuf_oob is unchanged. */
+		} else {
+
+			mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen,
+					   &old_buf, &old_fp);
+
+			/* Unsuccessful refill will have old_buf == NULL.
+			 * In this case, mana_rx_skb() will drop the packet.
+			 */
+			mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+		}
 
 		mana_move_wq_tail(rxq->gdma_rq,
 				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 0/2] Fix MANA RX with bounce buffering
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma

With swiotlb=force, the MANA NIC fails to work properly due to commit
730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead
of full pages to improve memory efficiency.")

Dipayaan tried to fix this by avoiding page pool frags when bounce
buffering is in use [1][2]. However, that is not a clean solution: no
other NIC drivers need to explicitly check whether bounce buffering is
in use. It is also not good for throughput, since
dma_map_single()/dma_unmap_single() are then called for each incoming
packet.

In fact, page pool frags can still be used with the standard MTU of
1500: all we need is to add page_pool_dma_sync_for_cpu() before the CPU
reads the incoming packet, so I implemented that in v1 [3].

As Simon suggested [4], this version splits v1 into two patches:
Patch 1 adds page_pool_dma_sync_for_cpu().
Patch 2 validates the packet length reported by the NIC.

There is no functional difference between v1 and v2, so I am keeping
Haiyang's Reviewed-by tag in v2.

Please review. Thanks!

Note that, with jumbo MTU and XDP, page pool frags are not used, and
dma_map_single()/dma_unmap_single() are still called for each incoming
packet, causing poor throughput with swiotlb=force; see
mana_get_rxbuf_cfg() and mana_refill_rx_oob() -> mana_get_rxfrag().
The jumbo MTU/XDP issue will be addressed later since that needs more
consideration if we want to use page pool with PP_FLAG_DMA_MAP there:
e.g., for XDP, the received packet can be transmitted in place, i.e. the
same RX buffer can be used as a TX buffer:
mana_rx_skb() -> mana_xdp_tx() -> mana_start_xmit() -> mana_map_skb().

In mana_create_page_pool(), we may have to set pprm.dma_dir to
DMA_BIDIRECTIONAL if XDP is in use:
pprm.dma_dir = mana_xdp_get(mpc) ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;

In the case of XDP, the next issue is that mana_rx_skb() -> ... ->
mana_map_skb() appears to call dma_map_single() on an RX buffer allocated
from a page pool created with PP_FLAG_DMA_MAP, which seems incorrect.
Any thoughts?

[1] https://lore.kernel.org/all/ae91hyrLf4n23XE6@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/#r
[2] https://lore.kernel.org/all/ae9pxvJfkAZYfKMf@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
[3] https://lore.kernel.org/all/20260618035029.249361-1-decui@microsoft.com/
[4] https://lore.kernel.org/all/20260619090514.GT827683@horms.kernel.org/

Dexuan Cui (2):
  net: mana: Sync page pool RX frags for CPU
  net: mana: Validate the packet length reported by the NIC

 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
 include/net/mana/mana.h                       |  8 +++
 2 files changed, 58 insertions(+), 11 deletions(-)

-- 
2.34.1

^ permalink raw reply

* Re: [PATCH net] eth: fbnic: fix race between concurrent hwmon sensor reads
From: Alexander Duyck @ 2026-06-24 22:22 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Zinc Lim, Alexander Duyck, Jakub Kicinski, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, kernel-team, netdev,
	linux-kernel, zinclim
In-Reply-To: <f2ecb668-ec56-4c84-ac97-474a2d761005@lunn.ch>

On Wed, Jun 24, 2026 at 2:56 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, Jun 24, 2026 at 11:51:29PM +0200, Andrew Lunn wrote:
> > On Wed, Jun 24, 2026 at 01:05:14PM -0700, Zinc Lim wrote:
> > > From: Zinc Lim <limzhineng2@gmail.com>
> > >
> > > Reading an hwmon sensor
> >
> > Since this is a hwmon related patch, it is a good idea to Cc: the
> > hwmon Maintainer.
> >
> > I don't know hwmon too well, i've never had to look at locking, but...
> >
> > static int hwmon_thermal_get_temp(struct thermal_zone_device *tz, int *temp)
> > {
> >       struct hwmon_thermal_data *tdata = thermal_zone_device_priv(tz);
> >       struct hwmon_device *hwdev = to_hwmon_device(tdata->dev);
> >       int ret;
> >       long t;
> >
> >       guard(mutex)(&hwdev->lock);
> >
> >       ret = hwdev->chip->ops->read(tdata->dev, hwmon_temp, hwmon_temp_input,
> >                                    tdata->index, &t);
> >
> > Could you explain why this lock in the hwmon core is not sufficient?
> > It could be i'm reading it wrong, but it looks like the lock is shared
> > by all hwmon instanced registered by a
> > hwmon_device_register_with_info() call.
>
> Documentation/hwmon/hwmon-kernel-api.rst
>
> When using ``[devm_]hwmon_device_register_with_info()`` to register the
> hardware monitoring device, accesses using the associated access functions
> are serialised by the hardware monitoring core. If a driver needs locking
> for other functions such as interrupt handlers or for attributes which are
> fully implemented in the driver, hwmon_lock() and hwmon_unlock() can be used
> to ensure that calls to those functions are serialized.

I did some digging and it looks like 3ad2a7b9b15d ("hwmon: Serialize
accesses in hwmon core") landed into 6.18. This was a patch put
together based on issues reported in an earlier kernel version and
explains how this got missed. So what we have is a redundant fix. We
can drop this for upstream and probably look at pulling the upstream
fix into our kernels that were having the issues.

Thanks for the review feedback.

- Alex

^ permalink raw reply

* Re: [PATCH v29 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Dan Williams (nvidia) @ 2026-06-24 22:10 UTC (permalink / raw)
  To: alejandro.lucero-palau, linux-cxl, netdev, dan.j.williams,
	edward.cree, davem, kuba, pabeni, edumazet, dave.jiang
  Cc: Alejandro Lucero, Edward Cree
In-Reply-To: <20260622124010.2192888-5-alejandro.lucero-palau@amd.com>

alejandro.lucero-palau@ wrote:
> From: Alejandro Lucero <alucerop@amd.com>
> 
> Use core API for safely obtain the CXL range linked to an HDM committed
> by the BIOS. Map such a range for being used as the ctpio buffer.
> 
> A potential user space action through sysfs unbinding or core cxl
> modules remove will trigger sfc driver device detachment, with that case
> not racing with this mapping as this is done during driver probe and
> therefore protected with device lock against those user space actions.
> 
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Acked-by: Edward Cree <ecree.xilinx@gmail.com>
> ---
>  drivers/net/ethernet/sfc/efx.c     |  2 ++
>  drivers/net/ethernet/sfc/efx_cxl.c | 23 +++++++++++++++++++++++
>  drivers/net/ethernet/sfc/efx_cxl.h |  3 +++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index 61cbb6cfc360..3806cd3dd7f4 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -984,6 +984,7 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
>  	efx_fini_io(efx);
>  
>  	probe_data = container_of(efx, struct efx_probe_data, efx);
> +	efx_cxl_exit(probe_data);
>  
>  	pci_dbg(efx->pci_dev, "shutdown successful\n");
>  
> @@ -1242,6 +1243,7 @@ static int efx_pci_probe(struct pci_dev *pci_dev,
>  	return 0;
>  
>   fail3:
> +	efx_cxl_exit(probe_data);
>  	efx_fini_io(efx);
>   fail2:
>  	efx_fini_struct(efx);
> diff --git a/drivers/net/ethernet/sfc/efx_cxl.c b/drivers/net/ethernet/sfc/efx_cxl.c
> index 18b535b3ea40..3e7c950f83e9 100644
> --- a/drivers/net/ethernet/sfc/efx_cxl.c
> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
> @@ -18,6 +18,7 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>  {
>  	struct efx_nic *efx = &probe_data->efx;
>  	struct pci_dev *pci_dev = efx->pci_dev;
> +	struct range cxl_pio_range;
>  	struct efx_cxl *cxl;
>  	u16 dvsec;
>  	int rc;
> @@ -73,9 +74,31 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>  		return -ENODEV;
>  	}
>  
> +	cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
> +	if (IS_ERR(cxl->cxlmd)) {
> +		pci_err(pci_dev, "CXL accel memdev creation failed\n");
> +		return PTR_ERR(cxl->cxlmd);
> +	}
> +
> +	cxl->ctpio_cxl = ioremap_wc(cxl_pio_range.start,
> +				    range_len(&cxl_pio_range));
> +	if (!cxl->ctpio_cxl) {
> +		pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
> +			&cxl_pio_range);
> +		return -ENOMEM;
> +	}
> +
>  	probe_data->cxl = cxl;
>  
>  	return 0;
>  }
>  
> +void efx_cxl_exit(struct efx_probe_data *probe_data)
> +{

If you are going to have an explicit efx_cxl_exit() then I would also
add an explicit unregistration of the memdev. This would also fix the
Sashiko report about pci_disable_device() running while the cxl_memdev
is still registered. Unfortunately, mixing devm and explicit unwind is
always fraught.

Let me know if this passes your testing, and I can send it out as a
standalone patch. You could also use it to unwind if the ioremap()
fails.

-- 8< --
From 47940f7d6de6dd1039d0e445eb944ca96ad5eb3a Mon Sep 17 00:00:00 2001
From: Dan Williams <djbw@kernel.org>
Date: Wed, 24 Jun 2026 11:47:24 -0700
Subject: [PATCH] cxl: Add devm_cxl_remove_mem()

Undo devm_cxl_probe_mem() without causing the host driver to unload.

The expectation of devm_cxl_probe_mem() is that upon attach the caller
depends on the CXL topology not being torn down. If it is then it is
escalated to a 'remove' event on the caller. However, if the caller wants
to cancel their interest in CXL operation, they should be able to do so
without that side effect.

This is useful for setup scenarios where CXL successfully attaches, but
some other event precludes CXL operation.

Cc: Alejandro Lucero <alucerop@amd.com>
Signed-off-by: Dan Williams <djbw@kernel.org>
---
 include/cxl/cxl.h         |  1 +
 drivers/cxl/core/memdev.c | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 016c74fb747c..cff2922b0d4a 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -226,4 +226,5 @@ struct cxl_dev_state *_devm_cxl_dev_state_create(struct device *dev,
 
 struct cxl_memdev *devm_cxl_probe_mem(struct cxl_dev_state *cxlds,
 				      struct range *range);
+void devm_cxl_remove_mem(struct cxl_memdev *cxlmd);
 #endif /* __CXL_CXL_H__ */
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 33a3d2e7b13a..aa35876f0067 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -1185,6 +1185,31 @@ struct cxl_memdev *__devm_cxl_add_memdev(struct cxl_dev_state *cxlds,
 }
 EXPORT_SYMBOL_FOR_MODULES(__devm_cxl_add_memdev, "cxl_mem");
 
+static struct cxl_attach_region *cancel_probe_mem_teardown(struct cxl_memdev *cxlmd)
+{
+	struct cxl_attach_region *attach;
+
+	guard(device)(&cxlmd->dev);
+	attach = container_of(cxlmd->attach, typeof(*attach), attach);
+	cxlmd->attach = NULL;
+	return attach;
+}
+
+void devm_cxl_remove_mem(struct cxl_memdev *cxlmd)
+{
+	struct cxl_attach_region *attach;
+	struct cxl_dev_state *cxlds;
+
+	if (IS_ERR(cxlmd))
+		return;
+
+	attach = cancel_probe_mem_teardown(cxlmd);
+	cxlds = cxlmd->cxlds;
+	devm_release_action(cxlds->dev, cxl_memdev_unregister, cxlmd);
+	devm_kfree(cxlds->dev, attach);
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_remove_mem, "CXL");
+
 static void sanitize_teardown_notifier(void *data)
 {
 	struct cxl_memdev_state *mds = data;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v4] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 22:10 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bX-iDvcUOPPY+NLz95tkRJYwWqvzAr=U48uNaub_HZLGw@mail.gmail.com>

On Wed, Jun 24, 2026 at 03:02:02PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 
> Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
> vhost_vdpa_pa_unmap()")

still 2 lines?

> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
> Changes in v4:
> - Keep the Fixes tag on one line.
> - Add Michael's Acked-by tag.
> 
> Changes in v3:
> - Add the Fixes tag.
> 
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* Re: [PATCH v12 07/12] static_call: Define EXPORT_STATIC_CALL_FOR_MODULES()
From: Sean Christopherson @ 2026-06-24 22:03 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, KP Singh,
	Jiri Olsa, David S. Miller, David Laight, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, David Ahern, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	Stanislav Fomichev, Hao Luo, Paolo Bonzini, Jonathan Corbet,
	Jason Baron, Alice Ryhl, Steven Rostedt, Ard Biesheuvel,
	Shuah Khan, linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf,
	netdev, linux-doc
In-Reply-To: <20260624214955.6kkivefeuapcocib@desk>

On Wed, Jun 24, 2026, Pawan Gupta wrote:
> On Wed, Jun 24, 2026 at 05:59:19AM -0700, Sean Christopherson wrote:
> > On Tue, Jun 23, 2026, Pawan Gupta wrote:
> > > There is EXPORT_STATIC_CALL_TRAMP() that hides the static key from all
> > > modules. But there is no equivalent of EXPORT_SYMBOL_FOR_MODULES() to
> > > restrict symbol visibility to only certain modules.
> > > 
> > > Add EXPORT_STATIC_CALL_FOR_MODULES(name, mods) that wraps both the key and
> > > the trampoline with EXPORT_SYMBOL_FOR_MODULES(), allowing only a limited
> > > set of modules to see and update the static key.
> > > 
> > > The immediate user is KVM, in the following commit.
> > > 
> > > checkpatch reported below warnings with this change that I believe don't
> > > apply in this case:
> > > 
> > >   include/linux/static_call.h:219: WARNING: Non-declarative macros with multiple statements should be enclosed in a do - while loop
> > >   include/linux/static_call.h:220: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> > > 
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > > ---

...

> > Drat, I forgot about this.  Exporting static call trampolines for KVM came up in
> > another conversation[*].  I had already put together patches to effectively default
> > to exporting only the trampoline, and also to deduplicate this code so that the
> > CONFIG_HAVE_STATIC_CALL_INLINE=y / CONFIG_HAVE_STATIC_CALL=y / CONFIG_HAVE_STATIC_CALL=n
> > implementations don't need to copy+paste the same lines of code.
> > 
> > The attached patches touch a lot more code, and will conflict mightily with KVM
> > changes I want to land in 7.3 (more use of a static_call in KVM).  But if we get
> > them applied (to tip tree) shortly after 7.2-rc1 and provide a topic branch/tag,
> > then there shouldn't be too much juggling needed?
> > 
> > If we want to go with the more aggressive cleanup, I'll formally post the patches.
> > 
> > [*] https://lore.kernel.org/all/ahhoDGUz39KSGZ6o@google.com
> 
> Thanks for the context.
> 
> Earlier making the key ro-after-init came up as an option in a thread with
> Peter. Does it look like a good option to you?

No, it won't work for KVM.  kvm.ko (owner of the keys) updates the keys only when
a vendor module (kvm-intel.ko or kvm-amd.ko) is loaded, and updates keys *every*
time a vendor module is loaded.  So for KVM, the static calls need to be __read_mostly,
not __ro_after_init.

> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> index b610afd1ed55..ea56da8fb446 100644
> --- a/include/linux/static_call.h
> +++ b/include/linux/static_call.h
> @@ -200,6 +200,14 @@ extern long __static_call_return0(void);
>  	};								\
>  	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
>  
> +#define DEFINE_STATIC_CALL_NULL_RO_AFTER_INIT(name, _func)		\
> +	DECLARE_STATIC_CALL(name, _func);				\
> +	struct static_call_key STATIC_CALL_KEY(name) __ro_after_init = {\
> +		.func = _func,						\
> +		.type = 1,						\
> +	};								\
> +	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
> +
>  #define DEFINE_STATIC_CALL_RET0(name, _func)				\
>  	DECLARE_STATIC_CALL(name, _func);				\
>  	struct static_call_key STATIC_CALL_KEY(name) = {		\

^ permalink raw reply

* [PATCH v4] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 22:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
vhost_vdpa_pa_unmap()")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v4:
- Keep the Fixes tag on one line.
- Add Michael's Acked-by tag.

Changes in v3:
- Add the Fixes tag.

Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 2/2] selftests/bpf: Add tests for bpf_redirect_peer with BPF_F_EGRESS
From: Paul Chaignon @ 2026-06-24 21:59 UTC (permalink / raw)
  To: Jordan Rife
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Stanislav Fomichev, Jiayuan Chen
In-Reply-To: <20260618182035.43811-3-jordan@jrife.io>

On Thu, Jun 18, 2026 at 11:20:33AM -0700, Jordan Rife wrote:
> Extend redirect tests to cover bpf_redirect_peer(BPF_F_EGRESS). SRC
> redirects to DST using bpf_redirect_peer(BPF_F_EGRESS) then traffic is
> hairpinned into DST using bpf_redirect.
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>

Acked-by: Paul Chaignon <paul.chaignon@gmail.com>


^ permalink raw reply

* Re: [PATCH v3] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 21:59 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bV8OeSTOVnAPRh6ygKdogFjqEiDNj1Vbh623KBBkZgxiw@mail.gmail.com>

On Wed, Jun 24, 2026 at 02:56:20PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 
> Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
> vhost_vdpa_pa_unmap()")

weirdly wrapped. will likely break some tools.

> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> Changes in v3:
> - Add the Fixes tag.
> 
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/2] bpf: Support BPF_F_EGRESS with bpf_redirect_peer
From: Paul Chaignon @ 2026-06-24 21:58 UTC (permalink / raw)
  To: Jordan Rife
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Stanislav Fomichev, Jiayuan Chen
In-Reply-To: <20260618182035.43811-2-jordan@jrife.io>

On Thu, Jun 18, 2026 at 11:20:32AM -0700, Jordan Rife wrote:
> We have several use cases where a pod injects traffic into the datapath
> of another so that the traffic appears to have originated from that
> pod. One such use case is a synthetic flow generator which injects
> synthetic traffic into a pod's datapath to enable dynamic probing and
> debugging. Another is a transparent proxy where connections originating
> from one pod are redirected towards another which proxies that
> connection. The new connection is bound to the IP of the original pod
> using IP_TRANSPARENT and its traffic is injected into that pod's
> datapath and handled as if it had originated there. This can be used for
> mTLS, etc.
> 
> We use bpf_redirect(BPF_F_INGRESS) to direct traffic leaving the proxy,
> flow generator, etc. towards the target pod, ensuring that eBPF programs
> that are meant to intercept traffic leaving that pod are executed.
> However, this doesn't work with netkit.
> 
> With netkit, an ingress redirection from proxy to workload skips eBPF
> programs that are meant to intercept traffic leaving the pod, since they
> reside on the netkit peer device. One workaround is to attach the
> same program to both the netkit peer device and the TCX ingress hook for
> the netkit pair's primary interface, but
> 
> a) This seems hacky and we need to be careful not to run the same
>    program twice for the same skb in cases where we want to pass that
>    traffic to the host stack.
> b) We're trying to keep the proxy redirection / traffic injection
>    systems as modular and separated from Cilium as possible, the system
>    that manages netkit setup and core eBPF programming.
> 
> It would be handy if instead we could redirect traffic directly from
> one netkit peer device to another. This patch proposes an extension
> to bpf_redirect_peer to allow us to do just that.
> 
> With this patch, the BPF_F_EGRESS flag tells bpf_redirect_peer to emit
> the skb in the egress direction of the target interface's peer device
> While the main use case is netkit, I suppose you could also use this
> mode with veth as well if, e.g., there were some eBPF programs attached
> to that side of the veth pair that needed to intercept traffic.
> 
>  +---------------------------------------------------------------------+
>  | +-------------------------+         6. bpf_redirect_neigh(eth0)     |
>  | | pod (10.244.0.10)       |           ------------------------      |
>  | |                         |          |                        |     |
>  | |              +--------+ |          |      +---------+       |     |
>  | | 1. packet -->|        | |          |      |         |       |     |
>  | |    leaves ^  | netkit |<===========|======| netkit  |       |     |
>  | |           |  | peer   |=======(eBPF)=====>| primary |       |     |
>  | |           |  |        | |          |      |         |       |     |
>  | |           |  +--------+ |          |      +---------+       |     |
>  | |           |             |          | 2. bpf_redirect        v     |
>  | +-----------|-------------+          |___________________   +-------|
>  |             |                                            |  | eth0  |
>  |             | 5. bpf_redirect_peer(BPF_F_EGRESS)         |  +-------|
>  |             |________________________                    |          |
>  | +-------------------------+          |                   |          |
>  | | proxy (10.244.0.11)     |          |                   |          |
>  | | IP_TRANSPARENT          |          |                   |          |
>  | |              +--------+ |          |      +---------+  |          |
>  | | 3. packet <--|        | |          |      |         |<--          |
>  | |    enters    | netkit |<===========|======| netkit  |             |
>  | |    [proxy]   | peer   |=======(eBPF)=====>| primary |             |
>  | | 4. packet -->|        | |                 |         |             |
>  | |    leaves    +--------+ |                 +---------+             |
>  | |    sip=10.244.0.10      |                                         |
>  | +-------------------------+                                         |
>  +---------------------------------------------------------------------+
> 
> Using the proxy use case as an example, in step 5 we would redirect
> traffic leaving the proxy towards the pod's peer device using
> bpf_redirect_peer(BPF_F_EGRESS).
> 
> As a bonus, since the skb doesn't have to go through the backlog queue
> it can take full advantage of netkit's performance benefits. I set up a
> test where outgoing iperf3 traffic is injected into the datapath of
> another pod using either bpf_redirect_peer(BPF_F_EGRESS) or
> bpf_redirect(BPF_F_INGRESS). I used Cilium's eBPF host routing mode
> which skips the host stack and uses BPF redirect helpers to do all the
> routing.
> 
>   (net.ipv4.tcp_congestion_control=cubic,mtu=1500,100GiB link,Cilium
>    eBPF host routing mode)
> 
> BASELINE [bpf_redirect(BPF_F_INGRESS)]
>   1. [iperf pod] ==bpf_redirect([pod b], BPF_F_INGRESS)==> [pod b]
>   2. [pod b]     ==bpf_redirect_neigh([eth0])==>           eth0
>   3. eth0        ==over network==>                         [host b]
> 
>   [ ID] Interval           Transfer     Bitrate         Retr
>   [  5]   0.00-60.00  sec   231 GBytes  33.0 Gbits/sec  12060     sender
>   [  5]   0.00-60.00  sec   230 GBytes  33.0 Gbits/sec            receiver
> 
> TEST [bpf_redirect_peer(BPF_F_EGRESS)]
>   1. [iperf pod] ==bpf_redirect_peer([pod b], BPF_F_EGRESS)==> [pod b]
>   2. [pod b]     ==bpf_redirect_neigh([eth0])==>               eth0
>   3. eth0        ==over network==>                             [host b]
> 
>   [ ID] Interval           Transfer     Bitrate         Retr
>   [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec    0       sender
>   [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec            receiver
> 
> In this test, using bpf_redirect_peer(BPF_F_EGRESS) for the hop from
> [iperf pod] to [pod b] led to ~18% more throughput compared to
> bpf_redirect(BPF_F_INGRESS).
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>

Acked-by: Paul Chaignon <paul.chaignon@gmail.com>


^ permalink raw reply

* Re: [PATCH net] net: udp_tunnel: fix use-after-free by refcounting udp_tunnel_nic
From: Jakub Kicinski @ 2026-06-24 21:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, Simon Horman, Ido Schimmel,
	David Ahern, netdev, eric.dumazet, Yue Sun
In-Reply-To: <20260624171034.4117423-1-edumazet@google.com>

On Wed, 24 Jun 2026 17:10:34 +0000 Eric Dumazet wrote:
> Yue Sun reported a use-after-free and debugobjects warning in
> udp_tunnel_nic_device_sync_work() during concurrent device operations.
> 
> The state flags of struct udp_tunnel_nic were originally bitfields
> sharing a byte, modified concurrently without locking (RCU vs worker).

Can you clarify the path where the bits are modified without locks??
My mental model is that this is basically all under rtnl_lock, and 
Stan added _another_ lock so that drivers can call "sync" / reply
without needing rtnl lock, but any changes are still under rtnl_lock.

The gap seems to be that we don't check pending under Stan's new lock,
since commit 1ead7501094c6 ("udp_tunnel: remove rtnl_lock dependency")
did:

+++ b/drivers/net/netdevsim/udp_tunnels.c
@@ -112,12 +112,10 @@ nsim_udp_tunnels_info_reset_write(struct file *file, const char __user *data,
        struct net_device *dev = file->private_data;
        struct netdevsim *ns = netdev_priv(dev);
 
-       rtnl_lock();
        if (dev->reg_state == NETREG_REGISTERED) {
                memset(ns->udp_ports.ports, 0, sizeof(ns->udp_ports.__ports));
                udp_tunnel_nic_reset_ntf(dev);
        }
-       rtnl_unlock();


so we just need:

diff --git a/net/ipv4/udp_tunnel_nic.c b/net/ipv4/udp_tunnel_nic.c
index 9944ed923ddf..d7db89a222f8 100644
--- a/net/ipv4/udp_tunnel_nic.c
+++ b/net/ipv4/udp_tunnel_nic.c
@@ -863,6 +863,7 @@ static void
 udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
        const struct udp_tunnel_nic_info *info = dev->udp_tunnel_nic_info;
+       bool pending;
 
        udp_tunnel_nic_lock(dev);
 
@@ -899,12 +900,14 @@ udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
         * from the work which we will boot immediately.
         */
        udp_tunnel_nic_flush(dev, utn);
+
+       pending = utn->work_pending;
        udp_tunnel_nic_unlock(dev);
 
        /* Wait for the work to be done using the state, netdev core will
         * retry unregister until we give up our reference on this device.
         */
-       if (utn->work_pending)
+       if (pending)
                return;
 
        udp_tunnel_nic_free(utn);



> Even after converting to atomic bitops, a single WORK_PENDING flag
> races: the workqueue core clears the pending bit before running the
> worker. A concurrent queueing sets the flag, but the running worker
> clears it, leading to premature freeing in unregister() while the
> re-queued work is still active.

> +	if (utn->dev)
> +		dev_put(utn->dev);

nit: cocci complains that null check is not needed here.

^ permalink raw reply related

* [PATCH v3] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 21:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
vhost_vdpa_pa_unmap()")
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v3:
- Add the Fixes tag.

Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH net] eth: fbnic: fix race between concurrent hwmon sensor reads
From: Andrew Lunn @ 2026-06-24 21:56 UTC (permalink / raw)
  To: Zinc Lim
  Cc: Alexander Duyck, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, alexander.duyck, kernel-team, netdev,
	linux-kernel, zinclim
In-Reply-To: <0673af7d-51a6-484c-be38-3f46849ebb30@lunn.ch>

On Wed, Jun 24, 2026 at 11:51:29PM +0200, Andrew Lunn wrote:
> On Wed, Jun 24, 2026 at 01:05:14PM -0700, Zinc Lim wrote:
> > From: Zinc Lim <limzhineng2@gmail.com>
> > 
> > Reading an hwmon sensor
> 
> Since this is a hwmon related patch, it is a good idea to Cc: the
> hwmon Maintainer.
> 
> I don't know hwmon too well, i've never had to look at locking, but...
> 
> static int hwmon_thermal_get_temp(struct thermal_zone_device *tz, int *temp)
> {
> 	struct hwmon_thermal_data *tdata = thermal_zone_device_priv(tz);
> 	struct hwmon_device *hwdev = to_hwmon_device(tdata->dev);
> 	int ret;
> 	long t;
> 
> 	guard(mutex)(&hwdev->lock);
> 
> 	ret = hwdev->chip->ops->read(tdata->dev, hwmon_temp, hwmon_temp_input,
> 				     tdata->index, &t);
> 
> Could you explain why this lock in the hwmon core is not sufficient?
> It could be i'm reading it wrong, but it looks like the lock is shared
> by all hwmon instanced registered by a
> hwmon_device_register_with_info() call.

Documentation/hwmon/hwmon-kernel-api.rst 

When using ``[devm_]hwmon_device_register_with_info()`` to register the
hardware monitoring device, accesses using the associated access functions
are serialised by the hardware monitoring core. If a driver needs locking
for other functions such as interrupt handlers or for attributes which are
fully implemented in the driver, hwmon_lock() and hwmon_unlock() can be used
to ensure that calls to those functions are serialized.

	Andrew

^ permalink raw reply

* Re: [PATCH v2] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 21:54 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bURwyoJtKUskzXpbUtrxXT1vBZZxpKpnnJY6qWsLtTBMA@mail.gmail.com>

On Wed, Jun 24, 2026 at 02:51:53PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 


And a Fixes: tag, please.

> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* [PATCH v2] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 21:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH net] eth: fbnic: fix race between concurrent hwmon sensor reads
From: Andrew Lunn @ 2026-06-24 21:51 UTC (permalink / raw)
  To: Zinc Lim
  Cc: Alexander Duyck, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, alexander.duyck, kernel-team, netdev,
	linux-kernel, zinclim
In-Reply-To: <20260624200514.1332788-1-zinclim@meta.com>

On Wed, Jun 24, 2026 at 01:05:14PM -0700, Zinc Lim wrote:
> From: Zinc Lim <limzhineng2@gmail.com>
> 
> Reading an hwmon sensor

Since this is a hwmon related patch, it is a good idea to Cc: the
hwmon Maintainer.

I don't know hwmon too well, i've never had to look at locking, but...

static int hwmon_thermal_get_temp(struct thermal_zone_device *tz, int *temp)
{
	struct hwmon_thermal_data *tdata = thermal_zone_device_priv(tz);
	struct hwmon_device *hwdev = to_hwmon_device(tdata->dev);
	int ret;
	long t;

	guard(mutex)(&hwdev->lock);

	ret = hwdev->chip->ops->read(tdata->dev, hwmon_temp, hwmon_temp_input,
				     tdata->index, &t);

Could you explain why this lock in the hwmon core is not sufficient?
It could be i'm reading it wrong, but it looks like the lock is shared
by all hwmon instanced registered by a
hwmon_device_register_with_info() call.

	Andrew

^ permalink raw reply

* Re: [PATCH v12 07/12] static_call: Define EXPORT_STATIC_CALL_FOR_MODULES()
From: Pawan Gupta @ 2026-06-24 21:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, KP Singh,
	Jiri Olsa, David S. Miller, David Laight, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, David Ahern, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	Stanislav Fomichev, Hao Luo, Paolo Bonzini, Jonathan Corbet,
	Jason Baron, Alice Ryhl, Steven Rostedt, Ard Biesheuvel,
	Shuah Khan, linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf,
	netdev, linux-doc
In-Reply-To: <ajvUp_kPJBRZ7k_p@google.com>

On Wed, Jun 24, 2026 at 05:59:19AM -0700, Sean Christopherson wrote:
> On Tue, Jun 23, 2026, Pawan Gupta wrote:
> > There is EXPORT_STATIC_CALL_TRAMP() that hides the static key from all
> > modules. But there is no equivalent of EXPORT_SYMBOL_FOR_MODULES() to
> > restrict symbol visibility to only certain modules.
> > 
> > Add EXPORT_STATIC_CALL_FOR_MODULES(name, mods) that wraps both the key and
> > the trampoline with EXPORT_SYMBOL_FOR_MODULES(), allowing only a limited
> > set of modules to see and update the static key.
> > 
> > The immediate user is KVM, in the following commit.
> > 
> > checkpatch reported below warnings with this change that I believe don't
> > apply in this case:
> > 
> >   include/linux/static_call.h:219: WARNING: Non-declarative macros with multiple statements should be enclosed in a do - while loop
> >   include/linux/static_call.h:220: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> > 
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >  include/linux/static_call.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> > index 78a77a4ae0ea..b610afd1ed55 100644
> > --- a/include/linux/static_call.h
> > +++ b/include/linux/static_call.h
> > @@ -216,6 +216,9 @@ extern long __static_call_return0(void);
> >  #define EXPORT_STATIC_CALL_GPL(name)					\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name));			\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
> > +#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods);		\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_TRAMP(name), mods)
> >  
> >  /* Leave the key unexported, so modules can't change static call targets: */
> >  #define EXPORT_STATIC_CALL_TRAMP(name)					\
> > @@ -276,6 +279,9 @@ extern long __static_call_return0(void);
> >  #define EXPORT_STATIC_CALL_GPL(name)					\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name));			\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
> > +#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods);		\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_TRAMP(name), mods)
> >  
> >  /* Leave the key unexported, so modules can't change static call targets: */
> >  #define EXPORT_STATIC_CALL_TRAMP(name)					\
> > @@ -346,6 +352,8 @@ static inline int static_call_text_reserved(void *start, void *end)
> >  
> >  #define EXPORT_STATIC_CALL(name)	EXPORT_SYMBOL(STATIC_CALL_KEY(name))
> >  #define EXPORT_STATIC_CALL_GPL(name)	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name))
> > +#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods)
> >  
> >  #endif /* CONFIG_HAVE_STATIC_CALL */
> 
> Drat, I forgot about this.  Exporting static call trampolines for KVM came up in
> another conversation[*].  I had already put together patches to effectively default
> to exporting only the trampoline, and also to deduplicate this code so that the
> CONFIG_HAVE_STATIC_CALL_INLINE=y / CONFIG_HAVE_STATIC_CALL=y / CONFIG_HAVE_STATIC_CALL=n
> implementations don't need to copy+paste the same lines of code.
> 
> The attached patches touch a lot more code, and will conflict mightily with KVM
> changes I want to land in 7.3 (more use of a static_call in KVM).  But if we get
> them applied (to tip tree) shortly after 7.2-rc1 and provide a topic branch/tag,
> then there shouldn't be too much juggling needed?
> 
> If we want to go with the more aggressive cleanup, I'll formally post the patches.
> 
> [*] https://lore.kernel.org/all/ahhoDGUz39KSGZ6o@google.com

Thanks for the context.

Earlier making the key ro-after-init came up as an option in a thread with
Peter. Does it look like a good option to you?

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index b610afd1ed55..ea56da8fb446 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -200,6 +200,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
+#define DEFINE_STATIC_CALL_NULL_RO_AFTER_INIT(name, _func)		\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key STATIC_CALL_KEY(name) __ro_after_init = {\
+		.func = _func,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\

^ permalink raw reply related

* Re: [PATCH] vhost/vdpa: reject overflowing PA map page counts
From: Yousef Alhouseen @ 2026-06-24 21:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <20260624153850-mutt-send-email-mst@kernel.org>

On Wed, Jun 24, 2026 at 01:53:38PM -0400, Michael S. Tsirkin wrote:
> You should add "on 32 bit systems" - I do not see how it can
> overflow on 64 bit.

Right, the overflow I was trying to cover is the unsigned long
page-count calculation on 32-bit systems, where size can be wider than
unsigned long and the page offset is added before PFN_UP(). I should
have made that scope explicit in the changelog.

> I don't see how this can happen at all - pinned_vm is in units of pages.

Agreed, that part is not needed for this fix. I'll drop the memlock
check change and send a v2 with the changelog clarified to say this is
for 32-bit systems.

Thanks,
Yousef


On Wed, 24 Jun 2026 15:53:38 -0400, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Jun 24, 2026 at 09:06:53PM +0200, Yousef Alhouseen wrote:
> > vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> > size before computing the number of pages to pin. If that addition wraps,
> > the code can pin and map fewer pages than the requested IOTLB range.
> >
> > Reject sizes that overflow the page-count calculation.
>
> You should add "on 32 bit systems" - I do not see how it can
> overflow on 64 bit.
>
> > Also make the
> > memlock check subtraction-based so a large page count cannot wrap the
> > pinned page total.
>
> I don't see how this can happen at all - pinned_vm is in units of pages.
>
> > Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> > ---
> > drivers/vhost/vdpa.c | 12 ++++++++++--
> > 1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index ac55275fa..090cb8693 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -1102,6 +1102,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > unsigned int gup_flags = FOLL_LONGTERM;
> > unsigned long npages, cur_base, map_pfn, last_pfn = 0;
> > unsigned long lock_limit, sz2pin, nchunks, i;
> > + unsigned long page_offset;
> > + u64 pinned_vm;
> > u64 start = iova;
> > long pinned;
> > int ret = 0;
> > @@ -1114,7 +1116,12 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > if (perm & VHOST_ACCESS_WO)
> > gup_flags |= FOLL_WRITE;
> >
> > - npages = PFN_UP(size + (iova & ~PAGE_MASK));
> > + page_offset = iova & ~PAGE_MASK;
> > + if (size > ULONG_MAX - page_offset) {
> > + ret = -EINVAL;
> > + goto free;
> > + }
> > + npages = PFN_UP(size + page_offset);
> > if (!npages) {
> > ret = -EINVAL;
> > goto free;
> > @@ -1123,7 +1130,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > mmap_read_lock(dev->mm);
> >
> > lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > - if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
> > + pinned_vm = atomic64_read(&dev->mm->pinned_vm);
> > + if (npages > lock_limit || pinned_vm > lock_limit - npages) {
> > ret = -ENOMEM;
> > goto unlock;
> > }
> > --
> > 2.54.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox