Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

This series depends on the following fixes now merged in net-next:
  17bfe0a8c014 ("net: mana: add NULL guards in teardown path to prevent panic")
  5b05aa36ee24 ("net: mana: skip redundant detach on already-detached port")

Changes in v10:
  - Rebased on net-next which now includes the prerequisite fixes.
  - Recovery logic in mana_set_priv_flags() leverages the idempotent
    mana_detach() from the merged fixes.
Changes in v9:
  - Added correct tree.
Changes in v8:
  - Fixed queue_reset_work recovery by restoring port_is_up before
    scheduling reset so the handler can properly re-attach.
  - Simplified "err && schedule_port_reset" to "schedule_port_reset".
Changes in v7:
  - Rebased onto net-next.
  - Retained private flag approach after David Wei's testing on
    Grace (ARM64) confirmed that fragment mode outperforms
    full-page mode on other platforms, validating this is a
    single-platform workaround rather than a generic issue.
Changes in v6:
  - Added missed maintainers.
Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 178 +++++++++++++++---
 include/net/mana/mana.h                       |   8 +
 3 files changed, 177 insertions(+), 31 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH net-next v10 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi
In-Reply-To: <20260602202801.1873742-1-dipayanroy@linux.microsoft.com>

Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.

No functional change.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++-------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 04350973e19e..7e79681634db 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	default:
 		return -EINVAL;
-
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	}
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
+	default:
+		break;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi
In-Reply-To: <20260602202801.1873742-1-dipayanroy@linux.microsoft.com>

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 +++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 103 ++++++++++++++++++
 include/net/mana/mana.h                       |   8 ++
 3 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index db14357d3732..447cecfd3f67 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 7e79681634db..f22bbb325948 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 		       ARRAY_SIZE(mana_phy_stats) +
 		       ARRAY_SIZE(mana_hc_stats)  +
 		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+
 	default:
 		return -EINVAL;
 	}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 	}
 }
 
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
+
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
+}
+
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 	case ETH_SS_STATS:
 		mana_get_strings_stats(apc, &data);
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
 	default:
 		break;
 	}
@@ -590,6 +609,88 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+
+			/* Port is in an inconsistent state. Restore
+			 * 'port_is_up' so that queue reset work handler
+			 * can properly detach and re-attach.
+			 */
+			apc->port_is_up = true;
+			schedule_port_reset = true;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+
+			/* Restore 'port_is_up' so the reset work handler
+			 * can properly detach/attach. Without this,
+			 * the handler sees port_is_up=false and skips
+			 * queue allocation, leaving the port dead.
+			 */
+			apc->port_is_up = true;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -608,4 +709,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d9c27310fd04..26fd5e041a47 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v8 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-06-02 20:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260529105739.5fc5fdf5@kernel.org>

On Fri, May 29, 2026 at 10:57:39AM -0700, Jakub Kicinski wrote:
> On Thu, 28 May 2026 19:55:23 -0700 Dipayaan Roy wrote:
> > As the pre-requisite fixes patches are accepted now:
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=17bfe0a8c014
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5b05aa36ee24
> > 
> > Can this series be merged now? Let me know if it needs a rebase or
> > anything else.
> 
> If there's any dependency, functional or otherwise on the fixes you
> need to wait until they make their way to net-next.
> 
> Otherwise you can rebase and repost now.

Hi Jakub,

I see the fixes have landed in net-next. So I rebased this series and posted as v10.
https://lore.kernel.org/all/20260602202801.1873742-1-dipayanroy@linux.microsoft.com/


Regards
Dipayaan Roy

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net-next v11 0/6] net: mana: Per-vPort EQ and MSI-X management
From: Long Li @ 2026-06-02 22:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
	KY Srinivasan, Wei Liu, Dexuan Cui,
	shradhagupta@linux.microsoft.com, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB6683A7B2415BEAF17BD0EB4ECE092@SA1PR21MB6683.namprd21.prod.outlook.com>

> 
> > On Fri, 22 May 2026 19:02:50 -0700 Long Li wrote:
> > > The following changes since commit
> > 95fab46aea57d6d7b76b319341acbefe8a9293c8:
> > >
> > >   Merge branch
> > > 'net-convert-atm-xdp-af_iucv-l2tp_ppp-rxrpc-tipc-to-getsockopt_iter'
> > > (2026-05-22 11:11:12 -0700)
> > >
> > > are available in the Git repository at:
> > >
> > >
> > >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > >
> > ub.com%2Flonglimsft%2Flinux.git&data=05%7C02%7Clongli%40microsoft.co
> > m%
> > >
> > 7C36237239bb6949843c7508debc60af6c%7C72f988bf86f141af91ab2d7c
> > d011db47%
> > >
> > 7C1%7C0%7C639155320616840917%7CUnknown%7CTWFpbGZsb3d8eyJF
> > bXB0eU1hcGkiO
> > >
> > nRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoy
> > fQ%
> > >
> > 3D%3D%7C0%7C%7C%7C&sdata=43aUwSeHYaOhd%2Bmd1lwfmCqmrAObg
> > MWJoDRpKDhmCt8
> > > %3D&reserved=0 tags/mana-eq-msi-v11
> > >
> > > for you to fetch changes up to
> > a26d11135abba51e81ae8b9689e288718af95088:
> > >
> > >   RDMA/mana_ib: Allocate interrupt contexts on EQs (2026-05-22
> > > 20:35:43 +0000)
> >
> > The branch is no good, it needs to be your patches applied on top of a
> > commit already in Linus's tree. The current branch is on top of
> > net-next, RDMA would have to pull in 100s of networking commits together
> with your changes.
> 
> Hi Jakub,
> 
> Thanks for looking into this. Since the RDMA patch (patch 6) depends on the
> networking changes in patches 1-5, could this series go through net-next? I've
> verified that the tag pulls cleanly into the latest net-next.
> 
> Leon, Jason - could you provide an Acked-by for patch 6 ("RDMA/mana_ib:
> Allocate interrupt contexts on EQs") so it can be taken through the networking
> tree?
> 
> Thanks,
> Long

Hi Jakub,

Thank you for the feedback. Since the RDMA patch builds on the networking changes, would it be possible to take this series through net-next? I've confirmed the tag merges cleanly into the current net-next head.

Leon has acked patch 6 on v2 [1]. There are no significant structural changes to the RDMA driver since then - the differences are minor:

Changes in patch 6 from v2 to v11:

 - Error handling updated from NULL/-ENOMEM to IS_ERR()/PTR_ERR() for mana_gd_get_gic() return values
 - Added mdev->eqs[i]->eq.irq = gic->irq to populate the irq field on all RDMA EQs for consistency with the Ethernet path
 - Introduced a separate msi variable instead of modifying spec.eq.msix_index directly
 - Commit message updated

The gdma.h changes are identical to v2.

[1] https://lore.kernel.org/all/20260304145923.GE12611@unreal/

Thanks,
Long

^ permalink raw reply

* RE: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Haiyang Zhang @ 2026-06-02 22:54 UTC (permalink / raw)
  To: LeantionX, netdev@vger.kernel.org
  Cc: linux-hyperv@vger.kernel.org, KY Srinivasan, wei.liu@kernel.org,
	Dexuan Cui, Long Li, andrew+netdev@lunn.ch, kuba@kernel.org,
	pabeni@redhat.com, edumazet@google.com, davem@davemloft.net,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260602155210.90987-1-leontyevanton1995@gmail.com>



> -----Original Message-----
> From: LeantionX <leontyevantony@gmail.com>
> Sent: Tuesday, June 2, 2026 11:52 AM
> To: netdev@vger.kernel.org
> Cc: linux-hyperv@vger.kernel.org; KY Srinivasan <kys@microsoft.com>;
> Haiyang Zhang <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; kuba@kernel.org; pabeni@redhat.com;
> edumazet@google.com; davem@davemloft.net; stable@vger.kernel.org; linux-
> kernel@vger.kernel.org; Anton Leontev <leontyevantony@gmail.com>
> Subject: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in
> netvsc_copy_to_send_buf
> 
> [You don't often get email from leontyevantony@gmail.com. Learn why this
> is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> From: Anton Leontev <leontyevantony@gmail.com>
> 
> netvsc_copy_to_send_buf() copies skb fragment pages into the shared
> VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
> x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
> for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
> address that has no kernel page table entry and lies outside the
> kernel direct map, so the subsequent memcpy() faults. As this happens
> on the transmit softirq path, the fault is fatal.
Please include the stack trace in patch description.

> A HIGHMEM fragment reaches this path whenever the page backing an skb
> fragment lives above the LOWMEM boundary, which is common on a 32-bit
> guest with several GiB of RAM (for example when the in-kernel NFS
> server splices page cache pages directly into the reply skb).
> 
> Map the fragment page on demand with kmap_local_page()/kunmap_local()
> instead. Using pfn_to_page() on pb[i].pfn maps exactly the page
> described by the page buffer entry. On configurations without HIGHMEM
> (amd64, i386 without CONFIG_HIGHMEM) kmap_local_page() reduces to
> page_address(), so this is a no-op there.

So, on 64bit kernel, it has no performance impact?

Thanks,
- Haiyang


^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH net-next v11 0/6] net: mana: Per-vPort EQ and MSI-X management
From: Jakub Kicinski @ 2026-06-03  0:36 UTC (permalink / raw)
  To: Long Li
  Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
	KY Srinivasan, Wei Liu, Dexuan Cui,
	shradhagupta@linux.microsoft.com, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB6683230E973519C12E2AB797CE122@SA1PR21MB6683.namprd21.prod.outlook.com>

On Tue, 2 Jun 2026 22:48:05 +0000 Long Li wrote:
> Changes in patch 6 from v2 to v11:
> 
>  - Error handling updated from NULL/-ENOMEM to IS_ERR()/PTR_ERR() for mana_gd_get_gic() return values
>  - Added mdev->eqs[i]->eq.irq = gic->irq to populate the irq field on all RDMA EQs for consistency with the Ethernet path
>  - Introduced a separate msi variable instead of modifying spec.eq.msix_index directly
>  - Commit message updated
> 
> The gdma.h changes are identical to v2.

Hm, yes, Leon seems to be AFK since May 19th.
Please repost with his tag included, the list of changes you provided
does seem immaterial. I don't want to merge v11 as is, there's a good
chance people marked this thread as ignored by now.

^ permalink raw reply

* Re: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03  5:29 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	KY Srinivasan, wei.liu@kernel.org, Dexuan Cui, Long Li,
	andrew+netdev@lunn.ch, kuba@kernel.org, pabeni@redhat.com,
	edumazet@google.com, davem@davemloft.net, stable@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BY1PR21MB38709E89497445EECE3C931DCA122@BY1PR21MB3870.namprd21.prod.outlook.com>

>
>
>
> > -----Original Message-----
> > From: LeantionX <leontyevantony@gmail.com>
> > Sent: Tuesday, June 2, 2026 11:52 AM
> > To: netdev@vger.kernel.org
> > Cc: linux-hyperv@vger.kernel.org; KY Srinivasan <kys@microsoft.com>;
> > Haiyang Zhang <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> > <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> > andrew+netdev@lunn.ch; kuba@kernel.org; pabeni@redhat.com;
> > edumazet@google.com; davem@davemloft.net; stable@vger.kernel.org; linux-
> > kernel@vger.kernel.org; Anton Leontev <leontyevantony@gmail.com>
> > Subject: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in
> > netvsc_copy_to_send_buf
> >
> > [You don't often get email from leontyevantony@gmail.com. Learn why this
> > is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > From: Anton Leontev <leontyevantony@gmail.com>
> >
> > netvsc_copy_to_send_buf() copies skb fragment pages into the shared
> > VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
> > x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
> > for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
> > address that has no kernel page table entry and lies outside the
> > kernel direct map, so the subsequent memcpy() faults. As this happens
> > on the transmit softirq path, the fault is fatal.
> Please include the stack trace in patch description.
>
> > A HIGHMEM fragment reaches this path whenever the page backing an skb
> > fragment lives above the LOWMEM boundary, which is common on a 32-bit
> > guest with several GiB of RAM (for example when the in-kernel NFS
> > server splices page cache pages directly into the reply skb).
> >
> > Map the fragment page on demand with kmap_local_page()/kunmap_local()
> > instead. Using pfn_to_page() on pb[i].pfn maps exactly the page
> > described by the page buffer entry. On configurations without HIGHMEM
> > (amd64, i386 without CONFIG_HIGHMEM) kmap_local_page() reduces to
> > page_address(), so this is a no-op there.
>
> So, on 64bit kernel, it has no performance impact?
>
> Thanks,
> - Haiyang
>

Correct. On 64-bit (and any !CONFIG_HIGHMEM config) all pages are
permanently present in the kernel direct map, so kmap_local_page()
folds to page_address() and kunmap_local() is a no-op. The generated
code is therefore equivalent to the previous direct-map access, with
no extra mapping cost on the tx path.

The kmap is only meaningful on 32-bit CONFIG_HIGHMEM, where the
fragment page may live above the LOWMEM boundary and the old
phys_to_virt() result is invalid.

Thanks,
Anton

^ permalink raw reply

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-03  7:08 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260602102811.A83AD1F00893@smtp.kernel.org>

On Tue, Jun 02, 2026 at 10:28:11AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> 
> Pre-existing issues:
> - [High] Missing error check for xa_store() leads to resource leak and potential kernel oops on allocation failure.
> --
> 
> commit 853688f50dc4ca56dd74698a3ecfdfeaf857e716
> Author: Shradha Gupta <shradhagupta@linux.microsoft.com>
> 
> net: mana: Optimize irq affinity for low vcpu configs
> 
> This patch optimizes irq affinity in the mana driver for low vCPU
> environments. When the number of allocated irqs is greater than the vCPU
> count, it ensures that queue irqs are spread across all available vCPUs
> to prevent softirq handling imbalances.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> [ ... ]
> > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > -		skip_first_cpu = true;
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > +		if (err) {
> > +			cpus_read_unlock();
> > +			goto free_irq;
> 
> [Severity: High]
> This isn't a bug introduced by this patch, but looking at the error handling
> path for goto free_irq, is there a potential resource leak if xa_store()
> fails earlier in the function?
> 
> When setting up the irqs in mana_gd_setup_dyn_irqs(), we call xa_store()
> with GFP_KERNEL:
> 
>     xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
> 
> However, the return value is not checked. If this allocation fails, the gic
> pointer is not saved. If a subsequent operation like irq_setup() fails and
> we jump to the free_irq label, the cleanup loop attempts to retrieve it:
> 
>     gic = xa_load(&gc->irq_contexts, i);
>     if (WARN_ON(!gic))
>         continue;
> 
> Because the store failed, xa_load() will return NULL, causing the loop
> to continue without calling free_irq() or kfree(gic). Could this result
> in an un-freed irq handler that might trigger a panic later if the
> module is unloaded and the hardware raises an interrupt?
> 
> > +		}
> > +	} else {
> [ ... ]
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260601102749.1768304-1-shradhagupta@linux.microsoft.com?part=1

Since this isn't a bug introduced by this patch, I will fix it in a
seperate patch. Will submit the fix for this bug by next week.

Thanks,
Shradha.

^ permalink raw reply

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Mike Rapoport @ 2026-06-03  9:29 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, linux-mm, kexec, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
	Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
	Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
	Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
	Michael Kelley
In-Reply-To: <f9d95fb1-ef3-d4a-19e4-afe7cdde5d1f@linux.microsoft.com>

On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
> On Sun, 31 May 2026, Mike Rapoport wrote:
> 
> > > Patch 19:      Export kexec_in_progress for modules
> > 
> > Isn't there another way to differentiate kexec reboot?

There's that "kexec reboot" string passed as the cmd to the reboot
notifier.
Maybe we can make it somehow more well defined API and use it?
 
> I could not find one, unfortunately.
> 
> > Sincerely yours,
> > Mike.
> 
> Best,
> Jork

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v4 07/47] x86/tdx: Force TSC frequency with CPUID-based info provided by the TDX-Module
From: Kiryl Shutsemau @ 2026-06-03 10:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-8-seanjc@google.com>

On Fri, May 29, 2026 at 07:43:54AM -0700, Sean Christopherson wrote:
> When running as a TDX guest, explicitly set the TSC frequency to a known
> value, using CPUID-based information, instead of potentially relying on a
> hypervisor-controlled PV routine.  For TDX guests, CPUID.0x15 is always
> emulated by the TDX-Module, i.e. the information from CPUID is more
> trustworthy than the information provided by the hypervisor.

Right. EBX is configurable by TD_PARAMS.TSC_FREQUENCY at TD build. The
rest is fixed.

> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency.  While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
> 
>   The APIC timer frequency will be the processor’s bus clock or core
>   crystal clock frequency (when TSC/core crystal clock ratio is enumerated
>   in CPUID leaf 0x15).
> 
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.

Agreed.

> Deliberately leave CPU frequency calibration as is, since the TDX-Module
> doesn't provide any guarantees with respect to CPUID.0x16.

It is fixed to zeros. Sounds like a guarantee to me :P

> Signed-off-by: Sean Christopherson <seanjc@google.com>

Looks sane to me. Including your reasoning about tsc_early_khz= in reply
to Sashiko.

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: sashiko-bot @ 2026-06-03 15:53 UTC (permalink / raw)
  To: LeantionX; +Cc: linux-hyperv
In-Reply-To: <20260602155210.90987-1-leontyevanton1995@gmail.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Passing a Hyper-V PFN directly to `pfn_to_page()` causes out-of-bounds memory accesses on architectures where `PAGE_SIZE > 4KB` (e.g., ARM64 with 64KB pages), and loses the intra-page chunk offset.
--

commit a844fbf018cd37eedc9a1f0aea79d6235c00bec3
Author: Anton Leontev <leontyevantony@gmail.com>

hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf

This commit replaces phys_to_virt() with kmap_local_page() in
netvsc_copy_to_send_buf() to map skb fragment pages. This is intended to
fix a fatal transmit softirq fault on 32-bit x86 configurations with
HIGHMEM, where phys_to_virt() is only valid for LOWMEM addresses.

> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 59e95341f9b1e..6984f6c97257d 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c

[ ... ]

> @@ -965,11 +966,13 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
>  	}
>  
>  	for (i = 0; i < page_count; i++) {
> -		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
> +		struct page *page = pfn_to_page(pb[i].pfn);
> +		char *src = kmap_local_page(page);

[Severity: Critical]
Does this cause out-of-bounds memory accesses on architectures with a native
page size greater than 4KB?

The pb[i].pfn field contains a Hyper-V PFN, which always operates at 4KB
granularity. Passing this directly to pfn_to_page() assumes it is a native
Linux PFN.

On architectures where the native page size is larger than 4KB (such as
ARM64 with 64KB pages), the Hyper-V PFN is significantly larger than the
corresponding native Linux PFN. Using it with pfn_to_page() would return an
invalid struct page pointer.

Calling kmap_local_page() on this invalid pointer in
netvsc_copy_to_send_buf() and subsequently reading from it could lead to
memory corruption, out-of-bounds accesses, or kernel panics.

Could this also lose the native page offset of the original 4KB chunk within
the larger Linux page, resulting in an incorrect read offset during the
subsequent memcpy?

>  		u32 offset = pb[i].offset;
>  		u32 len = pb[i].len;
>  
>  		memcpy(dest, (src + offset), len);
> +		kunmap_local(src);
>  		dest += len;
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260602155210.90987-1-leontyevanton1995@gmail.com?part=1

^ permalink raw reply

* Re: [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 16:29 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260603155344.7CE871F00893@smtp.kernel.org>

You are right, thanks for catching this. pb[i].pfn is a Hyper-V PFN at
4K granularity, so passing it directly to pfn_to_page() is wrong where
PAGE_SIZE > 4K, and it also drops the intra-page offset. Fixed in v2 by
reconstructing the physical address and using phys_to_page() +
offset_in_page(). Built for i386 (HIGHMEM) and arm64 (64K pages).

Thanks,
Anton

ср, 3 июн. 2026 г. в 18:53, <sashiko-bot@kernel.org>:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Critical] Passing a Hyper-V PFN directly to `pfn_to_page()` causes out-of-bounds memory accesses on architectures where `PAGE_SIZE > 4KB` (e.g., ARM64 with 64KB pages), and loses the intra-page chunk offset.
> --
>
> commit a844fbf018cd37eedc9a1f0aea79d6235c00bec3
> Author: Anton Leontev <leontyevantony@gmail.com>
>
> hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
>
> This commit replaces phys_to_virt() with kmap_local_page() in
> netvsc_copy_to_send_buf() to map skb fragment pages. This is intended to
> fix a fatal transmit softirq fault on 32-bit x86 configurations with
> HIGHMEM, where phys_to_virt() is only valid for LOWMEM addresses.
>
> > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> > index 59e95341f9b1e..6984f6c97257d 100644
> > --- a/drivers/net/hyperv/netvsc.c
> > +++ b/drivers/net/hyperv/netvsc.c
>
> [ ... ]
>
> > @@ -965,11 +966,13 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
> >       }
> >
> >       for (i = 0; i < page_count; i++) {
> > -             char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
> > +             struct page *page = pfn_to_page(pb[i].pfn);
> > +             char *src = kmap_local_page(page);
>
> [Severity: Critical]
> Does this cause out-of-bounds memory accesses on architectures with a native
> page size greater than 4KB?
>
> The pb[i].pfn field contains a Hyper-V PFN, which always operates at 4KB
> granularity. Passing this directly to pfn_to_page() assumes it is a native
> Linux PFN.
>
> On architectures where the native page size is larger than 4KB (such as
> ARM64 with 64KB pages), the Hyper-V PFN is significantly larger than the
> corresponding native Linux PFN. Using it with pfn_to_page() would return an
> invalid struct page pointer.
>
> Calling kmap_local_page() on this invalid pointer in
> netvsc_copy_to_send_buf() and subsequently reading from it could lead to
> memory corruption, out-of-bounds accesses, or kernel panics.
>
> Could this also lose the native page offset of the original 4KB chunk within
> the larger Linux page, resulting in an incorrect read offset during the
> subsequent memcpy?
>
> >               u32 offset = pb[i].offset;
> >               u32 len = pb[i].len;
> >
> >               memcpy(dest, (src + offset), len);
> > +             kunmap_local(src);
> >               dest += len;
> >       }
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260602155210.90987-1-leontyevanton1995@gmail.com?part=1

^ permalink raw reply

* [PATCH net v2] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 16:38 UTC (permalink / raw)
  To: netdev
  Cc: linux-hyperv, haiyangz, kys, wei.liu, decui, longli, kuba, pabeni,
	edumazet, davem, stable, linux-kernel, Anton Leontev
In-Reply-To: <20260602155210.90987-1-leontyevanton1995@gmail.com>

netvsc_copy_to_send_buf() copies skb fragment pages into the shared
VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
address that has no kernel page table entry and lies outside the
kernel direct map, so the subsequent memcpy() faults. As this happens
on the transmit softirq path, the fault is fatal.

A HIGHMEM fragment reaches this path whenever the page backing an skb
fragment lives above the LOWMEM boundary, which is common on a 32-bit
guest with several GiB of RAM (for example when the in-kernel NFS
server splices page cache pages directly into the reply skb).

pb[i].pfn is a Hyper-V PFN at HV_HYP_PAGE_SIZE (4K) granularity. The
physical address is reconstructed first and phys_to_page() is used to
obtain the native struct page, with offset_in_page() added so the
in-page offset stays correct where PAGE_SIZE > HV_HYP_PAGE_SIZE (e.g.
arm64 with 64K pages). The page is then mapped on demand with
kmap_local_page()/kunmap_local(). On !CONFIG_HIGHMEM configs
kmap_local_page() reduces to page_address(), so this is a no-op there.

Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
v2:
 - Reconstruct the physical address from the Hyper-V PFN and use
   phys_to_page() + offset_in_page() instead of pfn_to_page() on the
   raw PFN, correct where PAGE_SIZE > 4K (e.g. arm64 64K pages).
   Reported by Haiyang Zhang.
 - Built for i386 (CONFIG_HIGHMEM) and arm64 (64K pages).
 drivers/net/hyperv/netvsc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..2038d9f5c9f9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/slab.h>
@@ -965,11 +966,14 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
 	}
 
 	for (i = 0; i < page_count; i++) {
-		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
-		u32 offset = pb[i].offset;
+		phys_addr_t paddr = pb[i].pfn << HV_HYP_PAGE_SHIFT;
+		struct page *page = phys_to_page(paddr);
+		u32 offset = offset_in_page(paddr) + pb[i].offset;
 		u32 len = pb[i].len;
+		char *src = kmap_local_page(page);
 
 		memcpy(dest, (src + offset), len);
+		kunmap_local(src);
 		dest += len;
 	}
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net v2] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 17:25 UTC (permalink / raw)
  To: netdev
  Cc: linux-hyperv, haiyangz, kys, wei.liu, decui, longli, kuba, pabeni,
	edumazet, davem, stable, linux-kernel, Anton Leontev

netvsc_copy_to_send_buf() copies skb fragment pages into the shared
VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
address that has no kernel page table entry and lies outside the
kernel direct map, so the subsequent memcpy() faults. As this happens
on the transmit softirq path, the fault is fatal.

A HIGHMEM fragment reaches this path whenever the page backing an skb
fragment lives above the LOWMEM boundary, which is common on a 32-bit
guest with several GiB of RAM (for example when the in-kernel NFS
server splices page cache pages directly into the reply skb).

pb[i].pfn is a Hyper-V PFN at HV_HYP_PAGE_SIZE (4K) granularity. The
physical address is reconstructed first and phys_to_page() is used to
obtain the native struct page, with offset_in_page() added so the
in-page offset stays correct where PAGE_SIZE > HV_HYP_PAGE_SIZE (e.g.
arm64 with 64K pages). The page is then mapped on demand with
kmap_local_page()/kunmap_local(). On !CONFIG_HIGHMEM configs
kmap_local_page() reduces to page_address(), so this is a no-op there.

Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
v2:
 - Reconstruct the physical address from the Hyper-V PFN and use
   phys_to_page() + offset_in_page() instead of pfn_to_page() on the
   raw PFN, correct where PAGE_SIZE > 4K (e.g. arm64 64K pages).
   Reported by Haiyang Zhang.
 - Built for i386 (CONFIG_HIGHMEM) and arm64 (64K pages).
 drivers/net/hyperv/netvsc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..2038d9f5c9f9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/slab.h>
@@ -965,11 +966,14 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
 	}
 
 	for (i = 0; i < page_count; i++) {
-		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
-		u32 offset = pb[i].offset;
+		phys_addr_t paddr = pb[i].pfn << HV_HYP_PAGE_SHIFT;
+		struct page *page = phys_to_page(paddr);
+		u32 offset = offset_in_page(paddr) + pb[i].offset;
 		u32 len = pb[i].len;
+		char *src = kmap_local_page(page);
 
 		memcpy(dest, (src + offset), len);
+		kunmap_local(src);
 		dest += len;
 	}
 
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Jork Loeser @ 2026-06-03 17:25 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-hyperv, linux-mm, kexec, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
	Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
	Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
	Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
	Michael Kelley
In-Reply-To: <ah_z3GV55RY3ZnT-@kernel.org>



On Wed, 3 Jun 2026, Mike Rapoport wrote:

> On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
>> On Sun, 31 May 2026, Mike Rapoport wrote:
>>
>>>> Patch 19:      Export kexec_in_progress for modules
>>>
>>> Isn't there another way to differentiate kexec reboot?
>
> There's that "kexec reboot" string passed as the cmd to the reboot
> notifier.
> Maybe we can make it somehow more well defined API and use it?

A string? Dear my - the compiler won't flag it on an API change then, not 
ideal clearly. What's wrong with exporting kexec_in_progress()?

Best,
Jork

^ permalink raw reply

* Re: [PATCH 0/2] Remove stack ib_udata's
From: Jason Gunthorpe @ 2026-06-03 18:16 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Junxian Huang, Kai Shen, Kalesh AP, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Long Li, Michal Kalderon, Nelson Escobar, Satish Kharat,
	Selvin Xavier, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas, Zhu Yanjun
  Cc: Leon Romanovsky, patches
In-Reply-To: <0-v1-922fa8e828ba+f7-ib_udata_stack_jgg@nvidia.com>

On Tue, May 26, 2026 at 01:15:04PM -0300, Jason Gunthorpe wrote:
> Sashiko pointed out these are dangerous, and the create_qp() one is in
> fact a bug. The query_device is just ugly old code.
> 
> Remove the stack ib_udata's from both places.
> 
> Jason Gunthorpe (2):
>   RDMA/core: Don't make a dummy ib_udata on the stack in create_qp
>   RDMA: Update the query_device() op
> 
>  drivers/infiniband/core/core_priv.h           |  2 +-
>  drivers/infiniband/core/device.c              |  3 +--
>  drivers/infiniband/core/ib_core_uverbs.c      | 12 +++++++++++
>  drivers/infiniband/core/rdma_core.h           |  7 +++++++
>  drivers/infiniband/core/uverbs_cmd.c          | 14 +------------
>  drivers/infiniband/core/uverbs_std_types_qp.c |  3 +--
>  drivers/infiniband/core/verbs.c               | 20 ++++++++++---------
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  5 ++++-
>  drivers/infiniband/hw/cxgb4/provider.c        |  8 +++++---
>  drivers/infiniband/hw/erdma/erdma_verbs.c     |  9 +++++++--
>  drivers/infiniband/hw/hns/hns_roce_main.c     |  7 ++++++-
>  drivers/infiniband/hw/ionic/ionic_ibdev.c     |  7 ++++++-
>  drivers/infiniband/hw/irdma/verbs.c           |  8 +++++---
>  drivers/infiniband/hw/mana/main.c             |  7 ++++++-
>  drivers/infiniband/hw/mlx4/main.c             | 13 ++++++------
>  drivers/infiniband/hw/mthca/mthca_provider.c  | 13 +++++++-----
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  8 +++++---
>  drivers/infiniband/hw/qedr/verbs.c            |  7 ++++++-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |  8 +++++---
>  .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   |  8 +++++---
>  drivers/infiniband/sw/rdmavt/vt.c             |  9 ++++++---
>  drivers/infiniband/sw/rxe/rxe_verbs.c         | 14 ++++---------
>  drivers/infiniband/sw/siw/siw_verbs.c         |  8 +++++---
>  23 files changed, 124 insertions(+), 76 deletions(-)

Applied to for-next

Jason

^ permalink raw reply

* Re: [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: sashiko-bot @ 2026-06-03 20:28 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: linux-hyperv, bpf
In-Reply-To: <20260602202801.1873742-3-dipayanroy@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Forcing `apc->port_is_up = true` on the `mana_attach` error path causes a double-destroy of the hardware vport when the interface is later brought down.
- [Medium] Latent bug: Changes to future private flags will be silently ignored if MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF is not simultaneously toggled.
--

commit d7983f6668f209a0b7bfb427c64d5517b326c7b1
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: force full-page RX buffers via ethtool private flag

Adds an ethtool private flag full-page-rx to bypass page_pool fragment
allocation and force one RX buffer per page. Includes logic in
mana_set_priv_flags to dynamically reconfigure the port on flag change.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c

[ ... ]

> +static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +	u32 changed = apc->priv_flags ^ priv_flags;
> +	u32 old_priv_flags = apc->priv_flags;
> +	bool schedule_port_reset = false;
> +	int err = 0;
> +
> +	if (!changed)
> +		return 0;
> +
> +	/* Reject unknown bits */
> +	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
> +		return -EINVAL;
> +
> +	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
> +		apc->priv_flags = priv_flags;
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[Severity: Medium]
When additional private flags are added in the future, will changes to those
new flags be silently ignored if MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF isn't
also toggled?

Since this assignment is nested inside the FULL_PAGE_RXBUF check, changing
only a future new flag will bypass this assignment and the function will
return success while discarding the new flag state.

Should this assignment be moved outside the flag-specific conditional block?

> +
> +		if (!apc->port_is_up) {

[ ... ]

> +		err = mana_attach(ndev);
> +		if (err) {
> +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> +			apc->priv_flags = old_priv_flags;
> +
> +			/* Restore 'port_is_up' so the reset work handler
> +			 * can properly detach/attach. Without this,
> +			 * the handler sees port_is_up=false and skips
> +			 * queue allocation, leaving the port dead.
> +			 */
> +			apc->port_is_up = true;
                        ^^^^^^^^^^^^^^^^^^^^^^

[Severity: High]
Does forcing port_is_up to true here cause a double-destroy of the hardware
vport when the interface is later brought down?

If mana_attach fails, it cleans up any partial state by destroying the vport.

Leaving the port artificially marked as up means a subsequent interface down
event or reset worker will see apc->port_is_up as true and call
mana_detach with from_close set to true.

This bypasses the netif_device_present check and triggers a second call to
mana_destroy_vport, which could drop apc->vport_use_count below zero and
send a duplicate deregister message to the firmware for the already-freed
hw_vport_handle.

> +			schedule_port_reset = true;
> +		}
> +	}
> +
> +out:
> +	mana_pre_dealloc_rxbufs(apc);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260602202801.1873742-1-dipayanroy@linux.microsoft.com?part=2

^ permalink raw reply

* [PATCH v2 0/4] Convert remaining buses to generic driver_override handling
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao
In-Reply-To: <20260602160829.560904-1-runyu.xiao@seu.edu.cn>

This series converts four remaining buses from bus-private
driver_override handling to the generic driver-core infrastructure:

  - AMBA
  - RPMSG
  - VMBUS
  - CDX

These buses still keep private driver_override storage and read it
directly from their match paths. However, bus match() callbacks can be
reached from __driver_attach() without the device lock held, so those
raw reads can race with updates that replace and free the override
string.

The driver core already provides generic driver_override storage and
matching helpers with the required internal locking. Other buses have
already been converted to that model. This series switches the
remaining users above to the same infrastructure by:

  - removing bus-private driver_override storage
  - dropping bus-local driver_override sysfs handling
  - enabling struct bus_type.driver_override
  - using device_match_driver_override() in match paths

Bus-specific behavior is preserved where needed:

  - VMBUS keeps its dummy-id fallback for override-based binding
  - CDX keeps its override_only matching semantics
  - RPMSG converts its in-kernel override registration path to
    device_set_driver_override() and drops the old transport-local
    frees of bus-private override storage

Before preparing this v2 series, I rechecked the affected source paths
against v7.1-rc6. I also reran the existing report-specific no-device
KCSAN stand-ins on a local v7.1-rc6 guest for all four buses. Those
reruns again produced target-stack reports for the corresponding
driver_override update/match paths.

That runtime validation is still stand-in based rather than direct
hardware execution, but it reuses the real driver_set_override() helper
from the running v7.1-rc6 guest kernel and preserves the relevant
patch-local reader/writer contracts and caller chains.

Since v1:
  - reworked the series around the generic driver_override
    infrastructure instead of trying to serialize bus match() with
    device_lock(dev)
  - split the changes by bus
  - preserved VMBUS dummy-id fallback behavior explicitly
  - preserved CDX override_only matching semantics explicitly
  - converted the RPMSG in-kernel override registration path to the
    core helper
  - reran the four report-specific no-device KCSAN stand-ins on a
    local v7.1-rc6 guest and refreshed the validation basis
  - refreshed the commit messages accordingly

Runyu Xiao (4):
  amba: use generic driver_override infrastructure
  rpmsg: core: use generic driver_override infrastructure
  vmbus: use generic driver_override infrastructure
  cdx: use generic driver_override infrastructure

 drivers/amba/bus.c               | 35 +++++--------------------------
 drivers/cdx/cdx.c                | 40 +++++-------------------------------
 drivers/hv/vmbus_drv.c           | 36 +++++---------------------------
 drivers/rpmsg/qcom_glink_native.c |  2 --
 drivers/rpmsg/rpmsg_core.c       | 41 ++++++-------------------------------
 drivers/rpmsg/virtio_rpmsg_bus.c |  1 -
 include/linux/amba/bus.h         |  5 -----
 include/linux/cdx/cdx_bus.h      |  1 -
 include/linux/hyperv.h           |  6 ------
 include/linux/rpmsg.h            |  4 ----
 10 files changed, 21 insertions(+), 150 deletions(-)

-- 
2.34.1

^ permalink raw reply

* [PATCH v2 1/4] amba: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

AMBA devices still keep driver_override in bus-private storage.

The sysfs write side updates that string through driver_set_override(),
which replaces the pointer and frees the old value. However,
driver_match_device() can call amba_match() from __driver_attach()
without holding the device lock, and amba_match() still dereferences
that private pointer directly.

That means a bind/unbind or reprobe path can race with a concurrent
driver_override update and make amba_match() compare against freed
memory.

Fix this by switching AMBA to the driver-core driver_override
infrastructure. This lets the core own the sysfs attribute and storage,
and uses device_match_driver_override() for the locked read in the match
path.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: 3cf385713460 ("ARM: 8256/1: driver coamba: add device binding path 'driver_override'")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/amba/bus.c       | 35 +++++------------------------------
include/linux/amba/bus.h |  5 -----
 2 files changed, 5 insertions(+), 35 deletions(-)

diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 6d479caf89cb..df8333f90906 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -82,33 +82,6 @@ static void amba_put_disable_pclk(struct amba_device *pcdev)
 }
 
 
-static ssize_t driver_override_show(struct device *_dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct amba_device *dev = to_amba_device(_dev);
-	ssize_t len;
-
-	device_lock(_dev);
-	len = sprintf(buf, "%s\n", dev->driver_override);
-	device_unlock(_dev);
-	return len;
-}
-
-static ssize_t driver_override_store(struct device *_dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct amba_device *dev = to_amba_device(_dev);
-	int ret;
-
-	ret = driver_set_override(_dev, &dev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 #define amba_attr_func(name,fmt,arg...)					\
 static ssize_t name##_show(struct device *_dev,				\
 			   struct device_attribute *attr, char *buf)	\
@@ -126,7 +99,6 @@ amba_attr_func(resource, "\t%016llx\t%016llx\t%016lx\n",
 static struct attribute *amba_dev_attrs[] = {
 	&dev_attr_id.attr,
 	&dev_attr_resource.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(amba_dev);
@@ -209,6 +181,7 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
 {
 	struct amba_device *pcdev = to_amba_device(dev);
 	const struct amba_driver *pcdrv = to_amba_driver(drv);
+	int ret;
 
 	mutex_lock(&pcdev->periphid_lock);
 	if (!pcdev->periphid) {
@@ -230,8 +203,9 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
 	mutex_unlock(&pcdev->periphid_lock);
 
 	/* When driver_override is set, only bind to the matching driver */
-	if (pcdev->driver_override)
-		return !strcmp(pcdev->driver_override, drv->name);
+	ret = device_match_driver_override(dev, drv);
+	if (ret >= 0)
+		return ret;
 
 	return amba_lookup(pcdrv->id_table, pcdev) != NULL;
 }
@@ -435,6 +409,7 @@ static const struct dev_pm_ops amba_pm = {
  */
 const struct bus_type amba_bustype = {
 	.name		= "amba",
+	.driver_override = true,
 	.dev_groups	= amba_dev_groups,
 	.match		= amba_match,
 	.uevent		= amba_uevent,
diff --git a/include/linux/amba/bus.h b/include/linux/amba/bus.h
index 9946276aff73..6c54d5c0d21f 100644
--- a/include/linux/amba/bus.h
+++ b/include/linux/amba/bus.h
@@ -71,11 +71,6 @@ struct amba_device {
 	unsigned int		cid;
 	struct amba_cs_uci_id	uci;
 	unsigned int		irq[AMBA_NR_IRQS];
-	/*
-	 * Driver name to force a match.  Do not set directly, because core
-	 * frees it.  Use driver_set_override() to set or clear it.
-	 */
-	const char		*driver_override;
 };
 
 struct amba_driver {
-- 
2.34.1

^ permalink raw reply related

* [PATCH v2 2/4] rpmsg: core: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

RPMSG still keeps driver_override in bus-private storage.

That private pointer can be updated from the sysfs driver_override
attribute, and also from rpmsg_register_device_override(). Both paths
replace the pointer and can free the old value.

However, driver_match_device() can call rpmsg_dev_match() from
__driver_attach() without holding the device lock, and rpmsg_dev_match()
still dereferences that private pointer directly.

This leaves the match path racing with concurrent driver_override
updates, with the usual risk of comparing against freed memory.

Switch rpmsg to the driver-core driver_override infrastructure. This
removes the private storage, uses device_match_driver_override() for the
locked read in rpmsg_dev_match(), and converts
rpmsg_register_device_override() to device_set_driver_override() so the
in-kernel override path uses the same core-managed storage. With that
storage now owned by struct device, drop the remaining rpmsg transport
release-path frees of rpdev->driver_override as well.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: 39e47767ec9b ("rpmsg: Add driver_override device attribute for rpmsg_device")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/rpmsg/qcom_glink_native.c |  2 --
drivers/rpmsg/rpmsg_core.c        | 41 ++++++--------------------------------
drivers/rpmsg/virtio_rpmsg_bus.c  |  1 -
include/linux/rpmsg.h             |  4 ----
 4 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c
index e7f7831d37f8..11d3007db5cd 100644
--- a/drivers/rpmsg/rpmsg_core.c
+++ b/drivers/rpmsg/rpmsg_core.c
@@ -358,33 +358,6 @@ rpmsg_show_attr(src, src, "0x%x\n");
 rpmsg_show_attr(dst, dst, "0x%x\n");
 rpmsg_show_attr(announce, announce ? "true" : "false", "%s\n");
 
-static ssize_t driver_override_store(struct device *dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
-	int ret;
-
-	ret = driver_set_override(dev, &rpdev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
-	ssize_t len;
-
-	device_lock(dev);
-	len = sysfs_emit(buf, "%s\n", rpdev->driver_override);
-	device_unlock(dev);
-	return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 static ssize_t modalias_show(struct device *dev,
 			     struct device_attribute *attr, char *buf)
 {
@@ -405,7 +378,6 @@ static struct attribute *rpmsg_dev_attrs[] = {
 	&dev_attr_dst.attr,
 	&dev_attr_src.attr,
 	&dev_attr_announce.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(rpmsg_dev);
@@ -424,9 +396,11 @@ static int rpmsg_dev_match(struct device *dev, const struct device_driver *drv)
 	const struct rpmsg_driver *rpdrv = to_rpmsg_driver(drv);
 	const struct rpmsg_device_id *ids = rpdrv->id_table;
 	unsigned int i;
+	int ret;
 
-	if (rpdev->driver_override)
-		return !strcmp(rpdev->driver_override, drv->name);
+	ret = device_match_driver_override(dev, drv);
+	if (ret >= 0)
+		return ret;
 
 	if (ids)
 		for (i = 0; ids[i].name[0]; i++)
@@ -533,6 +507,7 @@ static void rpmsg_dev_remove(struct device *dev)
 
 static const struct bus_type rpmsg_bus = {
 	.name		= "rpmsg",
+	.driver_override = true,
 	.match		= rpmsg_dev_match,
 	.dev_groups	= rpmsg_dev_groups,
 	.uevent		= rpmsg_uevent,
@@ -560,9 +535,7 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
 
 	device_initialize(dev);
 	if (driver_override) {
-		ret = driver_set_override(dev, &rpdev->driver_override,
-					  driver_override,
-					  strlen(driver_override));
+		ret = device_set_driver_override(dev, driver_override);
 		if (ret) {
 			dev_err(dev, "device_set_override failed: %d\n", ret);
 			put_device(dev);
@@ -573,8 +546,6 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
 	ret = device_add(dev);
 	if (ret) {
 		dev_err(dev, "device_add failed: %d\n", ret);
-		kfree(rpdev->driver_override);
-		rpdev->driver_override = NULL;
 		put_device(dev);
 	}
 
diff --git a/drivers/rpmsg/qcom_glink_native.c b/drivers/rpmsg/qcom_glink_native.c
index 401a4ece0c97..d9d4468e4cbd 100644
--- a/drivers/rpmsg/qcom_glink_native.c
+++ b/drivers/rpmsg/qcom_glink_native.c
@@ -1626,7 +1626,6 @@ static void qcom_glink_rpdev_release(struct device *dev)
 {
 	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
 
-	kfree(rpdev->driver_override);
 	kfree(rpdev);
 }
 
@@ -1862,7 +1861,6 @@ static void qcom_glink_device_release(struct device *dev)
 
 	/* Release qcom_glink_alloc_channel() reference */
 	kref_put(&channel->refcount, qcom_glink_channel_release);
-	kfree(rpdev->driver_override);
 	kfree(rpdev);
 }
 
diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c
index 5ae15111fb4f..1b8bb05924af 100644
--- a/drivers/rpmsg/virtio_rpmsg_bus.c
+++ b/drivers/rpmsg/virtio_rpmsg_bus.c
@@ -374,7 +374,6 @@ static void virtio_rpmsg_release_device(struct device *dev)
 	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
 	struct virtio_rpmsg_channel *vch = to_virtio_rpmsg_channel(rpdev);
 
-	kfree(rpdev->driver_override);
 	kfree(vch);
 }
 
diff --git a/include/linux/rpmsg.h b/include/linux/rpmsg.h
index 83266ce14642..2e40eb54155e 100644
--- a/include/linux/rpmsg.h
+++ b/include/linux/rpmsg.h
@@ -41,9 +41,6 @@ struct rpmsg_channel_info {
  * rpmsg_device - device that belong to the rpmsg bus
  * @dev: the device struct
  * @id: device id (used to match between rpmsg drivers and devices)
- * @driver_override: driver name to force a match; do not set directly,
- *                   because core frees it; use driver_set_override() to
- *                   set or clear it.
  * @src: local address
  * @dst: destination address
  * @ept: the rpmsg endpoint of this channel
@@ -53,7 +50,6 @@ struct rpmsg_channel_info {
 struct rpmsg_device {
 	struct device dev;
 	struct rpmsg_device_id id;
-	const char *driver_override;
 	u32 src;
 	u32 dst;
 	struct rpmsg_endpoint *ept;
-- 
2.34.1

^ permalink raw reply related

* [PATCH v2 4/4] cdx: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

CDX devices still keep driver_override in bus-private storage.

The sysfs write side updates that string through driver_set_override(),
which replaces the pointer and frees the old value. However,
driver_match_device() can call cdx_bus_match() from __driver_attach()
without holding the device lock, and cdx_bus_match() still dereferences
that private pointer directly.

That means the CDX match path can race with a concurrent
driver_override update and compare against freed memory.

Switch CDX to the driver-core driver_override infrastructure. This
removes the private driver_override storage, lets the core provide the
sysfs attribute, and uses device_match_driver_override() for the locked
read in cdx_bus_match().

Preserve the existing CDX override_only semantics: entries marked
override_only still require a matching driver_override, but ordinary ID
matches continue to work unchanged.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: 48a6c7bced2a ("cdx: add device attributes")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/cdx/cdx.c           | 40 +++++--------------------------------
include/linux/cdx/cdx_bus.h |  1 -
 2 files changed, 5 insertions(+), 36 deletions(-)

diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
index 9196dc50a48d..d3d230247262 100644
--- a/drivers/cdx/cdx.c
+++ b/drivers/cdx/cdx.c
@@ -156,8 +156,6 @@ static int cdx_unregister_device(struct device *dev,
 	} else {
 		cdx_destroy_res_attr(cdx_dev, MAX_CDX_DEV_RESOURCES);
 		debugfs_remove_recursive(cdx_dev->debugfs_dir);
-		kfree(cdx_dev->driver_override);
-		cdx_dev->driver_override = NULL;
 	}
 
 	/*
@@ -268,6 +266,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
 	const struct cdx_driver *cdx_drv = to_cdx_driver(drv);
 	const struct cdx_device_id *found_id = NULL;
 	const struct cdx_device_id *ids;
+	int ret;
 
 	if (cdx_dev->is_bus)
 		return false;
@@ -275,7 +274,8 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
 	ids = cdx_drv->match_id_table;
 
 	/* When driver_override is set, only bind to the matching driver */
-	if (cdx_dev->driver_override && strcmp(cdx_dev->driver_override, drv->name))
+	ret = device_match_driver_override(dev, drv);
+	if (ret == 0)
 		return false;
 
 	found_id = cdx_match_id(ids, cdx_dev);
@@ -289,7 +289,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
 		 */
 		if (!found_id->override_only)
 			return true;
-		if (cdx_dev->driver_override)
+		if (ret > 0)
 			return true;
 
 		ids = found_id + 1;
@@ -453,36 +453,6 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(modalias);
 
-static ssize_t driver_override_store(struct device *dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct cdx_device *cdx_dev = to_cdx_device(dev);
-	int ret;
-
-	if (WARN_ON(dev->bus != &cdx_bus_type))
-		return -EINVAL;
-
-	ret = driver_set_override(dev, &cdx_dev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct cdx_device *cdx_dev = to_cdx_device(dev);
-	ssize_t len;
-
-	device_lock(dev);
-	len = sysfs_emit(buf, "%s\n", cdx_dev->driver_override);
-	device_unlock(dev);
-	return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 static ssize_t enable_store(struct device *dev, struct device_attribute *attr,
 			    const char *buf, size_t count)
 {
@@ -552,7 +522,6 @@ static struct attribute *cdx_dev_attrs[] = {
 	&dev_attr_class.attr,
 	&dev_attr_revision.attr,
 	&dev_attr_modalias.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 
@@ -646,6 +615,7 @@ ATTRIBUTE_GROUPS(cdx_bus);
 
 const struct bus_type cdx_bus_type = {
 	.name		= "cdx",
+	.driver_override = true,
 	.match		= cdx_bus_match,
 	.probe		= cdx_probe,
 	.remove		= cdx_remove,
diff --git a/include/linux/cdx/cdx_bus.h b/include/linux/cdx/cdx_bus.h
index b1ba97f6c9ad..f1a107b232da 100644
--- a/include/linux/cdx/cdx_bus.h
+++ b/include/linux/cdx/cdx_bus.h
@@ -165,7 +165,6 @@ struct cdx_device {
 	bool enabled;
 	u32 msi_dev_id;
 	u32 num_msi;
-	const char *driver_override;
 	struct mutex irqchip_lock;
 	bool msi_write_pending;
 };
-- 
2.34.1

^ permalink raw reply related

* [PATCH v2 3/4] vmbus: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

VMBUS devices still keep driver_override in bus-private storage.

The sysfs write side updates that string through driver_set_override(),
which replaces the pointer and frees the old value. However,
driver_match_device() can call into hv_vmbus_get_id() from
__driver_attach() without holding the device lock, and hv_vmbus_get_id()
still dereferences that private pointer directly.

That means a bind/reprobe path can race with a concurrent
driver_override update and make the match logic inspect freed memory.

Switch vmbus to the driver-core driver_override infrastructure. This
removes the private driver_override storage and uses
device_match_driver_override() for the locked read in the match path.

Keep the existing vmbus semantics intact: if driver_override matches but
no dynamic or static device ID matches, continue to return the dummy
vmbus_device_null ID so override-only binding still works as before.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: d765edbb301c ("vmbus: add driver_override support")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/hv/vmbus_drv.c | 36 +++++-------------------------------
include/linux/hyperv.h |  6 ------
 2 files changed, 5 insertions(+), 37 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index d28ff45d4cfd..a81e2b097636 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -538,34 +538,6 @@ static ssize_t device_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(device);
 
-static ssize_t driver_override_store(struct device *dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct hv_device *hv_dev = device_to_hv_device(dev);
-	int ret;
-
-	ret = driver_set_override(dev, &hv_dev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct hv_device *hv_dev = device_to_hv_device(dev);
-	ssize_t len;
-
-	device_lock(dev);
-	len = sysfs_emit(buf, "%s\n", hv_dev->driver_override);
-	device_unlock(dev);
-
-	return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 /* Set up per device attributes in /sys/bus/vmbus/devices/<bus device> */
 static struct attribute *vmbus_dev_attrs[] = {
 	&dev_attr_id.attr,
@@ -596,7 +568,6 @@ static struct attribute *vmbus_dev_attrs[] = {
 	&dev_attr_channel_vp_mapping.attr,
 	&dev_attr_vendor.attr,
 	&dev_attr_device.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 
@@ -708,9 +679,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
 {
 	const guid_t *guid = &dev->dev_type;
 	const struct hv_vmbus_device_id *id;
+	int ret;
 
 	/* When driver_override is set, only bind to the matching driver */
-	if (dev->driver_override && strcmp(dev->driver_override, drv->name))
+	ret = device_match_driver_override(&dev->device, &drv->driver);
+	if (ret == 0)
 		return NULL;
 
 	/* Look at the dynamic ids first, before the static ones */
@@ -719,7 +692,7 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
 		id = hv_vmbus_dev_match(drv->id_table, guid);
 
 	/* driver_override will always match, send a dummy id */
-	if (!id && dev->driver_override)
+	if (!id && ret > 0)
 		id = &vmbus_device_null;
 
 	return id;
@@ -1021,6 +994,7 @@ static const struct dev_pm_ops vmbus_pm = {
 /* The one and only one */
 static const struct bus_type  hv_bus = {
 	.name =		"vmbus",
+	.driver_override = true,
 	.match =		vmbus_match,
 	.shutdown =		vmbus_shutdown,
 	.remove =		vmbus_remove,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 964f1be8150c..f9ede569602d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1272,12 +1272,6 @@ struct hv_device {
 	u16 device_id;
 
 	struct device device;
-	/*
-	 * Driver name to force a match.  Do not set directly, because core
-	 * frees it.  Use driver_set_override() to set or clear it.
-	 */
-	const char *driver_override;
-
 	struct vmbus_channel *channel;
 	struct kset	     *channels_kset;
 	struct device_dma_parameters dma_parms;
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH v2 4/4] cdx: use generic driver_override infrastructure
From: sashiko-bot @ 2026-06-04  4:08 UTC (permalink / raw)
  To: Runyu Xiao; +Cc: linux-hyperv
In-Reply-To: <20260604035239.1711889-5-runyu.xiao@seu.edu.cn>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Stale kernel-doc entry for removed `@driver_override` member in `struct cdx_device`.
--

commit c9a2a632f33a0544d4ef0145d2020070218d6838
Author: Runyu Xiao <runyu.xiao@seu.edu.cn>

cdx: use generic driver_override infrastructure

This commit switches the CDX subsystem to use the generic driver-core
driver_override infrastructure. It removes the private driver_override
storage and uses device_match_driver_override() to prevent races during
bus matching.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/

[ ... ]

> diff --git a/include/linux/cdx/cdx_bus.h b/include/linux/cdx/cdx_bus.h
> index b1ba97f6c9ad6..f1a107b232daa 100644
> --- a/include/linux/cdx/cdx_bus.h
> +++ b/include/linux/cdx/cdx_bus.h
> @@ -165,7 +165,6 @@ struct cdx_device {
>  	bool enabled;
>  	u32 msi_dev_id;
>  	u32 num_msi;
> -	const char *driver_override;

[Severity: Low]
Does this leave a stale kernel-doc entry for @driver_override in the
struct cdx_device documentation block?

The struct documentation still references the removed field:

 * @driver_override: driver name to force a match; do not set directly,
 *                   because core frees it; use driver_set_override() to
 *                   set or clear it.

This might trigger an excess kerneldoc parameter warning during the kernel
build (with W=1).

>  	struct mutex irqchip_lock;
>  	bool msi_write_pending;
>  };

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604035239.1711889-1-runyu.xiao@seu.edu.cn?part=4

^ permalink raw reply

* Re: [PATCH v2 3/4] vmbus: use generic driver_override infrastructure
From: sashiko-bot @ 2026-06-04  4:13 UTC (permalink / raw)
  To: Runyu Xiao; +Cc: linux-hyperv
In-Reply-To: <20260604035239.1711889-4-runyu.xiao@seu.edu.cn>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] A time-of-check to time-of-use (TOCTOU) race condition in `vmbus_probe()` allows `drv->probe` to be called with a NULL `dev_id`, leading to a NULL pointer dereference in drivers like `storvsc_probe()`.
--

commit 76d8c5c515ad429741f331c31ec380bf773956cb
Author: Runyu Xiao <runyu.xiao@seu.edu.cn>

vmbus: use generic driver_override infrastructure

This commit switches the vmbus subsystem to use the generic driver core
driver_override infrastructure rather than bus-private storage. It removes
the private driver_override storage and updates the match path to use
device_match_driver_override() for locked reads, preventing the match
logic from inspecting freed memory during concurrent updates.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/

> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index d28ff45d4cfdd..a81e2b0976367 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
[ ... ]
> @@ -708,9 +679,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
>  {
>  	const guid_t *guid = &dev->dev_type;
>  	const struct hv_vmbus_device_id *id;
> +	int ret;
>  
>  	/* When driver_override is set, only bind to the matching driver */
> -	if (dev->driver_override && strcmp(dev->driver_override, drv->name))
> +	ret = device_match_driver_override(&dev->device, &drv->driver);
> +	if (ret == 0)
>  		return NULL;

[Severity: High]
This is a pre-existing issue, but does this behavior still leave a
time-of-check to time-of-use race condition open when called later
during probing?

The driver core __driver_attach() calls driver_match_device() locklessly,
which delegates to vmbus_match() and eventually hv_vmbus_get_id(). If a
matching driver name is written to the driver_override sysfs attribute (or
a matching dynamic ID is present), it returns a match.

Before __driver_attach() acquires the device lock to call
driver_probe_device(), a concurrent sysfs write can clear the
driver_override attribute or remove the dynamic ID.

Inside vmbus_probe(), which executes with the device lock held,
hv_vmbus_get_id() is called again. Because the state was changed
concurrently, it can evaluate to NULL:

drivers/hv/vmbus_drv.c:vmbus_probe() {
    ...
    dev_id = hv_vmbus_get_id(drv, dev);
    if (drv->probe) {
        ret = drv->probe(dev, dev_id);
    ...
}

Since vmbus_probe() lacks a NULL check for dev_id, it directly calls
drv->probe(dev, dev_id) with dev_id equal to NULL.

Sub-drivers like storvsc_probe() then unconditionally dereference dev_id
(for example, checking dev_id->driver_data), which results in a kernel panic.

Could a NULL check for dev_id be added in vmbus_probe() to fully prevent
this race?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604035239.1711889-1-runyu.xiao@seu.edu.cn?part=3

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox