Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: Stanislav Kinsburskii @ 2026-03-30 21:13 UTC (permalink / raw)
  To: Junrui Luo
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
	Praveen K Paladugu, Jinank Jain, linux-hyperv, linux-kernel,
	Yuhao Jiang, Roman Kisel, stable
In-Reply-To: <SYBPR01MB788138A30BC69B0F5C3316E5AF54A@SYBPR01MB7881.ausprd01.prod.outlook.com>

On Sat, Mar 28, 2026 at 05:18:45PM +0800, Junrui Luo wrote:
> mshv_partition_create_region() computes mem->guest_pfn + nr_pages to
> check for overlapping regions without verifying u64 wraparound. A
> sufficiently large guest_pfn can cause the addition to overflow,
> bypassing the overlap check and allowing creation of regions that wrap
> around the address space.
> 
> Fix by using check_add_overflow() to reject such regions early, and
> validate that the region end does not exceed MAX_PHYSMEM_BITS. These
> checks also protect downstream callers that compute start_gfn +
> nr_pages on stored regions without overflow guards.
> 
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Suggested-by: Roman Kisel <romank@linux.microsoft.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
> ---
> Changes in v2:
> - Add a maximum check suggested by Roman Kisel
> - Link to v1: https://lore.kernel.org/all/SYBPR01MB7881689C0F58149DD986A6D1AF49A@SYBPR01MB7881.ausprd01.prod.outlook.com/
> ---
>  drivers/hv/mshv_root_main.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 6f42423f7faa..32826247dbce 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1174,11 +1174,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>  {
>  	struct mshv_mem_region *rg;
>  	u64 nr_pages = HVPFN_DOWN(mem->size);
> +	u64 new_region_end;
> +

Minor nit: just "end" or even "tmp" would be sufficient, since it's only
used for the overflow checks. "new_region_end" is a bit verbose and it's
not really "new" per se.

> +	/* Reject regions whose end address would wrap around */
> +	if (check_add_overflow(mem->guest_pfn, nr_pages, &new_region_end))
> +		return -EOVERFLOW;
> +
> +	/* Reject regions beyond the maximum physical address */
> +	if (new_region_end > HVPFN_DOWN(1ULL << MAX_PHYSMEM_BITS))

This is a PFN, so the check should be against MAX_PHYSMEM_BITS -
PAGE_SHIFT, right?
Or maybe it's even better to use "pfn_valid"?

Thanks,
Stanislav

> +		return -EINVAL;
>  
>  	/* Reject overlapping regions */
>  	spin_lock(&partition->pt_mem_regions_lock);
>  	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> -		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> +		if (new_region_end <= rg->start_gfn ||
>  		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
>  			continue;
>  		spin_unlock(&partition->pt_mem_regions_lock);
> 
> ---
> base-commit: c369299895a591d96745d6492d4888259b004a9e
> change-id: 20260328-fixes-0296eb3dbb52
> 
> Best regards,
> -- 
> Junrui Luo <moonafterrain@outlook.com>

^ permalink raw reply

* [PATCH net-next,v4] net: mana: Force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-03-30 21:01 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, dipayanroy

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate(180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page().

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 159 +++++++++++++++---
 include/net/mana/mana.h                       |   8 +
 3 files changed, 159 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..59a1626c2be1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..9f7393b71a34 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,58 +133,91 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+	default:
 		return -EINVAL;
+	}
+}
+
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
 
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
+
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
 	}
 }
 
@@ -573,6 +606,74 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (err && schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -591,4 +692,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..fd87e3d6c1f4 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Alex Williamson @ 2026-03-30 20:10 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Jason Gunthorpe, Russell King, Greg Kroah-Hartman,
	Rafael J. Wysocki, Ioana Ciornei, Nipun Gupta, Nikhil Agarwal,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Bjorn Helgaas, Armin Wolf, Bjorn Andersson, Mathieu Poirier,
	Vineeth Vijayan, Peter Oberparleiter, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Christophe Leroy (CS GROUP), linux-kernel, driver-core,
	linuxppc-dev, linux-hyperv, linux-pci, platform-driver-x86,
	linux-arm-msm, linux-remoteproc, linux-s390, linux-spi,
	virtualization, kvm, xen-devel, linux-arm-kernel, Gui-Dong Han,
	alex
In-Reply-To: <DHGATG6LJOM1.2AI7BYQ2O4DFU@kernel.org>

On Mon, 30 Mar 2026 19:38:41 +0200
"Danilo Krummrich" <dakr@kernel.org> wrote:

> (Cc: Jason)
> 
> On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index d43745fe4c84..460852f79f29 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1987,9 +1987,8 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
> >  	    pdev->is_virtfn && physfn == vdev->pdev) {
> >  		pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
> >  			 pci_name(pdev));
> > -		pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
> > -						  vdev->vdev.ops->name);
> > -		WARN_ON(!pdev->driver_override);
> > +		WARN_ON(device_set_driver_override(&pdev->dev,
> > +						   vdev->vdev.ops->name));  
> 
> Technically, this is a change in behavior. If vdev->vdev.ops->name is NULL, it
> will trigger the WARN_ON(), whereas before it would have just written "(null)"
> into driver_override.

It's worse than that.  Looking at the implementation in [1], we have:

+static inline int device_set_driver_override(struct device *dev, const char *s)
+{
+	return __device_set_driver_override(dev, s, strlen(s));
+}

So if name is NULL, we oops in strlen() before we even hit the -EINVAL
and WARN_ON().

I don't believe we have any vfio-pci variant drivers where the name is
NULL, but kasprintf() handling NULL as "(null)" was a consideration in
this design, that even if there is no name the device is sequestered
with a driver_override that won't match an actual driver.

> I assume that vfio_pci_core drivers are expected to set the name in struct
> vfio_device_ops in the first place and this code (silently) relies on this
> invariant?

We do expect that, but it was previously safe either way to make sure
VFs are only bound to the same ops driver or barring that, at least
don't perform a standard driver match.  The last thing we want to
happen automatically is for a user owned PF to create SR-IOV VFs that
automatically bind to native kernel drivers.
 
> Alex, Jason: Should we keep this hunk above as is and check for a proper name in
> struct vfio_device_ops in vfio_pci_core_register_device() with a subsequent
> patch?

Given the oops, my preference would be to roll it in here.  This change
is what makes it a requirement that name cannot be NULL, where this was
safely handled with kasprintf().  Thanks,

Alex


[1] https://lore.kernel.org/all/20260302002729.19438-2-dakr@kernel.org/

> 
> >  	} else if (action == BUS_NOTIFY_BOUND_DRIVER &&
> >  		   pdev->is_virtfn && physfn == vdev->pdev) {
> >  		struct pci_driver *drv = pci_dev_driver(pdev);  


^ permalink raw reply

* [PATCH 7/7] mshv: Add tracepoint for map GPA hypercall
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Add tracing for GPA mapping hypercalls to aid in debugging memory
management issues in child partitions. The tracepoint captures both
successful and failed mapping attempts, including the number of pages
successfully mapped before any failure occurred.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |    3 +++
 drivers/hv/mshv_trace.h        |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index a95f2cfc5da5..7ed623668c8e 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -260,6 +260,9 @@ static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
 		done += completed;
 	}
 
+	trace_mshv_map_pfns(partition_id, gfn, pfns_count, page_count,
+			    flags, mmio_spa, done, ret);
+
 	if (ret && done) {
 		u32 unmap_flags = 0;
 
diff --git a/drivers/hv/mshv_trace.h b/drivers/hv/mshv_trace.h
index 6b8fa477fa3b..efd2b5d4ab73 100644
--- a/drivers/hv/mshv_trace.h
+++ b/drivers/hv/mshv_trace.h
@@ -538,6 +538,42 @@ TRACE_EVENT(mshv_handle_gpa_intercept,
 	    )
 );
 
+TRACE_EVENT(mshv_map_pfns,
+	    TP_PROTO(u64 partition_id, u64 gfn, u64 pfn_count, u64 page_count, u32 flags,
+		     u64 mmio_spa, int done, int ret),
+	    TP_ARGS(partition_id, gfn, pfn_count, page_count, flags, mmio_spa, done, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, gfn)
+		    __field(u64, pfn_count)
+		    __field(u64, page_count)
+		    __field(u32, flags)
+		    __field(u64, mmio_spa)
+		    __field(int, done)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->gfn = gfn;
+		    __entry->page_count = page_count;
+		    __entry->pfn_count = pfn_count;
+		    __entry->flags = flags;
+		    __entry->mmio_spa = mmio_spa;
+		    __entry->done = done;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu gfn=0x%llx pfn_count=%llu page_count=%llu flags=0x%x mmio_spa=0x%llx done=%d ret=%d",
+		    __entry->partition_id,
+		    __entry->gfn,
+		    __entry->pfn_count,
+		    __entry->page_count,
+		    __entry->flags,
+		    __entry->mmio_spa,
+		    __entry->done,
+		    __entry->ret
+	    )
+);
+
 #endif /* _MSHV_TRACE_H_ */
 
 /* This part must be outside protection */



^ permalink raw reply related

* [PATCH 6/7] mshv: Extract MMIO region mapping into separate function
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Extract the MMIO region mapping logic from mshv_map_user_memory() into
a dedicated mshv_map_mmio_region() function. This improves code
organization and consistency with the existing mshv_map_pinned_region()
and mshv_map_movable_region() functions.

The new function encapsulates the hv_call_map_mmio_pfns() call,
making the switch statement in mshv_map_user_memory() more concise
and maintaining a uniform pattern for all region types.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |    9 +++++++++
 drivers/hv/mshv_root.h      |    2 ++
 drivers/hv/mshv_root_main.c |    5 +----
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 28d3f488d89f..6b703b269a4f 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -831,3 +831,12 @@ int mshv_map_movable_region(struct mshv_mem_region *region)
 	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
 					   false);
 }
+
+int mshv_map_mmio_region(struct mshv_mem_region *region,
+			 unsigned long mmio_pfn)
+{
+	struct mshv_partition *partition = region->partition;
+
+	return hv_call_map_mmio_pfns(partition->pt_id, region->start_gfn,
+				     mmio_pfn, region->nr_pfns);
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 02c1c11f701c..1f92b9f85b60 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -375,5 +375,7 @@ void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
 int mshv_map_pinned_region(struct mshv_mem_region *region);
 int mshv_map_movable_region(struct mshv_mem_region *region);
+int mshv_map_mmio_region(struct mshv_mem_region *region,
+			 unsigned long mmio_pfn);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 91dab2a3bc92..adb09350205a 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1302,10 +1302,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = mshv_map_movable_region(region);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
-		ret = hv_call_map_mmio_pfns(partition->pt_id,
-					    region->start_gfn,
-					    mmio_pfn,
-					    region->nr_pfns);
+		ret = mshv_map_mmio_region(region, mmio_pfn);
 		break;
 	}
 



^ permalink raw reply related

* [PATCH 5/7] mshv: Map populated pages on movable region creation
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Map any populated pages into the hypervisor upfront when creating a
movable region, rather than waiting for faults. Previously, movable
regions were created with all pages marked as HV_MAP_GPA_NO_ACCESS
regardless of whether the userspace mapping contained populated pages.

This guarantees that if the caller passes a populated mapping, those
present pages will be mapped into the hypervisor immediately during
region creation instead of being faulted in later.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   65 ++++++++++++++++++++++++++++++++-----------
 drivers/hv/mshv_root.h      |    1 +
 drivers/hv/mshv_root_main.c |   10 +------
 3 files changed, 50 insertions(+), 26 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 133ec7771812..28d3f488d89f 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -519,7 +519,8 @@ int mshv_region_get(struct mshv_mem_region *region)
 static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 					  unsigned long start,
 					  unsigned long end,
-					  unsigned long *pfns)
+					  unsigned long *pfns,
+					  bool do_fault)
 {
 	struct hmm_range range = {
 		.notifier = &region->mreg_mni,
@@ -540,9 +541,12 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 		range.hmm_pfns = pfns;
 		range.start = start;
 		range.end = min(vma->vm_end, end);
-		range.default_flags = HMM_PFN_REQ_FAULT;
-		if (vma->vm_flags & VM_WRITE)
-			range.default_flags |= HMM_PFN_REQ_WRITE;
+		range.default_flags = 0;
+		if (do_fault) {
+			range.default_flags = HMM_PFN_REQ_FAULT;
+			if (vma->vm_flags & VM_WRITE)
+				range.default_flags |= HMM_PFN_REQ_WRITE;
+		}
 
 		ret = hmm_range_fault(&range);
 		if (ret)
@@ -567,26 +571,40 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 }
 
 /**
- * mshv_region_range_fault - Handle memory range faults for a given region.
- * @region: Pointer to the memory region structure.
- * @pfn_offset: Offset of the page within the region.
- * @pfn_count: Number of pages to handle.
+ * mshv_region_collect_and_map - Collect PFNs for a user range and map them
+ * @region    : memory region being processed
+ * @pfn_offset: PFNs offset within the region
+ * @pfn_count : number of PFNs to process
+ * @do_fault  : if true, fault in missing pages;
+ *              if false, collect only present pages
  *
- * This function resolves memory faults for a specified range of pages
- * within a memory region. It uses HMM (Heterogeneous Memory Management)
- * to fault in the required pages and updates the region's page array.
+ * Collects PFNs for the specified portion of @region from the
+ * corresponding userspace VMA and maps them into the hypervisor. The
+ * behavior depends on @do_fault:
  *
- * Return: 0 on success, negative error code on failure.
+ * - true: Fault in missing pages from userspace, ensuring all pages in the
+ *   range are present. Used for on-demand page population.
+ * - false: Collect PFNs only for pages already present in userspace,
+ *   leaving missing pages as invalid PFN markers.
+ *   Used for initial region setup.
+ *
+ * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
+ * flags cleared, then the range is mapped into the hypervisor. Present
+ * PFNs get mapped with region access permissions; missing PFNs (zero
+ * entries) get mapped with no-access permissions.
+ *
+ * Return: 0 on success, negative errno on failure.
  */
-static int mshv_region_range_fault(struct mshv_mem_region *region,
-				   u64 pfn_offset, u64 pfn_count)
+static int mshv_region_collect_and_map(struct mshv_mem_region *region,
+				       u64 pfn_offset, u64 pfn_count,
+				       bool do_fault)
 {
 	unsigned long start, end;
 	unsigned long *pfns;
 	int ret;
 	u64 i;
 
-	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
+	pfns = vmalloc_array(pfn_count, sizeof(unsigned long));
 	if (!pfns)
 		return -ENOMEM;
 
@@ -595,7 +613,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 
 	do {
 		ret = mshv_region_hmm_fault_and_lock(region, start, end,
-						     pfns);
+						     pfns, do_fault);
 	} while (ret == -EBUSY);
 
 	if (ret)
@@ -613,10 +631,17 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 
 	mutex_unlock(&region->mreg_mutex);
 out:
-	kfree(pfns);
+	vfree(pfns);
 	return ret;
 }
 
+static int mshv_region_range_fault(struct mshv_mem_region *region,
+				   u64 pfn_offset, u64 pfn_count)
+{
+	return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
+					   true);
+}
+
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
 {
 	u64 pfn_offset, pfn_count;
@@ -800,3 +825,9 @@ int mshv_map_pinned_region(struct mshv_mem_region *region)
 err_out:
 	return ret;
 }
+
+int mshv_map_movable_region(struct mshv_mem_region *region)
+{
+	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
+					   false);
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index d2e65a137bf4..02c1c11f701c 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -374,5 +374,6 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
 void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
 int mshv_map_pinned_region(struct mshv_mem_region *region);
+int mshv_map_movable_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index c393b5144e0b..91dab2a3bc92 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1299,15 +1299,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		ret = mshv_map_pinned_region(region);
 		break;
 	case MSHV_REGION_TYPE_MEM_MOVABLE:
-		/*
-		 * For movable memory regions, remap with no access to let
-		 * the hypervisor track dirty pages, enabling pre-copy live
-		 * migration.
-		 */
-		ret = hv_call_map_ram_pfns(partition->pt_id,
-					   region->start_gfn,
-					   region->nr_pfns,
-					   HV_MAP_GPA_NO_ACCESS, NULL);
+		ret = mshv_map_movable_region(region);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
 		ret = hv_call_map_mmio_pfns(partition->pt_id,



^ permalink raw reply related

* [PATCH 4/7] mshv: Move pinned region setup to mshv_regions.c
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Move mshv_prepare_pinned_region() from mshv_root_main.c to
mshv_regions.c and rename it to mshv_map_pinned_region(). This
co-locates the pinned region logic with the rest of the memory region
operations.

Make mshv_region_pin(), mshv_region_map(), mshv_region_share(),
mshv_region_unshare(), and mshv_region_invalidate() static, as they are
no longer called outside of mshv_regions.c.

Also fix a bug in the error handling where a mshv_region_map() failure
on a non-encrypted partition would be silently ignored, returning
success instead of propagating the error code.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   79 ++++++++++++++++++++++++++++++++++++++++---
 drivers/hv/mshv_root.h      |    6 +--
 drivers/hv/mshv_root_main.c |   70 +-------------------------------------
 3 files changed, 76 insertions(+), 79 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 1bb1bfe177e2..133ec7771812 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -287,7 +287,7 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 					      flags, true);
 }
 
-int mshv_region_share(struct mshv_mem_region *region)
+static int mshv_region_share(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
@@ -313,7 +313,7 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 					      flags, false);
 }
 
-int mshv_region_unshare(struct mshv_mem_region *region)
+static int mshv_region_unshare(struct mshv_mem_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
@@ -353,7 +353,7 @@ static int mshv_region_remap_pfns(struct mshv_mem_region *region,
 					 mshv_region_chunk_remap);
 }
 
-int mshv_region_map(struct mshv_mem_region *region)
+static int mshv_region_map(struct mshv_mem_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
@@ -377,12 +377,12 @@ static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
 	}
 }
 
-void mshv_region_invalidate(struct mshv_mem_region *region)
+static void mshv_region_invalidate(struct mshv_mem_region *region)
 {
 	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
-int mshv_region_pin(struct mshv_mem_region *region)
+static int mshv_region_pin(struct mshv_mem_region *region)
 {
 	u64 done_count, nr_pfns, i;
 	unsigned long *pfns;
@@ -731,3 +731,72 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
 
 	return true;
 }
+
+/**
+ * mshv_map_pinned_region - Pin and map memory regions
+ * @region: Pointer to the memory region structure
+ *
+ * This function processes memory regions that are explicitly marked as pinned.
+ * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
+ * population. The function ensures the region is properly populated, handles
+ * encryption requirements for SNP partitions if applicable, maps the region,
+ * and performs necessary sharing or eviction operations based on the mapping
+ * result.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int mshv_map_pinned_region(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	int ret;
+
+	ret = mshv_region_pin(region);
+	if (ret) {
+		pt_err(partition, "Failed to pin memory region: %d\n",
+		       ret);
+		goto err_out;
+	}
+
+	/*
+	 * For an SNP partition it is a requirement that for every memory region
+	 * that we are going to map for this partition we should make sure that
+	 * host access to that region is released. This is ensured by doing an
+	 * additional hypercall which will update the SLAT to release host
+	 * access to guest memory regions.
+	 */
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_region_unshare(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
+			       region->start_gfn, ret);
+			goto invalidate_region;
+		}
+	}
+
+	ret = mshv_region_map(region);
+	if (!ret)
+		return 0;
+
+	if (mshv_partition_encrypted(partition)) {
+		int shrc;
+
+		shrc = mshv_region_share(region);
+		if (!shrc)
+			goto invalidate_region;
+
+		pt_err(partition,
+		       "Failed to share memory region (guest_pfn: %llu): %d\n",
+		       region->start_gfn, shrc);
+		/*
+		 * Don't unpin if marking shared failed because pages are no
+		 * longer mapped in the host, ie root, anymore.
+		 */
+		goto err_out;
+	}
+
+invalidate_region:
+	mshv_region_invalidate(region);
+err_out:
+	return ret;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index f1d4bee97a3f..d2e65a137bf4 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -368,15 +368,11 @@ extern u8 * __percpu *hv_synic_eventring_tail;
 
 struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 					   u64 uaddr, u32 flags);
-int mshv_region_share(struct mshv_mem_region *region);
-int mshv_region_unshare(struct mshv_mem_region *region);
-int mshv_region_map(struct mshv_mem_region *region);
-void mshv_region_invalidate(struct mshv_mem_region *region);
-int mshv_region_pin(struct mshv_mem_region *region);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
 void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
+int mshv_map_pinned_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 685e4b562186..c393b5144e0b 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1254,74 +1254,6 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	return 0;
 }
 
-/**
- * mshv_prepare_pinned_region - Pin and map memory regions
- * @region: Pointer to the memory region structure
- *
- * This function processes memory regions that are explicitly marked as pinned.
- * Pinned regions are preallocated, mapped upfront, and do not rely on fault-based
- * population. The function ensures the region is properly populated, handles
- * encryption requirements for SNP partitions if applicable, maps the region,
- * and performs necessary sharing or eviction operations based on the mapping
- * result.
- *
- * Return: 0 on success, negative error code on failure.
- */
-static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
-{
-	struct mshv_partition *partition = region->partition;
-	int ret;
-
-	ret = mshv_region_pin(region);
-	if (ret) {
-		pt_err(partition, "Failed to pin memory region: %d\n",
-		       ret);
-		goto err_out;
-	}
-
-	/*
-	 * For an SNP partition it is a requirement that for every memory region
-	 * that we are going to map for this partition we should make sure that
-	 * host access to that region is released. This is ensured by doing an
-	 * additional hypercall which will update the SLAT to release host
-	 * access to guest memory regions.
-	 */
-	if (mshv_partition_encrypted(partition)) {
-		ret = mshv_region_unshare(region);
-		if (ret) {
-			pt_err(partition,
-			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
-			       region->start_gfn, ret);
-			goto invalidate_region;
-		}
-	}
-
-	ret = mshv_region_map(region);
-	if (ret && mshv_partition_encrypted(partition)) {
-		int shrc;
-
-		shrc = mshv_region_share(region);
-		if (!shrc)
-			goto invalidate_region;
-
-		pt_err(partition,
-		       "Failed to share memory region (guest_pfn: %llu): %d\n",
-		       region->start_gfn, shrc);
-		/*
-		 * Don't unpin if marking shared failed because pages are no
-		 * longer mapped in the host, ie root, anymore.
-		 */
-		goto err_out;
-	}
-
-	return 0;
-
-invalidate_region:
-	mshv_region_invalidate(region);
-err_out:
-	return ret;
-}
-
 /*
  * This maps two things: guest RAM and for pci passthru mmio space.
  *
@@ -1364,7 +1296,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 
 	switch (region->mreg_type) {
 	case MSHV_REGION_TYPE_MEM_PINNED:
-		ret = mshv_prepare_pinned_region(region);
+		ret = mshv_map_pinned_region(region);
 		break;
 	case MSHV_REGION_TYPE_MEM_MOVABLE:
 		/*



^ permalink raw reply related

* [PATCH 3/7] mshv: Support regions with different VMAs
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Allow HMM fault handling across memory regions that span multiple VMAs
with different protection flags. The previous implementation assumed a
single VMA per region, which would fail when guest memory crosses VMA
boundaries.

Iterate through VMAs within the range and handle each separately with
appropriate protection flags, enabling more flexible memory region
configurations for partitions.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   72 +++++++++++++++++++++++++++++++++------------
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index ed9c55841140..1bb1bfe177e2 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -492,37 +492,72 @@ int mshv_region_get(struct mshv_mem_region *region)
 }
 
 /**
- * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
+ * mshv_region_hmm_fault_and_lock - Handle HMM faults across VMAs and lock
+ *                                  the memory region
  * @region: Pointer to the memory region structure
- * @range: Pointer to the HMM range structure
+ * @start : Starting virtual address of the range to fault
+ * @end   : Ending virtual address of the range to fault (exclusive)
+ * @pfns  : Output array for page frame numbers with HMM flags
  *
  * This function performs the following steps:
  * 1. Reads the notifier sequence for the HMM range.
  * 2. Acquires a read lock on the memory map.
- * 3. Handles HMM faults for the specified range.
- * 4. Releases the read lock on the memory map.
- * 5. If successful, locks the memory region mutex.
- * 6. Verifies if the notifier sequence has changed during the operation.
- *    If it has, releases the mutex and returns -EBUSY to match with
- *    hmm_range_fault() return code for repeating.
+ * 3. Iterates through VMAs in the specified range, handling each
+ *    separately with appropriate protection flags (HMM_PFN_REQ_WRITE set
+ *    based on VMA flags).
+ * 4. Handles HMM faults for each VMA segment.
+ * 5. Releases the read lock on the memory map.
+ * 6. If successful, locks the memory region mutex.
+ * 7. Verifies if the notifier sequence has changed during the operation.
+ *    If it has, releases the mutex and returns -EBUSY to signal retry.
+ *
+ * The function expects the range [start, end] is backed by valid VMAs.
+ * Returns -EFAULT if any address in the range is not covered by a VMA.
  *
  * Return: 0 on success, a negative error code otherwise.
  */
 static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
-					  struct hmm_range *range)
+					  unsigned long start,
+					  unsigned long end,
+					  unsigned long *pfns)
 {
+	struct hmm_range range = {
+		.notifier = &region->mreg_mni,
+	};
 	int ret;
 
-	range->notifier_seq = mmu_interval_read_begin(range->notifier);
+	range.notifier_seq = mmu_interval_read_begin(range.notifier);
 	mmap_read_lock(region->mreg_mni.mm);
-	ret = hmm_range_fault(range);
+	while (start < end) {
+		struct vm_area_struct *vma;
+
+		vma = vma_lookup(current->mm, start);
+		if (!vma) {
+			ret = -EFAULT;
+			break;
+		}
+
+		range.hmm_pfns = pfns;
+		range.start = start;
+		range.end = min(vma->vm_end, end);
+		range.default_flags = HMM_PFN_REQ_FAULT;
+		if (vma->vm_flags & VM_WRITE)
+			range.default_flags |= HMM_PFN_REQ_WRITE;
+
+		ret = hmm_range_fault(&range);
+		if (ret)
+			break;
+
+		start = range.end + 1;
+		pfns += DIV_ROUND_UP(range.end - range.start, PAGE_SIZE);
+	}
 	mmap_read_unlock(region->mreg_mni.mm);
 	if (ret)
 		return ret;
 
 	mutex_lock(&region->mreg_mutex);
 
-	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+	if (mmu_interval_read_retry(range.notifier, range.notifier_seq)) {
 		mutex_unlock(&region->mreg_mutex);
 		cond_resched();
 		return -EBUSY;
@@ -546,10 +581,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 static int mshv_region_range_fault(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count)
 {
-	struct hmm_range range = {
-		.notifier = &region->mreg_mni,
-		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
-	};
+	unsigned long start, end;
 	unsigned long *pfns;
 	int ret;
 	u64 i;
@@ -558,12 +590,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	if (!pfns)
 		return -ENOMEM;
 
-	range.hmm_pfns = pfns;
-	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
-	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
+	start = region->start_uaddr + pfn_offset * PAGE_SIZE;
+	end = start + pfn_count * PAGE_SIZE;
 
 	do {
-		ret = mshv_region_hmm_fault_and_lock(region, &range);
+		ret = mshv_region_hmm_fault_and_lock(region, start, end,
+						     pfns);
 	} while (ret == -EBUSY);
 
 	if (ret)



^ permalink raw reply related

* [PATCH 2/7] mshv: Add support to address range holes remapping
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Consolidate memory region processing to handle both valid and invalid PFNs
uniformly. This eliminates code duplication across remap, unmap, share, and
unshare operations by using a common range processing interface.

Holes are now remapped with no-access permissions to enable
hypervisor dirty page tracking for precopy live migration.

This refactoring is a precursor to an upcoming change that will map
present pages in movable regions upon region creation, requiring
consistent handling of both mapped and unmapped ranges.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |  108 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 95 insertions(+), 13 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index b1a707d16c07..ed9c55841140 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -119,6 +119,57 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
 	return count;
 }
 
+/**
+ * mshv_region_process_hole - Handle a hole (invalid PFNs) in a memory
+ *                            region
+ * @region    : Memory region containing the hole
+ * @flags     : Flags to pass to the handler function
+ * @pfn_offset: Starting PFN offset within the region
+ * @pfn_count : Number of PFNs in the hole
+ * @handler   : Callback function to invoke for the hole
+ *
+ * Invokes the handler function for a contiguous hole with the specified
+ * parameters.
+ *
+ * Return: Number of PFNs handled, or negative error code.
+ */
+static long mshv_region_process_hole(struct mshv_mem_region *region,
+				     u32 flags,
+				     u64 pfn_offset, u64 pfn_count,
+				     int (*handler)(struct mshv_mem_region *region,
+						    u32 flags,
+						    u64 pfn_offset,
+						    u64 pfn_count,
+						    bool huge_page))
+{
+	long ret;
+
+	ret = handler(region, flags, pfn_offset, pfn_count, 0);
+	if (ret)
+		return ret;
+
+	return pfn_count;
+}
+
+static long mshv_region_process_chunk(struct mshv_mem_region *region,
+				      u32 flags,
+				      u64 pfn_offset, u64 pfn_count,
+				      int (*handler)(struct mshv_mem_region *region,
+						     u32 flags,
+						     u64 pfn_offset,
+						     u64 pfn_count,
+						     bool huge_page))
+{
+	if (pfn_valid(region->mreg_pfns[pfn_offset]))
+		return mshv_region_process_pfns(region, flags,
+				pfn_offset, pfn_count,
+				handler);
+	else
+		return mshv_region_process_hole(region, flags,
+				pfn_offset, pfn_count,
+				handler);
+}
+
 /**
  * mshv_region_process_range - Processes a range of PFNs in a region.
  * @region    : Pointer to the memory region structure.
@@ -146,33 +197,47 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 						    u64 pfn_count,
 						    bool huge_page))
 {
-	u64 pfn_end;
+	u64 start, end;
 	long ret;
 
-	if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
+	if (!pfn_count)
+		return 0;
+
+	if (check_add_overflow(pfn_offset, pfn_count, &end))
 		return -EOVERFLOW;
 
-	if (pfn_end > region->nr_pfns)
+	if (end > region->nr_pfns)
 		return -EINVAL;
 
-	while (pfn_count) {
-		/* Skip non-present pages */
-		if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
-			pfn_offset++;
-			pfn_count--;
+	start = pfn_offset;
+	end = pfn_offset + 1;
+
+	while (end < pfn_offset + pfn_count) {
+		/*
+		 * Accumulate contiguous pfns with the same validity
+		 * (valid or not).
+		 */
+		if (pfn_valid(region->mreg_pfns[start]) ==
+		    pfn_valid(region->mreg_pfns[end])) {
+			end++;
 			continue;
 		}
 
-		ret = mshv_region_process_pfns(region, flags,
-					       pfn_offset, pfn_count,
-					       handler);
+		ret = mshv_region_process_chunk(region, flags,
+						start, end - start,
+						handler);
 		if (ret < 0)
 			return ret;
 
-		pfn_offset += ret;
-		pfn_count -= ret;
+		start += ret;
 	}
 
+	ret = mshv_region_process_chunk(region, flags,
+					start, end - start,
+					handler);
+	if (ret < 0)
+		return ret;
+
 	return 0;
 }
 
@@ -208,6 +273,9 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
+	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+		return -EINVAL;
+
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
@@ -233,6 +301,9 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
 {
+	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+		return -EINVAL;
+
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
@@ -256,6 +327,14 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
+	/*
+	 * Remap missing pages with no access to let the
+	 * hypervisor track dirty pages, enabling precopy live
+	 * migration.
+	 */
+	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+		flags = HV_MAP_GPA_NO_ACCESS;
+
 	if (huge_page)
 		flags |= HV_MAP_GPA_LARGE_PAGE;
 
@@ -357,6 +436,9 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
+	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+		return 0;
+
 	if (huge_page)
 		flags |= HV_UNMAP_GPA_LARGE_PAGE;
 



^ permalink raw reply related

* [PATCH 1/7] mshv: Convert from page pointers to PFNs
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The HMM interface returns PFNs from hmm_range_fault(), and the
hypervisor hypercalls operate on PFNs. Storing page pointers in
between these interfaces requires unnecessary conversions and
temporary allocations.

Store PFNs directly in memory regions to match the natural data flow.
This eliminates the temporary PFN array allocation in the HMM fault
path and reduces page_to_pfn() conversions throughout the driver.
Convert to page structs via pfn_to_page() only when operations like
unpin_user_page() require them.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c      |  297 ++++++++++++++++++++++------------------
 drivers/hv/mshv_root.h         |   20 +--
 drivers/hv/mshv_root_hv_call.c |   50 +++----
 drivers/hv/mshv_root_main.c    |   30 ++--
 4 files changed, 212 insertions(+), 185 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6..b1a707d16c07 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -18,12 +18,13 @@
 #include "mshv_root.h"
 
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
+#define MSHV_INVALID_PFN				ULONG_MAX
 
 /**
  * mshv_chunk_stride - Compute stride for mapping guest memory
  * @page      : The page to check for huge page backing
  * @gfn       : Guest frame number for the mapping
- * @page_count: Total number of pages in the mapping
+ * @pfn_count: Total number of pages in the mapping
  *
  * Determines the appropriate stride (in pages) for mapping guest memory.
  * Uses huge page stride if the backing page is huge and the guest mapping
@@ -32,18 +33,18 @@
  * Return: Stride in pages, or -EINVAL if page order is unsupported.
  */
 static int mshv_chunk_stride(struct page *page,
-			     u64 gfn, u64 page_count)
+			     u64 gfn, u64 pfn_count)
 {
 	unsigned int page_order;
 
 	/*
 	 * Use single page stride by default. For huge page stride, the
 	 * page must be compound and point to the head of the compound
-	 * page, and both gfn and page_count must be huge-page aligned.
+	 * page, and both gfn and pfn_count must be huge-page aligned.
 	 */
 	if (!PageCompound(page) || !PageHead(page) ||
 	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
-	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
+	    !IS_ALIGNED(pfn_count, PTRS_PER_PMD))
 		return 1;
 
 	page_order = folio_order(page_folio(page));
@@ -57,60 +58,61 @@ static int mshv_chunk_stride(struct page *page,
 /**
  * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
  *                             in a region.
- * @region     : Pointer to the memory region structure.
- * @flags      : Flags to pass to the handler.
- * @page_offset: Offset into the region's pages array to start processing.
- * @page_count : Number of pages to process.
- * @handler    : Callback function to handle the chunk.
+ * @region    : Pointer to the memory region structure.
+ * @flags     : Flags to pass to the handler.
+ * @pfn_offset: Offset into the region's PFNs array to start processing.
+ * @pfn_count : Number of PFNs to process.
+ * @handler   : Callback function to handle the chunk.
  *
- * This function scans the region's pages starting from @page_offset,
- * checking for contiguous present pages of the same size (normal or huge).
- * It invokes @handler for the chunk of contiguous pages found. Returns the
- * number of pages handled, or a negative error code if the first page is
- * not present or the handler fails.
+ * This function scans the region's PFNs starting from @pfn_offset,
+ * checking for contiguous valid PFNs backed by pages of the same size
+ * (normal or huge). It invokes @handler for the chunk of contiguous valid
+ * PFNs found. Returns the number of PFNs handled, or a negative error code
+ * if the first PFN is invalid or the handler fails.
  *
- * Note: The @handler callback must be able to handle both normal and huge
- * pages.
+ * Note: The @handler callback must be able to handle valid PFNs backed by
+ * both normal and huge pages.
  *
  * Return: Number of pages handled, or negative error code.
  */
-static long mshv_region_process_chunk(struct mshv_mem_region *region,
-				      u32 flags,
-				      u64 page_offset, u64 page_count,
-				      int (*handler)(struct mshv_mem_region *region,
-						     u32 flags,
-						     u64 page_offset,
-						     u64 page_count,
-						     bool huge_page))
+static long mshv_region_process_pfns(struct mshv_mem_region *region,
+				     u32 flags,
+				     u64 pfn_offset, u64 pfn_count,
+				     int (*handler)(struct mshv_mem_region *region,
+						    u32 flags,
+						    u64 pfn_offset,
+						    u64 pfn_count,
+						    bool huge_page))
 {
-	u64 gfn = region->start_gfn + page_offset;
+	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count;
-	struct page *page;
+	unsigned long pfn;
 	int stride, ret;
 
-	page = region->mreg_pages[page_offset];
-	if (!page)
+	pfn = region->mreg_pfns[pfn_offset];
+	if (!pfn_valid(pfn))
 		return -EINVAL;
 
-	stride = mshv_chunk_stride(page, gfn, page_count);
+	stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
 	if (stride < 0)
 		return stride;
 
 	/* Start at stride since the first stride is validated */
-	for (count = stride; count < page_count; count += stride) {
-		page = region->mreg_pages[page_offset + count];
+	for (count = stride; count < pfn_count ; count += stride) {
+		pfn = region->mreg_pfns[pfn_offset + count];
 
-		/* Break if current page is not present */
-		if (!page)
+		/* Break if current pfn is invalid */
+		if (!pfn_valid(pfn))
 			break;
 
 		/* Break if stride size changes */
-		if (stride != mshv_chunk_stride(page, gfn + count,
-						page_count - count))
+		if (stride != mshv_chunk_stride(pfn_to_page(pfn),
+						gfn + count,
+						pfn_count - count))
 			break;
 	}
 
-	ret = handler(region, flags, page_offset, count, stride > 1);
+	ret = handler(region, flags, pfn_offset, count, stride > 1);
 	if (ret)
 		return ret;
 
@@ -118,70 +120,73 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 }
 
 /**
- * mshv_region_process_range - Processes a range of memory pages in a
- *                             region.
- * @region     : Pointer to the memory region structure.
- * @flags      : Flags to pass to the handler.
- * @page_offset: Offset into the region's pages array to start processing.
- * @page_count : Number of pages to process.
- * @handler    : Callback function to handle each chunk of contiguous
- *               pages.
+ * mshv_region_process_range - Processes a range of PFNs in a region.
+ * @region    : Pointer to the memory region structure.
+ * @flags     : Flags to pass to the handler.
+ * @pfn_offset: Offset into the region's PFNs array to start processing.
+ * @pfn_count : Number of PFNs to process.
+ * @handler   : Callback function to handle each chunk of contiguous
+ *              valid PFNs.
  *
- * Iterates over the specified range of pages in @region, skipping
- * non-present pages. For each contiguous chunk of present pages, invokes
- * @handler via mshv_region_process_chunk.
+ * Iterates over the specified range of PFNs in @region, skipping
+ * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
+ * @handler via mshv_region_process_pfns.
  *
- * Note: The @handler callback must be able to handle both normal and huge
- * pages.
+ * Note: The @handler callback must be able to handle PFNs backed by both
+ * normal and huge pages.
  *
  * Returns 0 on success, or a negative error code on failure.
  */
 static int mshv_region_process_range(struct mshv_mem_region *region,
 				     u32 flags,
-				     u64 page_offset, u64 page_count,
+				     u64 pfn_offset, u64 pfn_count,
 				     int (*handler)(struct mshv_mem_region *region,
 						    u32 flags,
-						    u64 page_offset,
-						    u64 page_count,
+						    u64 pfn_offset,
+						    u64 pfn_count,
 						    bool huge_page))
 {
+	u64 pfn_end;
 	long ret;
 
-	if (page_offset + page_count > region->nr_pages)
+	if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
+		return -EOVERFLOW;
+
+	if (pfn_end > region->nr_pfns)
 		return -EINVAL;
 
-	while (page_count) {
+	while (pfn_count) {
 		/* Skip non-present pages */
-		if (!region->mreg_pages[page_offset]) {
-			page_offset++;
-			page_count--;
+		if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
+			pfn_offset++;
+			pfn_count--;
 			continue;
 		}
 
-		ret = mshv_region_process_chunk(region, flags,
-						page_offset,
-						page_count,
-						handler);
+		ret = mshv_region_process_pfns(region, flags,
+					       pfn_offset, pfn_count,
+					       handler);
 		if (ret < 0)
 			return ret;
 
-		page_offset += ret;
-		page_count -= ret;
+		pfn_offset += ret;
+		pfn_count -= ret;
 	}
 
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
+struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
 					   u64 uaddr, u32 flags)
 {
 	struct mshv_mem_region *region;
+	u64 i;
 
-	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
-	region->nr_pages = nr_pages;
+	region->nr_pfns = nr_pfns;
 	region->start_gfn = guest_pfn;
 	region->start_uaddr = uaddr;
 	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
@@ -190,6 +195,9 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
+	for (i = 0; i < nr_pfns; i++)
+		region->mreg_pfns[i] = MSHV_INVALID_PFN;
+
 	kref_init(&region->mreg_refcount);
 
 	return region;
@@ -197,15 +205,15 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 
 static int mshv_region_chunk_share(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pages + page_offset,
-					      page_count,
+					      region->mreg_pfns + pfn_offset,
+					      pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
 					      flags, true);
@@ -216,21 +224,21 @@ int mshv_region_share(struct mshv_mem_region *region)
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
 	return mshv_region_process_range(region, flags,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_share);
 }
 
 static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 				     u32 flags,
-				     u64 page_offset, u64 page_count,
+				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pages + page_offset,
-					      page_count, 0,
+					      region->mreg_pfns + pfn_offset,
+					      pfn_count, 0,
 					      flags, false);
 }
 
@@ -239,30 +247,30 @@ int mshv_region_unshare(struct mshv_mem_region *region)
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
 	return mshv_region_process_range(region, flags,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_unshare);
 }
 
 static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_MAP_GPA_LARGE_PAGE;
 
-	return hv_call_map_gpa_pages(region->partition->pt_id,
-				     region->start_gfn + page_offset,
-				     page_count, flags,
-				     region->mreg_pages + page_offset);
+	return hv_call_map_ram_pfns(region->partition->pt_id,
+				    region->start_gfn + pfn_offset,
+				    pfn_count, flags,
+				    region->mreg_pfns + pfn_offset);
 }
 
-static int mshv_region_remap_pages(struct mshv_mem_region *region,
-				   u32 map_flags,
-				   u64 page_offset, u64 page_count)
+static int mshv_region_remap_pfns(struct mshv_mem_region *region,
+				  u32 map_flags,
+				  u64 pfn_offset, u64 pfn_count)
 {
 	return mshv_region_process_range(region, map_flags,
-					 page_offset, page_count,
+					 pfn_offset, pfn_count,
 					 mshv_region_chunk_remap);
 }
 
@@ -270,38 +278,50 @@ int mshv_region_map(struct mshv_mem_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
-	return mshv_region_remap_pages(region, map_flags,
-				       0, region->nr_pages);
+	return mshv_region_remap_pfns(region, map_flags,
+				      0, region->nr_pfns);
 }
 
-static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
-					 u64 page_offset, u64 page_count)
+static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
+					u64 pfn_offset, u64 pfn_count)
 {
-	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
-		unpin_user_pages(region->mreg_pages + page_offset, page_count);
+	u64 i;
+
+	for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
+		if (!pfn_valid(region->mreg_pfns[i]))
+			continue;
+
+		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+			unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
 
-	memset(region->mreg_pages + page_offset, 0,
-	       page_count * sizeof(struct page *));
+		region->mreg_pfns[i] = MSHV_INVALID_PFN;
+	}
 }
 
 void mshv_region_invalidate(struct mshv_mem_region *region)
 {
-	mshv_region_invalidate_pages(region, 0, region->nr_pages);
+	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
 int mshv_region_pin(struct mshv_mem_region *region)
 {
-	u64 done_count, nr_pages;
+	u64 done_count, nr_pfns, i;
+	unsigned long *pfns;
 	struct page **pages;
 	__u64 userspace_addr;
 	int ret;
 
-	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
-		pages = region->mreg_pages + done_count;
+	pages = kmalloc_array(MSHV_PIN_PAGES_BATCH_SIZE,
+			      sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (done_count = 0; done_count < region->nr_pfns; done_count += ret) {
+		pfns = region->mreg_pfns + done_count;
 		userspace_addr = region->start_uaddr +
 				 done_count * HV_HYP_PAGE_SIZE;
-		nr_pages = min(region->nr_pages - done_count,
-			       MSHV_PIN_PAGES_BATCH_SIZE);
+		nr_pfns = min(region->nr_pfns - done_count,
+			      MSHV_PIN_PAGES_BATCH_SIZE);
 
 		/*
 		 * Pinning assuming 4k pages works for large pages too.
@@ -311,39 +331,44 @@ int mshv_region_pin(struct mshv_mem_region *region)
 		 * with the FOLL_LONGTERM flag does a large temporary
 		 * allocation of contiguous memory.
 		 */
-		ret = pin_user_pages_fast(userspace_addr, nr_pages,
+		ret = pin_user_pages_fast(userspace_addr, nr_pfns,
 					  FOLL_WRITE | FOLL_LONGTERM,
 					  pages);
-		if (ret != nr_pages)
+		if (ret != nr_pfns)
 			goto release_pages;
+
+		for (i = 0; i < ret; i++)
+			pfns[i] = page_to_pfn(pages[i]);
 	}
 
+	kfree(pages);
 	return 0;
 
 release_pages:
 	if (ret > 0)
 		done_count += ret;
-	mshv_region_invalidate_pages(region, 0, done_count);
+	mshv_region_invalidate_pfns(region, 0, done_count);
+	kfree(pages);
 	return ret < 0 ? ret : -ENOMEM;
 }
 
 static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				   u32 flags,
-				   u64 page_offset, u64 page_count,
+				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
 {
 	if (huge_page)
 		flags |= HV_UNMAP_GPA_LARGE_PAGE;
 
-	return hv_call_unmap_gpa_pages(region->partition->pt_id,
-				       region->start_gfn + page_offset,
-				       page_count, flags);
+	return hv_call_unmap_pfns(region->partition->pt_id,
+				  region->start_gfn + pfn_offset,
+				  pfn_count, flags);
 }
 
 static int mshv_region_unmap(struct mshv_mem_region *region)
 {
 	return mshv_region_process_range(region, 0,
-					 0, region->nr_pages,
+					 0, region->nr_pfns,
 					 mshv_region_chunk_unmap);
 }
 
@@ -427,8 +452,8 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 /**
  * mshv_region_range_fault - Handle memory range faults for a given region.
  * @region: Pointer to the memory region structure.
- * @page_offset: Offset of the page within the region.
- * @page_count: Number of pages to handle.
+ * @pfn_offset: Offset of the page within the region.
+ * @pfn_count: Number of pages to handle.
  *
  * This function resolves memory faults for a specified range of pages
  * within a memory region. It uses HMM (Heterogeneous Memory Management)
@@ -437,7 +462,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
  * Return: 0 on success, negative error code on failure.
  */
 static int mshv_region_range_fault(struct mshv_mem_region *region,
-				   u64 page_offset, u64 page_count)
+				   u64 pfn_offset, u64 pfn_count)
 {
 	struct hmm_range range = {
 		.notifier = &region->mreg_mni,
@@ -447,13 +472,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	int ret;
 	u64 i;
 
-	pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL);
+	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
 	if (!pfns)
 		return -ENOMEM;
 
 	range.hmm_pfns = pfns;
-	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
-	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
+	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
+	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
 
 	do {
 		ret = mshv_region_hmm_fault_and_lock(region, &range);
@@ -462,11 +487,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 	if (ret)
 		goto out;
 
-	for (i = 0; i < page_count; i++)
-		region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+	for (i = 0; i < pfn_count; i++) {
+		if (!(pfns[i] & HMM_PFN_VALID))
+			continue;
+		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
+		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
+	}
 
-	ret = mshv_region_remap_pages(region, region->hv_map_flags,
-				      page_offset, page_count);
+	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
+				     pfn_offset, pfn_count);
 
 	mutex_unlock(&region->mreg_mutex);
 out:
@@ -476,24 +505,24 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
 {
-	u64 page_offset, page_count;
+	u64 pfn_offset, pfn_count;
 	int ret;
 
 	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
-	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
-				 MSHV_MAP_FAULT_IN_PAGES);
+	pfn_offset = ALIGN_DOWN(gfn - region->start_gfn,
+				MSHV_MAP_FAULT_IN_PAGES);
 
 	/* Map more pages than requested to reduce the number of faults. */
-	page_count = min(region->nr_pages - page_offset,
-			 MSHV_MAP_FAULT_IN_PAGES);
+	pfn_count = min(region->nr_pfns - pfn_offset,
+			MSHV_MAP_FAULT_IN_PAGES);
 
-	ret = mshv_region_range_fault(region, page_offset, page_count);
+	ret = mshv_region_range_fault(region, pfn_offset, pfn_count);
 
 	WARN_ONCE(ret,
-		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
+		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, pfn_offset %llu, pfn_count %llu\n",
 		  region->partition->pt_id, region->start_uaddr,
-		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
-		  gfn, page_offset, page_count);
+		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
+		  gfn, pfn_offset, pfn_count);
 
 	return !ret;
 }
@@ -523,16 +552,16 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	struct mshv_mem_region *region = container_of(mni,
 						      struct mshv_mem_region,
 						      mreg_mni);
-	u64 page_offset, page_count;
+	u64 pfn_offset, pfn_count;
 	unsigned long mstart, mend;
 	int ret = -EPERM;
 
 	mstart = max(range->start, region->start_uaddr);
 	mend = min(range->end, region->start_uaddr +
-		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
+		   (region->nr_pfns << HV_HYP_PAGE_SHIFT));
 
-	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
-	page_count = HVPFN_DOWN(mend - mstart);
+	pfn_offset = HVPFN_DOWN(mstart - region->start_uaddr);
+	pfn_count = HVPFN_DOWN(mend - mstart);
 
 	if (mmu_notifier_range_blockable(range))
 		mutex_lock(&region->mreg_mutex);
@@ -541,12 +570,12 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 
 	mmu_interval_set_seq(mni, cur_seq);
 
-	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
-				      page_offset, page_count);
+	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
+				     pfn_offset, pfn_count);
 	if (ret)
 		goto out_unlock;
 
-	mshv_region_invalidate_pages(region, page_offset, page_count);
+	mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
 
 	mutex_unlock(&region->mreg_mutex);
 
@@ -558,9 +587,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	WARN_ONCE(ret,
 		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
 		  region->start_uaddr,
-		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
+		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
 		  range->start, range->end, range->event,
-		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
+		  pfn_offset, pfn_offset + pfn_count - 1, (u64)range->mm, ret);
 	return false;
 }
 
@@ -579,7 +608,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
 
 	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
 					   region->start_uaddr,
-					   region->nr_pages << HV_HYP_PAGE_SHIFT,
+					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
 					   &mshv_region_mni_ops);
 	if (ret)
 		return false;
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 947dfb76bb19..f1d4bee97a3f 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -84,15 +84,15 @@ enum mshv_region_type {
 struct mshv_mem_region {
 	struct hlist_node hnode;
 	struct kref mreg_refcount;
-	u64 nr_pages;
+	u64 nr_pfns;
 	u64 start_gfn;
 	u64 start_uaddr;
 	u32 hv_map_flags;
 	struct mshv_partition *partition;
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
-	struct mutex mreg_mutex;	/* protects region pages remapping */
-	struct page *mreg_pages[];
+	struct mutex mreg_mutex;	/* protects region PFNs remapping */
+	unsigned long mreg_pfns[];
 };
 
 struct mshv_irq_ack_notifier {
@@ -282,11 +282,11 @@ int hv_call_create_partition(u64 flags,
 int hv_call_initialize_partition(u64 partition_id);
 int hv_call_finalize_partition(u64 partition_id);
 int hv_call_delete_partition(u64 partition_id);
-int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
-int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			  u32 flags, struct page **pages);
-int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			    u32 flags);
+int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
+int hv_call_map_ram_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
+			 u32 flags, unsigned long *pfns);
+int hv_call_unmap_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
+		       u32 flags);
 int hv_call_delete_vp(u64 partition_id, u32 vp_index);
 int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
 				     u64 dest_addr,
@@ -329,8 +329,8 @@ int hv_map_stats_page(enum hv_stats_object_type type,
 int hv_unmap_stats_page(enum hv_stats_object_type type,
 			struct hv_stats_page *page_addr,
 			const union hv_stats_object_identity *identity);
-int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
-				   u64 page_struct_count, u32 host_access,
+int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
+				   u64 pfns_count, u32 host_access,
 				   u32 flags, u8 acquire);
 int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
 				      void *property_value, size_t property_value_sz);
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index cb55d4d4be2e..a95f2cfc5da5 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -188,17 +188,16 @@ int hv_call_delete_partition(u64 partition_id)
 	return hv_result_to_errno(status);
 }
 
-/* Ask the hypervisor to map guest ram pages or the guest mmio space */
-static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
-			       u32 flags, struct page **pages, u64 mmio_spa)
+static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
+			  u32 flags, unsigned long *pfns, u64 mmio_spa)
 {
 	struct hv_input_map_gpa_pages *input_page;
 	u64 status, *pfnlist;
 	unsigned long irq_flags, large_shift = 0;
 	int ret = 0, done = 0;
-	u64 page_count = page_struct_count;
+	u64 page_count = pfns_count;
 
-	if (page_count == 0 || (pages && mmio_spa))
+	if (page_count == 0 || (pfns && mmio_spa))
 		return -EINVAL;
 
 	if (flags & HV_MAP_GPA_LARGE_PAGE) {
@@ -227,14 +226,14 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		for (i = 0; i < rep_count; i++)
 			if (flags & HV_MAP_GPA_NO_ACCESS) {
 				pfnlist[i] = 0;
-			} else if (pages) {
+			} else if (pfns) {
 				u64 index = (done + i) << large_shift;
 
-				if (index >= page_struct_count) {
+				if (index >= pfns_count) {
 					ret = -EINVAL;
 					break;
 				}
-				pfnlist[i] = page_to_pfn(pages[index]);
+				pfnlist[i] = pfns[index];
 			} else {
 				pfnlist[i] = mmio_spa + done + i;
 			}
@@ -266,37 +265,37 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 
 		if (flags & HV_MAP_GPA_LARGE_PAGE)
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
+		hv_call_unmap_pfns(partition_id, gfn, done, unmap_flags);
 	}
 
 	return ret;
 }
 
 /* Ask the hypervisor to map guest ram pages */
-int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
-			  u32 flags, struct page **pages)
+int hv_call_map_ram_pfns(u64 partition_id, u64 gfn, u64 pfn_count,
+			 u32 flags, unsigned long *pfns)
 {
-	return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
-				   flags, pages, 0);
+	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
+			      pfns, 0);
 }
 
-/* Ask the hypervisor to map guest mmio space */
-int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
+int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa,
+			  u64 pfn_count)
 {
 	int i;
 	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
 		    HV_MAP_GPA_NOT_CACHED;
 
-	for (i = 0; i < numpgs; i++)
+	for (i = 0; i < pfn_count; i++)
 		if (page_is_ram(mmio_spa + i))
 			return -EINVAL;
 
-	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
-				   mmio_spa);
+	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
+			      NULL, mmio_spa);
 }
 
-int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
-			    u32 flags)
+int hv_call_unmap_pfns(u64 partition_id, u64 gfn, u64 page_count_4k,
+		       u32 flags)
 {
 	struct hv_input_unmap_gpa_pages *input_page;
 	u64 status, page_count = page_count_4k;
@@ -1009,15 +1008,15 @@ int hv_unmap_stats_page(enum hv_stats_object_type type,
 	return ret;
 }
 
-int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
-				   u64 page_struct_count, u32 host_access,
+int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
+				   u64 pfns_count, u32 host_access,
 				   u32 flags, u8 acquire)
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
 	int done = 0;
 	unsigned long irq_flags, large_shift = 0;
-	u64 page_count = page_struct_count;
+	u64 page_count = pfns_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
 			     HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
 
@@ -1051,11 +1050,10 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		for (i = 0; i < rep_count; i++) {
 			u64 index = (done + i) << large_shift;
 
-			if (index >= page_struct_count)
+			if (index >= pfns_count)
 				return -EINVAL;
 
-			input_page->spa_page_list[i] =
-						page_to_pfn(pages[index]);
+			input_page->spa_page_list[i] = pfns[index];
 		}
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index f2d83d6c8c4f..685e4b562186 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -619,7 +619,7 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 
 	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
 		if (gfn >= region->start_gfn &&
-		    gfn < region->start_gfn + region->nr_pages)
+		    gfn < region->start_gfn + region->nr_pfns)
 			return region;
 	}
 
@@ -1221,20 +1221,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 					bool is_mmio)
 {
 	struct mshv_mem_region *rg;
-	u64 nr_pages = HVPFN_DOWN(mem->size);
+	u64 nr_pfns = HVPFN_DOWN(mem->size);
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
 	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
-		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
-		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
+		if (mem->guest_pfn + nr_pfns <= rg->start_gfn ||
+		    rg->start_gfn + rg->nr_pfns <= mem->guest_pfn)
 			continue;
 		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EEXIST;
 	}
 	spin_unlock(&partition->pt_mem_regions_lock);
 
-	rg = mshv_region_create(mem->guest_pfn, nr_pages,
+	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
 				mem->userspace_addr, mem->flags);
 	if (IS_ERR(rg))
 		return PTR_ERR(rg);
@@ -1372,21 +1372,21 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		 * the hypervisor track dirty pages, enabling pre-copy live
 		 * migration.
 		 */
-		ret = hv_call_map_gpa_pages(partition->pt_id,
-					    region->start_gfn,
-					    region->nr_pages,
-					    HV_MAP_GPA_NO_ACCESS, NULL);
+		ret = hv_call_map_ram_pfns(partition->pt_id,
+					   region->start_gfn,
+					   region->nr_pfns,
+					   HV_MAP_GPA_NO_ACCESS, NULL);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
-		ret = hv_call_map_mmio_pages(partition->pt_id,
-					     region->start_gfn,
-					     mmio_pfn,
-					     region->nr_pages);
+		ret = hv_call_map_mmio_pfns(partition->pt_id,
+					    region->start_gfn,
+					    mmio_pfn,
+					    region->nr_pfns);
 		break;
 	}
 
 	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
-				   region->start_gfn, region->nr_pages,
+				   region->start_gfn, region->nr_pfns,
 				   region->hv_map_flags, ret);
 
 	if (ret)
@@ -1424,7 +1424,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
 	/* Paranoia check */
 	if (region->start_uaddr != mem.userspace_addr ||
 	    region->start_gfn != mem.guest_pfn ||
-	    region->nr_pages != HVPFN_DOWN(mem.size)) {
+	    region->nr_pfns != HVPFN_DOWN(mem.size)) {
 		spin_unlock(&partition->pt_mem_regions_lock);
 		return -EINVAL;
 	}



^ permalink raw reply related

* [PATCH 0/7] mshv: Refactor memory region management and map pages at creation
From: Stanislav Kinsburskii @ 2026-03-30 20:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

This series refactors the mshv memory region subsystem in preparation
for mapping populated pages into the hypervisor at movable region
creation time, rather than relying solely on demand faulting.

The primary motivation is to ensure that when userspace passes a
pre-populated mapping for a movable memory region, those pages are
immediately visible to the hypervisor. Previously, all movable regions
were created with HV_MAP_GPA_NO_ACCESS on every page regardless of
whether the backing pages were already present, deferring all mapping
to the fault handler. This added unnecessary fault overhead and
complicated the initial setup of child partitions with pre-populated
memory.

The series takes a bottom-up approach:

- Patches 1-2 lay the groundwork by converting internal data structures
from page pointers to PFNs and teaching the range processing
infrastructure to handle holes (invalid PFNs) uniformly. The PFN
conversion eliminates redundant page_to_pfn()/pfn_to_page() conversions
between the HMM interface (which returns PFNs) and the hypervisor
hypercalls (which consume PFNs). The hole handling enables mapping
regions that contain a mix of present and absent pages, remapping holes
with no-access permissions to preserve hypervisor dirty page tracking
for precopy live migration.

- Patch 3 extends HMM fault handling to support memory regions that span
multiple VMAs with different protection flags, which is required for
flexible guest memory layouts.

- Patch 4 consolidates region setup by moving pinned region preparation
into mshv_regions.c, making five helper functions static, and fixing
a pre-existing bug where mshv_region_map() failures on non-encrypted
partitions were silently ignored.

- Patch 5 is the core functional change: movable regions now collect
already-present PFNs from userspace at creation time and map them
into the hypervisor immediately. A new do_fault parameter controls
whether hmm_range_fault() should fault in missing pages or only
collect those already present.

- Patches 6-7 are cleanups: extracting the MMIO mapping path into its
own function for consistency with the pinned and movable paths, and
adding a tracepoint for GPA mapping hypercalls to aid debugging.

---

Stanislav Kinsburskii (7):
      mshv: Convert from page pointers to PFNs
      mshv: Add support to address range holes remapping
      mshv: Support regions with different VMAs
      mshv: Move pinned region setup to mshv_regions.c
      mshv: Map populated pages on movable region creation
      mshv: Extract MMIO region mapping into separate function
      mshv: Add tracepoint for map GPA hypercall


 drivers/hv/mshv_regions.c      |  580 +++++++++++++++++++++++++++++-----------
 drivers/hv/mshv_root.h         |   29 +-
 drivers/hv/mshv_root_hv_call.c |   53 ++--
 drivers/hv/mshv_root_main.c    |   99 +------
 drivers/hv/mshv_trace.h        |   36 ++
 5 files changed, 503 insertions(+), 294 deletions(-)


^ permalink raw reply

* Re: [PATCH net-next, v3] net: mana: Force full-page RX buffers for 4K page size on specific systems.
From: Dipayaan Roy @ 2026-03-30 19:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, dipayanroy
In-Reply-To: <20260320172908.1840229d@kernel.org>

On Fri, Mar 20, 2026 at 05:29:08PM -0700, Jakub Kicinski wrote:
> On Fri, 20 Mar 2026 11:37:36 -0700 Dipayaan Roy wrote:
> > On Sat, Mar 14, 2026 at 12:50:53PM -0700, Jakub Kicinski wrote:
> > > On Tue, 10 Mar 2026 21:00:49 -0700 Dipayaan Roy wrote:  
> > > > On certain systems configured with 4K PAGE_SIZE, utilizing page_pool
> > > > fragments for RX buffers results in a significant throughput regression.
> > > > Profiling reveals that this regression correlates with high overhead in the
> > > > fragment allocation and reference counting paths on these specific
> > > > platforms, rendering the multi-buffer-per-page strategy counterproductive.  
> > > 
> > > Can you say more ? We could technically take two references on the page
> > > right away if MTU is small and avoid some of the cost.  
> > 
> > There is a 15-20% shortfall in achieving line rate for MANA (180+ Gbps)
> > on a particular ARM64 SKU. The issue is only specific to this processor SKU —
> > not seen on other ARM64 SKUs (e.g., GB200) or x86 SKUs. Critically, the
> > regression only manifests beyond 16 TCP connections, which strongly indicates
> > seen when there is  high contention and traffic.
> > 
> >   no. of     | rx buf backed       | rx buf backed
> >  connections | with page fragments | with full page
> > -------------+---------------------+---------------
> >            4 |         139 Gbps    |     138 Gbps
> >            8 |         140 Gbps    |     162 Gbps
> >           16 |         186 Gbps    |     186 Gbps
> 
> These results look at bit odd, 4 and 16 streams have the same perf,
> while all other cases indeed show a delta. What I was hoping for was
> a more precise attribution of the performance issue. Like perf top
> showing that its indeed the atomic ops on the refcount that stall.
> 
> >           32 |         136 Gbps    |     183 Gbps
> >           48 |         159 Gbps    |     185 Gbps
> >           64 |         165 Gbps    |     184 Gbps
> >          128 |         170 Gbps    |     180 Gbps
> >  
> > HW team is still working to RCA this hw behaviour.
> > 
> > Regarding "We could technically take two references on the page right
> > away", are you suggesting having page reference counting logic to driver
> > instead of relying on page pool?
> 
> Yes, either that or adjust the page pool APIs. 
> page_pool_alloc_frag_netmem() currently sets the refcount to BIAS
> which it then has to subtract later. So we get:
> 
>   set(BIAS)
>   .. driver allocates chunks ..
>   sub(BIAS_MAX - pool->frag_users)
> 
> Instead of using BIAS we could make the page pool guess that the caller
> will keep asking for the same frame size. So initially take
> (PAGE_SIZE/size) references.
> 
Ok I will be doing some expeimentation with this approach to see if it
helps the current scenario.

> > > The driver doesn't seem to set skb->truesize accordingly after this
> > > change. So you're lying to the stack about how much memory each packet
> > > consumes. This is a blocker for the change.
> > >   
> > ACK. I will send out a separate patch with fixes tag to fix the skb true
> > size.
> > 
> > > > To mitigate this, bypass the page_pool fragment path and force a single RX
> > > > packet per page allocation when all the following conditions are met:
> > > >   1. The system is configured with a 4K PAGE_SIZE.
> > > >   2. A processor-specific quirk is detected via SMBIOS Type 4 data.  
> > > 
> > > I don't think we want the kernel to be in the business of carrying
> > > matching on platform names and providing optimal config by default.
> > > This sort of logic needs to live in user space or the hypervisor 
> > > (which can then pass a single bit to the driver to enable the behavior)
> > >   
> > As per our internal discussion the hypervisor cannot provide the CPU
> > version info(in vm as well as in bare metal offerings).
> 
> Why? I suppose it's much more effort for you but it's much more effort
> for the community to carry the workaround. So..
>
As per the hypervisor team it is not solving the issue in the case of
bare metal offering, hence will work ahead with an alternate soultion
as suggested by you: "This sort of logic needs to live in user space..,
which can then pass a single bit to the driver to enable the behavior"

> > On handling it from user side are you suggesting it to introduce a new
> > ethtool Private Flags and have udev rules for the driver to set the private
> > flag and switch to full page rx buffers? Given that the wide number of distro
> > support this might be harder to maintain/backport. 
> > 
> > Also the dmi parsing design was influenced by other net wireleass
> > drivers as /wireless/ath/ath10k/core.c. If this approach is not
> > acceptable for MANA driver then will have to take a alternate route
> > based on the dsicussion right above it.
> 
> Plenty of ugly hacks in the kernel, it's no excuse.

Hi Jakub,

As we are still working on root causing the actual issue with HW team,
we would want the user a option to achieve the line rate by a tuneable
option to run with full page rx buffers. I will be sending out a next
version that would introduce an ethtool private flag for mana that
allows the user to force one RX buffer per page.


Regards

^ permalink raw reply

* Re: [PATCH net-next v4] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-03-30 19:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, shradhagupta, shirazsaleem,
	dipayanroy, yury.norov, kees, ssengar, gargaditya, linux-hyperv,
	netdev, linux-kernel, linux-rdma
In-Reply-To: <acK56AlPfVW8cDPe@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Tue, Mar 24, 2026 at 09:20:56AM -0700, Erni Sri Satya Vennela wrote:
> On Mon, Mar 23, 2026 at 05:44:44PM -0700, Jakub Kicinski wrote:
> > On Thu, 19 Mar 2026 00:09:13 -0700 Erni Sri Satya Vennela wrote:
> > > Add debugfs entries to expose hardware configuration and diagnostic
> > > information that aids in debugging driver initialization and runtime
> > > operations without adding noise to dmesg.
> > > 
> > > The debugfs directory creation and removal for each PCI device is
> > > integrated into mana_gd_setup() and mana_gd_cleanup_device()
> > > respectively, so that all callers (probe, remove, suspend, resume,
> > > shutdown) share a single code path.
> > > 
> > > Device-level entries (under /sys/kernel/debug/mana/<slot>/):
> > >   - num_msix_usable, max_num_queues: Max resources from hardware
> > >   - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
> > >   - num_vports, bm_hostmode: Device configuration
> > > 
> > > Per-vPort entries (under /sys/kernel/debug/mana/<slot>/vportN/):
> > >   - port_handle: Hardware vPort handle
> > >   - max_sq, max_rq: Max queues from vPort config
> > >   - indir_table_sz: Indirection table size
> > >   - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
> > >     Last applied steering configuration parameters
> > 
> > AI says:
> > 
> > > @@ -1918,15 +1930,23 @@ static int mana_gd_setup(struct pci_dev *pdev)
> > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > >  	int err;
> > >  
> > > +	if (gc->is_pf)
> > > +		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
> > > +	else
> > > +		gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
> > > +							  mana_debugfs_root);
> > 
> > If pdev->slot is NULL (which can happen for VFs in environments like generic
> > VFIO passthrough or nested KVM), will pci_slot_name(pdev->slot) cause a
> > NULL pointer dereference?
> > 
> > Also, could this naming scheme cause name collisions? If multiple PFs are
> > present, they would all try to use "0". Similarly, VFs across different
> > PCI domains or buses might share the same physical slot identifier, leading
> > to -EEXIST errors. Would it be safer to use the unique PCI BDF via
> > pci_name(pdev) instead?
> Yes. that is a better way to handle it. I will use pci_name. We can
> remove if-else and use only one for both the cases.
> > 
> > > @@ -3141,6 +3149,24 @@ static int mana_init_port(struct net_device *ndev)
> > >  	eth_hw_addr_set(ndev, apc->mac_addr);
> > >  	sprintf(vport, "vport%d", port_idx);
> > >  	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
> > > +
> > > +	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
> > > +			   &apc->port_handle);
> > 
> > When operations like changing the MTU or setting an XDP program trigger a
> > detach/attach cycle, mana_detach() invokes mana_cleanup_port_context(),
> > which recursively removes the apc->mana_port_debugfs directory.
> > During re-attachment, mana_attach() calls mana_init_port(), which
> > recreates the directory and the new files added in this patch. However, the
> > pre-existing current_speed file (created in mana_probe_port()) is not
> > recreated here.
> > 
> > Does this cause the current_speed file to be permanently lost after a
> > detach/attach cycle? Should the creation of current_speed be moved to
> > mana_init_port() so it survives the cycle?
> Yes that is correct.
> 
> Since these issues are pre-existing and not introduced from my patch.
> I'll plan to send them as different patch with fixes tag.
> > -- 
> > pw-bot: cr


Hi Jakub,

Just a quick follow‑up on this. Since these issues were pre‑existing and
not introduced by this patch, would you prefer that I send them as a
separate fix patch, or fold the fixes into the current patch?

Thanks,
Vennela

^ permalink raw reply

* Re: [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-30 17:38 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Christophe Leroy (CS GROUP), linux-kernel, driver-core,
	linuxppc-dev, linux-hyperv, linux-pci, platform-driver-x86,
	linux-arm-msm, linux-remoteproc, linux-s390, linux-spi,
	virtualization, kvm, xen-devel, linux-arm-kernel,
	Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260324005919.2408620-6-dakr@kernel.org>

(Cc: Jason)

On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index d43745fe4c84..460852f79f29 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1987,9 +1987,8 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
>  	    pdev->is_virtfn && physfn == vdev->pdev) {
>  		pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
>  			 pci_name(pdev));
> -		pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
> -						  vdev->vdev.ops->name);
> -		WARN_ON(!pdev->driver_override);
> +		WARN_ON(device_set_driver_override(&pdev->dev,
> +						   vdev->vdev.ops->name));

Technically, this is a change in behavior. If vdev->vdev.ops->name is NULL, it
will trigger the WARN_ON(), whereas before it would have just written "(null)"
into driver_override.

I assume that vfio_pci_core drivers are expected to set the name in struct
vfio_device_ops in the first place and this code (silently) relies on this
invariant?

Alex, Jason: Should we keep this hunk above as is and check for a proper name in
struct vfio_device_ops in vfio_pci_core_register_device() with a subsequent
patch?

>  	} else if (action == BUS_NOTIFY_BOUND_DRIVER &&
>  		   pdev->is_virtfn && physfn == vdev->pdev) {
>  		struct pci_driver *drv = pci_dev_driver(pdev);

^ permalink raw reply

* Re: [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-30 16:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Christophe Leroy (CS GROUP), linux-kernel,
	driver-core, linuxppc-dev, linux-hyperv, linux-pci,
	platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
	linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
	Gui-Dong Han
In-Reply-To: <20260326180825.GA1330769@bhelgaas>

On Thu Mar 26, 2026 at 7:08 PM CET, Bjorn Helgaas wrote:
> On Tue, Mar 24, 2026 at 01:59:09AM +0100, Danilo Krummrich wrote:
>> When a driver is probed through __driver_attach(), the bus' match()
>> callback is called without the device lock held, thus accessing the
>> driver_override field without a lock, which can cause a UAF.
>> 
>> Fix this by using the driver-core driver_override infrastructure taking
>> care of proper locking internally.
>> 
>> Note that calling match() from __driver_attach() without the device lock
>> held is intentional. [1]
>> 
>> Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
>> Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
>> Fixes: 782a985d7af2 ("PCI: Introduce new device binding path using pci_dev.driver_override")
>> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
>> ---
>>  drivers/pci/pci-driver.c           | 11 +++++++----
>>  drivers/pci/pci-sysfs.c            | 28 ----------------------------
>>  drivers/pci/probe.c                |  1 -
>>  include/linux/pci.h                |  6 ------
>
> For the above:
>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>
> "driver_override" is mentioned several places in
> Documentation/ABI/testing/sysfs-bus-*.  I assume this series doesn't
> change the behavior documented there?

Correct, none of this is altered.

^ permalink raw reply

* Re: [PATCH rdma v3] RDMA/mana_ib: Disable RX steering on RSS QP destroy
From: Leon Romanovsky @ 2026-03-30 13:00 UTC (permalink / raw)
  To: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Haiyang Zhang,
	K . Y . Srinivasan, Wei Liu, Dexuan Cui, Long Li
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel,
	stable
In-Reply-To: <20260325194100.1929056-1-longli@microsoft.com>


On Wed, 25 Mar 2026 12:40:57 -0700, Long Li wrote:
> When an RSS QP is destroyed (e.g. DPDK exit), mana_ib_destroy_qp_rss()
> destroys the RX WQ objects but does not disable vPort RX steering in
> firmware. This leaves stale steering configuration that still points to
> the destroyed RX objects.
> 
> If traffic continues to arrive (e.g. peer VM is still transmitting) and
> the VF interface is subsequently brought up (mana_open), the firmware
> may deliver completions using stale CQ IDs from the old RX objects.
> These CQ IDs can be reused by the ethernet driver for new TX CQs,
> causing RX completions to land on TX CQs:
> 
> [...]

Applied, thanks!

[1/1] RDMA/mana_ib: Disable RX steering on RSS QP destroy
      https://git.kernel.org/rdma/rdma/c/187c8bd5e571f5

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>


^ permalink raw reply

* Re: [PATCH v2 03/16] RDMA: Consolidate patterns with sizeof() to ib_copy_validate_udata_in()
From: Bernard Metzler @ 2026-03-29 11:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bryan Tan, Cheng Xu,
	Gal Pressman, Junxian Huang, Kai Shen, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Michal Kalderon, Michael Margolin, Nelson Escobar, Satish Kharat,
	Selvin Xavier, Yossi Leybovich, Chengchang Tang, Tatyana Nikolova,
	Vishnu Dasa, Yishai Hadas, Zhu Yanjun
  Cc: Long Li, patches
In-Reply-To: <3-v2-f4ac6f418bd6+12c5-rdma_udata_req_jgg@nvidia.com>

On 25.03.2026 22:26, Jason Gunthorpe wrote:
> Similar to the prior patch, these patterns are open coding an
> offsetofend() using sizeof(), which targets the last member of the
> current struct.
> 
> Reviewed-by: Long Li <longli@microsoft.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/infiniband/hw/mana/qp.c       | 27 +++++++++------------------
>   drivers/infiniband/hw/mana/wq.c       | 10 ++--------
>   drivers/infiniband/hw/mlx4/main.c     |  6 ++----
>   drivers/infiniband/hw/mlx5/cq.c       |  2 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.c | 13 ++-----------
>   drivers/infiniband/sw/siw/siw_verbs.c |  6 +-----
>   6 files changed, 17 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 82f84f7ad37a90..69c8d4f7a1f46b 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -111,16 +111,12 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>   	u32 port;
>   	int ret;
>   
> -	if (!udata || udata->inlen < sizeof(ucmd))
> +	if (!udata)
>   		return -EINVAL;
>   
> -	ret = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
> -	if (ret) {
> -		ibdev_dbg(&mdev->ib_dev,
> -			  "Failed copy from udata for create rss-qp, err %d\n",
> -			  ret);
> +	ret = ib_copy_validate_udata_in(udata, ucmd, port);
> +	if (ret)
>   		return ret;
> -	}
>   
>   	if (attr->cap.max_recv_wr > mdev->adapter_caps.max_qp_wr) {
>   		ibdev_dbg(&mdev->ib_dev,
> @@ -282,15 +278,12 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
>   	u32 port;
>   	int err;
>   
> -	if (!mana_ucontext || udata->inlen < sizeof(ucmd))
> +	if (!mana_ucontext)
>   		return -EINVAL;
>   
> -	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
> -	if (err) {
> -		ibdev_dbg(&mdev->ib_dev,
> -			  "Failed to copy from udata create qp-raw, %d\n", err);
> +	err = ib_copy_validate_udata_in(udata, ucmd, port);
> +	if (err)
>   		return err;
> -	}
>   
>   	if (attr->cap.max_send_wr > mdev->adapter_caps.max_qp_wr) {
>   		ibdev_dbg(&mdev->ib_dev,
> @@ -535,17 +528,15 @@ static int mana_ib_create_rc_qp(struct ib_qp *ibqp, struct ib_pd *ibpd,
>   	u64 flags = 0;
>   	u32 doorbell;
>   
> -	if (!udata || udata->inlen < sizeof(ucmd))
> +	if (!udata)
>   		return -EINVAL;
>   
>   	mana_ucontext = rdma_udata_to_drv_context(udata, struct mana_ib_ucontext, ibucontext);
>   	doorbell = mana_ucontext->doorbell;
>   	flags = MANA_RC_FLAG_NO_FMR;
> -	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
> -	if (err) {
> -		ibdev_dbg(&mdev->ib_dev, "Failed to copy from udata, %d\n", err);
> +	err = ib_copy_validate_udata_in(udata, ucmd, queue_size);
> +	if (err)
>   		return err;
> -	}
>   
>   	for (i = 0, j = 0; i < MANA_RC_QUEUE_TYPE_MAX; ++i) {
>   		/* skip FMR for user-level RC QPs */
> diff --git a/drivers/infiniband/hw/mana/wq.c b/drivers/infiniband/hw/mana/wq.c
> index 6206244f762e42..aceeea7f17b339 100644
> --- a/drivers/infiniband/hw/mana/wq.c
> +++ b/drivers/infiniband/hw/mana/wq.c
> @@ -15,15 +15,9 @@ struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
>   	struct mana_ib_wq *wq;
>   	int err;
>   
> -	if (udata->inlen < sizeof(ucmd))
> -		return ERR_PTR(-EINVAL);
> -
> -	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
> -	if (err) {
> -		ibdev_dbg(&mdev->ib_dev,
> -			  "Failed to copy from udata for create wq, %d\n", err);
> +	err = ib_copy_validate_udata_in(udata, ucmd, reserved);
> +	if (err)
>   		return ERR_PTR(err);
> -	}
>   
>   	wq = kzalloc_obj(*wq);
>   	if (!wq)
> diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
> index 73e17b4339eb60..16e4cffbd7a84d 100644
> --- a/drivers/infiniband/hw/mlx4/main.c
> +++ b/drivers/infiniband/hw/mlx4/main.c
> @@ -50,6 +50,7 @@
>   #include <rdma/ib_user_verbs.h>
>   #include <rdma/ib_addr.h>
>   #include <rdma/ib_cache.h>
> +#include <rdma/uverbs_ioctl.h>
>   
>   #include <net/bonding.h>
>   
> @@ -445,10 +446,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
>   	struct mlx4_clock_params clock_params;
>   
>   	if (uhw->inlen) {
> -		if (uhw->inlen < sizeof(cmd))
> -			return -EINVAL;
> -
> -		err = ib_copy_from_udata(&cmd, uhw, sizeof(cmd));
> +		err = ib_copy_validate_udata_in(uhw, cmd, reserved);
>   		if (err)
>   			return err;
>   
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index 643b3b7d387834..f5e75e51c6763f 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -1229,7 +1229,7 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
>   	struct ib_umem *umem;
>   	int err;
>   
> -	err = ib_copy_from_udata(&ucmd, udata, sizeof(ucmd));
> +	err = ib_copy_validate_udata_in(udata, ucmd, reserved1);
>   	if (err)
>   		return err;
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index fe41362c51444c..c9fd40bfa09eb2 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -452,18 +452,9 @@ static int rxe_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
>   	int err;
>   
>   	if (udata) {
> -		if (udata->inlen < sizeof(cmd)) {
> -			err = -EINVAL;
> -			rxe_dbg_srq(srq, "malformed udata\n");
> +		err = ib_copy_validate_udata_in(udata, cmd, mmap_info_addr);
> +		if (err)
>   			goto err_out;
> -		}
> -
> -		err = ib_copy_from_udata(&cmd, udata, sizeof(cmd));
> -		if (err) {
> -			err = -EFAULT;
> -			rxe_dbg_srq(srq, "unable to read udata\n");
> -			goto err_out;
> -		}
>   	}
>   
>   	err = rxe_srq_chk_attr(rxe, srq, attr, mask);
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
> index ef504db8f2b48b..1e1d262a4ae2db 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.c
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -1373,11 +1373,7 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
>   		struct siw_uresp_reg_mr uresp = {};
>   		struct siw_mem *mem = mr->mem;
>   
> -		if (udata->inlen < sizeof(ureq)) {
> -			rv = -EINVAL;
> -			goto err_out;
> -		}
> -		rv = ib_copy_from_udata(&ureq, udata, sizeof(ureq));
> +		rv = ib_copy_validate_udata_in(udata, ureq, pad);
>   		if (rv)
>   			goto err_out;
>   
Looks good for siw driver. Thank you.

Reviewed-by: Bernard Metzler <bernard.metzler@linux.dev>

^ permalink raw reply

* Re: [PATCH 02/12] bus: fsl-mc: use generic driver_override infrastructure
From: Christophe Leroy (CS GROUP) @ 2026-03-28 12:10 UTC (permalink / raw)
  To: Ioana Ciornei, Danilo Krummrich
  Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki, Nipun Gupta,
	Nikhil Agarwal, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Bjorn Helgaas, Armin Wolf, Bjorn Andersson,
	Mathieu Poirier, Vineeth Vijayan, Peter Oberparleiter,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Harald Freudenberger,
	Holger Dengler, Mark Brown, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio PĂ©rez, Alex Williamson,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
	platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
	linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
	Gui-Dong Han
In-Reply-To: <cvcetxkxjq2tz6n2vsofhyzove3qdi2e4r6rq6yxou3joejk2h@rmt5ygav7ssu>



Le 25/03/2026 à 13:01, Ioana Ciornei a écrit :
> On Tue, Mar 24, 2026 at 01:59:06AM +0100, Danilo Krummrich wrote:
>> When a driver is probed through __driver_attach(), the bus' match()
>> callback is called without the device lock held, thus accessing the
>> driver_override field without a lock, which can cause a UAF.
>>
>> Fix this by using the driver-core driver_override infrastructure taking
>> care of proper locking internally.
>>
>> Note that calling match() from __driver_attach() without the device lock
>> held is intentional. [1]
>>
>> Link: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdriver-core%2FDGRGTIRHA62X.3RY09D9SOK77P%40kernel.org%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7C4b9262ddecdd4ce29f9808de8a66485e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C639100369055903282%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BRfjlUkq7oWV%2F0v2S2B%2BEuxCY%2FLRQv6qHiEWiupd6kc%3D&reserved=0 [1]
>> Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
>> Closes: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D220789&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7C4b9262ddecdd4ce29f9808de8a66485e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C639100369055936232%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XL1K1ICiygOZnlvDUbQFe192KnLsBQms0HFNGCuyz%2Fw%3D&reserved=0
>> Fixes: 1f86a00c1159 ("bus/fsl-mc: add support for 'driver_override' in the mc-bus")
>> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> 
> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
> 


Applied, thanks

^ permalink raw reply

* [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: Junrui Luo @ 2026-03-28  9:18 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Nuno Das Neves, Anirudh Rayabharam, Stanislav Kinsburskii,
	Mukesh Rathor
  Cc: Muminul Islam, Praveen K Paladugu, Jinank Jain, linux-hyperv,
	linux-kernel, Yuhao Jiang, Roman Kisel, stable, Junrui Luo

mshv_partition_create_region() computes mem->guest_pfn + nr_pages to
check for overlapping regions without verifying u64 wraparound. A
sufficiently large guest_pfn can cause the addition to overflow,
bypassing the overlap check and allowing creation of regions that wrap
around the address space.

Fix by using check_add_overflow() to reject such regions early, and
validate that the region end does not exceed MAX_PHYSMEM_BITS. These
checks also protect downstream callers that compute start_gfn +
nr_pages on stored regions without overflow guards.

Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Suggested-by: Roman Kisel <romank@linux.microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
---
Changes in v2:
- Add a maximum check suggested by Roman Kisel
- Link to v1: https://lore.kernel.org/all/SYBPR01MB7881689C0F58149DD986A6D1AF49A@SYBPR01MB7881.ausprd01.prod.outlook.com/
---
 drivers/hv/mshv_root_main.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 6f42423f7faa..32826247dbce 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1174,11 +1174,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 {
 	struct mshv_mem_region *rg;
 	u64 nr_pages = HVPFN_DOWN(mem->size);
+	u64 new_region_end;
+
+	/* Reject regions whose end address would wrap around */
+	if (check_add_overflow(mem->guest_pfn, nr_pages, &new_region_end))
+		return -EOVERFLOW;
+
+	/* Reject regions beyond the maximum physical address */
+	if (new_region_end > HVPFN_DOWN(1ULL << MAX_PHYSMEM_BITS))
+		return -EINVAL;
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
 	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
-		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
+		if (new_region_end <= rg->start_gfn ||
 		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
 			continue;
 		spin_unlock(&partition->pt_mem_regions_lock);

---
base-commit: c369299895a591d96745d6492d4888259b004a9e
change-id: 20260328-fixes-0296eb3dbb52

Best regards,
-- 
Junrui Luo <moonafterrain@outlook.com>


^ permalink raw reply related

* Re: [PATCH net-next v2] net: mana: Use at least SZ_4K in doorbell ID range check
From: patchwork-bot+netdevbpf @ 2026-03-28  4:00 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, horms, shradhagupta, kotaranov,
	dipayanroy, yury.norov, kees, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260325180423.1923060-1-ernis@linux.microsoft.com>

Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 25 Mar 2026 11:04:17 -0700 you wrote:
> mana_gd_ring_doorbell() accesses offsets up to DOORBELL_OFFSET_EQ
> (0xFF8) + 8 bytes = 4KB within each doorbell page. A db_page_size
> smaller than SZ_4K is fundamentally incompatible with the driver:
> doorbell pages would overlap and the device cannot function correctly.
> 
> Validate db_page_size at the source and fail the
> probe early if the value is below SZ_4K. This ensures the doorbell ID
> range check in mana_gd_register_device() can rely on db_page_size
> being valid.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] net: mana: Use at least SZ_4K in doorbell ID range check
    https://git.kernel.org/netdev/net-next/c/fb4b4a05aeeb

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net-next v2] net: mana: Set default number of queues to 16
From: Long Li @ 2026-03-28  0:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
	KY Srinivasan, Wei Liu, Dexuan Cui, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260327165512.08f7b6f9@kernel.org>

> On Fri, 27 Mar 2026 04:00:31 +0000 Long Li wrote:
> >   We considered netif_get_num_default_rss_queues() but chose a fixed
> default based on our performance testing. On Azure VMs, typical
> >   workloads plateau at around 16 queues - adding more queues beyond that
> doesn't improve throughput but increases memory usage and
> >   interrupt overhead.
> >
> >   netif_get_num_default_rss_queues() would return 32-64 on large VMs
> (64-128 vCPUs), which wastes resources without benefit.
> >
> >   That said, I agree that completely ignoring the core-based heuristic isn't
> ideal for consistency. One option is to use
> >   netif_get_num_default_rss_queues() but clamp it to a maximum of
> MANA_DEF_NUM_QUEUES (16), so small VMs still get enough queues and
> >   large VMs don't over-allocate. Something like:
> >
> >    apc->num_queues = min(netif_get_num_default_rss_queues(),
> MANA_DEF_NUM_QUEUES);
> >    apc->num_queues = min(apc->num_queues, gc->max_num_queues);
> >
> >   For reference, it seems mlx4 does something similar - it caps at
> DEF_RX_RINGS (16) regardless of core count.
> 
> mlx4 is a bit ancient. And mlx5 does the wrong thing, which is why I'm so
> sensitive to this issue :(
> 
> >   Do you want me to send a v2?
> 
> Please send a follow up, let's leave this patch be and make an incremental
> change.
> 
> Thanks!

I will send a follow-up patch.

Thanks,
Long

^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH net-next v2] net: mana: Set default number of queues to 16
From: Jakub Kicinski @ 2026-03-27 23:55 UTC (permalink / raw)
  To: Long Li
  Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
	KY Srinivasan, Wei Liu, Dexuan Cui, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB668314B1AF002E40B379F1C0CE57A@SA1PR21MB6683.namprd21.prod.outlook.com>

On Fri, 27 Mar 2026 04:00:31 +0000 Long Li wrote:
>   We considered netif_get_num_default_rss_queues() but chose a fixed default based on our performance testing. On Azure VMs, typical
>   workloads plateau at around 16 queues - adding more queues beyond that doesn't improve throughput but increases memory usage and
>   interrupt overhead.
> 
>   netif_get_num_default_rss_queues() would return 32-64 on large VMs (64-128 vCPUs), which wastes resources without benefit.
> 
>   That said, I agree that completely ignoring the core-based heuristic isn't ideal for consistency. One option is to use
>   netif_get_num_default_rss_queues() but clamp it to a maximum of MANA_DEF_NUM_QUEUES (16), so small VMs still get enough queues and
>   large VMs don't over-allocate. Something like:
> 
>    apc->num_queues = min(netif_get_num_default_rss_queues(), MANA_DEF_NUM_QUEUES);
>    apc->num_queues = min(apc->num_queues, gc->max_num_queues);
> 
>   For reference, it seems mlx4 does something similar - it caps at DEF_RX_RINGS (16) regardless of core count.

mlx4 is a bit ancient. And mlx5 does the wrong thing, which is why 
I'm so sensitive to this issue :(

>   Do you want me to send a v2?

Please send a follow up, let's leave this patch be and make an
incremental change. 

Thanks!

^ permalink raw reply

* Re: [PATCH] net: mana: fix use-after-free in add_adev() error path
From: Hardik Garg @ 2026-03-27 21:27 UTC (permalink / raw)
  To: Guangshuo Li, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Saurabh Sengar,
	Erni Sri Satya Vennela, Shradha Gupta, Dipayaan Roy, Aditya Garg,
	Shiraz Saleem, Leon Romanovsky, linux-hyperv, netdev,
	linux-kernel
  Cc: stable
In-Reply-To: <20260318154041.638747-1-lgs201920130244@gmail.com>



On 3/18/2026 8:40 AM, Guangshuo Li wrote:
> If auxiliary_device_add() fails, add_adev() calls
> auxiliary_device_uninit(adev), whose release callback adev_release()
> frees the containing struct mana_adev.
> 
> The current error path then falls through to init_fail and accesses
> adev->id. Since adev is embedded in struct mana_adev, this may lead
> to a use-after-free.
> 
> Fix it by storing the allocated auxiliary device id in a local
> variable and using that saved id in the cleanup path after
> auxiliary_device_uninit().
> 
> Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")
> Cc: stable@vger.kernel.org
> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 1ad154f9db1a..70d71594c599 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -3362,6 +3362,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
>  {
>  	struct auxiliary_device *adev;
>  	struct mana_adev *madev;
> +	int id;
>  	int ret;
>  
>  	madev = kzalloc(sizeof(*madev), GFP_KERNEL);
> @@ -3372,7 +3373,8 @@ static int add_adev(struct gdma_dev *gd, const char *name)
>  	ret = mana_adev_idx_alloc();
>  	if (ret < 0)
>  		goto idx_fail;
> -	adev->id = ret;
> +	id = ret;
> +	adev->id = id;
>  
>  	adev->name = name;
>  	adev->dev.parent = gd->gdma_context->dev;
> @@ -3398,7 +3400,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
>  	auxiliary_device_uninit(adev);
>  
>  init_fail:
> -	mana_adev_idx_free(adev->id);
> +	mana_adev_idx_free(id);
>  
>  idx_fail:
>  	kfree(madev);

Reviewed-by: Hardik Garg <hargar@linux.microsoft.com>


Thanks,
Hardik

^ permalink raw reply

* [PATCH 6/6] mshv: unmap debugfs stats pages on kexec
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
	Michael Kelley, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260327201920.2100427-1-jloeser@linux.microsoft.com>

On L1VH, debugfs stats pages are overlay pages: the kernel allocates
them and registers the GPAs with the hypervisor via
HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
hypervisor across kexec. If the kexec'd kernel reuses those physical
pages, the hypervisor's overlay semantics cause a machine check
exception.

Fix this by calling mshv_debugfs_exit() from the reboot notifier,
which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
kexec. This releases the overlay bindings so the physical pages can be
safely reused. Guard mshv_debugfs_exit() against being called when
init failed.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_debugfs.c   | 7 ++++++-
 drivers/hv/mshv_root_main.c | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
index ebf2549eb44d..f9a4499cf8f3 100644
--- a/drivers/hv/mshv_debugfs.c
+++ b/drivers/hv/mshv_debugfs.c
@@ -676,8 +676,10 @@ int __init mshv_debugfs_init(void)
 
 	mshv_debugfs = debugfs_create_dir("mshv", NULL);
 	if (IS_ERR(mshv_debugfs)) {
+		err = PTR_ERR(mshv_debugfs);
+		mshv_debugfs = NULL;
 		pr_err("%s: failed to create debugfs directory\n", __func__);
-		return PTR_ERR(mshv_debugfs);
+		return err;
 	}
 
 	if (hv_root_partition()) {
@@ -712,6 +714,9 @@ int __init mshv_debugfs_init(void)
 
 void mshv_debugfs_exit(void)
 {
+	if (!mshv_debugfs)
+		return;
+
 	mshv_debugfs_parent_partition_remove();
 
 	if (hv_root_partition()) {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 281f530b68a9..7038fd830646 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2252,6 +2252,7 @@ root_scheduler_deinit(void)
 static int mshv_reboot_notify(struct notifier_block *nb,
 			      unsigned long code, void *unused)
 {
+	mshv_debugfs_exit();
 	cpuhp_remove_state(mshv_cpuhp_online);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
	Michael Kelley, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260327201920.2100427-1-jloeser@linux.microsoft.com>

Register the mshv reboot notifier for all parent partitions, not just
root. Previously the notifier was gated on hv_root_partition(), so on
L1VH (where hv_root_partition() is false) SINT0, SINT5, and SIRBP were
never cleaned up before kexec. The kexec'd kernel then inherited stale
unmasked SINTs and an enabled SIRBP pointing to freed memory.

The L1VH SIRBP also needs special handling: unlike the root partition
where the hypervisor provides the SIRBP page, L1VH must allocate its
own page and program the GPA into the MSR. Add this allocation to
mshv_synic_init() and the corresponding free to mshv_synic_cleanup().

Remove the unnecessary mshv_root_partition_init/exit wrappers and
register the reboot notifier directly in mshv_parent_partition_init().
Make mshv_reboot_nb static since it no longer needs external linkage.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 21 ++++-----------------
 drivers/hv/mshv_synic.c     | 37 ++++++++++++++++++++++++++++++-------
 2 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e6509c980763..281f530b68a9 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2256,20 +2256,10 @@ static int mshv_reboot_notify(struct notifier_block *nb,
 	return 0;
 }
 
-struct notifier_block mshv_reboot_nb = {
+static struct notifier_block mshv_reboot_nb = {
 	.notifier_call = mshv_reboot_notify,
 };
 
-static void mshv_root_partition_exit(void)
-{
-	unregister_reboot_notifier(&mshv_reboot_nb);
-}
-
-static int __init mshv_root_partition_init(struct device *dev)
-{
-	return register_reboot_notifier(&mshv_reboot_nb);
-}
-
 static int __init mshv_init_vmm_caps(struct device *dev)
 {
 	int ret;
@@ -2339,8 +2329,7 @@ static int __init mshv_parent_partition_init(void)
 	if (ret)
 		goto remove_cpu_state;
 
-	if (hv_root_partition())
-		ret = mshv_root_partition_init(dev);
+	ret = register_reboot_notifier(&mshv_reboot_nb);
 	if (ret)
 		goto remove_cpu_state;
 
@@ -2368,8 +2357,7 @@ static int __init mshv_parent_partition_init(void)
 deinit_root_scheduler:
 	root_scheduler_deinit();
 exit_partition:
-	if (hv_root_partition())
-		mshv_root_partition_exit();
+	unregister_reboot_notifier(&mshv_reboot_nb);
 remove_cpu_state:
 	cpuhp_remove_state(mshv_cpuhp_online);
 free_synic_pages:
@@ -2387,8 +2375,7 @@ static void __exit mshv_parent_partition_exit(void)
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
 	root_scheduler_deinit();
-	if (hv_root_partition())
-		mshv_root_partition_exit();
+	unregister_reboot_notifier(&mshv_reboot_nb);
 	cpuhp_remove_state(mshv_cpuhp_online);
 	free_percpu(mshv_root.synic_pages);
 }
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 8a7d76a10dc3..32f91a714c97 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -495,13 +495,29 @@ int mshv_synic_init(unsigned int cpu)
 
 	/* Setup the Synic's event ring page */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
-	sirbp.sirbp_enabled = true;
-	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
-				    PAGE_SIZE, MEMREMAP_WB);
 
-	if (!(*event_ring_page))
-		goto cleanup_siefp;
+	if (hv_root_partition()) {
+		*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
+					    PAGE_SIZE, MEMREMAP_WB);
+
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
+	} else {
+		/*
+		 * On L1VH the hypervisor does not provide a SIRBP page.
+		 * Allocate one and program its GPA into the MSR.
+		 */
+		*event_ring_page = (struct hv_synic_event_ring_page *)
+			get_zeroed_page(GFP_KERNEL);
+
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
 
+		sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
+				>> PAGE_SHIFT;
+	}
+
+	sirbp.sirbp_enabled = true;
 	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
 
 #ifdef HYPERVISOR_CALLBACK_VECTOR
@@ -581,8 +597,15 @@ int mshv_synic_cleanup(unsigned int cpu)
 	/* Disable SYNIC event ring page owned by MSHV */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
 	sirbp.sirbp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
-	memunmap(*event_ring_page);
+
+	if (hv_root_partition()) {
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		memunmap(*event_ring_page);
+	} else {
+		sirbp.base_sirbp_gpa = 0;
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		free_page((unsigned long)*event_ring_page);
+	}
 
 	/*
 	 * Release our mappings of the message and event flags pages.
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox