Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH net v2 4/4] net: mana: Fix EQ leak in mana_remove on NULL port
From: Erni Sri Satya Vennela @ 2026-04-13  5:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-1-ernis@linux.microsoft.com>

In mana_remove(), when a NULL port is encountered in the port iteration
loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event
queues allocated earlier by mana_create_eq().

This can happen when mana_probe_port() fails for port 0, leaving
ac->ports[0] as NULL. On driver unload or error cleanup, mana_remove()
hits the NULL entry and jumps past mana_destroy_eq().

Change 'goto out' to 'break' so the for-loop exits normally and
mana_destroy_eq() is always reached. Remove the now-unreferenced out:
label.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1a141c46ac27..97237d137cbf 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3747,7 +3747,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 		if (!ndev) {
 			if (i == 0)
 				dev_err(dev, "No net device to remove\n");
-			goto out;
+			break;
 		}
 
 		apc = netdev_priv(ndev);
@@ -3778,7 +3778,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 	}
 
 	mana_destroy_eq(ac);
-out:
+
 	if (ac->per_port_queue_reset_wq) {
 		destroy_workqueue(ac->per_port_queue_reset_wq);
 		ac->per_port_queue_reset_wq = NULL;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 3/4] net: mana: Don't overwrite port probe error with add_adev result
From: Erni Sri Satya Vennela @ 2026-04-13  5:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-1-ernis@linux.microsoft.com>

In mana_probe(), if mana_probe_port() fails for any port, the error
is stored in 'err' and the loop breaks. However, the subsequent
unconditional 'err = add_adev(gd, "eth")' overwrites this error.
If add_adev() succeeds, mana_probe() returns success despite ports
being left in a partially initialized state (ac->ports[i] == NULL).

Only call add_adev() when there is no prior error, so the probe
correctly fails and triggers mana_remove() cleanup.

Fixes: ced82fce77e9 ("net: mana: Probe rdma device in mana driver")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index f6ad46736418..1a141c46ac27 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3680,10 +3680,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 	if (!resuming) {
 		for (i = 0; i < ac->num_ports; i++) {
 			err = mana_probe_port(ac, i, &ac->ports[i]);
-			/* we log the port for which the probe failed and stop
-			 * probes for subsequent ports.
-			 * Note that we keep running ports, for which the probes
-			 * were successful, unless add_adev fails too
+			/* Log the port for which the probe failed, stop probing
+			 * subsequent ports, and skip add_adev.
+			 * Already-probed ports remain functional.
 			 */
 			if (err) {
 				dev_err(dev, "Probe Failed for port %d\n", i);
@@ -3697,10 +3696,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 			enable_work(&apc->queue_reset_work);
 			err = mana_attach(ac->ports[i]);
 			rtnl_unlock();
-			/* we log the port for which the attach failed and stop
-			 * attach for subsequent ports
-			 * Note that we keep running ports, for which the attach
-			 * were successful, unless add_adev fails too
+			/* Log the port for which the attach failed, stop
+			 * attaching subsequent ports, and skip add_adev.
+			 * Already-attached ports remain functional.
 			 */
 			if (err) {
 				dev_err(dev, "Attach Failed for port %d\n", i);
@@ -3709,7 +3707,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		}
 	}
 
-	err = add_adev(gd, "eth");
+	if (!err)
+		err = add_adev(gd, "eth");
 
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 2/4] net: mana: Init gf_stats_work before potential error paths in probe
From: Erni Sri Satya Vennela @ 2026-04-13  5:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-1-ernis@linux.microsoft.com>

Move INIT_DELAYED_WORK(gf_stats_work) to before mana_create_eq(),
while keeping schedule_delayed_work() at its original location.

Previously, if any function between mana_create_eq() and the
INIT_DELAYED_WORK call failed, mana_probe() would call mana_remove()
which unconditionally calls cancel_delayed_work_sync(gf_stats_work)
in __flush_work() or debug object warnings with
CONFIG_DEBUG_OBJECTS_WORK enabled.

Fixes: be4f1d67ec56 ("net: mana: Add standard counter rx_missed_errors")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 57f146ea6f66..f6ad46736418 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3635,6 +3635,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	}
 
+	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
+
 	err = mana_create_eq(ac);
 	if (err) {
 		dev_err(dev, "Failed to create EQs: %d\n", err);
@@ -3709,7 +3711,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	err = add_adev(gd, "eth");
 
-	INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
 
 out:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 1/4] net: mana: Init link_change_work before potential error paths in probe
From: Erni Sri Satya Vennela @ 2026-04-13  5:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-1-ernis@linux.microsoft.com>

Move INIT_WORK(link_change_work) to right after the mana_context
allocation, before any error path that could reach mana_remove().

Previously, if mana_create_eq() or mana_query_device_cfg() failed,
mana_probe() would jump to the error path which calls mana_remove().
mana_remove() unconditionally calls disable_work_sync(link_change_work),
but the work struct had not been initialized yet. This can trigger
CONFIG_DEBUG_OBJECTS_WORK enabled.

Fixes: 54133f9b4b53 ("net: mana: Support HW link state events")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 07630322545f..57f146ea6f66 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3631,6 +3631,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 		ac->gdma_dev = gd;
 		gd->driver_data = ac;
+
+		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	}
 
 	err = mana_create_eq(ac);
@@ -3648,8 +3650,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	if (!resuming) {
 		ac->num_ports = num_ports;
-
-		INIT_WORK(&ac->link_change_work, mana_link_state_handle);
 	} else {
 		if (ac->num_ports != num_ports) {
 			dev_err(dev, "The number of vPorts changed: %d->%d\n",
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 0/4] net: mana: Fix probe/remove error path bugs
From: Erni Sri Satya Vennela @ 2026-04-13  5:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel

Fix four pre-existing bugs in mana_probe()/mana_remove() error handling
that can cause warnings on uninitialized work structs, masked errors,
and resource leaks when early probe steps fail.

Patches 1-2 move work struct initialization (link_change_work and
gf_stats_work) to before any error path that could trigger
mana_remove(), preventing WARN_ON in __flush_work() or debug object
warnings when sync cancellation runs on uninitialized work structs.

Patch 3 prevents add_adev() from overwriting a port probe error,
which could leave the driver in a broken state with NULL ports while
reporting success.

Patch 4 changes 'goto out' to 'break' in mana_remove()'s port loop
so that mana_destroy_eq() is always reached, preventing EQ leaks when
a NULL port is encountered.
---
Changes in v2:
* Apply the patch in net instead of net-next.
---
Erni Sri Satya Vennela (4):
  net: mana: Init link_change_work before potential error paths in probe
  net: mana: Init gf_stats_work before potential error paths in probe
  net: mana: Don't overwrite port probe error with add_adev result
  net: mana: Fix EQ leak in mana_remove on NULL port

 drivers/net/ethernet/microsoft/mana/mana_en.c | 28 +++++++++----------
 1 file changed, 14 insertions(+), 14 deletions(-)

-- 
2.34.1

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-12 19:59 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260409183509.0b24dea6@kernel.org>

On Thu, 9 Apr 2026 18:35:09 -0700 Jakub Kicinski wrote:
> On Tue,  7 Apr 2026 12:59:17 -0700 Dipayaan Roy wrote:
> > This behavior is observed on a single platform; other platforms
> > perform better with page_pool fragments, indicating this is not a
> > page_pool issue but platform-specific.  
> 
> Well, someone has to run some experiments and confirm other ARM
> platforms are not impacted, with data. I was hoping to do it myself
> but doesn't look like that will happen in time for the merge window :(

Please repost with the perf analysis on other commercially available
ARM platform. Something like:

  This is a workaround applicable to only some platforms. Modifying
  driver X to use a similar workaround on [Ampere Max|nVidia
  Grace|Amazon Graviton 3|..] the performance for split pages is
  y% higher than when using single pages.
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net 0/2] net: mana: Fix debugfs directory naming and file lifecycle
From: patchwork-bot+netdevbpf @ 2026-04-12 18:40 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260408081224.302308-1-ernis@linux.microsoft.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed,  8 Apr 2026 01:12:18 -0700 you wrote:
> This series fixes two pre-existing debugfs issues in the MANA driver.
> 
> Patch 1 fixes the per-device debugfs directory naming to use the unique
> PCI BDF address via pci_name(), avoiding a potential NULL pointer
> dereference when pdev->slot is NULL (e.g. VFIO passthrough, nested KVM)
> and preventing name collisions across multiple PFs or VFs.
> 
> [...]

Here is the summary with links:
  - [net,1/2] net: mana: Use pci_name() for debugfs directory naming
    https://git.kernel.org/netdev/net/c/c116f07ab9d2
  - [net,2/2] net: mana: Move current_speed debugfs file to mana_init_port()
    https://git.kernel.org/netdev/net/c/3b7c7fc97aea

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Move current_speed debugfs file to mana_init_port()
From: Simon Horman @ 2026-04-12 12:52 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260408081224.302308-3-ernis@linux.microsoft.com>

On Wed, Apr 08, 2026 at 01:12:20AM -0700, Erni Sri Satya Vennela wrote:
> Move the current_speed debugfs file creation from mana_probe_port() to
> mana_init_port(). The file was previously created only during initial
> probe, but mana_cleanup_port_context() removes the entire vPort debugfs
> directory during detach/attach cycles. Since mana_init_port() recreates
> the directory on re-attach, moving current_speed here ensures it survives
> these cycles.
> 
> Fixes: 75cabb46935b ("net: mana: Add support for net_shaper_ops")
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Use pci_name() for debugfs directory naming
From: Simon Horman @ 2026-04-12 12:52 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260408081224.302308-2-ernis@linux.microsoft.com>

On Wed, Apr 08, 2026 at 01:12:19AM -0700, Erni Sri Satya Vennela wrote:
> Use pci_name(pdev) for the per-device debugfs directory instead of
> hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
> previous approach had two issues:
> 
> 1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
>    in environments like generic VFIO passthrough or nested KVM,
>    causing a NULL pointer dereference.
> 
> 2. Multiple PFs would all use "0", and VFs across different PCI
>    domains or buses could share the same slot name, leading to
>    -EEXIST errors from debugfs_create_dir().
> 
> pci_name(pdev) returns the unique BDF address, is always valid, and is
> unique across the system.
> 
> Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH net-next v6] net: mana: Expose hardware diagnostic info via debugfs
From: Simon Horman @ 2026-04-12 12:49 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, shradhagupta, shirazsaleem,
	yury.norov, kees, ssengar, dipayanroy, gargaditya, linux-hyperv,
	netdev, linux-kernel, linux-rdma
In-Reply-To: <20260408081555.302620-1-ernis@linux.microsoft.com>

On Wed, Apr 08, 2026 at 01:15:46AM -0700, Erni Sri Satya Vennela wrote:
> Add debugfs entries to expose hardware configuration and diagnostic
> information that aids in debugging driver initialization and runtime
> operations without adding noise to dmesg.
> 
> The debugfs directory for each PCI device is named using pci_name()
> (the unique BDF address), and its creation and removal is integrated
> into mana_gd_setup() and mana_gd_cleanup_device() respectively, so
> that all callers (probe, remove, suspend, resume, shutdown) share a
> single code path.
> 
> Device-level entries (under /sys/kernel/debug/mana/<BDF>/):
>   - num_msix_usable, max_num_queues: Max resources from hardware
>   - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
>   - num_vports, bm_hostmode: Device configuration
> 
> Per-vPort entries (under /sys/kernel/debug/mana/<BDF>/vportN/):
>   - port_handle: Hardware vPort handle
>   - max_sq, max_rq: Max queues from vPort config
>   - indir_table_sz: Indirection table size
>   - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
>     Last applied steering configuration parameters
> 
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> ---
> This patch depends on the following fixes submitted to net:
>   - "net: mana: Use pci_name() for debugfs directory naming"
>   - "net: mana: Move current_speed debugfs file to mana_init_port()"
> Conflict resolution may be needed when net merges into net-next.

Unfortunately this patch doesn't apply to net-next,
which is a requirement for our workflow.

-- 
pw-bot: changes-requested

^ permalink raw reply

* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: vdso @ 2026-04-11  5:05 UTC (permalink / raw)
  To: Junrui Luo, Stanislav Kinsburskii
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
	Praveen K Paladugu, Jinank Jain, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yuhao Jiang, stable@vger.kernel.org
In-Reply-To: <89730D18-D9A3-4A18-87DD-E7A51625FF69@outlook.com>

> On 04/09/2026 8:06 PM PDT Junrui Luo <moonafterrain@outlook.com> wrote:
> 
>  
> On Thu, Apr 02, 2026 at 04:25:02PM -0700, Stanislav Kinsburskii wrote:
> > nit: both comments are redundant - the meaning is clear from the code
> > itself.
> 
> I will drop them in v3.
> 
> > This maximum value check bugs me a bit.
> > 
> > First of all, why does it matter what is the region end? Potentially, there can be
> > regions not backed by host address space (leave alone host RAM), so why
> > intropducing this limitation?
> > 
> > Second, this check takes a host-specific constant (MAX_PHYSMEM_BITS) and rounds it down
> > to hypervisor-specific units which may not be aligned with the host page
> > size. Should this be host pages instead?
>  
> This check was suggested by Roman in v1 review. Roman, could you
> share your thoughts on Stanislav's concerns? I'd like to align on whether an upper
> bound check is needed here.

Hey Junrui, Stanislav,

The idea was that there is a max PFN/GFN going over which is _architecturally_ impossible.
It might be ((1 << 52) - 1) << 12 or ((1 << 48) - 1) << 12 or similar depending on how
many bits are used for the guest phys. address (52, 48, 36) and page sizes. The gist is
that max is way below 0xFFFF_FFFF_FFFF_FFFF (1<<64 - 1) used for overflow checking in the
original v1 patch. Hence, instead of checking for overflow, one may check for that max
PFN/GFN catching way more bad input. Checking for that max also feels as more
domain-specific compared to the "generic" overflow check.

I went for the trade-off where the "impossible" PFN/GFN is computed simply as
(1 << max_possible_bits_in_the_address)/the_min_possible_page_size. Again, that's simple
and better than the oveflow check as it catches the region with PFNs/GFNs that just cannot
be used. I agree that may overshoot, and an even more specific check would have to fetch the
CPUID for the VM (or its ARM64 MMFR regs), etc. to get more specifics and a lower bound
on the bad GFN (but at the cost of more computation).

All in all, from the three options of (generic check for overflow, simple check
for arch bad PFNs/GFNs, an elaborated check with all specifics) I suggested the simple check.
Fast and still more useful than checking for overflow in my opinion.

Below I go into various details on "why impossible" and these 48 and 52 bits, and I most
likely will be re-iterating things that you know - just for the convenience of other
folks who don't play with MMU/EPT/etc often, and might find the topic interesting.
It may happen that I'll also learn something from your critique, thanks in advance :)

That max PFN/GFN is dictated by the maximum number of bits allowed to be used in
the physical address. Above that, an address would not be mapped or cached by
the machines that run the code in question. The page tables won't work as the number
of bits used for the physical address here is the sum of the bits used to index
into a page table + the offset bits. That also gives the number of bits in the virtual
address which the CPU needs to address the memory (again, in this modern case;
there were many years ago machines with 32 bit virt addresses and 36 bit phys
addresses used via PAE/AWE).

E.g. for the x64_86 and 48 bits in the phys addresses: the VAs have 9+9+9+9 bits per
each page table hierarchy + 12 bits for the page offset in 4KiB (1<<12) pages. That
also can be played as 9+9+9 and 21 bits which gives 3 level paging and 2MiB (1<<21)
pages but still _48_ bits for the VA and the phys. address. Sure can be 9+9 bits for
2-level pages and 30 bits for the page offset (1GiB pages, 1 << 30). Still, _48_
bits. Similar aruphmetic goes for _52_ bit addresses but going to need 5-level pages
this time.

Now, can you have 53 bits in the phys address (to address 1<<53 bytes)? No, you
cannot as that doesn't work with the existing x86_64 MMUs (and ARM64) - no virtual
address can be constructed to address 53 bits - not enough levels in the page table
hierarchy. As no VA can be constructed, the code won't be able to access the data.

Neither the EPT/2nd level set up by the hypervisor, nor for 1st level translation
set up by the guest or the host can go higher than 52 bits, even for MMIO. When MMIO
happens, that's a 2nd level fault/not present page and goes to the hypervisor but
_still_ the hv builds the EPT/2nd level map by the rules of the architecture and
cannot go above the architecturally imposed limits.

--
Cheers,
Roman

> 
> Thanks,
> Junrui Luo

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Long Li @ 2026-04-10 22:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Erni Sri Satya Vennela, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260410154327.GA2551565@ziepe.ca>

> On Sat, Mar 21, 2026 at 12:56:39AM +0000, Long Li wrote:
> 
> > How we rephrase this in this way: the driver should not corrupt or
> > overflow other parts of the kernel if its device is misbehaving (or
> > has a bug).
> 
> If we are going to do this CC hardening stuff I think I want to see a more
> comphrensive approach, like if we detect an attack then the kernel instantly
> crashes or something. Or at least an approach in general agreed to by the CC and
> kernel community.
> 
> Igoring the issue and continuing seems just wrong.
> 
> This sprinkling of random checks in this series doesn't feel comprehensive or
> cohesive to me.
> 
> Jason

Can we follow the virtio BAD_RING()/vq->broken pattern in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/virtio/virtio_ring.c#n57.

Add a broken flag to mana_ib_dev. When any hardware response contains out-of-range values, mark the device broken and fail the operation - during probe this prevents device registration entirely, at runtime all subsequent operations return -EIO.

Long

^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Jason Gunthorpe @ 2026-04-10 15:49 UTC (permalink / raw)
  To: Long Li
  Cc: Leon Romanovsky, Konstantin Taranov, Jakub Kicinski,
	David S . Miller, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Haiyang Zhang, KY Srinivasan, Wei Liu, Dexuan Cui, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB66832D0A369DE7E411ACCDEDCE41A@SA1PR21MB6683.namprd21.prod.outlook.com>

On Tue, Mar 17, 2026 at 11:43:49PM +0000, Long Li wrote:

>    Today a DPC event on one NIC kills all RDMA connections and can
>    crash entire training jobs. 

All rdma connections on that nic, right?

>    If the ib_device persists and the driver
>    recreates firmware resources after recovery, raw verbs users can
>    resume without full teardown, and RDMA-CM users get the same
>    disconnect/reconnect behavior they have today.

No, I don't think this is feasible. There is too much state, the
kernel cannot just recreate things and transparently keep going
without userspace handshaking this. IMHO It is just the wrong model.

We have always gone for the model that userspace has to be involved in
the RAS and it has to recreate its operations on a fresh new verbs
FD. I think anything else is going to be so complicated and fragile.

I can't see any sensible way an already open verbs FD can survive a
device reset.

Jason

^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Jason Gunthorpe @ 2026-04-10 15:43 UTC (permalink / raw)
  To: Long Li
  Cc: Leon Romanovsky, Erni Sri Satya Vennela, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB66833EBAF447BA0B102862FCCE4DA@SA1PR21MB6683.namprd21.prod.outlook.com>

On Sat, Mar 21, 2026 at 12:56:39AM +0000, Long Li wrote:

> How we rephrase this in this way: the driver should not corrupt or
> overflow other parts of the kernel if its device is misbehaving (or
> has a bug).

If we are going to do this CC hardening stuff I think I want to see a
more comphrensive approach, like if we detect an attack then the
kernel instantly crashes or something. Or at least an approach in
general agreed to by the CC and kernel community.

Igoring the issue and continuing seems just wrong.

This sprinkling of random checks in this series doesn't feel
comprehensive or cohesive to me.

Jason

^ permalink raw reply

* Re: [PATCH 04/61] ext4: Prefer IS_ERR_OR_NULL over manual NULL check
From: Theodore Ts'o @ 2026-04-10 15:18 UTC (permalink / raw)
  To: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
	gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
	linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
	linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
	linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
	linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
	linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
	linux-sctp, linux-security-module, linux-sh, linux-sound,
	linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
	netdev, ntfs3, samba-technical, sched-ext, target-devel,
	tipc-discussion, v9fs, Philipp Hahn
  Cc: Theodore Ts'o, Andreas Dilger
In-Reply-To: <20260310-b4-is_err_or_null-v1-4-bd63b656022d@avm.de>


On Tue, 10 Mar 2026 12:48:30 +0100, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
> 
> Change generated with coccinelle.

Applied, thanks!

[04/61] ext4: Prefer IS_ERR_OR_NULL over manual NULL check
        commit: 1d749e110277ce4103f27bd60d6181e52c0cc1e3

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH net-next] net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG
From: Erni Sri Satya Vennela @ 2026-04-10  5:16 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260326174815.2012137-1-ernis@linux.microsoft.com>

On Thu, Mar 26, 2026 at 10:48:10AM -0700, Erni Sri Satya Vennela wrote:
> As a part of MANA hardening for CVM, validate that max_num_sq and
> max_num_rq returned by MANA_QUERY_VPORT_CONFIG are not zero. These
> values flow into apc->num_queues, which is used as an allocation count
> and loop bound. A zero value would result in zero-size allocations and
> incorrect driver behavior.
> 
> Return -EPROTO if either value is zero.
> 
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index b39e8b920791..a4197b4b0597 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1249,6 +1249,12 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
>  
>  	*max_sq = resp.max_num_sq;
>  	*max_rq = resp.max_num_rq;
> +
> +	if (*max_sq == 0 || *max_rq == 0) {
> +		netdev_err(apc->ndev, "Invalid max queues from vPort config\n");
> +		return -EPROTO;
> +	}
> +
>  	if (resp.num_indirection_ent > 0 &&
>  	    resp.num_indirection_ent <= MANA_INDIRECT_TABLE_MAX_SIZE &&
>  	    is_power_of_2(resp.num_indirection_ent)) {
> -- 
> 2.34.1

Hi,

Gentle reminder regarding this patch.

I would really appreciate any feedback whenever you get a chance.
Please let me know if any changes are required from my side.

Thanks for your time.

Regards,
Vennela


^ permalink raw reply

* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: Junrui Luo @ 2026-04-10  3:06 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
	Praveen K Paladugu, Jinank Jain, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yuhao Jiang, Roman Kisel,
	stable@vger.kernel.org
In-Reply-To: <ac76zlXjXhPVkA6f@skinsburskii.localdomain>

On Thu, Apr 02, 2026 at 04:25:02PM -0700, Stanislav Kinsburskii wrote:
> nit: both comments are redundant - the meaning is clear from the code
> itself.

I will drop them in v3.

> This maximum value check bugs me a bit.
> 
> First of all, why does it matter what is the region end? Potentially, there can be
> regions not backed by host address space (leave alone host RAM), so why
> intropducing this limitation?
> 
> Second, this check takes a host-specific constant (MAX_PHYSMEM_BITS) and rounds it down
> to hypervisor-specific units which may not be aligned with the host page
> size. Should this be host pages instead?
 
This check was suggested by Roman in v1 review. Roman, could you
share your thoughts on Stanislav's concerns? I'd like to align on whether an upper
bound check is needed here.

Thanks,
Junrui Luo

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-10  1:35 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260407200216.272659-1-dipayanroy@linux.microsoft.com>

On Tue,  7 Apr 2026 12:59:17 -0700 Dipayaan Roy wrote:
> This behavior is observed on a single platform; other platforms
> perform better with page_pool fragments, indicating this is not a
> page_pool issue but platform-specific.

Well, someone has to run some experiments and confirm other ARM
platforms are not impacted, with data. I was hoping to do it myself
but doesn't look like that will happen in time for the merge window :(

> Changes in v6:
>  - Added missed maintainers.

STOP REPOSTING PATCHES FOR NO REASON.

^ permalink raw reply

* [PATCH] Drivers: hv: vmbus: Export hv_vmbus_exists() and use it in pci-hyperv
From: Dexuan Cui @ 2026-04-09 21:52 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, linux-hyperv, linux-pci
  Cc: linux-kernel, Mukesh Rathor

With commit f84b21da3624 ("PCI: hv: Don't load the driver for baremetal root partition"),
the bare metal Linux root partition won't use the pci-hyperv driver, but
when a Linux VM runs on the Linux root partition, pci-hyperv's module_init
function init_hv_pci_drv() can still run, e.g. in the case of
CONFIG_PCI_HYPERV=y, even if the VMBus driver is not used in such a VM
(i.e. the hv_vmbus driver's init function returns -ENODEV due to
vmbus_root_device being NULL).

In such a Linux VM, init_hv_pci_drv() runs with a side effect: the 3
hvpci_block_ops callbacks are set to functions that depend on hv_vmbus.

Later, when the MLX driver in such a VM invokes the callbacks, e.g. in
drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c:
mlx5_hv_register_invalidate(), hvpci_block_ops.reg_blk_invalidate() is
hv_register_block_invalidate() rather than a NULL function pointer, and
hv_register_block_invalidate() assumes that it can find a struct
hv_pcibus_device from pdev->bus->sysdata, which is false in such a VM.

Consequently, hv_register_block_invalidate() -> get_pcichild_wslot() ->
spin_lock_irqsave() may hang since it can be accessing an invalid
spinlock pointer.

Fix the issue by exporting hv_vmbus_exists() and using it in pci-hyperv:

    hv_root_partition() is true and hv_nested is false ==>
	hv_vmbus_exists() is false.

    hv_root_partition() is true and hv_nested is true ==>
	hv_vmbus_exists() is true.

    hv_root_partition() is false ==> hv_vmbus_exists() is true.

While at it, rename vmbus_exists() to hv_vmbus_exists() to follow the
convention that all public functions have the hv_ prefix; also change
the return value's type from int to bool to make the code more readable;
also move the two pr_info() calls.

Reported-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/hv/vmbus_drv.c              | 20 ++++++++------------
 drivers/pci/controller/pci-hyperv.c |  2 +-
 include/linux/hyperv.h              |  2 ++
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc4fc1951ae1..2c8936efc8d1 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -101,13 +101,11 @@ struct device *hv_get_vmbus_root_device(void)
 }
 EXPORT_SYMBOL_GPL(hv_get_vmbus_root_device);

-static int vmbus_exists(void)
+bool hv_vmbus_exists(void)
 {
-	if (vmbus_root_device == NULL)
-		return -ENODEV;
-
-	return 0;
+	return vmbus_root_device != NULL;
 }
+EXPORT_SYMBOL_GPL(hv_vmbus_exists);

 static u8 channel_monitor_group(const struct vmbus_channel *channel)
 {
@@ -1582,11 +1580,10 @@ int __vmbus_driver_register(struct hv_driver *hv_driver, struct module *owner, c
 {
 	int ret;

-	pr_info("registering driver %s\n", hv_driver->name);
+	if (!hv_vmbus_exists())
+		return -ENODEV;

-	ret = vmbus_exists();
-	if (ret < 0)
-		return ret;
+	pr_info("registering driver %s\n", hv_driver->name);

 	hv_driver->driver.name = hv_driver->name;
 	hv_driver->driver.owner = owner;
@@ -1612,9 +1609,8 @@ EXPORT_SYMBOL_GPL(__vmbus_driver_register);
  */
 void vmbus_driver_unregister(struct hv_driver *hv_driver)
 {
-	pr_info("unregistering driver %s\n", hv_driver->name);
-
-	if (!vmbus_exists()) {
+	if (hv_vmbus_exists()) {
+		pr_info("unregistering driver %s\n", hv_driver->name);
 		driver_unregister(&hv_driver->driver);
 		vmbus_free_dynids(hv_driver);
 	}
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 2c7a406b4ba8..226b8bb802f3 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -4166,7 +4166,7 @@ static int __init init_hv_pci_drv(void)
 	if (!hv_is_hyperv_initialized())
 		return -ENODEV;

-	if (hv_root_partition() && !hv_nested)
+	if (!hv_vmbus_exists())
 		return -ENODEV;

 	ret = hv_pci_irqchip_init();
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..5459e776ec17 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1304,6 +1304,8 @@ static inline void *hv_get_drvdata(struct hv_device *dev)

 struct device *hv_get_vmbus_root_device(void);

+bool hv_vmbus_exists(void);
+
 struct hv_ring_buffer_debug_info {
 	u32 current_interrupt_mask;
 	u32 current_read_index;
-- 
2.43.0

^ permalink raw reply related

* RE: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Michael Kelley @ 2026-04-09 19:01 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <2dabc1b8-0cf0-4fc8-9cd4-cce60adfc05e@linux.microsoft.com>

From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Wednesday, April 8, 2026 1:21 PM
> 
> On 1/11/2026 9:36 AM, Michael Kelley wrote:
> > From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM
> >>
> >> On 1/8/2026 10:46 AM, Michael Kelley wrote:
> >>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> >>>>
> >>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> >>>>
> >>>> Hyper-V uses a logical device ID to identify a PCI endpoint device for
> >>>> child partitions. This ID will also be required for future hypercalls
> >>>> used by the Hyper-V IOMMU driver.
> >>>>
> >>>> Refactor the logic for building this logical device ID into a standalone
> >>>> helper function and export the interface for wider use.
> >>>>
> >>>> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> >>>> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> >>>> ---
> >>>>  drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
> >>>>  include/asm-generic/mshyperv.h      |  2 ++
> >>>>  2 files changed, 22 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> >>>> index 146b43981b27..4b82e06b5d93 100644
> >>>> --- a/drivers/pci/controller/pci-hyperv.c
> >>>> +++ b/drivers/pci/controller/pci-hyperv.c
> >>>> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> >>>>
> >>>>  #define hv_msi_prepare		pci_msi_prepare
> >>>>
> >>>> +/**
> >>>> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
> >>>> + * function number of the device.
> >>>> + */
> >>>> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
> >>>> +{
> >>>> +	struct pci_bus *pbus = pdev->bus;
> >>>> +	struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
> >>>> +						struct hv_pcibus_device, sysdata);
> >>>> +
> >>>> +	return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
> >>>> +		     (hbus->hdev->dev_instance.b[4] << 16) |
> >>>> +		     (hbus->hdev->dev_instance.b[7] << 8)  |
> >>>> +		     (hbus->hdev->dev_instance.b[6] & 0xf8) |
> >>>> +		     PCI_FUNC(pdev->devfn));
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
> >>>
> >>> This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
> >>> new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
> >>> The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
> >>> use this symbol in that case -- you'll get a link error on vmlinux when building
> >>> the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
> >>> require that the VMBus driver not be built as a module, so I don't think that's
> >>> the right solution.
> >>>
> >>> This is a messy problem. The new IOMMU driver needs to start with a generic
> >>> "struct device" for the PCI device, and somehow find the corresponding VMBus
> >>> PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
> >>> about ways to do this that don't depend on code and data structures that are
> >>> private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.
> >>
> >> Thank you, Michael. FWIW, I did try to pull out the device ID components out of
> >> pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
> >> but it was just too messy as you say.
> >
> > Yes, the current approach for getting the device ID wanders through struct
> > hv_pcibus_device (which is private to the pci-hyperv driver), and through
> > struct hv_device (which is a VMBus data structure). That makes the linkage
> > between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather
> > substantial, which is not good.
> 
> Hi Michael,
> 
> I missed this, or made a mental note to follow up but forgot. Either way, Yu reminded
> me about this email chain and I started looking at it this week.
> 
> >
> > But here's an idea for an alternate approach. The PV IOMMU driver doesn't
> > have to generate the logical device ID on-the-fly by going to the dev_instance
> > field of struct hv_device. Instead, the pci-hyperv driver can generate the logical
> > device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU
> > driver to access. The logical device ID doesn't change while Linux is running, so
> > stashing another copy somewhere isn't a problem.
> 
> In my exploration and consulting with Dexuan, I realized that one of the components of
> the logical device ID, the PCI function number is set only in pci_scan_device(), well into
> pci_scan_root_bus_bridge() that you call out as the point by which the communication
> must have occurred.
> 
> But then, Dexuan also pointed me to hv_pci_assign_slots() with its call to wslot_to_devfn() and I'm
> honestly confused how these two interact. With the current approach, it looks like whatever
> devfn pci_scan_device() set is the correct function number to use for the logical device
> ID, in which case, the best I can do with your suggested approach below is to inform the
> pvIOMMU driver of the GUID, rather than the logical device ID itself.
> 
> Perhaps with your history, you can clarify the interaction, and/or share your thoughts
> on the above?

During hv_pci_probe(), hv_pci_query_relations() is called to ask the Hyper-V
host about what PCI devices are present. hv_pci_query_relations() sends a
PCI_QUERY_BUS_RELATIONS message to the host, and the host send back a
PCI_BUS_RELATIONS or PCI_BUS_RELATIONS2 message. The response message
is handled in hv_pci_onchannelcallback(), which calls hv_pci_devices_present()
or hv_pci_devices_present2().  The latter two functions both call
hv_pci_start_relations_work() to add a request to a workqueue that runs
pci_devices_present_work().  Finally, pci_devices_present_work() calls
pc_scan_child_bus(), followed by hv_pci_assign_slots().

In hv_pci_assign_slots, you can see that the PCI_BUS_RELATIONS[2]
info from the Hyper-V host contains a function number encoded in the
win_slot field. So the Hyper-V host *does* tell the guest the function number.
However, the generic Linux PCI subsystem doesn't use this function number.
It still scans the PCI device, trying successive function numbers to see which
ones work. The scan should find the same function number that the Hyper-V
host originally reported.

As you noted, there's a sequencing problem in waiting for
pci_scan_single_device() to find the function number. In the hv_pci_probe()
path, after hv_pci_query_relations() runs and before create_root_hv_pci_bus()
is called, it seems feasible to use the function number provided by the
Hyper-V host to construct the logical device ID. That should work. But there's
another path, in that the Hyper-V host can generate a PCI_BUS_RELATIONS[2]
message without a request from Linux when something on the host side changes
the PCI device setup. There's a code path where pci_devices_present_work()
finds the state is "hv_pcibus_installed", and directly calls pci_scan_child_bus().
This path would presumably also need to construct (or re-construct) the
logical device ID using the information from the Hyper-V host before calling
pci_scan_child_bus(). I'm vague on the scenario for this latter case, but the
code is obviously there to handle it.

The other approach is as you suggest. The Hyper-V PCI driver can tell
the IOMMU driver the almost complete logical device ID, using just the
GUID bits. Then the IOMMU driver can then construct the full logical
device ID by adding the function number from the struct pci_dev. I don't
see a problem with this approach -- other IOMMU drivers are referencing
the struct pci_dev, and pulling out the function number doesn't seem like
a violation of layering.

> 
> >
> > So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
> > a PCI domain ID and the related logical device ID. The PV IOMMU driver is
> > responsible for storing this data in a form that it can later search. hv_pci_probe()
> > calls this new function when it instantiates a new PCI pass-thru device. Then when
> > the IOMMU driver needs to attach a new device, it can get the PCI domain ID
> > from the struct pci_dev (or struct pci_bus), search for the related logical device
> > ID in its own data structure, and use it. The pci-hyperv driver has a dependency
> > on the IOMMU driver, but that's a dependency in the desired direction. The
> > PCI domain ID and logical device ID are just integers, so no data structures are
> > shared.
> 
> In a previous reply on this thread, you raised the uniqueness issue of bytes 4 and 5
> of the GUID being used to create the domain number. I thought this approach could
> help with that too, but as I coded it up, I realized that using the domain number
> (not guaranteed to be unique) to search for the bus instance GUID (guaranteed to be unique)
> is the wrong way around. It is unfortunately the only available key in the pci_dev
> handed to the pvIOMMU driver in this approach though...
> 
> Do you think that's a fatal flaw?

There are two uniqueness problems, which I didn't fully separate conceptually
until writing this. One problem is constructing a PCI domain ID that Linux can use
to identify the virtual PCI bus that the Hyper-V PCI driver creates for each vPCI
device. The Hyper-V virtual PCI driver uses GUID bytes 4 and 5, and recognizes
that they might not be unique. So there's code in hv_pci_probe() to pick another
number if there's a duplicate. Hyper-V doesn't really care how Linux picks the
domain ID for the virtual PCI bus as it's purely a Linux construct.

The second problem is the logical device ID that Hyper-V interprets to
identify a vPCI device in hypercalls such a HVCALL_RETARGET_INTERRUPT
and the new pvIOMMU related hypercalls. This logical device ID uses
GUID bytes 4 thru 7 (minus 1 bit).  I don’t think Linux uses the
logical device ID for anything. Since only Hyper-V interprets it, Hyper-V
must somehow be ensuring uniqueness of bytes 4 thru 7 (minus 1 bit).
That's something to confirm with the Hyper-V team. If they are just hoping
for the best, I don't know how Linux can solve the problem.

My original comment about uniqueness somewhat conflated the two problems,
and that's misleading. The use of the logical device ID has been around for years
in hv_irq_retarget_interrupt(). Extending its use to the new pvIOMMU
hypercalls doesn't make things any worse. But I'm still curious about
what the Hyper-V team says about the uniqueness of bytes 4 thru 7.

Michael

> 
> >
> > Note that the pci-hyperv must inform the PV IOMMU driver of the logical
> > device ID *before* create_root_hv_pci_bus() calls pci_scan_root_bus_bridge().
> > The latter function eventually invokes hv_iommu_attach_dev(), which will
> > need the logical device ID. See example stack trace. [1]
> >
> > I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
> > remove the information if a PCI pass-thru device is unbound or removed, as
> > the logical device ID will be the same if the device ever comes back. At worst,
> > the IOMMU driver can simply replace an existing logical device ID if a new one
> > is provided for the same PCI domain ID.
> 
> As above, replacing a unique GUID when a result is found for a non-unique
> key value may be prone to failure if it happens that the device that came "back"
> is not in fact the same device (or class of device) that went away and just happens
> to, either due to bytes 4 and 5 being identical, or due to collision in the
> pci_domain_nr_dynamic_ida, have the same domain number.
> 
> Thanks,
> Easwar (he/him)
> 
> >
> > An include file must provide a stub for the new function if
> > CONFIG_HYPERV_PVIOMMU is not defined, so that the pci-hyperv driver still
> > builds and works.
> >
> > I haven't coded this up, but it seems like it should be pretty clean.
> >
> > Michael
> >
> > [1] Example stack trace, starting with vmbus_add_channel_work() as a
> > result of Hyper-V offering the PCI pass-thru device to the guest.
> > hv_pci_probe() runs, and ends up in the generic Linux code for adding
> > a PCI device, which in turn sets up the IOMMU.
> >
> > [    1.731786]  hv_iommu_attach_dev+0xf0/0x1d0
> > [    1.731788]  __iommu_attach_device+0x21/0xb0
> > [    1.731790]  __iommu_device_set_domain+0x65/0xd0
> > [    1.731792]  __iommu_group_set_domain_internal+0x61/0x120
> > [    1.731795]  iommu_setup_default_domain+0x3a4/0x530
> > [    1.731796]  __iommu_probe_device.part.0+0x15d/0x1d0
> > [    1.731798]  iommu_probe_device+0x81/0xb0
> > [    1.731799]  iommu_bus_notifier+0x2c/0x80
> > [    1.731800]  notifier_call_chain+0x66/0xe0
> > [    1.731802]  blocking_notifier_call_chain+0x47/0x70
> > [    1.731804]  bus_notify+0x3b/0x50
> > [    1.731805]  device_add+0x631/0x850
> > [    1.731807]  pci_device_add+0x2db/0x670
> > [    1.731809]  pci_scan_single_device+0xc3/0x100
> > [    1.731810]  pci_scan_slot+0x97/0x230
> > [    1.731812]  pci_scan_child_bus_extend+0x3b/0x2f0
> > [    1.731814]  pci_scan_root_bus_bridge+0xc0/0xf0
> > [    1.731816]  hv_pci_probe+0x398/0x5f0
> > [    1.731817]  vmbus_probe+0x42/0xa0
> > [    1.731819]  really_probe+0xe5/0x3e0
> > [    1.731822]  __driver_probe_device+0x7e/0x170
> > [    1.731823]  driver_probe_device+0x23/0xa0
> > [    1.731824]  __device_attach_driver+0x92/0x130
> > [    1.731826]  bus_for_each_drv+0x8c/0xe0
> > [    1.731828]  __device_attach+0xc0/0x200
> > [    1.731830]  device_initial_probe+0x4c/0x50
> > [    1.731831]  bus_probe_device+0x32/0x90
> > [    1.731832]  device_add+0x65b/0x850
> > [    1.731836]  device_register+0x1f/0x30
> > [    1.731837]  vmbus_device_register+0x87/0x130
> > [    1.731840]  vmbus_add_channel_work+0x139/0x1a0
> > [    1.731841]  process_one_work+0x19f/0x3f0
> > [    1.731843]  worker_thread+0x188/0x2f0
> > [    1.731845]  kthread+0x119/0x230
> > [    1.731852]  ret_from_fork+0x1b4/0x1e0
> > [    1.731854]  ret_from_fork_asm+0x1a/0x30
> >
> >>


^ permalink raw reply

* Re: [PATCH 00/61] treewide: Use IS_ERR_OR_NULL over manual NULL check - refactor
From: Al Viro @ 2026-04-09 18:16 UTC (permalink / raw)
  To: Philipp Hahn
  Cc: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
	gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
	linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
	linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
	linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
	linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
	linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
	linux-sctp, linux-security-module, linux-sh, linux-sound,
	linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
	netdev, ntfs3, samba-technical, sched-ext, target-devel,
	tipc-discussion, v9fs, Julia Lawall, Nicolas Palix, Chris Mason,
	David Sterba, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Theodore Ts'o, Andreas Dilger, Steve French, Paulo Alcantara,
	Ronnie Sahlberg, Shyam Prasad N, Tom Talpey, Bharath SM,
	Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Miklos Szeredi,
	Konstantin Komarov, Andreas Gruenbacher, Kees Cook, Tony Luck,
	Guilherme G. Piccoli, Jan Kara, Phillip Lougher,
	Christian Brauner, Jan Kara, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
	Sami Tolvanen, Aaron Tomlin, Sylwester Nawrocki, Liam Girdwood,
	Mark Brown, Jaroslav Kysela, Takashi Iwai, Max Filippov,
	Paolo Bonzini, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andrew Morton, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Benjamin Marzinski, David S. Miller, David Ahern,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Marcel Holtmann, Johan Hedberg, Luiz Augusto von Dentz,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Jamal Hadi Salim, Jiri Pirko,
	Marcelo Ricardo Leitner, Xin Long, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Jon Maloy, Johannes Berg,
	Catalin Marinas, Russell King, John Crispin, Thomas Bogendoerfer,
	Yoshinori Sato, Rich Felker, John Paul Adrian Glaubitz,
	Andrzej Hajda, Neil Armstrong, Robert Foss, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Zhenyu Wang,
	Zhi Wang, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, Alex Deucher, Christian König, Sandy Huang,
	Heiko Stübner, Andy Yan, Igor Russkikh, Andrew Lunn,
	Pavan Chebbi, Michael Chan, Potnuri Bharat Teja, Tony Nguyen,
	Przemek Kitszel, Taras Chornyi, Maxime Coquelin, Alexandre Torgue,
	Iyappan Subramanian, Keyur Chudgar, Quan Nguyen, Heiner Kallweit,
	Marc Zyngier, Thomas Gleixner, Andrew Lunn, Gregory Clement,
	Sebastian Hesselbarth, Vinod Koul, Linus Walleij, Ulf Hansson,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Martin K. Petersen,
	Eduardo Valentin, Keerthy, Rafael J. Wysocki, Daniel Lezcano,
	Zhang Rui, Lukasz Luba, Alex Williamson, Mark Greer,
	Miquel Raynal, Richard Weinberger, Vignesh Raghavendra,
	Shuah Khan, Kieran Bingham, Mauro Carvalho Chehab, Joerg Roedel,
	Will Deacon, Robin Murphy, Lee Jones, Pavel Machek, Dave Penkler,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Justin Sanders, Jens Axboe, Georgi Djakov, Michael Turquette,
	Stephen Boyd, Philipp Zabel, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Pali Rohár, Dmitry Torokhov
In-Reply-To: <20260310-b4-is_err_or_null-v1-0-bd63b656022d@avm.de>

On Tue, Mar 10, 2026 at 12:48:26PM +0100, Philipp Hahn wrote:
> While doing some static code analysis I stumbled over a common pattern,
> where IS_ERR() is combined with a NULL check. For that there is
> IS_ERR_OR_NULL().

... and valid uses of IS_ERR_OR_NULL are rare as hen teeth.
Most of those are "I'm not sure how this function returns an
error, let's use that just in case".

Please, do not introduce more of that crap.

^ permalink raw reply

* [PATCH v3 7/7] mshv: Allocate pfns array only for pinned regions
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Convert pfns to a pointer allocated only for pinned regions that
actually need it for share/unshare/evict operations. Unpinned
regions use NULL since HMM handles their mappings dynamically.

The pfns array was previously a flexible array member, forcing
allocation for all regions regardless of memory type. This wastes
significant memory for unpinned HMM-managed regions which don't
need persistent PFN tracking - a 1GB region wastes 2MB for an
unused array.

This also allows using kzalloc for the main structure instead of
vzalloc, improving allocation efficiency and cache locality.

Simplify unpinned region invalidation by calling the hypervisor
directly rather than tracking PFNs. The tradeoff of skipping huge
page optimization is acceptable since invalidation ranges are
typically small and not performance-critical.

Add NULL checks where pfns array is required and update cleanup
to handle conditional allocation.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   61 ++++++++++++++++++++++++---------------------
 drivers/hv/mshv_root.h    |    6 +++-
 2 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 5a1a06ee83d2..44eb6dfd7142 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -243,7 +243,7 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 	int ret = 0;
 	u64 i;
 
-	region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
+	region = kzalloc_obj(*region);
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
@@ -255,6 +255,13 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 						   &mshv_region_mni_ops);
 		break;
 	case MSHV_REGION_TYPE_MEM_PINNED:
+		region->mreg_pfns = vmalloc_array(nr_pfns, sizeof(unsigned long));
+		if (!region->mreg_pfns) {
+			ret = -ENOMEM;
+			break;
+		}
+		for (i = 0; i < nr_pfns; i++)
+			region->mreg_pfns[i] = MSHV_INVALID_PFN;
 		break;
 	case MSHV_REGION_TYPE_MMIO:
 		region->mreg_mmio_pfn = mmio_pfn;
@@ -276,16 +283,13 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
-	for (i = 0; i < nr_pfns; i++)
-		region->mreg_pfns[i] = MSHV_INVALID_PFN;
-
 	mutex_init(&region->mreg_mutex);
 	kref_init(&region->mreg_refcount);
 
 	return region;
 
 free_region:
-	vfree(region);
+	kfree(region);
 	return ERR_PTR(ret);
 }
 
@@ -312,6 +316,9 @@ static int mshv_region_share(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
+	if (!region->mreg_pfns)
+		return -EINVAL;
+
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
 					 region->mreg_pfns,
@@ -340,6 +347,9 @@ static int mshv_region_unshare(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
+	if (!region->mreg_pfns)
+		return -EINVAL;
+
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
 					 region->mreg_pfns,
@@ -380,27 +390,19 @@ static int mshv_region_remap_pfns(struct mshv_region *region,
 					 mshv_region_chunk_remap);
 }
 
-static int mshv_region_map(struct mshv_region *region)
-{
-	u32 map_flags = region->hv_map_flags;
-
-	return mshv_region_remap_pfns(region, map_flags,
-				      0, region->nr_pfns,
-				      region->mreg_pfns);
-}
-
 static void mshv_region_invalidate_pfns(struct mshv_region *region,
 					u64 pfn_offset, u64 pfn_count)
 {
 	u64 i;
 
+	if (region->mreg_type != MSHV_REGION_TYPE_MEM_PINNED)
+		return;
+
 	for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
 		if (!pfn_valid(region->mreg_pfns[i]))
 			continue;
 
-		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
-			unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
-
+		unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
 		region->mreg_pfns[i] = MSHV_INVALID_PFN;
 	}
 }
@@ -517,7 +519,9 @@ static void mshv_region_destroy(struct kref *ref)
 
 	mshv_region_invalidate(region);
 
-	vfree(region);
+	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+		vfree(region->mreg_pfns);
+	kfree(region);
 }
 
 void mshv_region_put(struct mshv_region *region)
@@ -627,10 +631,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_region *region,
  *   leaving missing pages as invalid PFN markers.
  *   Used for initial region setup.
  *
- * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
- * flags cleared, then the range is mapped into the hypervisor. Present
- * PFNs get mapped with region access permissions; missing PFNs (zero
- * entries) get mapped with no-access permissions.
+ * HMM bookkeeping flags are stripped from collected PFNs before mapping.
+ * Present PFNs get mapped with region access permissions; missing PFNs
+ * (marked as MSHV_INVALID_PFN) get mapped with no-access permissions.
  *
  * Return: 0 on success, negative errno on failure.
  */
@@ -659,15 +662,17 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 		goto out;
 
 	for (i = 0; i < pfn_count; i++) {
-		if (!(pfns[i] & HMM_PFN_VALID))
+		if (!(pfns[i] & HMM_PFN_VALID)) {
+			pfns[i] = MSHV_INVALID_PFN;
 			continue;
+		}
 		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
-		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
+		pfns[i] &= ~HMM_PFN_FLAGS;
 	}
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
 				     pfn_offset, pfn_count,
-				     region->mreg_pfns + pfn_offset);
+				     pfns);
 
 	mutex_unlock(&region->mreg_mutex);
 out:
@@ -792,8 +797,6 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	if (ret)
 		goto out_unlock;
 
-	mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
-
 	mutex_unlock(&region->mreg_mutex);
 
 	return true;
@@ -856,7 +859,9 @@ static int mshv_map_pinned_region(struct mshv_region *region)
 		}
 	}
 
-	ret = mshv_region_map(region);
+	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
+				     0, region->nr_pfns,
+				     region->mreg_pfns);
 	if (!ret)
 		return 0;
 
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 97659ba55418..e43bdbf1ada8 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -92,8 +92,10 @@ struct mshv_region {
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
 	struct mutex mreg_mutex;	/* protects region PFNs remapping */
-	u64 mreg_mmio_pfn;
-	unsigned long mreg_pfns[];
+	union {
+		unsigned long *mreg_pfns;
+		u64 mreg_mmio_pfn;
+	};
 };
 
 struct mshv_irq_ack_notifier {



^ permalink raw reply related

* [PATCH v3 6/7] mshv: Simplify pfn array handling in region processing
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The current code requires passing both the full pfn array and an offset
parameter to region processing functions, forcing callees to manually
index into arrays. This approach is inflexible and makes it difficult
to work with different sources of pfn arrays.

Upcoming changes will need to pass pfn arrays obtained from the HMM
framework directly to these functions. The HMM framework returns arrays
that represent specific ranges rather than full region arrays with
offsets, making the current offset-based indexing pattern incompatible.

Refactor by having callers pass pre-offset pointers to pfn arrays and
removing offset-based indexing from callees. This allows functions to
work with any pfn array starting at index 0, regardless of its source,
and prepares the code for HMM integration.

No functional change intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 1c318d1020fc..5a1a06ee83d2 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -92,7 +92,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 	unsigned long pfn;
 	int stride, ret;
 
-	pfn = pfns[pfn_offset];
+	pfn = pfns[0];
 	if (!pfn_valid(pfn))
 		return -EINVAL;
 
@@ -102,7 +102,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 
 	/* Start at stride since the first stride is validated */
 	for (count = stride; count < pfn_count ; count += stride) {
-		pfn = pfns[pfn_offset + count];
+		pfn = pfns[count];
 
 		/* Break if current pfn is invalid */
 		if (!pfn_valid(pfn))
@@ -157,7 +157,7 @@ static long mshv_region_process_chunk(struct mshv_region *region,
 				      unsigned long *pfns,
 				      pfn_handler_t handler)
 {
-	if (pfn_valid(pfns[pfn_offset]))
+	if (pfn_valid(pfns[0]))
 		return mshv_region_process_pfns(region, flags,
 				pfn_offset, pfn_count, pfns,
 				handler);
@@ -204,10 +204,7 @@ static int mshv_region_process_range(struct mshv_region *region,
 	if (end > region->nr_pfns)
 		return -EINVAL;
 
-	start = pfn_offset;
-	end = pfn_offset + 1;
-
-	while (end < pfn_offset + pfn_count) {
+	for (start = 0, end = 1; end < pfn_count; ) {
 		/*
 		 * Accumulate contiguous pfns with the same validity
 		 * (valid or not).
@@ -218,8 +215,9 @@ static int mshv_region_process_range(struct mshv_region *region,
 		}
 
 		ret = mshv_region_process_chunk(region, flags,
-						start, end - start,
-						pfns, handler);
+						pfn_offset + start,
+						end - start,
+						pfns + start, handler);
 		if (ret < 0)
 			return ret;
 
@@ -227,8 +225,9 @@ static int mshv_region_process_range(struct mshv_region *region,
 	}
 
 	ret = mshv_region_process_chunk(region, flags,
-					start, end - start,
-					pfns, handler);
+					pfn_offset + start,
+					end - start,
+					pfns + start, handler);
 	if (ret < 0)
 		return ret;
 
@@ -296,15 +295,14 @@ static int mshv_region_chunk_share(struct mshv_region *region,
 				   unsigned long *pfns,
 				   bool huge_page)
 {
-	if (!pfn_valid(pfns[pfn_offset]))
+	if (!pfn_valid(pfns[0]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      pfns + pfn_offset,
-					      pfn_count,
+					      pfns, pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
 					      flags, true);
@@ -326,15 +324,15 @@ static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     unsigned long *pfns,
 				     bool huge_page)
 {
-	if (!pfn_valid(pfns[pfn_offset]))
+	if (!pfn_valid(pfns[0]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      pfns + pfn_offset,
-					      pfn_count, 0,
+					      pfns, pfn_count,
+					      0,
 					      flags, false);
 }
 
@@ -359,7 +357,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	 * hypervisor track dirty pages, enabling precopy live
 	 * migration.
 	 */
-	if (!pfn_valid(pfns[pfn_offset]))
+	if (!pfn_valid(pfns[0]))
 		flags = HV_MAP_GPA_NO_ACCESS;
 
 	if (huge_page)
@@ -368,7 +366,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	return hv_call_map_ram_pfns(region->partition->pt_id,
 				    region->start_gfn + pfn_offset,
 				    pfn_count, flags,
-				    pfns + pfn_offset);
+				    pfns);
 }
 
 static int mshv_region_remap_pfns(struct mshv_region *region,
@@ -669,7 +667,7 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
 				     pfn_offset, pfn_count,
-				     region->mreg_pfns);
+				     region->mreg_pfns + pfn_offset);
 
 	mutex_unlock(&region->mreg_mutex);
 out:



^ permalink raw reply related

* [PATCH v3 5/7] mshv: Pass pfns array explicitly through processing chain
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The current implementation relies on accessing region->pfns directly
within the pfn processing chain, making it difficult to use these
handlers with alternative pfn sources. This tight coupling limits
flexibility when processing pfns from different locations, such as
temporary arrays or external sources.

By threading the pfns pointer through the entire processing chain
(mshv_region_process_range, mshv_region_process_chunk, and all
handlers), we decouple the processing logic from the storage location.
This enables future enhancements like processing pfns from multiple
sources or implementing more sophisticated memory management strategies
without duplicating the core processing logic.

No functional change intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   61 +++++++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 24 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index f209a34afb3a..1c318d1020fc 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -22,7 +22,7 @@
 
 typedef int (*pfn_handler_t)(struct mshv_region *region, u32 flags,
 			     u64 pfn_offset, u64 pfn_count,
-			     bool huge_page);
+			     unsigned long *pfns, bool huge_page);
 
 static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
 
@@ -84,6 +84,7 @@ static int mshv_chunk_stride(struct page *page,
 static long mshv_region_process_pfns(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     pfn_handler_t handler)
 {
 	u64 gfn = region->start_gfn + pfn_offset;
@@ -91,7 +92,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 	unsigned long pfn;
 	int stride, ret;
 
-	pfn = region->mreg_pfns[pfn_offset];
+	pfn = pfns[pfn_offset];
 	if (!pfn_valid(pfn))
 		return -EINVAL;
 
@@ -101,7 +102,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 
 	/* Start at stride since the first stride is validated */
 	for (count = stride; count < pfn_count ; count += stride) {
-		pfn = region->mreg_pfns[pfn_offset + count];
+		pfn = pfns[pfn_offset + count];
 
 		/* Break if current pfn is invalid */
 		if (!pfn_valid(pfn))
@@ -114,7 +115,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 			break;
 	}
 
-	ret = handler(region, flags, pfn_offset, count, stride > 1);
+	ret = handler(region, flags, pfn_offset, count, pfns, stride > 1);
 	if (ret)
 		return ret;
 
@@ -138,11 +139,12 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 static long mshv_region_process_hole(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     pfn_handler_t handler)
 {
 	long ret;
 
-	ret = handler(region, flags, pfn_offset, pfn_count, 0);
+	ret = handler(region, flags, pfn_offset, pfn_count, pfns, 0);
 	if (ret)
 		return ret;
 
@@ -152,15 +154,16 @@ static long mshv_region_process_hole(struct mshv_region *region,
 static long mshv_region_process_chunk(struct mshv_region *region,
 				      u32 flags,
 				      u64 pfn_offset, u64 pfn_count,
+				      unsigned long *pfns,
 				      pfn_handler_t handler)
 {
-	if (pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (pfn_valid(pfns[pfn_offset]))
 		return mshv_region_process_pfns(region, flags,
-				pfn_offset, pfn_count,
+				pfn_offset, pfn_count, pfns,
 				handler);
 	else
 		return mshv_region_process_hole(region, flags,
-				pfn_offset, pfn_count,
+				pfn_offset, pfn_count, pfns,
 				handler);
 }
 
@@ -170,12 +173,13 @@ static long mshv_region_process_chunk(struct mshv_region *region,
  * @flags     : Flags to pass to the handler.
  * @pfn_offset: Offset into the region's PFNs array to start processing.
  * @pfn_count : Number of PFNs to process.
+ * @pfns      : Pointer to an array of PFNs corresponding to the region.
  * @handler   : Callback function to handle each chunk of contiguous
  *              valid PFNs.
  *
- * Iterates over the specified range of PFNs in @region, skipping
- * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
- * @handler via mshv_region_process_pfns.
+ * Iterates over the specified range of PFNs, skipping invalid PFNs.
+ * For each contiguous chunk of valid PFNS, invokes @handler via
+ * mshv_region_process_pfns.
  *
  * Note: The @handler callback must be able to handle PFNs backed by both
  * normal and huge pages.
@@ -185,6 +189,7 @@ static long mshv_region_process_chunk(struct mshv_region *region,
 static int mshv_region_process_range(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     pfn_handler_t handler)
 {
 	u64 start, end;
@@ -207,15 +212,14 @@ static int mshv_region_process_range(struct mshv_region *region,
 		 * Accumulate contiguous pfns with the same validity
 		 * (valid or not).
 		 */
-		if (pfn_valid(region->mreg_pfns[start]) ==
-		    pfn_valid(region->mreg_pfns[end])) {
+		if (pfn_valid(pfns[start]) == pfn_valid(pfns[end])) {
 			end++;
 			continue;
 		}
 
 		ret = mshv_region_process_chunk(region, flags,
 						start, end - start,
-						handler);
+						pfns, handler);
 		if (ret < 0)
 			return ret;
 
@@ -224,7 +228,7 @@ static int mshv_region_process_range(struct mshv_region *region,
 
 	ret = mshv_region_process_chunk(region, flags,
 					start, end - start,
-					handler);
+					pfns, handler);
 	if (ret < 0)
 		return ret;
 
@@ -289,16 +293,17 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 static int mshv_region_chunk_share(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
+				   unsigned long *pfns,
 				   bool huge_page)
 {
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!pfn_valid(pfns[pfn_offset]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pfns + pfn_offset,
+					      pfns + pfn_offset,
 					      pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
@@ -311,22 +316,24 @@ static int mshv_region_share(struct mshv_region *region)
 
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
+					 region->mreg_pfns,
 					 mshv_region_chunk_share);
 }
 
 static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     bool huge_page)
 {
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!pfn_valid(pfns[pfn_offset]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pfns + pfn_offset,
+					      pfns + pfn_offset,
 					      pfn_count, 0,
 					      flags, false);
 }
@@ -337,12 +344,14 @@ static int mshv_region_unshare(struct mshv_region *region)
 
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
+					 region->mreg_pfns,
 					 mshv_region_chunk_unshare);
 }
 
 static int mshv_region_chunk_remap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
+				   unsigned long *pfns,
 				   bool huge_page)
 {
 	/*
@@ -350,7 +359,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	 * hypervisor track dirty pages, enabling precopy live
 	 * migration.
 	 */
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!pfn_valid(pfns[pfn_offset]))
 		flags = HV_MAP_GPA_NO_ACCESS;
 
 	if (huge_page)
@@ -359,15 +368,17 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	return hv_call_map_ram_pfns(region->partition->pt_id,
 				    region->start_gfn + pfn_offset,
 				    pfn_count, flags,
-				    region->mreg_pfns + pfn_offset);
+				    pfns + pfn_offset);
 }
 
 static int mshv_region_remap_pfns(struct mshv_region *region,
 				  u32 map_flags,
-				  u64 pfn_offset, u64 pfn_count)
+				  u64 pfn_offset, u64 pfn_count,
+				  unsigned long *pfns)
 {
 	return mshv_region_process_range(region, map_flags,
 					 pfn_offset, pfn_count,
+					 pfns,
 					 mshv_region_chunk_remap);
 }
 
@@ -376,7 +387,8 @@ static int mshv_region_map(struct mshv_region *region)
 	u32 map_flags = region->hv_map_flags;
 
 	return mshv_region_remap_pfns(region, map_flags,
-				      0, region->nr_pfns);
+				      0, region->nr_pfns,
+				      region->mreg_pfns);
 }
 
 static void mshv_region_invalidate_pfns(struct mshv_region *region,
@@ -656,7 +668,8 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 	}
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
-				     pfn_offset, pfn_count);
+				     pfn_offset, pfn_count,
+				     region->mreg_pfns);
 
 	mutex_unlock(&region->mreg_mutex);
 out:



^ permalink raw reply related

* [PATCH v3 4/7] mshv: Optimize memory region mapping operations
From: Stanislav Kinsburskii @ 2026-04-09 15:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177574802240.19719.4873018419452139691.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Two specific operations don't require PFN iteration: region unmapping
and region remapping with no access. For unmapping, all frames in MSHV
memory regions are guaranteed to be mapped with page access, so we can
unmap them all without checking individual PFNs. For remapping with no
access, all frames are already mapped with page access, allowing us to
unmap them all in one pass.

Since neither operation needs PFN validation, iterating over PFNs is
redundant. Batch operations into large page-aligned chunks followed by
remaining pages. This eliminates PFN traversal for these operations,
requires no additional hypercalls compared to the PFN-checking approach,
and provides the simplest possible sequential execution path.

The optimization utilizes HV_MAP_GPA_LARGE_PAGE and
HV_UNMAP_GPA_LARGE_PAGE flags for aligned portions, processing only the
remainder with base page granularity. This removes
mshv_region_chunk_unmap() and eliminates PFN iteration for unmap and
no-access operations, reducing code complexity.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   87 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 68 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 2c4215381e0b..f209a34afb3a 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -449,27 +449,38 @@ static int mshv_region_pin(struct mshv_region *region)
 	return ret < 0 ? ret : -ENOMEM;
 }
 
-static int mshv_region_chunk_unmap(struct mshv_region *region,
-				   u32 flags,
-				   u64 pfn_offset, u64 pfn_count,
-				   bool huge_page)
+static int mshv_region_unmap(struct mshv_region *region)
 {
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
-		return 0;
+	u64 gfn, nr_pfns, starting_pfns, aligned_pfns, remaining_pfns;
+	int ret = 0;
 
-	if (huge_page)
-		flags |= HV_UNMAP_GPA_LARGE_PAGE;
+	gfn = region->start_gfn;
+	nr_pfns = region->nr_pfns;
 
-	return hv_call_unmap_pfns(region->partition->pt_id,
-				  region->start_gfn + pfn_offset,
-				  pfn_count, flags);
-}
+	starting_pfns = min(ALIGN(gfn, PTRS_PER_PMD) - gfn, nr_pfns);
+	aligned_pfns = ALIGN_DOWN(nr_pfns - starting_pfns, PTRS_PER_PMD);
+	remaining_pfns = nr_pfns - aligned_pfns - starting_pfns;
 
-static int mshv_region_unmap(struct mshv_region *region)
-{
-	return mshv_region_process_range(region, 0,
-					 0, region->nr_pfns,
-					 mshv_region_chunk_unmap);
+	if (starting_pfns)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 gfn, starting_pfns,
+					 0);
+
+	gfn += starting_pfns;
+
+	if (!ret && aligned_pfns)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 gfn, aligned_pfns,
+					 HV_UNMAP_GPA_LARGE_PAGE);
+
+	gfn += aligned_pfns;
+
+	if (!ret && remaining_pfns)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 gfn, remaining_pfns,
+					 0);
+
+	return ret;
 }
 
 static void mshv_region_destroy(struct kref *ref)
@@ -684,6 +695,45 @@ bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn)
 	return !ret;
 }
 
+static int mshv_region_map_no_access(struct mshv_region *region,
+				     u64 pfn_offset, u64 pfn_count)
+{
+	u64 gfn, nr_pfns, starting_pfns, aligned_pfns, remaining_pfns;
+	int ret = 0;
+
+	gfn = region->start_gfn + pfn_offset;
+	nr_pfns = pfn_count;
+
+	starting_pfns = min(ALIGN(gfn, PTRS_PER_PMD) - gfn, nr_pfns);
+	aligned_pfns = ALIGN_DOWN(nr_pfns - starting_pfns, PTRS_PER_PMD);
+	remaining_pfns = nr_pfns - aligned_pfns - starting_pfns;
+
+	if (starting_pfns)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   gfn, starting_pfns,
+					   HV_MAP_GPA_NO_ACCESS,
+					   NULL);
+
+	gfn += starting_pfns;
+
+	if (!ret && aligned_pfns)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   gfn, aligned_pfns,
+					   HV_MAP_GPA_NO_ACCESS |
+					   HV_MAP_GPA_LARGE_PAGE,
+					   NULL);
+
+	gfn += aligned_pfns;
+
+	if (!ret && remaining_pfns)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   gfn, remaining_pfns,
+					   HV_MAP_GPA_NO_ACCESS,
+					   NULL);
+
+	return ret;
+}
+
 /**
  * mshv_region_interval_invalidate - Invalidate a range of memory region
  * @mni: Pointer to the mmu_interval_notifier structure
@@ -727,8 +777,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 
 	mmu_interval_set_seq(mni, cur_seq);
 
-	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
-				     pfn_offset, pfn_count);
+	ret = mshv_region_map_no_access(region, pfn_offset, pfn_count);
 	if (ret)
 		goto out_unlock;
 



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox