Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH net v2 4/4] net: mana: Fix EQ leak in mana_remove on NULL port
From: Erni Sri Satya Vennela @ 2026-04-13  5:08 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
	shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-1-ernis@linux.microsoft.com>

In mana_remove(), when a NULL port is encountered in the port iteration
loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event
queues allocated earlier by mana_create_eq().

This can happen when mana_probe_port() fails for port 0, leaving
ac->ports[0] as NULL. On driver unload or error cleanup, mana_remove()
hits the NULL entry and jumps past mana_destroy_eq().

Change 'goto out' to 'break' so the for-loop exits normally and
mana_destroy_eq() is always reached. Remove the now-unreferenced out:
label.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v2:
* Apply the patch in net instead of net-next.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1a141c46ac27..97237d137cbf 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3747,7 +3747,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 		if (!ndev) {
 			if (i == 0)
 				dev_err(dev, "No net device to remove\n");
-			goto out;
+			break;
 		}
 
 		apc = netdev_priv(ndev);
@@ -3778,7 +3778,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 	}
 
 	mana_destroy_eq(ac);
-out:
+
 	if (ac->per_port_queue_reset_wq) {
 		destroy_workqueue(ac->per_port_queue_reset_wq);
 		ac->per_port_queue_reset_wq = NULL;
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v4 19/21] uio: replace deprecated mmap hook with mmap_prepare in uio_info
From: Shinichiro Kawasaki @ 2026-04-13  5:14 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org,
	linux-mtd@lists.infradead.org, linux-staging@lists.linux.dev,
	linux-scsi@vger.kernel.org, target-devel@vger.kernel.org,
	linux-afs@lists.infradead.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Ryan Roberts
In-Reply-To: <157583e4477705b496896c7acd4ac88a937b8fa6.1774045440.git.ljs@kernel.org>

On Mar 20, 2026 / 22:39, Lorenzo Stoakes (Oracle) wrote:
> The f_op->mmap interface is deprecated, so update uio_info to use its
> successor, mmap_prepare.
> 
> Therefore, replace the uio_info->mmap hook with a new
> uio_info->mmap_prepare hook, and update its one user, target_core_user,
> to both specify this new mmap_prepare hook and also to use the new
> vm_ops->mapped() hook to continue to maintain a correct udev->kref
> refcount.
> 
> Then update uio_mmap() to utilise the mmap_prepare compatibility layer to
> invoke this callback from the uio mmap invocation.
> 
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Hello Lorenzo, since two weeks ago, I observe a failure during my kernel test
set targeting Linux for-next branch. On failure, kernel reported a WARN at
__vma_check_mmap_hook [1]. I bisected and found that this patch is the trigger.
Here I share my observations of the failure. Actions or advices for fix will be
appreciated.


The failure happens when TCMU device is set up with targetcli. When tcmu-runner
is running, the command lines below should successfully create a backstore for a
TCMU device, but it fails.

  $ sudo targetcli
  targetcli shell version 3.0.1
  Copyright 2011-2013 by Datera, Inc and others.
  For help on commands, type 'help'.
  
  /> cd /backstores/user:zbc
  /backstores/user:zbc> create name=test size=1M cfgstring=@/tmp/tmp.img
  UserBackedStorageObject creation failed.

On failure, tcmu-runner reports mmap failures:

  2026-04-13 12:23:49.271 1103 [CRIT] main:1302: Starting...
  2026-04-13 12:23:49.461 1103 [INFO] load_our_module:575: Inserted module 'target_core_user'
  2026-04-13 12:23:49.480 1103 [INFO] tcmur_register_handler:92: Handler fbo is registered
  2026-04-13 12:23:49.486 1103 [INFO] tcmur_register_handler:92: Handler zbc is registered
  2026-04-13 12:23:51.202 1103 [INFO] tcmur_register_handler:92: Handler rbd is registered
  2026-04-13 12:27:24.522 1103 [ERROR] device_open_shm:523: could not mmap /dev/uio0
  2026-04-13 12:27:24.550 1103 [ERROR] device_open_shm:523: could not mmap /dev/uio0

The failure was found with user:zbc handler. I confirmed that the failure is
recreated with fbo handler also. Then, this failure looks common for all TCMU
users.

At the failrue, kernel reported the WARN at __vma_check_mmap_hook [1]. The line
1287 in mm/util.c reported the WARN:

  1284 int __vma_check_mmap_hook(struct vm_area_struct *vma)
  1285 {
  1286         /* vm_ops->mapped is not valid if mmap() is specified. */
  1287         if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped))
  1288                 return -EINVAL;
  1289
  1290         return 0;
  1291 }
  1292 EXPORT_SYMBOL(__vma_check_mmap_hook);

When I reverted the commit from the kernel tag next-20260409, the failrue
disappeared.

If other information is required for fix, please let me know. Thanks in advance.


[1] dmesg

WARNING: mm/util.c:1287 at __vma_check_mmap_hook+0x61/0x90, CPU#0: tcmu-runner/1332
Modules linked in: target_core_pscsi target_core_file target_core_iblock xfs target_core_user target_core_mod rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nf_tables ip6table_filter ip6_tables iptable_filter ip_tables qrtr irdma intel_rapl_msr intel_rapl_common ice intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common libie_fwlog nfit sunrpc gnss idpf libnvdimm x86_pkg_temp_thermal libeth_xdp intel_powerclamp libeth ib_core spi_nor mtd coretemp kvm_intel kvm i40e iTCO_wdt irqbypass intel_pmc_bxt rapl ses vfat intel_cstate libie fat intel_uncore libie_adminq enclosure i2c_i801 spi_intel_pci i2c_smbus spi_intel lpc_ich wmi joydev mei_me ioatdma acpi_power_meter acpi_pad mei intel_pch_thermal dca fuse loop dm_multipath nfnetlink zram
 lz4hc_compress lz4_compress zstd_compress ast drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper nvme drm mpi3mr nvme_core scsi_transport_sas nvme_keyring nvme_auth scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkcs8_key_parser i2c_dev [last unloaded: null_blk]
CPU: 0 UID: 0 PID: 1332 Comm: tcmu-runner Not tainted 7.0.0-rc6-next-20260401-kts #1 PREEMPT(lazy) 
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.5 05/18/2021
RIP: 0010:__vma_check_mmap_hook+0x61/0x90
Code: 00 00 00 00 fc ff df 48 8d 78 10 48 89 f9 48 c1 e9 03 80 3c 11 00 75 2a 48 83 78 10 00 75 0b 31 c0 48 83 c4 08 c3 cc cc cc cc <0f> 0b b8 ea ff ff ff eb ee 48 89 04 24 e8 6d 4c 1f 00 48 8b 04 24
RSP: 0018:ffff8881391f7488 EFLAGS: 00010282
RAX: ffffffffc2abca40 RBX: 0000000000000000 RCX: 1ffffffff855794a
RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffffffffc2abca50
RBP: ffff8881391f76a0 R08: ffffffffa10016e9 R09: ffffed102723ee44
R10: ffffed102723ee45 R11: 0000000000000000 R12: ffff8881391f78e0
R13: ffff8881391f78f0 R14: ffff88810d1ec780 R15: ffff8881391f7a78
FS:  00007f154f1a9840(0000) GS:ffff888e9b440000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5efd40fe88 CR3: 000000010cfbf005 CR4: 00000000007726f0
PKRU: 55555554
Call Trace:
 <TASK>
 __mmap_new_vma+0x116e/0x18d0
 ? __pfx___mmap_new_vma+0x10/0x10
 ? vma_merge_new_range+0x495/0xa00
 ? __pfx_vma_merge_new_range+0x10/0x10
 ? lock_acquire+0x126/0x140
 __mmap_region+0x651/0xa00
 ? __pfx_process_measurement+0x10/0x10
 ? __pfx___mmap_region+0x10/0x10
 ? __lock_acquire+0x55d/0xbd0
 ? __lock_acquire+0x55d/0xbd0
 ? lock_is_held_type+0x9a/0x110
 ? mas_find+0xc9/0x690
 ? arch_get_unmapped_area_topdown+0x2a7/0x890
 mmap_region+0x3c2/0x4c0
 ? __pfx_mmap_region+0x10/0x10
 ? security_mmap_addr+0x54/0xd0
 ? __get_unmapped_area+0x18c/0x300
 ? __pfx_uio_mmap+0x10/0x10
 do_mmap+0xa26/0x10f0
 ? lock_acquire+0x126/0x140
 ? __pfx_do_mmap+0x10/0x10
 ? __pfx_down_write_killable+0x10/0x10
 ? __lock_acquire+0x55d/0xbd0
 vm_mmap_pgoff+0x218/0x3a0
 ? __pfx_vm_mmap_pgoff+0x10/0x10
 ? __fget_files+0x1b4/0x2f0
 ksys_mmap_pgoff+0x229/0x570
 ? clockevents_program_event+0x144/0x370
 do_syscall_64+0xf4/0x1560
 ? do_syscall_64+0x1d7/0x1560
 ? __lock_release.isra.0+0x59/0x170
 ? do_syscall_64+0x34/0x1560
 ? lockdep_hardirqs_on_prepare.part.0+0x9b/0x140
 ? do_syscall_64+0x34/0x1560
 ? trace_hardirqs_on+0x19/0x1a0
 ? do_syscall_64+0xab/0x1560
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f154f5728dc
Code: 1e fa 41 f7 c1 ff 0f 00 00 75 33 55 48 89 e5 41 54 41 89 cc 53 48 89 fb 48 85 ff 74 41 45 89 e2 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7c 5b 41 5c 5d c3 0f 1f 80 00 00 00 00 48 8b
RSP: 002b:00007ffeb6ac04f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f154f5728dc
RDX: 0000000000000003 RSI: 0000000040800000 RDI: 0000000000000000
RBP: 00007ffeb6ac0500 R08: 000000000000000c R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
R13: 000000001ddac980 R14: 00007f154fa629c0 R15: 00007f154fa62940
 </TASK>
irq event stamp: 62665
hardirqs last  enabled at (62679): [<ffffffff9e1cd23e>] __up_console_sem+0x5e/0x70
hardirqs last disabled at (62698): [<ffffffff9e1cd223>] __up_console_sem+0x43/0x70
softirqs last  enabled at (62692): [<ffffffff9dfc5d01>] handle_softirqs+0x5c1/0x8b0
softirqs last disabled at (62721): [<ffffffff9dfc6152>] __irq_exit_rcu+0x152/0x280
---[ end trace 0000000000000000 ]---
scsi host74: TCM_Loopback

^ permalink raw reply

* Re: [PATCH v4 19/21] uio: replace deprecated mmap hook with mmap_prepare in uio_info
From: Lorenzo Stoakes @ 2026-04-13  5:37 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org,
	linux-mtd@lists.infradead.org, linux-staging@lists.linux.dev,
	linux-scsi@vger.kernel.org, target-devel@vger.kernel.org,
	linux-afs@lists.infradead.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Ryan Roberts
In-Reply-To: <adx2ws5z0NMIe5Yj@shinmob>

On Mon, Apr 13, 2026 at 05:14:08AM +0000, Shinichiro Kawasaki wrote:
> On Mar 20, 2026 / 22:39, Lorenzo Stoakes (Oracle) wrote:
> > The f_op->mmap interface is deprecated, so update uio_info to use its
> > successor, mmap_prepare.
> >
> > Therefore, replace the uio_info->mmap hook with a new
> > uio_info->mmap_prepare hook, and update its one user, target_core_user,
> > to both specify this new mmap_prepare hook and also to use the new
> > vm_ops->mapped() hook to continue to maintain a correct udev->kref
> > refcount.
> >
> > Then update uio_mmap() to utilise the mmap_prepare compatibility layer to
> > invoke this callback from the uio mmap invocation.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>
> Hello Lorenzo, since two weeks ago, I observe a failure during my kernel test
> set targeting Linux for-next branch. On failure, kernel reported a WARN at
> __vma_check_mmap_hook [1]. I bisected and found that this patch is the trigger.
> Here I share my observations of the failure. Actions or advices for fix will be
> appreciated.

Ugh yeah thanks, this actually needs to account for use of compatibility layer,
so probably we shouldn't even assert this as that isn't easily detectable.

I'll send a hotfix for this that can be bundled up with 7.1 patches.

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 1/8] hv: Select CONFIG_SYSFB only for CONFIG_HYPERV_VMBUS
From: Thomas Zimmermann @ 2026-04-13  7:17 UTC (permalink / raw)
  To: Saurabh Singh Sengar, javierm@redhat.com, arnd@arndb.de,
	ardb@kernel.org, ilias.apalodimas@linaro.org,
	chenhuacai@kernel.org, kernel@xen0n.name,
	maarten.lankhorst@linux.intel.com, mripard@kernel.org,
	airlied@gmail.com, simona@ffwll.ch, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, Long Li, deller@gmx.de
  Cc: linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
	linux-efi@vger.kernel.org, linux-riscv@lists.infradead.org,
	dri-devel@lists.freedesktop.org, linux-hyperv@vger.kernel.org,
	linux-fbdev@vger.kernel.org, Michael Kelley, Saurabh Sengar,
	stable@vger.kernel.org
In-Reply-To: <KUZP153MB14449BBE44CBAEEA7621A4A0BE51A@KUZP153MB1444.APCP153.PROD.OUTLOOK.COM>

Hi

Am 02.04.26 um 12:50 schrieb Saurabh Singh Sengar:
>> Hyperv's sysfb access only exists in the VMBUS support. Therefore only select
>> CONFIG_SYSFB for CONFIG_HYPERV_VMBUS. Avoids sysfb code on systems
>> that don't need it.
>>
>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>> Fixes: 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V
>> guests")
>> Cc: Michael Kelley <mhklinux@outlook.com>
>> Cc: Saurabh Sengar <ssengar@linux.microsoft.com>
>> Cc: Wei Liu <wei.liu@kernel.org>
>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>> Cc: Dexuan Cui <decui@microsoft.com>
>> Cc: Long Li <longli@microsoft.com>
>> Cc: linux-hyperv@vger.kernel.org
>> Cc: <stable@vger.kernel.org> # v6.16+
>> ---
>>   drivers/hv/Kconfig | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig index
>> 7937ac0cbd0f..2d0b3fcb0ff8 100644
>> --- a/drivers/hv/Kconfig
>> +++ b/drivers/hv/Kconfig
>> @@ -9,7 +9,6 @@ config HYPERV
>>   	select PARAVIRT
>>   	select X86_HV_CALLBACK_VECTOR if X86
>>   	select OF_EARLY_FLATTREE if OF
>> -	select SYSFB if EFI && !HYPERV_VTL_MODE
>>   	select IRQ_MSI_LIB if X86
>>   	help
>>   	  Select this option to run Linux as a Hyper-V client operating @@ -62,6
>> +61,7 @@ config HYPERV_VMBUS
>>   	tristate "Microsoft Hyper-V VMBus driver"
>>   	depends on HYPERV
>>   	default HYPERV
>> +	select SYSFB if EFI && !HYPERV_VTL_MODE
>>   	help
>>   	  Select this option to enable Hyper-V Vmbus driver.
>>
>> --
>> 2.53.0
> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>

This fix is independent from the rest of the series. Do you want to 
merge it or can I take it into DRM trees?

Best regards
Thomas

>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH 1/8] hv: Select CONFIG_SYSFB only for CONFIG_HYPERV_VMBUS
From: Saurabh Singh Sengar @ 2026-04-13  8:22 UTC (permalink / raw)
  To: Thomas Zimmermann, javierm@redhat.com, arnd@arndb.de,
	ardb@kernel.org, ilias.apalodimas@linaro.org,
	chenhuacai@kernel.org, kernel@xen0n.name,
	maarten.lankhorst@linux.intel.com, mripard@kernel.org,
	airlied@gmail.com, simona@ffwll.ch, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, Long Li, deller@gmx.de
  Cc: linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
	linux-efi@vger.kernel.org, linux-riscv@lists.infradead.org,
	dri-devel@lists.freedesktop.org, linux-hyperv@vger.kernel.org,
	linux-fbdev@vger.kernel.org, Michael Kelley, Saurabh Sengar,
	stable@vger.kernel.org, Wei Liu
In-Reply-To: <2fe8ce91-2dc5-4cf2-b7cf-d495e5cff14b@suse.de>

> Hi
> 
> Am 02.04.26 um 12:50 schrieb Saurabh Singh Sengar:
> >> Hyperv's sysfb access only exists in the VMBUS support. Therefore
> >> only select CONFIG_SYSFB for CONFIG_HYPERV_VMBUS. Avoids sysfb code
> >> on systems that don't need it.
> >>
> >> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> >> Fixes: 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for
> >> Hyper-V
> >> guests")
> >> Cc: Michael Kelley <mhklinux@outlook.com>
> >> Cc: Saurabh Sengar <ssengar@linux.microsoft.com>
> >> Cc: Wei Liu <wei.liu@kernel.org>
> >> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> >> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> >> Cc: Dexuan Cui <decui@microsoft.com>
> >> Cc: Long Li <longli@microsoft.com>
> >> Cc: linux-hyperv@vger.kernel.org
> >> Cc: <stable@vger.kernel.org> # v6.16+
> >> ---
> >>   drivers/hv/Kconfig | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig index
> >> 7937ac0cbd0f..2d0b3fcb0ff8 100644
> >> --- a/drivers/hv/Kconfig
> >> +++ b/drivers/hv/Kconfig
> >> @@ -9,7 +9,6 @@ config HYPERV
> >>   	select PARAVIRT
> >>   	select X86_HV_CALLBACK_VECTOR if X86
> >>   	select OF_EARLY_FLATTREE if OF
> >> -	select SYSFB if EFI && !HYPERV_VTL_MODE
> >>   	select IRQ_MSI_LIB if X86
> >>   	help
> >>   	  Select this option to run Linux as a Hyper-V client operating @@
> >> -62,6
> >> +61,7 @@ config HYPERV_VMBUS
> >>   	tristate "Microsoft Hyper-V VMBus driver"
> >>   	depends on HYPERV
> >>   	default HYPERV
> >> +	select SYSFB if EFI && !HYPERV_VTL_MODE
> >>   	help
> >>   	  Select this option to enable Hyper-V Vmbus driver.
> >>
> >> --
> >> 2.53.0
> > Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> 
> This fix is independent from the rest of the series. Do you want to merge it or
> can I take it into DRM trees?

Please feel free to take it via DRM tree.
CC : Wei Liu

- Saurabh


^ permalink raw reply

* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: Junrui Luo @ 2026-04-13  8:43 UTC (permalink / raw)
  To: vdso@mailbox.org, Stanislav Kinsburskii
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
	Praveen K Paladugu, Jinank Jain, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yuhao Jiang, stable@vger.kernel.org
In-Reply-To: <319614096.43465.1775883935863@app.mailbox.org>

On Fri, Apr 10, 2026 at 09:05:35PM -0800, vdso@mailbox.org wrote:
> All in all, from the three options of (generic check for overflow, simple check
> for arch bad PFNs/GFNs, an elaborated check with all specifics) I suggested the simple check.
> Fast and still more useful than checking for overflow in my opinion.
 
Thanks Roman for the thorough write-up. Since the original patch mixes
host and hypervisor-side constants with an unclear unit, IMO we should
do the bounds check in bytes instead.

For instance:

	u64 start_gpa, end_gpa;

	if (check_mul_overflow(mem->guest_pfn, HV_HYP_PAGE_SIZE,
						   &start_gpa) ||
		check_add_overflow(start_gpa, mem->size, &end_gpa) ||
		end_gpa > (1ULL << MAX_PHYSMEM_BITS))
		return -EINVAL;

Both sides of the final comparison are bytes, so no host-vs-hv page
unit conversion is needed.

In addition, it changes return value from -EOVERFLOW to -EINVAL.

Does this approach look reasonable? Happy to iterate if either of you
would prefer a different choice.

Thanks,
Junrui Luo


^ permalink raw reply

* Re: [PATCH 01/11] arch: arm64: Export arch_smp_send_reschedule for mshv_vtl module
From: Naman Jain @ 2026-04-13 11:44 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB41570A9050B3EB6A905DFF56D450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:24 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
> 
> Nit: For the patch "Subject", the most common prefix for the file
> arch/arm64/kernel/smp.c is "arm64: smp:".  I'd suggest using that
> prefix for historical consistency.

Acked. Will change in v2.

> 
>> mshv_vtl_main.c calls smp_send_reschedule() which expands to
>> arch_smp_send_reschedule(). When CONFIG_MSHV_VTL=m, the module cannot
>> access this symbol since it is not exported on arm64.
>>
>> smp_send_reschedule() is used in mshv_vtl_cancel() to interrupt a vCPU
>> thread running on another CPU. When a vCPU is looping in
>> mshv_vtl_ioctl_return_to_lower_vtl(), it checks a per-CPU cancel flag
>> before each VTL0 entry. Setting cancel=1 alone is not enough if the
>> target CPU thread is sleeping - the IPI from smp_send_reschedule() kicks
>> the remote CPU out of idle/sleep so it re-checks the cancel flag and
>> exits the loop promptly.
>>
>> Other architectures (riscv, loongarch, powerpc) already export this
>> symbol. Add the same EXPORT_SYMBOL_GPL for arm64. This is required
>> for adding arm64 support in MSHV_VTL.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/kernel/smp.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index 1aa324104afb..26b1a4456ceb 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -1152,6 +1152,7 @@ void arch_smp_send_reschedule(int cpu)
>>   {
>>   	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
>>   }
>> +EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);
>>
>>   #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
>>   void arch_send_wakeup_ipi(unsigned int cpu)
>> --
>> 2.43.0
>>
> 
> The "Subject" nit notwithstanding,
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>

Thanks,
Naman

^ permalink raw reply

* Re: [PATCH 03/11] Drivers: hv: Add support to setup percpu vmbus handler
From: Naman Jain @ 2026-04-13 11:45 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB41570E0F113FE28CFC839476D450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:25 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Add a wrapper function - hv_setup_percpu_vmbus_handler(), similar to
>> hv_setup_vmbus_handler() to allow setting up custom per-cpu VMBus
>> interrupt handler. This is required for arm64 support, to be added
>> in MSHV_VTL driver, where per-cpu VMBus interrupt handler will be
>> set to mshv_vtl_vmbus_isr() for VTL2 (Virtual Trust Level 2).
> 
> Needing both hv_setup_vmbus_handler() and
> hv_setup_percpu_vmbus_handler() seems unfortunate. Here's an
> alternate approach to consider:
> 
> 1. I think the x86 VMBus sysvec handler and the vmbus_percpu_isr()
> functions could both use the same vmbus_handler global variable.
> Looking at your changes in this patch set, hv_setup_vmbus_handler()
> and hv_setup_percpu_vmbus_handler() are used together and always
> set the same value.
> 
> 2. So move the global variable vmbus_handler out from arch/x86
> and into hv_common.c, and export it. The x86 sysvec handler can
> still reference it, and vmbus_percpu_isr() in vmbus_drv.c can
> also reference it.  No need to have vmbus_percpu_isr() under
> arch/arm64 or have a stub in hv_common.c.
> 
> 3. hv_setup_vmbus_handler() and hv_remove_vmbus_handler()
> also move to hv_common.c.  The __weak stubs go away.
> 
> With these changes, only hv_setup_vmbus_handler() needs to
> be called, and it works for both x86 with the sysvec handler and
> for arm64 with vmbus_percpu_isr().
> 
> I haven't coded this up, so maybe there's some problematic detail,
> but the idea seems like it would work. If it does work, some of my
> comments below are no longer applicable.
> 

This is a great suggestion. Current implementation looked complex in 
design and it was becoming more complex with the changes I was making 
while addressing Sashiko's AI review comments. However your suggestion 
looks much better. I'll implement it. Thanks for suggesting.

>>
>> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/hyperv/mshyperv.c   | 13 +++++++++++++
>>   drivers/hv/hv_common.c         | 11 +++++++++++
>>   drivers/hv/vmbus_drv.c         |  7 +------
>>   include/asm-generic/mshyperv.h |  3 +++
>>   4 files changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
>> index 4fdc26ade1d7..d4494ceeaad0 100644
>> --- a/arch/arm64/hyperv/mshyperv.c
>> +++ b/arch/arm64/hyperv/mshyperv.c
>> @@ -134,3 +134,16 @@ bool hv_is_hyperv_initialized(void)
>>   	return hyperv_initialized;
>>   }
>>   EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
>> +
>> +static void (*vmbus_percpu_handler)(void);
>> +void hv_setup_percpu_vmbus_handler(void (*handler)(void))
>> +{
>> +	vmbus_percpu_handler = handler;
>> +}
>> +
>> +irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
>> +{
>> +	if (vmbus_percpu_handler)
>> +		vmbus_percpu_handler();
>> +	return IRQ_HANDLED;
>> +}
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index d1ebc0ebd08f..a5064f558bf6 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -759,6 +759,17 @@ void __weak hv_setup_vmbus_handler(void (*handler)(void))
>>   }
>>   EXPORT_SYMBOL_GPL(hv_setup_vmbus_handler);
>>
>> +irqreturn_t __weak vmbus_percpu_isr(int irq, void *dev_id)
>> +{
>> +	return IRQ_HANDLED;
>> +}
>> +EXPORT_SYMBOL_GPL(vmbus_percpu_isr);
>> +
>> +void __weak hv_setup_percpu_vmbus_handler(void (*handler)(void))
>> +{
>> +}
>> +EXPORT_SYMBOL_GPL(hv_setup_percpu_vmbus_handler);
> 
> You've implemented hv_setup_percpu_vmbus_handler() following
> the pattern of hv_setup_vmbus_handler(), which is reasonable.
> But that turns out to be unnecessarily complicated. The existing
> hv_setup_vmbus_handler() has a portion in
> arch/x86/kernel/cpu/mshyperv.c as a special case because it uses a
> hard-coded interrupt vector on x86/x64, and has its own custom
> sysvec code. And there's a need for a __weak stub in hv_common.c
> so that vmbus_drv.c will compile on arm64.
> 
> But hv_setup_percpu_vmbus_handler() does not have the same
> requirements. It could be implemented entirely in vmbus_drv.c,
> with no code under arch/x86 or arch/arm64, and no __weak stub
> in hv_common.c.  vmbus_drv.c would just need to
> EXPORT_SYMBOL_FOR_MODULES, like it already does with vmbus_isr.
> I didn't code it up, but I think that approach would be simpler with
> fewer piece-parts scattered all over. If so, it would be worth
> breaking the symmetry with hv_setup_vmbus_handler().
> 

No longer applicable.

Regards,
Naman

^ permalink raw reply

* Re: [PATCH 04/11] Drivers: hv: Refactor mshv_vtl for ARM64 support to be added
From: Naman Jain @ 2026-04-13 11:46 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB41573C4A21BA96A534E5429CD450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:26 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Refactor MSHV_VTL driver to move some of the x86 specific code to arch
>> specific files, and add corresponding functions for arm64.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/include/asm/mshyperv.h |  10 +++
>>   arch/x86/hyperv/hv_vtl.c          |  98 ++++++++++++++++++++++++++++
>>   arch/x86/include/asm/mshyperv.h   |   1 +
>>   drivers/hv/mshv_vtl_main.c        | 102 +-----------------------------
>>   4 files changed, 111 insertions(+), 100 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index b721d3134ab6..804068e0941b 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -60,6 +60,16 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>   				ARM_SMCCC_SMC_64,		\
>>   				ARM_SMCCC_OWNER_VENDOR_HYP,	\
>>   				HV_SMCCC_FUNC_NUMBER)
>> +#ifdef CONFIG_HYPERV_VTL_MODE
>> +/*
>> + * Get/Set the register. If the function returns `1`, that must be done via
>> + * a hypercall. Returning `0` means success.
>> + */
>> +static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared)
>> +{
>> +	return 1;
>> +}
>> +#endif
>>
>>   #include <asm-generic/mshyperv.h>
>>
>> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
>> index 9b6a9bc4ab76..72a0bb4ae0c7 100644
>> --- a/arch/x86/hyperv/hv_vtl.c
>> +++ b/arch/x86/hyperv/hv_vtl.c
>> @@ -17,6 +17,8 @@
>>   #include <asm/realmode.h>
>>   #include <asm/reboot.h>
>>   #include <asm/smap.h>
>> +#include <uapi/asm/mtrr.h>
>> +#include <asm/debugreg.h>
>>   #include <linux/export.h>
>>   #include <../kernel/smpboot.h>
>>   #include "../../kernel/fpu/legacy.h"
>> @@ -281,3 +283,99 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>>   	kernel_fpu_end();
>>   }
>>   EXPORT_SYMBOL(mshv_vtl_return_call);
>> +
>> +/* Static table mapping register names to their corresponding actions */
>> +static const struct {
>> +	enum hv_register_name reg_name;
>> +	int debug_reg_num;  /* -1 if not a debug register */
>> +	u32 msr_addr;       /* 0 if not an MSR */
>> +} reg_table[] = {
>> +	/* Debug registers */
>> +	{HV_X64_REGISTER_DR0, 0, 0},
>> +	{HV_X64_REGISTER_DR1, 1, 0},
>> +	{HV_X64_REGISTER_DR2, 2, 0},
>> +	{HV_X64_REGISTER_DR3, 3, 0},
>> +	{HV_X64_REGISTER_DR6, 6, 0},
>> +	/* MTRR MSRs */
>> +	{HV_X64_REGISTER_MSR_MTRR_CAP, -1, MSR_MTRRcap},
>> +	{HV_X64_REGISTER_MSR_MTRR_DEF_TYPE, -1, MSR_MTRRdefType},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0, -1, MTRRphysBase_MSR(0)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1, -1, MTRRphysBase_MSR(1)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2, -1, MTRRphysBase_MSR(2)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3, -1, MTRRphysBase_MSR(3)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4, -1, MTRRphysBase_MSR(4)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5, -1, MTRRphysBase_MSR(5)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6, -1, MTRRphysBase_MSR(6)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7, -1, MTRRphysBase_MSR(7)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8, -1, MTRRphysBase_MSR(8)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9, -1, MTRRphysBase_MSR(9)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA, -1, MTRRphysBase_MSR(0xa)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB, -1, MTRRphysBase_MSR(0xb)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC, -1, MTRRphysBase_MSR(0xc)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASED, -1, MTRRphysBase_MSR(0xd)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE, -1, MTRRphysBase_MSR(0xe)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF, -1, MTRRphysBase_MSR(0xf)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0, -1, MTRRphysMask_MSR(0)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1, -1, MTRRphysMask_MSR(1)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2, -1, MTRRphysMask_MSR(2)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3, -1, MTRRphysMask_MSR(3)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4, -1, MTRRphysMask_MSR(4)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5, -1, MTRRphysMask_MSR(5)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6, -1, MTRRphysMask_MSR(6)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7, -1, MTRRphysMask_MSR(7)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8, -1, MTRRphysMask_MSR(8)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9, -1, MTRRphysMask_MSR(9)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA, -1, MTRRphysMask_MSR(0xa)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB, -1, MTRRphysMask_MSR(0xb)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC, -1, MTRRphysMask_MSR(0xc)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD, -1, MTRRphysMask_MSR(0xd)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE, -1, MTRRphysMask_MSR(0xe)},
>> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF, -1, MTRRphysMask_MSR(0xf)},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX64K00000, -1, MSR_MTRRfix64K_00000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX16K80000, -1, MSR_MTRRfix16K_80000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX16KA0000, -1, MSR_MTRRfix16K_A0000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KC0000, -1, MSR_MTRRfix4K_C0000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KC8000, -1, MSR_MTRRfix4K_C8000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KD0000, -1, MSR_MTRRfix4K_D0000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KD8000, -1, MSR_MTRRfix4K_D8000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KE0000, -1, MSR_MTRRfix4K_E0000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KE8000, -1, MSR_MTRRfix4K_E8000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KF0000, -1, MSR_MTRRfix4K_F0000},
>> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KF8000, -1, MSR_MTRRfix4K_F8000},
>> +};
>> +
>> +int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared)
>> +{
>> +	u64 *reg64;
>> +	enum hv_register_name gpr_name;
>> +	int i;
>> +
>> +	gpr_name = regs->name;
>> +	reg64 = &regs->value.reg64;
>> +
>> +	/* Search for the register in the table */
>> +	for (i = 0; i < ARRAY_SIZE(reg_table); i++) {
>> +		if (reg_table[i].reg_name != gpr_name)
>> +			continue;
>> +		if (reg_table[i].debug_reg_num != -1) {
>> +			/* Handle debug registers */
>> +			if (gpr_name == HV_X64_REGISTER_DR6 && !shared)
>> +				goto hypercall;
>> +			if (set)
>> +				native_set_debugreg(reg_table[i].debug_reg_num, *reg64);
>> +			else
>> +				*reg64 = native_get_debugreg(reg_table[i].debug_reg_num);
>> +		} else {
>> +			/* Handle MSRs */
>> +			if (set)
>> +				wrmsrl(reg_table[i].msr_addr, *reg64);
>> +			else
>> +				rdmsrl(reg_table[i].msr_addr, *reg64);
>> +		}
>> +		return 0;
>> +	}
>> +
>> +hypercall:
>> +	return 1;
>> +}
>> +EXPORT_SYMBOL(hv_vtl_get_set_reg);
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index f64393e853ee..d5355a5b7517 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -304,6 +304,7 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   void mshv_vtl_return_call_init(u64 vtl_return_offset);
>>   void mshv_vtl_return_hypercall(void);
>>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> +int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared);
> 
> Can this move to asm-generic/mshyperv.h?  The function is no longer specific
> to x86/x64, so one would want to not declare it in the arch/x86 version
> of mshyperv.h. But maybe moving it to asm-generic/mshyperv.h breaks
> compilation on arm64 because there's also the static inline stub there.

This is still arch specific (x86 to be precise). For ARM64, we always 
want to return 1, which is to tell the client to use hypercall as a 
fallback option. Moving this x86 specific implementation (X64 registers, 
MTRR etc) to a common code, would not be right. One thing that could be 
done here was to move the "return 1" stub code from arm64 to asm-generic 
mshyperv.h, but that would not provide much benefit.

I am currently not planning to make any changes here. If I misunderstood 
your comment, please let me know.

Regards,
Naman


^ permalink raw reply

* Re: [PATCH 05/11] drivers: hv: Export vmbus_interrupt for mshv_vtl module
From: Naman Jain @ 2026-04-13 11:46 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157F1DAEF3BC14A67D59FB8D450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:26 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
> 
> Nit:  For the patch Subject, capitalize "Drivers:" in the prefix.

Acked.

Thanks,
Naman

> 
>> vmbus_interrupt is used in mshv_vtl_main.c to set the SINT vector.
>> When CONFIG_MSHV_VTL=m and CONFIG_HYPERV_VMBUS=y (built-in), the module
>> cannot access vmbus_interrupt at load time since it is not exported.
>>
>> Export it using EXPORT_SYMBOL_FOR_MODULES consistent with the existing
>> pattern used for vmbus_isr.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   drivers/hv/vmbus_drv.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
>> index f99d4f2d3862..de191799a8f6 100644
>> --- a/drivers/hv/vmbus_drv.c
>> +++ b/drivers/hv/vmbus_drv.c
>> @@ -57,6 +57,7 @@ static DEFINE_PER_CPU(long, vmbus_evt);
>>   /* Values parsed from ACPI DSDT */
>>   int vmbus_irq;
>>   int vmbus_interrupt;
>> +EXPORT_SYMBOL_FOR_MODULES(vmbus_interrupt, "mshv_vtl");
>>
>>   /*
>>    * If the Confidential VMBus is used, the data on the "wire" is not
>> --
>> 2.43.0
>>
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply

* Re: [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Naman Jain @ 2026-04-13 11:47 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157521DEF9EA2471B6F3359D450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:27 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Generalize Synthetic interrupt source vector (sint) to use
>> vmbus_interrupt variable instead, which automatically takes care of
>> architectures where HYPERVISOR_CALLBACK_VECTOR is not present (arm64).
> 
> Sashiko AI raised an interesting question about the startup timing --
> whether the vmbus_platform_driver_probe() is guaranteed to have
> set vmbus_interrupt before the VTL functions below run and use it.
> What causes the mshv_vtl.ko module to be loaded, and hence run
> mshv_vtl_init()?

There is no race condition here. The init ordering guarantees that
vmbus_interrupt is always set before mshv_vtl_synic_enable_regs()
reads it.

The call chain for setting vmbus_interrupt:

   subsys_initcall(hv_acpi_init)                          [level 4]
     -> platform_driver_register(&vmbus_platform_driver) and so on.


The call chain for reading vmbus_interrupt:

   module_init(mshv_vtl_init)                             [level 6]
     -> hv_vtl_setup_synic()
       -> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ..., 
mshv_vtl_alloc_context, ...)
         -> mshv_vtl_alloc_context()
           -> mshv_vtl_synic_enable_regs()
             -> sint.vector = vmbus_interrupt

do_initcalls() processes sections in order 0 through 7, so 
hv_acpi_init() (level 4) is guaranteed to complete before 
mshv_vtl_init() (level 6) runs.



Regarding memory leak on cpu offline/online or module load/unload- it is 
beyond the scope of this series, I will fix it separately.

I may need some more time in addressing comments on rest of the patches. 
Please bear with me.

Regards,
Naman

> 
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   drivers/hv/mshv_vtl_main.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index b607b6e7e121..91517b45d526 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -234,7 +234,7 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>>   	union hv_synic_sint sint;
>>
>>   	sint.as_uint64 = 0;
>> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
>> +	sint.vector = vmbus_interrupt;
>>   	sint.masked = false;
>>   	sint.auto_eoi = hv_recommend_using_aeoi();
>>
>> @@ -753,7 +753,7 @@ static void mshv_vtl_synic_mask_vmbus_sint(void *info)
>>   	const u8 *mask = info;
>>
>>   	sint.as_uint64 = 0;
>> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
>> +	sint.vector = vmbus_interrupt;
>>   	sint.masked = (*mask != 0);
>>   	sint.auto_eoi = hv_recommend_using_aeoi();
>>
>> --
>> 2.43.0
>>
> 
> Assuming there's no timing problem vs. the VMBus driver,
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Jason Gunthorpe @ 2026-04-13 13:46 UTC (permalink / raw)
  To: Long Li
  Cc: Leon Romanovsky, Erni Sri Satya Vennela, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <LV0PR21MB66700DC2FB827B93ED6A5714CE592@LV0PR21MB6670.namprd21.prod.outlook.com>

On Fri, Apr 10, 2026 at 10:29:45PM +0000, Long Li wrote:
> > On Sat, Mar 21, 2026 at 12:56:39AM +0000, Long Li wrote:
> > 
> > > How we rephrase this in this way: the driver should not corrupt or
> > > overflow other parts of the kernel if its device is misbehaving (or
> > > has a bug).
> > 
> > If we are going to do this CC hardening stuff I think I want to see a more
> > comphrensive approach, like if we detect an attack then the kernel instantly
> > crashes or something. Or at least an approach in general agreed to by the CC and
> > kernel community.
> > 
> > Igoring the issue and continuing seems just wrong.
> > 
> > This sprinkling of random checks in this series doesn't feel comprehensive or
> > cohesive to me.
> > 
> > Jason
> 
> Can we follow the virtio BAD_RING()/vq->broken pattern in
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/virtio/virtio_ring.c#n57.
> 
> Add a broken flag to mana_ib_dev. When any hardware response
> contains out-of-range values, mark the device broken and fail the
> operation - during probe this prevents device registration entirely,
> at runtime all subsequent operations return -EIO.

If that's the plan I would think it should be struct device based, but
yeah, I'm more comfortable with this sort of direction as a CC
hardening plan.

Jason

^ permalink raw reply

* RE: [PATCH 04/11] Drivers: hv: Refactor mshv_vtl for ARM64 support to be added
From: Michael Kelley @ 2026-04-13 15:19 UTC (permalink / raw)
  To: Naman Jain, Michael Kelley, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <fe4c0663-4ece-439c-bcb6-cd5780c15ed9@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 13, 2026 4:46 AM
> 
> On 4/1/2026 10:26 PM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026
> 5:13 AM
> >>
> >> Refactor MSHV_VTL driver to move some of the x86 specific code to arch
> >> specific files, and add corresponding functions for arm64.
> >>
> >> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
> >> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> >> ---
> >>   arch/arm64/include/asm/mshyperv.h |  10 +++
> >>   arch/x86/hyperv/hv_vtl.c          |  98 ++++++++++++++++++++++++++++
> >>   arch/x86/include/asm/mshyperv.h   |   1 +
> >>   drivers/hv/mshv_vtl_main.c        | 102 +-----------------------------
> >>   4 files changed, 111 insertions(+), 100 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/mshyperv.h
> >> b/arch/arm64/include/asm/mshyperv.h
> >> index b721d3134ab6..804068e0941b 100644
> >> --- a/arch/arm64/include/asm/mshyperv.h
> >> +++ b/arch/arm64/include/asm/mshyperv.h
> >> @@ -60,6 +60,16 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
> >>   				ARM_SMCCC_SMC_64,		\
> >>   				ARM_SMCCC_OWNER_VENDOR_HYP,	\
> >>   				HV_SMCCC_FUNC_NUMBER)
> >> +#ifdef CONFIG_HYPERV_VTL_MODE
> >> +/*
> >> + * Get/Set the register. If the function returns `1`, that must be done via
> >> + * a hypercall. Returning `0` means success.
> >> + */
> >> +static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared)
> >> +{
> >> +	return 1;
> >> +}
> >> +#endif
> >>
> >>   #include <asm-generic/mshyperv.h>
> >>
> >> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
> >> index 9b6a9bc4ab76..72a0bb4ae0c7 100644
> >> --- a/arch/x86/hyperv/hv_vtl.c
> >> +++ b/arch/x86/hyperv/hv_vtl.c
> >> @@ -17,6 +17,8 @@
> >>   #include <asm/realmode.h>
> >>   #include <asm/reboot.h>
> >>   #include <asm/smap.h>
> >> +#include <uapi/asm/mtrr.h>
> >> +#include <asm/debugreg.h>
> >>   #include <linux/export.h>
> >>   #include <../kernel/smpboot.h>
> >>   #include "../../kernel/fpu/legacy.h"
> >> @@ -281,3 +283,99 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
> >>   	kernel_fpu_end();
> >>   }
> >>   EXPORT_SYMBOL(mshv_vtl_return_call);
> >> +
> >> +/* Static table mapping register names to their corresponding actions */
> >> +static const struct {
> >> +	enum hv_register_name reg_name;
> >> +	int debug_reg_num;  /* -1 if not a debug register */
> >> +	u32 msr_addr;       /* 0 if not an MSR */
> >> +} reg_table[] = {
> >> +	/* Debug registers */
> >> +	{HV_X64_REGISTER_DR0, 0, 0},
> >> +	{HV_X64_REGISTER_DR1, 1, 0},
> >> +	{HV_X64_REGISTER_DR2, 2, 0},
> >> +	{HV_X64_REGISTER_DR3, 3, 0},
> >> +	{HV_X64_REGISTER_DR6, 6, 0},
> >> +	/* MTRR MSRs */
> >> +	{HV_X64_REGISTER_MSR_MTRR_CAP, -1, MSR_MTRRcap},
> >> +	{HV_X64_REGISTER_MSR_MTRR_DEF_TYPE, -1, MSR_MTRRdefType},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0, -1, MTRRphysBase_MSR(0)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1, -1, MTRRphysBase_MSR(1)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2, -1, MTRRphysBase_MSR(2)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3, -1, MTRRphysBase_MSR(3)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4, -1, MTRRphysBase_MSR(4)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5, -1, MTRRphysBase_MSR(5)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6, -1, MTRRphysBase_MSR(6)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7, -1, MTRRphysBase_MSR(7)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8, -1, MTRRphysBase_MSR(8)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9, -1, MTRRphysBase_MSR(9)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA, -1, MTRRphysBase_MSR(0xa)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB, -1, MTRRphysBase_MSR(0xb)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC, -1, MTRRphysBase_MSR(0xc)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASED, -1, MTRRphysBase_MSR(0xd)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE, -1, MTRRphysBase_MSR(0xe)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF, -1, MTRRphysBase_MSR(0xf)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0, -1, MTRRphysMask_MSR(0)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1, -1, MTRRphysMask_MSR(1)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2, -1, MTRRphysMask_MSR(2)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3, -1, MTRRphysMask_MSR(3)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4, -1, MTRRphysMask_MSR(4)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5, -1, MTRRphysMask_MSR(5)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6, -1, MTRRphysMask_MSR(6)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7, -1, MTRRphysMask_MSR(7)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8, -1, MTRRphysMask_MSR(8)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9, -1, MTRRphysMask_MSR(9)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA, -1, MTRRphysMask_MSR(0xa)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB, -1, MTRRphysMask_MSR(0xb)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC, -1, MTRRphysMask_MSR(0xc)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD, -1, MTRRphysMask_MSR(0xd)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE, -1, MTRRphysMask_MSR(0xe)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF, -1, MTRRphysMask_MSR(0xf)},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX64K00000, -1, MSR_MTRRfix64K_00000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX16K80000, -1, MSR_MTRRfix16K_80000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX16KA0000, -1, MSR_MTRRfix16K_A0000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KC0000, -1, MSR_MTRRfix4K_C0000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KC8000, -1, MSR_MTRRfix4K_C8000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KD0000, -1, MSR_MTRRfix4K_D0000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KD8000, -1, MSR_MTRRfix4K_D8000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KE0000, -1, MSR_MTRRfix4K_E0000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KE8000, -1, MSR_MTRRfix4K_E8000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KF0000, -1, MSR_MTRRfix4K_F0000},
> >> +	{HV_X64_REGISTER_MSR_MTRR_FIX4KF8000, -1, MSR_MTRRfix4K_F8000},
> >> +};
> >> +
> >> +int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared)
> >> +{
> >> +	u64 *reg64;
> >> +	enum hv_register_name gpr_name;
> >> +	int i;
> >> +
> >> +	gpr_name = regs->name;
> >> +	reg64 = &regs->value.reg64;
> >> +
> >> +	/* Search for the register in the table */
> >> +	for (i = 0; i < ARRAY_SIZE(reg_table); i++) {
> >> +		if (reg_table[i].reg_name != gpr_name)
> >> +			continue;
> >> +		if (reg_table[i].debug_reg_num != -1) {
> >> +			/* Handle debug registers */
> >> +			if (gpr_name == HV_X64_REGISTER_DR6 && !shared)
> >> +				goto hypercall;
> >> +			if (set)
> >> +				native_set_debugreg(reg_table[i].debug_reg_num, *reg64);
> >> +			else
> >> +				*reg64 = native_get_debugreg(reg_table[i].debug_reg_num);
> >> +		} else {
> >> +			/* Handle MSRs */
> >> +			if (set)
> >> +				wrmsrl(reg_table[i].msr_addr, *reg64);
> >> +			else
> >> +				rdmsrl(reg_table[i].msr_addr, *reg64);
> >> +		}
> >> +		return 0;
> >> +	}
> >> +
> >> +hypercall:
> >> +	return 1;
> >> +}
> >> +EXPORT_SYMBOL(hv_vtl_get_set_reg);
> >> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> >> index f64393e853ee..d5355a5b7517 100644
> >> --- a/arch/x86/include/asm/mshyperv.h
> >> +++ b/arch/x86/include/asm/mshyperv.h
> >> @@ -304,6 +304,7 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context
> *vtl0);
> >>   void mshv_vtl_return_call_init(u64 vtl_return_offset);
> >>   void mshv_vtl_return_hypercall(void);
> >>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >> +int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared);
> >
> > Can this move to asm-generic/mshyperv.h?  The function is no longer specific
> > to x86/x64, so one would want to not declare it in the arch/x86 version
> > of mshyperv.h. But maybe moving it to asm-generic/mshyperv.h breaks
> > compilation on arm64 because there's also the static inline stub there.
> 
> This is still arch specific (x86 to be precise). For ARM64, we always
> want to return 1, which is to tell the client to use hypercall as a
> fallback option. Moving this x86 specific implementation (X64 registers,
> MTRR etc) to a common code, would not be right. One thing that could be
> done here was to move the "return 1" stub code from arm64 to asm-generic
> mshyperv.h, but that would not provide much benefit.
> 
> I am currently not planning to make any changes here. If I misunderstood
> your comment, please let me know.
> 

Yes, the implementation of hv_vtl_get_set_reg() is arch-specific. But the
one-line declaration is not. If there was a full implementation on arm64
like with x86, then the declaration could move to asm-generic/mshyperv.h.
But the arm64 side is a "static inline" function, and I'm pretty sure the
above declaration would cause a compiler conflict.  So not making any
changes is appropriate. If the arm64 side should ever change to a full
implementation that isn't static inline, then the one-line declaration
should move to asm-generic/mshyperv.h.

I probably should have omitted my comment entirely as it was just
musing about a situation that doesn't actually exist at the moment. :-(

Michael

^ permalink raw reply

* RE: [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Michael Kelley @ 2026-04-13 15:49 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <b5125f61-173f-45d0-a6dc-d795ba0f8693@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 13, 2026 4:48 AM
> 
> On 4/1/2026 10:27 PM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
> >>
> >> Generalize Synthetic interrupt source vector (sint) to use
> >> vmbus_interrupt variable instead, which automatically takes care of
> >> architectures where HYPERVISOR_CALLBACK_VECTOR is not present (arm64).
> >
> > Sashiko AI raised an interesting question about the startup timing --
> > whether the vmbus_platform_driver_probe() is guaranteed to have
> > set vmbus_interrupt before the VTL functions below run and use it.
> > What causes the mshv_vtl.ko module to be loaded, and hence run
> > mshv_vtl_init()?
> 
> There is no race condition here. The init ordering guarantees that
> vmbus_interrupt is always set before mshv_vtl_synic_enable_regs()
> reads it.
> 
> The call chain for setting vmbus_interrupt:
> 
>    subsys_initcall(hv_acpi_init)                          [level 4]
>      -> platform_driver_register(&vmbus_platform_driver) and so on.
> 
> 
> The call chain for reading vmbus_interrupt:
> 
>    module_init(mshv_vtl_init)                             [level 6]
>      -> hv_vtl_setup_synic()
>        -> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ..., mshv_vtl_alloc_context, ...)
>          -> mshv_vtl_alloc_context()
>            -> mshv_vtl_synic_enable_regs()
>              -> sint.vector = vmbus_interrupt
> 
> do_initcalls() processes sections in order 0 through 7, so
> hv_acpi_init() (level 4) is guaranteed to complete before
> mshv_vtl_init() (level 6) runs.
> 

I think the situation is more complex than what you describe, depending
on whether the VMBus driver and/or MSHV_VTL are built as modules vs.
being built-in to the kernel image. In include/linux/module.h, see the
comment for module_init() and how subsys_initcall() is mapped
to module_init() when built as a module.

If both are built-in, then what you describe is correct. But if either or
both are modules, then the respective init functions (hv_acpi_init
and mshv_vtl_init) get called at the time the module is loaded, and
not by do_initcalls(). I think hv_vmbus.ko gets loaded when an attempt
is first made to access a disk, but I would need to look more closely to
be sure. I don't have any understanding of what causes mshv_vtl.ko
to be loaded. And what is the ordering if MSHV_VTL is built-in while
VMBus is built as a module, or vice versa?

Michael





^ permalink raw reply

* Re: [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Naman Jain @ 2026-04-13 16:51 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157DA00A31F0BA8585B9B69D4242@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/13/2026 9:19 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 13, 2026 4:48 AM
>>
>> On 4/1/2026 10:27 PM, Michael Kelley wrote:
>>> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>>>
>>>> Generalize Synthetic interrupt source vector (sint) to use
>>>> vmbus_interrupt variable instead, which automatically takes care of
>>>> architectures where HYPERVISOR_CALLBACK_VECTOR is not present (arm64).
>>>
>>> Sashiko AI raised an interesting question about the startup timing --
>>> whether the vmbus_platform_driver_probe() is guaranteed to have
>>> set vmbus_interrupt before the VTL functions below run and use it.
>>> What causes the mshv_vtl.ko module to be loaded, and hence run
>>> mshv_vtl_init()?
>>
>> There is no race condition here. The init ordering guarantees that
>> vmbus_interrupt is always set before mshv_vtl_synic_enable_regs()
>> reads it.
>>
>> The call chain for setting vmbus_interrupt:
>>
>>     subsys_initcall(hv_acpi_init)                          [level 4]
>>       -> platform_driver_register(&vmbus_platform_driver) and so on.
>>
>>
>> The call chain for reading vmbus_interrupt:
>>
>>     module_init(mshv_vtl_init)                             [level 6]
>>       -> hv_vtl_setup_synic()
>>         -> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ..., mshv_vtl_alloc_context, ...)
>>           -> mshv_vtl_alloc_context()
>>             -> mshv_vtl_synic_enable_regs()
>>               -> sint.vector = vmbus_interrupt
>>
>> do_initcalls() processes sections in order 0 through 7, so
>> hv_acpi_init() (level 4) is guaranteed to complete before
>> mshv_vtl_init() (level 6) runs.
>>
> 
> I think the situation is more complex than what you describe, depending
> on whether the VMBus driver and/or MSHV_VTL are built as modules vs.
> being built-in to the kernel image. In include/linux/module.h, see the
> comment for module_init() and how subsys_initcall() is mapped
> to module_init() when built as a module.
> 
> If both are built-in, then what you describe is correct. But if either or
> both are modules, then the respective init functions (hv_acpi_init
> and mshv_vtl_init) get called at the time the module is loaded, and
> not by do_initcalls(). I think hv_vmbus.ko gets loaded when an attempt
> is first made to access a disk, but I would need to look more closely to
> be sure. I don't have any understanding of what causes mshv_vtl.ko
> to be loaded. And what is the ordering if MSHV_VTL is built-in while
> VMBus is built as a module, or vice versa?
> 
> Michael
> 

Based on this, I still feel that this race is not possible.

hv_vmbus   mshv_vtl
    y          y  -> different initcall levels, no issues
    y          m  -> use without initialization is not possible
    m          y  -> config dependencies take care of this, and mshv_vtl 
is forced to compile as a module in this case.
    m          m  -> config and symbol dependencies should take care of 
it. mshv_vtl has symbol and config dependencies on hv_vmbus, and it 
won't allow loading mshv_vtl if hv_vmbus module is not loaded.

Relevant code here: kernel/module/main.c

Regards,
Naman

^ permalink raw reply

* Re: [PATCH 07/11] arch: arm64: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-13 16:52 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157D3C4F6F376C8D6C3D234D450A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/1/2026 10:27 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
> 
> Nit: For historical consistency, use "arm64: hyperv:" as the prefix in the patch Subject.

Acked.

> 
>> Add support for arm64 specific variant of mshv_vtl_return_call function
>> to be able to add support for arm64 in MSHV_VTL driver. This would
>> help enable the transition between Virtual Trust Levels (VTL) in
>> MSHV_VTL when the kernel acts as a paravisor.
> 
> This commit message has a fair number of "filler" words. Suggest simplifying to:
> 
> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
> driver on arm64.  This function enables the transition between Virtual Trust
> Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
> 

I can see the difference clearly :-) Will use this in commit message.

>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/hyperv/Makefile        |   1 +
>>   arch/arm64/hyperv/hv_vtl.c        | 144 ++++++++++++++++++++++++++++++
>>   arch/arm64/include/asm/mshyperv.h |  13 +++
>>   3 files changed, 158 insertions(+)
>>   create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
>> diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
>> index 87c31c001da9..9701a837a6e1 100644
>> --- a/arch/arm64/hyperv/Makefile
>> +++ b/arch/arm64/hyperv/Makefile
>> @@ -1,2 +1,3 @@
>>   # SPDX-License-Identifier: GPL-2.0
>>   obj-y		:= hv_core.o mshyperv.o
>> +obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
>> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
>> new file mode 100644
>> index 000000000000..66318672c242
>> --- /dev/null
>> +++ b/arch/arm64/hyperv/hv_vtl.c
>> @@ -0,0 +1,144 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2026, Microsoft, Inc.
>> + *
>> + * Authors:
>> + *     Roman Kisel <romank@linux.microsoft.com>
>> + *     Naman Jain <namjain@linux.microsoft.com>
>> + */
>> +
>> +#include <asm/boot.h>
>> +#include <asm/mshyperv.h>
>> +#include <asm/cpu_ops.h>
>> +
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>> +{
>> +	u64 base_ptr = (u64)vtl0->x;
>> +
>> +	/*
>> +	 * VTL switch for ARM64 platform - managing VTL0's CPU context.
>> +	 * We explicitly use the stack to save the base pointer, and use x16
>> +	 * as our working register for accessing the context structure.
>> +	 *
>> +	 * Register Handling:
>> +	 * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
>> +	 * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
>> +	 * - X19-X30: Saved/restored (part of VTL0's execution context)
>> +	 * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
>> +	 * - SP: Not in structure, hypervisor-managed per-VTL
>> +	 *
>> +	 * Note: X29 (FP) and X30 (LR) are in the structure and must be saved/restored
>> +	 * as part of VTL0's complete execution state.
> 
> Could drop "Note:".  That's what comments are. :-)

Acked.

> 
> 
>> +	 */
>> +	asm __volatile__ (
>> +		/* Save base pointer to stack explicitly, then load into x16 */
>> +		"str %0, [sp, #-16]!\n\t"     /* Push base pointer onto stack */
>> +		"mov x16, %0\n\t"             /* Load base pointer into x16 */
>> +		/* Volatile registers (Windows ARM64 ABI: x0-x15) */
>> +		"ldp x0, x1, [x16]\n\t"
>> +		"ldp x2, x3, [x16, #(2*8)]\n\t"
> 
> On the x86 side, there's machinery to generate a series of constants that are
> the offsets of the individual fields in struct mshv_vtl_cpu_context. The x86
> asm code then uses these constants. Here on the arm64 side, you've calculated
> the offsets directly in the asm code. Any reason for the difference? I can see
> it's fairly easy to eyeball the offsets here and compare against the registers
> that are being loaded, where it's not so easy the way registers are named
> on x86. So maybe the additional machinery that's helpful on the x86 side
> is less necessary here. Just wondering ....

There were complexities around static call etc. in x86 which led to that 
redesign. Here things are much simpler and the offsets we see are 1-1 
mapped to registers. But I still tried to prototype it, and it looked 
more complex than it has to be. I think we can keep it in current form.

> 
>> +		"ldp x4, x5, [x16, #(4*8)]\n\t"
>> +		"ldp x6, x7, [x16, #(6*8)]\n\t"
>> +		"ldp x8, x9, [x16, #(8*8)]\n\t"
>> +		"ldp x10, x11, [x16, #(10*8)]\n\t"
>> +		"ldp x12, x13, [x16, #(12*8)]\n\t"
>> +		"ldp x14, x15, [x16, #(14*8)]\n\t"
>> +		/* x16 will be loaded last, after saving base pointer */
>> +		"ldr x17, [x16, #(17*8)]\n\t"
>> +		/* x18 is hypervisor-managed per-VTL - DO NOT LOAD */
>> +
>> +		/* General-purpose registers: x19-x30 */
>> +		"ldp x19, x20, [x16, #(19*8)]\n\t"
>> +		"ldp x21, x22, [x16, #(21*8)]\n\t"
>> +		"ldp x23, x24, [x16, #(23*8)]\n\t"
>> +		"ldp x25, x26, [x16, #(25*8)]\n\t"
>> +		"ldp x27, x28, [x16, #(27*8)]\n\t"
>> +
>> +		/* Frame pointer and link register */
>> +		"ldp x29, x30, [x16, #(29*8)]\n\t"
>> +
>> +		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
>> +		"ldp q0, q1, [x16, #(32*8)]\n\t"
>> +		"ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
>> +		"ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
>> +		"ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
>> +		"ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
>> +		"ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
>> +		"ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
>> +		"ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
>> +		"ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
>> +		"ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
>> +		"ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
>> +		"ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
>> +		"ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
>> +		"ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
>> +		"ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
>> +		"ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
>> +
>> +		/* Now load x16 itself */
>> +		"ldr x16, [x16, #(16*8)]\n\t"
>> +
>> +		/* Return to the lower VTL */
>> +		"hvc #3\n\t"
>> +
>> +		/* Save context after return - reload base pointer from stack */
>> +		"stp x16, x17, [sp, #-16]!\n\t" /* Save x16, x17 temporarily */
>> +		"ldr x16, [sp, #16]\n\t"        /* Reload base pointer (skip saved x16,x17) */
>> +
>> +		/* Volatile registers */
>> +		"stp x0, x1, [x16]\n\t"
>> +		"stp x2, x3, [x16, #(2*8)]\n\t"
>> +		"stp x4, x5, [x16, #(4*8)]\n\t"
>> +		"stp x6, x7, [x16, #(6*8)]\n\t"
>> +		"stp x8, x9, [x16, #(8*8)]\n\t"
>> +		"stp x10, x11, [x16, #(10*8)]\n\t"
>> +		"stp x12, x13, [x16, #(12*8)]\n\t"
>> +		"stp x14, x15, [x16, #(14*8)]\n\t"
>> +		"ldp x0, x1, [sp], #16\n\t"      /* Recover saved x16, x17 */
>> +		"stp x0, x1, [x16, #(16*8)]\n\t"
>> +		/* x18 is hypervisor-managed - DO NOT SAVE */
>> +
>> +		/* General-purpose registers: x19-x30 */
>> +		"stp x19, x20, [x16, #(19*8)]\n\t"
>> +		"stp x21, x22, [x16, #(21*8)]\n\t"
>> +		"stp x23, x24, [x16, #(23*8)]\n\t"
>> +		"stp x25, x26, [x16, #(25*8)]\n\t"
>> +		"stp x27, x28, [x16, #(27*8)]\n\t"
>> +		"stp x29, x30, [x16, #(29*8)]\n\t"  /* Frame pointer and link register */
>> +
>> +		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
>> +		"stp q0, q1, [x16, #(32*8)]\n\t"
>> +		"stp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
>> +		"stp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
>> +		"stp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
>> +		"stp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
>> +		"stp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
>> +		"stp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
>> +		"stp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
>> +		"stp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
>> +		"stp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
>> +		"stp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
>> +		"stp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
>> +		"stp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
>> +		"stp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
>> +		"stp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
>> +		"stp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
>> +
>> +		/* Clean up stack - pop base pointer */
>> +		"add sp, sp, #16\n\t"
>> +
>> +		: /* No outputs */
>> +		: /* Input */ "r"(base_ptr)
>> +		: /* Clobber list - x16 used as base, x18 is hypervisor-managed (not touched) */
>> +		"memory", "cc",
>> +		"x0", "x1", "x2", "x3", "x4", "x5",
>> +		"x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13",
>> +		"x14", "x15", "x16", "x17", "x19", "x20", "x21",
>> +		"x22", "x23", "x24", "x25", "x26", "x27", "x28",
>> +		"x29", "x30",
>> +		"v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
>> +		"v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
>> +		"v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23",
>> +		"v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31");
>> +}
>> +EXPORT_SYMBOL(mshv_vtl_return_call);
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index 804068e0941b..de7f3a41a8ea 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -60,6 +60,17 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>   				ARM_SMCCC_SMC_64,		\
>>   				ARM_SMCCC_OWNER_VENDOR_HYP,	\
>>   				HV_SMCCC_FUNC_NUMBER)
>> +
>> +struct mshv_vtl_cpu_context {
>> +/*
>> + * NOTE: x18 is managed by the hypervisor. It won't be reloaded from this array.
>> + * It is included here for convenience in the common case.
> 
> I'm not getting your point in this last sentence. What is the "common case"?
> 

This was really odd :-) I should have spotted it. I'll change it to:
It is included here for convenience in array indexing.

> You could also drop the "NOTE: " prefix.

Acked.

> 
>> + */
>> +	__u64 x[31];
>> +	__u64 rsvd;
>> +	__uint128_t q[32];
>> +};
> 
> struct mshv_vtl_run reserves 1024 bytes for cpu_context. It would be nice to
> have a compile-time check that the size of struct mshv_vtl_cpu_context fits in
> that 1024 bytes. That check might be better added where struct mshv_vtl_run
> is defined so that it works for both x86 and arm64.

Acked, will add it.

> 
>> +
>>   #ifdef CONFIG_HYPERV_VTL_MODE
>>   /*
>>    * Get/Set the register. If the function returns `1`, that must be done via
>> @@ -69,6 +80,8 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u
>>   {
>>   	return 1;
>>   }
>> +
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> 
> This declaration now duplicated in mshyperv.h under arch/arm64 and under
> arch/x86.  Instead, it should be added to asm-generic/mshyperv.h, and
> removed from the arch/x86 mshyperv.h, so that there's only a single
> instance of the declaration.

Acked.

> 
>>   #endif
>>
>>   #include <asm-generic/mshyperv.h>
>> --
>> 2.43.0
>>

Regards,
Naman

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Long Li @ 2026-04-13 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Erni Sri Satya Vennela, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260413134602.GL3694781@ziepe.ca>

> On Fri, Apr 10, 2026 at 10:29:45PM +0000, Long Li wrote:
> > > On Sat, Mar 21, 2026 at 12:56:39AM +0000, Long Li wrote:
> > >
> > > > How we rephrase this in this way: the driver should not corrupt or
> > > > overflow other parts of the kernel if its device is misbehaving
> > > > (or has a bug).
> > >
> > > If we are going to do this CC hardening stuff I think I want to see
> > > a more comphrensive approach, like if we detect an attack then the
> > > kernel instantly crashes or something. Or at least an approach in
> > > general agreed to by the CC and kernel community.
> > >
> > > Igoring the issue and continuing seems just wrong.
> > >
> > > This sprinkling of random checks in this series doesn't feel
> > > comprehensive or cohesive to me.
> > >
> > > Jason
> >
> > Can we follow the virtio BAD_RING()/vq->broken pattern in
> >
> https://git.kernel/
> .org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2
> Fdrivers%2Fvirtio%2Fvirtio_ring.c%23n57&data=05%7C02%7Clongli%40microsoft
> .com%7C698adb98daa64e20184708de996302b5%7C72f988bf86f141af91ab2d7c
> d011db47%7C1%7C0%7C639116847704528406%7CUnknown%7CTWFpbGZsb3d
> 8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoi
> TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=HkjUfCDysbQKTuiCsiQai
> gySStd%2BI3VrHUnfMC%2FBORc%3D&reserved=0.
> >
> > Add a broken flag to mana_ib_dev. When any hardware response contains
> > out-of-range values, mark the device broken and fail the operation -
> > during probe this prevents device registration entirely, at runtime
> > all subsequent operations return -EIO.
>
> If that's the plan I would think it should be struct device based, but yeah, I'm
> more comfortable with this sort of direction as a CC hardening plan.
>
> Jason

Will do, thank you.

Long

^ permalink raw reply

* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: vdso @ 2026-04-13 18:10 UTC (permalink / raw)
  To: Junrui Luo, Stanislav Kinsburskii
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
	Praveen K Paladugu, Jinank Jain, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yuhao Jiang, stable@vger.kernel.org
In-Reply-To: <19EDB8B0-A6F4-460F-8ABA-E9D3E239511B@outlook.com>


> On 04/13/2026 1:43 AM PDT Junrui Luo <moonafterrain@outlook.com> wrote:
> 
>  
> On Fri, Apr 10, 2026 at 09:05:35PM -0800, vdso@mailbox.org wrote:
> > All in all, from the three options of (generic check for overflow, simple check
> > for arch bad PFNs/GFNs, an elaborated check with all specifics) I suggested the simple check.
> > Fast and still more useful than checking for overflow in my opinion.
>  
> Thanks Roman for the thorough write-up. Since the original patch mixes
> host and hypervisor-side constants with an unclear unit, IMO we should
> do the bounds check in bytes instead.
> 
> For instance:
> 
> 	u64 start_gpa, end_gpa;
> 
> 	if (check_mul_overflow(mem->guest_pfn, HV_HYP_PAGE_SIZE,
> 						   &start_gpa) ||
> 		check_add_overflow(start_gpa, mem->size, &end_gpa) ||
> 		end_gpa > (1ULL << MAX_PHYSMEM_BITS))
> 		return -EINVAL;
> 
> Both sides of the final comparison are bytes, so no host-vs-hv page
> unit conversion is needed.

I like that better indeed!

> 
> In addition, it changes return value from -EOVERFLOW to -EINVAL.

I think that good, too: -EOVERFLOW originated iiuc and is more used
in VFS from my cursory glance.

> 
> Does this approach look reasonable? Happy to iterate if either of you
> would prefer a different choice.

I agree with all your points, feels like a better place now :)

I'd defer the final smell check to Stanislav. Stanislav maintains this code
as the daily job, and might have a better feel and perspective for it. I've
been happy to add my 2c!

> 
> Thanks,
> Junrui Luo

^ permalink raw reply

* RE: [PATCH 0/7] mshv: Refactor memory region management and map pages at creation
From: Michael Kelley @ 2026-04-13 21:07 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> 
> This series refactors the mshv memory region subsystem in preparation
> for mapping populated pages into the hypervisor at movable region
> creation time, rather than relying solely on demand faulting.
> 
> The primary motivation is to ensure that when userspace passes a
> pre-populated mapping for a movable memory region, those pages are
> immediately visible to the hypervisor. Previously, all movable regions
> were created with HV_MAP_GPA_NO_ACCESS on every page regardless of
> whether the backing pages were already present, deferring all mapping
> to the fault handler. This added unnecessary fault overhead and
> complicated the initial setup of child partitions with pre-populated
> memory.
> 

This is a nice set of changes. Independent of the new functionality
for pre-populating, it improves the code organization and makes
it more regular.

See a few comments on individual patches. I noticed that Sashiko
wasn't able to review the series because it wouldn't apply. Hopefully
your v2 will apply. From what I've seen so far of Sashiko, it finds some
good issues. I did run the patch set through Co-Pilot, but that didn't
have the benefit of the AI prompts that Sashiko provides.

Michael

^ permalink raw reply

* RE: [PATCH 1/7] mshv: Convert from page pointers to PFNs
From: Michael Kelley @ 2026-04-13 21:08 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490105758.81669.969284388846280218.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> 
> The HMM interface returns PFNs from hmm_range_fault(), and the
> hypervisor hypercalls operate on PFNs. Storing page pointers in
> between these interfaces requires unnecessary conversions and
> temporary allocations.
> 
> Store PFNs directly in memory regions to match the natural data flow.
> This eliminates the temporary PFN array allocation in the HMM fault
> path and reduces page_to_pfn() conversions throughout the driver.
> Convert to page structs via pfn_to_page() only when operations like
> unpin_user_page() require them.

General comment for this series:  PFN fields are typed as "unsigned long".
But pfn_offset and pfn_count are "u64".  GFNs are also "u64".  Any
reason not to make PFNs also "u64"? I know that pfn_valid() takes
an "unsigned long" input, but see comment below about pfn_valid().

> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c      |  297 ++++++++++++++++++++++------------------
>  drivers/hv/mshv_root.h         |   20 +--
>  drivers/hv/mshv_root_hv_call.c |   50 +++----
>  drivers/hv/mshv_root_main.c    |   30 ++--
>  4 files changed, 212 insertions(+), 185 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index fdffd4f002f6..b1a707d16c07 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -18,12 +18,13 @@
>  #include "mshv_root.h"
> 
>  #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
> +#define MSHV_INVALID_PFN				ULONG_MAX
> 
>  /**
>   * mshv_chunk_stride - Compute stride for mapping guest memory
>   * @page      : The page to check for huge page backing
>   * @gfn       : Guest frame number for the mapping
> - * @page_count: Total number of pages in the mapping
> + * @pfn_count: Total number of pages in the mapping

Nit: The colons are misaligned after this change.

>   *
>   * Determines the appropriate stride (in pages) for mapping guest memory.
>   * Uses huge page stride if the backing page is huge and the guest mapping
> @@ -32,18 +33,18 @@
>   * Return: Stride in pages, or -EINVAL if page order is unsupported.
>   */
>  static int mshv_chunk_stride(struct page *page,
> -			     u64 gfn, u64 page_count)
> +			     u64 gfn, u64 pfn_count)
>  {
>  	unsigned int page_order;
> 
>  	/*
>  	 * Use single page stride by default. For huge page stride, the
>  	 * page must be compound and point to the head of the compound
> -	 * page, and both gfn and page_count must be huge-page aligned.
> +	 * page, and both gfn and pfn_count must be huge-page aligned.
>  	 */
>  	if (!PageCompound(page) || !PageHead(page) ||
>  	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
> -	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
> +	    !IS_ALIGNED(pfn_count, PTRS_PER_PMD))
>  		return 1;
> 
>  	page_order = folio_order(page_folio(page));
> @@ -57,60 +58,61 @@ static int mshv_chunk_stride(struct page *page,
>  /**
>   * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
>   *                             in a region.
> - * @region     : Pointer to the memory region structure.
> - * @flags      : Flags to pass to the handler.
> - * @page_offset: Offset into the region's pages array to start processing.
> - * @page_count : Number of pages to process.
> - * @handler    : Callback function to handle the chunk.
> + * @region    : Pointer to the memory region structure.
> + * @flags     : Flags to pass to the handler.
> + * @pfn_offset: Offset into the region's PFNs array to start processing.
> + * @pfn_count : Number of PFNs to process.
> + * @handler   : Callback function to handle the chunk.
>   *
> - * This function scans the region's pages starting from @page_offset,
> - * checking for contiguous present pages of the same size (normal or huge).
> - * It invokes @handler for the chunk of contiguous pages found. Returns the
> - * number of pages handled, or a negative error code if the first page is
> - * not present or the handler fails.
> + * This function scans the region's PFNs starting from @pfn_offset,
> + * checking for contiguous valid PFNs backed by pages of the same size
> + * (normal or huge). It invokes @handler for the chunk of contiguous valid
> + * PFNs found. Returns the number of PFNs handled, or a negative error code
> + * if the first PFN is invalid or the handler fails.
>   *
> - * Note: The @handler callback must be able to handle both normal and huge
> - * pages.
> + * Note: The @handler callback must be able to handle valid PFNs backed by
> + * both normal and huge pages.
>   *
>   * Return: Number of pages handled, or negative error code.
>   */
> -static long mshv_region_process_chunk(struct mshv_mem_region *region,
> -				      u32 flags,
> -				      u64 page_offset, u64 page_count,
> -				      int (*handler)(struct mshv_mem_region *region,
> -						     u32 flags,
> -						     u64 page_offset,
> -						     u64 page_count,
> -						     bool huge_page))
> +static long mshv_region_process_pfns(struct mshv_mem_region *region,
> +				     u32 flags,
> +				     u64 pfn_offset, u64 pfn_count,
> +				     int (*handler)(struct mshv_mem_region *region,
> +						    u32 flags,
> +						    u64 pfn_offset,
> +						    u64 pfn_count,
> +						    bool huge_page))
>  {
> -	u64 gfn = region->start_gfn + page_offset;
> +	u64 gfn = region->start_gfn + pfn_offset;
>  	u64 count;
> -	struct page *page;
> +	unsigned long pfn;
>  	int stride, ret;
> 
> -	page = region->mreg_pages[page_offset];
> -	if (!page)
> +	pfn = region->mreg_pfns[pfn_offset];
> +	if (!pfn_valid(pfn))
>  		return -EINVAL;
> 
> -	stride = mshv_chunk_stride(page, gfn, page_count);
> +	stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
>  	if (stride < 0)
>  		return stride;
> 
>  	/* Start at stride since the first stride is validated */
> -	for (count = stride; count < page_count; count += stride) {
> -		page = region->mreg_pages[page_offset + count];
> +	for (count = stride; count < pfn_count ; count += stride) {
> +		pfn = region->mreg_pfns[pfn_offset + count];
> 
> -		/* Break if current page is not present */
> -		if (!page)
> +		/* Break if current pfn is invalid */
> +		if (!pfn_valid(pfn))

pfn_valid() is a relatively expensive test to be doing in a loop
on what may be every single page. It does an RCU lock/unlock
and make other checks that aren't necessary here. Since
mreg_pfns[] is populated from mm calls, the only invalid PFNs
would be MSHV_INVALID_PFN that code in this module has
explicitly put there. Just testing against MSHV_INVALID_PFN
would be a lot faster here and elsewhere in this module. It's
really a "pfn set/not set" test. Defining a pfn_set() macro
here in this module that tests against MSHV_INVALID_PFN
would accomplish the same thing more efficiently.

>  			break;
> 
>  		/* Break if stride size changes */
> -		if (stride != mshv_chunk_stride(page, gfn + count,
> -						page_count - count))
> +		if (stride != mshv_chunk_stride(pfn_to_page(pfn),
> +						gfn + count,
> +						pfn_count - count))
>  			break;
>  	}
> 
> -	ret = handler(region, flags, page_offset, count, stride > 1);
> +	ret = handler(region, flags, pfn_offset, count, stride > 1);
>  	if (ret)
>  		return ret;
> 
> @@ -118,70 +120,73 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  }
> 
>  /**
> - * mshv_region_process_range - Processes a range of memory pages in a
> - *                             region.
> - * @region     : Pointer to the memory region structure.
> - * @flags      : Flags to pass to the handler.
> - * @page_offset: Offset into the region's pages array to start processing.
> - * @page_count : Number of pages to process.
> - * @handler    : Callback function to handle each chunk of contiguous
> - *               pages.
> + * mshv_region_process_range - Processes a range of PFNs in a region.
> + * @region    : Pointer to the memory region structure.
> + * @flags     : Flags to pass to the handler.
> + * @pfn_offset: Offset into the region's PFNs array to start processing.
> + * @pfn_count : Number of PFNs to process.
> + * @handler   : Callback function to handle each chunk of contiguous
> + *              valid PFNs.
>   *
> - * Iterates over the specified range of pages in @region, skipping
> - * non-present pages. For each contiguous chunk of present pages, invokes
> - * @handler via mshv_region_process_chunk.
> + * Iterates over the specified range of PFNs in @region, skipping
> + * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
> + * @handler via mshv_region_process_pfns.
>   *
> - * Note: The @handler callback must be able to handle both normal and huge
> - * pages.
> + * Note: The @handler callback must be able to handle PFNs backed by both
> + * normal and huge pages.
>   *
>   * Returns 0 on success, or a negative error code on failure.
>   */
>  static int mshv_region_process_range(struct mshv_mem_region *region,
>  				     u32 flags,
> -				     u64 page_offset, u64 page_count,
> +				     u64 pfn_offset, u64 pfn_count,
>  				     int (*handler)(struct mshv_mem_region *region,
>  						    u32 flags,
> -						    u64 page_offset,
> -						    u64 page_count,
> +						    u64 pfn_offset,
> +						    u64 pfn_count,
>  						    bool huge_page))
>  {
> +	u64 pfn_end;

In Patch 2 of this series, "pfn_end" is changed to just "end", and
the references are adjusted. Patch 2 could be a few lines smaller if it
was named "end" here and Patch 2 didn't have to change it.

>  	long ret;
> 
> -	if (page_offset + page_count > region->nr_pages)
> +	if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> +		return -EOVERFLOW;
> +
> +	if (pfn_end > region->nr_pfns)
>  		return -EINVAL;
> 
> -	while (page_count) {
> +	while (pfn_count) {
>  		/* Skip non-present pages */
> -		if (!region->mreg_pages[page_offset]) {
> -			page_offset++;
> -			page_count--;
> +		if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> +			pfn_offset++;
> +			pfn_count--;
>  			continue;
>  		}
> 
> -		ret = mshv_region_process_chunk(region, flags,
> -						page_offset,
> -						page_count,
> -						handler);
> +		ret = mshv_region_process_pfns(region, flags,
> +					       pfn_offset, pfn_count,
> +					       handler);
>  		if (ret < 0)
>  			return ret;
> 
> -		page_offset += ret;
> -		page_count -= ret;
> +		pfn_offset += ret;
> +		pfn_count -= ret;
>  	}
> 
>  	return 0;
>  }
> 
> -struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> +struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
>  					   u64 uaddr, u32 flags)
>  {
>  	struct mshv_mem_region *region;
> +	u64 i;
> 
> -	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
> +	region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);

Use struct_size(region, mreg_pfns, nr_pfns) instead of open coding the arithmetic?

>  	if (!region)
>  		return ERR_PTR(-ENOMEM);
> 
> -	region->nr_pages = nr_pages;
> +	region->nr_pfns = nr_pfns;
>  	region->start_gfn = guest_pfn;
>  	region->start_uaddr = uaddr;
>  	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
> @@ -190,6 +195,9 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
>  	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
>  		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
> 
> +	for (i = 0; i < nr_pfns; i++)
> +		region->mreg_pfns[i] = MSHV_INVALID_PFN;
> +
>  	kref_init(&region->mreg_refcount);
> 
>  	return region;
> @@ -197,15 +205,15 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> 
>  static int mshv_region_chunk_share(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count,
> +				   u64 pfn_offset, u64 pfn_count,
>  				   bool huge_page)
>  {
>  	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
>  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> -					      region->mreg_pages + page_offset,
> -					      page_count,
> +					      region->mreg_pfns + pfn_offset,
> +					      pfn_count,
>  					      HV_MAP_GPA_READABLE |
>  					      HV_MAP_GPA_WRITABLE,
>  					      flags, true);
> @@ -216,21 +224,21 @@ int mshv_region_share(struct mshv_mem_region *region)
>  	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
> 
>  	return mshv_region_process_range(region, flags,
> -					 0, region->nr_pages,
> +					 0, region->nr_pfns,
>  					 mshv_region_chunk_share);
>  }
> 
>  static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
>  				     u32 flags,
> -				     u64 page_offset, u64 page_count,
> +				     u64 pfn_offset, u64 pfn_count,
>  				     bool huge_page)
>  {
>  	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
>  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> -					      region->mreg_pages + page_offset,
> -					      page_count, 0,
> +					      region->mreg_pfns + pfn_offset,
> +					      pfn_count, 0,
>  					      flags, false);
>  }
> 
> @@ -239,30 +247,30 @@ int mshv_region_unshare(struct mshv_mem_region *region)
>  	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
> 
>  	return mshv_region_process_range(region, flags,
> -					 0, region->nr_pages,
> +					 0, region->nr_pfns,
>  					 mshv_region_chunk_unshare);
>  }
> 
>  static int mshv_region_chunk_remap(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count,
> +				   u64 pfn_offset, u64 pfn_count,
>  				   bool huge_page)
>  {
>  	if (huge_page)
>  		flags |= HV_MAP_GPA_LARGE_PAGE;
> 
> -	return hv_call_map_gpa_pages(region->partition->pt_id,
> -				     region->start_gfn + page_offset,
> -				     page_count, flags,
> -				     region->mreg_pages + page_offset);
> +	return hv_call_map_ram_pfns(region->partition->pt_id,
> +				    region->start_gfn + pfn_offset,
> +				    pfn_count, flags,
> +				    region->mreg_pfns + pfn_offset);
>  }
> 
> -static int mshv_region_remap_pages(struct mshv_mem_region *region,
> -				   u32 map_flags,
> -				   u64 page_offset, u64 page_count)
> +static int mshv_region_remap_pfns(struct mshv_mem_region *region,
> +				  u32 map_flags,
> +				  u64 pfn_offset, u64 pfn_count)
>  {
>  	return mshv_region_process_range(region, map_flags,
> -					 page_offset, page_count,
> +					 pfn_offset, pfn_count,
>  					 mshv_region_chunk_remap);
>  }
> 
> @@ -270,38 +278,50 @@ int mshv_region_map(struct mshv_mem_region *region)
>  {
>  	u32 map_flags = region->hv_map_flags;
> 
> -	return mshv_region_remap_pages(region, map_flags,
> -				       0, region->nr_pages);
> +	return mshv_region_remap_pfns(region, map_flags,
> +				      0, region->nr_pfns);
>  }
> 
> -static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> -					 u64 page_offset, u64 page_count)
> +static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
> +					u64 pfn_offset, u64 pfn_count)
>  {
> -	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> -		unpin_user_pages(region->mreg_pages + page_offset, page_count);
> +	u64 i;
> +
> +	for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
> +		if (!pfn_valid(region->mreg_pfns[i]))
> +			continue;
> +
> +		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> +			unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
> 
> -	memset(region->mreg_pages + page_offset, 0,
> -	       page_count * sizeof(struct page *));
> +		region->mreg_pfns[i] = MSHV_INVALID_PFN;
> +	}
>  }
> 
>  void mshv_region_invalidate(struct mshv_mem_region *region)
>  {
> -	mshv_region_invalidate_pages(region, 0, region->nr_pages);
> +	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
>  }
> 
>  int mshv_region_pin(struct mshv_mem_region *region)
>  {
> -	u64 done_count, nr_pages;
> +	u64 done_count, nr_pfns, i;
> +	unsigned long *pfns;
>  	struct page **pages;
>  	__u64 userspace_addr;
>  	int ret;
> 
> -	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
> -		pages = region->mreg_pages + done_count;
> +	pages = kmalloc_array(MSHV_PIN_PAGES_BATCH_SIZE,
> +			      sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	for (done_count = 0; done_count < region->nr_pfns; done_count += ret) {
> +		pfns = region->mreg_pfns + done_count;
>  		userspace_addr = region->start_uaddr +
>  				 done_count * HV_HYP_PAGE_SIZE;
> -		nr_pages = min(region->nr_pages - done_count,
> -			       MSHV_PIN_PAGES_BATCH_SIZE);
> +		nr_pfns = min(region->nr_pfns - done_count,
> +			      MSHV_PIN_PAGES_BATCH_SIZE);
> 
>  		/*
>  		 * Pinning assuming 4k pages works for large pages too.
> @@ -311,39 +331,44 @@ int mshv_region_pin(struct mshv_mem_region *region)
>  		 * with the FOLL_LONGTERM flag does a large temporary
>  		 * allocation of contiguous memory.
>  		 */
> -		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> +		ret = pin_user_pages_fast(userspace_addr, nr_pfns,
>  					  FOLL_WRITE | FOLL_LONGTERM,
>  					  pages);
> -		if (ret != nr_pages)
> +		if (ret != nr_pfns)
>  			goto release_pages;
> +
> +		for (i = 0; i < ret; i++)
> +			pfns[i] = page_to_pfn(pages[i]);
>  	}
> 
> +	kfree(pages);
>  	return 0;
> 
>  release_pages:
>  	if (ret > 0)
>  		done_count += ret;
> -	mshv_region_invalidate_pages(region, 0, done_count);
> +	mshv_region_invalidate_pfns(region, 0, done_count);
> +	kfree(pages);
>  	return ret < 0 ? ret : -ENOMEM;
>  }
> 
>  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count,
> +				   u64 pfn_offset, u64 pfn_count,
>  				   bool huge_page)
>  {
>  	if (huge_page)
>  		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> 
> -	return hv_call_unmap_gpa_pages(region->partition->pt_id,
> -				       region->start_gfn + page_offset,
> -				       page_count, flags);
> +	return hv_call_unmap_pfns(region->partition->pt_id,
> +				  region->start_gfn + pfn_offset,
> +				  pfn_count, flags);
>  }
> 
>  static int mshv_region_unmap(struct mshv_mem_region *region)
>  {
>  	return mshv_region_process_range(region, 0,
> -					 0, region->nr_pages,
> +					 0, region->nr_pfns,
>  					 mshv_region_chunk_unmap);
>  }
> 
> @@ -427,8 +452,8 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
>  /**
>   * mshv_region_range_fault - Handle memory range faults for a given region.
>   * @region: Pointer to the memory region structure.
> - * @page_offset: Offset of the page within the region.
> - * @page_count: Number of pages to handle.
> + * @pfn_offset: Offset of the page within the region.
> + * @pfn_count: Number of pages to handle.
>   *
>   * This function resolves memory faults for a specified range of pages
>   * within a memory region. It uses HMM (Heterogeneous Memory Management)
> @@ -437,7 +462,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
>   * Return: 0 on success, negative error code on failure.
>   */
>  static int mshv_region_range_fault(struct mshv_mem_region *region,
> -				   u64 page_offset, u64 page_count)
> +				   u64 pfn_offset, u64 pfn_count)
>  {
>  	struct hmm_range range = {
>  		.notifier = &region->mreg_mni,
> @@ -447,13 +472,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>  	int ret;
>  	u64 i;
> 
> -	pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL);
> +	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
>  	if (!pfns)
>  		return -ENOMEM;
> 
>  	range.hmm_pfns = pfns;
> -	range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> -	range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> +	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> +	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
> 
>  	do {
>  		ret = mshv_region_hmm_fault_and_lock(region, &range);
> @@ -462,11 +487,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>  	if (ret)
>  		goto out;
> 
> -	for (i = 0; i < page_count; i++)
> -		region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> +	for (i = 0; i < pfn_count; i++) {
> +		if (!(pfns[i] & HMM_PFN_VALID))
> +			continue;
> +		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
> +		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
> +	}
> 
> -	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> -				      page_offset, page_count);
> +	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
> +				     pfn_offset, pfn_count);
> 
>  	mutex_unlock(&region->mreg_mutex);
>  out:
> @@ -476,24 +505,24 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> 
>  bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
>  {
> -	u64 page_offset, page_count;
> +	u64 pfn_offset, pfn_count;
>  	int ret;
> 
>  	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> -	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> -				 MSHV_MAP_FAULT_IN_PAGES);
> +	pfn_offset = ALIGN_DOWN(gfn - region->start_gfn,
> +				MSHV_MAP_FAULT_IN_PAGES);
> 
>  	/* Map more pages than requested to reduce the number of faults. */
> -	page_count = min(region->nr_pages - page_offset,
> -			 MSHV_MAP_FAULT_IN_PAGES);
> +	pfn_count = min(region->nr_pfns - pfn_offset,
> +			MSHV_MAP_FAULT_IN_PAGES);
> 
> -	ret = mshv_region_range_fault(region, page_offset, page_count);
> +	ret = mshv_region_range_fault(region, pfn_offset, pfn_count);
> 
>  	WARN_ONCE(ret,
> -		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, pfn_offset %llu, pfn_count %llu\n",
>  		  region->partition->pt_id, region->start_uaddr,
> -		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> -		  gfn, page_offset, page_count);
> +		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
> +		  gfn, pfn_offset, pfn_count);
> 
>  	return !ret;
>  }
> @@ -523,16 +552,16 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
>  	struct mshv_mem_region *region = container_of(mni,
>  						      struct mshv_mem_region,
>  						      mreg_mni);
> -	u64 page_offset, page_count;
> +	u64 pfn_offset, pfn_count;
>  	unsigned long mstart, mend;
>  	int ret = -EPERM;
> 
>  	mstart = max(range->start, region->start_uaddr);
>  	mend = min(range->end, region->start_uaddr +
> -		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
> +		   (region->nr_pfns << HV_HYP_PAGE_SHIFT));
> 
> -	page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> -	page_count = HVPFN_DOWN(mend - mstart);
> +	pfn_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> +	pfn_count = HVPFN_DOWN(mend - mstart);
> 
>  	if (mmu_notifier_range_blockable(range))
>  		mutex_lock(&region->mreg_mutex);
> @@ -541,12 +570,12 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> 
>  	mmu_interval_set_seq(mni, cur_seq);
> 
> -	ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> -				      page_offset, page_count);
> +	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
> +				     pfn_offset, pfn_count);
>  	if (ret)
>  		goto out_unlock;
> 
> -	mshv_region_invalidate_pages(region, page_offset, page_count);
> +	mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
> 
>  	mutex_unlock(&region->mreg_mutex);
> 
> @@ -558,9 +587,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
>  	WARN_ONCE(ret,
>  		  "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
>  		  region->start_uaddr,
> -		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> +		  region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
>  		  range->start, range->end, range->event,
> -		  page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
> +		  pfn_offset, pfn_offset + pfn_count - 1, (u64)range->mm, ret);
>  	return false;
>  }
> 
> @@ -579,7 +608,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
> 
>  	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
>  					   region->start_uaddr,
> -					   region->nr_pages << HV_HYP_PAGE_SHIFT,
> +					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
>  					   &mshv_region_mni_ops);
>  	if (ret)
>  		return false;
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 947dfb76bb19..f1d4bee97a3f 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -84,15 +84,15 @@ enum mshv_region_type {
>  struct mshv_mem_region {
>  	struct hlist_node hnode;
>  	struct kref mreg_refcount;
> -	u64 nr_pages;
> +	u64 nr_pfns;
>  	u64 start_gfn;
>  	u64 start_uaddr;
>  	u32 hv_map_flags;
>  	struct mshv_partition *partition;
>  	enum mshv_region_type mreg_type;
>  	struct mmu_interval_notifier mreg_mni;
> -	struct mutex mreg_mutex;	/* protects region pages remapping */
> -	struct page *mreg_pages[];
> +	struct mutex mreg_mutex;	/* protects region PFNs remapping */
> +	unsigned long mreg_pfns[];
>  };
> 
>  struct mshv_irq_ack_notifier {
> @@ -282,11 +282,11 @@ int hv_call_create_partition(u64 flags,
>  int hv_call_initialize_partition(u64 partition_id);
>  int hv_call_finalize_partition(u64 partition_id);
>  int hv_call_delete_partition(u64 partition_id);
> -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64
> numpgs);
> -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> -			  u32 flags, struct page **pages);
> -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> -			    u32 flags);
> +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
> +int hv_call_map_ram_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> +			 u32 flags, unsigned long *pfns);
> +int hv_call_unmap_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> +		       u32 flags);
>  int hv_call_delete_vp(u64 partition_id, u32 vp_index);
>  int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
>  				     u64 dest_addr,
> @@ -329,8 +329,8 @@ int hv_map_stats_page(enum hv_stats_object_type type,
>  int hv_unmap_stats_page(enum hv_stats_object_type type,
>  			struct hv_stats_page *page_addr,
>  			const union hv_stats_object_identity *identity);
> -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> -				   u64 page_struct_count, u32 host_access,
> +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> +				   u64 pfns_count, u32 host_access,
>  				   u32 flags, u8 acquire);
>  int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
>  				      void *property_value, size_t property_value_sz);
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index cb55d4d4be2e..a95f2cfc5da5 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -188,17 +188,16 @@ int hv_call_delete_partition(u64 partition_id)
>  	return hv_result_to_errno(status);
>  }
> 
> -/* Ask the hypervisor to map guest ram pages or the guest mmio space */
> -static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> -			       u32 flags, struct page **pages, u64 mmio_spa)
> +static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
> +			  u32 flags, unsigned long *pfns, u64 mmio_spa)
>  {
>  	struct hv_input_map_gpa_pages *input_page;
>  	u64 status, *pfnlist;
>  	unsigned long irq_flags, large_shift = 0;
>  	int ret = 0, done = 0;
> -	u64 page_count = page_struct_count;
> +	u64 page_count = pfns_count;
> 
> -	if (page_count == 0 || (pages && mmio_spa))
> +	if (page_count == 0 || (pfns && mmio_spa))
>  		return -EINVAL;
> 
>  	if (flags & HV_MAP_GPA_LARGE_PAGE) {
> @@ -227,14 +226,14 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  		for (i = 0; i < rep_count; i++)
>  			if (flags & HV_MAP_GPA_NO_ACCESS) {
>  				pfnlist[i] = 0;
> -			} else if (pages) {
> +			} else if (pfns) {
>  				u64 index = (done + i) << large_shift;
> 
> -				if (index >= page_struct_count) {
> +				if (index >= pfns_count) {
>  					ret = -EINVAL;
>  					break;
>  				}
> -				pfnlist[i] = page_to_pfn(pages[index]);
> +				pfnlist[i] = pfns[index];
>  			} else {
>  				pfnlist[i] = mmio_spa + done + i;
>  			}
> @@ -266,37 +265,37 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> 
>  		if (flags & HV_MAP_GPA_LARGE_PAGE)
>  			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> -		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> +		hv_call_unmap_pfns(partition_id, gfn, done, unmap_flags);
>  	}
> 
>  	return ret;
>  }
> 
>  /* Ask the hypervisor to map guest ram pages */
> -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> -			  u32 flags, struct page **pages)
> +int hv_call_map_ram_pfns(u64 partition_id, u64 gfn, u64 pfn_count,
> +			 u32 flags, unsigned long *pfns)
>  {
> -	return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
> -				   flags, pages, 0);
> +	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> +			      pfns, 0);
>  }
> 
> -/* Ask the hypervisor to map guest mmio space */
> -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
> +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa,
> +			  u64 pfn_count)
>  {
>  	int i;
>  	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
>  		    HV_MAP_GPA_NOT_CACHED;
> 
> -	for (i = 0; i < numpgs; i++)
> +	for (i = 0; i < pfn_count; i++)
>  		if (page_is_ram(mmio_spa + i))
>  			return -EINVAL;
> 
> -	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
> -				   mmio_spa);
> +	return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> +			      NULL, mmio_spa);
>  }
> 
> -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> -			    u32 flags)
> +int hv_call_unmap_pfns(u64 partition_id, u64 gfn, u64 page_count_4k,
> +		       u32 flags)
>  {
>  	struct hv_input_unmap_gpa_pages *input_page;
>  	u64 status, page_count = page_count_4k;
> @@ -1009,15 +1008,15 @@ int hv_unmap_stats_page(enum hv_stats_object_type type,
>  	return ret;
>  }
> 
> -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> -				   u64 page_struct_count, u32 host_access,
> +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> +				   u64 pfns_count, u32 host_access,
>  				   u32 flags, u8 acquire)
>  {
>  	struct hv_input_modify_sparse_spa_page_host_access *input_page;
>  	u64 status;
>  	int done = 0;
>  	unsigned long irq_flags, large_shift = 0;
> -	u64 page_count = page_struct_count;
> +	u64 page_count = pfns_count;
>  	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
>  			     HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
> 
> @@ -1051,11 +1050,10 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
>  		for (i = 0; i < rep_count; i++) {
>  			u64 index = (done + i) << large_shift;
> 
> -			if (index >= page_struct_count)
> +			if (index >= pfns_count)
>  				return -EINVAL;
> 
> -			input_page->spa_page_list[i] =
> -						page_to_pfn(pages[index]);
> +			input_page->spa_page_list[i] = pfns[index];
>  		}
> 
>  		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index f2d83d6c8c4f..685e4b562186 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -619,7 +619,7 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> 
>  	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
>  		if (gfn >= region->start_gfn &&
> -		    gfn < region->start_gfn + region->nr_pages)
> +		    gfn < region->start_gfn + region->nr_pfns)
>  			return region;
>  	}
> 
> @@ -1221,20 +1221,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>  					bool is_mmio)
>  {
>  	struct mshv_mem_region *rg;
> -	u64 nr_pages = HVPFN_DOWN(mem->size);
> +	u64 nr_pfns = HVPFN_DOWN(mem->size);
> 
>  	/* Reject overlapping regions */
>  	spin_lock(&partition->pt_mem_regions_lock);
>  	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> -		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> -		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
> +		if (mem->guest_pfn + nr_pfns <= rg->start_gfn ||
> +		    rg->start_gfn + rg->nr_pfns <= mem->guest_pfn)
>  			continue;
>  		spin_unlock(&partition->pt_mem_regions_lock);
>  		return -EEXIST;
>  	}
>  	spin_unlock(&partition->pt_mem_regions_lock);
> 
> -	rg = mshv_region_create(mem->guest_pfn, nr_pages,
> +	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
>  				mem->userspace_addr, mem->flags);
>  	if (IS_ERR(rg))
>  		return PTR_ERR(rg);
> @@ -1372,21 +1372,21 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  		 * the hypervisor track dirty pages, enabling pre-copy live
>  		 * migration.
>  		 */
> -		ret = hv_call_map_gpa_pages(partition->pt_id,
> -					    region->start_gfn,
> -					    region->nr_pages,
> -					    HV_MAP_GPA_NO_ACCESS, NULL);
> +		ret = hv_call_map_ram_pfns(partition->pt_id,
> +					   region->start_gfn,
> +					   region->nr_pfns,
> +					   HV_MAP_GPA_NO_ACCESS, NULL);
>  		break;
>  	case MSHV_REGION_TYPE_MMIO:
> -		ret = hv_call_map_mmio_pages(partition->pt_id,
> -					     region->start_gfn,
> -					     mmio_pfn,
> -					     region->nr_pages);
> +		ret = hv_call_map_mmio_pfns(partition->pt_id,
> +					    region->start_gfn,
> +					    mmio_pfn,
> +					    region->nr_pfns);
>  		break;
>  	}
> 
>  	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
> -				   region->start_gfn, region->nr_pages,
> +				   region->start_gfn, region->nr_pfns,
>  				   region->hv_map_flags, ret);
> 
>  	if (ret)
> @@ -1424,7 +1424,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
>  	/* Paranoia check */
>  	if (region->start_uaddr != mem.userspace_addr ||
>  	    region->start_gfn != mem.guest_pfn ||
> -	    region->nr_pages != HVPFN_DOWN(mem.size)) {
> +	    region->nr_pfns != HVPFN_DOWN(mem.size)) {
>  		spin_unlock(&partition->pt_mem_regions_lock);
>  		return -EINVAL;
>  	}
> 
> 


^ permalink raw reply

* RE: [PATCH 2/7] mshv: Add support to address range holes remapping
From: Michael Kelley @ 2026-04-13 21:08 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490106397.81669.13650500489059864399.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> 
> Consolidate memory region processing to handle both valid and invalid PFNs
> uniformly. This eliminates code duplication across remap, unmap, share, and
> unshare operations by using a common range processing interface.
> 
> Holes are now remapped with no-access permissions to enable
> hypervisor dirty page tracking for precopy live migration.
> 
> This refactoring is a precursor to an upcoming change that will map
> present pages in movable regions upon region creation, requiring
> consistent handling of both mapped and unmapped ranges.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c |  108
> ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 95 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index b1a707d16c07..ed9c55841140 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -119,6 +119,57 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
>  	return count;
>  }
> 
> +/**
> + * mshv_region_process_hole - Handle a hole (invalid PFNs) in a memory
> + *                            region
> + * @region    : Memory region containing the hole
> + * @flags     : Flags to pass to the handler function
> + * @pfn_offset: Starting PFN offset within the region
> + * @pfn_count : Number of PFNs in the hole
> + * @handler   : Callback function to invoke for the hole
> + *
> + * Invokes the handler function for a contiguous hole with the specified
> + * parameters.
> + *
> + * Return: Number of PFNs handled, or negative error code.
> + */
> +static long mshv_region_process_hole(struct mshv_mem_region *region,
> +				     u32 flags,
> +				     u64 pfn_offset, u64 pfn_count,
> +				     int (*handler)(struct mshv_mem_region *region,
> +						    u32 flags,
> +						    u64 pfn_offset,
> +						    u64 pfn_count,
> +						    bool huge_page))
> +{
> +	long ret;
> +
> +	ret = handler(region, flags, pfn_offset, pfn_count, 0);
> +	if (ret)
> +		return ret;
> +
> +	return pfn_count;
> +}
> +
> +static long mshv_region_process_chunk(struct mshv_mem_region *region,
> +				      u32 flags,
> +				      u64 pfn_offset, u64 pfn_count,
> +				      int (*handler)(struct mshv_mem_region *region,
> +						     u32 flags,
> +						     u64 pfn_offset,
> +						     u64 pfn_count,
> +						     bool huge_page))
> +{
> +	if (pfn_valid(region->mreg_pfns[pfn_offset]))
> +		return mshv_region_process_pfns(region, flags,
> +				pfn_offset, pfn_count,
> +				handler);
> +	else
> +		return mshv_region_process_hole(region, flags,
> +				pfn_offset, pfn_count,
> +				handler);
> +}
> +
>  /**
>   * mshv_region_process_range - Processes a range of PFNs in a region.
>   * @region    : Pointer to the memory region structure.
> @@ -146,33 +197,47 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
>  						    u64 pfn_count,
>  						    bool huge_page))
>  {
> -	u64 pfn_end;
> +	u64 start, end;
>  	long ret;
> 
> -	if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> +	if (!pfn_count)
> +		return 0;
> +
> +	if (check_add_overflow(pfn_offset, pfn_count, &end))
>  		return -EOVERFLOW;
> 
> -	if (pfn_end > region->nr_pfns)
> +	if (end > region->nr_pfns)
>  		return -EINVAL;
> 
> -	while (pfn_count) {
> -		/* Skip non-present pages */
> -		if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> -			pfn_offset++;
> -			pfn_count--;
> +	start = pfn_offset;
> +	end = pfn_offset + 1;
> +
> +	while (end < pfn_offset + pfn_count) {
> +		/*
> +		 * Accumulate contiguous pfns with the same validity
> +		 * (valid or not).
> +		 */
> +		if (pfn_valid(region->mreg_pfns[start]) ==
> +		    pfn_valid(region->mreg_pfns[end])) {
> +			end++;
>  			continue;
>  		}
> 
> -		ret = mshv_region_process_pfns(region, flags,
> -					       pfn_offset, pfn_count,
> -					       handler);
> +		ret = mshv_region_process_chunk(region, flags,
> +						start, end - start,
> +						handler);
>  		if (ret < 0)
>  			return ret;
> 
> -		pfn_offset += ret;
> -		pfn_count -= ret;
> +		start += ret;
>  	}
> 
> +	ret = mshv_region_process_chunk(region, flags,
> +					start, end - start,
> +					handler);
> +	if (ret < 0)
> +		return ret;
> +
>  	return 0;
>  }
> 
> @@ -208,6 +273,9 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
>  				   u64 pfn_offset, u64 pfn_count,
>  				   bool huge_page)
>  {
> +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> +		return -EINVAL;
> +
>  	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
> @@ -233,6 +301,9 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
>  				     u64 pfn_offset, u64 pfn_count,
>  				     bool huge_page)
>  {
> +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> +		return -EINVAL;
> +
>  	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
> @@ -256,6 +327,14 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
>  				   u64 pfn_offset, u64 pfn_count,
>  				   bool huge_page)
>  {
> +	/*
> +	 * Remap missing pages with no access to let the
> +	 * hypervisor track dirty pages, enabling precopy live
> +	 * migration.
> +	 */
> +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> +		flags = HV_MAP_GPA_NO_ACCESS;

Is it OK to wipe out any other flags that might be set? Certainly, any previous
flags in PERMISSIONS_MASK should be removed, but what about ADJUSTABLE
and NOT_CACHED?

> +
>  	if (huge_page)
>  		flags |= HV_MAP_GPA_LARGE_PAGE;
> 
> @@ -357,6 +436,9 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
>  				   u64 pfn_offset, u64 pfn_count,
>  				   bool huge_page)
>  {
> +	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> +		return 0;
> +
>  	if (huge_page)
>  		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> 
> 
> 


^ permalink raw reply

* RE: [PATCH 3/7] mshv: Support regions with different VMAs
From: Michael Kelley @ 2026-04-13 21:08 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490106973.81669.17734971204992890360.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> 
> Allow HMM fault handling across memory regions that span multiple VMAs
> with different protection flags. The previous implementation assumed a
> single VMA per region, which would fail when guest memory crosses VMA
> boundaries.
> 
> Iterate through VMAs within the range and handle each separately with
> appropriate protection flags, enabling more flexible memory region
> configurations for partitions.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c |   72 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 52 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index ed9c55841140..1bb1bfe177e2 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -492,37 +492,72 @@ int mshv_region_get(struct mshv_mem_region *region)
>  }
> 
>  /**
> - * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> + * mshv_region_hmm_fault_and_lock - Handle HMM faults across VMAs and lock
> + *                                  the memory region
>   * @region: Pointer to the memory region structure
> - * @range: Pointer to the HMM range structure
> + * @start : Starting virtual address of the range to fault
> + * @end   : Ending virtual address of the range to fault (exclusive)
> + * @pfns  : Output array for page frame numbers with HMM flags
>   *
>   * This function performs the following steps:
>   * 1. Reads the notifier sequence for the HMM range.
>   * 2. Acquires a read lock on the memory map.
> - * 3. Handles HMM faults for the specified range.
> - * 4. Releases the read lock on the memory map.
> - * 5. If successful, locks the memory region mutex.
> - * 6. Verifies if the notifier sequence has changed during the operation.
> - *    If it has, releases the mutex and returns -EBUSY to match with
> - *    hmm_range_fault() return code for repeating.
> + * 3. Iterates through VMAs in the specified range, handling each
> + *    separately with appropriate protection flags (HMM_PFN_REQ_WRITE set
> + *    based on VMA flags).
> + * 4. Handles HMM faults for each VMA segment.
> + * 5. Releases the read lock on the memory map.
> + * 6. If successful, locks the memory region mutex.
> + * 7. Verifies if the notifier sequence has changed during the operation.
> + *    If it has, releases the mutex and returns -EBUSY to signal retry.
> + *
> + * The function expects the range [start, end] is backed by valid VMAs.

Use "[start, end)" to describe the range since end is exclusive.

> + * Returns -EFAULT if any address in the range is not covered by a VMA.
>   *
>   * Return: 0 on success, a negative error code otherwise.
>   */
>  static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> -					  struct hmm_range *range)
> +					  unsigned long start,
> +					  unsigned long end,
> +					  unsigned long *pfns)
>  {
> +	struct hmm_range range = {
> +		.notifier = &region->mreg_mni,
> +	};
>  	int ret;
> 
> -	range->notifier_seq = mmu_interval_read_begin(range->notifier);
> +	range.notifier_seq = mmu_interval_read_begin(range.notifier);
>  	mmap_read_lock(region->mreg_mni.mm);
> -	ret = hmm_range_fault(range);
> +	while (start < end) {
> +		struct vm_area_struct *vma;
> +
> +		vma = vma_lookup(current->mm, start);

The mmap_read_lock() was obtained on region->mreg_mni.mm, but the
lookup is done against current->mm. Maybe these are the same, but
it looks wrong.  (Pointed out by a Co-Pilot AI review.)

> +		if (!vma) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		range.hmm_pfns = pfns;
> +		range.start = start;
> +		range.end = min(vma->vm_end, end);
> +		range.default_flags = HMM_PFN_REQ_FAULT;
> +		if (vma->vm_flags & VM_WRITE)
> +			range.default_flags |= HMM_PFN_REQ_WRITE;
> +
> +		ret = hmm_range_fault(&range);
> +		if (ret)
> +			break;
> +
> +		start = range.end + 1;

Since range.end is exclusive, the +1 should not be done.

> +		pfns += DIV_ROUND_UP(range.end - range.start, PAGE_SIZE);

Just to confirm, range.end and range.start should always be page aligned,
right? So the ROUND_UP should never kick in.

> +	}
>  	mmap_read_unlock(region->mreg_mni.mm);
>  	if (ret)
>  		return ret;
> 
>  	mutex_lock(&region->mreg_mutex);
> 
> -	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> +	if (mmu_interval_read_retry(range.notifier, range.notifier_seq)) {
>  		mutex_unlock(&region->mreg_mutex);
>  		cond_resched();
>  		return -EBUSY;
> @@ -546,10 +581,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
>  static int mshv_region_range_fault(struct mshv_mem_region *region,
>  				   u64 pfn_offset, u64 pfn_count)
>  {
> -	struct hmm_range range = {
> -		.notifier = &region->mreg_mni,
> -		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> -	};
> +	unsigned long start, end;
>  	unsigned long *pfns;
>  	int ret;
>  	u64 i;
> @@ -558,12 +590,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>  	if (!pfns)
>  		return -ENOMEM;
> 
> -	range.hmm_pfns = pfns;
> -	range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> -	range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
> +	start = region->start_uaddr + pfn_offset * PAGE_SIZE;
> +	end = start + pfn_count * PAGE_SIZE;
> 
>  	do {
> -		ret = mshv_region_hmm_fault_and_lock(region, &range);
> +		ret = mshv_region_hmm_fault_and_lock(region, start, end,
> +						     pfns);
>  	} while (ret == -EBUSY);
> 
>  	if (ret)
> 
> 


^ permalink raw reply

* RE: [PATCH 5/7] mshv: Map populated pages on movable region creation
From: Michael Kelley @ 2026-04-13 21:09 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490108104.81669.2129535961068627665.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:05 PM
> 
> Map any populated pages into the hypervisor upfront when creating a
> movable region, rather than waiting for faults. Previously, movable
> regions were created with all pages marked as HV_MAP_GPA_NO_ACCESS
> regardless of whether the userspace mapping contained populated pages.
> 
> This guarantees that if the caller passes a populated mapping, those
> present pages will be mapped into the hypervisor immediately during
> region creation instead of being faulted in later.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c   |   65 ++++++++++++++++++++++++++++++++-----------
>  drivers/hv/mshv_root.h      |    1 +
>  drivers/hv/mshv_root_main.c |   10 +------
>  3 files changed, 50 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 133ec7771812..28d3f488d89f 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -519,7 +519,8 @@ int mshv_region_get(struct mshv_mem_region *region)
>  static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
>  					  unsigned long start,
>  					  unsigned long end,
> -					  unsigned long *pfns)
> +					  unsigned long *pfns,
> +					  bool do_fault)
>  {
>  	struct hmm_range range = {
>  		.notifier = &region->mreg_mni,
> @@ -540,9 +541,12 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
>  		range.hmm_pfns = pfns;
>  		range.start = start;
>  		range.end = min(vma->vm_end, end);
> -		range.default_flags = HMM_PFN_REQ_FAULT;
> -		if (vma->vm_flags & VM_WRITE)
> -			range.default_flags |= HMM_PFN_REQ_WRITE;
> +		range.default_flags = 0;
> +		if (do_fault) {
> +			range.default_flags = HMM_PFN_REQ_FAULT;
> +			if (vma->vm_flags & VM_WRITE)
> +				range.default_flags |= HMM_PFN_REQ_WRITE;
> +		}
> 
>  		ret = hmm_range_fault(&range);
>  		if (ret)
> @@ -567,26 +571,40 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
>  }
> 
>  /**
> - * mshv_region_range_fault - Handle memory range faults for a given region.
> - * @region: Pointer to the memory region structure.
> - * @pfn_offset: Offset of the page within the region.
> - * @pfn_count: Number of pages to handle.
> + * mshv_region_collect_and_map - Collect PFNs for a user range and map them
> + * @region    : memory region being processed
> + * @pfn_offset: PFNs offset within the region
> + * @pfn_count : number of PFNs to process
> + * @do_fault  : if true, fault in missing pages;
> + *              if false, collect only present pages
>   *
> - * This function resolves memory faults for a specified range of pages
> - * within a memory region. It uses HMM (Heterogeneous Memory Management)
> - * to fault in the required pages and updates the region's page array.
> + * Collects PFNs for the specified portion of @region from the
> + * corresponding userspace VMA and maps them into the hypervisor. The

Actually, this should be "userspace VMAs" (i.e., plural)

> + * behavior depends on @do_fault:
>   *
> - * Return: 0 on success, negative error code on failure.
> + * - true: Fault in missing pages from userspace, ensuring all pages in the
> + *   range are present. Used for on-demand page population.
> + * - false: Collect PFNs only for pages already present in userspace,
> + *   leaving missing pages as invalid PFN markers.
> + *   Used for initial region setup.
> + *
> + * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
> + * flags cleared, then the range is mapped into the hypervisor. Present
> + * PFNs get mapped with region access permissions; missing PFNs (zero
> + * entries) get mapped with no-access permissions.

Hmmm. The missing PFNs are just skipped and the mreg_pfns[] array
is not updated. Is the corresponding entry in mreg_pfns[] known to
already be set to MSHV_INVALID_PFN? When mapping a new movable
region, that appears to be so. I'm less sure about the 
mshv_region_range_fault() case, though mshv_region_invalidate_pfns()
does such initialization of any entries that are invalidated. At that point
in the code, I'd add a comment about that assumption, as it took me a
bit to figure it out.

So does the comment about "zero entries" refer to what is returned
by hmm_range_fault() via mshv_region_hmm_fault_and_lock()?
The mention of "zero entries" here is a bit confusing.

> + *
> + * Return: 0 on success, negative errno on failure.
>   */
> -static int mshv_region_range_fault(struct mshv_mem_region *region,
> -				   u64 pfn_offset, u64 pfn_count)
> +static int mshv_region_collect_and_map(struct mshv_mem_region *region,
> +				       u64 pfn_offset, u64 pfn_count,
> +				       bool do_fault)
>  {
>  	unsigned long start, end;
>  	unsigned long *pfns;
>  	int ret;
>  	u64 i;
> 
> -	pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
> +	pfns = vmalloc_array(pfn_count, sizeof(unsigned long));
>  	if (!pfns)
>  		return -ENOMEM;
> 
> @@ -595,7 +613,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> 
>  	do {
>  		ret = mshv_region_hmm_fault_and_lock(region, start, end,
> -						     pfns);
> +						     pfns, do_fault);
>  	} while (ret == -EBUSY);
> 
>  	if (ret)
> @@ -613,10 +631,17 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> 
>  	mutex_unlock(&region->mreg_mutex);
>  out:
> -	kfree(pfns);
> +	vfree(pfns);
>  	return ret;
>  }
> 
> +static int mshv_region_range_fault(struct mshv_mem_region *region,
> +				   u64 pfn_offset, u64 pfn_count)
> +{
> +	return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
> +					   true);
> +}
> +
>  bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
>  {
>  	u64 pfn_offset, pfn_count;
> @@ -800,3 +825,9 @@ int mshv_map_pinned_region(struct mshv_mem_region
> *region)
>  err_out:
>  	return ret;
>  }
> +
> +int mshv_map_movable_region(struct mshv_mem_region *region)
> +{
> +	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
> +					   false);
> +}
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index d2e65a137bf4..02c1c11f701c 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -374,5 +374,6 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region
> *region, u64 gfn);
>  void mshv_region_movable_fini(struct mshv_mem_region *region);
>  bool mshv_region_movable_init(struct mshv_mem_region *region);
>  int mshv_map_pinned_region(struct mshv_mem_region *region);
> +int mshv_map_movable_region(struct mshv_mem_region *region);
> 
>  #endif /* _MSHV_ROOT_H_ */
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index c393b5144e0b..91dab2a3bc92 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1299,15 +1299,7 @@ mshv_map_user_memory(struct mshv_partition
> *partition,
>  		ret = mshv_map_pinned_region(region);
>  		break;
>  	case MSHV_REGION_TYPE_MEM_MOVABLE:
> -		/*
> -		 * For movable memory regions, remap with no access to let
> -		 * the hypervisor track dirty pages, enabling pre-copy live
> -		 * migration.
> -		 */
> -		ret = hv_call_map_ram_pfns(partition->pt_id,
> -					   region->start_gfn,
> -					   region->nr_pfns,
> -					   HV_MAP_GPA_NO_ACCESS, NULL);
> +		ret = mshv_map_movable_region(region);
>  		break;
>  	case MSHV_REGION_TYPE_MMIO:
>  		ret = hv_call_map_mmio_pfns(partition->pt_id,
> 
> 


^ permalink raw reply

* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Easwar Hariharan @ 2026-04-13 21:29 UTC (permalink / raw)
  To: Michael Kelley
  Cc: easwar.hariharan, Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB4157DC555A9EC7A73E2CB8CBD4582@SN6PR02MB4157.namprd02.prod.outlook.com>

On 4/9/2026 12:01 PM, Michael Kelley wrote:
> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Wednesday, April 8, 2026 1:21 PM
>>
>> On 1/11/2026 9:36 AM, Michael Kelley wrote:
>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM
>>>>
>>>> On 1/8/2026 10:46 AM, Michael Kelley wrote:
>>>>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>>>>>>
>>>>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>>>>
>>>>>> Hyper-V uses a logical device ID to identify a PCI endpoint device for
>>>>>> child partitions. This ID will also be required for future hypercalls
>>>>>> used by the Hyper-V IOMMU driver.
>>>>>>
>>>>>> Refactor the logic for building this logical device ID into a standalone
>>>>>> helper function and export the interface for wider use.
>>>>>>
>>>>>> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>>>> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
>>>>>> ---
>>>>>>  drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
>>>>>>  include/asm-generic/mshyperv.h      |  2 ++
>>>>>>  2 files changed, 22 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>>>>>> index 146b43981b27..4b82e06b5d93 100644
>>>>>> --- a/drivers/pci/controller/pci-hyperv.c
>>>>>> +++ b/drivers/pci/controller/pci-hyperv.c
>>>>>> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>>>>>
>>>>>>  #define hv_msi_prepare		pci_msi_prepare
>>>>>>
>>>>>> +/**
>>>>>> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
>>>>>> + * function number of the device.
>>>>>> + */
>>>>>> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
>>>>>> +{
>>>>>> +	struct pci_bus *pbus = pdev->bus;
>>>>>> +	struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
>>>>>> +						struct hv_pcibus_device, sysdata);
>>>>>> +
>>>>>> +	return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
>>>>>> +		     (hbus->hdev->dev_instance.b[4] << 16) |
>>>>>> +		     (hbus->hdev->dev_instance.b[7] << 8)  |
>>>>>> +		     (hbus->hdev->dev_instance.b[6] & 0xf8) |
>>>>>> +		     PCI_FUNC(pdev->devfn));
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
>>>>>
>>>>> This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
>>>>> new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
>>>>> The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
>>>>> use this symbol in that case -- you'll get a link error on vmlinux when building
>>>>> the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
>>>>> require that the VMBus driver not be built as a module, so I don't think that's
>>>>> the right solution.
>>>>>
>>>>> This is a messy problem. The new IOMMU driver needs to start with a generic
>>>>> "struct device" for the PCI device, and somehow find the corresponding VMBus
>>>>> PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
>>>>> about ways to do this that don't depend on code and data structures that are
>>>>> private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.
>>>>
>>>> Thank you, Michael. FWIW, I did try to pull out the device ID components out of
>>>> pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
>>>> but it was just too messy as you say.
>>>
>>> Yes, the current approach for getting the device ID wanders through struct
>>> hv_pcibus_device (which is private to the pci-hyperv driver), and through
>>> struct hv_device (which is a VMBus data structure). That makes the linkage
>>> between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather
>>> substantial, which is not good.
>>
>> Hi Michael,
>>
>> I missed this, or made a mental note to follow up but forgot. Either way, Yu reminded
>> me about this email chain and I started looking at it this week.
>>
>>>
>>> But here's an idea for an alternate approach. The PV IOMMU driver doesn't
>>> have to generate the logical device ID on-the-fly by going to the dev_instance
>>> field of struct hv_device. Instead, the pci-hyperv driver can generate the logical
>>> device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU
>>> driver to access. The logical device ID doesn't change while Linux is running, so
>>> stashing another copy somewhere isn't a problem.
>>
>> In my exploration and consulting with Dexuan, I realized that one of the components of
>> the logical device ID, the PCI function number is set only in pci_scan_device(), well into
>> pci_scan_root_bus_bridge() that you call out as the point by which the communication
>> must have occurred.
>>
>> But then, Dexuan also pointed me to hv_pci_assign_slots() with its call to wslot_to_devfn() and I'm
>> honestly confused how these two interact. With the current approach, it looks like whatever
>> devfn pci_scan_device() set is the correct function number to use for the logical device
>> ID, in which case, the best I can do with your suggested approach below is to inform the
>> pvIOMMU driver of the GUID, rather than the logical device ID itself.
>>
>> Perhaps with your history, you can clarify the interaction, and/or share your thoughts
>> on the above?
> 
> During hv_pci_probe(), hv_pci_query_relations() is called to ask the Hyper-V
> host about what PCI devices are present. hv_pci_query_relations() sends a
> PCI_QUERY_BUS_RELATIONS message to the host, and the host send back a
> PCI_BUS_RELATIONS or PCI_BUS_RELATIONS2 message. The response message
> is handled in hv_pci_onchannelcallback(), which calls hv_pci_devices_present()
> or hv_pci_devices_present2().  The latter two functions both call
> hv_pci_start_relations_work() to add a request to a workqueue that runs
> pci_devices_present_work().  Finally, pci_devices_present_work() calls
> pc_scan_child_bus(), followed by hv_pci_assign_slots().
> 
> In hv_pci_assign_slots, you can see that the PCI_BUS_RELATIONS[2]
> info from the Hyper-V host contains a function number encoded in the
> win_slot field. So the Hyper-V host *does* tell the guest the function number.
> However, the generic Linux PCI subsystem doesn't use this function number.
> It still scans the PCI device, trying successive function numbers to see which
> ones work. The scan should find the same function number that the Hyper-V
> host originally reported.
> 
> As you noted, there's a sequencing problem in waiting for
> pci_scan_single_device() to find the function number. In the hv_pci_probe()
> path, after hv_pci_query_relations() runs and before create_root_hv_pci_bus()
> is called, it seems feasible to use the function number provided by the
> Hyper-V host to construct the logical device ID. That should work. But there's
> another path, in that the Hyper-V host can generate a PCI_BUS_RELATIONS[2]
> message without a request from Linux when something on the host side changes
> the PCI device setup. There's a code path where pci_devices_present_work()
> finds the state is "hv_pcibus_installed", and directly calls pci_scan_child_bus().
> This path would presumably also need to construct (or re-construct) the
> logical device ID using the information from the Hyper-V host before calling
> pci_scan_child_bus(). I'm vague on the scenario for this latter case, but the
> code is obviously there to handle it.
> 
> The other approach is as you suggest. The Hyper-V PCI driver can tell
> the IOMMU driver the almost complete logical device ID, using just the
> GUID bits. Then the IOMMU driver can then construct the full logical
> device ID by adding the function number from the struct pci_dev. I don't
> see a problem with this approach -- other IOMMU drivers are referencing
> the struct pci_dev, and pulling out the function number doesn't seem like
> a violation of layering.
> 

Thanks for that explanation, that makes sense. I didn't see any serialization
that would ensure that the VMBus path to communicate the child devices on the bus
would complete before pci_scan_device() finds and finalizes the pci_dev. I think it's
safest to take the approach to communicate the GUID, and find the function number from
the pci_dev. This does mean that there will be an essentially identical copy of
hv_build_logical_dev_id() in the IOMMU code, but a comment can explain that.

>>
>>>
>>> So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
>>> a PCI domain ID and the related logical device ID. The PV IOMMU driver is
>>> responsible for storing this data in a form that it can later search. hv_pci_probe()
>>> calls this new function when it instantiates a new PCI pass-thru device. Then when
>>> the IOMMU driver needs to attach a new device, it can get the PCI domain ID
>>> from the struct pci_dev (or struct pci_bus), search for the related logical device
>>> ID in its own data structure, and use it. The pci-hyperv driver has a dependency
>>> on the IOMMU driver, but that's a dependency in the desired direction. The
>>> PCI domain ID and logical device ID are just integers, so no data structures are
>>> shared.
>>
>> In a previous reply on this thread, you raised the uniqueness issue of bytes 4 and 5
>> of the GUID being used to create the domain number. I thought this approach could
>> help with that too, but as I coded it up, I realized that using the domain number
>> (not guaranteed to be unique) to search for the bus instance GUID (guaranteed to be unique)
>> is the wrong way around. It is unfortunately the only available key in the pci_dev
>> handed to the pvIOMMU driver in this approach though...
>>
>> Do you think that's a fatal flaw?
> 
> There are two uniqueness problems, which I didn't fully separate conceptually
> until writing this. One problem is constructing a PCI domain ID that Linux can use
> to identify the virtual PCI bus that the Hyper-V PCI driver creates for each vPCI
> device. The Hyper-V virtual PCI driver uses GUID bytes 4 and 5, and recognizes
> that they might not be unique. So there's code in hv_pci_probe() to pick another
> number if there's a duplicate. Hyper-V doesn't really care how Linux picks the
> domain ID for the virtual PCI bus as it's purely a Linux construct.

This part matters for the IOMMU driver as it is the key we will use to search the data
structure to find the right GUID to construct the logical dev ID that Hyper-V recognizes.

> 
> The second problem is the logical device ID that Hyper-V interprets to
> identify a vPCI device in hypercalls such a HVCALL_RETARGET_INTERRUPT
> and the new pvIOMMU related hypercalls. This logical device ID uses
> GUID bytes 4 thru 7 (minus 1 bit).  I don’t think Linux uses the
> logical device ID for anything. Since only Hyper-V interprets it, Hyper-V
> must somehow be ensuring uniqueness of bytes 4 thru 7 (minus 1 bit).
> That's something to confirm with the Hyper-V team. If they are just hoping
> for the best, I don't know how Linux can solve the problem.

I checked with the Hyper-V vPCI team on this aspect and the only guarantee that
they provide is that, at any given time, there will only be 1 device with a given
logical ID attached to a VM. Once a device has been removed, everything about it is
forgotten from the Hyper-V stack's perspective, and nothing in the Hyper-V stack would
prevent a scenario where, for example, a data movement accelerator is attached with
logical ID X, then revoked, and let's say a NIC is attached with the same logical ID X.

Also, FWIW, they also stated that the GUID is not unique and cannot be guaranteed to be
unique because it's the GUID for the bus, not the individual devices.

> My original comment about uniqueness somewhat conflated the two problems,
> and that's misleading. The use of the logical device ID has been around for years
> in hv_irq_retarget_interrupt(). Extending its use to the new pvIOMMU
> hypercalls doesn't make things any worse. But I'm still curious about
> what the Hyper-V team says about the uniqueness of bytes 4 thru 7.
> 
> Michael
> c
>>
>>>
>>> Note that the pci-hyperv must inform the PV IOMMU driver of the logical
>>> device ID *before* create_root_hv_pci_bus() calls pci_scan_root_bus_bridge().
>>> The latter function eventually invokes hv_iommu_attach_dev(), which will
>>> need the logical device ID. See example stack trace. [1]
>>>
>>> I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
>>> remove the information if a PCI pass-thru device is unbound or removed, as
>>> the logical device ID will be the same if the device ever comes back. At worst,
>>> the IOMMU driver can simply replace an existing logical device ID if a new one
>>> is provided for the same PCI domain ID.
>>
>> As above, replacing a unique GUID when a result is found for a non-unique
>> key value may be prone to failure if it happens that the device that came "back"
>> is not in fact the same device (or class of device) that went away and just happens
>> to, either due to bytes 4 and 5 being identical, or due to collision in the
>> pci_domain_nr_dynamic_ida, have the same domain number.

Given the vPCI team's statements (above), I think we will need to handle unbind or
removal and ensure the pvIOMMU drivers data structure is invalidated when either happens.

<snip>

Thanks,
Easwar (he/him)

^ permalink raw reply

* Re: [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
From: Martin K. Petersen @ 2026-04-14  2:19 UTC (permalink / raw)
  To: linux-scsi, Li Tian
  Cc: Martin K . Petersen, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, James E.J. Bottomley, linux-hyperv,
	linux-kernel
In-Reply-To: <20260406015344.12566-1-litian@redhat.com>

On Mon, 06 Apr 2026 09:53:44 +0800, Li Tian wrote:

> The storvsc driver has become stricter in handling
> SRB status codes returned by the Hyper-V host. When using Virtual
> Fibre Channel (vFC) passthrough, the host may return
> SRB_STATUS_DATA_OVERRUN for PERSISTENT_RESERVE_IN commands if the
> allocation length in the CDB does not match the host's expected
> response size.
> 
> [...]

Applied to 7.1/scsi-queue, thanks!

[1/1] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
      https://git.kernel.org/mkp/scsi/c/9cf351b289fb

-- 
Martin K. Petersen

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox