Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Naman Jain @ 2026-04-29  9:57 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157467FDBC0203C67A67042D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:10 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
>> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
>> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
>> in architecture-specific code.
>>
>> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
>> include/asm-generic/mshyperv.h so they are visible to both arch and
>> driver code.
>>
>> Change the return type from void to bool so the caller can determine
>> whether the register page was successfully configured and set
>> mshv_has_reg_page accordingly.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/hv_vtl.c       | 32 ++++++++++++++++++++++
>>   drivers/hv/mshv_vtl_main.c     | 49 +++-------------------------------
>>   include/asm-generic/mshyperv.h | 17 ++++++++++++
>>   3 files changed, 53 insertions(+), 45 deletions(-)
>>

<snip>

>>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> 
> This comment pre-dates your patch, but I don't understand the point
> it is trying to make. The comment is factually true, but I don't know
> why calling that out is relevant. The REG_PAGE MSR seems to be
> conceptually separate and distinct from the SIMP MSR, so the fact
> that the layouts are the same is just a coincidence. Or is there some
> relationship between the two MSRs that I'm not aware of, and the
> comment is trying (and failing?) to point out?

This was added as per suggestion from Nuno in my initial series for 
MSHV_VTL. If the reference in "identical to" is misleading, I should 
remove it.

https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/

Quoting:
"""
it is a generic structure that
appears to be used for several overlay page MSRs (SIMP, SIEF, etc).

But, the type doesn't appear in the hv*dk headers explicitly; it's just
used internally by the hypervisor.

I think it should be renamed with a hv_ prefix to indicate it's part of
the hypervisor ABI, and a brief comment with the provenance:

/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
union hv_synic_overlay_page_msr {
	/* <snip> */
};
"""

> 
>> +union hv_synic_overlay_page_msr {
>> +	u64 as_uint64;
>> +	struct {
>> +		u64 enabled: 1;
>> +		u64 reserved: 11;
>> +		u64 pfn: 52;
>> +	} __packed;
>> +};
>> +
>>   u8 __init get_vtl(void);
>>   void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>>   #else
>>   static inline u8 get_vtl(void) { return 0; }
>>   static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> +static inline bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu) { return false; }
> 
> As with Patch 8, if CONFIG_HYPERV_VTL_MODE caused mshv_common.o
> to be built, this stub wouldn't be needed.
> 

Acked.


>>   #endif
>>
>>   #endif
>> --
>> 2.43.0
>>

Regards,
Naman

^ permalink raw reply

* Re: [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Anirudh Rayabharam @ 2026-04-29  9:58 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-2-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:52PM -0700, Jork Loeser wrote:
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> 
> Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
> from under VMBus causes its later cleanup to write SynIC MSRs while
> SynIC is disabled, which the hypervisor does not tolerate.
> 
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
>   and nested root partition); on a non-nested root partition VMBus
>   does not run, so MSHV must enable/disable them
> 
> While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
> calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
> PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
> register GPAs regardless of the kernel page size, so using PAGE_SHIFT
> produces wrong addresses on ARM64 with 64K pages.
> 
> Note that initialization order matters - VMBUS first, MSHV second,
> and the reverse on de-init. Ideally, we would want a dedicated SYNIC
> driver that replaces the cross-dependencies with a clear API and
> dynamic tracking. Such refactor should go into its own dedicated
> series, outside of this kexec fix series.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/hv.c         |   3 +
>  drivers/hv/mshv_synic.c | 150 ++++++++++++++++++++++++++--------------
>  2 files changed, 103 insertions(+), 50 deletions(-)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v4 2/3] mshv: clean up SynIC state on kexec for L1VH
From: Anirudh Rayabharam @ 2026-04-29  9:58 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-3-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:53PM -0700, Jork Loeser wrote:
> The reboot notifier that tears down the SynIC cpuhp state guards the
> cleanup with hv_root_partition(), so on L1VH (where
> hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
> cleaned up before kexec. The kexec'd kernel then inherits stale
> unmasked SINTs and an enabled SIRBP pointing to freed memory.
> 
> Remove the hv_root_partition() guard so the cleanup runs for all
> parent partitions.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 2db3b0192eac..978a1cace341 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -723,9 +723,6 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
>  static int mshv_synic_reboot_notify(struct notifier_block *nb,
>  			      unsigned long code, void *unused)
>  {
> -	if (!hv_root_partition())
> -		return 0;
> -
>  	cpuhp_remove_state(synic_cpuhp_online);
>  	return 0;
>  }
> -- 
> 2.43.0
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH v2 12/15] mshv_vtl: Move VSM code page offset logic to x86 files
From: Naman Jain @ 2026-04-29 10:00 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157E0525DDDD153888F5AFBD4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:10 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> The VSM code page offset register (HV_REGISTER_VSM_CODE_PAGE_OFFSETS)
>> is x86 specific, its value configures the static call used to return
>> to VTL0 via the hypercall page. Move the register read from the common
>> mshv_vtl_get_vsm_regs() into the x86 mshv_vtl_return_call_init(),
>> which is the sole consumer of the offset.
>>
>> Change mshv_vtl_return_call_init() from taking a u64 parameter
>> to taking no arguments, and rename mshv_vtl_get_vsm_regs() to
>> mshv_vtl_get_vsm_cap_reg() since it now only fetches
>> HV_REGISTER_VSM_CAPABILITIES.
>>
>> No functional change on x86. This prepares the common driver code for
>> ARM64 where VSM code page offsets do not apply.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/hv_vtl.c        | 19 +++++++++++++++++--
>>   arch/x86/include/asm/mshyperv.h |  4 ++--
>>   drivers/hv/mshv_vtl_main.c      | 24 +++++++++++++-----------
>>   3 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
>> index f3ffb6a7cb2d..7c10b34cf8a4 100644
>> --- a/arch/x86/hyperv/hv_vtl.c
>> +++ b/arch/x86/hyperv/hv_vtl.c
>> @@ -293,10 +293,25 @@ EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
>>
>>   DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
>>
>> -void mshv_vtl_return_call_init(u64 vtl_return_offset)
>> +int mshv_vtl_return_call_init(void)
>>   {
>> +	struct hv_register_assoc vsm_pg_offset_reg;
>> +	union hv_register_vsm_page_offsets offsets;
>> +	int ret;
>> +
>> +	vsm_pg_offset_reg.name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> +
>> +	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> +				       1, input_vtl_zero, &vsm_pg_offset_reg);
>> +	if (ret)
>> +		return ret;
>> +
>> +	offsets.as_uint64 = vsm_pg_offset_reg.value.reg64;
>> +
>>   	static_call_update(__mshv_vtl_return_hypercall,
>> -			   (void *)((u8 *)hv_hypercall_pg + vtl_return_offset));
>> +			   (void *)((u8 *)hv_hypercall_pg + offsets.vtl_return_offset));
>> +
>> +	return 0;
>>   }
>>   EXPORT_SYMBOL(mshv_vtl_return_call_init);
>>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index b4d80c9a673a..b48f115c1292 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -286,14 +286,14 @@ struct mshv_vtl_cpu_context {
>>   #ifdef CONFIG_HYPERV_VTL_MODE
>>   void __init hv_vtl_init_platform(void);
>>   int __init hv_vtl_early_init(void);
>> -void mshv_vtl_return_call_init(u64 vtl_return_offset);
>> +int mshv_vtl_return_call_init(void);
>>   void mshv_vtl_return_hypercall(void);
>>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, bool shared);
>>   #else
>>   static inline void __init hv_vtl_init_platform(void) {}
>>   static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> +static inline int mshv_vtl_return_call_init(void) { return 0; }
>>   static inline void mshv_vtl_return_hypercall(void) {}
>>   static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>   #endif
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index 4c9ae65ad3e8..be498c9234fd 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -79,7 +79,6 @@ struct mshv_vtl {
>>   };
>>
>>   static struct mutex mshv_vtl_poll_file_lock;
>> -static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
>>   static union hv_register_vsm_capabilities mshv_vsm_capabilities;
>>
>>   static DEFINE_PER_CPU(struct mshv_vtl_poll_file, mshv_vtl_poll_file);
>> @@ -203,21 +202,19 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>>   	/* VTL2 Host VSP SINT is (un)masked when the user mode requests that */
>>   }
>>
>> -static int mshv_vtl_get_vsm_regs(void)
>> +static int mshv_vtl_get_vsm_cap_reg(void)
>>   {
>> -	struct hv_register_assoc registers[2];
>> -	int ret, count = 2;
>> +	struct hv_register_assoc vsm_capability_reg;
>> +	int ret;
>>
>> -	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> -	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
>> +	vsm_capability_reg.name = HV_REGISTER_VSM_CAPABILITIES;
>>
>>   	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> -				       count, input_vtl_zero, registers);
>> +				       1, input_vtl_zero, &vsm_capability_reg);
>>   	if (ret)
>>   		return ret;
>>
>> -	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
>> -	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
>> +	mshv_vsm_capabilities.as_uint64 = vsm_capability_reg.value.reg64;
>>
>>   	return ret;
> 
> Nit: This could be just "return 0".

Acked.

> 
>>   }
>> @@ -1139,13 +1136,18 @@ static int __init mshv_vtl_init(void)
>>   	tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0);
>>   	init_waitqueue_head(&fd_wait_queue);
>>
>> -	if (mshv_vtl_get_vsm_regs()) {
>> +	if (mshv_vtl_get_vsm_cap_reg()) {
>>   		dev_emerg(dev, "Unable to get VSM capabilities !!\n");
> 
> Why is this failure an emergency message, while the other failures
> here in mshv_vtl_init() are just error messages? When there's lack
> of consistency, I always wonder if there is a reason ..... :-)

It might be because I didn’t pay enough attention to the old code :)
dev_err() should work just fine, I'll change it.


> 
>>   		ret = -ENODEV;
>>   		goto free_dev;
>>   	}
>>
>> -	mshv_vtl_return_call_init(mshv_vsm_page_offsets.vtl_return_offset);
>> +	ret = mshv_vtl_return_call_init();
>> +	if (ret) {
>> +		dev_err(dev, "mshv_vtl_return_call_init failed: %d\n", ret);
>> +		goto free_dev;
>> +	}
>> +
>>   	ret = hv_vtl_setup_synic();
>>   	if (ret)
>>   		goto free_dev;
>> --
>> 2.43.0
>>

Regards,
Naman

^ permalink raw reply

* Re: [PATCH v4 3/3] mshv: unmap debugfs stats pages on kexec
From: Anirudh Rayabharam @ 2026-04-29 10:10 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-4-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:54PM -0700, Jork Loeser wrote:
> On L1VH, debugfs stats pages are overlay pages: the kernel allocates
> them and registers the GPAs with the hypervisor via
> HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
> hypervisor across kexec. If the kexec'd kernel reuses those physical
> pages, the hypervisor's overlay semantics cause a machine check
> exception.
> 
> Fix this by calling mshv_debugfs_exit() from the reboot notifier,
> which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
> kexec. This releases the overlay bindings so the physical pages can be
> safely reused. Guard mshv_debugfs_exit() against being called when
> init failed.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/mshv_debugfs.c | 7 ++++++-
>  drivers/hv/mshv_synic.c   | 1 +
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
> index 418b6dc8f3c2..3c3e02237ae9 100644
> --- a/drivers/hv/mshv_debugfs.c
> +++ b/drivers/hv/mshv_debugfs.c
> @@ -674,8 +674,10 @@ int __init mshv_debugfs_init(void)
>  
>  	mshv_debugfs = debugfs_create_dir("mshv", NULL);
>  	if (IS_ERR(mshv_debugfs)) {
> +		err = PTR_ERR(mshv_debugfs);
> +		mshv_debugfs = NULL;
>  		pr_err("%s: failed to create debugfs directory\n", __func__);

Might as well print err here.

Nevertheless:

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH V1 04/13] mshv: Provide a way to get partition id if running in a VMM process
From: Anirudh Rayabharam @ 2026-04-29 10:35 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-5-mrathor@linux.microsoft.com>

On Tue, Apr 21, 2026 at 07:32:30PM -0700, Mukesh R wrote:
> Many PCI passthru related hypercalls require partition id of the target
> guest. Guests are actually managed by MSHV driver and the partition id
> is only maintained there. Add a field in the partition struct in MSHV
> driver to save the tgid of the VMM process creating the partition,
> and add a function there to retrieve partition id if current process
> is a VMM process.
> 
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h         |  1 +
>  drivers/hv/mshv_root_main.c    | 22 ++++++++++++++++++++++
>  include/asm-generic/mshyperv.h |  5 +++++
>  3 files changed, 28 insertions(+)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH] mshv: Simplify GPA map/unmap hypercall helpers
From: Anirudh Rayabharam @ 2026-04-29 11:02 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177741845948.632922.14128507833980339307.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Tue, Apr 28, 2026 at 11:21:12PM +0000, Stanislav Kinsburskii wrote:
> Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> preceding bug-fix patches:
> 
> Move "done += completed" before the status checks so that pages mapped
> by a partially-successful batch are included in the error cleanup unmap.
> Previously these mappings were leaked on failure.
> 
> While here, improve type safety and readability:
>  - Change "int done" to "u64 done" to match the u64 page_count it is
>    compared against, avoiding signed/unsigned comparison hazards.
>  - Use u64 for loop iteration and batch size variables consistently.
>  - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
>  - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
>  - Simplify the error-path unmap to use "done << large_shift" directly
>    instead of mutating done in place.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |   55 +++++++++++++++-------------------------
>  1 file changed, 20 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index e5992c324904a..f5f205a397834 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  	struct hv_input_map_gpa_pages *input_page;
>  	u64 status, *pfnlist;
>  	unsigned long irq_flags, large_shift = 0;
> -	int ret = 0, done = 0;
> -	u64 page_count = page_struct_count;
> +	u64 done = 0, page_count = page_struct_count;
> +	int ret = 0;
>  
>  	if (page_count == 0 || (pages && mmio_spa))
>  		return -EINVAL;
> @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  	}
>  
>  	while (done < page_count) {
> -		ulong i, completed, remain = page_count - done;
> -		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> +		u64 i, completed, remain = page_count - done;
> +		u64 rep_count = min(remain, (u64)HV_MAP_GPA_BATCH_SIZE);
>  
>  		local_irq_save(irq_flags);
>  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  		input_page->map_flags = flags;
>  		pfnlist = input_page->source_gpa_page_list;
>  
> -		for (i = 0; i < rep_count; i++)
> -			if (flags & HV_MAP_GPA_NO_ACCESS) {
> +		for (i = 0; i < rep_count; i++) {
> +			if (flags & HV_MAP_GPA_NO_ACCESS)
>  				pfnlist[i] = 0;
> -			} else if (pages) {
> -				u64 index = (done + i) << large_shift;
> -
> -				if (index >= page_struct_count) {
> -					ret = -EINVAL;
> -					break;
> -				}
> -				pfnlist[i] = page_to_pfn(pages[index]);
> -			} else {
> +			else if (pages)
> +				pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
> +			else
>  				pfnlist[i] = mmio_spa + done + i;
> -			}
> -		if (ret) {
> -			local_irq_restore(irq_flags);
> -			break;
>  		}
>  
>  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> @@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>  		local_irq_restore(irq_flags);
>  
>  		completed = hv_repcomp(status);
> +		done += completed;
>  
>  		if (hv_result_needs_memory(status)) {
>  			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
>  						    HV_MAP_GPA_DEPOSIT_PAGES);
>  			if (ret)
>  				break;
> -
>  		} else if (!hv_result_success(status)) {
>  			ret = hv_result_to_errno(status);
>  			break;
>  		}
> -
> -		done += completed;
>  	}
>  
>  	if (ret && done) {
>  		u32 unmap_flags = 0;
>  
> -		if (flags & HV_MAP_GPA_LARGE_PAGE) {
> +		if (flags & HV_MAP_GPA_LARGE_PAGE)
>  			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> -			done <<= large_shift;
> -		}
> -		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> +		hv_call_unmap_gpa_pages(partition_id, gfn,
> +					done << large_shift, unmap_flags);

How does this work? Earlier we were doing "done << large_shift" only if
HV_MAP_GPA_LARGE_PAGE is set but now we always do it.

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Olaf Hering @ 2026-04-29 12:27 UTC (permalink / raw)
  To: Thorsten Blum
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, stable, Ky Srinivasan, linux-hyperv,
	linux-kernel
In-Reply-To: <20260414111008.307220-2-thorsten.blum@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 235 bytes --]

Tue, 14 Apr 2026 13:10:08 +0200 Thorsten Blum <thorsten.blum@linux.dev>:

> Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")

Please do not abuse the Fixes tag when it fact this change is "cosmetics".


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Thorsten Blum @ 2026-04-29 12:36 UTC (permalink / raw)
  To: Olaf Hering
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, stable, Ky Srinivasan, linux-hyperv,
	linux-kernel
In-Reply-To: <20260429142724.4d74641a.olaf@aepfle.de>

On Wed, Apr 29, 2026 at 02:27:24PM +0200, Olaf Hering wrote:
> Tue, 14 Apr 2026 13:10:08 +0200 Thorsten Blum <thorsten.blum@linux.dev>:
> 
> > Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")
> 
> Please do not abuse the Fixes tag when it fact this change is "cosmetics".

What makes you think this is just "cosmetics"?

^ permalink raw reply

* Re: [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Olaf Hering @ 2026-04-29 12:44 UTC (permalink / raw)
  To: Thorsten Blum
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, stable, linux-hyperv, linux-kernel
In-Reply-To: <afH7VELGgh8eGBUC@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 252 bytes --]

Wed, 29 Apr 2026 14:36:36 +0200 Thorsten Blum <thorsten.blum@linux.dev>:

> What makes you think this is just "cosmetics"?

It does fix an unlikely bug indeed, but it does not need to trigger the whole paperwork attached to a Fixes tag.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] mshv: Add dedicated ioctl for GVA to GPA translation
From: Anirudh Rayabharam @ 2026-04-29 13:11 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177741648871.626779.11067281081219290277.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Tue, Apr 28, 2026 at 10:48:24PM +0000, Stanislav Kinsburskii wrote:
> Add an MSHV_TRANSLATE_GVA ioctl on the VP fd that wraps
> HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX with transparent fault-in handling for
> movable memory regions. The passthrough path for this hypercall is retained
> for backward compatibility.
> 
> When guest-backing pages reside in movable memory regions, the mmu_notifier
> invalidation path remaps them to NO_ACCESS in the hypervisor's second-level
> address translation tables. If the VMM issues a GVA translation (e.g.
> during MMIO emulation) while a page-table page is invalidated, the
> hypervisor returns HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS. The VMM cannot
> resolve this on its own.
> 
> The new ioctl detects this transient GPA access failure, faults the page
> back in via mshv_region_handle_gfn_fault(), and retries the translation
> until it succeeds or an unrecoverable error occurs.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h         |    3 ++
>  drivers/hv/mshv_root_hv_call.c |   37 +++++++++++++++++++++
>  drivers/hv/mshv_root_main.c    |   69 ++++++++++++++++++++++++++++++++++++++++
>  include/hyperv/hvgdk_mini.h    |    1 +
>  include/hyperv/hvhdk.h         |   41 ++++++++++++++++++++++++
>  include/uapi/linux/mshv.h      |   10 ++++++
>  6 files changed, 161 insertions(+)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 1f086dcb7aa1a..2e6c4414740cc 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -290,6 +290,9 @@ int hv_call_delete_vp(u64 partition_id, u32 vp_index);
>  int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
>  				     u64 dest_addr,
>  				     union hv_interrupt_control control);
> +int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
> +					 u64 flags, u64 gva, u64 *gfn,
> +					 struct hv_translate_gva_result_ex *result);
>  int hv_call_clear_virtual_interrupt(u64 partition_id);
>  int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
>  				  union hv_gpa_page_access_state_flags state_flags,
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index e5992c324904a..9ff4ba5373f59 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -692,6 +692,43 @@ int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code,
>  	return 0;
>  }
>  
> +int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
> +					 u64 flags, u64 gva, u64 *gfn,
> +					 struct hv_translate_gva_result_ex *result)
> +{
> +	struct hv_input_translate_virtual_address *input;
> +	struct hv_output_translate_virtual_address_ex *output;
> +	unsigned long irq_flags;
> +	u64 status;
> +
> +	local_irq_save(irq_flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = partition_id;
> +	input->vp_index = vp_index;
> +	input->control_flags = flags;
> +	input->gva_page = gva >> HV_HYP_PAGE_SHIFT;
> +
> +	status = hv_do_hypercall(HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX,
> +				 input, output);
> +
> +	if (!hv_result_success(status)) {
> +		local_irq_restore(irq_flags);
> +		pr_err("%s: %s\n", __func__, hv_result_to_string(status));
> +		return hv_result_to_errno(status);
> +	}
> +
> +	*result = output->translation_result;
> +	*gfn = output->gpa_page;
> +
> +	local_irq_restore(irq_flags);
> +
> +	return 0;
> +}
> +
>  int
>  hv_call_clear_virtual_interrupt(u64 partition_id)
>  {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..2d7b6923415a8 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -898,6 +898,72 @@ mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
>  	return 0;
>  }
>  
> +static bool mshv_gpa_fault_retryable(u32 result_code)
> +{
> +	/*
> +	 * Note: HV_TRANSLATE_GVA_GPA_UNMAPPED is intentionally not handled
> +	 * here. The guest page table cannot be unmapped under normal
> +	 * operation. It may be mapped with no access during page moves,
> +	 * but a truly unmapped state indicates a kernel driver bug.
> +	 * Retrying in this case would only mask the underlying problem of
> +	 * an unmapped guest page table.
> +	 */
> +	return result_code == HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS;
> +}
> +
> +static long
> +mshv_vp_ioctl_translate_gva(struct mshv_vp *vp, void __user *user_args)
> +{
> +	struct mshv_partition *partition = vp->vp_partition;
> +	struct mshv_translate_gva args;
> +	struct hv_translate_gva_result_ex result;
> +	u64 gfn, gpa;
> +	int ret;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	do {
> +		ret = hv_call_translate_virtual_address_ex(vp->vp_index,
> +							   partition->pt_id,
> +							   args.flags, args.gva,
> +							   &gfn, &result);
> +		if (ret)
> +			return ret;
> +
> +		if (mshv_gpa_fault_retryable(result.result_code)) {
> +			struct mshv_mem_region *region;
> +			bool faulted;
> +
> +			region = mshv_partition_region_by_gfn_get(partition,
> +								  gfn);
> +			if (!region)
> +				return -EFAULT;
> +
> +			faulted = false;
> +			if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> +				faulted = mshv_region_handle_gfn_fault(region,
> +								       gfn);
> +			mshv_region_put(region);
> +
> +			if (!faulted)
> +				return -EFAULT;
> +
> +			cond_resched();
> +		}
> +	} while (mshv_gpa_fault_retryable(result.result_code));
> +
> +	gpa = (gfn << PAGE_SHIFT) | (args.gva & ~PAGE_MASK);
> +
> +        if (copy_to_user(args.result, &result, sizeof(*args.result)))

Indentation is a bit off here.

With that fixed:

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-04-29 15:15 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260429-orca-of-legal-symmetry-3c72bc@anirudhrb>

On Wed, Apr 29, 2026 at 11:02:37AM +0000, Anirudh Rayabharam wrote:
> On Tue, Apr 28, 2026 at 11:21:12PM +0000, Stanislav Kinsburskii wrote:
> > Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> > preceding bug-fix patches:
> > 
> > Move "done += completed" before the status checks so that pages mapped
> > by a partially-successful batch are included in the error cleanup unmap.
> > Previously these mappings were leaked on failure.
> > 
> > While here, improve type safety and readability:
> >  - Change "int done" to "u64 done" to match the u64 page_count it is
> >    compared against, avoiding signed/unsigned comparison hazards.
> >  - Use u64 for loop iteration and batch size variables consistently.
> >  - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
> >  - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
> >  - Simplify the error-path unmap to use "done << large_shift" directly
> >    instead of mutating done in place.
> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_hv_call.c |   55 +++++++++++++++-------------------------
> >  1 file changed, 20 insertions(+), 35 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index e5992c324904a..f5f205a397834 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  	struct hv_input_map_gpa_pages *input_page;
> >  	u64 status, *pfnlist;
> >  	unsigned long irq_flags, large_shift = 0;
> > -	int ret = 0, done = 0;
> > -	u64 page_count = page_struct_count;
> > +	u64 done = 0, page_count = page_struct_count;
> > +	int ret = 0;
> >  
> >  	if (page_count == 0 || (pages && mmio_spa))
> >  		return -EINVAL;
> > @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  	}
> >  
> >  	while (done < page_count) {
> > -		ulong i, completed, remain = page_count - done;
> > -		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> > +		u64 i, completed, remain = page_count - done;
> > +		u64 rep_count = min(remain, (u64)HV_MAP_GPA_BATCH_SIZE);
> >  
> >  		local_irq_save(irq_flags);
> >  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > @@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  		input_page->map_flags = flags;
> >  		pfnlist = input_page->source_gpa_page_list;
> >  
> > -		for (i = 0; i < rep_count; i++)
> > -			if (flags & HV_MAP_GPA_NO_ACCESS) {
> > +		for (i = 0; i < rep_count; i++) {
> > +			if (flags & HV_MAP_GPA_NO_ACCESS)
> >  				pfnlist[i] = 0;
> > -			} else if (pages) {
> > -				u64 index = (done + i) << large_shift;
> > -
> > -				if (index >= page_struct_count) {
> > -					ret = -EINVAL;
> > -					break;
> > -				}
> > -				pfnlist[i] = page_to_pfn(pages[index]);
> > -			} else {
> > +			else if (pages)
> > +				pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
> > +			else
> >  				pfnlist[i] = mmio_spa + done + i;
> > -			}
> > -		if (ret) {
> > -			local_irq_restore(irq_flags);
> > -			break;
> >  		}
> >  
> >  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> > @@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  		local_irq_restore(irq_flags);
> >  
> >  		completed = hv_repcomp(status);
> > +		done += completed;
> >  
> >  		if (hv_result_needs_memory(status)) {
> >  			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> >  						    HV_MAP_GPA_DEPOSIT_PAGES);
> >  			if (ret)
> >  				break;
> > -
> >  		} else if (!hv_result_success(status)) {
> >  			ret = hv_result_to_errno(status);
> >  			break;
> >  		}
> > -
> > -		done += completed;
> >  	}
> >  
> >  	if (ret && done) {
> >  		u32 unmap_flags = 0;
> >  
> > -		if (flags & HV_MAP_GPA_LARGE_PAGE) {
> > +		if (flags & HV_MAP_GPA_LARGE_PAGE)
> >  			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > -			done <<= large_shift;
> > -		}
> > -		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> > +		hv_call_unmap_gpa_pages(partition_id, gfn,
> > +					done << large_shift, unmap_flags);
> 
> How does this work? Earlier we were doing "done << large_shift" only if
> HV_MAP_GPA_LARGE_PAGE is set but now we always do it.
> 

It works becuase large_shift in initialized to 0 when
HV_MAP_GPA_LARGE_PAGE is not set.

Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* [PATCH v2] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-04-29 16:48 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
preceding bug-fix patches:

Move "done += completed" before the status checks so that pages mapped
by a partially-successful batch are included in the error cleanup unmap.
Previously these mappings were leaked on failure.

While here, improve type safety and readability:
 - Change "int done" to "u64 done" to match the u64 page_count it is
   compared against, avoiding signed/unsigned comparison hazards.
 - Use u64 for loop iteration and batch size variables consistently.
 - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
 - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
 - Simplify the error-path unmap to use "done << large_shift" directly
   instead of mutating done in place.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   55 +++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index e5992c324904a..1f19a4ca824f0 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	struct hv_input_map_gpa_pages *input_page;
 	u64 status, *pfnlist;
 	unsigned long irq_flags, large_shift = 0;
-	int ret = 0, done = 0;
-	u64 page_count = page_struct_count;
+	u64 done = 0, page_count = page_struct_count;
+	int ret = 0;
 
 	if (page_count == 0 || (pages && mmio_spa))
 		return -EINVAL;
@@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain, HV_MAP_GPA_BATCH_SIZE);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		input_page->map_flags = flags;
 		pfnlist = input_page->source_gpa_page_list;
 
-		for (i = 0; i < rep_count; i++)
-			if (flags & HV_MAP_GPA_NO_ACCESS) {
+		for (i = 0; i < rep_count; i++) {
+			if (flags & HV_MAP_GPA_NO_ACCESS)
 				pfnlist[i] = 0;
-			} else if (pages) {
-				u64 index = (done + i) << large_shift;
-
-				if (index >= page_struct_count) {
-					ret = -EINVAL;
-					break;
-				}
-				pfnlist[i] = page_to_pfn(pages[index]);
-			} else {
+			else if (pages)
+				pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
+			else
 				pfnlist[i] = mmio_spa + done + i;
-			}
-		if (ret) {
-			local_irq_restore(irq_flags);
-			break;
 		}
 
 		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
@@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		local_irq_restore(irq_flags);
 
 		completed = hv_repcomp(status);
+		done += completed;
 
 		if (hv_result_needs_memory(status)) {
 			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
 						    HV_MAP_GPA_DEPOSIT_PAGES);
 			if (ret)
 				break;
-
 		} else if (!hv_result_success(status)) {
 			ret = hv_result_to_errno(status);
 			break;
 		}
-
-		done += completed;
 	}
 
 	if (ret && done) {
 		u32 unmap_flags = 0;
 
-		if (flags & HV_MAP_GPA_LARGE_PAGE) {
+		if (flags & HV_MAP_GPA_LARGE_PAGE)
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-			done <<= large_shift;
-		}
-		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
+		hv_call_unmap_gpa_pages(partition_id, gfn,
+					done << large_shift, unmap_flags);
 	}
 
 	return ret;
@@ -305,7 +292,7 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 	struct hv_input_unmap_gpa_pages *input_page;
 	u64 status, page_count = page_count_4k;
 	unsigned long irq_flags, large_shift = 0;
-	int ret = 0, done = 0;
+	u64 done = 0;
 
 	if (page_count == 0)
 		return -EINVAL;
@@ -319,8 +306,8 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 	}
 
 	while (done < page_count) {
-		ulong completed, remain = page_count - done;
-		int rep_count = min(remain, HV_UMAP_GPA_PAGES);
+		u64 completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain, HV_UMAP_GPA_PAGES);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -333,15 +320,13 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 		local_irq_restore(irq_flags);
 
 		completed = hv_repcomp(status);
-		if (!hv_result_success(status)) {
-			ret = hv_result_to_errno(status);
-			break;
-		}
-
 		done += completed;
+
+		if (!hv_result_success(status))
+			return hv_result_to_errno(status);
 	}
 
-	return ret;
+	return 0;
 }
 
 int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,



^ permalink raw reply related

* RE: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Michael Kelley @ 2026-04-29 18:01 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, matthew.ruffell@canonical.com,
	johansen@templeofstupid.com
  Cc: stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69214DC322549834104D26E0BF342@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Tuesday, April 28, 2026 8:13 PM
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, April 23, 2026 10:40 AM

[snip]

> > > +	/* Hyper-V CoCo guests do not have a framebuffer device. */
> > > +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> > > +		return;
> >
> > This test is testing feature "A" (mem encryption) in order to determine
> > the presence of feature "B" (no framebuffer), because current
> > configurations happen to always have "A" and "B" at the same time. But
> > the linkage between the features is tenuous, and if configurations should
> > change in the future, testing this way could be bogus. It works now, but I'm
> > leery of depending on the linkage between "A" and "B".
> >
> > You could set up a "can_have_framebuffer" flag in ms_hyperv_init_platform()
> > if running in a CVM, and test that flag here. But I'd suggest just dropping
> > this optimization. CVMs are always Gen2 (and that's not going to change),
> > so they have plenty of low mmio space.
> 
> This is not true on a lab host, e.g. I have a TDX VM on a lab host created
> by these 2 commands (without the 2nd command, Hyper-V won't allow
> the TDX VM to start):
> 
>     New-VM -Generation 2 -GuestStateIsolationType Tdx -Name $vmName
>     Disable-VMConsoleSupport -VMName $vmName
> 
> The low_mmio_base is still 4GB-128MB. In this case, it's not a good idea
> to try to reserve the 128MB:
> 
> 1) the available low MMIO size is smaller than 128MB due to the vTPM
> MMIO range.
> 
> 2) even if we can reserve the 109.25 low mmio range
> [0xf8000000-0xfed3ffff], we may not want to do that, just in case
> some assigned PCI device has 32-bit BARs.
> 
> So, IMO we need to keep the check:
>  +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>  +		return;
> 
> BTW, I think this may be a slightly better check here:
> +        if (hv_is_isolation_supported())
> +                return;

Agreed. Using hv_is_isolation_supported() seems better than
cc_platform_has() for this purpose.

> 
> A CVM on Hyper-V won't start without the command line
>     Disable-VMConsoleSupport -VMName $vmName

Unfortunately, on my laptop Hyper-V, a VM with VBS Isolation appears
to *not* require Disable-VMConsoleSupport. I can start the VM, and the
VM is offered the VMBus synthvid, mouse, and keyboard devices.

But what's weird in this case is that vmbus_reserved_fb() sees lfb_base
and lfb_start as 0. Furthermore, as a test, I changed the "allowed_in_isolated"
flag to true for the synthvid device, and the Hyper-V DRM driver loads and
initializes. In doing so, the vmconnect.exe window is resized larger, as is
done in a normal VM. /proc/iomem shows that the DRM driver claimed
the expected MMIO range at the start of low MMIO space. I can run a user
space program that mmaps /dev/fb0 and writes pixels to the mmap'ed
memory, and that succeeds as it would in a normal VM, but the
vmconnect.exe window doesn't show anything. It appears that the Hyper-V
host has allocated memory for the frame buffer, but is ignoring anything
that is written to it.

Running Disable-VMConsoleSupport works as expected -- the synthvid,
mouse, and keyboard devices are no longer offered to the VM.

> 
> IMO this is very unlikely to change in the future, because the Hyper-V
> synthetic framebuffer VMBus device is not a trusted device for a CVM,
> so there is no reason for Hyper-V to offer such a device to CVMs; even
> if the host offers it, currently the guest hv_vmbus driver ignores it.
> 

In the case of VBS Isolation, if such a VM also had a PCI pass-thru device,
the core problem could recur. I.e., not reserving space for the framebuffer
could allow the PCI device to try to use MMIO space that Hyper-V has
set up for the frame buffer, causing the PCI device to fail. And that's a
worse problem than just having the graphics console not function. I
can't actually try the failure case because I don't have an assignable PCI
device on my laptop, but it seems likely based on the evidence that
Hyper-V is setting up a framebuffer device.

So instead of not reserving any MMIO space for the framebuffer on
CVMs, the code you already have limits the reservation to half of the
MMIO space below 4 GB. Won't that work to avoid exhausting the low
MMIO space in a CVM that's running on a local Hyper-V with only 128
MiB of low MMIO space?

> When we assign a physical PCI GPU device to a CVM, I'm not sure if there
> is any framebuffer from the GPU or not. Even if there is, that's a completely
> different scenario and not reserving some low MMIO for "framebuffer"
> is unrelated: I think hyperv_drm (or the deprecated hyperv_fb) is the only
> driver that sets the fb_overlap_ok parameter of vmbus_allocate_mmio().
> 
> > And at the moment, CVMs don't
> > support PCI devices,
> 
> This is not true: recently I created a "Standard DC16eds v6" TDX CVM
> on Azure, and I did see two NVMe local temporary disks in "nvme list"
>  (here TDISP is not used). In 2023, we added the commit
> 2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")
> and I believe some users are running CVMs with GPUs.

Interesting! I worked on commit 2c6ba4216844, but had not noticed
that Azure now has offerings that makes use of it. I'll take a look at
that TDX VM size.

Thanks,

Michael

^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-29 18:01 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69213486F821CA5A2C793C81BF342@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Tuesday, April 28, 2026 6:58 PM
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, April 23, 2026 10:40 AM

[snip]

> 
> > Question about Gen 1 VMs: If the Linux frame buffer driver moves
> > the frame buffer somewhere other than the default location, and
> > then the VM does a kexec/kdump, what does the legacy PCI graphic
> > device BAR report as the frame buffer location? Does it *always*
> > report 4G-128MB, or does it report the new location? I can run
> 
> It always reports 4G-128MB.

OK, good to know. I was hoping it might report the new location. :-(

> BTW,  I suspect a Gen2 VM may have the same issue, i.e.
> currently we only reserve 8MB below 4GB; if hyperv_drm uses
> high MMIO, I suspect the UEFI firmware would still report the
> same original low MMIO framebuffer base/size to the kdump kernel,
> but there is no easy way to verify this for Gen2 VMs...
> 

[snip]

> 
> However,  when the kdump kernel starts to run, and I print the
> pci_resource_start(pdev, 0) and pci_resource_len(pdev, 0)
> from vmbus_reserve_fb(), I still see 4G-128MB:
> [   12.506159] Gen1 VM: start=0xf8000000, size=0x4000000
> 
> In this case, we can't really fix the MMIO conflict, e.g.
> if both hv_pci and hyperv_drm are built as modules, then
> the order of loading them can be nondeterministic:if the order
> in the first kernel is different from the order in
> the kdump kernel, we run into trouble.

Yep.

> 
> If the order is deterministic (e.g. hv_pci is
> built-in, and hyperv_drm is built as a module),
> we should be good since both allocates MMIO from
> the high MMIO range in a deterministic way.
> 

Yep.

Thanks,

Michael

^ permalink raw reply

* [PATCH 00/10] mshv: Bug fixes across the mshv_root module
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

 This series addresses bugs found during a review of the mshv_root module
 introduced by commit 621191d709b14 ("Drivers: hv: Introduce mshv_root
 module to expose /dev/mshv to VMMs").

 The fixes range from data corruption and use-after-free to silent
 functional failures:

  - IRQ state leak and type truncation in hypercall helpers
    (hv_call_modify_spa_host_access)
  - Integer overflow on userspace-controlled allocation size
    (mshv_region_create)
  - Missing locking, broken seqcount read protection, and a check on
    uninitialized data in the irqfd path — the latter makes
    level-triggered interrupt resampling completely non-functional
  - Duplicate GSI 0 detection using the wrong predicate
  - Use-after-RCU in port ID lookup
  - Missing VP index bounds check in intercept ISR (OOB in interrupt
    context)
  - Missing error code on VP allocation failure (silent success to
    userspace)

---

Stanislav Kinsburskii (10):
      mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
      mshv: Fix potential integer overflow in mshv_region_create
      mshv: Fix missing lock in mshv_irqfd_deassign
      mshv: Fix broken seqcount read protection
      mshv: Fix level-triggered check on uninitialized data
      mshv: Fix duplicate GSI detection for GSI 0
      mshv: Fix use-after-RCU in mshv_portid_lookup
      mshv: Use kfree_rcu in mshv_portid_free
      mshv: Add missing vp_index bounds check in intercept ISR
      mshv: Fix missing error code on VP allocation failure


 drivers/hv/mshv_eventfd.c      |   75 ++++++++++++++++++++++------------------
 drivers/hv/mshv_irq.c          |    2 +
 drivers/hv/mshv_portid_table.c |    6 +--
 drivers/hv/mshv_regions.c      |    2 +
 drivers/hv/mshv_root_hv_call.c |   18 +++-------
 drivers/hv/mshv_root_main.c    |    4 ++
 drivers/hv/mshv_synic.c        |    4 ++
 7 files changed, 59 insertions(+), 52 deletions(-)


^ permalink raw reply

* [PATCH 01/10] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The bounds check inside the PFN-filling loop can return -EINVAL while
interrupts are disabled via local_irq_save(), leaking IRQ state.

Remove the check — it is redundant because the loop invariant
(done + i < page_count == page_struct_count >> large_shift) guarantees
(done + i) << large_shift < page_struct_count always holds.

While here, fix type mismatches: change 'int done' to 'u64 done' and
use u64 for loop and batch-size variables so they match the u64
page_count they are compared against.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index f8c2341193da5..61871ad131b4b 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -1041,7 +1041,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 {
 	struct hv_input_modify_sparse_spa_page_host_access *input_page;
 	u64 status;
-	int done = 0;
+	u64 done = 0;
 	unsigned long irq_flags, large_shift = 0;
 	u64 page_count = page_struct_count;
 	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
@@ -1058,9 +1058,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain,
-				    HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min_t(u64, remain,
+				      HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -1074,15 +1074,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 		input_page->flags = flags;
 		input_page->host_access = host_access;
 
-		for (i = 0; i < rep_count; i++) {
-			u64 index = (done + i) << large_shift;
-
-			if (index >= page_struct_count)
-				return -EINVAL;
-
+		for (i = 0; i < rep_count; i++)
 			input_page->spa_page_list[i] =
-						page_to_pfn(pages[index]);
-		}
+				page_to_pfn(pages[(done + i) << large_shift]);
 
 		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
 					     NULL);



^ permalink raw reply related

* [PATCH 02/10] mshv: Fix potential integer overflow in mshv_region_create
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The allocation size is computed as:

  sizeof(*region) + sizeof(struct page *) * nr_pages

where nr_pages is a u64 originating from userspace. A sufficiently
large nr_pages can overflow the multiplication, resulting in a small
allocation followed by out-of-bounds writes when populating mreg_pages.

Use struct_size() which returns SIZE_MAX on overflow, causing vzalloc
to safely return NULL — caught by the existing error check.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6f..1d04a97980b8b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -177,7 +177,7 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 {
 	struct mshv_mem_region *region;
 
-	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	region = vzalloc(struct_size(region, mreg_pages, nr_pages));
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 



^ permalink raw reply related

* [PATCH 03/10] mshv: Fix missing lock in mshv_irqfd_deassign
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_irqfd_deactivate() and the hlist traversal of pt_irqfds_list
require pt->pt_irqfds_lock to be held, but mshv_irqfd_deassign()
omits it. This races with the EPOLLHUP path in mshv_irqfd_wakeup(),
which does take the lock before calling mshv_irqfd_deactivate().

Add the missing spin_lock_irq/spin_unlock_irq around the list
traversal, matching the pattern in mshv_irqfd_release().

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 90959f639dc32..704c229ee3b19 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -541,13 +541,14 @@ static int mshv_irqfd_deassign(struct mshv_partition *pt,
 	if (IS_ERR(eventfd))
 		return PTR_ERR(eventfd);
 
+	spin_lock_irq(&pt->pt_irqfds_lock);
 	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list,
 				  irqfd_hnode) {
 		if (irqfd->irqfd_eventfd_ctx == eventfd &&
 		    irqfd->irqfd_irqnum == args->gsi)
-
 			mshv_irqfd_deactivate(irqfd);
 	}
+	spin_unlock_irq(&pt->pt_irqfds_lock);
 
 	eventfd_ctx_put(eventfd);
 



^ permalink raw reply related

* [PATCH 04/10] mshv: Fix broken seqcount read protection
From: Stanislav Kinsburskii @ 2026-04-29 18:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_irqfd_update() writes both irqfd_girq_ent and irqfd_lapic_irq as a
logical unit under seqcount write protection. Readers must snapshot these
fields inside the seqcount begin/retry loop to obtain a consistent
point-in-time view — otherwise a concurrent update can produce a torn
read where one field comes from the old state and the other from the new.

Both mshv_assert_irq_slow() and mshv_irqfd_wakeup() get this wrong: the
seqcount loop bodies are empty (just spinning until a stable sequence is
observed), and all reads of the protected fields happen after the loop
with no protection from concurrent writes. If mshv_irqfd_update() races
with interrupt assertion, the caller may use a stale or mixed
vector/apic_id/control combination — delivering an interrupt to the
wrong vCPU, with the wrong vector, or with the wrong trigger mode. This
can cause spurious or lost interrupts in the guest, or a stuck interrupt
line in the level-triggered case.

Fix mshv_assert_irq_slow() by snapshotting both irqfd_girq_ent and
irqfd_lapic_irq into local variables inside the seqcount loop, then
using those locals for the validity check and the hypercall.

Fix mshv_irqfd_wakeup() by snapshotting irqfd_lapic_irq inside its
seqcount loop and passing the snapshot to mshv_try_assert_irq_fast(),
so the fast path operates on the consistent copy rather than reading
the field directly outside seqcount protection.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |   47 +++++++++++++++++++++++++--------------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 704c229ee3b19..d9491a14f30f1 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -151,10 +151,10 @@ static int mshv_vp_irq_set_vector(struct mshv_vp *vp, u32 vector)
  * Try to raise irq for guest via shared vector array. hyp does the actual
  * inject of the interrupt.
  */
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+				    const struct mshv_lapic_irq *irq)
 {
 	struct mshv_partition *partition = irqfd->irqfd_partn;
-	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
 	struct mshv_vp *vp;
 
 	if (!(ms_hyperv.ext_features &
@@ -186,7 +186,8 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 	return 0;
 }
 #else /* CONFIG_X86_64 */
-static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
+				    const struct mshv_lapic_irq *irq)
 {
 	return -EOPNOTSUPP;
 }
@@ -195,30 +196,32 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
 static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
 {
 	struct mshv_partition *partition = irqfd->irqfd_partn;
-	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
+	struct mshv_guest_irq_ent girq_ent;
+	struct mshv_lapic_irq irq;
 	unsigned int seq;
 	int idx;
 
-#if IS_ENABLED(CONFIG_X86)
-	WARN_ON(irqfd->irqfd_resampler &&
-		!irq->lapic_control.level_triggered);
-#endif
-
 	idx = srcu_read_lock(&partition->pt_irq_srcu);
-	if (irqfd->irqfd_girq_ent.guest_irq_num) {
-		if (!irqfd->irqfd_girq_ent.girq_entry_valid) {
-			srcu_read_unlock(&partition->pt_irq_srcu, idx);
-			return;
-		}
 
-		do {
-			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
-		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+	do {
+		seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+		girq_ent = irqfd->irqfd_girq_ent;
+		irq = irqfd->irqfd_lapic_irq;
+	} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+
+	if (girq_ent.guest_irq_num && !girq_ent.girq_entry_valid) {
+		srcu_read_unlock(&partition->pt_irq_srcu, idx);
+		return;
 	}
 
-	hv_call_assert_virtual_interrupt(irqfd->irqfd_partn->pt_id,
-					 irq->lapic_vector, irq->lapic_apic_id,
-					 irq->lapic_control);
+#if IS_ENABLED(CONFIG_X86)
+	WARN_ON(irqfd->irqfd_resampler &&
+		!irq.lapic_control.level_triggered);
+#endif
+
+	hv_call_assert_virtual_interrupt(partition->pt_id,
+					 irq.lapic_vector, irq.lapic_apic_id,
+					 irq.lapic_control);
 	srcu_read_unlock(&partition->pt_irq_srcu, idx);
 }
 
@@ -304,16 +307,18 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
 	int ret = 0;
 
 	if (flags & EPOLLIN) {
+		struct mshv_lapic_irq irq;
 		u64 cnt;
 
 		eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
 		idx = srcu_read_lock(&pt->pt_irq_srcu);
 		do {
 			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+			irq = irqfd->irqfd_lapic_irq;
 		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
 
 		/* An event has been signaled, raise an interrupt */
-		ret = mshv_try_assert_irq_fast(irqfd);
+		ret = mshv_try_assert_irq_fast(irqfd, &irq);
 		if (ret)
 			mshv_assert_irq_slow(irqfd);
 



^ permalink raw reply related

* [PATCH 05/10] mshv: Fix level-triggered check on uninitialized data
From: Stanislav Kinsburskii @ 2026-04-29 18:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

In mshv_irqfd_assign(), the level-triggered validation for resample
irqfds checks irqfd_lapic_irq.lapic_control.level_triggered before
mshv_irqfd_update() has populated the field. Since the irqfd struct is
zero-allocated, level_triggered is always 0 at that point, causing the
check to always reject resample irqfds with -EINVAL. This makes
level-triggered interrupt resampling — used to avoid interrupt storms
with assigned devices — completely non-functional.

Move the check after the mshv_irqfd_update() call, which resolves the
IRQ routing entry and populates irqfd_lapic_irq with the actual trigger
mode.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_eventfd.c |   25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index d9491a14f30f1..fd594acce3235 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -478,6 +478,19 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
 	init_poll_funcptr(&irqfd->irqfd_polltbl, mshv_irqfd_queue_proc);
 
 	spin_lock_irq(&pt->pt_irqfds_lock);
+	ret = 0;
+	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
+		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
+			continue;
+		/* This fd is used for another irq already. */
+		ret = -EBUSY;
+		spin_unlock_irq(&pt->pt_irqfds_lock);
+		goto fail;
+	}
+
+	idx = srcu_read_lock(&pt->pt_irq_srcu);
+	mshv_irqfd_update(pt, irqfd);
+
 #if IS_ENABLED(CONFIG_X86)
 	if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
 	    !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {
@@ -486,22 +499,12 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
 		 * Otherwise return with failure
 		 */
 		spin_unlock_irq(&pt->pt_irqfds_lock);
+		srcu_read_unlock(&pt->pt_irq_srcu, idx);
 		ret = -EINVAL;
 		goto fail;
 	}
 #endif
-	ret = 0;
-	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
-		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
-			continue;
-		/* This fd is used for another irq already. */
-		ret = -EBUSY;
-		spin_unlock_irq(&pt->pt_irqfds_lock);
-		goto fail;
-	}
 
-	idx = srcu_read_lock(&pt->pt_irq_srcu);
-	mshv_irqfd_update(pt, irqfd);
 	hlist_add_head(&irqfd->irqfd_hnode, &pt->pt_irqfds_list);
 	spin_unlock_irq(&pt->pt_irqfds_lock);
 



^ permalink raw reply related

* [PATCH 06/10] mshv: Fix duplicate GSI detection for GSI 0
From: Stanislav Kinsburskii @ 2026-04-29 18:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The duplicate routing entry check in mshv_update_routing_table() uses
guest_irq_num != 0 to detect whether a GSI slot is already occupied.
This fails for GSI 0 because its guest_irq_num is 0 both when the slot
is unused (zero-initialized) and when legitimately assigned. As a
result, duplicate entries for GSI 0 are silently accepted, with the
second entry overwriting the first — corrupting the routing table
without any error reported to userspace.

While GSI 0 (legacy timer) is unlikely to appear in MSI-based routing
in practice, the check is semantically wrong — it conflates
"uninitialized" with "GSI number 0." Use girq_entry_valid instead,
which is explicitly set to true when an entry is populated and remains
zero for unused slots regardless of the GSI number.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_irq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
index b3142c84dcbc2..65a4ffc82d566 100644
--- a/drivers/hv/mshv_irq.c
+++ b/drivers/hv/mshv_irq.c
@@ -51,7 +51,7 @@ int mshv_update_routing_table(struct mshv_partition *partition,
 		/*
 		 * Allow only one to one mapping between GSI and MSI routing.
 		 */
-		if (girq->guest_irq_num != 0) {
+		if (girq->girq_entry_valid) {
 			r = -EINVAL;
 			goto out;
 		}



^ permalink raw reply related

* [PATCH 07/10] mshv: Fix use-after-RCU in mshv_portid_lookup
From: Stanislav Kinsburskii @ 2026-04-29 18:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_portid_lookup() drops the RCU read lock before copying the
port_table_info struct found by idr_find(). If mshv_portid_free() runs
concurrently on another CPU, it can remove the entry and free it (via
synchronize_rcu + kfree) before the copy at line *info = *_info
completes — resulting in a use-after-free.

Move rcu_read_unlock() after the struct copy so the object remains
protected for the entire duration of the read-side access.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_portid_table.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index c349af1f0aaac..f1aaef69eb9b7 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -72,12 +72,11 @@ mshv_portid_lookup(int port_id, struct port_table_info *info)
 
 	rcu_read_lock();
 	_info = idr_find(&port_table_idr, port_id);
-	rcu_read_unlock();
-
 	if (_info) {
 		*info = *_info;
 		ret = 0;
 	}
+	rcu_read_unlock();
 
 	return ret;
 }



^ permalink raw reply related

* [PATCH 08/10] mshv: Use kfree_rcu in mshv_portid_free
From: Stanislav Kinsburskii @ 2026-04-29 18:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_portid_free() uses synchronize_rcu() followed by kfree() to
reclaim port table entries. This blocks the caller until a full RCU
grace period elapses, which is unnecessary since the same module already
uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().

Replace with kfree_rcu() to avoid the blocking wait and keep the
reclamation strategy consistent across the file.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_portid_table.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index f1aaef69eb9b7..7ffe7a1c1e518 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -60,8 +60,7 @@ mshv_portid_free(int port_id)
 	WARN_ON(!info);
 	idr_unlock(&port_table_idr);
 
-	synchronize_rcu();
-	kfree(info);
+	kfree_rcu(info, portbl_rcu);
 }
 
 int



^ permalink raw reply related

* [PATCH 09/10] mshv: Add missing vp_index bounds check in intercept ISR
From: Stanislav Kinsburskii @ 2026-04-29 18:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177748522635.144491.1565666089881726479.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

mshv_intercept_isr() reads vp_index from the intercept message payload
and uses it directly to index into partition->pt_vp_array without
validating it against MSHV_MAX_VPS. A malformed or corrupted hypervisor
message with a vp_index beyond the array bounds would cause an
out-of-bounds memory access in interrupt context, likely crashing the
host.

Both handle_bitset_message() and handle_pair_message() already validate
vp_index before use. Add the same check to mshv_intercept_isr() for
consistency.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_synic.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 43f1bcbbf2d34..5bceb81229817 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -384,6 +384,10 @@ mshv_intercept_isr(struct hv_message *msg)
 	 */
 	vp_index =
 	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
+	if (unlikely(vp_index >= MSHV_MAX_VPS)) {
+		pr_debug("VP index %u out of bounds\n", vp_index);
+		goto unlock_out;
+	}
 	vp = partition->pt_vp_array[vp_index];
 	if (unlikely(!vp)) {
 		pr_debug("failed to find VP %u\n", vp_index);



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox