Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-02-02 17:10 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aX0Vbfocwa4WgXUw@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > hypervisor deposited pages.
> > > > > > > > 
> > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > management is implemented.
> > > > > > > 
> > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > and would work without any issue for L1VH.
> > > > > > > 
> > > > > > 
> > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > 
> > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > right? What other deposited pages would be left?
> > > > > 
> > > > 
> > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > upon gust shutdown) and the other - for the host itself (never
> > > > withdrawn).
> > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > host partition.
> > > 
> > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > Also, can't we forcefully kill all running partitions in module_exit and
> > > then reclaim memory? Would this help with kernel consistency
> > > irrespective of userspace behavior?
> > > 
> > 
> > It would, but this is sloppy and cannot be a long-term solution.
> > 
> > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > to kill the guest or reclaim the memory for any reason, the new kernel
> > may still crash.
> 
> Actually guests won't be running by the time we reach our module_exit
> function during a kexec. Userspace processes would've been killed by
> then.
> 

No, they will not: "kexec -e" doesn't kill user processes.
We must not rely on OS to do graceful shutdown before doing
kexec.

> Also, why is this sloppy? Isn't this what module_exit should be
> doing anyway? If someone unloads our module we should be trying to
> clean everything up (including killing guests) and reclaim memory.
> 

Kexec does not unload modules, but it doesn't really matter even if it
would.
There are other means to plug into the reboot flow, but neither of them
is robust or reliable.

> In any case, we can BUG() out if we fail to reclaim the memory. That would
> stop the kexec.
> 

By killing the whole system? This is not a good user experience and I
don't see how can this be justified.

> This is a better solution since instead of disabling KEXEC outright: our
> driver made the best possible efforts to make kexec work.
> 

How an unrealiable feature leading to potential system crashes is better
that disabling kexec outright?

It's a complete opposite story for me: the latter provides a limited,
but robust functionality, while the former provides an unreliable and
unpredictable behavior.

> > 
> > There are two long-term solutions:
> >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> 
> I honestly think we should focus efforts on making kexec work rather
> than finding ways to prevent it.
> 

There is no argument about it. But until we have it fixed properly, we
have two options: either disable kexec or stop claiming we have our
driver up and ready for external customers. Giving the importance of
this driver for current projects, I believe the better way would be to
explicitly limit the functionality instead of postponing the
productization of the driver.

In other words, this is not about our fillings about kexec support: it's
about what we can reliably provide to our customers today.

Thanks,
Stanislav

> Thanks,
> Anirudh
> 
> >  2. Hand the shared kernel state over to the new kernel.
> > 
> > I sent a series for the first one. The second one is not ready yet.
> > Anything else is neither robust nor reliable, so I don’t think it makes
> > sense to pursue it.
> > 
> > Thanks,
> > Stanislav
> > 
> > 
> > > Thanks,
> > > Anirudh.
> > > 
> > > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > Anirudh.
> > > > > 
> > > > > > Also, kernel consisntency must no depend on use space behavior. 
> > > > > > 
> > > > > > > Also, I don't think it is reasonable at all that someone needs to
> > > > > > > disable basic kernel functionality such as kexec in order to use our
> > > > > > > driver.
> > > > > > > 
> > > > > > 
> > > > > > It's a temporary measure until proper page lifecycle management is
> > > > > > supported in the driver.
> > > > > > Mutual exclusion of the driver and kexec is given and thus should be
> > > > > > expclitily stated in the Kconfig.
> > > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Anirudh.
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > ---
> > > > > > > >  drivers/hv/Kconfig |    1 +
> > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > > >  	# no particular order, making it impossible to reassemble larger pages
> > > > > > > >  	depends on PAGE_SIZE_4KB
> > > > > > > > +	depends on !KEXEC
> > > > > > > >  	select EVENTFD
> > > > > > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > > > > >  	select HMM_MIRROR
> > > > > > > > 
> > > > > > > > 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Naman Jain @ 2026-02-02 16:56 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <176920684805.250171.6817228088359793537.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>



On 1/24/2026 3:50 AM, Stanislav Kinsburskii wrote:
> The MSHV driver deposits kernel-allocated pages to the hypervisor during
> runtime and never withdraws them. This creates a fundamental incompatibility
> with KEXEC, as these deposited pages remain unavailable to the new kernel
> loaded via KEXEC, leading to potential system crashes upon kernel accessing
> hypervisor deposited pages.
> 
> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> management is implemented.
> 

I have not gone through entire conversation that has happened already on 
this, but if you send a next version for this, please change commit msg 
and subject to include MSHV_ROOT instead of MSHV, to avoid confusion.

Regards,
Naman

> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>   drivers/hv/Kconfig |    1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 7937ac0cbd0f..cfd4501db0fa 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -74,6 +74,7 @@ config MSHV_ROOT
>   	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>   	# no particular order, making it impossible to reassemble larger pages
>   	depends on PAGE_SIZE_4KB
> +	depends on !KEXEC
>   	select EVENTFD
>   	select VIRT_XFER_TO_GUEST_WORK
>   	select HMM_MIRROR
> 
> 


^ permalink raw reply

* [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: mhkelley58 @ 2026-02-02 16:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, skinsburskii, linux-hyperv
  Cc: linux-kernel

From: Michael Kelley <mhklinux@outlook.com>

Huge page mappings in the guest physical address space depend on having
matching alignment of the userspace address in the parent partition and
of the guest physical address. Add a comment that captures this
information. See the link to the mailing list thread.

No code or functional change.

Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 drivers/hv/mshv_root_main.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 681b58154d5e..bc738ff4508e 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
 	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
 		return mshv_unmap_user_memory(partition, mem);
 
+	/*
+	 * If the userspace_addr and the guest physical address (as derived
+	 * from the guest_pfn) have the same alignment modulo PMD huge page
+	 * size, the MSHV driver can map any PMD huge pages to the guest
+	 * physical address space as PMD huge pages. If the alignments do
+	 * not match, PMD huge pages must be mapped as single pages in the
+	 * guest physical address space. The MSHV driver does not enforce
+	 * that the alignments match, and it invokes the hypervisor to set
+	 * up correct functional mappings either way. See mshv_chunk_stride().
+	 * The caller of the ioctl is responsible for providing userspace_addr
+	 * and guest_pfn values with matching alignments if it wants the guest
+	 * to get the performance benefits of PMD huge page mappings of its
+	 * physical address space to real system memory.
+	 */
 	return mshv_map_user_memory(partition, mem);
 }
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH 1/1] x86/hyperv: Update comment in hyperv_cleanup()
From: mhkelley58 @ 2026-02-02 16:48 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, x86, hpa, linux-hyperv, linux-kernel

From: Michael Kelley <mhklinux@outlook.com>

The comment in hyperv_cleanup() became out-of-date as a result of
commit c8ed0812646e ("x86/hyperv: Use direct call to hypercall-page").

Update the comment. No code or functional change.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/hyperv/hv_init.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 14de43f4bc6c..a777e43a5de1 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -633,9 +633,13 @@ void hyperv_cleanup(void)
 	hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
 
 	/*
-	 * Reset hypercall page reference before reset the page,
-	 * let hypercall operations fail safely rather than
-	 * panic the kernel for using invalid hypercall page
+	 * Reset hv_hypercall_pg before resetting it in the hypervisor.
+	 * hv_set_hypercall_pg(NULL) is not used because at this point in the
+	 * panic path other CPUs have been stopped, causing static_call_update()
+	 * to hang. So resetting hv_hypercall_pg to cause hypercalls to fail
+	 * cleanly is only operative on 32-bit builds. But this is OK as it is
+	 * just a preventative measure to ease detecting a hypercall being made
+	 * after this point, which shouldn't be happening anyway.
 	 */
 	hv_hypercall_pg = NULL;
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-02-02 16:43 UTC (permalink / raw)
  To: Mukesh R
  Cc: Anirudh Rayabharam, kys, haiyangz, wei.liu, decui, longli,
	linux-hyperv, linux-kernel
In-Reply-To: <2efb7fc8-994f-7cbf-6b7c-a1e645bdf638@linux.microsoft.com>

On Fri, Jan 30, 2026 at 11:47:48AM -0800, Mukesh R wrote:
> On 1/30/26 10:41, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 05:17:52PM +0000, Anirudh Rayabharam wrote:
> > > On Thu, Jan 29, 2026 at 06:59:31PM -0800, Mukesh R wrote:
> > > > On 1/28/26 15:08, Stanislav Kinsburskii wrote:
> > > > > On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
> > > > > > On 1/27/26 09:47, Stanislav Kinsburskii wrote:
> > > > > > > On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
> > > > > > > > On 1/26/26 16:21, Stanislav Kinsburskii wrote:
> > > > > > > > > On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
> > > > > > > > > > On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> > > > > > > > > > > On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> > > > > > > > > > > > On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> > > > > > > > > > > > > > On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >         drivers/hv/Kconfig |    1 +
> > > > > > > > > > > > > > >         1 file changed, 1 insertion(+)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > > > > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > > > > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > > > > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > > > > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > > > > > > > > > >         	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > > > > > > > > > >         	# no particular order, making it impossible to reassemble larger pages
> > > > > > > > > > > > > > >         	depends on PAGE_SIZE_4KB
> > > > > > > > > > > > > > > +	depends on !KEXEC
> > > > > > > > > > > > > > >         	select EVENTFD
> > > > > > > > > > > > > > >         	select VIRT_XFER_TO_GUEST_WORK
> > > > > > > > > > > > > > >         	select HMM_MIRROR
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> > > > > > > > > > > > > > implying that crash dump might be involved. Or did you test kdump
> > > > > > > > > > > > > > and it was fine?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, it will. Crash kexec depends on normal kexec functionality, so it
> > > > > > > > > > > > > will be affected as well.
> > > > > > > > > > > > 
> > > > > > > > > > > > So not sure I understand the reason for this patch. We can just block
> > > > > > > > > > > > kexec if there are any VMs running, right? Doing this would mean any
> > > > > > > > > > > > further developement would be without a ver important and major feature,
> > > > > > > > > > > > right?
> > > > > > > > > > > 
> > > > > > > > > > > This is an option. But until it's implemented and merged, a user mshv
> > > > > > > > > > > driver gets into a situation where kexec is broken in a non-obvious way.
> > > > > > > > > > > The system may crash at any time after kexec, depending on whether the
> > > > > > > > > > > new kernel touches the pages deposited to hypervisor or not. This is a
> > > > > > > > > > > bad user experience.
> > > > > > > > > > 
> > > > > > > > > > I understand that. But with this we cannot collect core and debug any
> > > > > > > > > > crashes. I was thinking there would be a quick way to prohibit kexec
> > > > > > > > > > for update via notifier or some other quick hack. Did you already
> > > > > > > > > > explore that and didn't find anything, hence this?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > This quick hack you mention isn't quick in the upstream kernel as there
> > > > > > > > > is no hook to interrupt kexec process except the live update one.
> > > > > > > > 
> > > > > > > > That's the one we want to interrupt and block right? crash kexec
> > > > > > > > is ok and should be allowed. We can document we don't support kexec
> > > > > > > > for update for now.
> > > > > > > > 
> > > > > > > > > I sent an RFC for that one but given todays conversation details is
> > > > > > > > > won't be accepted as is.
> > > > > > > > 
> > > > > > > > Are you taking about this?
> > > > > > > > 
> > > > > > > >            "mshv: Add kexec safety for deposited pages"
> > > > > > > > 
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > > Making mshv mutually exclusive with kexec is the only viable option for
> > > > > > > > > now given time constraints.
> > > > > > > > > It is intended to be replaced with proper page lifecycle management in
> > > > > > > > > the future.
> > > > > > > > 
> > > > > > > > Yeah, that could take a long time and imo we cannot just disable KEXEC
> > > > > > > > completely. What we want is just block kexec for updates from some
> > > > > > > > mshv file for now, we an print during boot that kexec for updates is
> > > > > > > > not supported on mshv. Hope that makes sense.
> > > > > > > > 
> > > > > > > 
> > > > > > > The trade-off here is between disabling kexec support and having the
> > > > > > > kernel crash after kexec in a non-obvious way. This affects both regular
> > > > > > > kexec and crash kexec.
> > > > > > 
> > > > > > crash kexec on baremetal is not affected, hence disabling that
> > > > > > doesn't make sense as we can't debug crashes then on bm.
> > > > > > 
> > > > > 
> > > > > Bare metal support is not currently relevant, as it is not available.
> > > > > This is the upstream kernel, and this driver will be accessible to
> > > > > third-party customers beginning with kernel 6.19 for running their
> > > > > kernels in Azure L1VH, so consistency is required.
> > > > 
> > > > Well, without crashdump support, customers will not be running anything
> > > > anywhere.
> > > 
> > > This is my concern too. I don't think customers will be particularly
> > > happy that kexec doesn't work with our driver.
> > > 
> > 
> > I wasn?t clear earlier, so let me restate it. Today, kexec is not
> > supported in L1VH. This is a bug we have not fixed yet. Disabling kexec
> > is not a long-term solution. But it is better to disable it explicitly
> > than to have kernel crashes after kexec.
> 
> I don't think there is disagreement on this. The undesired part is turning
> off KEXEC config completely.
> 

There is no disagreement on this either. If you have a better solution
that can be implemented and merged before next kernel merge window,
please propose it. Otherwise, this patch will remain as is for now.

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> 
> > This does not mean the bug should not be fixed. But the upstream kernel
> > has its own policies and merge windows. For kernel 6.19, it is better to
> > have a clear kexec error than random crashes after kexec.
> > 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > Anirudh
> > > 
> > > > 
> > > > Thanks,
> > > > -Mukesh
> > > > 
> > > > > Thanks,
> > > > > Stanislav
> > > > > 
> > > > > > Let me think and explore a bit, and if I come up with something, I'll
> > > > > > send a patch here. If nothing, then we can do this as last resort.
> > > > > > 
> > > > > > Thanks,
> > > > > > -Mukesh
> > > > > > 
> > > > > > 
> > > > > > > It?s a pity we can?t apply a quick hack to disable only regular kexec.
> > > > > > > However, since crash kexec would hit the same issues, until we have a
> > > > > > > proper state transition for deposted pages, the best workaround for now
> > > > > > > is to reset the hypervisor state on every kexec, which needs design,
> > > > > > > work, and testing.
> > > > > > > 
> > > > > > > Disabling kexec is the only consistent way to handle this in the
> > > > > > > upstream kernel at the moment.
> > > > > > > 
> > > > > > > Thanks, Stanislav
> > > > > > > 
> > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > -Mukesh
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Stanislav
> > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > -Mukesh
> > > > > > > > > > 
> > > > > > > > > > > Therefor it should be explicitly forbidden as it's essentially not
> > > > > > > > > > > supported yet.
> > > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stanislav
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Stanislav
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > -Mukesh
> > > > 
> 

^ permalink raw reply

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
From: Stanislav Kinsburskii @ 2026-02-02 16:30 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <8d798da6-1720-ceea-f1b0-62ca675085c8@linux.microsoft.com>

On Fri, Jan 30, 2026 at 02:17:24PM -0800, Mukesh R wrote:
> On 1/27/26 10:57, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
> > > On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
> > > > > On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> > > > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > 
> > > > > > > Upon guest access, in case of missing mmio mapping, the hypervisor
> > > > > > > generates an unmapped gpa intercept. In this path, lookup the PCI
> > > > > > > resource pfn for the guest gpa, and ask the hypervisor to map it
> > > > > > > via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> > > > > > > and obtained via fixup_user_fault call (similar to KVM).
> > > > > > > 
> > > > > > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > ---
> > > > > > >     drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
> > > > > > >     1 file changed, 115 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > > > index 03f3aa9f5541..4c8bc7cd0888 100644
> > > > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > > > @@ -56,6 +56,14 @@ struct hv_stats_page {
> > > > > > >     	};
> > > > > > >     } __packed;
> > > > > > > +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> > > > > > > +static int __init setup_hv_full_mmio(char *str)
> > > > > > > +{
> > > > > > > +	hv_nofull_mmio = true;
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> > > > > > > +
> > > > > > >     struct mshv_root mshv_root;
> > > > > > >     enum hv_scheduler_type hv_scheduler_type;
> > > > > > > @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > > > > > >     }
> > > > > > >     #ifdef CONFIG_X86_64
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> > > > > > > + * else just return -errno.
> > > > > > > + */
> > > > > > > +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> > > > > > > +				       u64 *mmio_pfnp)
> > > > > > > +{
> > > > > > > +	struct vm_area_struct *vma;
> > > > > > > +	bool is_mmio;
> > > > > > > +	u64 uaddr;
> > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > +	struct follow_pfnmap_args pfnmap_args;
> > > > > > > +	int rc = -EINVAL;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> > > > > > > +	 * useraddr vma to lookup pci bar pfn.
> > > > > > > +	 */
> > > > > > > +	spin_lock(&pt->pt_mem_regions_lock);
> > > > > > > +
> > > > > > > +	/* Get the region again under the lock */
> > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > +		goto unlock_pt_out;
> > > > > > > +
> > > > > > > +	uaddr = mreg->start_uaddr +
> > > > > > > +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> > > > > > > +
> > > > > > > +	mmap_read_lock(current->mm);
> > > > > > 
> > > > > > Semaphore can't be taken under spinlock.
> > > > 
> > > > > 
> > > > > Yeah, something didn't feel right here and I meant to recheck, now regret
> > > > > rushing to submit the patch.
> > > > > 
> > > > > Rethinking, I think the pt_mem_regions_lock is not needed to protect
> > > > > the uaddr because unmap will properly serialize via the mm lock.
> > > > > 
> > > > > 
> > > > > > > +	vma = vma_lookup(current->mm, uaddr);
> > > > > > > +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> > > > > > 
> > > > > > Why this check is needed again?
> > > > > 
> > > > > To make sure region did not change. This check is under lock.
> > > > > 
> > > > 
> > > > How can this happen? One can't change VMA type without unmapping it
> > > > first. And unmapping it leads to a kernel MMIO region state dangling
> > > > around without corresponding user space mapping.
> > > 
> > > Right, and vm_flags would not be mmio expected then.
> > > 
> > > > This is similar to dangling pinned regions and should likely be
> > > > addressed the same way by utilizing MMU notifiers to destpoy memoty
> > > > regions is VMA is detached.
> > > 
> > > I don't think we need that. Either it succeeds if the region did not
> > > change at all, or just fails.
> > > 
> > 
> > I'm afraid we do, as if the driver mapped a page with the previous
> > memory region, and then the region is unmapped, the page will stay
> > mapped in the hypervisor, but will be considered free by kernel, which
> > in turn will lead to GPF upn next allocation.
> 
> There are no ram pages for mmio regions. Also, we don't do much with
> mmio regions other than tell the hyp about it.
> 

So, are you saying that the hypervisor does not use these pages and only
tracks them? That would make things easier.
However, if we later try to map a GPA that is already mapped, will the
hypervisor return an error?

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> 
> > With pinned regions we issue is similar but less impacting: pages can't
> > be released by user space unmapping and thus will be simply leaked, but
> > the system stays intact.
> > 
> > MMIO regions are simila to movable region in this regard: they don't
> > reference the user pages, and thus this guest region replaement is a
> > stright wat to kernel panic.
> > 
> > > 
> > > > > > The region type is stored on the region itself.
> > > > > > And the type is checked on the caller side.
> > > > > > 
> > > > > > > +	if (!is_mmio)
> > > > > > > +		goto unlock_mmap_out;
> > > > > > > +
> > > > > > > +	pfnmap_args.vma = vma;
> > > > > > > +	pfnmap_args.address = uaddr;
> > > > > > > +
> > > > > > > +	rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > +	if (rc) {
> > > > > > > +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> > > > > > > +				      NULL);
> > > > > > > +		if (rc)
> > > > > > > +			goto unlock_mmap_out;
> > > > > > > +
> > > > > > > +		rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > +		if (rc)
> > > > > > > +			goto unlock_mmap_out;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	*mmio_pfnp = pfnmap_args.pfn;
> > > > > > > +	follow_pfnmap_end(&pfnmap_args);
> > > > > > > +d
> > > > > > > +unlock_mmap_out:
> > > > > > > +	mmap_read_unlock(current->mm);
> > > > > > > +unlock_pt_out:
> > > > > > > +	spin_unlock(&pt->pt_mem_regions_lock);
> > > > > > > +	return rc;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> > > > > > > + * and resolve if possible.
> > > > > > > + * Returns: True if valid mmio intercept and it was handled, else false
> > > > > > > + */
> > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> > > > > > > +{
> > > > > > > +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> > > > > > > +	struct hv_x64_memory_intercept_message *msg;
> > > > > > > +	union hv_x64_memory_access_info accinfo;
> > > > > > > +	u64 gfn, mmio_spa, numpgs;
> > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > +	int rc;
> > > > > > > +	struct mshv_partition *pt = vp->vp_partition;
> > > > > > > +
> > > > > > > +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> > > > > > > +	accinfo = msg->memory_access_info;
> > > > > > > +
> > > > > > > +	if (!accinfo.gva_gpa_valid)
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	/* Do a fast check and bail if non mmio intercept */
> > > > > > > +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > 
> > > > > > This call needs to be protected by the spinlock.
> > > > > 
> > > > > This is sorta fast path to bail. We recheck under partition lock above.
> > > > > 
> > > > 
> > > > Accessing the list of regions without lock is unsafe.
> > > 
> > > I am not sure why? This check is done by a vcpu thread, so regions
> > > will not have just gone away.
> > > 
> > 
> > This is shared resources. Multiple VP thread get into this function
> > simultaneously, so there is a race already. But this one we can live
> > with without locking as they don't mutate the list of the regions.
> > 
> > The issue happens when VMM adds or removed another region as it mutates
> > the list and races with VP threads doing this lookup.
> > 
> > Thanks,
> > Stanislav
> > 
> > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > -Mukesh
> > > > > 
> > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> > > > > > > +	if (rc)
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	if (!hv_nofull_mmio) {		/* default case */
> > > > > > > +		gfn = mreg->start_gfn;
> > > > > > > +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> > > > > > > +		numpgs = mreg->nr_pages;
> > > > > > > +	} else
> > > > > > > +		numpgs = 1;
> > > > > > > +
> > > > > > > +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> > > > > > > +
> > > > > > > +	return rc == 0;
> > > > > > > +}
> > > > > > > +
> > > > > > >     static struct mshv_mem_region *
> > > > > > >     mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > > > > > >     {
> > > > > > > @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > > > > >     	return ret;
> > > > > > >     }
> > > > > > > +
> > > > > > >     #else  /* CONFIG_X86_64 */
> > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
> > > > > > >     static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > > > > >     #endif /* CONFIG_X86_64 */
> > > > > > >     static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > > > > >     {
> > > > > > >     	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > > > > > +	case HVMSG_UNMAPPED_GPA:
> > > > > > > +		return mshv_handle_unmapped_gpa(vp);
> > > > > > >     	case HVMSG_GPA_INTERCEPT:
> > > > > > >     		return mshv_handle_gpa_intercept(vp);
> > > > > > >     	}
> > > > > > > -- 
> > > > > > > 2.51.2.vfs.0.1
> > > > > > > 

^ permalink raw reply

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Stanislav Kinsburskii @ 2026-02-02 16:20 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <8298de1d-648c-bbc6-b3c9-1cbc9b5d7e72@linux.microsoft.com>

On Fri, Jan 30, 2026 at 02:51:19PM -0800, Mukesh R wrote:
> On 1/27/26 10:46, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 07:02:29PM -0800, Mukesh R wrote:
> > > On 1/26/26 07:57, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> > > > > On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > 
> > > > > > > Add a new file to implement management of device domains, mapping and
> > > > > > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > > > > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > > > > > parent. This also implements direct attach mechanism for PCI passthru,
> > > > > > > and it is also made to work within the VFIO framework.
> > > > > > > 
> > > > > > > At a high level, during boot the hypervisor creates a default identity
> > > > > > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > > > > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > > > > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > > > > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > > > > > PCI passthru:
> > > > > > > 
> > > > > > >      1. Device Domain: root must create a device domain in the hypervisor,
> > > > > > >         and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > > > > > >         All hypervisor communications use device id of type PCI for
> > > > > > >         identifying and referencing the device.
> > > > > > > 
> > > > > > >      2. Direct Attach: the hypervisor will simply use the guest's HW
> > > > > > >         page table for mappings, thus the host need not do map/unmap
> > > > > > >         device memory hypercalls. As such, direct attach passthru setup
> > > > > > >         during guest boot is extremely fast. A direct attached device
> > > > > > >         must be referenced via logical device id and not via the PCI
> > > > > > >         device id.
> > > > > > > 
> > > > > > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > > > > > attach is default in non-L1VH cases because there are some significant
> > > > > > > performance issues with device domain implementation currently for guests
> > > > > > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > > > > > addressed in the short term.
> > > > > > > 
> > > > > > 
> > > > > > <snip>
> > > > > > 
> > > > 
> > > > <snip>
> > > > 
> > > > > > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > > > > > +{
> > > > > > > +	struct pci_dev *pdev;
> > > > > > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > > > > > +
> > > > > > > +	/* See the attach function, only PCI devices for now */
> > > > > > > +	if (!dev_is_pci(dev))
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	if (hvdom->num_attchd == 0)
> > > > > > > +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > > > > > +
> > > > > > > +	pdev = to_pci_dev(dev);
> > > > > > > +
> > > > > > > +	if (hvdom->attached_dom) {
> > > > > > > +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > > > > > +
> > > > > > > +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > > > > > +		 * next.
> > > > > > > +		 */
> > > > > > > +	} else {
> > > > > > > +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	hvdom->num_attchd--;
> > > > > > 
> > > > > > Shouldn't this be modified iff the detach succeeded?
> > > > > 
> > > > > We want to still free the domain and not let it get stuck. The purpose
> > > > > is more to make sure detach was called before domain free.
> > > > > 
> > > > 
> > > > How can one debug subseqent errors if num_attchd is decremented
> > > > unconditionally? In reality the device is left attached, but the related
> > > > kernel metadata is gone.
> > > 
> > > Error is printed in case of failed detach. If there is panic, at least
> > > you can get some info about the device. Metadata in hypervisor is
> > > around if failed.
> > > 
> > 
> > With this approach the only thing left is a kernel message.
> > But if the state is kept intact, one could collect a kernel core and
> > analyze it.
> 
> Again, most of linux stuff is cleaned up, the only state is in
> hypervisor, and hypervisor can totally protect itself and devices.
> So there is not much in kernel core as it got cleaned up already.
> Think of this as additional check, we can remove in future after
> it stands the test of time, until then, every debugging bit helps.
> 

Again, the hypervisor state is not accessible from the kernel core in
L1VH.

> > And note, that there won't be a hypervisor core by default: our main
> > context with the usptreamed version of the driver is L1VH and a kernel
> > core is the only thing a third party customer can provide for our
> > analysis.
> 
> Wei can correct me, but we are not only l1vh focused here. There is
> work going on on all fronts.
> 

In upstream, it does not matter what the work is focused on. The only
thing that matters is that the functionality is merged and available.
Once the code is merged upstream, it becomes available to third-party
customers. They can use it in any way they see fit.
The only product we support with upstream code is L1VH. We should keep
it crisp, clear, and easy to debug.
This change does not help debugging. It only sweeps the issue under the
carpet, with no justification.

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> > Thanks,
> > Stanislav

^ permalink raw reply

* Re: [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
From: kernel test robot @ 2026-02-02 15:02 UTC (permalink / raw)
  To: Shashank Balaji, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Broadcom internal kernel review list, Jan Kiszka,
	Paolo Bonzini, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: oe-kbuild-all, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Shashank Balaji,
	Daniel Palmer, Tim Bird
In-Reply-To: <20260202-x2apic-fix-v1-1-71c8f488a88b@sony.com>

Hi Shashank,

kernel test robot noticed the following build errors:

[auto build test ERROR on 18f7fcd5e69a04df57b563360b88be72471d6b62]

url:    https://github.com/intel-lab-lkp/linux/commits/Shashank-Balaji/x86-x2apic-disable-x2apic-on-resume-if-the-kernel-expects-so/20260202-181147
base:   18f7fcd5e69a04df57b563360b88be72471d6b62
patch link:    https://lore.kernel.org/r/20260202-x2apic-fix-v1-1-71c8f488a88b%40sony.com
patch subject: [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
config: x86_64-buildonly-randconfig-001-20260202 (https://download.01.org/0day-ci/archive/20260202/202602022242.iSdFHMDI-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260202/202602022242.iSdFHMDI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602022242.iSdFHMDI-lkp@intel.com/

All errors (new ones prefixed by >>):

   arch/x86/kernel/apic/apic.c: In function 'lapic_resume':
>> arch/x86/kernel/apic/apic.c:2463:17: error: implicit declaration of function '__x2apic_disable'; did you mean '__x2apic_enable'? [-Wimplicit-function-declaration]
    2463 |                 __x2apic_disable();
         |                 ^~~~~~~~~~~~~~~~
         |                 __x2apic_enable


vim +2463 arch/x86/kernel/apic/apic.c

  2435	
  2436	static void lapic_resume(void *data)
  2437	{
  2438		unsigned int l, h;
  2439		unsigned long flags;
  2440		int maxlvt;
  2441	
  2442		if (!apic_pm_state.active)
  2443			return;
  2444	
  2445		local_irq_save(flags);
  2446	
  2447		/*
  2448		 * IO-APIC and PIC have their own resume routines.
  2449		 * We just mask them here to make sure the interrupt
  2450		 * subsystem is completely quiet while we enable x2apic
  2451		 * and interrupt-remapping.
  2452		 */
  2453		mask_ioapic_entries();
  2454		legacy_pic->mask_all();
  2455	
  2456		if (x2apic_mode) {
  2457			__x2apic_enable();
  2458		} else {
  2459			/*
  2460			 * x2apic may have been re-enabled by the
  2461			 * firmware on resuming from s2ram
  2462			 */
> 2463			__x2apic_disable();
  2464	
  2465			/*
  2466			 * Make sure the APICBASE points to the right address
  2467			 *
  2468			 * FIXME! This will be wrong if we ever support suspend on
  2469			 * SMP! We'll need to do this as part of the CPU restore!
  2470			 */
  2471			if (boot_cpu_data.x86 >= 6) {
  2472				rdmsr(MSR_IA32_APICBASE, l, h);
  2473				l &= ~MSR_IA32_APICBASE_BASE;
  2474				l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr;
  2475				wrmsr(MSR_IA32_APICBASE, l, h);
  2476			}
  2477		}
  2478	
  2479		maxlvt = lapic_get_maxlvt();
  2480		apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED);
  2481		apic_write(APIC_ID, apic_pm_state.apic_id);
  2482		apic_write(APIC_DFR, apic_pm_state.apic_dfr);
  2483		apic_write(APIC_LDR, apic_pm_state.apic_ldr);
  2484		apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri);
  2485		apic_write(APIC_SPIV, apic_pm_state.apic_spiv);
  2486		apic_write(APIC_LVT0, apic_pm_state.apic_lvt0);
  2487		apic_write(APIC_LVT1, apic_pm_state.apic_lvt1);
  2488	#ifdef CONFIG_X86_THERMAL_VECTOR
  2489		if (maxlvt >= 5)
  2490			apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
  2491	#endif
  2492	#ifdef CONFIG_X86_MCE_INTEL
  2493		if (maxlvt >= 6)
  2494			apic_write(APIC_LVTCMCI, apic_pm_state.apic_cmci);
  2495	#endif
  2496		if (maxlvt >= 4)
  2497			apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc);
  2498		apic_write(APIC_LVTT, apic_pm_state.apic_lvtt);
  2499		apic_write(APIC_TDCR, apic_pm_state.apic_tdcr);
  2500		apic_write(APIC_TMICT, apic_pm_state.apic_tmict);
  2501		apic_write(APIC_ESR, 0);
  2502		apic_read(APIC_ESR);
  2503		apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr);
  2504		apic_write(APIC_ESR, 0);
  2505		apic_read(APIC_ESR);
  2506	
  2507		irq_remapping_reenable(x2apic_mode);
  2508	
  2509		local_irq_restore(flags);
  2510	}
  2511	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH 2/3] x86/defconfig: add CONFIG_IRQ_REMAP
From: Jan Kiszka @ 2026-02-02 13:50 UTC (permalink / raw)
  To: Andrew Cooper, Shashank Balaji, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Suresh Siddha,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Broadcom internal kernel review list,
	Paolo Bonzini, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Ingo Molnar, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Daniel Palmer,
	Tim Bird
In-Reply-To: <7469ab46-94c9-48c8-b6e7-b500550768a9@citrix.com>

On 02.02.26 13:12, Andrew Cooper wrote:
> On 02/02/2026 11:54 am, Jan Kiszka wrote:
>> On 02.02.26 12:35, Andrew Cooper wrote:
>>> On 02/02/2026 9:51 am, Shashank Balaji wrote:
>>>> Interrupt remapping is an architectural dependency of x2apic, which is already
>>>> enabled in the defconfig.
>>> There is no such dependency.  VMs for example commonly have x2APIC and
>>> no IOMMU, and even native system with fewer than 254 CPUs does not need
>>> interrupt remapping for IO-APIC interrupts to function correctly.
>>>
>> It is theoretically possible with less than 254 CPUs, and that is why
>> virtualization uses it, but the Intel SDM clearly states:
>>
>> "Routing of device interrupts to local APIC units operating in x2APIC
>> mode requires use of the interrupt-remapping architecture specified in
>> the Intel® Virtualization Technology for Directed I/O (Revision 1.3
>> and/or later versions)."
> 
> This statement is misleading and has been argued over before.  It's
> missing the key word "all".
> 
> What IR gets you in this case is the ability to target CPU 255 and higher.
> 
> The OS-side access mechanism (xAPIC MMIO vs x2APIC MSRs) has no baring
> on how external interrupts are handled in the fabric.
> 
> There are plenty of good reasons to have Interrupt Remapping enabled
> when available, but it is not a hard requirement architecturally.
> 

If that is true, then this patch is the wrong one to blame because it
only reacts on existing kernel logic and repeats the arguments that are
in the code and even provided to kernel users. If you have hard proof
that the existing code is wrong (some confirmation from Intel folks
would be "nice" I guess), then propose a patch to change that logic.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH 2/3] x86/defconfig: add CONFIG_IRQ_REMAP
From: Andrew Cooper @ 2026-02-02 12:12 UTC (permalink / raw)
  To: Jan Kiszka, Shashank Balaji, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Suresh Siddha,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Broadcom internal kernel review list,
	Paolo Bonzini, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Andrew Cooper, Ingo Molnar, linux-kernel, linux-hyperv,
	virtualization, jailhouse-dev, kvm, xen-devel, Rahul Bukte,
	Daniel Palmer, Tim Bird
In-Reply-To: <f875676e-c878-4d3e-9eae-1f74f24cdedd@siemens.com>

On 02/02/2026 11:54 am, Jan Kiszka wrote:
> On 02.02.26 12:35, Andrew Cooper wrote:
>> On 02/02/2026 9:51 am, Shashank Balaji wrote:
>>> Interrupt remapping is an architectural dependency of x2apic, which is already
>>> enabled in the defconfig.
>> There is no such dependency.  VMs for example commonly have x2APIC and
>> no IOMMU, and even native system with fewer than 254 CPUs does not need
>> interrupt remapping for IO-APIC interrupts to function correctly.
>>
> It is theoretically possible with less than 254 CPUs, and that is why
> virtualization uses it, but the Intel SDM clearly states:
>
> "Routing of device interrupts to local APIC units operating in x2APIC
> mode requires use of the interrupt-remapping architecture specified in
> the Intel® Virtualization Technology for Directed I/O (Revision 1.3
> and/or later versions)."

This statement is misleading and has been argued over before.  It's
missing the key word "all".

What IR gets you in this case is the ability to target CPU 255 and higher.

The OS-side access mechanism (xAPIC MMIO vs x2APIC MSRs) has no baring
on how external interrupts are handled in the fabric.

There are plenty of good reasons to have Interrupt Remapping enabled
when available, but it is not a hard requirement architecturally.

~Andrew

^ permalink raw reply

* Re: [PATCH 2/3] x86/defconfig: add CONFIG_IRQ_REMAP
From: Jan Kiszka @ 2026-02-02 11:54 UTC (permalink / raw)
  To: Andrew Cooper, Shashank Balaji, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Suresh Siddha,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Broadcom internal kernel review list,
	Paolo Bonzini, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Ingo Molnar, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Daniel Palmer,
	Tim Bird
In-Reply-To: <a7d93306-42e5-4617-91df-23f7dd35aa1c@citrix.com>

On 02.02.26 12:35, Andrew Cooper wrote:
> On 02/02/2026 9:51 am, Shashank Balaji wrote:
>> Interrupt remapping is an architectural dependency of x2apic, which is already
>> enabled in the defconfig.
> 
> There is no such dependency.  VMs for example commonly have x2APIC and
> no IOMMU, and even native system with fewer than 254 CPUs does not need
> interrupt remapping for IO-APIC interrupts to function correctly.
> 

It is theoretically possible with less than 254 CPUs, and that is why
virtualization uses it, but the Intel SDM clearly states:

"Routing of device interrupts to local APIC units operating in x2APIC
mode requires use of the interrupt-remapping architecture specified in
the Intel® Virtualization Technology for Directed I/O (Revision 1.3
and/or later versions)."

IIRC, ignoring this would move us into undefined behaviour space that
may work on some system and did not on others.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH 2/3] x86/defconfig: add CONFIG_IRQ_REMAP
From: Andrew Cooper @ 2026-02-02 11:35 UTC (permalink / raw)
  To: Shashank Balaji, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Broadcom internal kernel review list, Jan Kiszka,
	Paolo Bonzini, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Andrew Cooper, Ingo Molnar, linux-kernel, linux-hyperv,
	virtualization, jailhouse-dev, kvm, xen-devel, Rahul Bukte,
	Daniel Palmer, Tim Bird
In-Reply-To: <20260202-x2apic-fix-v1-2-71c8f488a88b@sony.com>

On 02/02/2026 9:51 am, Shashank Balaji wrote:
> Interrupt remapping is an architectural dependency of x2apic, which is already
> enabled in the defconfig.

There is no such dependency.  VMs for example commonly have x2APIC and
no IOMMU, and even native system with fewer than 254 CPUs does not need
interrupt remapping for IO-APIC interrupts to function correctly.

For native systems with 255 or more CPUs you do want Interrupt Remapping
so enabling it in defconfig is a good move, but the justification needs
correcting.

~Andrew

^ permalink raw reply

* [PATCH 3/3] x86/virt: rename x2apic_available to x2apic_without_ir_available
From: Shashank Balaji @ 2026-02-02  9:51 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Jan Kiszka, Paolo Bonzini,
	Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Ingo Molnar, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Shashank Balaji,
	Daniel Palmer, Tim Bird
In-Reply-To: <20260202-x2apic-fix-v1-0-71c8f488a88b@sony.com>

No functional change.

x86_init.hyper.x2apic_available is used only in try_to_enable_x2apic to check if
x2apic needs to be disabled if interrupt remapping support isn't present. But
the name x2apic_available doesn't reflect that usage.

This is what x2apic_available is set to for various hypervisors:

	acrn		boot_cpu_has(X86_FEATURE_X2APIC)
	mshyperv	boot_cpu_has(X86_FEATURE_X2APIC)
	xen		boot_cpu_has(X86_FEATURE_X2APIC) or false
	vmware		vmware_legacy_x2apic_available
	kvm		kvm_cpuid_base() != 0
	jailhouse	x2apic_enabled()
	bhyve		true
	default		false

Bare metal and vmware correctly check if x2apic is available without interrupt
remapping. The rest of them check if x2apic is enabled/supported, and kvm just
checks if the kernel is running on kvm. The other hypervisors may have to have
their checks audited.

Also fix the backwards pr_info message printed on disabling x2apic because of
lack of irq remapping support.

Compile tested with all the hypervisor guest support enabled.

Co-developed-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
---
 arch/x86/include/asm/x86_init.h |  4 ++--
 arch/x86/kernel/apic/apic.c     |  4 ++--
 arch/x86/kernel/cpu/acrn.c      |  2 +-
 arch/x86/kernel/cpu/bhyve.c     |  2 +-
 arch/x86/kernel/cpu/mshyperv.c  |  2 +-
 arch/x86/kernel/cpu/vmware.c    |  2 +-
 arch/x86/kernel/jailhouse.c     |  2 +-
 arch/x86/kernel/kvm.c           |  2 +-
 arch/x86/kernel/x86_init.c      | 12 ++++++------
 arch/x86/xen/enlighten_hvm.c    |  4 ++--
 10 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 6c8a6ead84f6..b270d9eed755 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -116,7 +116,7 @@ struct x86_init_pci {
  * struct x86_hyper_init - x86 hypervisor init functions
  * @init_platform:		platform setup
  * @guest_late_init:		guest late init
- * @x2apic_available:		X2APIC detection
+ * @x2apic_without_ir_available: is x2apic available without irq remap?
  * @msi_ext_dest_id:		MSI supports 15-bit APIC IDs
  * @init_mem_mapping:		setup early mappings during init_mem_mapping()
  * @init_after_bootmem:		guest init after boot allocator is finished
@@ -124,7 +124,7 @@ struct x86_init_pci {
 struct x86_hyper_init {
 	void (*init_platform)(void);
 	void (*guest_late_init)(void);
-	bool (*x2apic_available)(void);
+	bool (*x2apic_without_ir_available)(void);
 	bool (*msi_ext_dest_id)(void);
 	void (*init_mem_mapping)(void);
 	void (*init_after_bootmem)(void);
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index cc64d61f82cf..8820b631f8a2 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1836,8 +1836,8 @@ static __init void try_to_enable_x2apic(int remap_mode)
 		 * Using X2APIC without IR is not architecturally supported
 		 * on bare metal but may be supported in guests.
 		 */
-		if (!x86_init.hyper.x2apic_available()) {
-			pr_info("x2apic: IRQ remapping doesn't support X2APIC mode\n");
+		if (!x86_init.hyper.x2apic_without_ir_available()) {
+			pr_info("x2apic: Not supported without IRQ remapping\n");
 			x2apic_disable();
 			return;
 		}
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 2c5b51aad91a..9204b98d4786 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -77,5 +77,5 @@ const __initconst struct hypervisor_x86 x86_hyper_acrn = {
 	.detect                 = acrn_detect,
 	.type			= X86_HYPER_ACRN,
 	.init.init_platform     = acrn_init_platform,
-	.init.x2apic_available  = acrn_x2apic_available,
+	.init.x2apic_without_ir_available = acrn_x2apic_available,
 };
diff --git a/arch/x86/kernel/cpu/bhyve.c b/arch/x86/kernel/cpu/bhyve.c
index f1a8ca3dd1ed..91a90a7459ce 100644
--- a/arch/x86/kernel/cpu/bhyve.c
+++ b/arch/x86/kernel/cpu/bhyve.c
@@ -61,6 +61,6 @@ const struct hypervisor_x86 x86_hyper_bhyve __refconst = {
 	.name			= "Bhyve",
 	.detect			= bhyve_detect,
 	.init.init_platform	= x86_init_noop,
-	.init.x2apic_available	= bhyve_x2apic_available,
+	.init.x2apic_without_ir_available = bhyve_x2apic_available,
 	.init.msi_ext_dest_id	= bhyve_ext_dest_id,
 };
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 579fb2c64cfd..61458855094a 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -760,7 +760,7 @@ const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
 	.name			= "Microsoft Hyper-V",
 	.detect			= ms_hyperv_platform,
 	.type			= X86_HYPER_MS_HYPERV,
-	.init.x2apic_available	= ms_hyperv_x2apic_available,
+	.init.x2apic_without_ir_available = ms_hyperv_x2apic_available,
 	.init.msi_ext_dest_id	= ms_hyperv_msi_ext_dest_id,
 	.init.init_platform	= ms_hyperv_init_platform,
 	.init.guest_late_init	= ms_hyperv_late_init,
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index cb3f900c46fc..46d325818797 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -585,7 +585,7 @@ const __initconst struct hypervisor_x86 x86_hyper_vmware = {
 	.detect				= vmware_platform,
 	.type				= X86_HYPER_VMWARE,
 	.init.init_platform		= vmware_platform_setup,
-	.init.x2apic_available		= vmware_legacy_x2apic_available,
+	.init.x2apic_without_ir_available = vmware_legacy_x2apic_available,
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 	.runtime.sev_es_hcall_prepare	= vmware_sev_es_hcall_prepare,
 	.runtime.sev_es_hcall_finish	= vmware_sev_es_hcall_finish,
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index 9e9a591a5fec..84a0bbe15989 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -291,6 +291,6 @@ const struct hypervisor_x86 x86_hyper_jailhouse __refconst = {
 	.name			= "Jailhouse",
 	.detect			= jailhouse_detect,
 	.init.init_platform	= jailhouse_init_platform,
-	.init.x2apic_available	= jailhouse_x2apic_available,
+	.init.x2apic_without_ir_available = jailhouse_x2apic_available,
 	.ignore_nopv		= true,
 };
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 37dc8465e0f5..709eba87d58e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -1042,7 +1042,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
 	.detect				= kvm_detect,
 	.type				= X86_HYPER_KVM,
 	.init.guest_late_init		= kvm_guest_init,
-	.init.x2apic_available		= kvm_para_available,
+	.init.x2apic_without_ir_available = kvm_para_available,
 	.init.msi_ext_dest_id		= kvm_msi_ext_dest_id,
 	.init.init_platform		= kvm_init_platform,
 #if defined(CONFIG_AMD_MEM_ENCRYPT)
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index ebefb77c37bb..9ddf8c901ac6 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -112,12 +112,12 @@ struct x86_init_ops x86_init __initdata = {
 	},
 
 	.hyper = {
-		.init_platform		= x86_init_noop,
-		.guest_late_init	= x86_init_noop,
-		.x2apic_available	= bool_x86_init_noop,
-		.msi_ext_dest_id	= bool_x86_init_noop,
-		.init_mem_mapping	= x86_init_noop,
-		.init_after_bootmem	= x86_init_noop,
+		.init_platform			= x86_init_noop,
+		.guest_late_init		= x86_init_noop,
+		.x2apic_without_ir_available	= bool_x86_init_noop,
+		.msi_ext_dest_id		= bool_x86_init_noop,
+		.init_mem_mapping		= x86_init_noop,
+		.init_after_bootmem		= x86_init_noop,
 	},
 
 	.acpi = {
diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index fe57ff85d004..42f3d21f313d 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -311,7 +311,7 @@ static uint32_t __init xen_platform_hvm(void)
 		 * detect PVH and panic there.
 		 */
 		h->init_platform = x86_init_noop;
-		h->x2apic_available = bool_x86_init_noop;
+		h->x2apic_without_ir_available = bool_x86_init_noop;
 		h->init_mem_mapping = x86_init_noop;
 		h->init_after_bootmem = x86_init_noop;
 		h->guest_late_init = xen_hvm_guest_late_init;
@@ -325,7 +325,7 @@ struct hypervisor_x86 x86_hyper_xen_hvm __initdata = {
 	.detect                 = xen_platform_hvm,
 	.type			= X86_HYPER_XEN_HVM,
 	.init.init_platform     = xen_hvm_guest_init,
-	.init.x2apic_available  = xen_x2apic_available,
+	.init.x2apic_without_ir_available = xen_x2apic_available,
 	.init.init_mem_mapping	= xen_hvm_init_mem_mapping,
 	.init.guest_late_init	= xen_hvm_guest_late_init,
 	.init.msi_ext_dest_id   = msi_ext_dest_id,

-- 
2.43.0


^ permalink raw reply related

* [PATCH 2/3] x86/defconfig: add CONFIG_IRQ_REMAP
From: Shashank Balaji @ 2026-02-02  9:51 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Jan Kiszka, Paolo Bonzini,
	Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Ingo Molnar, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Shashank Balaji,
	Daniel Palmer, Tim Bird
In-Reply-To: <20260202-x2apic-fix-v1-0-71c8f488a88b@sony.com>

Interrupt remapping is an architectural dependency of x2apic, which is already
enabled in the defconfig. Enable CONFIG_IRQ_REMAP so that a defconfig kernel on
bare metal actually uses x2apic.

Co-developed-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
---
 arch/x86/configs/x86_64_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 7d7310cdf8b0..269f7d808be4 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -230,6 +230,7 @@ CONFIG_EEEPC_LAPTOP=y
 CONFIG_AMD_IOMMU=y
 CONFIG_INTEL_IOMMU=y
 # CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
+CONFIG_IRQ_REMAP=y
 CONFIG_EXT4_FS=y
 CONFIG_EXT4_FS_POSIX_ACL=y
 CONFIG_EXT4_FS_SECURITY=y

-- 
2.43.0


^ permalink raw reply related

* [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
From: Shashank Balaji @ 2026-02-02  9:51 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Jan Kiszka, Paolo Bonzini,
	Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Ingo Molnar, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Shashank Balaji,
	Daniel Palmer, Tim Bird, stable
In-Reply-To: <20260202-x2apic-fix-v1-0-71c8f488a88b@sony.com>

In lapic_resume, ensure x2apic is actually disabled when the kernel expects it
to be disabled, i.e. when x2apic_mode = 0.

x2apic_mode is set to 0 and x2apic is disabled on boot if the kernel doesn't
support irq remapping or for other reasons. On resume from s2ram
(/sys/power/mem_sleep = deep), firmware can re-enable x2apic, but the kernel
continues using the xapic interface because it didn't check to see if someone
enabled x2apic behind its back, which causes hangs. This situation happens on
defconfig + bare metal + s2ram, on which this fix has been tested.

Fixes: 6e1cb38a2aef ("x64, x2apic/intr-remap: add x2apic support, including enabling interrupt-remapping")
Cc: stable@vger.kernel.org
Co-developed-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
---
 arch/x86/kernel/apic/apic.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d93f87f29d03..cc64d61f82cf 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2456,6 +2456,12 @@ static void lapic_resume(void *data)
 	if (x2apic_mode) {
 		__x2apic_enable();
 	} else {
+		/*
+		 * x2apic may have been re-enabled by the
+		 * firmware on resuming from s2ram
+		 */
+		__x2apic_disable();
+
 		/*
 		 * Make sure the APICBASE points to the right address
 		 *

-- 
2.43.0


^ permalink raw reply related

* [PATCH 0/3] x86/x2apic: Fix hang-up of defconfig kernel on resume from s2ram
From: Shashank Balaji @ 2026-02-02  9:51 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Jan Kiszka, Paolo Bonzini,
	Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Ingo Molnar, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Shashank Balaji,
	Daniel Palmer, Tim Bird, stable

On resume from s2ram, a defconfig kernel gets into a state where the x2apic
hardware state and the kernel's perceived state are different.

On boot, x2apic is enabled by the firmware, and then the kernel does the
following (relevant lines from dmesg):

	[    0.000381] x2apic: enabled by BIOS, switching to x2apic ops
	[    0.009939] APIC: Switched APIC routing to: cluster x2apic
	[    0.095151] x2apic: IRQ remapping doesn't support X2APIC mode
	[    0.095154] x2apic disabled
	[    0.095551] APIC: Switched APIC routing to: physical flat

defconfig has CONFIG_IRQ_REMAP=n, which leads to x2apic being disabled,
because on bare metal, x2apic has an architectural dependence on interrupt
remapping.

While resuming from s2ram, x2apic is enabled again by the firmware, but
the kernel continues using the physical flat apic routing. This causes a
hang-up and no console output.

Patch 1 fixes this in lapic_resume by disabling x2apic when the kernel expects
it to be disabled.
Patch 2 enables CONFIG_IRQ_REMAP in defconfig so that defconfig kernels at
least don't disable x2apic because of a lack of IRQ_REMAP support.
Patch 3 is a non-functional change renaming x2apic_available to
x2apic_without_ir_available in struct x86_hyper_init, to better convey
the semantic.

Signed-off-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
---
Shashank Balaji (3):
      x86/x2apic: disable x2apic on resume if the kernel expects so
      x86/defconfig: add CONFIG_IRQ_REMAP
      x86/virt: rename x2apic_available to x2apic_without_ir_available

 arch/x86/configs/x86_64_defconfig |  1 +
 arch/x86/include/asm/x86_init.h   |  4 ++--
 arch/x86/kernel/apic/apic.c       | 10 ++++++++--
 arch/x86/kernel/cpu/acrn.c        |  2 +-
 arch/x86/kernel/cpu/bhyve.c       |  2 +-
 arch/x86/kernel/cpu/mshyperv.c    |  2 +-
 arch/x86/kernel/cpu/vmware.c      |  2 +-
 arch/x86/kernel/jailhouse.c       |  2 +-
 arch/x86/kernel/kvm.c             |  2 +-
 arch/x86/kernel/x86_init.c        | 12 ++++++------
 arch/x86/xen/enlighten_hvm.c      |  4 ++--
 11 files changed, 25 insertions(+), 18 deletions(-)
---
base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
change-id: 20260201-x2apic-fix-85c8c1b5cb90

Best regards,
-- 
Shashank Balaji <shashank.mahadasyam@sony.com>

^ permalink raw reply

* RE: [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Michael Kelley @ 2026-02-02  6:25 UTC (permalink / raw)
  To: Anirudh Rayabharam, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260128160437.3342167-3-anirudh@anirudhrb.com>

From: Anirudh Rayabharam <anirudh@anirudhrb.com> Sent: Wednesday, January 28, 2026 8:05 AM
> 
> On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
> interrupts (SINTs) from the hypervisor for doorbells and intercepts.
> There is no such vector reserved for arm64.
> 
> On arm64, the INTID for SINTs should be in the SGI or PPI range. The
> hypervisor exposes a virtual device in the ACPI that reserves a
> PPI for this use. Introduce a platform_driver that binds to this ACPI
> device and obtains the interrupt vector that can be used for SINTs.
> 
> To better unify x86 and arm64 paths, introduce mshv_sint_irq_init() that
> either registers the platform_driver and obtains the INTID (arm64) or
> just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
> 
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

I have a few comments below for starters. I'll do a more thorough review
when your v2 comes out. FWIW, I'm in full agreement with your take on why
the code is conditioned on HYPERVISOR_CALLBACK_VECTOR and not X86_64
vs. ARM64. Of course, I'm the one who wrote it that way in the VMBus driver
back when Microsoft first supported arm64 guests, so I'm not an independent
observer. :-)  But if need be, I can expound on the thinking at the time for this
case and other similar ones.

> ---
>  drivers/hv/mshv_root.h      |   2 +
>  drivers/hv/mshv_root_main.c |  11 ++-
>  drivers/hv/mshv_synic.c     | 152 ++++++++++++++++++++++++++++++++++--
>  3 files changed, 158 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index c02513f75429..c2d1e8d7452c 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -332,5 +332,7 @@ int mshv_region_get(struct mshv_mem_region *region);
>  bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
>  void mshv_region_movable_fini(struct mshv_mem_region *region);
>  bool mshv_region_movable_init(struct mshv_mem_region *region);
> +int mshv_synic_init(void);
> +void mshv_synic_cleanup(void);
> 
>  #endif /* _MSHV_ROOT_H_ */
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index abb34b37d552..6c2d4a80dbe3 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2276,11 +2276,17 @@ static int __init mshv_parent_partition_init(void)
>  			MSHV_HV_MAX_VERSION);
>  	}
> 
> +	ret = mshv_synic_init();
> +	if (ret) {
> +		dev_err(dev, "Failed to initialize synic: %i\n", ret);
> +		goto device_deregister;
> +	}
> +
>  	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
>  	if (!mshv_root.synic_pages) {
>  		dev_err(dev, "Failed to allocate percpu synic page\n");
>  		ret = -ENOMEM;
> -		goto device_deregister;
> +		goto synic_cleanup;
>  	}
> 
>  	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> @@ -2322,6 +2328,8 @@ static int __init mshv_parent_partition_init(void)
>  	cpuhp_remove_state(mshv_cpuhp_online);
>  free_synic_pages:
>  	free_percpu(mshv_root.synic_pages);
> +synic_cleanup:
> +	mshv_synic_cleanup();
>  device_deregister:
>  	misc_deregister(&mshv_dev);
>  	return ret;
> @@ -2337,6 +2345,7 @@ static void __exit mshv_parent_partition_exit(void)
>  		mshv_root_partition_exit();
>  	cpuhp_remove_state(mshv_cpuhp_online);
>  	free_percpu(mshv_root.synic_pages);
> +	mshv_synic_cleanup();
>  }
> 
>  module_init(mshv_parent_partition_init);
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index ba89655b0910..b7860a75b97e 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -10,13 +10,19 @@
>  #include <linux/kernel.h>
>  #include <linux/slab.h>
>  #include <linux/mm.h>
> +#include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/random.h>
>  #include <asm/mshyperv.h>
> +#include <linux/platform_device.h>
> +#include <linux/acpi.h>
> 
>  #include "mshv_eventfd.h"
>  #include "mshv.h"
> 
> +static int mshv_interrupt = -1;
> +static int mshv_irq = -1;
> +
>  static u32 synic_event_ring_get_queued_port(u32 sint_index)
>  {
>  	struct hv_synic_event_ring_page **event_ring_page;
> @@ -446,14 +452,144 @@ void mshv_isr(void)
>  	}
>  }
> 
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +#ifdef CONFIG_ACPI
> +static long __percpu *mshv_evt;
> +
> +static acpi_status mshv_walk_resources(struct acpi_resource *res, void *ctx)
> +{
> +	struct resource r;
> +
> +	switch (res->type) {
> +	case ACPI_RESOURCE_TYPE_EXTENDED_IRQ:
> +		if (!acpi_dev_resource_interrupt(res, 0, &r)) {
> +			pr_err("Unable to parse MSHV ACPI interrupt\n");
> +			return AE_ERROR;
> +		}
> +		/* ARM64 INTID */
> +		mshv_interrupt = res->data.extended_irq.interrupts[0];
> +		/* Linux IRQ number */
> +		mshv_irq = r.start;
> +		pr_info("MSHV SINT INTID %d, IRQ %d\n",
> +			mshv_interrupt, mshv_irq);

Is this pr_info() necessary? Once Linux is running, the output of /proc/interrupts
shows both the IRQ and the INTID.

> +		return AE_OK;
> +	default:
> +		/* Unused resource type */
> +		return AE_OK;
> +	}
> +
> +	return AE_OK;

Having three occurrences of "return AE_OK" for a function as simple as this makes
me want to look for simplifications. :-) The structure matches vmbus_walk_resources(),
which does more. Maybe there's value in preserving the matching structure, but it
might be equally good to just do "if (res->type == ACPI_REOURCES_TYPE_EXTENDED_ID) {"
followed by the real work, and then have a single "return AE_OK".  That drops 5 lines of
fairly useless code and comments.

> +}
> +
> +static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
> +{
> +	mshv_isr();
> +	add_interrupt_randomness(irq);

The add_interrupt_randomness() should not be necessary. The Linux IRQ
infrastructure that invokes per-cpu interrupt handlers should already be doing it.
When using HYPERVISOR_CALLBACK_VECTOR the Linux IRQ infrastructure is
bypassed, so we must do it explicitly, but not in this case (unless I'm missing
something).

> +	return IRQ_HANDLED;
> +}
> +
> +static int mshv_sint_probe(struct platform_device *pdev)
> +{
> +	acpi_status result;
> +	int ret = 0;

Initializing to zero isn't necessary.  All exit paths explicitly set the value.

> +	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
> +
> +	result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
> +					mshv_walk_resources, NULL);
> +

Usually don't put a blank line between a function call and testing its result.

> +	if (ACPI_FAILURE(result)) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	mshv_evt = alloc_percpu(long);
> +	if (!mshv_evt) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = request_percpu_irq(mshv_irq, mshv_percpu_isr, "MSHV", mshv_evt);
> +out:
> +	return ret;
> +}
> +
> +static void mshv_sint_remove(struct platform_device *pdev)
> +{
> +	free_percpu_irq(mshv_irq, mshv_evt);
> +	free_percpu(mshv_evt);
> +}
> +#else
> +static int mshv_sint_probe(struct platform_device *pdev)
> +{
> +	return -ENODEV;
> +}
> +
> +static void mshv_sint_remove(struct platform_device *pdev)
> +{
> +	return;
> +}
> +#endif
> +
> +
> +static const __maybe_unused struct acpi_device_id mshv_sint_device_ids[] = {
> +	{"MSFT1003", 0},
> +	{"", 0},
> +};
> +
> +static struct platform_driver mshv_sint_drv = {
> +	.probe = mshv_sint_probe,
> +	.remove = mshv_sint_remove,
> +	.driver = {
> +		.name = "mshv_sint",
> +		.acpi_match_table = ACPI_PTR(mshv_sint_device_ids),
> +		.probe_type = PROBE_FORCE_SYNCHRONOUS,
> +	},
> +};
> +#endif /* HYPERVISOR_CALLBACK_VECTOR */
> +
> +int mshv_synic_init(void)
> +{
> +#ifdef HYPERVISOR_CALLBACK_VECTOR
> +	mshv_interrupt = HYPERVISOR_CALLBACK_VECTOR;
> +	mshv_irq = -1;
> +	return 0;
> +#else
> +	int ret;
> +
> +	if (acpi_disabled)
> +		return -ENODEV;
> +
> +	ret = platform_driver_register(&mshv_sint_drv);
> +	if (ret)
> +		return ret;
> +
> +	if (mshv_interrupt == -1 || mshv_irq == -1) {
> +		ret = -ENODEV;
> +		goto out_unregister;
> +	}
> +
> +	return 0;
> +
> +out_unregister:
> +	platform_driver_unregister(&mshv_sint_drv);
> +	return ret;
> +#endif
> +}
> +
> +void mshv_synic_cleanup(void)
> +{
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +	if (!acpi_disabled)
> +		platform_driver_unregister(&mshv_sint_drv);
> +#endif
> +}
> +
>  int mshv_synic_cpu_init(unsigned int cpu)
>  {
>  	union hv_synic_simp simp;
>  	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
>  	union hv_synic_sint sint;
> -#endif
>  	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> @@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> 
>  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> 
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
> +	if (mshv_irq != -1)
> +		enable_percpu_irq(mshv_irq, 0);
> +
>  	/* Enable intercepts */
>  	sint.as_uint64 = 0;
> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.vector = mshv_interrupt;
>  	sint.masked = false;
>  	sint.auto_eoi = hv_recommend_using_aeoi();
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> @@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> 
>  	/* Doorbell SINT */
>  	sint.as_uint64 = 0;
> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.vector = mshv_interrupt;
>  	sint.masked = false;
>  	sint.as_intercept = 1;
>  	sint.auto_eoi = hv_recommend_using_aeoi();
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
> -#endif
> 
>  	/* Enable global synic bit */
>  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> @@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
> 
> +	if (mshv_irq != -1)
> +		disable_percpu_irq(mshv_irq);
> +
>  	/* Disable Synic's event ring page */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
>  	sirbp.sirbp_enabled = false;
> --
> 2.34.1
> 


^ permalink raw reply

* Re: [PATCH rdma-next] MAINTAINERS: Drop RDMA files from Hyper-V section
From: Leon Romanovsky @ 2026-02-01 13:44 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Leon Romanovsky; +Cc: linux-rdma, linux-hyperv
In-Reply-To: <20260128-get-maintainers-fix-v1-1-fc5e58ce9f02@nvidia.com>


On Wed, 28 Jan 2026 11:55:25 +0200, Leon Romanovsky wrote:
> MAINTAINERS entries are organized by subsystem ownership, and the RDMA
> files belong under drivers/infiniband. Remove the overly broad mana_ib
> entries from the Hyper-V section, and instead add the Hyper-V mailing list
> to CC on mana_ib patches.
> 
> This makes get_maintainer.pl behave more sensibly when running it on
> mana_ib patches.
> 
> [...]

Applied, thanks!

[1/1] MAINTAINERS: Drop RDMA files from Hyper-V section
      https://git.kernel.org/rdma/rdma/c/e5b0cfa32b1c3e

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>


^ permalink raw reply

* [PATCH] pci: pci-hyperv-intf: remove unnecessary module_init/exit functions
From: Ethan Nelson-Moore @ 2026-01-31  2:00 UTC (permalink / raw)
  To: linux-hyperv, linux-pci
  Cc: Ethan Nelson-Moore, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring, Bjorn Helgaas

The pci-hyperv-intf driver has unnecessary empty module_init and
module_exit functions. Remove them. Note that if a module_init function
exists, a module_exit function must also exist; otherwise, the module
cannot be unloaded.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
---
 drivers/pci/controller/pci-hyperv-intf.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv-intf.c b/drivers/pci/controller/pci-hyperv-intf.c
index 28b3e93d31c0..18acbda867f0 100644
--- a/drivers/pci/controller/pci-hyperv-intf.c
+++ b/drivers/pci/controller/pci-hyperv-intf.c
@@ -52,17 +52,5 @@ int hyperv_reg_block_invalidate(struct pci_dev *dev, void *context,
 }
 EXPORT_SYMBOL_GPL(hyperv_reg_block_invalidate);
 
-static void __exit exit_hv_pci_intf(void)
-{
-}
-
-static int __init init_hv_pci_intf(void)
-{
-	return 0;
-}
-
-module_init(init_hv_pci_intf);
-module_exit(exit_hv_pci_intf);
-
 MODULE_DESCRIPTION("Hyper-V PCI Interface");
 MODULE_LICENSE("GPL v2");
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Mukesh R @ 2026-01-30 23:44 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <20260127112144.00002991@linux.microsoft.com>

On 1/27/26 11:21, Jacob Pan wrote:
> Hi Mukesh,
> 
> On Fri, 23 Jan 2026 18:01:29 -0800
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
>> On 1/21/26 21:18, Jacob Pan wrote:
>>> Hi Mukesh,
>>>
>>> On Mon, 19 Jan 2026 22:42:27 -0800
>>> Mukesh R <mrathor@linux.microsoft.com> wrote:
>>>    
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>
>>>> Add a new file to implement management of device domains, mapping
>>>> and unmapping of iommu memory, and other iommu_ops to fit within
>>>> the VFIO framework for PCI passthru on Hyper-V running Linux as
>>>> root or L1VH parent. This also implements direct attach mechanism
>>>> for PCI passthru, and it is also made to work within the VFIO
>>>> framework.
>>>>
>>>> At a high level, during boot the hypervisor creates a default
>>>> identity domain and attaches all devices to it. This nicely maps
>>>> to Linux iommu subsystem IOMMU_DOMAIN_IDENTITY domain. As a
>>>> result, Linux does not need to explicitly ask Hyper-V to attach
>>>> devices and do maps/unmaps during boot. As mentioned previously,
>>>> Hyper-V supports two ways to do PCI passthru:
>>>>
>>>>     1. Device Domain: root must create a device domain in the
>>>> hypervisor, and do map/unmap hypercalls for mapping and unmapping
>>>> guest RAM. All hypervisor communications use device id of type PCI
>>>> for identifying and referencing the device.
>>>>
>>>>     2. Direct Attach: the hypervisor will simply use the guest's HW
>>>>        page table for mappings, thus the host need not do map/unmap
>>>>        device memory hypercalls. As such, direct attach passthru
>>>> setup during guest boot is extremely fast. A direct attached device
>>>>        must be referenced via logical device id and not via the PCI
>>>>        device id.
>>>>
>>>> At present, L1VH root/parent only supports direct attaches. Also
>>>> direct attach is default in non-L1VH cases because there are some
>>>> significant performance issues with device domain implementation
>>>> currently for guests with higher RAM (say more than 8GB), and that
>>>> unfortunately cannot be addressed in the short term.
>>>>
>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>> ---
>>>>    MAINTAINERS                     |   1 +
>>>>    arch/x86/include/asm/mshyperv.h |   7 +-
>>>>    arch/x86/kernel/pci-dma.c       |   2 +
>>>>    drivers/iommu/Makefile          |   2 +-
>>>>    drivers/iommu/hyperv-iommu.c    | 876
>>>> ++++++++++++++++++++++++++++++++ include/linux/hyperv.h          |
>>>> 6 + 6 files changed, 890 insertions(+), 4 deletions(-)
>>>>    create mode 100644 drivers/iommu/hyperv-iommu.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 381a0e086382..63160cee942c 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
>>>>    F:	drivers/hv/
>>>>    F:	drivers/infiniband/hw/mana/
>>>>    F:	drivers/input/serio/hyperv-keyboard.c
>>>> +F:	drivers/iommu/hyperv-iommu.c
>>> Given we are also developing a guest iommu driver on hyperv, I
>>> think it is more clear to name them accordingly. Perhaps,
>>> hyperv-iommu-root.c?
>>
>> well, l1vh is not quite root, more like a parent. But we've been using
>> l1vh root loosely to mean l1vh parent. so probably ok to rename it
>> to hyperv-iommu-root.c. I prefer not calling it parent or something
>> like that.
> yeah, something specific and different than the guest driver will do.
> 
>>>>    F:	drivers/iommu/hyperv-irq.c
>>>>    F:	drivers/net/ethernet/microsoft/
>>>>    F:	drivers/net/hyperv/
>>>> diff --git a/arch/x86/include/asm/mshyperv.h
>>>> b/arch/x86/include/asm/mshyperv.h index 97477c5a8487..e4ccdbbf1d12
>>>> 100644 --- a/arch/x86/include/asm/mshyperv.h
>>>> +++ b/arch/x86/include/asm/mshyperv.h
>>>> @@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
>>>>    #endif
>>>>    
>>>>    #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
>>>> -static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>>>> -{ return false; }       /* temporary */
>>>> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
>>>>    u64 hv_build_devid_oftype(struct pci_dev *pdev, enum
>>>> hv_device_type type); +u64 hv_iommu_get_curr_partid(void);
>>>>    #else	/* CONFIG_HYPERV_IOMMU */
>>>>    static inline bool hv_pcidev_is_attached_dev(struct pci_dev
>>>> *pdev) { return false; }
>>>> -
>>>>    static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
>>>>    				       enum hv_device_type type)
>>>>    { return 0; }
>>>> +static inline u64 hv_iommu_get_curr_partid(void)
>>>> +{ return HV_PARTITION_ID_INVALID; }
>>>>    
>>>>    #endif	/* CONFIG_HYPERV_IOMMU */
>>>>    
>>>> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
>>>> index 6267363e0189..cfeee6505e17 100644
>>>> --- a/arch/x86/kernel/pci-dma.c
>>>> +++ b/arch/x86/kernel/pci-dma.c
>>>> @@ -8,6 +8,7 @@
>>>>    #include <linux/gfp.h>
>>>>    #include <linux/pci.h>
>>>>    #include <linux/amd-iommu.h>
>>>> +#include <linux/hyperv.h>
>>>>    
>>>>    #include <asm/proto.h>
>>>>    #include <asm/dma.h>
>>>> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
>>>>    	gart_iommu_hole_init();
>>>>    	amd_iommu_detect();
>>>>    	detect_intel_iommu();
>>>> +	hv_iommu_detect();
>> j
>>> Will this driver be x86 only?
>> Yes for now.
> If there is nothing x86 specific in this driver (assuming the
> hypercalls here are not x86 only), maybe you can move to the generic
> startup code.

It's x86 specific:

         x86_init.iommu.iommu_init = hv_iommu_init


>>>>    	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
>>>>    }
>>>>    
>>>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>>>> index 598c39558e7d..cc9774864b00 100644
>>>> --- a/drivers/iommu/Makefile
>>>> +++ b/drivers/iommu/Makefile
>>>> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>>>>    obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>>>>    obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>>>>    obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>>>> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
>>>> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o
>>> DMA and IRQ remapping should be separate
>>
>> not sure i follow.
> In IOMMU subsystem, DMA remapping and IRQ remapping can be turned
> on/off independently. e.g. you could have an option to turn on IRQ
> remapping w/o DMA remapping. But here you tied them together.

oh, you are talking about the config option, yeah, I will move
CONFIG_IRQ_REMAP from Kconfig to here.

>>
>>>>    obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>>>>    obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>>>>    obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
>>>> diff --git a/drivers/iommu/hyperv-iommu.c
>>>> b/drivers/iommu/hyperv-iommu.c new file mode 100644
>>>> index 000000000000..548483fec6b1
>>>> --- /dev/null
>>>> +++ b/drivers/iommu/hyperv-iommu.c
>>>> @@ -0,0 +1,876 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Hyper-V root vIOMMU driver.
>>>> + * Copyright (C) 2026, Microsoft, Inc.
>>>> + */
>>>> +
>>>> +#include <linux/module.h>
>>> I don't think this is needed since this driver cannot be a module
>>>    
>>>> +#include <linux/pci.h>
>>>> +#include <linux/dmar.h>
>>> should not depend on Intel's DMAR
>>>    
>>>> +#include <linux/dma-map-ops.h>
>>>> +#include <linux/interval_tree.h>
>>>> +#include <linux/hyperv.h>
>>>> +#include "dma-iommu.h"
>>>> +#include <asm/iommu.h>
>>>> +#include <asm/mshyperv.h>
>>>> +
>>>> +/* We will not claim these PCI devices, eg hypervisor needs it for
>>>> debugger */ +static char *pci_devs_to_skip;
>>>> +static int __init hv_iommu_setup_skip(char *str)
>>>> +{
>>>> +	pci_devs_to_skip = str;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
>>>> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
>>>> +
>>>> +bool hv_no_attdev;	 /* disable direct device attach for
>>>> passthru */ +EXPORT_SYMBOL_GPL(hv_no_attdev);
>>>> +static int __init setup_hv_no_attdev(char *str)
>>>> +{
>>>> +	hv_no_attdev = true;
>>>> +	return 0;
>>>> +}
>>>> +__setup("hv_no_attdev", setup_hv_no_attdev);
>>>> +
>>>> +/* Iommu device that we export to the world. HyperV supports max
>>>> of one */ +static struct iommu_device hv_virt_iommu;
>>>> +
>>>> +struct hv_domain {
>>>> +	struct iommu_domain iommu_dom;
>>>> +	u32 domid_num;			      /* as opposed
>>>> to domain_id.type */
>>>> +	u32 num_attchd;		      /* number of
>>>> currently attached devices */
>>> rename to num_dev_attached?
>>>    
>>>> +	bool attached_dom;		      /* is this direct
>>>> attached dom? */
>>>> +	spinlock_t mappings_lock;	      /* protects
>>>> mappings_tree */
>>>> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup
>>>> tree */ +};
>>>> +
>>>> +#define to_hv_domain(d) container_of(d, struct hv_domain,
>>>> iommu_dom) +
>>>> +struct hv_iommu_mapping {
>>>> +	phys_addr_t paddr;
>>>> +	struct interval_tree_node iova;
>>>> +	u32 flags;
>>>> +};
>>>> +
>>>> +/*
>>>> + * By default, during boot the hypervisor creates one Stage 2 (S2)
>>>> default
>>>> + * domain. Stage 2 means that the page table is controlled by the
>>>> hypervisor.
>>>> + *   S2 default: access to entire root partition memory. This for
>>>> us easily
>>>> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu
>>>> subsystem, and
>>>> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the
>>>> hypervisor.
>>>> + *
>>>> + * Device Management:
>>>> + *   There are two ways to manage device attaches to domains:
>>>> + *     1. Domain Attach: A device domain is created in the
>>>> hypervisor, the
>>>> + *			 device is attached to this domain, and
>>>> then memory
>>>> + *			 ranges are mapped in the map callbacks.
>>>> + *     2. Direct Attach: No need to create a domain in the
>>>> hypervisor for direct
>>>> + *			 attached devices. A hypercall is made
>>>> to tell the
>>>> + *			 hypervisor to attach the device to a
>>>> guest. There is
>>>> + *			 no need for explicit memory mappings
>>>> because the
>>>> + *			 hypervisor will just use the guest HW
>>>> page table.
>>>> + *
>>>> + * Since a direct attach is much faster, it is the default. This
>>>> can be
>>>> + * changed via hv_no_attdev.
>>>> + *
>>>> + * L1VH: hypervisor only supports direct attach.
>>>> + */
>>>> +
>>>> +/*
>>>> + * Create dummy domain to correspond to hypervisor prebuilt
>>>> default identity
>>>> + * domain (dummy because we do not make hypercall to create them).
>>>> + */
>>>> +static struct hv_domain hv_def_identity_dom;
>>>> +
>>>> +static bool hv_special_domain(struct hv_domain *hvdom)
>>>> +{
>>>> +	return hvdom == &hv_def_identity_dom;
>>>> +}
>>>> +
>>>> +struct iommu_domain_geometry default_geometry = (struct
>>>> iommu_domain_geometry) {
>>>> +	.aperture_start = 0,
>>>> +	.aperture_end = -1UL,
>>>> +	.force_aperture = true,
>>>> +};
>>>> +
>>>> +/*
>>>> + * Since the relevant hypercalls can only fit less than 512 PFNs
>>>> in the pfn
>>>> + * array, report 1M max.
>>>> + */
>>>> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
>>>> +
>>>> +static u32 unique_id;	      /* unique numeric id of a new
>>>> domain */ +
>>>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
>>>> +				struct device *dev);
>>>> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
>>>> ulong iova,
>>>> +				   size_t pgsize, size_t pgcount,
>>>> +				   struct iommu_iotlb_gather
>>>> *gather); +
>>>> +/*
>>>> + * If the current thread is a VMM thread, return the partition id
>>>> of the VM it
>>>> + * is managing, else return HV_PARTITION_ID_INVALID.
>>>> + */
>>>> +u64 hv_iommu_get_curr_partid(void)
>>>> +{
>>>> +	u64 (*fn)(pid_t pid);
>>>> +	u64 partid;
>>>> +
>>>> +	fn = symbol_get(mshv_pid_to_partid);
>>>> +	if (!fn)
>>>> +		return HV_PARTITION_ID_INVALID;
>>>> +
>>>> +	partid = fn(current->tgid);
>>>> +	symbol_put(mshv_pid_to_partid);
>>>> +
>>>> +	return partid;
>>>> +}
>>> This function is not iommu specific. Maybe move it to mshv code?
>>
>> Well, it is getting the information from mshv by calling a function
>> there for iommu, and is not needed if no HYPER_IOMMU. So this is
>> probably the best place for it.
>>
> ok, maybe move it to mshv after we have a second user. But the function
> name can be just hv_get_curr_partid(void), no?

it could, but by convention all public funcs here are hv_iommu_xxx..
and other reviewers might object... We really need virt/mshv/
sub directory... not sure if it's worth creating now for just one
function. so maybe we just live with it for now... we do have work
item to move some things from drivers/hv to virt/mshv/ .. so this
can get added to that whenever that happens.

>>>> +
>>>> +/* If this is a VMM thread, then this domain is for a guest VM */
>>>> +static bool hv_curr_thread_is_vmm(void)
>>>> +{
>>>> +	return hv_iommu_get_curr_partid() !=
>>>> HV_PARTITION_ID_INVALID; +}
>>>> +
>>>> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap
>>>> cap) +{
>>>> +	switch (cap) {
>>>> +	case IOMMU_CAP_CACHE_COHERENCY:
>>>> +		return true;
>>>> +	default:
>>>> +		return false;
>>>> +	}
>>>> +	return false;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Check if given pci device is a direct attached device. Caller
>>>> must have
>>>> + * verified pdev is a valid pci device.
>>>> + */
>>>> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>>>> +{
>>>> +	struct iommu_domain *iommu_domain;
>>>> +	struct hv_domain *hvdom;
>>>> +	struct device *dev = &pdev->dev;
>>>> +
>>>> +	iommu_domain = iommu_get_domain_for_dev(dev);
>>>> +	if (iommu_domain) {
>>>> +		hvdom = to_hv_domain(iommu_domain);
>>>> +		return hvdom->attached_dom;
>>>> +	}
>>>> +
>>>> +	return false;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
>>> Attached domain can change anytime, what guarantee does the caller
>>> have?
>>
>> Not sure I understand what can change: the device moving from attached
>> to non-attached? or the domain getting deleted? In any case, this is
>> called from leaf functions, so that should not happen... and it
>> will return false if the device did somehow got removed.
>>
> I was thinking the device can be attached to a different domain type at
> runtime, e.g. via sysfs to identity or DMA. But I guess here is a static
> attachment either for l1vh or root.

That is correct. It is extra work to support that if there is a good
usecase/demand.
  
>>>> +
>>>> +/* Create a new device domain in the hypervisor */
>>>> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
>>>> +{
>>>> +	u64 status;
>>>> +	unsigned long flags;
>>>> +	struct hv_input_device_domain *ddp;
>>>> +	struct hv_input_create_device_domain *input;
>>> nit: use consistent coding style, inverse Christmas tree.
>>>    
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	ddp = &input->device_domain;
>>>> +	ddp->partition_id = HV_PARTITION_ID_SELF;
>>>> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	ddp->domain_id.id = hvdom->domid_num;
>>>> +
>>>> +
>>>> input->create_device_domain_flags.forward_progress_required = 1;
>>>> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
>>>> +
>>>> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN,
>>>> input, NULL); +
>>>> +	local_irq_restore(flags);
>>>> +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +/* During boot, all devices are attached to this */
>>>> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct
>>>> device *dev) +{
>>>> +	return &hv_def_identity_dom.iommu_dom;
>>>> +}
>>>> +
>>>> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
>>>> device *dev) +{
>>>> +	struct hv_domain *hvdom;
>>>> +	int rc;
>>>> +
>>>> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
>>>> !hv_no_attdev) {
>>>> +		pr_err("Hyper-V: l1vh iommu does not support host
>>>> devices\n");
>>> why is this an error if user input choose not to do direct attach?
>>
>> Like the error message says: on l1vh, direct attaches of host devices
>> (eg dpdk) is not supported. and l1vh only does direct attaches. IOW,
>> no host devices on l1vh.
>>
> This hv_no_attdev flag is really confusing to me, by default
> hv_no_attdev is false, which allows direct attach. And you are saying
> l1vh allows it.

Well, at the time of this design/coding, my understanding was we'd have
mapped devices on l1vh also. But now it looks like that would be bit
later than sooner .. unless AI bots start dumping code of course :) :)..

I could remove it from the if statement and add it when the support
is added, but is harmless and one less thing to remember.

> Why is this flag also controls host device attachment in l1vh? If you
> can tell the difference between direct host device attach and other
> direct attach, why don't you reject always reject host attach in l1vh?
> 
>>>> +		return NULL;
>>>> +	}
>>>> +
>>>> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
>>>> +	if (hvdom == NULL)
>>>> +		goto out;
>>>> +
>>>> +	spin_lock_init(&hvdom->mappings_lock);
>>>> +	hvdom->mappings_tree = RB_ROOT_CACHED;
>>>> +
>>>> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /*
>>>> ie, 0 */
>>> This is true only when unique_id wraps around, right? Then this
>>> driver stops working?
>>
>> Correct. It's a u32, so if my math is right, and a device is attached
>> every second, it will take 136 years to wrap! Did i get that right?
>>
> This is still a unnecessary vulnerability.

Device passthru will fail and will not cause any corruption or
data theft issues... can make it u64 if it gives extra peace. not
worth all that mumbo jumbo for almost never gonna happen case.

>>> can you use an IDR for the unique_id and free it as you detach
>>> instead of doing this cyclic allocation?
>>>    
>>>> +		goto out_free;
>>>> +
>>>> +	hvdom->domid_num = unique_id;
>>>> +	hvdom->iommu_dom.geometry = default_geometry;
>>>> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
>>>> +
>>>> +	/* For guests, by default we do direct attaches, so no
>>>> domain in hyp */
>>>> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
>>>> +		hvdom->attached_dom = true;
>>>> +	else {
>>>> +		rc = hv_iommu_create_hyp_devdom(hvdom);
>>>> +		if (rc)
>>>> +			goto out_free_id;
>>>> +	}
>>>> +
>>>> +	return &hvdom->iommu_dom;
>>>> +
>>>> +out_free_id:
>>>> +	unique_id--;
>>>> +out_free:
>>>> +	kfree(hvdom);
>>>> +out:
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
>>>> +{
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +	unsigned long flags;
>>>> +	u64 status;
>>>> +	struct hv_input_delete_device_domain *input;
>>>> +
>>>> +	if (hv_special_domain(hvdom))
>>>> +		return;
>>>> +
>>>> +	if (hvdom->num_attchd) {
>>>> +		pr_err("Hyper-V: can't free busy iommu domain
>>>> (%p)\n", immdom);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
>>>> +		struct hv_input_device_domain *ddp;
>>>> +
>>>> +		local_irq_save(flags);
>>>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +		ddp = &input->device_domain;
>>>> +		memset(input, 0, sizeof(*input));
>>>> +
>>>> +		ddp->partition_id = HV_PARTITION_ID_SELF;
>>>> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +		ddp->domain_id.id = hvdom->domid_num;
>>>> +
>>>> +		status =
>>>> hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
>>>> +					 NULL);
>>>> +		local_irq_restore(flags);
>>>> +
>>>> +		if (!hv_result_success(status))
>>>> +			hv_status_err(status, "\n");
>>>> +	}
>>
>>> you could free the domid here, no?
>> sorry, don't follow what you mean by domid, you mean unique_id?
>>
> yes.

no it's just a sequential number with no track of what's used.

>>>> +
>>>> +	kfree(hvdom);
>>>> +}
>>>> +
>>>> +/* Attach a device to a domain previously created in the
>>>> hypervisor */ +static int hv_iommu_att_dev2dom(struct hv_domain
>>>> *hvdom, struct pci_dev *pdev) +{
>>>> +	unsigned long flags;
>>>> +	u64 status;
>>>> +	enum hv_device_type dev_type;
>>>> +	struct hv_input_attach_device_domain *input;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>>>> +	input->device_domain.domain_id.type =
>>>> HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>>>> +
>>>> +	/* NB: Upon guest shutdown, device is re-attached to the
>>>> default domain
>>>> +	 * without explicit detach.
>>>> +	 */
>>>> +	if (hv_l1vh_partition())
>>>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>>>> +	else
>>>> +		dev_type = HV_DEVICE_TYPE_PCI;
>>>> +
>>>> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev,
>>>> dev_type); +
>>>> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN,
>>>> input, NULL);
>>>> +	local_irq_restore(flags);
>>>> +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +/* Caller must have validated that dev is a valid pci dev */
>>>> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
>>>> +{
>>>> +	struct hv_input_attach_device *input;
>>>> +	u64 status;
>>>> +	int rc;
>>>> +	unsigned long flags;
>>>> +	union hv_device_id host_devid;
>>>> +	enum hv_device_type dev_type;
>>>> +	u64 ptid = hv_iommu_get_curr_partid();
>>>> +
>>>> +	if (ptid == HV_PARTITION_ID_INVALID) {
>>>> +		pr_err("Hyper-V: Invalid partition id in direct
>>>> attach\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (hv_l1vh_partition())
>>>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>>>> +	else
>>>> +		dev_type = HV_DEVICE_TYPE_PCI;
>>>> +
>>>> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev,
>>>> dev_type); +
>>>> +	do {
>>>> +		local_irq_save(flags);
>>>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +		memset(input, 0, sizeof(*input));
>>>> +		input->partition_id = ptid;
>>>> +		input->device_id = host_devid;
>>>> +
>>>> +		/* Hypervisor associates logical_id with this
>>>> device, and in
>>>> +		 * some hypercalls like retarget interrupts,
>>>> logical_id must be
>>>> +		 * used instead of the BDF. It is a required
>>>> parameter.
>>>> +		 */
>>>> +		input->attdev_flags.logical_id = 1;
>>>> +		input->logical_devid =
>>>> +			   hv_build_devid_oftype(pdev,
>>>> HV_DEVICE_TYPE_LOGICAL); +
>>>> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE,
>>>> input, NULL);
>>>> +		local_irq_restore(flags);
>>>> +
>>>> +		if (hv_result(status) ==
>>>> HV_STATUS_INSUFFICIENT_MEMORY) {
>>>> +			rc = hv_call_deposit_pages(NUMA_NO_NODE,
>>>> ptid, 1);
>>>> +			if (rc)
>>>> +				break;
>>>> +		}
>>>> +	} while (hv_result(status) ==
>>>> HV_STATUS_INSUFFICIENT_MEMORY); +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +/* This to attach a device to both host app (like DPDK) and a
>>>> guest VM */
>>> The IOMMU driver should be agnostic to the type of consumer,
>>> whether a userspace driver or a VM. This comment is not necessary.
>>>    
>>>> +static int hv_iommu_attach_dev(struct iommu_domain *immdom,
>>>> struct device *dev,
>>>> +			       struct iommu_domain *old)
>>> This does not match upstream kernel prototype, which kernel version
>>> is this based on? I will stop here for now.
>>
>> As I mentioned in the cover letter:
>>            Based on: 8f0b4cce4481 (origin/hyperv-next)
>>
> where is this repo?

https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
branch: hyperv-next

All our hyperv/mshv related patch submissions are merged there
first by Wei.

Thanks,
-Mukesh

.. deleted ......

^ permalink raw reply

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Mukesh R @ 2026-01-30 22:51 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <aXkIGfos4l0kv_lF@skinsburskii.localdomain>

On 1/27/26 10:46, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 07:02:29PM -0800, Mukesh R wrote:
>> On 1/26/26 07:57, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
>>>> On 1/20/26 16:12, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
>>>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>
>>>>>> Add a new file to implement management of device domains, mapping and
>>>>>> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
>>>>>> framework for PCI passthru on Hyper-V running Linux as root or L1VH
>>>>>> parent. This also implements direct attach mechanism for PCI passthru,
>>>>>> and it is also made to work within the VFIO framework.
>>>>>>
>>>>>> At a high level, during boot the hypervisor creates a default identity
>>>>>> domain and attaches all devices to it. This nicely maps to Linux iommu
>>>>>> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
>>>>>> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
>>>>>> during boot. As mentioned previously, Hyper-V supports two ways to do
>>>>>> PCI passthru:
>>>>>>
>>>>>>      1. Device Domain: root must create a device domain in the hypervisor,
>>>>>>         and do map/unmap hypercalls for mapping and unmapping guest RAM.
>>>>>>         All hypervisor communications use device id of type PCI for
>>>>>>         identifying and referencing the device.
>>>>>>
>>>>>>      2. Direct Attach: the hypervisor will simply use the guest's HW
>>>>>>         page table for mappings, thus the host need not do map/unmap
>>>>>>         device memory hypercalls. As such, direct attach passthru setup
>>>>>>         during guest boot is extremely fast. A direct attached device
>>>>>>         must be referenced via logical device id and not via the PCI
>>>>>>         device id.
>>>>>>
>>>>>> At present, L1VH root/parent only supports direct attaches. Also direct
>>>>>> attach is default in non-L1VH cases because there are some significant
>>>>>> performance issues with device domain implementation currently for guests
>>>>>> with higher RAM (say more than 8GB), and that unfortunately cannot be
>>>>>> addressed in the short term.
>>>>>>
>>>>>
>>>>> <snip>
>>>>>
>>>
>>> <snip>
>>>
>>>>>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
>>>>>> +{
>>>>>> +	struct pci_dev *pdev;
>>>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>>>> +
>>>>>> +	/* See the attach function, only PCI devices for now */
>>>>>> +	if (!dev_is_pci(dev))
>>>>>> +		return;
>>>>>> +
>>>>>> +	if (hvdom->num_attchd == 0)
>>>>>> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
>>>>>> +
>>>>>> +	pdev = to_pci_dev(dev);
>>>>>> +
>>>>>> +	if (hvdom->attached_dom) {
>>>>>> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
>>>>>> +
>>>>>> +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
>>>>>> +		 * next.
>>>>>> +		 */
>>>>>> +	} else {
>>>>>> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
>>>>>> +	}
>>>>>> +
>>>>>> +	hvdom->num_attchd--;
>>>>>
>>>>> Shouldn't this be modified iff the detach succeeded?
>>>>
>>>> We want to still free the domain and not let it get stuck. The purpose
>>>> is more to make sure detach was called before domain free.
>>>>
>>>
>>> How can one debug subseqent errors if num_attchd is decremented
>>> unconditionally? In reality the device is left attached, but the related
>>> kernel metadata is gone.
>>
>> Error is printed in case of failed detach. If there is panic, at least
>> you can get some info about the device. Metadata in hypervisor is
>> around if failed.
>>
> 
> With this approach the only thing left is a kernel message.
> But if the state is kept intact, one could collect a kernel core and
> analyze it.

Again, most of linux stuff is cleaned up, the only state is in
hypervisor, and hypervisor can totally protect itself and devices.
So there is not much in kernel core as it got cleaned up already.
Think of this as additional check, we can remove in future after
it stands the test of time, until then, every debugging bit helps.

> And note, that there won't be a hypervisor core by default: our main
> context with the usptreamed version of the driver is L1VH and a kernel
> core is the only thing a third party customer can provide for our
> analysis.

Wei can correct me, but we are not only l1vh focused here. There is
work going on on all fronts.

Thanks,
-Mukesh

> Thanks,
> Stanislav

^ permalink raw reply

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
From: Mukesh R @ 2026-01-30 22:17 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <aXkKhGvpaHUGclI-@skinsburskii.localdomain>

On 1/27/26 10:57, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
>> On 1/26/26 10:15, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
>>>> On 1/20/26 17:53, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
>>>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>
>>>>>> Upon guest access, in case of missing mmio mapping, the hypervisor
>>>>>> generates an unmapped gpa intercept. In this path, lookup the PCI
>>>>>> resource pfn for the guest gpa, and ask the hypervisor to map it
>>>>>> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
>>>>>> and obtained via fixup_user_fault call (similar to KVM).
>>>>>>
>>>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>> ---
>>>>>>     drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 115 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>>>> index 03f3aa9f5541..4c8bc7cd0888 100644
>>>>>> --- a/drivers/hv/mshv_root_main.c
>>>>>> +++ b/drivers/hv/mshv_root_main.c
>>>>>> @@ -56,6 +56,14 @@ struct hv_stats_page {
>>>>>>     	};
>>>>>>     } __packed;
>>>>>> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
>>>>>> +static int __init setup_hv_full_mmio(char *str)
>>>>>> +{
>>>>>> +	hv_nofull_mmio = true;
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
>>>>>> +
>>>>>>     struct mshv_root mshv_root;
>>>>>>     enum hv_scheduler_type hv_scheduler_type;
>>>>>> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>>>>>>     }
>>>>>>     #ifdef CONFIG_X86_64
>>>>>> +
>>>>>> +/*
>>>>>> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
>>>>>> + * else just return -errno.
>>>>>> + */
>>>>>> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
>>>>>> +				       u64 *mmio_pfnp)
>>>>>> +{
>>>>>> +	struct vm_area_struct *vma;
>>>>>> +	bool is_mmio;
>>>>>> +	u64 uaddr;
>>>>>> +	struct mshv_mem_region *mreg;
>>>>>> +	struct follow_pfnmap_args pfnmap_args;
>>>>>> +	int rc = -EINVAL;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
>>>>>> +	 * useraddr vma to lookup pci bar pfn.
>>>>>> +	 */
>>>>>> +	spin_lock(&pt->pt_mem_regions_lock);
>>>>>> +
>>>>>> +	/* Get the region again under the lock */
>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>> +		goto unlock_pt_out;
>>>>>> +
>>>>>> +	uaddr = mreg->start_uaddr +
>>>>>> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
>>>>>> +
>>>>>> +	mmap_read_lock(current->mm);
>>>>>
>>>>> Semaphore can't be taken under spinlock.
>>>
>>>>
>>>> Yeah, something didn't feel right here and I meant to recheck, now regret
>>>> rushing to submit the patch.
>>>>
>>>> Rethinking, I think the pt_mem_regions_lock is not needed to protect
>>>> the uaddr because unmap will properly serialize via the mm lock.
>>>>
>>>>
>>>>>> +	vma = vma_lookup(current->mm, uaddr);
>>>>>> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
>>>>>
>>>>> Why this check is needed again?
>>>>
>>>> To make sure region did not change. This check is under lock.
>>>>
>>>
>>> How can this happen? One can't change VMA type without unmapping it
>>> first. And unmapping it leads to a kernel MMIO region state dangling
>>> around without corresponding user space mapping.
>>
>> Right, and vm_flags would not be mmio expected then.
>>
>>> This is similar to dangling pinned regions and should likely be
>>> addressed the same way by utilizing MMU notifiers to destpoy memoty
>>> regions is VMA is detached.
>>
>> I don't think we need that. Either it succeeds if the region did not
>> change at all, or just fails.
>>
> 
> I'm afraid we do, as if the driver mapped a page with the previous
> memory region, and then the region is unmapped, the page will stay
> mapped in the hypervisor, but will be considered free by kernel, which
> in turn will lead to GPF upn next allocation.

There are no ram pages for mmio regions. Also, we don't do much with
mmio regions other than tell the hyp about it.

Thanks,
-Mukesh


> With pinned regions we issue is similar but less impacting: pages can't
> be released by user space unmapping and thus will be simply leaked, but
> the system stays intact.
> 
> MMIO regions are simila to movable region in this regard: they don't
> reference the user pages, and thus this guest region replaement is a
> stright wat to kernel panic.
> 
>>
>>>>> The region type is stored on the region itself.
>>>>> And the type is checked on the caller side.
>>>>>
>>>>>> +	if (!is_mmio)
>>>>>> +		goto unlock_mmap_out;
>>>>>> +
>>>>>> +	pfnmap_args.vma = vma;
>>>>>> +	pfnmap_args.address = uaddr;
>>>>>> +
>>>>>> +	rc = follow_pfnmap_start(&pfnmap_args);
>>>>>> +	if (rc) {
>>>>>> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
>>>>>> +				      NULL);
>>>>>> +		if (rc)
>>>>>> +			goto unlock_mmap_out;
>>>>>> +
>>>>>> +		rc = follow_pfnmap_start(&pfnmap_args);
>>>>>> +		if (rc)
>>>>>> +			goto unlock_mmap_out;
>>>>>> +	}
>>>>>> +
>>>>>> +	*mmio_pfnp = pfnmap_args.pfn;
>>>>>> +	follow_pfnmap_end(&pfnmap_args);
>>>>>> +d
>>>>>> +unlock_mmap_out:
>>>>>> +	mmap_read_unlock(current->mm);
>>>>>> +unlock_pt_out:
>>>>>> +	spin_unlock(&pt->pt_mem_regions_lock);
>>>>>> +	return rc;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
>>>>>> + * and resolve if possible.
>>>>>> + * Returns: True if valid mmio intercept and it was handled, else false
>>>>>> + */
>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
>>>>>> +{
>>>>>> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
>>>>>> +	struct hv_x64_memory_intercept_message *msg;
>>>>>> +	union hv_x64_memory_access_info accinfo;
>>>>>> +	u64 gfn, mmio_spa, numpgs;
>>>>>> +	struct mshv_mem_region *mreg;
>>>>>> +	int rc;
>>>>>> +	struct mshv_partition *pt = vp->vp_partition;
>>>>>> +
>>>>>> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
>>>>>> +	accinfo = msg->memory_access_info;
>>>>>> +
>>>>>> +	if (!accinfo.gva_gpa_valid)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* Do a fast check and bail if non mmio intercept */
>>>>>> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>
>>>>> This call needs to be protected by the spinlock.
>>>>
>>>> This is sorta fast path to bail. We recheck under partition lock above.
>>>>
>>>
>>> Accessing the list of regions without lock is unsafe.
>>
>> I am not sure why? This check is done by a vcpu thread, so regions
>> will not have just gone away.
>>
> 
> This is shared resources. Multiple VP thread get into this function
> simultaneously, so there is a race already. But this one we can live
> with without locking as they don't mutate the list of the regions.
> 
> The issue happens when VMM adds or removed another region as it mutates
> the list and races with VP threads doing this lookup.
> 
> Thanks,
> Stanislav
> 
> 
>> Thanks,
>> -Mukesh
>>
>>
>>> Thanks,
>>> Stanislav
>>>
>>>> Thanks,
>>>> -Mukesh
>>>>
>>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
>>>>>> +	if (rc)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	if (!hv_nofull_mmio) {		/* default case */
>>>>>> +		gfn = mreg->start_gfn;
>>>>>> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
>>>>>> +		numpgs = mreg->nr_pages;
>>>>>> +	} else
>>>>>> +		numpgs = 1;
>>>>>> +
>>>>>> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
>>>>>> +
>>>>>> +	return rc == 0;
>>>>>> +}
>>>>>> +
>>>>>>     static struct mshv_mem_region *
>>>>>>     mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>>>>>>     {
>>>>>> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>>>>>>     	return ret;
>>>>>>     }
>>>>>> +
>>>>>>     #else  /* CONFIG_X86_64 */
>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>>>>>>     static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>>>>>>     #endif /* CONFIG_X86_64 */
>>>>>>     static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>>>>>>     {
>>>>>>     	switch (vp->vp_intercept_msg_page->header.message_type) {
>>>>>> +	case HVMSG_UNMAPPED_GPA:
>>>>>> +		return mshv_handle_unmapped_gpa(vp);
>>>>>>     	case HVMSG_GPA_INTERCEPT:
>>>>>>     		return mshv_handle_gpa_intercept(vp);
>>>>>>     	}
>>>>>> -- 
>>>>>> 2.51.2.vfs.0.1
>>>>>>


^ permalink raw reply

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Mukesh R @ 2026-01-30 22:10 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <20260127143119.00006d2f@linux.microsoft.com>

On 1/27/26 14:31, Jacob Pan wrote:
> Hi Mukesh,
> 
>>>>> +
>>>>> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
>>>>> !hv_no_attdev) {
>>>>> +		pr_err("Hyper-V: l1vh iommu does not support
>>>>> host devices\n");
>>>> why is this an error if user input choose not to do direct
>>>> attach?
>>>
>>> Like the error message says: on l1vh, direct attaches of host
>>> devices (eg dpdk) is not supported. and l1vh only does direct
>>> attaches. IOW, no host devices on l1vh.
>>>    
>> This hv_no_attdev flag is really confusing to me, by default
>> hv_no_attdev is false, which allows direct attach. And you are saying
>> l1vh allows it.
>>
>> Why is this flag also controls host device attachment in l1vh? If you
>> can tell the difference between direct host device attach and other
>> direct attach, why don't you reject always reject host attach in l1vh?
> On second thought, if the hv_no_attdev knob is only meant to control
> host domain attach vs. direct attach, then it is irrelevant on L1VH.
> 
> Would it make more sense to rename this to something like
> hv_host_disable_direct_attach? That would better reflect its scope and
> allow it to be ignored under L1VH, and reduce the risk of users
> misinterpreting or misusing it.

It would, but it is kernel parameter and needs to be terse. It would
be documented properly tho, so we should be ok.

Thanks,
-Mukesh


^ permalink raw reply

* Re: [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-01-30 20:45 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXz_4yWC6Jofqygj@skinsburskii.localdomain>

On Fri, Jan 30, 2026 at 11:00:51AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 05:09:10PM +0000, Anirudh Rayabharam wrote:
> > On Thu, Jan 29, 2026 at 09:03:54AM -0800, Stanislav Kinsburskii wrote:
> > > On Thu, Jan 29, 2026 at 04:36:51AM +0000, Anirudh Rayabharam wrote:
> > > > On Wed, Jan 28, 2026 at 03:03:51PM -0800, Stanislav Kinsburskii wrote:
> > > > > On Wed, Jan 28, 2026 at 04:04:37PM +0000, Anirudh Rayabharam wrote:
> > > > > > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > > 
> > > <snip>
> > > 
> > > > > 
> > > > > > +static int mshv_irq = -1;
> > > > > > +
> > > > > 
> > > > > Should this be a path of mshv_root structure?
> > > > 
> > > > This doesn't need to be globally accessible. It is only used in this file.
> > > > So I guess it doesn't need to be in mshv_root. What do you think?
> > > > 
> > > 
> > > Please, see below.
> > 
> > The below part doesn't make a case for this variable being part of the
> > mshv_root structure. Did you miss this part in your reply?
> > 
> 
> No, I didn't miss it. I just don't see the point of introducing there
> variables unless the goal is to weave more logic into the existent flow.

I introduced the variables in this file because they're only used in
this file. How would moving the variables to the mshv_root structure
help with the code weaving problem?

> 
> > > 
> > > <snip>
> > > 
> > > > > >  int mshv_synic_cpu_init(unsigned int cpu)
> > > > > >  {
> > > > > >  	union hv_synic_simp simp;
> > > > > >  	union hv_synic_siefp siefp;
> > > > > >  	union hv_synic_sirbp sirbp;
> > > > > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > > > >  	union hv_synic_sint sint;
> > > > > > -#endif
> > > > > >  	union hv_synic_scontrol sctrl;
> > > > > >  	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > > > > >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > > > > > @@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > > > > >  
> > > > > >  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> > > > > >  
> > > > > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > > > > +	if (mshv_irq != -1)
> > > > > > +		enable_percpu_irq(mshv_irq, 0);
> > > > > > +
> > > > > 
> > > > > It's better to explicitly separate x86 and arm64 paths with #ifdefs.
> > > > > For example:
> > > > > 
> > > > > #ifdef CONFIG_X86_64
> > > > > int setup_cpu_sint() {
> > > > >   	/* Enable intercepts */
> > > > >   	sint.as_uint64 = 0;
> > > > > 	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > > 	....
> > > > > }
> > > > > #endif
> > > > > #ifdef CONFIG_ARM64
> > > > > int setup_cpu_sint() {
> > > > > 	enable_percpu_irq(mshv_irq, 0);
> > > > > 
> > > > >   	/* Enable intercepts */
> > > > >   	sint.as_uint64 = 0;
> > > > > 	sint.vector = mshv_interrupt;
> > > > > 	....
> > > > > }
> > > > > #endif
> > > > 
> > > > This seems unnecessary. We've made the paths that determine
> > > > mshv_interrupt separate. Now we can just use that here.
> > > > 
> > > > There is no need to write two copies of 
> > > > 
> > > > 	...
> > > >    	sint.as_uint64 = 0;
> > > >  	sint.vector = <whatever>;
> > > > 	...
> > > > 
> > > > I could do the enable_percpu_irq() inside an ifdef. But do we gain
> > > > anything from it? Won't the compiler optimize the current code as well
> > > > since mshv_irq will always be -1 whenever HYPERVISOR_CALLBACK_VECTOR is
> > > > defined?
> > > > 
> > > 
> > > AFAIU this patc, x86 doesn’t need these variables at all. So it’s better
> > > to separate them completely and explicitly.
> > > 
> > > Also, this isn’t the only place where ARM-specific logic is added. This
> > > patch adds ARM-specific logic and tries to weave it into the existing
> > > x86 flow.
> > > 
> > > If it were only one place, that might be OK. But here it happens in
> > > several places. That makes the code harder to read and maintain. It also
> > > makes future extensions more risky (and they will likely follow). The
> > > dependencies are also not obvious. For example, on ARM the interrupt
> > > vector comes from ACPI (at least that’s what the comments say). So it’s
> > > not right to mix this into the common x86 path even if
> > > HYPERVISOR_CALLBACK_VECTOR is a x86-specific define.
> > 
> > We shouldn't think of this code in terms of X86 & ARM64. It's not about
> > arch at all. It's about whether or not we have a pre-defined vector
> > (a.k.a HYPERVISOR_CALLBACK_VECTOR). I feel that the current code cleanly
> > separates the two cases. The main difference in the two cases is in how
> > the vector is determined which is well seperated in the code paths. Once
> > the vector is determined, how we program it in the synic is the same for
> > both cases.
> > 
> 
> The major question is whether HYPERVISOR_CALLBACK_VECTOR can be
> defined on ARM. If it can’t, then it’s effectively an x86-only feature.

It cannot be defined for ARM. Just not possible with the way interrupts
are allocated on ARM.

> 
> The current code separates two cases. You are adding a third one: ARM,

The current code only really handles one case: where
HYPERVISOR_CALLBACK_VECTOR is defined. The other case is not handled at
all.

There is no case called ARM. The only two cases are:

  - HYPERVISOR_CALLBACK_VECTOR is defined
  - HYPERVISOR_CALLBACK_VECTOR is not defined

> with its own logic. But this is not stated explicitly in the code. As a
> result, we now have three cases mixed together, and the flow becomes
> spaghetti-like.
> 
> If we ever need to support DT on ARM (and we should expect that, because
> ACPI on ARM looks odd), we will need to add yet another case to this
> mix.

Nope there won't be another case. DT on ARM would fall under the
"HYPERVISOR_CALLBACK_VECTOR is not defined" case. Under that case, we
would check if we're using ACPI or DT and take the appropriate action.

Please see vmbus_drv.c, a similar approach is used there.

> 
> I hope you see the problem. The original code wasn't designed to be
> extensible. Since you are adding a new case, this is a good opportunity
> to redesign the flow and make it more extensible, instead of adding more
> logic on top.

I'll be sending a v2 soon where I believe I've cleaned this up
a little bit more. Let's see...

Thanks,
Anirudh.

> 
> > > 
> > > It would be much better to keep this ARM-specific logic in separate,
> > > conditionally compiled code. I suggest changing the flow to make this
> > > per-arch logic explicit. It will pay off later.
> > 
> > Most of the code introduced in this patch is conditionally compiled.
> > Building code from this patch on x86 will conditionally compile out a
> > large majority of it.
> > 
> > Are you by any chance suggesting we put it in a separate file?
> > 
> 
> No, I’m not suggesting to move it into a separate file yet.
> But making the arch-specific code clearly separated would be a good first step.
> 
> Thanks,
> Stanislav.
> 
> > Thanks,
> > Anirudh.
> > 
> > > 
> > > Thanks,
> > > Stanislav
> > > 
> > > > Thanks,
> > > > Anirudh.
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Stanislav
> > > > > 
> > > > > >  	/* Enable intercepts */
> > > > > >  	sint.as_uint64 = 0;
> > > > > > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > > > +	sint.vector = mshv_interrupt;
> > > > > >  	sint.masked = false;
> > > > > >  	sint.auto_eoi = hv_recommend_using_aeoi();
> > > > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> > > > > > @@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > > > > >  
> > > > > >  	/* Doorbell SINT */
> > > > > >  	sint.as_uint64 = 0;
> > > > > > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > > > +	sint.vector = mshv_interrupt;
> > > > > >  	sint.masked = false;
> > > > > >  	sint.as_intercept = 1;
> > > > > >  	sint.auto_eoi = hv_recommend_using_aeoi();
> > > > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > > > >  			      sint.as_uint64);
> > > > > > -#endif
> > > > > >  
> > > > > >  	/* Enable global synic bit */
> > > > > >  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> > > > > > @@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
> > > > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > > > >  			      sint.as_uint64);
> > > > > >  
> > > > > > +	if (mshv_irq != -1)
> > > > > > +		disable_percpu_irq(mshv_irq);
> > > > > > +
> > > > > >  	/* Disable Synic's event ring page */
> > > > > >  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> > > > > >  	sirbp.sirbp_enabled = false;
> > > > > > -- 
> > > > > > 2.34.1
> > > > > > 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-01-30 20:32 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXz8ldAeoWwGIxdu@skinsburskii.localdomain>

On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > hypervisor deposited pages.
> > > > > > > 
> > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > management is implemented.
> > > > > > 
> > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > and would work without any issue for L1VH.
> > > > > > 
> > > > > 
> > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > 
> > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > right? What other deposited pages would be left?
> > > > 
> > > 
> > > The driver deposits two types of pages: one for the guests (withdrawn
> > > upon gust shutdown) and the other - for the host itself (never
> > > withdrawn).
> > > See hv_call_create_partition, for example: it deposits pages for the
> > > host partition.
> > 
> > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > Also, can't we forcefully kill all running partitions in module_exit and
> > then reclaim memory? Would this help with kernel consistency
> > irrespective of userspace behavior?
> > 
> 
> It would, but this is sloppy and cannot be a long-term solution.
> 
> It is also not reliable. We have no hook to prevent kexec. So if we fail
> to kill the guest or reclaim the memory for any reason, the new kernel
> may still crash.

Actually guests won't be running by the time we reach our module_exit
function during a kexec. Userspace processes would've been killed by
then.

Also, why is this sloppy? Isn't this what module_exit should be
doing anyway? If someone unloads our module we should be trying to
clean everything up (including killing guests) and reclaim memory.

In any case, we can BUG() out if we fail to reclaim the memory. That would
stop the kexec.

This is a better solution since instead of disabling KEXEC outright: our
driver made the best possible efforts to make kexec work.

> 
> There are two long-term solutions:
>  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.

I honestly think we should focus efforts on making kexec work rather
than finding ways to prevent it.

Thanks,
Anirudh

>  2. Hand the shared kernel state over to the new kernel.
> 
> I sent a series for the first one. The second one is not ready yet.
> Anything else is neither robust nor reliable, so I don’t think it makes
> sense to pursue it.
> 
> Thanks,
> Stanislav
> 
> 
> > Thanks,
> > Anirudh.
> > 
> > > 
> > > Thanks,
> > > Stanislav
> > > 
> > > > Thanks,
> > > > Anirudh.
> > > > 
> > > > > Also, kernel consisntency must no depend on use space behavior. 
> > > > > 
> > > > > > Also, I don't think it is reasonable at all that someone needs to
> > > > > > disable basic kernel functionality such as kexec in order to use our
> > > > > > driver.
> > > > > > 
> > > > > 
> > > > > It's a temporary measure until proper page lifecycle management is
> > > > > supported in the driver.
> > > > > Mutual exclusion of the driver and kexec is given and thus should be
> > > > > expclitily stated in the Kconfig.
> > > > > 
> > > > > Thanks,
> > > > > Stanislav
> > > > > 
> > > > > > Thanks,
> > > > > > Anirudh.
> > > > > > 
> > > > > > > 
> > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > ---
> > > > > > >  drivers/hv/Kconfig |    1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > >  	# no particular order, making it impossible to reassemble larger pages
> > > > > > >  	depends on PAGE_SIZE_4KB
> > > > > > > +	depends on !KEXEC
> > > > > > >  	select EVENTFD
> > > > > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > > > >  	select HMM_MIRROR
> > > > > > > 
> > > > > > > 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox