Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: Wei Liu @ 2026-02-04  6:02 UTC (permalink / raw)
  To: mhklinux
  Cc: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, linux-pci, linux-kernel, linux-hyperv
In-Reply-To: <20260111170034.67558-1-mhklinux@outlook.com>

On Sun, Jan 11, 2026 at 09:00:34AM -0800, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> Field pci_bus in struct hv_pcibus_device is unused since
> commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

It looks like this trivial patch is not yet picked up. I've queued it
up.

Thanks,
Wei

> ---
>  drivers/pci/controller/pci-hyperv.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1e237d3538f9..7fcba05cec30 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -501,7 +501,6 @@ struct hv_pcibus_device {
>  	struct resource *low_mmio_res;
>  	struct resource *high_mmio_res;
>  	struct completion *survey_event;
> -	struct pci_bus *pci_bus;
>  	spinlock_t config_lock;	/* Avoid two threads writing index page */
>  	spinlock_t device_list_lock;	/* Protect lists below */
>  	void __iomem *cfg_addr;
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-02-04  5:33 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYJPwp2i47P33xuz@skinsburskii.localdomain>

On Tue, Feb 03, 2026 at 11:42:58AM -0800, Stanislav Kinsburskii wrote:
> On Tue, Feb 03, 2026 at 04:46:03PM +0000, Anirudh Rayabharam wrote:
> > On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote:
> > > On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> > > > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > > > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > > > > > 
> > > > > > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > > > > > withdrawn).
> > > > > > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > > > > > host partition.
> > > > > > > > > > 
> > > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > > > > > irrespective of userspace behavior?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > > > > > 
> > > > > > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > > > > > may still crash.
> > > > > > > > 
> > > > > > > > Actually guests won't be running by the time we reach our module_exit
> > > > > > > > function during a kexec. Userspace processes would've been killed by
> > > > > > > > then.
> > > > > > > > 
> > > > > > > 
> > > > > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > > > > We must not rely on OS to do graceful shutdown before doing
> > > > > > > kexec.
> > > > > > 
> > > > > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > > > > more graceful and is probably used more commonly. In this case at least
> > > > > > we could register a reboot notifier and attempt to clean things up.
> > > > > > 
> > > > > > I think it is better to support kexec to this extent rather than
> > > > > > disabling it entirely.
> > > > > > 
> > > > > 
> > > > > You do understand that once our kernel is released to third parties, we
> > > > > can’t control how they will use kexec, right?
> > > > 
> > > > Yes, we can't. But that's okay. It is fine for us to say that only some
> > > > kexec scenarios are supported and some aren't (iff you're creating VMs
> > > > using MSHV; if you're not creating VMs all of kexec is supported).
> > > > 
> > > 
> > > Well, I disagree here. If we say the kernel supports MSHV, we must
> > > provide a robust solution. A partially working solution is not
> > > acceptable. It makes us look careless and can damage our reputation as a
> > > team (and as a company).
> > 
> > It won't if we call out upfront what is supported and what is not.
> > 
> > > 
> > > > > 
> > > > > This is a valid and existing option. We have to account for it. Yet
> > > > > again, L1VH will be used by arbitrary third parties out there, not just
> > > > > by us.
> > > > > 
> > > > > We can’t say the kernel supports MSHV until we close these gaps. We must
> > > > 
> > > > We can. It is okay say some scenarios are supported and some aren't.
> > > > 
> > > > All kexecs are supported if they never create VMs using MSHV. If they do
> > > > create VMs using MSHV and we implement cleanup in a reboot notifier at
> > > > least systemctl kexec and crashdump kexec would which are probably the
> > > > most common uses of kexec. It's okay to say that this is all we support
> > > > as of now.
> > > > 
> > > 
> > > I'm repeating myself, but I'll try to put it differently.
> > > There won't be any kernel core collected if a page was deposited. You're
> > > arguing for a lost cause here. Once a page is allocated and deposited,
> > > the crash kernel will try to write it into the core.
> > 
> > That's why we have to implement something where we attempt to destroy
> > partitions and reclaim memory (and BUG() out if that fails; which
> > hopefully should happen very rarely if at all). This should be *the*
> > solution we work towards. We don't need a temporary disable kexec
> > solution.
> > 
> 
> No, the solution is to preserve the shared state and pass it over via KHO.

Okay, then work towards it without doing temporary KEXEC disable. We can
call out that kexec is not supported until then. Disabling KEXEC is too
intrusive.

Is there any precedent for this? Do you know if any driver ever disabled
KEXEC this way?

> 
> > > 
> > > > Also, what makes you think customers would even be interested in enabling
> > > > our module in their kernel configs if it takes away kexec?
> > > > 
> > > 
> > > It's simple: L1VH isn't a host, so I can spin up new VMs instead of
> > > servicing the existing ones.
> > 
> > And what about the L2 VM state then? They might not be throwaway in all
> > cases.
> > 
> 
> L2 guest can (and likely will) be migrated fromt he old L1VH to the new
> one.
> And this is most likely the current scenario customers are using.
> 
> > > 
> > > Why do you think there won’t be customers interested in using MSHV in
> > > L1VH without kexec support?
> > 
> > Because they could already be using kexec for their servicing needs or
> > whatever. And no we can't just say "don't service these VMs just spin up
> > new ones".
> > 
> 
> Are you speculating or know for sure?

It's a reasonable assumption that people are using kexec for servicing.

> 
> > Also, keep in mind that once L1VH is available in Azure, the distros
> > that run on it would be the same distros that run on all other Azure
> > VMs. There won't be special distros with a kernel specifically built for
> > L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be
> > happy that they would need to publish a separate version of their image with
> > MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to
> > be disabled for all Azure VMs. Also, the customers will be confused why
> > the same distro doesn't work on L1VH.
> > 
> 
> I don't think distro happiness is our concern. They already build custom

If distros are not happy they won't package this and consequently
nobody will use it.

> versions for Azure. They can build another custom version for L1VH if
> needed.

We should at least check if they are ready to do this.

Thanks,
Anirudh.

> 
> Anyway, I don't see the point in continuing this discussion. All points
> have been made, and solutions have been proposed.
> 
> If you can come up with something better in the next few days, so we at
> least have a chance to get it merged in the next merge window, great. If
> not, we should explicitly forbid the unsupported feature and move on.
> 
> Thanks,
> Thanks,
> Stanislav
> 
> > Thanks,
> > Anirudh.

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-02-04  2:46 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Anirudh Rayabharam, kys, haiyangz, wei.liu, decui, longli,
	linux-hyperv, linux-kernel
In-Reply-To: <e03cea10-0970-88b6-ae44-7cb9759f2683@linux.microsoft.com>

On 2/2/26 12:15, Mukesh R wrote:
> On 2/2/26 08:43, Stanislav Kinsburskii wrote:
>> On Fri, Jan 30, 2026 at 11:47:48AM -0800, Mukesh R wrote:
>>> On 1/30/26 10:41, Stanislav Kinsburskii wrote:
>>>> On Fri, Jan 30, 2026 at 05:17:52PM +0000, Anirudh Rayabharam wrote:
>>>>> On Thu, Jan 29, 2026 at 06:59:31PM -0800, Mukesh R wrote:
>>>>>> On 1/28/26 15:08, Stanislav Kinsburskii wrote:
>>>>>>> On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
>>>>>>>> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
>>>>>>>>> On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
>>>>>>>>>> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
>>>>>>>>>>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>>>>>>>>>>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>>>>>>>>>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>>>>>>>>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>>>>>>>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>>>>>>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>>>>>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>>>>>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>>>>>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>>>>>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>>>>>>>>>>>> hypervisor deposited pages.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>>>>>>>>>>>> management is implemented.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>          drivers/hv/Kconfig |    1 +
>>>>>>>>>>>>>>>>>          1 file changed, 1 insertion(+)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>>>>>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>>>>>>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>>>>>>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>>>>>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>>>>>>>>>>>              # e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>>>>>>>>>>>              # no particular order, making it impossible to reassemble larger pages
>>>>>>>>>>>>>>>>>              depends on PAGE_SIZE_4KB
>>>>>>>>>>>>>>>>> +    depends on !KEXEC
>>>>>>>>>>>>>>>>>              select EVENTFD
>>>>>>>>>>>>>>>>>              select VIRT_XFER_TO_GUEST_WORK
>>>>>>>>>>>>>>>>>              select HMM_MIRROR
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>>>>>>>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>>>>>>>>>>>> and it was fine?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>>>>>>>>>>>> will be affected as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So not sure I understand the reason for this patch. We can just block
>>>>>>>>>>>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>>>>>>>>>>>> further developement would be without a ver important and major feature,
>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is an option. But until it's implemented and merged, a user mshv
>>>>>>>>>>>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>>>>>>>>>>>> The system may crash at any time after kexec, depending on whether the
>>>>>>>>>>>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>>>>>>>>>>>> bad user experience.
>>>>>>>>>>>>
>>>>>>>>>>>> I understand that. But with this we cannot collect core and debug any
>>>>>>>>>>>> crashes. I was thinking there would be a quick way to prohibit kexec
>>>>>>>>>>>> for update via notifier or some other quick hack. Did you already
>>>>>>>>>>>> explore that and didn't find anything, hence this?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This quick hack you mention isn't quick in the upstream kernel as there
>>>>>>>>>>> is no hook to interrupt kexec process except the live update one.
>>>>>>>>>>
>>>>>>>>>> That's the one we want to interrupt and block right? crash kexec
>>>>>>>>>> is ok and should be allowed. We can document we don't support kexec
>>>>>>>>>> for update for now.
>>>>>>>>>>
>>>>>>>>>>> I sent an RFC for that one but given todays conversation details is
>>>>>>>>>>> won't be accepted as is.
>>>>>>>>>>
>>>>>>>>>> Are you taking about this?
>>>>>>>>>>
>>>>>>>>>>             "mshv: Add kexec safety for deposited pages"
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>>> Making mshv mutually exclusive with kexec is the only viable option for
>>>>>>>>>>> now given time constraints.
>>>>>>>>>>> It is intended to be replaced with proper page lifecycle management in
>>>>>>>>>>> the future.
>>>>>>>>>>
>>>>>>>>>> Yeah, that could take a long time and imo we cannot just disable KEXEC
>>>>>>>>>> completely. What we want is just block kexec for updates from some
>>>>>>>>>> mshv file for now, we an print during boot that kexec for updates is
>>>>>>>>>> not supported on mshv. Hope that makes sense.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The trade-off here is between disabling kexec support and having the
>>>>>>>>> kernel crash after kexec in a non-obvious way. This affects both regular
>>>>>>>>> kexec and crash kexec.
>>>>>>>>
>>>>>>>> crash kexec on baremetal is not affected, hence disabling that
>>>>>>>> doesn't make sense as we can't debug crashes then on bm.
>>>>>>>>
>>>>>>>
>>>>>>> Bare metal support is not currently relevant, as it is not available.
>>>>>>> This is the upstream kernel, and this driver will be accessible to
>>>>>>> third-party customers beginning with kernel 6.19 for running their
>>>>>>> kernels in Azure L1VH, so consistency is required.
>>>>>>
>>>>>> Well, without crashdump support, customers will not be running anything
>>>>>> anywhere.
>>>>>
>>>>> This is my concern too. I don't think customers will be particularly
>>>>> happy that kexec doesn't work with our driver.
>>>>>
>>>>
>>>> I wasn?t clear earlier, so let me restate it. Today, kexec is not
>>>> supported in L1VH. This is a bug we have not fixed yet. Disabling kexec
>>>> is not a long-term solution. But it is better to disable it explicitly
>>>> than to have kernel crashes after kexec.
>>>
>>> I don't think there is disagreement on this. The undesired part is turning
>>> off KEXEC config completely.
>>>
>>
>> There is no disagreement on this either. If you have a better solution
>> that can be implemented and merged before next kernel merge window,
>> please propose it. Otherwise, this patch will remain as is for now.
> 
> Like I said previously, I'll explore a bit. I think I found something,
> but need to test it a bit and get second opinion on it. For me, I am

Nah, it works, but is too intrusive and no chance of being accepted. So
giving up on it. Hopefully a cleaner way can be achieved working with
kexec folks.

Thanks,
-Mukesh


> not convinced this absolutely has to be in this merge window as it only
> involves MSHV for l1vh and has been like this all this time. Moreover,
> other things like makedumpfile are broken on l1vh. But Wei can make
> final decision.
> 
> Thanks,
> -Mukesh
> 
>> Thanks,
>> Stanislav
>>
>>> Thanks,
>>> -Mukesh
>>>
>>>
>>>> This does not mean the bug should not be fixed. But the upstream kernel
>>>> has its own policies and merge windows. For kernel 6.19, it is better to
>>>> have a clear kexec error than random crashes after kexec.
>>>>
>>>> Thanks,
>>>> Stanislav
>>>>
>>>>> Thanks,
>>>>> Anirudh
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -Mukesh
>>>>>>
>>>>>>> Thanks,
>>>>>>> Stanislav
>>>>>>>
>>>>>>>> Let me think and explore a bit, and if I come up with something, I'll
>>>>>>>> send a patch here. If nothing, then we can do this as last resort.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Mukesh
>>>>>>>>
>>>>>>>>
>>>>>>>>> It?s a pity we can?t apply a quick hack to disable only regular kexec.
>>>>>>>>> However, since crash kexec would hit the same issues, until we have a
>>>>>>>>> proper state transition for deposted pages, the best workaround for now
>>>>>>>>> is to reset the hypervisor state on every kexec, which needs design,
>>>>>>>>> work, and testing.
>>>>>>>>>
>>>>>>>>> Disabling kexec is the only consistent way to handle this in the
>>>>>>>>> upstream kernel at the moment.
>>>>>>>>>
>>>>>>>>> Thanks, Stanislav
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Mukesh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Stanislav
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -Mukesh
>>>>>>>>>>>>
>>>>>>>>>>>>> Therefor it should be explicitly forbidden as it's essentially not
>>>>>>>>>>>>> supported yet.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -Mukesh
>>>>>>
>>>
> 


^ permalink raw reply

* [PATCH v1] x86/hyperv: Move hv crash init after hypercall pg setup
From: Mukesh R @ 2026-02-04  1:58 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel; +Cc: wei.liu

hv_root_crash_init() is not setting up the hypervisor crash collection
for baremetal cases because when it's called, hypervisor page is not
setup. This got missed due to internal mirror falling behind.

Fix is simple, just move the crash init call after the hypercall
page setup.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/hv_init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 14de43f4bc6c..7f3301bd081e 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -558,7 +558,6 @@ void __init hyperv_init(void)
 		memunmap(src);
 
 		hv_remap_tsc_clocksource();
-		hv_root_crash_init();
 		hv_sleep_notifiers_register();
 	} else {
 		hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
@@ -567,6 +566,9 @@ void __init hyperv_init(void)
 
 	hv_set_hypercall_pg(hv_hypercall_pg);
 
+	if (hv_root_partition())        /* after set hypercall pg */
+		hv_root_crash_init();
+
 skip_hypercall_pg_init:
 	/*
 	 * hyperv_init() is called before LAPIC is initialized: see
-- 
2.51.2.vfs.0.1


^ permalink raw reply related

* Re: [PATCH v0] x86/hyperv: Move hv crash init after hypercall pg setup
From: Mukesh R @ 2026-02-04  1:35 UTC (permalink / raw)
  To: Easwar Hariharan; +Cc: linux-hyperv, linux-kernel, wei.liu
In-Reply-To: <ae52e158-d138-4344-ab0c-74b2fae56ddb@linux.microsoft.com>

On 2/3/26 16:25, Easwar Hariharan wrote:
> On 2/3/2026 2:41 PM, Mukesh R wrote:
>> Fix a regression where hv_root_crash_init() fails a hypercall because
>> the hypercall page is not fully setup. The regression is caused by
>> following commit:
>>
>> commit c8ed0812646e ("x86/hyperv: Use direct call to hypercall-page")
>>
> 
> Is that the right commit? The named commit was merged in v6.18-rc1 and
> hv_root_crash_init() was only merged in v6.19-rc1...
> 
> Thanks,
> Easwar (he/him)

Ah, you are right. I guess that commit was not in our internal
hyper-next mirror, so testing did not reveal the issue and I did not
notice it. Because of few missing things, we've to use internal mirror
to test. Anyways, will fix the commit and resend.

Thanks,
-Mukesh

^ permalink raw reply

* Re: [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: Stanislav Kinsburskii @ 2026-02-04  0:54 UTC (permalink / raw)
  To: Michael Kelley
  Cc: mhkelley58@gmail.com, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41575CA65B0A07C935F85665D49BA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Feb 03, 2026 at 06:35:40PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, February 2, 2026 10:56 AM
> > 
> > On Mon, Feb 02, 2026 at 06:26:42PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, February 2, 2026 9:18 AM
> > > >
> > > > On Mon, Feb 02, 2026 at 08:51:01AM -0800, mhkelley58@gmail.com wrote:
> > > > > From: Michael Kelley <mhklinux@outlook.com>
> > > > >
> > > > > Huge page mappings in the guest physical address space depend on having
> > > > > matching alignment of the userspace address in the parent partition and
> > > > > of the guest physical address. Add a comment that captures this
> > > > > information. See the link to the mailing list thread.
> > > > >
> > > > > No code or functional change.
> > > > >
> > > > > Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
> > > > > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> > > > > ---
> > > > >  drivers/hv/mshv_root_main.c | 14 ++++++++++++++
> > > > >  1 file changed, 14 insertions(+)
> > > > >
> > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > index 681b58154d5e..bc738ff4508e 100644
> > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > @@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
> > > > >  	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
> > > > >  		return mshv_unmap_user_memory(partition, mem);
> > > > >
> > > > > +	/*
> > > > > +	 * If the userspace_addr and the guest physical address (as derived
> > > > > +	 * from the guest_pfn) have the same alignment modulo PMD huge page
> > > > > +	 * size, the MSHV driver can map any PMD huge pages to the guest
> > > > > +	 * physical address space as PMD huge pages. If the alignments do
> > > > > +	 * not match, PMD huge pages must be mapped as single pages in the
> > > > > +	 * guest physical address space. The MSHV driver does not enforce
> > > > > +	 * that the alignments match, and it invokes the hypervisor to set
> > > > > +	 * up correct functional mappings either way. See mshv_chunk_stride().
> > > > > +	 * The caller of the ioctl is responsible for providing userspace_addr
> > > > > +	 * and guest_pfn values with matching alignments if it wants the guest
> > > > > +	 * to get the performance benefits of PMD huge page mappings of its
> > > > > +	 * physical address space to real system memory.
> > > > > +	 */
> > > >
> > > > Thanks. However, I'd suggest to reduce this commet a lot and put the
> > > > details into the commit message instead. Also, why this place? Why not a
> > > > part of the function description instead, for example?
> > >
> > > In general, I'm very much an advocate of putting a bit more detail into code
> > > comments, so that someone new reading the code has a chance of figuring
> > > out what's going on without having to search through the commit history
> > > and read commit messages. The commit history is certainly useful for the
> > > historical record, and especially how things have changed over time. But for
> > > "how non-obvious things work now", I like to see that in the code comments.
> > >
> > 
> > This approach is not well aligned with the existing kernel coding style.
> > It is common to answer the "why" question in the commit message.
> > Code comments should focus on "what" the code does.
> > 
> > https://www.kernel.org/doc/html/latest/process/coding-style.html
> > 
> 
> Which says "Instead, put the comments at the head of the function,
> telling people what it does, and possibly WHY it does it." I'm good with
> that approach.
> 
> > For more details, it is common to use `git blame` to learn the context
> > of a change when needed.
> 
> Yep, I use that all the time for the historical record.
> 
> > 
> > > As for where to put the comment, I'm flexible. I thought about placing it
> > > outside the function as a "header" (which is what I think you mean by the
> > > "function description"), but the function handles both "map" and "unmap"
> > > operations, and this comment applies only to "map".  Hence I put it after
> > > the test for whether we're doing "map" vs. "unmap".  But I wouldn't object
> > > to it being placed as a function description, though the text would need to be
> > > enhanced to more broadly be a function description instead of just a comment
> > > about a specific aspect of "map" behavior.
> > >
> > 
> > As for the location, since this documents the userspace API, I would
> > rather place it above the function as part of the function description.
> > Even though the function handles both map and unmap, unmap also deals
> > with huge pages.
> 
> I'll do a version written as the function description. But the full function
> description will be more extensive to cover all the "what" that this function
> implements:
> * input parameters, and their valid values
> * map and unmap
> * when pinned vs. movable vs. mmio regions are created
> * what is done with huge pages in the above cases (i.e., a massaged version
>    of what I've already written)
> * populating and pinning of pages for pinned regions
> 
> Does that match with your expectations?

I’d rather suggest something simpler for the function header:

* What regions are created
* What pages sizes are supported

I.e. describe what the function does, not the rationale or the
architecture behind it.

For example, something like this (suggested by AI, feel free to rewrite
completly):

 * Depending on the request, the region is created as pinned RAM, movable RAM,
 * or MMIO. PMD-sized huge page mappings are supported when the userspace
 * address and guest physical address (guest_pfn << PAGE_SHIFT) have matching
 * alignment modulo PMD_SIZE; otherwise the mapping is established using base
 * pages.

The rationale and architecture can be put into the commit message.

Thanks,
Stanislav

> Michael

^ permalink raw reply

* Re: [PATCH v0] x86/hyperv: Move hv crash init after hypercall pg setup
From: Easwar Hariharan @ 2026-02-04  0:25 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv, linux-kernel, easwar.hariharan, wei.liu
In-Reply-To: <20260203224121.26711-1-mrathor@linux.microsoft.com>

On 2/3/2026 2:41 PM, Mukesh R wrote:
> Fix a regression where hv_root_crash_init() fails a hypercall because
> the hypercall page is not fully setup. The regression is caused by
> following commit:
> 
> commit c8ed0812646e ("x86/hyperv: Use direct call to hypercall-page")
> 

Is that the right commit? The named commit was merged in v6.18-rc1 and
hv_root_crash_init() was only merged in v6.19-rc1...

Thanks,
Easwar (he/him)

^ permalink raw reply

* Re: [PATCH v2 0/4] Improve Hyper-V memory deposit error handling
From: Mukesh R @ 2026-02-03 23:03 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177005499596.120041.5908089206606113719.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On 2/2/26 09:58, Stanislav Kinsburskii wrote:
> This series extends the MSHV driver to properly handle additional
> memory-related error codes from the Microsoft Hypervisor by depositing
> memory pages when needed.
> 
> Currently, when the hypervisor returns HV_STATUS_INSUFFICIENT_MEMORY
> during partition creation, the driver calls hv_call_deposit_pages() to
> provide the necessary memory. However, there are other memory-related
> error codes that indicate the hypervisor needs additional memory
> resources, but the driver does not attempt to deposit pages for these
> cases.
> 
> This series introduces a dedicated helper function macro to identify all
> memory-related error codes (HV_STATUS_INSUFFICIENT_MEMORY,
> HV_STATUS_INSUFFICIENT_BUFFERS, HV_STATUS_INSUFFICIENT_DEVICE_DOMAINS, and
> HV_STATUS_INSUFFICIENT_ROOT_MEMORY) and ensures the driver attempts to
> deposit pages for all of them via new hv_deposit_memory() helper.
> 
> With these changes, partition creation becomes more robust by handling
> all scenarios where the hypervisor requires additional memory deposits.
> 
> v2:
> - Rename hv_result_oom() into hv_result_needs_memory()
> 
> ---
> 
> Stanislav Kinsburskii (4):
>        mshv: Introduce hv_result_needs_memory() helper function
>        mshv: Introduce hv_deposit_memory helper functions
>        mshv: Handle insufficient contiguous memory hypervisor status
>        mshv: Handle insufficient root memory hypervisor statuses
> 
> 
>   drivers/hv/hv_common.c         |    3 ++
>   drivers/hv/hv_proc.c           |   54 +++++++++++++++++++++++++++++++++++---
>   drivers/hv/mshv_root_hv_call.c |   45 +++++++++++++-------------------
>   drivers/hv/mshv_root_main.c    |    5 +---
>   include/asm-generic/mshyperv.h |   13 +++++++++
>   include/hyperv/hvgdk_mini.h    |   57 +++++++++++++++++++++-------------------
>   include/hyperv/hvhdk_mini.h    |    2 +
>   7 files changed, 119 insertions(+), 60 deletions(-)
> 

for the whole series:

Reviewed-by: Mukesh R <mrathor@linux.microsoft.com>


^ permalink raw reply

* Re: [PATCH v2 2/4] mshv: Introduce hv_deposit_memory helper functions
From: Mukesh R @ 2026-02-03 23:01 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <177005514346.120041.5702271891856790910.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On 2/2/26 09:59, Stanislav Kinsburskii wrote:
> Introduce hv_deposit_memory_node() and hv_deposit_memory() helper
> functions to handle memory deposition with proper error handling.

deposition is a legal thing :) ... i think you just mean deposit.

> The new hv_deposit_memory_node() function takes the hypervisor status
> as a parameter and validates it before depositing pages. It checks for
> HV_STATUS_INSUFFICIENT_MEMORY specifically and returns an error for
> unexpected status codes.
> 
> This is a precursor patch to new out-of-memory error codes support.
> No functional changes intended.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>   drivers/hv/hv_proc.c           |   22 ++++++++++++++++++++--
>   drivers/hv/mshv_root_hv_call.c |   25 +++++++++----------------
>   drivers/hv/mshv_root_main.c    |    3 +--
>   include/asm-generic/mshyperv.h |   10 ++++++++++
>   4 files changed, 40 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index e53204b9e05d..ffa25cd6e4e9 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -110,6 +110,23 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>   }
>   EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
>   
> +int hv_deposit_memory_node(int node, u64 partition_id,
> +			   u64 hv_status)
> +{
> +	u32 num_pages;
> +
> +	switch (hv_result(hv_status)) {
> +	case HV_STATUS_INSUFFICIENT_MEMORY:
> +		num_pages = 1;
> +		break;
> +	default:
> +		hv_status_err(hv_status, "Unexpected!\n");
> +		return -ENOMEM;
> +	}
> +	return hv_call_deposit_pages(node, partition_id, num_pages);
> +}
> +EXPORT_SYMBOL_GPL(hv_deposit_memory_node);
> +
>   bool hv_result_needs_memory(u64 status)
>   {
>   	switch (hv_result(status)) {
> @@ -155,7 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>   			}
>   			break;
>   		}
> -		ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
> +		ret = hv_deposit_memory_node(node, hv_current_partition_id,
> +					     status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -197,7 +215,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>   			}
>   			break;
>   		}
> -		ret = hv_call_deposit_pages(node, partition_id, 1);
> +		ret = hv_deposit_memory_node(node, partition_id, status);
>   
>   	} while (!ret);
>   
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index 89afeeda21dd..174431cb5e0e 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -123,8 +123,7 @@ int hv_call_create_partition(u64 flags,
>   			break;
>   		}
>   		local_irq_restore(irq_flags);
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -					    hv_current_partition_id, 1);
> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -151,7 +150,7 @@ int hv_call_initialize_partition(u64 partition_id)
>   			ret = hv_result_to_errno(status);
>   			break;
>   		}
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +		ret = hv_deposit_memory(partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -465,8 +464,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
>   		}
>   		local_irq_restore(flags);
>   
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -					    partition_id, 1);
> +		ret = hv_deposit_memory(partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -525,8 +523,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
>   		}
>   		local_irq_restore(flags);
>   
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -					    partition_id, 1);
> +		ret = hv_deposit_memory(partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -573,7 +570,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
>   
>   		local_irq_restore(flags);
>   
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +		ret = hv_deposit_memory(partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -722,8 +719,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
>   			ret = hv_result_to_errno(status);
>   			break;
>   		}
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
> -
> +		ret = hv_deposit_memory(port_partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -776,8 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
>   			ret = hv_result_to_errno(status);
>   			break;
>   		}
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -					    connection_partition_id, 1);
> +		ret = hv_deposit_memory(connection_partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -848,8 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
>   			break;
>   		}
>   
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -					    hv_current_partition_id, 1);
> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>   	} while (!ret);
>   
>   	return ret;
> @@ -885,8 +879,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
>   			return ret;
>   		}
>   
> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -					    hv_current_partition_id, 1);
> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>   		if (ret)
>   			return ret;
>   	} while (!ret);
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index ee30bfa6bb2e..dce255c94f9e 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -264,8 +264,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
>   		if (!hv_result_needs_memory(status))
>   			ret = hv_result_to_errno(status);
>   		else
> -			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -						    pt_id, 1);
> +			ret = hv_deposit_memory(pt_id, status);
>   	} while (!ret);
>   
>   	args.status = hv_result(status);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 452426d5b2ab..d37b68238c97 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -344,6 +344,7 @@ static inline bool hv_parent_partition(void)
>   }
>   
>   bool hv_result_needs_memory(u64 status);
> +int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
>   int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
>   int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
>   int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
> @@ -353,6 +354,10 @@ static inline bool hv_root_partition(void) { return false; }
>   static inline bool hv_l1vh_partition(void) { return false; }
>   static inline bool hv_parent_partition(void) { return false; }
>   static inline bool hv_result_needs_memory(u64 status) { return false; }
> +static inline int hv_deposit_memory_node(int node, u64 partition_id, u64 status)
> +{
> +	return -EOPNOTSUPP;
> +}
>   static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>   {
>   	return -EOPNOTSUPP;
> @@ -367,6 +372,11 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
>   }
>   #endif /* CONFIG_MSHV_ROOT */
>   
> +static inline int hv_deposit_memory(u64 partition_id, u64 status)
> +{
> +	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
> +}
> +
>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>   u8 __init get_vtl(void);
>   #else
> 
> 


^ permalink raw reply

* [PATCH v0] x86/hyperv: Move hv crash init after hypercall pg setup
From: Mukesh R @ 2026-02-03 22:41 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel; +Cc: wei.liu

Fix a regression where hv_root_crash_init() fails a hypercall because
the hypercall page is not fully setup. The regression is caused by
following commit:

commit c8ed0812646e ("x86/hyperv: Use direct call to hypercall-page")

Fix is simple, just move the crash init call after the hypercall
page setup.

Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/hv_init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 14de43f4bc6c..7f3301bd081e 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -558,7 +558,6 @@ void __init hyperv_init(void)
 		memunmap(src);
 
 		hv_remap_tsc_clocksource();
-		hv_root_crash_init();
 		hv_sleep_notifiers_register();
 	} else {
 		hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
@@ -567,6 +566,9 @@ void __init hyperv_init(void)
 
 	hv_set_hypercall_pg(hv_hypercall_pg);
 
+	if (hv_root_partition())        /* after set hypercall pg */
+		hv_root_crash_init();
+
 skip_hypercall_pg_init:
 	/*
 	 * hyperv_init() is called before LAPIC is initialized: see
-- 
2.51.2.vfs.0.1


^ permalink raw reply related

* Re: [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
From: Sohil Mehta @ 2026-02-03 21:08 UTC (permalink / raw)
  To: Shashank Balaji, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Broadcom internal kernel review list, Jan Kiszka,
	Paolo Bonzini, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: Ingo Molnar, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Daniel Palmer,
	Tim Bird, stable
In-Reply-To: <20260202-x2apic-fix-v1-1-71c8f488a88b@sony.com>

> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index d93f87f29d03..cc64d61f82cf 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -2456,6 +2456,12 @@ static void lapic_resume(void *data)
>  	if (x2apic_mode) {
>  		__x2apic_enable();
>  	} else {
> +		/*
> +		 * x2apic may have been re-enabled by the
> +		 * firmware on resuming from s2ram
> +		 */
> +		__x2apic_disable();
> +

We should likely only disable x2apic on platforms that support it and
need the disabling. How about?

...
} else {
	/*
	 *
	 */
	if (x2apic_enabled())
		__x2apic_disable();

I considered if an error message should be printed along with this. But,
I am not sure if it can really be called a firmware issue. It's probably
just that newer CPUs might have started defaulting to x2apic on.

Can you specify what platform you are encountering this?



^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-02-03 19:42 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYImS_vEdR-kxBuQ@anirudh-surface.localdomain>

On Tue, Feb 03, 2026 at 04:46:03PM +0000, Anirudh Rayabharam wrote:
> On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote:
> > On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> > > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > > > > 
> > > > > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > > > > withdrawn).
> > > > > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > > > > host partition.
> > > > > > > > > 
> > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > > > > irrespective of userspace behavior?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > > > > 
> > > > > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > > > > may still crash.
> > > > > > > 
> > > > > > > Actually guests won't be running by the time we reach our module_exit
> > > > > > > function during a kexec. Userspace processes would've been killed by
> > > > > > > then.
> > > > > > > 
> > > > > > 
> > > > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > > > We must not rely on OS to do graceful shutdown before doing
> > > > > > kexec.
> > > > > 
> > > > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > > > more graceful and is probably used more commonly. In this case at least
> > > > > we could register a reboot notifier and attempt to clean things up.
> > > > > 
> > > > > I think it is better to support kexec to this extent rather than
> > > > > disabling it entirely.
> > > > > 
> > > > 
> > > > You do understand that once our kernel is released to third parties, we
> > > > can’t control how they will use kexec, right?
> > > 
> > > Yes, we can't. But that's okay. It is fine for us to say that only some
> > > kexec scenarios are supported and some aren't (iff you're creating VMs
> > > using MSHV; if you're not creating VMs all of kexec is supported).
> > > 
> > 
> > Well, I disagree here. If we say the kernel supports MSHV, we must
> > provide a robust solution. A partially working solution is not
> > acceptable. It makes us look careless and can damage our reputation as a
> > team (and as a company).
> 
> It won't if we call out upfront what is supported and what is not.
> 
> > 
> > > > 
> > > > This is a valid and existing option. We have to account for it. Yet
> > > > again, L1VH will be used by arbitrary third parties out there, not just
> > > > by us.
> > > > 
> > > > We can’t say the kernel supports MSHV until we close these gaps. We must
> > > 
> > > We can. It is okay say some scenarios are supported and some aren't.
> > > 
> > > All kexecs are supported if they never create VMs using MSHV. If they do
> > > create VMs using MSHV and we implement cleanup in a reboot notifier at
> > > least systemctl kexec and crashdump kexec would which are probably the
> > > most common uses of kexec. It's okay to say that this is all we support
> > > as of now.
> > > 
> > 
> > I'm repeating myself, but I'll try to put it differently.
> > There won't be any kernel core collected if a page was deposited. You're
> > arguing for a lost cause here. Once a page is allocated and deposited,
> > the crash kernel will try to write it into the core.
> 
> That's why we have to implement something where we attempt to destroy
> partitions and reclaim memory (and BUG() out if that fails; which
> hopefully should happen very rarely if at all). This should be *the*
> solution we work towards. We don't need a temporary disable kexec
> solution.
> 

No, the solution is to preserve the shared state and pass it over via KHO.

> > 
> > > Also, what makes you think customers would even be interested in enabling
> > > our module in their kernel configs if it takes away kexec?
> > > 
> > 
> > It's simple: L1VH isn't a host, so I can spin up new VMs instead of
> > servicing the existing ones.
> 
> And what about the L2 VM state then? They might not be throwaway in all
> cases.
> 

L2 guest can (and likely will) be migrated fromt he old L1VH to the new
one.
And this is most likely the current scenario customers are using.

> > 
> > Why do you think there won’t be customers interested in using MSHV in
> > L1VH without kexec support?
> 
> Because they could already be using kexec for their servicing needs or
> whatever. And no we can't just say "don't service these VMs just spin up
> new ones".
> 

Are you speculating or know for sure?

> Also, keep in mind that once L1VH is available in Azure, the distros
> that run on it would be the same distros that run on all other Azure
> VMs. There won't be special distros with a kernel specifically built for
> L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be
> happy that they would need to publish a separate version of their image with
> MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to
> be disabled for all Azure VMs. Also, the customers will be confused why
> the same distro doesn't work on L1VH.
> 

I don't think distro happiness is our concern. They already build custom
versions for Azure. They can build another custom version for L1VH if
needed.

Anyway, I don't see the point in continuing this discussion. All points
have been made, and solutions have been proposed.

If you can come up with something better in the next few days, so we at
least have a chance to get it merged in the next merge window, great. If
not, we should explicitly forbid the unsupported feature and move on.

Thanks,
Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* RE: [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: Michael Kelley @ 2026-02-03 18:35 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: mhkelley58@gmail.com, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aYDzU5ujoBlzWaa6@skinsburskii.localdomain>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, February 2, 2026 10:56 AM
> 
> On Mon, Feb 02, 2026 at 06:26:42PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, February 2, 2026 9:18 AM
> > >
> > > On Mon, Feb 02, 2026 at 08:51:01AM -0800, mhkelley58@gmail.com wrote:
> > > > From: Michael Kelley <mhklinux@outlook.com>
> > > >
> > > > Huge page mappings in the guest physical address space depend on having
> > > > matching alignment of the userspace address in the parent partition and
> > > > of the guest physical address. Add a comment that captures this
> > > > information. See the link to the mailing list thread.
> > > >
> > > > No code or functional change.
> > > >
> > > > Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
> > > > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> > > > ---
> > > >  drivers/hv/mshv_root_main.c | 14 ++++++++++++++
> > > >  1 file changed, 14 insertions(+)
> > > >
> > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > index 681b58154d5e..bc738ff4508e 100644
> > > > --- a/drivers/hv/mshv_root_main.c
> > > > +++ b/drivers/hv/mshv_root_main.c
> > > > @@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
> > > >  	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
> > > >  		return mshv_unmap_user_memory(partition, mem);
> > > >
> > > > +	/*
> > > > +	 * If the userspace_addr and the guest physical address (as derived
> > > > +	 * from the guest_pfn) have the same alignment modulo PMD huge page
> > > > +	 * size, the MSHV driver can map any PMD huge pages to the guest
> > > > +	 * physical address space as PMD huge pages. If the alignments do
> > > > +	 * not match, PMD huge pages must be mapped as single pages in the
> > > > +	 * guest physical address space. The MSHV driver does not enforce
> > > > +	 * that the alignments match, and it invokes the hypervisor to set
> > > > +	 * up correct functional mappings either way. See mshv_chunk_stride().
> > > > +	 * The caller of the ioctl is responsible for providing userspace_addr
> > > > +	 * and guest_pfn values with matching alignments if it wants the guest
> > > > +	 * to get the performance benefits of PMD huge page mappings of its
> > > > +	 * physical address space to real system memory.
> > > > +	 */
> > >
> > > Thanks. However, I'd suggest to reduce this commet a lot and put the
> > > details into the commit message instead. Also, why this place? Why not a
> > > part of the function description instead, for example?
> >
> > In general, I'm very much an advocate of putting a bit more detail into code
> > comments, so that someone new reading the code has a chance of figuring
> > out what's going on without having to search through the commit history
> > and read commit messages. The commit history is certainly useful for the
> > historical record, and especially how things have changed over time. But for
> > "how non-obvious things work now", I like to see that in the code comments.
> >
> 
> This approach is not well aligned with the existing kernel coding style.
> It is common to answer the "why" question in the commit message.
> Code comments should focus on "what" the code does.
> 
> https://www.kernel.org/doc/html/latest/process/coding-style.html
> 

Which says "Instead, put the comments at the head of the function,
telling people what it does, and possibly WHY it does it." I'm good with
that approach.

> For more details, it is common to use `git blame` to learn the context
> of a change when needed.

Yep, I use that all the time for the historical record.

> 
> > As for where to put the comment, I'm flexible. I thought about placing it
> > outside the function as a "header" (which is what I think you mean by the
> > "function description"), but the function handles both "map" and "unmap"
> > operations, and this comment applies only to "map".  Hence I put it after
> > the test for whether we're doing "map" vs. "unmap".  But I wouldn't object
> > to it being placed as a function description, though the text would need to be
> > enhanced to more broadly be a function description instead of just a comment
> > about a specific aspect of "map" behavior.
> >
> 
> As for the location, since this documents the userspace API, I would
> rather place it above the function as part of the function description.
> Even though the function handles both map and unmap, unmap also deals
> with huge pages.

I'll do a version written as the function description. But the full function
description will be more extensive to cover all the "what" that this function
implements:
* input parameters, and their valid values
* map and unmap
* when pinned vs. movable vs. mmio regions are created
* what is done with huge pages in the above cases (i.e., a massaged version
   of what I've already written)
* populating and pinning of pages for pinned regions

Does that match with your expectations?

Michael

^ permalink raw reply

* Re: [PATCH v2 1/2] mshv: refactor synic init and cleanup
From: Stanislav Kinsburskii @ 2026-02-03 16:57 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <bja6gpc4y5jhbujljlcv4lcje3zius776o3v6n7gxj6bfj2bfl@a6dwxx424xcb>

On Tue, Feb 03, 2026 at 10:19:10AM +0530, Anirudh Rayabharam wrote:
> On Mon, Feb 02, 2026 at 11:07:17AM -0800, Stanislav Kinsburskii wrote:
> > On Mon, Feb 02, 2026 at 06:27:05PM +0000, Anirudh Rayabharam wrote:
> > > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > > 
> > > Rename mshv_synic_init() to mshv_synic_cpu_init() and
> > > mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
> > > these functions handle per-cpu synic setup and teardown.
> > > 
> > > Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
> > > Move all the synic related setup from mshv_parent_partition_init.
> > > 
> > > Move the reboot notifier to mshv_synic.c because it currently only
> > > operates on the synic cpuhp state.
> > > 
> > > Move out synic_pages from the global mshv_root since it's use is now
> > > completely local to mshv_synic.c.
> > > 
> > > This is in preparation for the next patch which will add more stuff to
> > > mshv_synic_init().
> > > 
> > > No functional change.
> > > 
> > > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

<snip>

> > > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > > index f8b0337cdc82..98c58755846d 100644
> > > --- a/drivers/hv/mshv_synic.c
> > > +++ b/drivers/hv/mshv_synic.c
> > > @@ -12,11 +12,16 @@
> > >  #include <linux/mm.h>
> > >  #include <linux/io.h>
> > >  #include <linux/random.h>
> > > +#include <linux/cpuhotplug.h>
> > > +#include <linux/reboot.h>
> > >  #include <asm/mshyperv.h>
> > >  
> > >  #include "mshv_eventfd.h"
> > >  #include "mshv.h"
> > >  
> > > +static int synic_cpuhp_online;
> > > +static struct hv_synic_pages __percpu *synic_pages;
> > > +
> > >  static u32 synic_event_ring_get_queued_port(u32 sint_index)
> > >  {
> > >  	struct hv_synic_event_ring_page **event_ring_page;
> > > @@ -26,7 +31,7 @@ static u32 synic_event_ring_get_queued_port(u32 sint_index)
> > >  	u32 message;
> > >  	u8 tail;
> > >  
> > > -	spages = this_cpu_ptr(mshv_root.synic_pages);
> > > +	spages = this_cpu_ptr(synic_pages);
> > >  	event_ring_page = &spages->synic_event_ring_page;
> > >  	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> > >  
> > > @@ -393,7 +398,7 @@ mshv_intercept_isr(struct hv_message *msg)
> > >  
> > >  void mshv_isr(void)
> > >  {
> > > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> > >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > >  	struct hv_message *msg;
> > >  	bool handled;
> > > @@ -446,7 +451,7 @@ void mshv_isr(void)
> > >  	}
> > >  }
> > >  
> > > -int mshv_synic_init(unsigned int cpu)
> > > +static int mshv_synic_cpu_init(unsigned int cpu)
> > >  {
> > >  	union hv_synic_simp simp;
> > >  	union hv_synic_siefp siefp;
> > > @@ -455,7 +460,7 @@ int mshv_synic_init(unsigned int cpu)
> > >  	union hv_synic_sint sint;
> > >  #endif
> > >  	union hv_synic_scontrol sctrl;
> > > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> > >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > >  	struct hv_synic_event_flags_page **event_flags_page =
> > >  			&spages->synic_event_flags_page;
> > > @@ -542,14 +547,14 @@ int mshv_synic_init(unsigned int cpu)
> > >  	return -EFAULT;
> > >  }
> > >  
> > > -int mshv_synic_cleanup(unsigned int cpu)
> > > +static int mshv_synic_cpu_exit(unsigned int cpu)
> > >  {
> > >  	union hv_synic_sint sint;
> > >  	union hv_synic_simp simp;
> > >  	union hv_synic_siefp siefp;
> > >  	union hv_synic_sirbp sirbp;
> > >  	union hv_synic_scontrol sctrl;
> > > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> > >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > >  	struct hv_synic_event_flags_page **event_flags_page =
> > >  		&spages->synic_event_flags_page;
> > > @@ -663,3 +668,57 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
> > >  
> > >  	mshv_portid_free(doorbell_portid);
> > >  }
> > > +
> > > +static int mshv_synic_reboot_notify(struct notifier_block *nb,
> > > +			      unsigned long code, void *unused)
> > > +{
> > > +	cpuhp_remove_state(synic_cpuhp_online);
> > > +	return 0;
> > > +}
> > > +
> > > +static struct notifier_block mshv_synic_reboot_nb = {
> > > +	.notifier_call = mshv_synic_reboot_notify,
> > > +};
> > > +
> > > +int __init mshv_synic_init(struct device *dev)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	synic_pages = alloc_percpu(struct hv_synic_pages);
> > > +	if (!synic_pages) {
> > > +		dev_err(dev, "Failed to allocate percpu synic page\n");
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> > > +				mshv_synic_cpu_init,
> > > +				mshv_synic_cpu_exit);
> > > +	if (ret < 0) {
> > > +		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> > > +		goto free_synic_pages;
> > > +	}
> > > +
> > > +	synic_cpuhp_online = ret;
> > > +
> > > +	if (hv_root_partition()) {
> > 
> > Nit: it's probably better to branch in the notifier itself.
> > It will introduce an additional object, but the branching will be in one
> > palce instead of two and it will also make to code simpler and easier to
> > read.
> 
> Maybe I introduce mshv_synic_root_partition_init/exit() which will have
> branching inside? Similar to what we did in mshv_root_main.c. That will
> avoid introducing the additional object. But I guess the branch will
> still be in both init and exit functions...
> 

This is a matter of taste, but from my POV, in general, less code is
better. The reboot notifier (or device shutdown) hook is not a hot path.
Also, we will need to do work there for L1VH eventually anyway. So
keeping the distinction in the callback makes more sense to me.

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 
> > 
> > Thanks
> > Stanislav.
> > 
> > > +		ret = register_reboot_notifier(&mshv_synic_reboot_nb);
> > > +		if (ret)
> > > +			goto remove_cpuhp_state;
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +remove_cpuhp_state:
> > > +	cpuhp_remove_state(synic_cpuhp_online);
> > > +free_synic_pages:
> > > +	free_percpu(synic_pages);
> > > +	return ret;
> > > +}
> > > +
> > > +void mshv_synic_cleanup(void)
> > > +{
> > > +	if (hv_root_partition())
> > > +		unregister_reboot_notifier(&mshv_synic_reboot_nb);
> > > +	cpuhp_remove_state(synic_cpuhp_online);
> > > +	free_percpu(synic_pages);
> > > +}
> > > -- 
> > > 2.34.1
> > > 

^ permalink raw reply

* Re: [PATCH v3] mshv: Add support for integrated scheduler
From: Anirudh Rayabharam @ 2026-02-03 16:52 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYD553BXAyyeNV6M@skinsburskii.localdomain>

On Mon, Feb 02, 2026 at 11:24:23AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 05:47:48PM +0000, Anirudh Rayabharam wrote:
> > On Fri, Jan 30, 2026 at 04:04:14PM +0000, Stanislav Kinsburskii wrote:
> > > Query the hypervisor for integrated scheduler support and use it if
> > > configured.
> > > 
> > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > root scheduler allows the root partition to schedule guest vCPUs across
> > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > scheduling entirely to the hypervisor.
> > > 
> > > Direct virtualization introduces a new privileged guest partition type - L1
> > > Virtual Host (L1VH) — which can create child partitions from its own
> > > resources. These child partitions are effectively siblings, scheduled by
> > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > (typically round-robin across all allocated physical CPUs). As a result,
> > > the system may appear to "steal" time from the L1VH and its children.
> > > 
> > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > > This allows an L1VH partition to schedule its own vCPUs and those of its
> > 
> > How could an L1VH partition schedule its own vCPUs?
> > 
> 
> By the mean of the integrated scheduler. Or,  from another perspective,
> the same way like any other root partition does: by placing load on a
> particular code and halting when there is nothing to do.

Maybe it's the terminology that's throwing me off. And L1VH partition
would schedule *guest* vCPUs right? And not its own? Hypervisor will
schedule the vCPUs belonging to an L1VH.

> 
> > > guests across its "physical" cores, effectively emulating root scheduler
> > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > rest of the system.
> > > 
> > > The integrated scheduler is controlled by the root partition and gated by
> > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > supports the integrated scheduler. The L1VH partition must then check if it
> > > is enabled by querying the corresponding extended partition property. If
> > > this property is true, the L1VH partition must use the root scheduler
> > > logic; otherwise, it must use the core scheduler. This requirement makes
> > > reading VMM capabilities in L1VH partition a requirement too.
> > > 
> > > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   85 +++++++++++++++++++++++++++----------------
> > >  include/hyperv/hvhdk_mini.h |    7 +++-
> > >  2 files changed, 59 insertions(+), 33 deletions(-)
> > > 
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 1134a82c7881..6a6bf641b352 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -2053,6 +2053,32 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> > >  	};
> > >  }
> > >  
> > > +static int __init l1vh_retrive_scheduler_type(enum hv_scheduler_type *out)
> > 
> > typo: retrieve*
> > 
> 
> Thanks, will fix.
> 
> > > +{
> > > +	u64 integrated_sched_enabled;
> > > +	int ret;
> > > +
> > > +	*out = HV_SCHEDULER_TYPE_CORE_SMT;
> > > +
> > > +	if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
> > > +		return 0;
> > > +
> > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > +						HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
> > > +						0, &integrated_sched_enabled,
> > > +						sizeof(integrated_sched_enabled));
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (integrated_sched_enabled)
> > > +		*out = HV_SCHEDULER_TYPE_ROOT;
> > > +
> > > +	pr_debug("%s: integrated scheduler property read: ret=%d value=%llu\n",
> > > +		 __func__, ret, integrated_sched_enabled);
> > 
> > ret is always 0 here, right? We don't need to bother printing then.
> > 
> 
> Oh yes, good point. Will fix.
> 
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  /* TODO move this to hv_common.c when needed outside */
> > >  static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
> > >  {
> > > @@ -2085,13 +2111,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
> > >  /* Retrieve and stash the supported scheduler type */
> > >  static int __init mshv_retrieve_scheduler_type(struct device *dev)
> > >  {
> > > -	int ret = 0;
> > > +	int ret;
> > >  
> > >  	if (hv_l1vh_partition())
> > > -		hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
> > > +		ret = l1vh_retrive_scheduler_type(&hv_scheduler_type);
> > >  	else
> > >  		ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
> > > -
> > >  	if (ret)
> > >  		return ret;
> > >  
> > > @@ -2211,42 +2236,29 @@ struct notifier_block mshv_reboot_nb = {
> > >  static void mshv_root_partition_exit(void)
> > >  {
> > >  	unregister_reboot_notifier(&mshv_reboot_nb);
> > > -	root_scheduler_deinit();
> > >  }
> > >  
> > >  static int __init mshv_root_partition_init(struct device *dev)
> > >  {
> > > -	int err;
> > > -
> > > -	err = root_scheduler_init(dev);
> > > -	if (err)
> > > -		return err;
> > > -
> > > -	err = register_reboot_notifier(&mshv_reboot_nb);
> > > -	if (err)
> > > -		goto root_sched_deinit;
> > > -
> > > -	return 0;
> > > -
> > > -root_sched_deinit:
> > > -	root_scheduler_deinit();
> > > -	return err;
> > > +	return register_reboot_notifier(&mshv_reboot_nb);
> > >  }
> > >  
> > > -static void mshv_init_vmm_caps(struct device *dev)
> > > +static int __init mshv_init_vmm_caps(struct device *dev)
> > >  {
> > > -	/*
> > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > -	 */
> > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > -					      0, &mshv_root.vmm_caps,
> > > -					      sizeof(mshv_root.vmm_caps)))
> > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > +	int ret;
> > > +
> > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > +						HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > +						0, &mshv_root.vmm_caps,
> > > +						sizeof(mshv_root.vmm_caps));
> > > +	if (ret && hv_l1vh_partition()) {
> > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > +		return ret;
> > 
> > I don't think we need to fail here. If there are not VMM caps available,
> > that means integrated scheduler is not supported by the hypervisor, so
> > fall back to core scheduler.
> > 
> 
> I believe we discussed this in a personal conversation earlier.
> Let me know, is we need to discuss it further.

So if we want to ensure that we proceed only if vmm caps are available,
we don't need the "&& hv_l1vh_partition()" part?

Thanks,
Anirudh.

> 
> Thanks,
> Stanislav
> 
> > Thanks,
> > Anirudh
> > 
> > > +	}
> > >  
> > >  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> > > +
> > > +	return 0;
> > >  }
> > >  
> > >  static int __init mshv_parent_partition_init(void)
> > > @@ -2292,6 +2304,10 @@ static int __init mshv_parent_partition_init(void)
> > >  
> > >  	mshv_cpuhp_online = ret;
> > >  
> > > +	ret = mshv_init_vmm_caps(dev);
> > > +	if (ret)
> > > +		goto remove_cpu_state;
> > > +
> > >  	ret = mshv_retrieve_scheduler_type(dev);
> > >  	if (ret)
> > >  		goto remove_cpu_state;
> > > @@ -2301,11 +2317,13 @@ static int __init mshv_parent_partition_init(void)
> > >  	if (ret)
> > >  		goto remove_cpu_state;
> > >  
> > > -	mshv_init_vmm_caps(dev);
> > > +	ret = root_scheduler_init(dev);
> > > +	if (ret)
> > > +		goto exit_partition;
> > >  
> > >  	ret = mshv_irqfd_wq_init();
> > >  	if (ret)
> > > -		goto exit_partition;
> > > +		goto deinit_root_scheduler;
> > >  
> > >  	spin_lock_init(&mshv_root.pt_ht_lock);
> > >  	hash_init(mshv_root.pt_htable);
> > > @@ -2314,6 +2332,8 @@ static int __init mshv_parent_partition_init(void)
> > >  
> > >  	return 0;
> > >  
> > > +deinit_root_scheduler:
> > > +	root_scheduler_deinit();
> > >  exit_partition:
> > >  	if (hv_root_partition())
> > >  		mshv_root_partition_exit();
> > > @@ -2332,6 +2352,7 @@ static void __exit mshv_parent_partition_exit(void)
> > >  	mshv_port_table_fini();
> > >  	misc_deregister(&mshv_dev);
> > >  	mshv_irqfd_wq_cleanup();
> > > +	root_scheduler_deinit();
> > >  	if (hv_root_partition())
> > >  		mshv_root_partition_exit();
> > >  	cpuhp_remove_state(mshv_cpuhp_online);
> > > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > > index 41a29bf8ec14..c0300910808b 100644
> > > --- a/include/hyperv/hvhdk_mini.h
> > > +++ b/include/hyperv/hvhdk_mini.h
> > > @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> > >  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> > >  	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
> > >  
> > > +	/* Integrated scheduling properties */
> > > +	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
> > > +
> > >  	/* Resource properties */
> > >  	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
> > >  	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
> > > @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> > >  };
> > >  
> > >  #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
> > > -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	59
> > > +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
> > >  
> > >  struct hv_partition_property_vmm_capabilities {
> > >  	u16 bank_count;
> > > @@ -119,6 +122,8 @@ struct hv_partition_property_vmm_capabilities {
> > >  			u64 reservedbit3: 1;
> > >  #endif
> > >  			u64 assignable_synthetic_proc_features: 1;
> > > +			u64 reservedbit5: 1;
> > > +			u64 vmm_enable_integrated_scheduler : 1;
> > >  			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> > >  		} __packed;
> > >  	};
> > > 
> > > 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-02-03 16:46 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYIW9PhzqmyET8IL@skinsburskii.localdomain>

On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote:
> On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > 
> > > > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > > > 
> > > > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > > > withdrawn).
> > > > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > > > host partition.
> > > > > > > > 
> > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > > > irrespective of userspace behavior?
> > > > > > > > 
> > > > > > > 
> > > > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > > > 
> > > > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > > > may still crash.
> > > > > > 
> > > > > > Actually guests won't be running by the time we reach our module_exit
> > > > > > function during a kexec. Userspace processes would've been killed by
> > > > > > then.
> > > > > > 
> > > > > 
> > > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > > We must not rely on OS to do graceful shutdown before doing
> > > > > kexec.
> > > > 
> > > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > > more graceful and is probably used more commonly. In this case at least
> > > > we could register a reboot notifier and attempt to clean things up.
> > > > 
> > > > I think it is better to support kexec to this extent rather than
> > > > disabling it entirely.
> > > > 
> > > 
> > > You do understand that once our kernel is released to third parties, we
> > > can’t control how they will use kexec, right?
> > 
> > Yes, we can't. But that's okay. It is fine for us to say that only some
> > kexec scenarios are supported and some aren't (iff you're creating VMs
> > using MSHV; if you're not creating VMs all of kexec is supported).
> > 
> 
> Well, I disagree here. If we say the kernel supports MSHV, we must
> provide a robust solution. A partially working solution is not
> acceptable. It makes us look careless and can damage our reputation as a
> team (and as a company).

It won't if we call out upfront what is supported and what is not.

> 
> > > 
> > > This is a valid and existing option. We have to account for it. Yet
> > > again, L1VH will be used by arbitrary third parties out there, not just
> > > by us.
> > > 
> > > We can’t say the kernel supports MSHV until we close these gaps. We must
> > 
> > We can. It is okay say some scenarios are supported and some aren't.
> > 
> > All kexecs are supported if they never create VMs using MSHV. If they do
> > create VMs using MSHV and we implement cleanup in a reboot notifier at
> > least systemctl kexec and crashdump kexec would which are probably the
> > most common uses of kexec. It's okay to say that this is all we support
> > as of now.
> > 
> 
> I'm repeating myself, but I'll try to put it differently.
> There won't be any kernel core collected if a page was deposited. You're
> arguing for a lost cause here. Once a page is allocated and deposited,
> the crash kernel will try to write it into the core.

That's why we have to implement something where we attempt to destroy
partitions and reclaim memory (and BUG() out if that fails; which
hopefully should happen very rarely if at all). This should be *the*
solution we work towards. We don't need a temporary disable kexec
solution.

> 
> > Also, what makes you think customers would even be interested in enabling
> > our module in their kernel configs if it takes away kexec?
> > 
> 
> It's simple: L1VH isn't a host, so I can spin up new VMs instead of
> servicing the existing ones.

And what about the L2 VM state then? They might not be throwaway in all
cases.

> 
> Why do you think there won’t be customers interested in using MSHV in
> L1VH without kexec support?

Because they could already be using kexec for their servicing needs or
whatever. And no we can't just say "don't service these VMs just spin up
new ones".

Also, keep in mind that once L1VH is available in Azure, the distros
that run on it would be the same distros that run on all other Azure
VMs. There won't be special distros with a kernel specifically built for
L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be
happy that they would need to publish a separate version of their image with
MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to
be disabled for all Azure VMs. Also, the customers will be confused why
the same distro doesn't work on L1VH.

Thanks,
Anirudh.


^ permalink raw reply

* [PATCH] x86: mshyperv: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-02-03 16:01 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy

From: Jan Kiszka <jan.kiszka@siemens.com>

Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
with related guest support enabled:

[    1.127941] hv_vmbus: registering driver hyperv_drm

[    1.132518] =============================
[    1.132519] [ BUG: Invalid wait context ]
[    1.132521] 6.19.0-rc8+ #9 Not tainted
[    1.132524] -----------------------------
[    1.132525] swapper/0/0 is trying to lock:
[    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
[    1.132543] other info that might help us debug this:
[    1.132544] context-{2:2}
[    1.132545] 1 lock held by swapper/0/0:
[    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
[    1.132557] stack backtrace:
[    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
[    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
[    1.132567] Call Trace:
[    1.132570]  <IRQ>
[    1.132573]  dump_stack_lvl+0x6e/0xa0
[    1.132581]  __lock_acquire+0xee0/0x21b0
[    1.132592]  lock_acquire+0xd5/0x2d0
[    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132606]  ? lock_acquire+0xd5/0x2d0
[    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132619]  rt_spin_lock+0x3f/0x1f0
[    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132634]  vmbus_chan_sched+0xc4/0x2b0
[    1.132641]  vmbus_isr+0x2c/0x150
[    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
[    1.132654]  sysvec_hyperv_callback+0x88/0xb0
[    1.132658]  </IRQ>
[    1.132659]  <TASK>
[    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20

As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
the complete vmbus_handler execution needs to be moved into thread
context. Open-coding this allows to skip the IPI that irq_work would
additionally bring and which we do not need, being an IRQ, never an NMI.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

This should resolve what was once brought forward via [1]. If it 
actually resolves all remaining compatibility issues of the hyperv 
support with RT is not yet clear, though. So far, lockdep is happy when 
using this plus [2].

[1] https://lore.kernel.org/all/20230809-b4-rt_preempt-fix-v1-0-7283bbdc8b14@gmail.com/
[2] https://lore.kernel.org/lkml/0c7fb5cd-fb21-4760-8593-e04bade84744@siemens.com/

 arch/x86/kernel/cpu/mshyperv.c | 52 ++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 579fb2c64cfd..1194ca452c52 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -17,6 +17,7 @@
 #include <linux/irq.h>
 #include <linux/kexec.h>
 #include <linux/random.h>
+#include <linux/smpboot.h>
 #include <asm/processor.h>
 #include <asm/hypervisor.h>
 #include <hyperv/hvhdk.h>
@@ -150,6 +151,43 @@ static void (*hv_stimer0_handler)(void);
 static void (*hv_kexec_handler)(void);
 static void (*hv_crash_handler)(struct pt_regs *regs);
 
+static DEFINE_PER_CPU(bool, vmbus_irq_pending);
+static DEFINE_PER_CPU(struct task_struct *, vmbus_irqd);
+
+static void vmbus_irqd_wake(void)
+{
+	struct task_struct *tsk = __this_cpu_read(vmbus_irqd);
+
+	__this_cpu_write(vmbus_irq_pending, true);
+	wake_up_process(tsk);
+}
+
+static void vmbus_irqd_setup(unsigned int cpu)
+{
+	sched_set_fifo(current);
+}
+
+static int vmbus_irqd_should_run(unsigned int cpu)
+{
+	return __this_cpu_read(vmbus_irq_pending);
+}
+
+static void run_vmbus_irqd(unsigned int cpu)
+{
+	vmbus_handler();
+	__this_cpu_write(vmbus_irq_pending, false);
+}
+
+static bool vmbus_irq_initialized;
+
+static struct smp_hotplug_thread vmbus_irq_threads = {
+	.store                  = &vmbus_irqd,
+	.setup			= vmbus_irqd_setup,
+	.thread_should_run      = vmbus_irqd_should_run,
+	.thread_fn              = run_vmbus_irqd,
+	.thread_comm            = "vmbus_irq/%u",
+};
+
 DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
@@ -158,8 +196,12 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 	if (mshv_handler)
 		mshv_handler();
 
-	if (vmbus_handler)
-		vmbus_handler();
+	if (vmbus_handler) {
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			vmbus_irqd_wake();
+		else
+			vmbus_handler();
+	}
 
 	if (ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED)
 		apic_eoi();
@@ -174,6 +216,10 @@ void hv_setup_mshv_handler(void (*handler)(void))
 
 void hv_setup_vmbus_handler(void (*handler)(void))
 {
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !vmbus_irq_initialized) {
+		BUG_ON(smpboot_register_percpu_thread(&vmbus_irq_threads));
+		vmbus_irq_initialized = true;
+	}
 	vmbus_handler = handler;
 }
 
@@ -181,6 +227,8 @@ void hv_remove_vmbus_handler(void)
 {
 	/* We have no way to deallocate the interrupt gate */
 	vmbus_handler = NULL;
+	smpboot_unregister_percpu_thread(&vmbus_irq_threads);
+	vmbus_irq_initialized = false;
 }
 
 /*
-- 
2.51.0

^ permalink raw reply related

* Re: [PATCH 1/1] mshv: Use EPOLLIN and EPOLLHUP instead of POLLIN and POLLHUP
From: Stanislav Kinsburskii @ 2026-02-03 15:44 UTC (permalink / raw)
  To: mhklinux
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260129155154.484671-1-mhklinux@outlook.com>

On Thu, Jan 29, 2026 at 07:51:54AM -0800, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> mshv code currently uses the POLLIN and POLLHUP flags. Starting with
> commit a9a08845e9acb ("vfs: do bulk POLL* -> EPOLL* replacement") the
> intent is to use the EPOLL* versions throughout the kernel.
> 
> The comment at the top of mshv_eventfd.c describes it as being inspired
> by the KVM implementation, which was changed by the above mentioned
> commit in 2018 to use EPOLL*. mshv_eventfd.c is much newer than 2018
> and there's no statement as to why it must use the POLL* versions.
> So change it to use the EPOLL* versions. This change also resolves
> a 'sparse' warning.
> 
> No functional change, and the generated code is the same.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202601220948.MUTO60W4-lkp@intel.com/
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

> ---
>  drivers/hv/mshv_eventfd.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> index 0b75ff1edb73..dfc8b1092c02 100644
> --- a/drivers/hv/mshv_eventfd.c
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -295,13 +295,13 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
>  {
>  	struct mshv_irqfd *irqfd = container_of(wait, struct mshv_irqfd,
>  						irqfd_wait);
> -	unsigned long flags = (unsigned long)key;
> +	__poll_t flags = key_to_poll(key);
>  	int idx;
>  	unsigned int seq;
>  	struct mshv_partition *pt = irqfd->irqfd_partn;
>  	int ret = 0;
>  
> -	if (flags & POLLIN) {
> +	if (flags & EPOLLIN) {
>  		u64 cnt;
>  
>  		eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
> @@ -320,7 +320,7 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
>  		ret = 1;
>  	}
>  
> -	if (flags & POLLHUP) {
> +	if (flags & EPOLLHUP) {
>  		/* The eventfd is closing, detach from the partition */
>  		unsigned long flags;
>  
> @@ -506,7 +506,7 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
>  	 */
>  	events = vfs_poll(fd_file(f), &irqfd->irqfd_polltbl);
>  
> -	if (events & POLLIN)
> +	if (events & EPOLLIN)
>  		mshv_assert_irq_slow(irqfd);
>  
>  	srcu_read_unlock(&pt->pt_irq_srcu, idx);
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-02-03 15:40 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <wnh3ghsxxml32sldkm4qzlzre7nebor3oqtj6i7mlhqj2gwzys@o5w5rpzrhhc4>

On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > 
> > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > management is implemented.
> > > > > > > > > > > 
> > > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > > 
> > > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > > withdrawn).
> > > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > > host partition.
> > > > > > > 
> > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > > irrespective of userspace behavior?
> > > > > > > 
> > > > > > 
> > > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > > 
> > > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > > may still crash.
> > > > > 
> > > > > Actually guests won't be running by the time we reach our module_exit
> > > > > function during a kexec. Userspace processes would've been killed by
> > > > > then.
> > > > > 
> > > > 
> > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > We must not rely on OS to do graceful shutdown before doing
> > > > kexec.
> > > 
> > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > more graceful and is probably used more commonly. In this case at least
> > > we could register a reboot notifier and attempt to clean things up.
> > > 
> > > I think it is better to support kexec to this extent rather than
> > > disabling it entirely.
> > > 
> > 
> > You do understand that once our kernel is released to third parties, we
> > can’t control how they will use kexec, right?
> 
> Yes, we can't. But that's okay. It is fine for us to say that only some
> kexec scenarios are supported and some aren't (iff you're creating VMs
> using MSHV; if you're not creating VMs all of kexec is supported).
> 

Well, I disagree here. If we say the kernel supports MSHV, we must
provide a robust solution. A partially working solution is not
acceptable. It makes us look careless and can damage our reputation as a
team (and as a company).

> > 
> > This is a valid and existing option. We have to account for it. Yet
> > again, L1VH will be used by arbitrary third parties out there, not just
> > by us.
> > 
> > We can’t say the kernel supports MSHV until we close these gaps. We must
> 
> We can. It is okay say some scenarios are supported and some aren't.
> 
> All kexecs are supported if they never create VMs using MSHV. If they do
> create VMs using MSHV and we implement cleanup in a reboot notifier at
> least systemctl kexec and crashdump kexec would which are probably the
> most common uses of kexec. It's okay to say that this is all we support
> as of now.
> 

I'm repeating myself, but I'll try to put it differently.
There won't be any kernel core collected if a page was deposited. You're
arguing for a lost cause here. Once a page is allocated and deposited,
the crash kernel will try to write it into the core.

> Also, what makes you think customers would even be interested in enabling
> our module in their kernel configs if it takes away kexec?
> 

It's simple: L1VH isn't a host, so I can spin up new VMs instead of
servicing the existing ones.

Why do you think there won’t be customers interested in using MSHV in
L1VH without kexec support?

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 
> > not depend on user space to keep the kernel safe.
> > 
> > Do you agree?
> > 
> > Thanks,
> > Stanislav
> > 
> > > > 
> > > > > Also, why is this sloppy? Isn't this what module_exit should be
> > > > > doing anyway? If someone unloads our module we should be trying to
> > > > > clean everything up (including killing guests) and reclaim memory.
> > > > > 
> > > > 
> > > > Kexec does not unload modules, but it doesn't really matter even if it
> > > > would.
> > > > There are other means to plug into the reboot flow, but neither of them
> > > > is robust or reliable.
> > > > 
> > > > > In any case, we can BUG() out if we fail to reclaim the memory. That would
> > > > > stop the kexec.
> > > > > 
> > > > 
> > > > By killing the whole system? This is not a good user experience and I
> > > > don't see how can this be justified.
> > > 
> > > It is justified because, as you said, once we reach that failure we can
> > > no longer guarantee integrity. So BUG() makes sense. This BUG() would
> > > cause the system to go for a full reboot and restore integrity.
> > > 
> > > > 
> > > > > This is a better solution since instead of disabling KEXEC outright: our
> > > > > driver made the best possible efforts to make kexec work.
> > > > > 
> > > > 
> > > > How an unrealiable feature leading to potential system crashes is better
> > > > that disabling kexec outright?
> > > 
> > > Because there are ways of using the feature reliably. What if someone
> > > has MSHV_ROOT enabled but never start a VM? (Just because someone has our
> > > driver enabled in the kernel doesn't mean they're using it.) What about crash
> > > dump?
> > > 
> > > It is far better to support some of these scenarios and be unreliable in
> > > some corner cases rather than disabling the feature completely.
> > > 
> > > Also, I'm curious if any other driver in the kernel has ever done this
> > > (force disable KEXEC).
> > > 
> > > > 
> > > > It's a complete opposite story for me: the latter provides a limited,
> > > > but robust functionality, while the former provides an unreliable and
> > > > unpredictable behavior.
> > > > 
> > > > > > 
> > > > > > There are two long-term solutions:
> > > > > >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> > > > > 
> > > > > I honestly think we should focus efforts on making kexec work rather
> > > > > than finding ways to prevent it.
> > > > > 
> > > > 
> > > > There is no argument about it. But until we have it fixed properly, we
> > > > have two options: either disable kexec or stop claiming we have our
> > > > driver up and ready for external customers. Giving the importance of
> > > > this driver for current projects, I believe the better way would be to
> > > > explicitly limit the functionality instead of postponing the
> > > > productization of the driver.
> > > 
> > > It is okay to claim our driver as ready even if it doesn't support all
> > > kexec cases. If we can support the common cases such as crash dump and
> > > maybe kexec based servicing (pretty sure people do systemctl kexec and
> > > not kexec -e for this with proper teardown) we can claim that our driver
> > > is ready for general use.
> > > 
> > > Thanks,
> > > Anirudh.

^ permalink raw reply

* Re: [EXTERNAL] [PATCH] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
From: Jan Kiszka @ 2026-02-03  6:10 UTC (permalink / raw)
  To: Long Li, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org
  Cc: linux-scsi@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka, RT, Mitchell Levy
In-Reply-To: <6b4933df-6af2-449c-922b-30ef8fd4c8b8@siemens.com>

On 03.02.26 06:57, Jan Kiszka wrote:
> On 03.02.26 00:47, Long Li wrote:
>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>
>>> This resolves the follow splat and lock-up when running with PREEMPT_RT
>>> enabled on Hyper-V:
>>
>> Hi Jan,
>>
>> It's interesting to know the use-case of running a RT kernel over Hyper-V.
>>
>> Can you give an example?
>>
> 
> - functional testing of an RT base image over Hyper-V
> - re-use of a common RT base image, without exploiting RT properties
> 
>> As far as I know, Hyper-V makes no RT guarantees of scheduling VPs for a VM.
> 
> This is well understood and not our goal. We only need the kernel to run
> correctly over Hyper-V with PREEMPT-RT enabled, and that is not the case
> right now.
> 
> Thanks,
> Jan
> 
> PS: Who had to idea to drop a virtual UART from Gen 2 VMs? Early boot
> guest debugging is true fun now...
> 

OK, after some guessing, the patched kernel boots again. So I think I
also fixed the broken vmbus IRQ patch by threading it under RT.
Currently building a kernel inside the VM while lockdep is enabled.
Boot-up and first minutes of building didn't trigger any complaints.
Will share later on.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [EXTERNAL] [PATCH] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
From: Jan Kiszka @ 2026-02-03  5:57 UTC (permalink / raw)
  To: Long Li, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org
  Cc: linux-scsi@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka, RT, Mitchell Levy
In-Reply-To: <DS3PR21MB5735CBC7D843174F9CA9039CCE9AA@DS3PR21MB5735.namprd21.prod.outlook.com>

On 03.02.26 00:47, Long Li wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> This resolves the follow splat and lock-up when running with PREEMPT_RT
>> enabled on Hyper-V:
> 
> Hi Jan,
> 
> It's interesting to know the use-case of running a RT kernel over Hyper-V.
> 
> Can you give an example?
> 

- functional testing of an RT base image over Hyper-V
- re-use of a common RT base image, without exploiting RT properties

> As far as I know, Hyper-V makes no RT guarantees of scheduling VPs for a VM.

This is well understood and not our goal. We only need the kernel to run
correctly over Hyper-V with PREEMPT-RT enabled, and that is not the case
right now.

Thanks,
Jan

PS: Who had to idea to drop a virtual UART from Gen 2 VMs? Early boot
guest debugging is true fun now...

> 
> Thanks,
> Long
> 
>>
>> [  415.140818] BUG: scheduling while atomic: stress-ng-
>> iomix/1048/0x00000002 [  415.140822] INFO: lockdep is turned off.
>> [  415.140823] Modules linked in: intel_rapl_msr intel_rapl_common
>> intel_uncore_frequency_common intel_pmc_core pmt_telemetry
>> pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec
>> ghash_clmulni_intel aesni_intel rapl binfmt_misc nls_ascii nls_cp437 vfat fat
>> snd_pcm hyperv_drm snd_timer drm_client_lib drm_shmem_helper snd sg
>> soundcore drm_kms_helper pcspkr hv_balloon hv_utils evdev joydev drm
>> configfs efi_pstore nfnetlink vsock_loopback
>> vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport
>> vsock vmw_vmci efivarfs autofs4 ext4 crc16 mbcache jbd2 sr_mod sd_mod
>> cdrom hv_storvsc serio_raw hid_generic scsi_transport_fc hid_hyperv
>> scsi_mod hid hv_netvsc hyperv_keyboard scsi_common [  415.140846]
>> Preemption disabled at:
>> [  415.140847] [<ffffffffc0656171>] storvsc_queuecommand+0x2e1/0xbe0
>> [hv_storvsc] [  415.140854] CPU: 8 UID: 0 PID: 1048 Comm: stress-ng-iomix
>> Not tainted 6.19.0-rc7 #30 PREEMPT_{RT,(full)} [  415.140856] Hardware
>> name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V
>> UEFI Release v4.1 09/04/2024 [  415.140857] Call Trace:
>> [  415.140861]  <TASK>
>> [  415.140861]  ? storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
>> [  415.140863]  dump_stack_lvl+0x91/0xb0 [  415.140870]
>> __schedule_bug+0x9c/0xc0 [  415.140875]  __schedule+0xdf6/0x1300
>> [  415.140877]  ? rtlock_slowlock_locked+0x56c/0x1980
>> [  415.140879]  ? rcu_is_watching+0x12/0x60 [  415.140883]
>> schedule_rtlock+0x21/0x40 [  415.140885]
>> rtlock_slowlock_locked+0x502/0x1980
>> [  415.140891]  rt_spin_lock+0x89/0x1e0
>> [  415.140893]  hv_ringbuffer_write+0x87/0x2a0 [  415.140899]
>> vmbus_sendpacket_mpb_desc+0xb6/0xe0
>> [  415.140900]  ? rcu_is_watching+0x12/0x60 [  415.140902]
>> storvsc_queuecommand+0x669/0xbe0 [hv_storvsc] [  415.140904]  ?
>> HARDIRQ_verbose+0x10/0x10 [  415.140908]  ? __rq_qos_issue+0x28/0x40
>> [  415.140911]  scsi_queue_rq+0x760/0xd80 [scsi_mod] [  415.140926]
>> __blk_mq_issue_directly+0x4a/0xc0 [  415.140928]
>> blk_mq_issue_direct+0x87/0x2b0 [  415.140931]
>> blk_mq_dispatch_queue_requests+0x120/0x440
>> [  415.140933]  blk_mq_flush_plug_list+0x7a/0x1a0 [  415.140935]
>> __blk_flush_plug+0xf4/0x150 [  415.140940]  __submit_bio+0x2b2/0x5c0
>> [  415.140944]  ? submit_bio_noacct_nocheck+0x272/0x360
>> [  415.140946]  submit_bio_noacct_nocheck+0x272/0x360
>> [  415.140951]  ext4_read_bh_lock+0x3e/0x60 [ext4] [  415.140995]
>> ext4_block_write_begin+0x396/0x650 [ext4] [  415.141018]  ?
>> __pfx_ext4_da_get_block_prep+0x10/0x10 [ext4] [  415.141038]
>> ext4_da_write_begin+0x1c4/0x350 [ext4] [  415.141060]
>> generic_perform_write+0x14e/0x2c0 [  415.141065]
>> ext4_buffered_write_iter+0x6b/0x120 [ext4] [  415.141083]
>> vfs_write+0x2ca/0x570 [  415.141087]  ksys_write+0x76/0xf0
>> [  415.141089]  do_syscall_64+0x99/0x1490 [  415.141093]  ?
>> rcu_is_watching+0x12/0x60 [  415.141095]  ?
>> finish_task_switch.isra.0+0xdf/0x3d0
>> [  415.141097]  ? rcu_is_watching+0x12/0x60 [  415.141098]  ?
>> lock_release+0x1f0/0x2a0 [  415.141100]  ? rcu_is_watching+0x12/0x60
>> [  415.141101]  ? finish_task_switch.isra.0+0xe4/0x3d0
>> [  415.141103]  ? rcu_is_watching+0x12/0x60 [  415.141104]  ?
>> __schedule+0xb34/0x1300 [  415.141106]  ?
>> hrtimer_try_to_cancel+0x1d/0x170 [  415.141109]  ?
>> do_nanosleep+0x8b/0x160 [  415.141111]  ?
>> hrtimer_nanosleep+0x89/0x100 [  415.141114]  ?
>> __pfx_hrtimer_wakeup+0x10/0x10 [  415.141116]  ?
>> xfd_validate_state+0x26/0x90 [  415.141118]  ? rcu_is_watching+0x12/0x60
>> [  415.141120]  ? do_syscall_64+0x1e0/0x1490 [  415.141121]  ?
>> do_syscall_64+0x1e0/0x1490 [  415.141123]  ? rcu_is_watching+0x12/0x60
>> [  415.141124]  ? do_syscall_64+0x1e0/0x1490 [  415.141125]  ?
>> do_syscall_64+0x1e0/0x1490 [  415.141127]  ? irqentry_exit+0x140/0x7e0
>> [  415.141129]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> get_cpu() disables preemption while the spinlock hv_ringbuffer_write is using
>> is converted to an rt-mutex under PREEMPT_RT.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>
>> This is likely just the tip of an iceberg, see specifically [1], but if you never start
>> addressing it, it will continue to crash ships, even if those are only on test
>> cruises (we are fully aware that Hyper-V provides no RT guarantees for
>> guests). A pragmatic alternative to that would be a simple
>>
>> config HYPERV
>>     depends on !PREEMPT_RT
>>
>> Please share your thoughts if this fix is worth it, or if we should better stop
>> looking at the next splats that show up after it. We are currently considering to
>> thread some of the hv platform IRQs under PREEMPT_RT as potential next
>> step.
>>
>> TIA!
>>
>> [1]
>> https://lore.
>> kernel.org%2Fall%2F20230809-b4-rt_preempt-fix-v1-0-
>> 7283bbdc8b14%40gmail.com%2F&data=05%7C02%7Clongli%40microsoft.c
>> om%7C9bcc663272304e06251908de5f42fe3b%7C72f988bf86f141af91ab2
>> d7cd011db47%7C1%7C0%7C639052938514762134%7CUnknown%7CTWF
>> pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW
>> 4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=WyFA
>> %2FIUPpZDcayM%2Fj7Ky8%2Bm93bey239zVWguDspSbdo%3D&reserved=0
>>
>>  drivers/scsi/storvsc_drv.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
>> b43d876747b7..68c837146b9e 100644
>> --- a/drivers/scsi/storvsc_drv.c
>> +++ b/drivers/scsi/storvsc_drv.c
>> @@ -1855,8 +1855,9 @@ static int storvsc_queuecommand(struct Scsi_Host
>> *host, struct scsi_cmnd *scmnd)
>>  	cmd_request->payload_sz = payload_sz;
>>
>>  	/* Invokes the vsc to start an IO */
>> -	ret = storvsc_do_io(dev, cmd_request, get_cpu());
>> -	put_cpu();
>> +	migrate_disable();
>> +	ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
>> +	migrate_enable();
>>
>>  	if (ret)
>>  		scsi_dma_unmap(scmnd);
>> --
>> 2.51.0

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-02-03  5:04 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYD4gw-1qKYHcnXI@skinsburskii.localdomain>

On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > 
> > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > management is implemented.
> > > > > > > > > > 
> > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > 
> > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > right? What other deposited pages would be left?
> > > > > > > > 
> > > > > > > 
> > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > withdrawn).
> > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > host partition.
> > > > > > 
> > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > irrespective of userspace behavior?
> > > > > > 
> > > > > 
> > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > 
> > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > may still crash.
> > > > 
> > > > Actually guests won't be running by the time we reach our module_exit
> > > > function during a kexec. Userspace processes would've been killed by
> > > > then.
> > > > 
> > > 
> > > No, they will not: "kexec -e" doesn't kill user processes.
> > > We must not rely on OS to do graceful shutdown before doing
> > > kexec.
> > 
> > I see kexec -e is too brutal. Something like systemctl kexec is
> > more graceful and is probably used more commonly. In this case at least
> > we could register a reboot notifier and attempt to clean things up.
> > 
> > I think it is better to support kexec to this extent rather than
> > disabling it entirely.
> > 
> 
> You do understand that once our kernel is released to third parties, we
> can’t control how they will use kexec, right?

Yes, we can't. But that's okay. It is fine for us to say that only some
kexec scenarios are supported and some aren't (iff you're creating VMs
using MSHV; if you're not creating VMs all of kexec is supported).

> 
> This is a valid and existing option. We have to account for it. Yet
> again, L1VH will be used by arbitrary third parties out there, not just
> by us.
> 
> We can’t say the kernel supports MSHV until we close these gaps. We must

We can. It is okay say some scenarios are supported and some aren't.

All kexecs are supported if they never create VMs using MSHV. If they do
create VMs using MSHV and we implement cleanup in a reboot notifier at
least systemctl kexec and crashdump kexec would which are probably the
most common uses of kexec. It's okay to say that this is all we support
as of now.

Also, what makes you think customers would even be interested in enabling
our module in their kernel configs if it takes away kexec?

Thanks,
Anirudh.

> not depend on user space to keep the kernel safe.
> 
> Do you agree?
> 
> Thanks,
> Stanislav
> 
> > > 
> > > > Also, why is this sloppy? Isn't this what module_exit should be
> > > > doing anyway? If someone unloads our module we should be trying to
> > > > clean everything up (including killing guests) and reclaim memory.
> > > > 
> > > 
> > > Kexec does not unload modules, but it doesn't really matter even if it
> > > would.
> > > There are other means to plug into the reboot flow, but neither of them
> > > is robust or reliable.
> > > 
> > > > In any case, we can BUG() out if we fail to reclaim the memory. That would
> > > > stop the kexec.
> > > > 
> > > 
> > > By killing the whole system? This is not a good user experience and I
> > > don't see how can this be justified.
> > 
> > It is justified because, as you said, once we reach that failure we can
> > no longer guarantee integrity. So BUG() makes sense. This BUG() would
> > cause the system to go for a full reboot and restore integrity.
> > 
> > > 
> > > > This is a better solution since instead of disabling KEXEC outright: our
> > > > driver made the best possible efforts to make kexec work.
> > > > 
> > > 
> > > How an unrealiable feature leading to potential system crashes is better
> > > that disabling kexec outright?
> > 
> > Because there are ways of using the feature reliably. What if someone
> > has MSHV_ROOT enabled but never start a VM? (Just because someone has our
> > driver enabled in the kernel doesn't mean they're using it.) What about crash
> > dump?
> > 
> > It is far better to support some of these scenarios and be unreliable in
> > some corner cases rather than disabling the feature completely.
> > 
> > Also, I'm curious if any other driver in the kernel has ever done this
> > (force disable KEXEC).
> > 
> > > 
> > > It's a complete opposite story for me: the latter provides a limited,
> > > but robust functionality, while the former provides an unreliable and
> > > unpredictable behavior.
> > > 
> > > > > 
> > > > > There are two long-term solutions:
> > > > >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> > > > 
> > > > I honestly think we should focus efforts on making kexec work rather
> > > > than finding ways to prevent it.
> > > > 
> > > 
> > > There is no argument about it. But until we have it fixed properly, we
> > > have two options: either disable kexec or stop claiming we have our
> > > driver up and ready for external customers. Giving the importance of
> > > this driver for current projects, I believe the better way would be to
> > > explicitly limit the functionality instead of postponing the
> > > productization of the driver.
> > 
> > It is okay to claim our driver as ready even if it doesn't support all
> > kexec cases. If we can support the common cases such as crash dump and
> > maybe kexec based servicing (pretty sure people do systemctl kexec and
> > not kexec -e for this with proper teardown) we can claim that our driver
> > is ready for general use.
> > 
> > Thanks,
> > Anirudh.

^ permalink raw reply

* Re: [PATCH v2 1/2] mshv: refactor synic init and cleanup
From: Anirudh Rayabharam @ 2026-02-03  4:49 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYD15RxUIoGDJCv5@skinsburskii.localdomain>

On Mon, Feb 02, 2026 at 11:07:17AM -0800, Stanislav Kinsburskii wrote:
> On Mon, Feb 02, 2026 at 06:27:05PM +0000, Anirudh Rayabharam wrote:
> > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > 
> > Rename mshv_synic_init() to mshv_synic_cpu_init() and
> > mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
> > these functions handle per-cpu synic setup and teardown.
> > 
> > Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
> > Move all the synic related setup from mshv_parent_partition_init.
> > 
> > Move the reboot notifier to mshv_synic.c because it currently only
> > operates on the synic cpuhp state.
> > 
> > Move out synic_pages from the global mshv_root since it's use is now
> > completely local to mshv_synic.c.
> > 
> > This is in preparation for the next patch which will add more stuff to
> > mshv_synic_init().
> > 
> > No functional change.
> > 
> > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > ---
> >  drivers/hv/mshv_root.h      |  5 ++-
> >  drivers/hv/mshv_root_main.c | 59 +++++-------------------------
> >  drivers/hv/mshv_synic.c     | 71 +++++++++++++++++++++++++++++++++----
> >  3 files changed, 75 insertions(+), 60 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index 3c1d88b36741..26e0320c8097 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -183,7 +183,6 @@ struct hv_synic_pages {
> >  };
> >  
> >  struct mshv_root {
> > -	struct hv_synic_pages __percpu *synic_pages;
> >  	spinlock_t pt_ht_lock;
> >  	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
> >  	struct hv_partition_property_vmm_capabilities vmm_caps;
> > @@ -242,8 +241,8 @@ int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
> >  void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
> >  
> >  void mshv_isr(void);
> > -int mshv_synic_init(unsigned int cpu);
> > -int mshv_synic_cleanup(unsigned int cpu);
> > +int mshv_synic_init(struct device *dev);
> > +void mshv_synic_cleanup(void);
> >  
> >  static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
> >  {
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index 681b58154d5e..7c1666456e78 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -2035,7 +2035,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
> >  	return 0;
> >  }
> >  
> > -static int mshv_cpuhp_online;
> >  static int mshv_root_sched_online;
> >  
> >  static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> > @@ -2198,40 +2197,14 @@ root_scheduler_deinit(void)
> >  	free_percpu(root_scheduler_output);
> >  }
> >  
> > -static int mshv_reboot_notify(struct notifier_block *nb,
> > -			      unsigned long code, void *unused)
> > -{
> > -	cpuhp_remove_state(mshv_cpuhp_online);
> > -	return 0;
> > -}
> > -
> 
> Unrelated to the change, but it would be great to get rid of this
> notifier altogether and just do the cleanup in the device shutdown hook.
> This is a cleaner approach as this is a device driver and we do have the
> device in hands.
> Do you think you could make this change a part of this series?

That needs to more investigation. The notifier is there because it is
called during kexec. Whether device shutdown hook also works is
something I need to check. I will prefer it to be a separate patch.

Makes it easy to reason that this patch indeed has "No functional
changes".

> 
> > -struct notifier_block mshv_reboot_nb = {
> > -	.notifier_call = mshv_reboot_notify,
> > -};
> > -
> >  static void mshv_root_partition_exit(void)
> >  {
> > -	unregister_reboot_notifier(&mshv_reboot_nb);
> >  	root_scheduler_deinit();
> >  }
> >  
> >  static int __init mshv_root_partition_init(struct device *dev)
> >  {
> > -	int err;
> > -
> > -	err = root_scheduler_init(dev);
> > -	if (err)
> > -		return err;
> > -
> > -	err = register_reboot_notifier(&mshv_reboot_nb);
> > -	if (err)
> > -		goto root_sched_deinit;
> > -
> > -	return 0;
> > -
> > -root_sched_deinit:
> > -	root_scheduler_deinit();
> > -	return err;
> > +	return root_scheduler_init(dev);
> >  }
> >  
> 
> This conflicts with the "mshv: Add support for integrated scheduler"
> patch out there.
> Perhaps we should ask Wei to merge that change first.

Sure, I'm okay with that ordering.

> 
> >  static void mshv_init_vmm_caps(struct device *dev)
> > @@ -2276,31 +2249,18 @@ static int __init mshv_parent_partition_init(void)
> >  			MSHV_HV_MAX_VERSION);
> >  	}
> >  
> > -	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
> > -	if (!mshv_root.synic_pages) {
> > -		dev_err(dev, "Failed to allocate percpu synic page\n");
> > -		ret = -ENOMEM;
> > +	ret = mshv_synic_init(dev);
> > +	if (ret)
> >  		goto device_deregister;
> > -	}
> > -
> > -	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> > -				mshv_synic_init,
> > -				mshv_synic_cleanup);
> > -	if (ret < 0) {
> > -		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> > -		goto free_synic_pages;
> > -	}
> > -
> > -	mshv_cpuhp_online = ret;
> >  
> >  	ret = mshv_retrieve_scheduler_type(dev);
> >  	if (ret)
> > -		goto remove_cpu_state;
> > +		goto synic_cleanup;
> >  
> >  	if (hv_root_partition())
> >  		ret = mshv_root_partition_init(dev);
> >  	if (ret)
> > -		goto remove_cpu_state;
> > +		goto synic_cleanup;
> >  
> >  	mshv_init_vmm_caps(dev);
> >  
> > @@ -2318,10 +2278,8 @@ static int __init mshv_parent_partition_init(void)
> >  exit_partition:
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> > -remove_cpu_state:
> > -	cpuhp_remove_state(mshv_cpuhp_online);
> > -free_synic_pages:
> > -	free_percpu(mshv_root.synic_pages);
> > +synic_cleanup:
> > +	mshv_synic_cleanup();
> >  device_deregister:
> >  	misc_deregister(&mshv_dev);
> >  	return ret;
> > @@ -2335,8 +2293,7 @@ static void __exit mshv_parent_partition_exit(void)
> >  	mshv_irqfd_wq_cleanup();
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> > -	cpuhp_remove_state(mshv_cpuhp_online);
> > -	free_percpu(mshv_root.synic_pages);
> > +	mshv_synic_cleanup();
> >  }
> >  
> >  module_init(mshv_parent_partition_init);
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index f8b0337cdc82..98c58755846d 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -12,11 +12,16 @@
> >  #include <linux/mm.h>
> >  #include <linux/io.h>
> >  #include <linux/random.h>
> > +#include <linux/cpuhotplug.h>
> > +#include <linux/reboot.h>
> >  #include <asm/mshyperv.h>
> >  
> >  #include "mshv_eventfd.h"
> >  #include "mshv.h"
> >  
> > +static int synic_cpuhp_online;
> > +static struct hv_synic_pages __percpu *synic_pages;
> > +
> >  static u32 synic_event_ring_get_queued_port(u32 sint_index)
> >  {
> >  	struct hv_synic_event_ring_page **event_ring_page;
> > @@ -26,7 +31,7 @@ static u32 synic_event_ring_get_queued_port(u32 sint_index)
> >  	u32 message;
> >  	u8 tail;
> >  
> > -	spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	spages = this_cpu_ptr(synic_pages);
> >  	event_ring_page = &spages->synic_event_ring_page;
> >  	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> >  
> > @@ -393,7 +398,7 @@ mshv_intercept_isr(struct hv_message *msg)
> >  
> >  void mshv_isr(void)
> >  {
> > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> >  	struct hv_message *msg;
> >  	bool handled;
> > @@ -446,7 +451,7 @@ void mshv_isr(void)
> >  	}
> >  }
> >  
> > -int mshv_synic_init(unsigned int cpu)
> > +static int mshv_synic_cpu_init(unsigned int cpu)
> >  {
> >  	union hv_synic_simp simp;
> >  	union hv_synic_siefp siefp;
> > @@ -455,7 +460,7 @@ int mshv_synic_init(unsigned int cpu)
> >  	union hv_synic_sint sint;
> >  #endif
> >  	union hv_synic_scontrol sctrl;
> > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> >  	struct hv_synic_event_flags_page **event_flags_page =
> >  			&spages->synic_event_flags_page;
> > @@ -542,14 +547,14 @@ int mshv_synic_init(unsigned int cpu)
> >  	return -EFAULT;
> >  }
> >  
> > -int mshv_synic_cleanup(unsigned int cpu)
> > +static int mshv_synic_cpu_exit(unsigned int cpu)
> >  {
> >  	union hv_synic_sint sint;
> >  	union hv_synic_simp simp;
> >  	union hv_synic_siefp siefp;
> >  	union hv_synic_sirbp sirbp;
> >  	union hv_synic_scontrol sctrl;
> > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> >  	struct hv_synic_event_flags_page **event_flags_page =
> >  		&spages->synic_event_flags_page;
> > @@ -663,3 +668,57 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
> >  
> >  	mshv_portid_free(doorbell_portid);
> >  }
> > +
> > +static int mshv_synic_reboot_notify(struct notifier_block *nb,
> > +			      unsigned long code, void *unused)
> > +{
> > +	cpuhp_remove_state(synic_cpuhp_online);
> > +	return 0;
> > +}
> > +
> > +static struct notifier_block mshv_synic_reboot_nb = {
> > +	.notifier_call = mshv_synic_reboot_notify,
> > +};
> > +
> > +int __init mshv_synic_init(struct device *dev)
> > +{
> > +	int ret = 0;
> > +
> > +	synic_pages = alloc_percpu(struct hv_synic_pages);
> > +	if (!synic_pages) {
> > +		dev_err(dev, "Failed to allocate percpu synic page\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> > +				mshv_synic_cpu_init,
> > +				mshv_synic_cpu_exit);
> > +	if (ret < 0) {
> > +		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> > +		goto free_synic_pages;
> > +	}
> > +
> > +	synic_cpuhp_online = ret;
> > +
> > +	if (hv_root_partition()) {
> 
> Nit: it's probably better to branch in the notifier itself.
> It will introduce an additional object, but the branching will be in one
> palce instead of two and it will also make to code simpler and easier to
> read.

Maybe I introduce mshv_synic_root_partition_init/exit() which will have
branching inside? Similar to what we did in mshv_root_main.c. That will
avoid introducing the additional object. But I guess the branch will
still be in both init and exit functions...

Thanks,
Anirudh.

> 
> Thanks
> Stanislav.
> 
> > +		ret = register_reboot_notifier(&mshv_synic_reboot_nb);
> > +		if (ret)
> > +			goto remove_cpuhp_state;
> > +	}
> > +
> > +	return 0;
> > +
> > +remove_cpuhp_state:
> > +	cpuhp_remove_state(synic_cpuhp_online);
> > +free_synic_pages:
> > +	free_percpu(synic_pages);
> > +	return ret;
> > +}
> > +
> > +void mshv_synic_cleanup(void)
> > +{
> > +	if (hv_root_partition())
> > +		unregister_reboot_notifier(&mshv_synic_reboot_nb);
> > +	cpuhp_remove_state(synic_cpuhp_online);
> > +	free_percpu(synic_pages);
> > +}
> > -- 
> > 2.34.1
> > 

^ permalink raw reply

* Re: [PATCH v2 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-02-03  4:40 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYD3XvbrOhH3NNP_@skinsburskii.localdomain>

On Mon, Feb 02, 2026 at 11:13:34AM -0800, Stanislav Kinsburskii wrote:
> On Mon, Feb 02, 2026 at 06:27:06PM +0000, Anirudh Rayabharam wrote:
> > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > 
> > On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
> > interrupts (SINTs) from the hypervisor for doorbells and intercepts.
> > There is no such vector reserved for arm64.
> > 
> > On arm64, the INTID for SINTs should be in the SGI or PPI range. The
> > hypervisor exposes a virtual device in the ACPI that reserves a
> > PPI for this use. Introduce a platform_driver that binds to this ACPI
> > device and obtains the interrupt vector that can be used for SINTs.
> > 
> > To better unify x86 and arm64 paths, introduce mshv_sint_vector_init() that
> > either registers the platform_driver and obtains the INTID (arm64) or
> > just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
> > 
> > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > ---
> >  drivers/hv/mshv_synic.c | 163 ++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 156 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index 98c58755846d..de5fee6e9f29 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -10,17 +10,24 @@
> >  #include <linux/kernel.h>
> >  #include <linux/slab.h>
> >  #include <linux/mm.h>
> > +#include <linux/interrupt.h>
> >  #include <linux/io.h>
> >  #include <linux/random.h>
> >  #include <linux/cpuhotplug.h>
> >  #include <linux/reboot.h>
> >  #include <asm/mshyperv.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/acpi.h>
> >  
> >  #include "mshv_eventfd.h"
> >  #include "mshv.h"
> >  
> >  static int synic_cpuhp_online;
> >  static struct hv_synic_pages __percpu *synic_pages;
> > +static int mshv_sint_vector = -1; /* hwirq for the SynIC SINTs */
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +static int mshv_sint_irq = -1; /* Linux IRQ for mshv_sint_vector */
> > +#endif
> >  
> >  static u32 synic_event_ring_get_queued_port(u32 sint_index)
> >  {
> > @@ -456,9 +463,7 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> >  	union hv_synic_simp simp;
> >  	union hv_synic_siefp siefp;
> >  	union hv_synic_sirbp sirbp;
> > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> >  	union hv_synic_sint sint;
> > -#endif
> >  	union hv_synic_scontrol sctrl;
> >  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > @@ -501,10 +506,13 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> >  
> >  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> >  
> > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +	enable_percpu_irq(mshv_sint_irq, 0);
> > +#endif
> > +
> >  	/* Enable intercepts */
> >  	sint.as_uint64 = 0;
> > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > +	sint.vector = mshv_sint_vector;
> >  	sint.masked = false;
> >  	sint.auto_eoi = hv_recommend_using_aeoi();
> >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> > @@ -512,13 +520,12 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> >  
> >  	/* Doorbell SINT */
> >  	sint.as_uint64 = 0;
> > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > +	sint.vector = mshv_sint_vector;
> >  	sint.masked = false;
> >  	sint.as_intercept = 1;
> >  	sint.auto_eoi = hv_recommend_using_aeoi();
> >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> >  			      sint.as_uint64);
> > -#endif
> >  
> >  	/* Enable global synic bit */
> >  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> > @@ -573,6 +580,10 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
> >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> >  			      sint.as_uint64);
> >  
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +	disable_percpu_irq(mshv_sint_irq);
> > +#endif
> > +
> >  	/* Disable Synic's event ring page */
> >  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> >  	sirbp.sirbp_enabled = false;
> > @@ -680,14 +691,149 @@ static struct notifier_block mshv_synic_reboot_nb = {
> >  	.notifier_call = mshv_synic_reboot_notify,
> >  };
> >  
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> 
> You have introduced 4 ifdef branches (one aroung the variable and three
> in mshv_synic_cpu_init) and then you still have a big ifdef branch here.

If it is ifdefs you're counting I could remove the one's around
mshv_sint_irq and then wherever it is used I could do:

	if (mshv_sint_irq != -1)
		<do something...>

This will reduce #ifdefs but will mean one more variable in the
!HYPERVISOR_CALLBACK_VECTOR case (which I believe would be optimized
away anyway).

> 
> Why is it better than simply introducing two different
> mshv_synic_cpu_init functions and have a single big ifdef instead
> (especially giving that this code is arch-specific anyway and thus won't
> bloat the binary)?

There are only two lines of code in mshv_synic_cpu_init that are inside
ifdefs. It doesn't make sense to write the exact same function twice
with just two lines different. It's not a short function, so anybody
reading it would have a pretty hard time figuring out which lines are
different.

Big ifdef here is unavoidable because the whole platform_driver part has
to be implemented.

Other ifdefs are so small that they shouldn't be jarring when reading
the code.

> 
> This will also allows to get rid of redundant mshv_sint_vector variable
> on x86.

Yeah, but the trade off (two copies of the largely same
mshv_synic_cpu_init) is horrible. And the variable would be optimized
away anyway.

Thanks,
Anirudh.

> 
> Thanks,
> Stanislav
> 
> > +#ifdef CONFIG_ACPI
> > +static long __percpu *mshv_evt;
> > +
> > +static acpi_status mshv_walk_resources(struct acpi_resource *res, void *ctx)
> > +{
> > +	struct resource r;
> > +
> > +	if (res->type == ACPI_RESOURCE_TYPE_EXTENDED_IRQ) {
> > +		if (!acpi_dev_resource_interrupt(res, 0, &r)) {
> > +			pr_err("Unable to parse MSHV ACPI interrupt\n");
> > +			return AE_ERROR;
> > +		}
> > +		/* ARM64 INTID */
> > +		mshv_sint_vector = res->data.extended_irq.interrupts[0];
> > +		/* Linux IRQ number */
> > +		mshv_sint_irq = r.start;
> > +	}
> > +
> > +	return AE_OK;
> > +}
> > +
> > +static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
> > +{
> > +	mshv_isr();
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +static int mshv_sint_probe(struct platform_device *pdev)
> > +{
> > +	acpi_status result;
> > +	int ret;
> > +	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
> > +
> > +	result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
> > +					mshv_walk_resources, NULL);
> > +	if (ACPI_FAILURE(result)) {
> > +		ret = -ENODEV;
> > +		goto out_fail;
> > +	}
> > +
> > +	mshv_evt = alloc_percpu(long);
> > +	if (!mshv_evt) {
> > +		ret = -ENOMEM;
> > +		goto out_fail;
> > +	}
> > +
> > +	ret = request_percpu_irq(mshv_sint_irq, mshv_percpu_isr, "MSHV",
> > +		mshv_evt);
> > +	if (ret)
> > +		goto free_evt;
> > +
> > +	return 0;
> > +
> > +free_evt:
> > +	free_percpu(mshv_evt);
> > +out_fail:
> > +	mshv_sint_vector = -1;
> > +	mshv_sint_irq = -1;
> > +	return ret;
> > +}
> > +
> > +static void mshv_sint_remove(struct platform_device *pdev)
> > +{
> > +	free_percpu_irq(mshv_sint_irq, mshv_evt);
> > +	free_percpu(mshv_evt);
> > +}
> > +#else
> > +static int mshv_sint_probe(struct platform_device *pdev)
> > +{
> > +	return -ENODEV;
> > +}
> > +
> > +static void mshv_sint_remove(struct platform_device *pdev)
> > +{
> > +}
> > +#endif
> > +
> > +static const __maybe_unused struct acpi_device_id mshv_sint_device_ids[] = {
> > +	{"MSFT1003", 0},
> > +	{"", 0},
> > +};
> > +
> > +static struct platform_driver mshv_sint_drv = {
> > +	.probe = mshv_sint_probe,
> > +	.remove = mshv_sint_remove,
> > +	.driver = {
> > +		.name = "mshv_sint",
> > +		.acpi_match_table = ACPI_PTR(mshv_sint_device_ids),
> > +		.probe_type = PROBE_FORCE_SYNCHRONOUS,
> > +	},
> > +};
> > +
> > +static int __init mshv_sint_vector_init(void)
> > +{
> > +	int ret;
> > +
> > +	if (acpi_disabled)
> > +		return -ENODEV;
> > +
> > +	ret = platform_driver_register(&mshv_sint_drv);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (mshv_sint_vector == -1 || mshv_sint_irq == -1) {
> > +		platform_driver_unregister(&mshv_sint_drv);
> > +		return -ENODEV;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void mshv_sint_vector_cleanup(void)
> > +{
> > +	platform_driver_unregister(&mshv_sint_drv);
> > +}
> > +#else /* HYPERVISOR_CALLBACK_VECTOR */
> > +static int __init mshv_sint_vector_init(void)
> > +{
> > +	mshv_sint_vector = HYPERVISOR_CALLBACK_VECTOR;
> > +	return 0;
> > +}
> > +
> > +static void mshv_sint_vector_cleanup(void)
> > +{
> > +}
> > +#endif /* HYPERVISOR_CALLBACK_VECTOR */
> > +
> >  int __init mshv_synic_init(struct device *dev)
> >  {
> >  	int ret = 0;
> >  
> > +	ret = mshv_sint_vector_init();
> > +	if (ret) {
> > +		dev_err(dev, "Failed to get MSHV SINT vector: %i\n", ret);
> > +		return ret;
> > +	}
> > +
> >  	synic_pages = alloc_percpu(struct hv_synic_pages);
> >  	if (!synic_pages) {
> >  		dev_err(dev, "Failed to allocate percpu synic page\n");
> > -		return -ENOMEM;
> > +		ret = -ENOMEM;
> > +		goto sint_vector_cleanup;
> >  	}
> >  
> >  	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> > @@ -712,6 +858,8 @@ int __init mshv_synic_init(struct device *dev)
> >  	cpuhp_remove_state(synic_cpuhp_online);
> >  free_synic_pages:
> >  	free_percpu(synic_pages);
> > +sint_vector_cleanup:
> > +	mshv_sint_vector_cleanup();
> >  	return ret;
> >  }
> >  
> > @@ -721,4 +869,5 @@ void mshv_synic_cleanup(void)
> >  		unregister_reboot_notifier(&mshv_synic_reboot_nb);
> >  	cpuhp_remove_state(synic_cpuhp_online);
> >  	free_percpu(synic_pages);
> > +	mshv_sint_vector_cleanup();
> >  }
> > -- 
> > 2.34.1
> > 

^ permalink raw reply

* Re: [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: Mukesh R @ 2026-02-03  1:09 UTC (permalink / raw)
  To: Stanislav Kinsburskii, mhkelley58
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYDcLRhxx9wXRXBG@skinsburskii.localdomain>

On 2/2/26 09:17, Stanislav Kinsburskii wrote:
> On Mon, Feb 02, 2026 at 08:51:01AM -0800, mhkelley58@gmail.com wrote:
>> From: Michael Kelley <mhklinux@outlook.com>
>>
>> Huge page mappings in the guest physical address space depend on having
>> matching alignment of the userspace address in the parent partition and
>> of the guest physical address. Add a comment that captures this
>> information. See the link to the mailing list thread.
>>
>> No code or functional change.
>>
>> Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
>> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
>> ---
>>   drivers/hv/mshv_root_main.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index 681b58154d5e..bc738ff4508e 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
>>   	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
>>   		return mshv_unmap_user_memory(partition, mem);
>>   
>> +	/*
>> +	 * If the userspace_addr and the guest physical address (as derived
>> +	 * from the guest_pfn) have the same alignment modulo PMD huge page
>> +	 * size, the MSHV driver can map any PMD huge pages to the guest
>> +	 * physical address space as PMD huge pages. If the alignments do
>> +	 * not match, PMD huge pages must be mapped as single pages in the
>> +	 * guest physical address space. The MSHV driver does not enforce
>> +	 * that the alignments match, and it invokes the hypervisor to set
>> +	 * up correct functional mappings either way. See mshv_chunk_stride().
>> +	 * The caller of the ioctl is responsible for providing userspace_addr
>> +	 * and guest_pfn values with matching alignments if it wants the guest
>> +	 * to get the performance benefits of PMD huge page mappings of its
>> +	 * physical address space to real system memory.
>> +	 */
> 
> Thanks. However, I'd suggest to reduce this commet a lot and put the
> details into the commit message instead. Also, why this place? Why not a
> part of the function description instead, for example?

Fwiw, I also prefer this in the function prologue. IMO, larger comments
belong outside the function rather than inside, unless of course cases
where it has to be that way. This makes functions easier to study.

Thanks,
-Mukesh



> Thanks,
> Stanislav
> 
>>   	return mshv_map_user_memory(partition, mem);
>>   }
>>   
>> -- 
>> 2.25.1


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox