* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-30 2:52 UTC (permalink / raw)
To: Michael Kelley, Stanislav Kinsburskii
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157EDC69791EF24D5DA8661D491A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 1/28/26 07:53, Michael Kelley wrote:
> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, January 27, 2026 11:56 AM
>> To: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Cc: kys@microsoft.com; haiyangz@microsoft.com; wei.liu@kernel.org;
>> decui@microsoft.com; longli@microsoft.com; linux-hyperv@vger.kernel.org; linux-
>> kernel@vger.kernel.org
>> Subject: Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
>>
>> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
>>>> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>>>>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>>>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>>>>>> hypervisor deposited pages.
>>>>>>>>>>>
>>>>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>>>>>> management is implemented.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/hv/Kconfig | 1 +
>>>>>>>>>>> 1 file changed, 1 insertion(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>>>>> # e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>>>>> # no particular order, making it impossible to reassemble larger pages
>>>>>>>>>>> depends on PAGE_SIZE_4KB
>>>>>>>>>>> + depends on !KEXEC
>>>>>>>>>>> select EVENTFD
>>>>>>>>>>> select VIRT_XFER_TO_GUEST_WORK
>>>>>>>>>>> select HMM_MIRROR
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>>>>>> and it was fine?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>>>>>> will be affected as well.
>>>>>>>>
>>>>>>>> So not sure I understand the reason for this patch. We can just block
>>>>>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>>>>>> further developement would be without a ver important and major feature,
>>>>>>>> right?
>>>>>>>
>>>>>>> This is an option. But until it's implemented and merged, a user mshv
>>>>>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>>>>>> The system may crash at any time after kexec, depending on whether the
>>>>>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>>>>>> bad user experience.
>>>>>>
>>>>>> I understand that. But with this we cannot collect core and debug any
>>>>>> crashes. I was thinking there would be a quick way to prohibit kexec
>>>>>> for update via notifier or some other quick hack. Did you already
>>>>>> explore that and didn't find anything, hence this?
>>>>>>
>>>>>
>>>>> This quick hack you mention isn't quick in the upstream kernel as there
>>>>> is no hook to interrupt kexec process except the live update one.
>>>>
>>>> That's the one we want to interrupt and block right? crash kexec
>>>> is ok and should be allowed. We can document we don't support kexec
>>>> for update for now.
>>>>
>>>>> I sent an RFC for that one but given todays conversation details is
>>>>> won't be accepted as is.
>>>>
>>>> Are you taking about this?
>>>>
>>>> "mshv: Add kexec safety for deposited pages"
>>>>
>>>
>>> Yes.
>>>
>>>>> Making mshv mutually exclusive with kexec is the only viable option for
>>>>> now given time constraints.
>>>>> It is intended to be replaced with proper page lifecycle management in
>>>>> the future.
>>>>
>>>> Yeah, that could take a long time and imo we cannot just disable KEXEC
>>>> completely. What we want is just block kexec for updates from some
>>>> mshv file for now, we an print during boot that kexec for updates is
>>>> not supported on mshv. Hope that makes sense.
>>>>
>>>
>>> The trade-off here is between disabling kexec support and having the
>>> kernel crash after kexec in a non-obvious way. This affects both regular
>>> kexec and crash kexec.
>>
>> crash kexec on baremetal is not affected, hence disabling that
>> doesn't make sense as we can't debug crashes then on bm.
>>
>> Let me think and explore a bit, and if I come up with something, I'll
>> send a patch here. If nothing, then we can do this as last resort.
>>
>> Thanks,
>> -Mukesh
>
> Maybe you've already looked at this, but there's a sysctl parameter
> kernel.kexec_load_limit_reboot that prevents loading a kexec
> kernel for reboot if the value is zero. Separately, there is
> kernel.kexec_load_limit_panic that controls whether a kexec
> kernel can be loaded for kdump purposes.
>
> kernel.kexec_load_limit_reboot defaults to -1, which allows an
> unlimited number of loading a kexec kernel for reboot. But the value
> can be set to zero with this kernel boot line parameter:
>
> sysctl.kernel.kexec_load_limit_reboot=0
>
> Alternatively, the mshv driver initialization could add code along
> the lines of process_sysctl_arg() to open
> /proc/sys/kernel/kexec_load_limit_reboot and write a value of zero.
> Then there's no dependency on setting the kernel boot line.
>
> The downside to either method is that after Linux in the root partition
> is up-and-running, it is possible to change the sysctl to a non-zero value,
> and then load a kexec kernel for reboot. So this approach isn't absolute
> protection against doing a kexec for reboot. But it makes it harder, and
> until there's a mechanism to reclaim the deposited pages, it might be
> a viable compromise to allow kdump to still be used.
Mmm...eee...weelll... i think i see a much easier way to do this by
just hijacking __kexec_lock. I will resume my normal work tmrw/Fri,
so let me test it out. if it works, will send patch Monday.
Thanks,
-Mukesh
> Just a thought ....
>
> Michael
>
>>
>>
>>> It?s a pity we can?t apply a quick hack to disable only regular kexec.
>>> However, since crash kexec would hit the same issues, until we have a
>>> proper state transition for deposted pages, the best workaround for now
>>> is to reset the hypervisor state on every kexec, which needs design,
>>> work, and testing.
>>>
>>> Disabling kexec is the only consistent way to handle this in the
>>> upstream kernel at the moment.
>>>
>>> Thanks, Stanislav
^ permalink raw reply
* Re: [PATCH 2/4] mshv: Introduce hv_deposit_memory helper functions
From: Mukesh R @ 2026-01-30 2:49 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXqZSKmRdJMc6x5u@skinsburskii.localdomain>
On 1/28/26 15:18, Stanislav Kinsburskii wrote:
> On Tue, Jan 27, 2026 at 11:44:25AM -0800, Mukesh R wrote:
>> On 1/27/26 10:30, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 26, 2026 at 06:06:23PM -0800, Mukesh R wrote:
>>>> On 1/25/26 14:41, Stanislav Kinsburskii wrote:
>>>>> On Fri, Jan 23, 2026 at 04:33:39PM -0800, Mukesh R wrote:
>>>>>> On 1/22/26 17:35, Stanislav Kinsburskii wrote:
>>>>>>> Introduce hv_deposit_memory_node() and hv_deposit_memory() helper
>>>>>>> functions to handle memory deposition with proper error handling.
>>>>>>>
>>>>>>> The new hv_deposit_memory_node() function takes the hypervisor status
>>>>>>> as a parameter and validates it before depositing pages. It checks for
>>>>>>> HV_STATUS_INSUFFICIENT_MEMORY specifically and returns an error for
>>>>>>> unexpected status codes.
>>>>>>>
>>>>>>> This is a precursor patch to new out-of-memory error codes support.
>>>>>>> No functional changes intended.
>>>>>>>
>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>> ---
>>>>>>> drivers/hv/hv_proc.c | 22 ++++++++++++++++++++--
>>>>>>> drivers/hv/mshv_root_hv_call.c | 25 +++++++++----------------
>>>>>>> drivers/hv/mshv_root_main.c | 3 +--
>>>>>>> include/asm-generic/mshyperv.h | 10 ++++++++++
>>>>>>> 4 files changed, 40 insertions(+), 20 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
>>>>>>> index 80c66d1c74d5..c0c2bfc80d77 100644
>>>>>>> --- a/drivers/hv/hv_proc.c
>>>>>>> +++ b/drivers/hv/hv_proc.c
>>>>>>> @@ -110,6 +110,23 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>>>>>> }
>>>>>>> EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
>>>>>>> +int hv_deposit_memory_node(int node, u64 partition_id,
>>>>>>> + u64 hv_status)
>>>>>>> +{
>>>>>>> + u32 num_pages;
>>>>>>> +
>>>>>>> + switch (hv_result(hv_status)) {
>>>>>>> + case HV_STATUS_INSUFFICIENT_MEMORY:
>>>>>>> + num_pages = 1;
>>>>>>> + break;
>>>>>>> + default:
>>>>>>> + hv_status_err(hv_status, "Unexpected!\n");
>>>>>>> + return -ENOMEM;
>>>>>>> + }
>>>>>>> + return hv_call_deposit_pages(node, partition_id, num_pages);
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(hv_deposit_memory_node);
>>>>>>> +
>>>>>>
>>>>>> Different hypercalls may want to deposit different number of pages in one
>>>>>> shot. As feature evolves, page sizes get mixed, we'd almost need that
>>>>>> flexibility. So, imo, either we just don't do this for now, or add num pages
>>>>>> parameter to be passed down.
>>>>>>
>>>>>
>>>>> What you do mean by "page sizes get mixed"?
>>>>> A helper to deposit num pages already exists: its
>>>>> hv_call_deposit_pages().
>>>>
>>>> My point, you are removing number of pages, and we may want to keep
>>>> that so one can quickly play around and change them.
>>>>
>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>> - pt_id, 1);
>>>> + ret = hv_deposit_memory(pt_id, status);
>>>>
>>>> For example, in hv_call_initialize_partition() we may realize after
>>>> some analysis that depositing 2 pages or 4 pages is much better.
>>>>
>>>
>>> We have been using this 1-page deposit logic from the beginning. To
>>> change the number of pages, simply replace hv_deposit_memory with
>>> hv_call_deposit_pages and specify the desired number of pages.
>>
>> You could perhaps rename it to hv_deposit_page().
>>
>
> Yes, this would be a good name, but unfortunately we can now receive
> statuses like HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY, where we need to
> deposit at least 8 consecutive pages. There is also another pair of
> status codes for required root pages, even when a guest partition-related
> hypercall is performed (see the next patch for details).
> This new helper is intended to cover all such cases, instead of branching
> for all these different cases in every function.
Got it, thanks.
> Thanks,
> Stanislav
>
>
>>> The proposed approach reduces code duplication and is less error-prone,
>>> as there are multiple error codes to handle. Consolidating the logic
>>> also makes the driver more robust.
>>>
>>>
>>> Thanks, Stanislav
>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>> Thanks,
>>>>>> -Mukesh
>>>>>>
>>>>>>
>>>>>>
>>>>>>> bool hv_result_oom(u64 status)
>>>>>>> {
>>>>>>> switch (hv_result(status)) {
>>>>>>> @@ -155,7 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>>>>>> }
>>>>>>> break;
>>>>>>> }
>>>>>>> - ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
>>>>>>> + ret = hv_deposit_memory_node(node, hv_current_partition_id,
>>>>>>> + status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -197,7 +215,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>>>>>>> }
>>>>>>> break;
>>>>>>> }
>>>>>>> - ret = hv_call_deposit_pages(node, partition_id, 1);
>>>>>>> + ret = hv_deposit_memory_node(node, partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
>>>>>>> index 58c5cbf2e567..06f2bac8039d 100644
>>>>>>> --- a/drivers/hv/mshv_root_hv_call.c
>>>>>>> +++ b/drivers/hv/mshv_root_hv_call.c
>>>>>>> @@ -123,8 +123,7 @@ int hv_call_create_partition(u64 flags,
>>>>>>> break;
>>>>>>> }
>>>>>>> local_irq_restore(irq_flags);
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>>>> - hv_current_partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(hv_current_partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -151,7 +150,7 @@ int hv_call_initialize_partition(u64 partition_id)
>>>>>>> ret = hv_result_to_errno(status);
>>>>>>> break;
>>>>>>> }
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -465,8 +464,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
>>>>>>> }
>>>>>>> local_irq_restore(flags);
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>>>> - partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -525,8 +523,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
>>>>>>> }
>>>>>>> local_irq_restore(flags);
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>>>> - partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -573,7 +570,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
>>>>>>> local_irq_restore(flags);
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -722,8 +719,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
>>>>>>> ret = hv_result_to_errno(status);
>>>>>>> break;
>>>>>>> }
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
>>>>>>> -
>>>>>>> + ret = hv_deposit_memory(port_partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -776,8 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
>>>>>>> ret = hv_result_to_errno(status);
>>>>>>> break;
>>>>>>> }
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>>>> - connection_partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(connection_partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -848,8 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
>>>>>>> break;
>>>>>>> }
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>>>> - hv_current_partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(hv_current_partition_id, status);
>>>>>>> } while (!ret);
>>>>>>> return ret;
>>>>>>> @@ -885,8 +879,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
>>>>>>> return ret;
>>>>>>> }
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>>>> - hv_current_partition_id, 1);
>>>>>>> + ret = hv_deposit_memory(hv_current_partition_id, status);
>>>>>>> if (ret)
>>>>>>> return ret;
>>>>>>> } while (!ret);
>>>>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>>>>> index f4697497f83e..5fc572e31cd7 100644
>>>>>>> --- a/drivers/hv/mshv_root_main.c
>>>>>>> +++ b/drivers/hv/mshv_root_main.c
>>>>>>> @@ -264,8 +264,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
>>>>>>> if (!hv_result_oom(status))
>>>>>>> ret = hv_result_to_errno(status);
>>>>>>> else
>>>>>>> - ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>>>> - pt_id, 1);
>>>>>>> + ret = hv_deposit_memory(pt_id, status);
>>>>>>> } while (!ret);
>>>>>>> args.status = hv_result(status);
>>>>>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>>>>>> index b73352a7fc9e..c8e8976839f8 100644
>>>>>>> --- a/include/asm-generic/mshyperv.h
>>>>>>> +++ b/include/asm-generic/mshyperv.h
>>>>>>> @@ -344,6 +344,7 @@ static inline bool hv_parent_partition(void)
>>>>>>> }
>>>>>>> bool hv_result_oom(u64 status);
>>>>>>> +int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
>>>>>>> int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
>>>>>>> int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
>>>>>>> int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
>>>>>>> @@ -353,6 +354,10 @@ static inline bool hv_root_partition(void) { return false; }
>>>>>>> static inline bool hv_l1vh_partition(void) { return false; }
>>>>>>> static inline bool hv_parent_partition(void) { return false; }
>>>>>>> static inline bool hv_result_oom(u64 status) { return false; }
>>>>>>> +static inline int hv_deposit_memory_node(int node, u64 partition_id, u64 status)
>>>>>>> +{
>>>>>>> + return -EOPNOTSUPP;
>>>>>>> +}
>>>>>>> static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>>>>>> {
>>>>>>> return -EOPNOTSUPP;
>>>>>>> @@ -367,6 +372,11 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
>>>>>>> }
>>>>>>> #endif /* CONFIG_MSHV_ROOT */
>>>>>>> +static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>>>>>> +{
>>>>>>> + return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
>>>>>>> +}
>>>>>>> +
>>>>>>> #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>>>>>> u8 __init get_vtl(void);
>>>>>>> #else
>>>>>>>
>>>>>>>
>>
^ permalink raw reply
* RE: [PATCH v2] mshv: Add support for integrated scheduler
From: Michael Kelley @ 2026-01-30 1:29 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <176971725312.67225.3938191771112866951.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, January 29, 2026 12:08 PM
>
> Microsoft Hypervisor originally provided two schedulers: root and core. The
> root scheduler allows the root partition to schedule guest vCPUs across
> physical cores, supporting both time slicing and CPU affinity (e.g., via
> cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> scheduling entirely to the hypervisor.
>
> Direct virtualization introduces a new privileged guest partition type - L1
> Virtual Host (L1VH) — which can create child partitions from its own
> resources. These child partitions are effectively siblings, scheduled by
> the hypervisor's core scheduler. This prevents the L1VH parent from setting
> affinity or time slicing for its own processes or guest VPs. While cgroups,
> CFS, and cpuset controllers can still be used, their effectiveness is
> unpredictable, as the core scheduler swaps vCPUs according to its own logic
> (typically round-robin across all allocated physical CPUs). As a result,
> the system may appear to "steal" time from the L1VH and its children.
>
> To address this, Microsoft Hypervisor introduces the integrated scheduler.
> This allows an L1VH partition to schedule its own vCPUs and those of its
> guests across its "physical" cores, effectively emulating root scheduler
> behavior within the L1VH, while retaining core scheduler behavior for the
> rest of the system.
>
> The integrated scheduler is controlled by the root partition and gated by
> the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> supports the integrated scheduler. The L1VH partition must then check if it
> is enabled by querying the corresponding extended partition property. If
> this property is true, the L1VH partition must use the root scheduler
> logic; otherwise, it must use the core scheduler. This requirement makes
> reading VMM capabilities in L1VH partition a requirement too.
>
> Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_root_main.c | 85 +++++++++++++++++++++++++++----------------
> include/hyperv/hvhdk_mini.h | 7 +++-
> 2 files changed, 59 insertions(+), 33 deletions(-)
>
Looks good.
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
^ permalink raw reply
* RE: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Michael Kelley @ 2026-01-30 1:24 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aXuwes2HNf4Og8lW@skinsburskii.localdomain>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, January 29, 2026 11:10 AM
>
> On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
>
> <snip>
>
> > > static int __init mshv_root_partition_init(struct device *dev)
> > > {
> > > int err;
> > >
> > > - err = root_scheduler_init(dev);
> > > - if (err)
> > > - return err;
> > > -
> > > err = register_reboot_notifier(&mshv_reboot_nb);
> > > if (err)
> > > - goto root_sched_deinit;
> > > + return err;
> > >
> > > return 0;
> >
> > This code is now:
> >
> > if (err)
> > return err;
> > return 0;
> >
> > which can be simplified to just:
> >
> > return err;
> >
> > Or drop the local variable 'err' and simplify the entire function to:
> >
> > return register_reboot_notifier(&mshv_reboot_nb);
> >
> > There's a tangential question here: Why is this reboot notifier
> > needed in the first place? All it does is remove the cpuhp state
> > that allocates/frees the per-cpu root_scheduler_input and
> > root_scheduler_output pages. Removing the state will free
> > the pages, but if Linux is rebooting, why bother?
> >
>
> This was originally done to support kexec.
> Here is the original commit message:
>
> mshv: perform synic cleanup during kexec
>
> Register a reboot notifier that performs synic cleanup when a kexec
> is in progress.
>
> One notable issue this commit fixes is one where after a kexec, virtio
> devices are not functional. Linux root partition receives MMIO doorbell
> events in the ring buffer in the SIRB synic page. The hypervisor maintains
> a head pointer where it writes new events into the ring buffer. The root
> partition maintains a tail pointer to read events from the buffer.
>
> Upon kexec reboot, all root data structures are re-initialized and thus the
> tail pointer gets reset to zero. The hypervisor on the other hand still
> retains the pre-kexec head pointer which could be non-zero. This means that
> when the hypervisor writes new events to the ring buffer, the root
> partition looks at the wrong place and doesn't find any events. So, future
> doorbell events never get delivered. As a result, virtqueue kicks never get
> delivered to the host.
>
> When the SIRB page is disabled the hypervisor resets the head pointer.
FWIW, I don't see that commit message anywhere in a public source code
tree. The calls to register/unregister_reboot_notifier() were in the original
introduction of mshv_root_main.c in upstream commit 621191d709b14.
Evidently the code described by that commit message was not submitted
upstream. And of course, the kexec() topic is now being revisited ....
So to clarify: Do you expect that in the future the reboot notifier will be
used for something that really is required for resetting hypervisor state
in the case of a kexec reboot?
>
> > > -root_sched_deinit:
> > > - root_scheduler_deinit();
> > > - return err;
> > > }
> > >
> > > -static void mshv_init_vmm_caps(struct device *dev)
> > > +static int mshv_init_vmm_caps(struct device *dev)
> > > {
> > > - /*
> > > - * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > - * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > - * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > - */
> > > - if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > - HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > - 0, &mshv_root.vmm_caps,
> > > - sizeof(mshv_root.vmm_caps)))
> > > - dev_warn(dev, "Unable to get VMM capabilities\n");
> > > + int ret;
> > > +
> > > + ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > + HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > + 0, &mshv_root.vmm_caps,
> > > + sizeof(mshv_root.vmm_caps));
> > > + if (ret) {
> > > + dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > + return ret;
> > > + }
> >
> > This is a functional change that isn't mentioned in the commit message.
> > Why is it now appropriate to fail instead of treating the VMM capabilities
> > as all disabled? Presumably there are older versions of the hypervisor that
> > don't support the requirements described in the original comment, but
> > perhaps they are no longer relevant?
> >
>
> To fail is now the only option for the L1VH partition. It must discover
> the scheduler type. Without this information, the partition cannot
> operate. The core scheduler logic will not work with an integrated
> scheduler, and vice versa.
>
> And yes, older hypervisor versions do not support L1VH.
That makes sense. Your change in v2 of the patch handles this
nicely. For the non-L1VH case, the v2 behavior is the same as before in
that the init path won't error out on older hypervisors that don't
support the requirements described in the original comment. That's
the case I am concerned about.
Michael
^ permalink raw reply
* [PATCH v2] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-01-29 20:07 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
Query the hypervisor for integrated scheduler support and use it if
configured.
Microsoft Hypervisor originally provided two schedulers: root and core. The
root scheduler allows the root partition to schedule guest vCPUs across
physical cores, supporting both time slicing and CPU affinity (e.g., via
cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
scheduling entirely to the hypervisor.
Direct virtualization introduces a new privileged guest partition type - L1
Virtual Host (L1VH) — which can create child partitions from its own
resources. These child partitions are effectively siblings, scheduled by
the hypervisor's core scheduler. This prevents the L1VH parent from setting
affinity or time slicing for its own processes or guest VPs. While cgroups,
CFS, and cpuset controllers can still be used, their effectiveness is
unpredictable, as the core scheduler swaps vCPUs according to its own logic
(typically round-robin across all allocated physical CPUs). As a result,
the system may appear to "steal" time from the L1VH and its children.
To address this, Microsoft Hypervisor introduces the integrated scheduler.
This allows an L1VH partition to schedule its own vCPUs and those of its
guests across its "physical" cores, effectively emulating root scheduler
behavior within the L1VH, while retaining core scheduler behavior for the
rest of the system.
The integrated scheduler is controlled by the root partition and gated by
the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
supports the integrated scheduler. The L1VH partition must then check if it
is enabled by querying the corresponding extended partition property. If
this property is true, the L1VH partition must use the root scheduler
logic; otherwise, it must use the core scheduler. This requirement makes
reading VMM capabilities in L1VH partition a requirement too.
Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 85 +++++++++++++++++++++++++++----------------
include/hyperv/hvhdk_mini.h | 7 +++-
2 files changed, 59 insertions(+), 33 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..086e455dd889 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2053,6 +2053,32 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
};
}
+static int __init l1vh_retrive_scheduler_type(enum hv_scheduler_type *out)
+{
+ u64 integrated_sched_enabled;
+ int ret;
+
+ *out = HV_SCHEDULER_TYPE_CORE_SMT;
+
+ if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
+ return 0;
+
+ ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+ HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
+ 0, &integrated_sched_enabled,
+ sizeof(integrated_sched_enabled));
+ if (ret)
+ return ret;
+
+ if (integrated_sched_enabled)
+ *out = HV_SCHEDULER_TYPE_ROOT;
+
+ pr_debug("%s: integrated scheduler property read: ret=%d value=%llu\n",
+ __func__, ret, integrated_sched_enabled);
+
+ return 0;
+}
+
/* TODO move this to hv_common.c when needed outside */
static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
{
@@ -2085,13 +2111,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
/* Retrieve and stash the supported scheduler type */
static int __init mshv_retrieve_scheduler_type(struct device *dev)
{
- int ret = 0;
+ int ret;
if (hv_l1vh_partition())
- hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
+ ret = l1vh_retrive_scheduler_type(&hv_scheduler_type);
else
ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
-
if (ret)
return ret;
@@ -2211,42 +2236,29 @@ struct notifier_block mshv_reboot_nb = {
static void mshv_root_partition_exit(void)
{
unregister_reboot_notifier(&mshv_reboot_nb);
- root_scheduler_deinit();
}
static int __init mshv_root_partition_init(struct device *dev)
{
- int err;
-
- err = root_scheduler_init(dev);
- if (err)
- return err;
-
- err = register_reboot_notifier(&mshv_reboot_nb);
- if (err)
- goto root_sched_deinit;
-
- return 0;
-
-root_sched_deinit:
- root_scheduler_deinit();
- return err;
+ return register_reboot_notifier(&mshv_reboot_nb);
}
-static void mshv_init_vmm_caps(struct device *dev)
+static int __init mshv_init_vmm_caps(struct device *dev)
{
- /*
- * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
- * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
- * case it's valid to proceed as if all vmm_caps are disabled (zero).
- */
- if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
- HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
- 0, &mshv_root.vmm_caps,
- sizeof(mshv_root.vmm_caps)))
- dev_warn(dev, "Unable to get VMM capabilities\n");
+ int ret;
+
+ ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+ HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
+ 0, &mshv_root.vmm_caps,
+ sizeof(mshv_root.vmm_caps));
+ if (ret && hv_l1vh_partition())
+ dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
+ return ret;
+ }
dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
+
+ return 0;
}
static int __init mshv_parent_partition_init(void)
@@ -2292,6 +2304,10 @@ static int __init mshv_parent_partition_init(void)
mshv_cpuhp_online = ret;
+ ret = mshv_init_vmm_caps(dev);
+ if (ret)
+ goto remove_cpu_state;
+
ret = mshv_retrieve_scheduler_type(dev);
if (ret)
goto remove_cpu_state;
@@ -2301,11 +2317,13 @@ static int __init mshv_parent_partition_init(void)
if (ret)
goto remove_cpu_state;
- mshv_init_vmm_caps(dev);
+ ret = root_scheduler_init(dev);
+ if (ret)
+ goto exit_partition;
ret = mshv_irqfd_wq_init();
if (ret)
- goto exit_partition;
+ goto deinit_root_scheduler;
spin_lock_init(&mshv_root.pt_ht_lock);
hash_init(mshv_root.pt_htable);
@@ -2314,6 +2332,8 @@ static int __init mshv_parent_partition_init(void)
return 0;
+deinit_root_scheduler:
+ root_scheduler_deinit();
exit_partition:
if (hv_root_partition())
mshv_root_partition_exit();
@@ -2332,6 +2352,7 @@ static void __exit mshv_parent_partition_exit(void)
mshv_port_table_fini();
misc_deregister(&mshv_dev);
mshv_irqfd_wq_cleanup();
+ root_scheduler_deinit();
if (hv_root_partition())
mshv_root_partition_exit();
cpuhp_remove_state(mshv_cpuhp_online);
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 41a29bf8ec14..c0300910808b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -87,6 +87,9 @@ enum hv_partition_property_code {
HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS = 0x00010000,
HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES = 0x00010001,
+ /* Integrated scheduling properties */
+ HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED = 0x00020005,
+
/* Resource properties */
HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING = 0x00050005,
HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION = 0x00050017,
@@ -102,7 +105,7 @@ enum hv_partition_property_code {
};
#define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT 1
-#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 59
+#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 57
struct hv_partition_property_vmm_capabilities {
u16 bank_count;
@@ -119,6 +122,8 @@ struct hv_partition_property_vmm_capabilities {
u64 reservedbit3: 1;
#endif
u64 assignable_synthetic_proc_features: 1;
+ u64 reservedbit5: 1;
+ u64 vmm_enable_integrated_scheduler : 1;
u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
} __packed;
};
^ permalink raw reply related
* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-01-29 19:09 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415767BB59E00442812F47B5D49EA@SN6PR02MB4157.namprd02.prod.outlook.com>
On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> >
> > From: Andreea Pintilie <anpintil@microsoft.com>
> >
> > Query the hypervisor for integrated scheduler support and use it if
> > configured.
> >
> > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > root scheduler allows the root partition to schedule guest vCPUs across
> > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > scheduling entirely to the hypervisor.
> >
> > Direct virtualization introduces a new privileged guest partition type - L1
> > Virtual Host (L1VH) — which can create child partitions from its own
> > resources. These child partitions are effectively siblings, scheduled by
> > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > CFS, and cpuset controllers can still be used, their effectiveness is
> > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > (typically round-robin across all allocated physical CPUs). As a result,
> > the system may appear to "steal" time from the L1VH and its children.
> >
> > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > guests across its "physical" cores, effectively emulating root scheduler
> > behavior within the L1VH, while retaining core scheduler behavior for the
> > rest of the system.
> >
> > The integrated scheduler is controlled by the root partition and gated by
> > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > supports the integrated scheduler. The L1VH partition must then check if it
> > is enabled by querying the corresponding extended partition property. If
> > this property is true, the L1VH partition must use the root scheduler
> > logic; otherwise, it must use the core scheduler.
> >
> > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_root_main.c | 79 +++++++++++++++++++++++++++++--------------
> > include/hyperv/hvhdk_mini.h | 6 +++
> > 2 files changed, 58 insertions(+), 27 deletions(-)
> >
<snip>
> > static int __init mshv_root_partition_init(struct device *dev)
> > {
> > int err;
> >
> > - err = root_scheduler_init(dev);
> > - if (err)
> > - return err;
> > -
> > err = register_reboot_notifier(&mshv_reboot_nb);
> > if (err)
> > - goto root_sched_deinit;
> > + return err;
> >
> > return 0;
>
> This code is now:
>
> if (err)
> return err;
> return 0;
>
> which can be simplified to just:
>
> return err;
>
> Or drop the local variable 'err' and simplify the entire function to:
>
> return register_reboot_notifier(&mshv_reboot_nb);
>
> There's a tangential question here: Why is this reboot notifier
> needed in the first place? All it does is remove the cpuhp state
> that allocates/frees the per-cpu root_scheduler_input and
> root_scheduler_output pages. Removing the state will free
> the pages, but if Linux is rebooting, why bother?
>
This was originally done to support kexec.
Here is the original commit message:
mshv: perform synic cleanup during kexec
Register a reboot notifier that performs synic cleanup when a kexec
is in progress.
One notable issue this commit fixes is one where after a kexec, virtio
devices are not functional. Linux root partition receives MMIO doorbell
events in the ring buffer in the SIRB synic page. The hypervisor maintains
a head pointer where it writes new events into the ring buffer. The root
partition maintains a tail pointer to read events from the buffer.
Upon kexec reboot, all root data structures are re-initialized and thus the
tail pointer gets reset to zero. The hypervisor on the other hand still
retains the pre-kexec head pointer which could be non-zero. This means that
when the hypervisor writes new events to the ring buffer, the root
partition looks at the wrong place and doesn't find any events. So, future
doorbell events never get delivered. As a result, virtqueue kicks never get
delivered to the host.
When the SIRB page is disabled the hypervisor resets the head pointer.
> > -root_sched_deinit:
> > - root_scheduler_deinit();
> > - return err;
> > }
> >
> > -static void mshv_init_vmm_caps(struct device *dev)
> > +static int mshv_init_vmm_caps(struct device *dev)
> > {
> > - /*
> > - * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > - * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > - * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > - */
> > - if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > - HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > - 0, &mshv_root.vmm_caps,
> > - sizeof(mshv_root.vmm_caps)))
> > - dev_warn(dev, "Unable to get VMM capabilities\n");
> > + int ret;
> > +
> > + ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > + HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > + 0, &mshv_root.vmm_caps,
> > + sizeof(mshv_root.vmm_caps));
> > + if (ret) {
> > + dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > + return ret;
> > + }
>
> This is a functional change that isn't mentioned in the commit message.
> Why is it now appropriate to fail instead of treating the VMM capabilities
> as all disabled? Presumably there are older versions of the hypervisor that
> don't support the requirements described in the original comment, but
> perhaps they are no longer relevant?
>
To fail is now the only option for the L1VH partition. It must discover
the scheduler type. Without this information, the partition cannot
operate. The core scheduler logic will not work with an integrated
scheduler, and vice versa.
And yes, older hypervisor versions do not support L1VH.
Thanks,
Stanislav
> >
> > dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> > +
> > + return 0;
> > }
> >
> > static int __init mshv_parent_partition_init(void)
> > @@ -2292,6 +2310,10 @@ static int __init mshv_parent_partition_init(void)
> >
> > mshv_cpuhp_online = ret;
> >
> > + ret = mshv_init_vmm_caps(dev);
> > + if (ret)
> > + goto remove_cpu_state;
> > +
> > ret = mshv_retrieve_scheduler_type(dev);
> > if (ret)
> > goto remove_cpu_state;
> > @@ -2301,11 +2323,13 @@ static int __init mshv_parent_partition_init(void)
> > if (ret)
> > goto remove_cpu_state;
> >
> > - mshv_init_vmm_caps(dev);
> > + ret = root_scheduler_init(dev);
> > + if (ret)
> > + goto exit_partition;
> >
> > ret = mshv_irqfd_wq_init();
> > if (ret)
> > - goto exit_partition;
> > + goto deinit_root_scheduler;
> >
> > spin_lock_init(&mshv_root.pt_ht_lock);
> > hash_init(mshv_root.pt_htable);
> > @@ -2314,6 +2338,8 @@ static int __init mshv_parent_partition_init(void)
> >
> > return 0;
> >
> > +deinit_root_scheduler:
> > + root_scheduler_deinit();
> > exit_partition:
> > if (hv_root_partition())
> > mshv_root_partition_exit();
> > @@ -2332,6 +2358,7 @@ static void __exit mshv_parent_partition_exit(void)
> > mshv_port_table_fini();
> > misc_deregister(&mshv_dev);
> > mshv_irqfd_wq_cleanup();
> > + root_scheduler_deinit();
> > if (hv_root_partition())
> > mshv_root_partition_exit();
> > cpuhp_remove_state(mshv_cpuhp_online);
> > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > index aa03616f965b..0f7178fa88a8 100644
> > --- a/include/hyperv/hvhdk_mini.h
> > +++ b/include/hyperv/hvhdk_mini.h
> > @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> > HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS = 0x00010000,
> > HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES = 0x00010001,
> >
> > + /* Integrated scheduling properties */
> > + HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED = 0x00020005,
> > +
> > /* Resource properties */
> > HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING = 0x00050005,
> > HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION = 0x00050017,
> > @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> > };
> >
> > #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT 1
> > -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 58
> > +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 57
> >
> > struct hv_partition_property_vmm_capabilities {
> > u16 bank_count;
> > @@ -120,6 +123,7 @@ struct hv_partition_property_vmm_capabilities {
> > #endif
> > u64 assignable_synthetic_proc_features: 1;
> > u64 tag_hv_message_from_child: 1;
> > + u64 vmm_enable_integrated_scheduler : 1;
> > u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> > } __packed;
> > };
> >
> >
>
^ permalink raw reply
* RE: [EXTERNAL] [PATCH rdma-next] MAINTAINERS: Drop RDMA files from Hyper-V section
From: Long Li @ 2026-01-29 17:56 UTC (permalink / raw)
To: Leon Romanovsky, Konstantin Taranov
Cc: linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <20260128-get-maintainers-fix-v1-1-fc5e58ce9f02@nvidia.com>
> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Wednesday, January 28, 2026 1:55 AM
> To: Long Li <longli@microsoft.com>; Konstantin Taranov
> <kotaranov@microsoft.com>
> Cc: linux-rdma@vger.kernel.org; linux-hyperv@vger.kernel.org
> Subject: [EXTERNAL] [PATCH rdma-next] MAINTAINERS: Drop RDMA files from
> Hyper-V section
>
> From: Leon Romanovsky <leonro@nvidia.com>
>
>
>
> MAINTAINERS entries are organized by subsystem ownership, and the RDMA
>
> files belong under drivers/infiniband. Remove the overly broad mana_ib
>
> entries from the Hyper-V section, and instead add the Hyper-V mailing list
>
> to CC on mana_ib patches.
>
>
>
> This makes get_maintainer.pl behave more sensibly when running it on
>
> mana_ib patches.
>
>
>
> Fixes: 428ca2d4c6aa ("MAINTAINERS: Add Long Li as a Hyper-V maintainer")
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Long Li <longli@microsoft.com>
>
> ---
>
> MAINTAINERS | 3 +--
>
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
>
>
> diff --git a/MAINTAINERS b/MAINTAINERS
>
> index 12f49de7fe03..d2e3353a1d29 100644
>
> --- a/MAINTAINERS
>
> +++ b/MAINTAINERS
>
> @@ -11739,7 +11739,6 @@ F: arch/x86/kernel/cpu/mshyperv.c
>
> F: drivers/clocksource/hyperv_timer.c
>
> F: drivers/hid/hid-hyperv.c
>
> F: drivers/hv/
>
> -F: drivers/infiniband/hw/mana/
>
> F: drivers/input/serio/hyperv-keyboard.c
>
> F: drivers/iommu/hyperv-iommu.c
>
> F: drivers/net/ethernet/microsoft/
>
> @@ -11758,7 +11757,6 @@ F: include/hyperv/hvhdk_mini.h
>
> F: include/linux/hyperv.h
>
> F: include/net/mana
>
> F: include/uapi/linux/hyperv.h
>
> -F: include/uapi/rdma/mana-abi.h
>
> F: net/vmw_vsock/hyperv_transport.c
>
> F: tools/hv/
>
>
>
> @@ -17318,6 +17316,7 @@ MICROSOFT MANA RDMA DRIVER
>
> M: Long Li <longli@microsoft.com>
>
> M: Konstantin Taranov <kotaranov@microsoft.com>
>
> L: linux-rdma@vger.kernel.org
>
> +L: linux-hyperv@vger.kernel.org
>
> S: Supported
>
> F: drivers/infiniband/hw/mana/
>
> F: include/net/mana
>
>
>
> ---
>
> base-commit: a01745ccf7c41043c503546cae7ba7b0ff499d38
>
> change-id: 20260128-get-maintainers-fix-a9319fc985c8
>
>
>
> Best regards,
>
> --
>
> Leon Romanovsky <leonro@nvidia.com>
>
>
^ permalink raw reply
* Re: [PATCH v6 0/7] mshv: Debugfs interface for mshv_root
From: Stanislav Kinsburskii @ 2026-01-29 17:51 UTC (permalink / raw)
To: Nuno Das Neves
Cc: linux-hyperv, linux-kernel, mhklinux, kys, haiyangz, wei.liu,
decui, longli, prapal, mrathor, paekkaladevi
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>
On Wed, Jan 28, 2026 at 10:11:39AM -0800, Nuno Das Neves wrote:
> Expose hypervisor, logical processor, partition, and virtual processor
> statistics via debugfs. These are provided by mapping 'stats' pages via
> hypercall.
>
For the whole series:
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Patch #1: Update hv_call_map_stats_page() to return success when
> HV_STATS_AREA_PARENT is unavailable, which is the case on some
> hypervisor versions, where it can fall back to HV_STATS_AREA_SELF
> Patch #2: Use struct hv_stats_page pointers instead of void *
> Patch #3: Make mshv_vp_stats_map/unmap() more flexible to use with debugfs
> code
> Patch #4: Always map vp stats page regardless of scheduler, to reuse in
> debugfs
> Patch #5: Change to hv_stats_page definition and
> VpRootDispatchThreadBlocked
> Patch #6: Introduce the definitions needed for the various stats pages
> Patch #7: Add mshv_debugfs.c, and integrate it with the mshv_root driver to
> expose the partition and VP stats.
>
> ---
> Changes in v6:
> - Fix whitespace and other checkpatch issues [Michael]
>
> Changes in v5:
> - Rename hv_counters.c to mshv_debugfs_counters.c [Michael]
> - Clarify unusual inclusion of mshv_debugfs_counters.c with comment. After
> discussion it is still included directly to keep things simple. Including
> arrays with unspecified size via a header means sizeof() cannot be used on
> the array.
> - Error if mshv_debugfs_counters.c is included elsewhere than mshv_debugfs.c
> - Use array index as stats page index to save space [Stanislav]
> - Enforce HV_STATS_AREA_PARENT and SELF fit in NUM_STATS_AREAS with
> static_assert and clarify with comment [Michael]
> - Return to using lp count from hv stats page for mshv_lps_count [Michael]
> - Use nr_cpu_ids instead of num_possible_cpus() [Michael]
> - Set mshv_lps_stats[idx] and the array itself to NULL on unmap and cleanup
> [Michael]
> - Rename HvLogicalProcessors and VpRootDispatchThreadBlocked to Linux style
> - Translate Linux cpu index to vp index via hv_vp_index on partition destroy
> [Michael]
> - Minor formatting cleanups [Michael]
>
> Changes in v4:
> - Put the counters definitions in static arrays in hv_counters.c, instead of
> as enums in hvhdk.h [Michael]
> - Due to the above, add an additional patch (#5) to simplify hv_stats_page,
> and retain the enum definition at the top of mshv_root_main.c for use with
> VpRootDispatchThreadBlocked. That is the only remaining use of the counter
> enum.
> - Due to the above, use num_present_cpus() as the number of LPs to map stats
> pages for - this number shouldn't change at runtime because the hypervisor
> doesn't support hotplug for root partition.
>
> Changes in v3:
> - Add 3 small refactor/cleanup patches (patches 2,3,4) from Stanislav. These
> simplify some of the debugfs code, and fix issues with mapping VP stats on
> L1VH.
> - Fix cleanup of parent stats dentries on module removal (via squashing some
> internal patches into patch #6) [Praveen]
> - Remove unused goto label [Stanislav, kernel bot]
> - Use struct hv_stats_page * instead of void * in mshv_debugfs.c [Stanislav]
> - Remove some redundant variables [Stanislav]
> - Rename debugfs dentry fields for brevity [Stanislav]
> - Use ERR_CAST() for the dentry error pointer returned from
> lp_debugfs_stats_create() [Stanislav]
> - Fix leak of pages allocated for lp stats mappings by storing them in an array
> [Michael]
> - Add comments to clarify PARENT vs SELF usage and edge cases [Michael]
> - Add VpLoadAvg for x86 and print the stat [Michael]
> - Add NUM_STATS_AREAS for array sizing in mshv_debugfs.c [Michael]
>
> Changes in v2:
> - Remove unnecessary pr_debug_once() in patch 1 [Stanislav Kinsburskii]
> - CONFIG_X86 -> CONFIG_X86_64 in patch 2 [Stanislav Kinsburskii]
>
> ---
> Nuno Das Neves (3):
> mshv: Update hv_stats_page definitions
> mshv: Add data for printing stats page counters
> mshv: Add debugfs to view hypervisor statistics
>
> Purna Pavan Chandra Aekkaladevi (1):
> mshv: Ignore second stats page map result failure
>
> Stanislav Kinsburskii (3):
> mshv: Use typed hv_stats_page pointers
> mshv: Improve mshv_vp_stats_map/unmap(), add them to mshv_root.h
> mshv: Always map child vp stats pages regardless of scheduler type
>
> drivers/hv/Makefile | 1 +
> drivers/hv/mshv_debugfs.c | 726 +++++++++++++++++++++++++++++
> drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++
> drivers/hv/mshv_root.h | 49 +-
> drivers/hv/mshv_root_hv_call.c | 64 ++-
> drivers/hv/mshv_root_main.c | 140 +++---
> include/hyperv/hvhdk.h | 7 +
> 7 files changed, 1412 insertions(+), 65 deletions(-)
> create mode 100644 drivers/hv/mshv_debugfs.c
> create mode 100644 drivers/hv/mshv_debugfs_counters.c
>
> --
> 2.34.1
^ permalink raw reply
* RE: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Michael Kelley @ 2026-01-29 17:47 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <176903495970.166619.12888807009225201668.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
>
> From: Andreea Pintilie <anpintil@microsoft.com>
>
> Query the hypervisor for integrated scheduler support and use it if
> configured.
>
> Microsoft Hypervisor originally provided two schedulers: root and core. The
> root scheduler allows the root partition to schedule guest vCPUs across
> physical cores, supporting both time slicing and CPU affinity (e.g., via
> cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> scheduling entirely to the hypervisor.
>
> Direct virtualization introduces a new privileged guest partition type - L1
> Virtual Host (L1VH) — which can create child partitions from its own
> resources. These child partitions are effectively siblings, scheduled by
> the hypervisor's core scheduler. This prevents the L1VH parent from setting
> affinity or time slicing for its own processes or guest VPs. While cgroups,
> CFS, and cpuset controllers can still be used, their effectiveness is
> unpredictable, as the core scheduler swaps vCPUs according to its own logic
> (typically round-robin across all allocated physical CPUs). As a result,
> the system may appear to "steal" time from the L1VH and its children.
>
> To address this, Microsoft Hypervisor introduces the integrated scheduler.
> This allows an L1VH partition to schedule its own vCPUs and those of its
> guests across its "physical" cores, effectively emulating root scheduler
> behavior within the L1VH, while retaining core scheduler behavior for the
> rest of the system.
>
> The integrated scheduler is controlled by the root partition and gated by
> the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> supports the integrated scheduler. The L1VH partition must then check if it
> is enabled by querying the corresponding extended partition property. If
> this property is true, the L1VH partition must use the root scheduler
> logic; otherwise, it must use the core scheduler.
>
> Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_root_main.c | 79 +++++++++++++++++++++++++++++--------------
> include/hyperv/hvhdk_mini.h | 6 +++
> 2 files changed, 58 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 1134a82c7881..7a36297feea7 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2053,6 +2053,32 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> };
> }
>
> +static int __init l1vh_retrive_scheduler_type(enum hv_scheduler_type *out)
> +{
> + size_t root_sched_enabled;
> + int ret;
> +
> + *out = HV_SCHEDULER_TYPE_CORE_SMT;
> +
> + if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
> + return 0;
> +
> + ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> + HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
> + 0, &root_sched_enabled,
> + sizeof(root_sched_enabled));
hv_call_get_partition_property_ex() makes a hypercall, and then copies
back the number of bytes indicated by the 4th argument above; i.e.,
sizeof(root_sched_enabled). But using the size of a Linux type (size_t) to
control how much data is copied back from a hypercall seems inappropriate.
There should be a hypervisor-defined size that is copied back, or at worst,
an exactly specified Linux size like u64. By comparison, the use of
hv_call_get_partition_property_ex() in mshv_init_vmm_caps() copies back
sizeof(struct hv_partition_property_vmm_capabilities) bytes, which comes
from hvhdk_mini.h, so that's good.
The naming of root_sched_enabled is a bit of a cognitive dissonance with
getting the INTEGRATED_SCHEDULER_ENABLED property. I'd suggest the
local variable should be named "integrated_sched_enabled". Code in this
function then makes the decision that if the integrated scheduler is enabled,
L1VH partitions should be using the root scheduler (which is what the
commit message describes).
> + if (ret)
> + return ret;
> +
> + if (root_sched_enabled)
> + *out = HV_SCHEDULER_TYPE_ROOT;
> +
> + pr_debug("%s: integrated scheduler property read: ret=%d value=%lu\n",
> + __func__, ret, root_sched_enabled);
> +
> + return 0;
> +}
> +
> /* TODO move this to hv_common.c when needed outside */
> static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
> {
> @@ -2085,13 +2111,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
> /* Retrieve and stash the supported scheduler type */
> static int __init mshv_retrieve_scheduler_type(struct device *dev)
> {
> - int ret = 0;
> + int ret;
>
> if (hv_l1vh_partition())
> - hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
> + ret = l1vh_retrive_scheduler_type(&hv_scheduler_type);
> else
> ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
> -
> if (ret)
> return ret;
>
> @@ -2211,42 +2236,35 @@ struct notifier_block mshv_reboot_nb = {
> static void mshv_root_partition_exit(void)
> {
> unregister_reboot_notifier(&mshv_reboot_nb);
> - root_scheduler_deinit();
> }
>
> static int __init mshv_root_partition_init(struct device *dev)
> {
> int err;
>
> - err = root_scheduler_init(dev);
> - if (err)
> - return err;
> -
> err = register_reboot_notifier(&mshv_reboot_nb);
> if (err)
> - goto root_sched_deinit;
> + return err;
>
> return 0;
This code is now:
if (err)
return err;
return 0;
which can be simplified to just:
return err;
Or drop the local variable 'err' and simplify the entire function to:
return register_reboot_notifier(&mshv_reboot_nb);
There's a tangential question here: Why is this reboot notifier
needed in the first place? All it does is remove the cpuhp state
that allocates/frees the per-cpu root_scheduler_input and
root_scheduler_output pages. Removing the state will free
the pages, but if Linux is rebooting, why bother?
> -
> -root_sched_deinit:
> - root_scheduler_deinit();
> - return err;
> }
>
> -static void mshv_init_vmm_caps(struct device *dev)
> +static int mshv_init_vmm_caps(struct device *dev)
> {
> - /*
> - * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> - * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> - * case it's valid to proceed as if all vmm_caps are disabled (zero).
> - */
> - if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> - HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> - 0, &mshv_root.vmm_caps,
> - sizeof(mshv_root.vmm_caps)))
> - dev_warn(dev, "Unable to get VMM capabilities\n");
> + int ret;
> +
> + ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> + HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> + 0, &mshv_root.vmm_caps,
> + sizeof(mshv_root.vmm_caps));
> + if (ret) {
> + dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> + return ret;
> + }
This is a functional change that isn't mentioned in the commit message.
Why is it now appropriate to fail instead of treating the VMM capabilities
as all disabled? Presumably there are older versions of the hypervisor that
don't support the requirements described in the original comment, but
perhaps they are no longer relevant?
>
> dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> +
> + return 0;
> }
>
> static int __init mshv_parent_partition_init(void)
> @@ -2292,6 +2310,10 @@ static int __init mshv_parent_partition_init(void)
>
> mshv_cpuhp_online = ret;
>
> + ret = mshv_init_vmm_caps(dev);
> + if (ret)
> + goto remove_cpu_state;
> +
> ret = mshv_retrieve_scheduler_type(dev);
> if (ret)
> goto remove_cpu_state;
> @@ -2301,11 +2323,13 @@ static int __init mshv_parent_partition_init(void)
> if (ret)
> goto remove_cpu_state;
>
> - mshv_init_vmm_caps(dev);
> + ret = root_scheduler_init(dev);
> + if (ret)
> + goto exit_partition;
>
> ret = mshv_irqfd_wq_init();
> if (ret)
> - goto exit_partition;
> + goto deinit_root_scheduler;
>
> spin_lock_init(&mshv_root.pt_ht_lock);
> hash_init(mshv_root.pt_htable);
> @@ -2314,6 +2338,8 @@ static int __init mshv_parent_partition_init(void)
>
> return 0;
>
> +deinit_root_scheduler:
> + root_scheduler_deinit();
> exit_partition:
> if (hv_root_partition())
> mshv_root_partition_exit();
> @@ -2332,6 +2358,7 @@ static void __exit mshv_parent_partition_exit(void)
> mshv_port_table_fini();
> misc_deregister(&mshv_dev);
> mshv_irqfd_wq_cleanup();
> + root_scheduler_deinit();
> if (hv_root_partition())
> mshv_root_partition_exit();
> cpuhp_remove_state(mshv_cpuhp_online);
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index aa03616f965b..0f7178fa88a8 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS = 0x00010000,
> HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES = 0x00010001,
>
> + /* Integrated scheduling properties */
> + HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED = 0x00020005,
> +
> /* Resource properties */
> HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING = 0x00050005,
> HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION = 0x00050017,
> @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> };
>
> #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT 1
> -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 58
> +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 57
>
> struct hv_partition_property_vmm_capabilities {
> u16 bank_count;
> @@ -120,6 +123,7 @@ struct hv_partition_property_vmm_capabilities {
> #endif
> u64 assignable_synthetic_proc_features: 1;
> u64 tag_hv_message_from_child: 1;
> + u64 vmm_enable_integrated_scheduler : 1;
> u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> } __packed;
> };
>
>
^ permalink raw reply
* RE: [PATCH 1/2] hyperv: Sync guest VMM capabilities structure with Microsoft Hypervisor ABI
From: Michael Kelley @ 2026-01-29 17:46 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <176903495416.166619.16629695002971245203.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
>
> From: Andreea Pintilie <anpintil@microsoft.com>
>
> Update the partition VMM capability structure to match the hypervisor
> representation to bring it to the up to date state. A precursor patch for
> Root-on-Core scheduler feature support.
>
> Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> include/hyperv/hvhdk_mini.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index 41a29bf8ec14..aa03616f965b 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -102,7 +102,7 @@ enum hv_partition_property_code {
> };
>
> #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT 1
> -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 59
> +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT 58
>
> struct hv_partition_property_vmm_capabilities {
> u16 bank_count;
> @@ -119,6 +119,7 @@ struct hv_partition_property_vmm_capabilities {
> u64 reservedbit3: 1;
> #endif
> u64 assignable_synthetic_proc_features: 1;
> + u64 tag_hv_message_from_child: 1;
> u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> } __packed;
> };
The tag_hv_message_from_child field is not used in the 2nd patch of this
patch set, so it is added but never used. Is it added just to be a placeholder
so that field vmm_enable_integrated_scheduler can be added in the 2nd patch?
If that's the case, I'd suggest dropping this patch, and have the 2nd patch
add a "reservedbit5" field along with vmm_enable_integrated_scheduler.
If later there is a use for tag_hv_message_from_child, the "reservedbit5"
field can be renamed at that time.
Michael
^ permalink raw reply
* Re: [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Stanislav Kinsburskii @ 2026-01-29 17:03 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXrj4-KAxYfuK7k0@anirudh-surface.localdomain>
On Thu, Jan 29, 2026 at 04:36:51AM +0000, Anirudh Rayabharam wrote:
> On Wed, Jan 28, 2026 at 03:03:51PM -0800, Stanislav Kinsburskii wrote:
> > On Wed, Jan 28, 2026 at 04:04:37PM +0000, Anirudh Rayabharam wrote:
> > > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
<snip>
> >
> > > +static int mshv_irq = -1;
> > > +
> >
> > Should this be a path of mshv_root structure?
>
> This doesn't need to be globally accessible. It is only used in this file.
> So I guess it doesn't need to be in mshv_root. What do you think?
>
Please, see below.
<snip>
> > > int mshv_synic_cpu_init(unsigned int cpu)
> > > {
> > > union hv_synic_simp simp;
> > > union hv_synic_siefp siefp;
> > > union hv_synic_sirbp sirbp;
> > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > union hv_synic_sint sint;
> > > -#endif
> > > union hv_synic_scontrol sctrl;
> > > struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > > struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > > @@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > >
> > > hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> > >
> > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > + if (mshv_irq != -1)
> > > + enable_percpu_irq(mshv_irq, 0);
> > > +
> >
> > It's better to explicitly separate x86 and arm64 paths with #ifdefs.
> > For example:
> >
> > #ifdef CONFIG_X86_64
> > int setup_cpu_sint() {
> > /* Enable intercepts */
> > sint.as_uint64 = 0;
> > sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > ....
> > }
> > #endif
> > #ifdef CONFIG_ARM64
> > int setup_cpu_sint() {
> > enable_percpu_irq(mshv_irq, 0);
> >
> > /* Enable intercepts */
> > sint.as_uint64 = 0;
> > sint.vector = mshv_interrupt;
> > ....
> > }
> > #endif
>
> This seems unnecessary. We've made the paths that determine
> mshv_interrupt separate. Now we can just use that here.
>
> There is no need to write two copies of
>
> ...
> sint.as_uint64 = 0;
> sint.vector = <whatever>;
> ...
>
> I could do the enable_percpu_irq() inside an ifdef. But do we gain
> anything from it? Won't the compiler optimize the current code as well
> since mshv_irq will always be -1 whenever HYPERVISOR_CALLBACK_VECTOR is
> defined?
>
AFAIU this patc, x86 doesn’t need these variables at all. So it’s better
to separate them completely and explicitly.
Also, this isn’t the only place where ARM-specific logic is added. This
patch adds ARM-specific logic and tries to weave it into the existing
x86 flow.
If it were only one place, that might be OK. But here it happens in
several places. That makes the code harder to read and maintain. It also
makes future extensions more risky (and they will likely follow). The
dependencies are also not obvious. For example, on ARM the interrupt
vector comes from ACPI (at least that’s what the comments say). So it’s
not right to mix this into the common x86 path even if
HYPERVISOR_CALLBACK_VECTOR is a x86-specific define.
It would be much better to keep this ARM-specific logic in separate,
conditionally compiled code. I suggest changing the flow to make this
per-arch logic explicit. It will pay off later.
Thanks,
Stanislav
> Thanks,
> Anirudh.
>
> >
> > Thanks,
> > Stanislav
> >
> > > /* Enable intercepts */
> > > sint.as_uint64 = 0;
> > > - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > + sint.vector = mshv_interrupt;
> > > sint.masked = false;
> > > sint.auto_eoi = hv_recommend_using_aeoi();
> > > hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> > > @@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > >
> > > /* Doorbell SINT */
> > > sint.as_uint64 = 0;
> > > - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > + sint.vector = mshv_interrupt;
> > > sint.masked = false;
> > > sint.as_intercept = 1;
> > > sint.auto_eoi = hv_recommend_using_aeoi();
> > > hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > sint.as_uint64);
> > > -#endif
> > >
> > > /* Enable global synic bit */
> > > sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> > > @@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
> > > hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > sint.as_uint64);
> > >
> > > + if (mshv_irq != -1)
> > > + disable_percpu_irq(mshv_irq);
> > > +
> > > /* Disable Synic's event ring page */
> > > sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> > > sirbp.sirbp_enabled = false;
> > > --
> > > 2.34.1
> > >
^ permalink raw reply
* [PATCH 1/1] mshv: Use EPOLLIN and EPOLLHUP instead of POLLIN and POLLHUP
From: mhkelley58 @ 2026-01-29 15:51 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv; +Cc: linux-kernel
From: Michael Kelley <mhklinux@outlook.com>
mshv code currently uses the POLLIN and POLLHUP flags. Starting with
commit a9a08845e9acb ("vfs: do bulk POLL* -> EPOLL* replacement") the
intent is to use the EPOLL* versions throughout the kernel.
The comment at the top of mshv_eventfd.c describes it as being inspired
by the KVM implementation, which was changed by the above mentioned
commit in 2018 to use EPOLL*. mshv_eventfd.c is much newer than 2018
and there's no statement as to why it must use the POLL* versions.
So change it to use the EPOLL* versions. This change also resolves
a 'sparse' warning.
No functional change, and the generated code is the same.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202601220948.MUTO60W4-lkp@intel.com/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/mshv_eventfd.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 0b75ff1edb73..dfc8b1092c02 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -295,13 +295,13 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
{
struct mshv_irqfd *irqfd = container_of(wait, struct mshv_irqfd,
irqfd_wait);
- unsigned long flags = (unsigned long)key;
+ __poll_t flags = key_to_poll(key);
int idx;
unsigned int seq;
struct mshv_partition *pt = irqfd->irqfd_partn;
int ret = 0;
- if (flags & POLLIN) {
+ if (flags & EPOLLIN) {
u64 cnt;
eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
@@ -320,7 +320,7 @@ static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
ret = 1;
}
- if (flags & POLLHUP) {
+ if (flags & EPOLLHUP) {
/* The eventfd is closing, detach from the partition */
unsigned long flags;
@@ -506,7 +506,7 @@ static int mshv_irqfd_assign(struct mshv_partition *pt,
*/
events = vfs_poll(fd_file(f), &irqfd->irqfd_polltbl);
- if (events & POLLIN)
+ if (events & EPOLLIN)
mshv_assert_irq_slow(irqfd);
srcu_read_unlock(&pt->pt_irq_srcu, idx);
--
2.25.1
^ permalink raw reply related
* [PATCH] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
From: Jan Kiszka @ 2026-01-29 14:30 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
James E.J. Bottomley, Martin K. Petersen, linux-hyperv
Cc: linux-scsi@vger.kernel.org, Linux Kernel Mailing List,
Florian Bezdeka, RT, Mitchell Levy
From: Jan Kiszka <jan.kiszka@siemens.com>
This resolves the follow splat and lock-up when running with PREEMPT_RT
enabled on Hyper-V:
[ 415.140818] BUG: scheduling while atomic: stress-ng-iomix/1048/0x00000002
[ 415.140822] INFO: lockdep is turned off.
[ 415.140823] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec ghash_clmulni_intel aesni_intel rapl binfmt_misc nls_ascii nls_cp437 vfat fat snd_pcm hyperv_drm snd_timer drm_client_lib drm_shmem_helper snd sg soundcore drm_kms_helper pcspkr hv_balloon hv_utils evdev joydev drm configfs efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vsock vmw_vmci efivarfs autofs4 ext4 crc16 mbcache jbd2 sr_mod sd_mod cdrom hv_storvsc serio_raw hid_generic scsi_transport_fc hid_hyperv scsi_mod hid hv_netvsc hyperv_keyboard scsi_common
[ 415.140846] Preemption disabled at:
[ 415.140847] [<ffffffffc0656171>] storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
[ 415.140854] CPU: 8 UID: 0 PID: 1048 Comm: stress-ng-iomix Not tainted 6.19.0-rc7 #30 PREEMPT_{RT,(full)}
[ 415.140856] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/04/2024
[ 415.140857] Call Trace:
[ 415.140861] <TASK>
[ 415.140861] ? storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
[ 415.140863] dump_stack_lvl+0x91/0xb0
[ 415.140870] __schedule_bug+0x9c/0xc0
[ 415.140875] __schedule+0xdf6/0x1300
[ 415.140877] ? rtlock_slowlock_locked+0x56c/0x1980
[ 415.140879] ? rcu_is_watching+0x12/0x60
[ 415.140883] schedule_rtlock+0x21/0x40
[ 415.140885] rtlock_slowlock_locked+0x502/0x1980
[ 415.140891] rt_spin_lock+0x89/0x1e0
[ 415.140893] hv_ringbuffer_write+0x87/0x2a0
[ 415.140899] vmbus_sendpacket_mpb_desc+0xb6/0xe0
[ 415.140900] ? rcu_is_watching+0x12/0x60
[ 415.140902] storvsc_queuecommand+0x669/0xbe0 [hv_storvsc]
[ 415.140904] ? HARDIRQ_verbose+0x10/0x10
[ 415.140908] ? __rq_qos_issue+0x28/0x40
[ 415.140911] scsi_queue_rq+0x760/0xd80 [scsi_mod]
[ 415.140926] __blk_mq_issue_directly+0x4a/0xc0
[ 415.140928] blk_mq_issue_direct+0x87/0x2b0
[ 415.140931] blk_mq_dispatch_queue_requests+0x120/0x440
[ 415.140933] blk_mq_flush_plug_list+0x7a/0x1a0
[ 415.140935] __blk_flush_plug+0xf4/0x150
[ 415.140940] __submit_bio+0x2b2/0x5c0
[ 415.140944] ? submit_bio_noacct_nocheck+0x272/0x360
[ 415.140946] submit_bio_noacct_nocheck+0x272/0x360
[ 415.140951] ext4_read_bh_lock+0x3e/0x60 [ext4]
[ 415.140995] ext4_block_write_begin+0x396/0x650 [ext4]
[ 415.141018] ? __pfx_ext4_da_get_block_prep+0x10/0x10 [ext4]
[ 415.141038] ext4_da_write_begin+0x1c4/0x350 [ext4]
[ 415.141060] generic_perform_write+0x14e/0x2c0
[ 415.141065] ext4_buffered_write_iter+0x6b/0x120 [ext4]
[ 415.141083] vfs_write+0x2ca/0x570
[ 415.141087] ksys_write+0x76/0xf0
[ 415.141089] do_syscall_64+0x99/0x1490
[ 415.141093] ? rcu_is_watching+0x12/0x60
[ 415.141095] ? finish_task_switch.isra.0+0xdf/0x3d0
[ 415.141097] ? rcu_is_watching+0x12/0x60
[ 415.141098] ? lock_release+0x1f0/0x2a0
[ 415.141100] ? rcu_is_watching+0x12/0x60
[ 415.141101] ? finish_task_switch.isra.0+0xe4/0x3d0
[ 415.141103] ? rcu_is_watching+0x12/0x60
[ 415.141104] ? __schedule+0xb34/0x1300
[ 415.141106] ? hrtimer_try_to_cancel+0x1d/0x170
[ 415.141109] ? do_nanosleep+0x8b/0x160
[ 415.141111] ? hrtimer_nanosleep+0x89/0x100
[ 415.141114] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 415.141116] ? xfd_validate_state+0x26/0x90
[ 415.141118] ? rcu_is_watching+0x12/0x60
[ 415.141120] ? do_syscall_64+0x1e0/0x1490
[ 415.141121] ? do_syscall_64+0x1e0/0x1490
[ 415.141123] ? rcu_is_watching+0x12/0x60
[ 415.141124] ? do_syscall_64+0x1e0/0x1490
[ 415.141125] ? do_syscall_64+0x1e0/0x1490
[ 415.141127] ? irqentry_exit+0x140/0x7e0
[ 415.141129] entry_SYSCALL_64_after_hwframe+0x76/0x7e
get_cpu() disables preemption while the spinlock hv_ringbuffer_write is
using is converted to an rt-mutex under PREEMPT_RT.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
This is likely just the tip of an iceberg, see specifically [1], but if
you never start addressing it, it will continue to crash ships, even if
those are only on test cruises (we are fully aware that Hyper-V provides
no RT guarantees for guests). A pragmatic alternative to that would be a
simple
config HYPERV
depends on !PREEMPT_RT
Please share your thoughts if this fix is worth it, or if we should
better stop looking at the next splats that show up after it. We are
currently considering to thread some of the hv platform IRQs under
PREEMPT_RT as potential next step.
TIA!
[1] https://lore.kernel.org/all/20230809-b4-rt_preempt-fix-v1-0-7283bbdc8b14@gmail.com/
drivers/scsi/storvsc_drv.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index b43d876747b7..68c837146b9e 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1855,8 +1855,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
cmd_request->payload_sz = payload_sz;
/* Invokes the vsc to start an IO */
- ret = storvsc_do_io(dev, cmd_request, get_cpu());
- put_cpu();
+ migrate_disable();
+ ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
+ migrate_enable();
if (ret)
scsi_dma_unmap(scmnd);
--
2.51.0
^ permalink raw reply related
* Re: [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: Prasanna Kumar T S M @ 2026-01-29 9:48 UTC (permalink / raw)
To: mhklinux, kys, haiyangz, wei.liu, decui, longli, lpieralisi,
kwilczynski, mani, robh, bhelgaas
Cc: linux-pci, linux-kernel, linux-hyperv
In-Reply-To: <20260111170034.67558-1-mhklinux@outlook.com>
On 11-01-2026 22:30, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
>
> Field pci_bus in struct hv_pcibus_device is unused since
> commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
>
> No functional change.
>
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
> drivers/pci/controller/pci-hyperv.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1e237d3538f9..7fcba05cec30 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -501,7 +501,6 @@ struct hv_pcibus_device {
> struct resource *low_mmio_res;
> struct resource *high_mmio_res;
> struct completion *survey_event;
> - struct pci_bus *pci_bus;
> spinlock_t config_lock; /* Avoid two threads writing index page */
> spinlock_t device_list_lock; /* Protect lists below */
> void __iomem *cfg_addr;
Reviewed-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
Regards,
Prasanna Kumar
^ permalink raw reply
* Re: [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-01-29 4:36 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXqV127NzazbDkau@skinsburskii.localdomain>
On Wed, Jan 28, 2026 at 03:03:51PM -0800, Stanislav Kinsburskii wrote:
> On Wed, Jan 28, 2026 at 04:04:37PM +0000, Anirudh Rayabharam wrote:
> > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> >
> > On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
> > interrupts (SINTs) from the hypervisor for doorbells and intercepts.
> > There is no such vector reserved for arm64.
> >
> > On arm64, the INTID for SINTs should be in the SGI or PPI range. The
> > hypervisor exposes a virtual device in the ACPI that reserves a
> > PPI for this use. Introduce a platform_driver that binds to this ACPI
> > device and obtains the interrupt vector that can be used for SINTs.
> >
> > To better unify x86 and arm64 paths, introduce mshv_sint_irq_init() that
>
> Where is mshv_sint_irq_init?
Oops, this should be mshv_synic_init(). Leftover from previous
development version of this patch :)
Will fix in the next version.
>
> > either registers the platform_driver and obtains the INTID (arm64) or
> > just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
> >
> > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > ---
> > drivers/hv/mshv_root.h | 2 +
> > drivers/hv/mshv_root_main.c | 11 ++-
> > drivers/hv/mshv_synic.c | 152 ++++++++++++++++++++++++++++++++++--
> > 3 files changed, 158 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index c02513f75429..c2d1e8d7452c 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -332,5 +332,7 @@ int mshv_region_get(struct mshv_mem_region *region);
> > bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
> > void mshv_region_movable_fini(struct mshv_mem_region *region);
> > bool mshv_region_movable_init(struct mshv_mem_region *region);
> > +int mshv_synic_init(void);
> > +void mshv_synic_cleanup(void);
> >
> > #endif /* _MSHV_ROOT_H_ */
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index abb34b37d552..6c2d4a80dbe3 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -2276,11 +2276,17 @@ static int __init mshv_parent_partition_init(void)
> > MSHV_HV_MAX_VERSION);
> > }
> >
> > + ret = mshv_synic_init();
> > + if (ret) {
> > + dev_err(dev, "Failed to initialize synic: %i\n", ret);
> > + goto device_deregister;
> > + }
> > +
> > mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
> > if (!mshv_root.synic_pages) {
> > dev_err(dev, "Failed to allocate percpu synic page\n");
> > ret = -ENOMEM;
> > - goto device_deregister;
> > + goto synic_cleanup;
> > }
>
> Should this become a part of mshv_synic_init()?
Yeah, good idea. Maybe even the below cpuhp_setup_state can be moved.
>
> >
> > ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> > @@ -2322,6 +2328,8 @@ static int __init mshv_parent_partition_init(void)
> > cpuhp_remove_state(mshv_cpuhp_online);
> > free_synic_pages:
> > free_percpu(mshv_root.synic_pages);
> > +synic_cleanup:
> > + mshv_synic_cleanup();
> > device_deregister:
> > misc_deregister(&mshv_dev);
> > return ret;
> > @@ -2337,6 +2345,7 @@ static void __exit mshv_parent_partition_exit(void)
> > mshv_root_partition_exit();
> > cpuhp_remove_state(mshv_cpuhp_online);
> > free_percpu(mshv_root.synic_pages);
> > + mshv_synic_cleanup();
>
> Please, follow the common convention where cleaup path is the reverse of
> init path.
Right, will fix this.
>
> > }
> >
> > module_init(mshv_parent_partition_init);
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index ba89655b0910..b7860a75b97e 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -10,13 +10,19 @@
> > #include <linux/kernel.h>
> > #include <linux/slab.h>
> > #include <linux/mm.h>
> > +#include <linux/interrupt.h>
> > #include <linux/io.h>
> > #include <linux/random.h>
> > #include <asm/mshyperv.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/acpi.h>
> >
> > #include "mshv_eventfd.h"
> > #include "mshv.h"
> >
> > +static int mshv_interrupt = -1;
>
> The name is a bit too short. What about mshv_callback_vector or
> mshv_irq_vector?
I like mshv_callback_vector. I'll change to that in the next version
unless someone else comes up with a better suggestion.
>
> > +static int mshv_irq = -1;
> > +
>
> Should this be a path of mshv_root structure?
This doesn't need to be globally accessible. It is only used in this file.
So I guess it doesn't need to be in mshv_root. What do you think?
>
> > static u32 synic_event_ring_get_queued_port(u32 sint_index)
> > {
> > struct hv_synic_event_ring_page **event_ring_page;
> > @@ -446,14 +452,144 @@ void mshv_isr(void)
> > }
> > }
> >
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +#ifdef CONFIG_ACPI
> > +static long __percpu *mshv_evt;
> > +
> > +static acpi_status mshv_walk_resources(struct acpi_resource *res, void *ctx)
> > +{
> > + struct resource r;
> > +
> > + switch (res->type) {
> > + case ACPI_RESOURCE_TYPE_EXTENDED_IRQ:
> > + if (!acpi_dev_resource_interrupt(res, 0, &r)) {
> > + pr_err("Unable to parse MSHV ACPI interrupt\n");
> > + return AE_ERROR;
> > + }
> > + /* ARM64 INTID */
> > + mshv_interrupt = res->data.extended_irq.interrupts[0];
> > + /* Linux IRQ number */
> > + mshv_irq = r.start;
> > + pr_info("MSHV SINT INTID %d, IRQ %d\n",
> > + mshv_interrupt, mshv_irq);
> > + return AE_OK;
> > + default:
> > + /* Unused resource type */
> > + return AE_OK;
> > + }
> > +
> > + return AE_OK;
> > +}
> > +
> > +static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
> > +{
> > + mshv_isr();
> > + add_interrupt_randomness(irq);
> > + return IRQ_HANDLED;
> > +}
> > +
> > +static int mshv_sint_probe(struct platform_device *pdev)
> > +{
> > + acpi_status result;
> > + int ret = 0;
> > + struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
> > +
> > + result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
> > + mshv_walk_resources, NULL);
> > +
> > + if (ACPI_FAILURE(result)) {
> > + ret = -ENODEV;
> > + goto out;
> > + }
> > +
> > + mshv_evt = alloc_percpu(long);
> > + if (!mshv_evt) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + ret = request_percpu_irq(mshv_irq, mshv_percpu_isr, "MSHV", mshv_evt);
> > +out:
> > + return ret;
> > +}
> > +
> > +static void mshv_sint_remove(struct platform_device *pdev)
> > +{
> > + free_percpu_irq(mshv_irq, mshv_evt);
> > + free_percpu(mshv_evt);
> > +}
> > +#else
> > +static int mshv_sint_probe(struct platform_device *pdev)
> > +{
> > + return -ENODEV;
> > +}
> > +
> > +static void mshv_sint_remove(struct platform_device *pdev)
> > +{
> > + return;
> > +}
> > +#endif
> > +
>
> Is this all x86-compatible?
> The commit message says it's introduced for arm64.
> If it's incompatible, please, wrap it into #ifdefs and compile out for
> x86_64.
They are wrapped in #ifndef HYPERVISOR_CALLBACK_VECTOR.
If that is defined we use the hardcoded vector. It is currently
only defined for x86 so HYPERVISOR_CALLBACK_VECTOR is effectively a proxy
for "x86 enabled". This approach is better because we're not concerned
about whether it is x86 or arm, what we really want to figure out
is whether we have a pre-defined vector or not.
The VMBus driver follows this pattern too.
>
> > +
> > +static const __maybe_unused struct acpi_device_id mshv_sint_device_ids[] = {
> > + {"MSFT1003", 0},
> > + {"", 0},
> > +};
> > +
> > +static struct platform_driver mshv_sint_drv = {
> > + .probe = mshv_sint_probe,
> > + .remove = mshv_sint_remove,
> > + .driver = {
> > + .name = "mshv_sint",
> > + .acpi_match_table = ACPI_PTR(mshv_sint_device_ids),
> > + .probe_type = PROBE_FORCE_SYNCHRONOUS,
> > + },
> > +};
> > +#endif /* HYPERVISOR_CALLBACK_VECTOR */
> > +
> > +int mshv_synic_init(void)
> > +{
> > +#ifdef HYPERVISOR_CALLBACK_VECTOR
> > + mshv_interrupt = HYPERVISOR_CALLBACK_VECTOR;
> > + mshv_irq = -1;
> > + return 0;
> > +#else
> > + int ret;
> > +
> > + if (acpi_disabled)
> > + return -ENODEV;
> > +
> > + ret = platform_driver_register(&mshv_sint_drv);
> > + if (ret)
> > + return ret;
> > +
> > + if (mshv_interrupt == -1 || mshv_irq == -1) {
> > + ret = -ENODEV;
> > + goto out_unregister;
> > + }
> > +
> > + return 0;
> > +
> > +out_unregister:
> > + platform_driver_unregister(&mshv_sint_drv);
> > + return ret;
> > +#endif
> > +}
> > +
> > +void mshv_synic_cleanup(void)
> > +{
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > + if (!acpi_disabled)
> > + platform_driver_unregister(&mshv_sint_drv);
> > +#endif
> > +}
> > +
> > int mshv_synic_cpu_init(unsigned int cpu)
> > {
> > union hv_synic_simp simp;
> > union hv_synic_siefp siefp;
> > union hv_synic_sirbp sirbp;
> > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > union hv_synic_sint sint;
> > -#endif
> > union hv_synic_scontrol sctrl;
> > struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > @@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> >
> > hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> >
> > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > + if (mshv_irq != -1)
> > + enable_percpu_irq(mshv_irq, 0);
> > +
>
> It's better to explicitly separate x86 and arm64 paths with #ifdefs.
> For example:
>
> #ifdef CONFIG_X86_64
> int setup_cpu_sint() {
> /* Enable intercepts */
> sint.as_uint64 = 0;
> sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> ....
> }
> #endif
> #ifdef CONFIG_ARM64
> int setup_cpu_sint() {
> enable_percpu_irq(mshv_irq, 0);
>
> /* Enable intercepts */
> sint.as_uint64 = 0;
> sint.vector = mshv_interrupt;
> ....
> }
> #endif
This seems unnecessary. We've made the paths that determine
mshv_interrupt separate. Now we can just use that here.
There is no need to write two copies of
...
sint.as_uint64 = 0;
sint.vector = <whatever>;
...
I could do the enable_percpu_irq() inside an ifdef. But do we gain
anything from it? Won't the compiler optimize the current code as well
since mshv_irq will always be -1 whenever HYPERVISOR_CALLBACK_VECTOR is
defined?
Thanks,
Anirudh.
>
> Thanks,
> Stanislav
>
> > /* Enable intercepts */
> > sint.as_uint64 = 0;
> > - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > + sint.vector = mshv_interrupt;
> > sint.masked = false;
> > sint.auto_eoi = hv_recommend_using_aeoi();
> > hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> > @@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> >
> > /* Doorbell SINT */
> > sint.as_uint64 = 0;
> > - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > + sint.vector = mshv_interrupt;
> > sint.masked = false;
> > sint.as_intercept = 1;
> > sint.auto_eoi = hv_recommend_using_aeoi();
> > hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > sint.as_uint64);
> > -#endif
> >
> > /* Enable global synic bit */
> > sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> > @@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
> > hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > sint.as_uint64);
> >
> > + if (mshv_irq != -1)
> > + disable_percpu_irq(mshv_irq);
> > +
> > /* Disable Synic's event ring page */
> > sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> > sirbp.sirbp_enabled = false;
> > --
> > 2.34.1
> >
^ permalink raw reply
* RE: [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: Michael Kelley @ 2026-01-29 4:35 UTC (permalink / raw)
To: lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
robh@kernel.org, bhelgaas@google.com
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, Michael Kelley, kys@microsoft.com,
haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
longli@microsoft.com
In-Reply-To: <20260111170034.67558-1-mhklinux@outlook.com>
From: mhkelley58@gmail.com <mhkelley58@gmail.com> Sent: Sunday, January 11, 2026 9:01 AM
>
> From: Michael Kelley <mhklinux@outlook.com>
>
> Field pci_bus in struct hv_pcibus_device is unused since
> commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
>
> No functional change.
>
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Could a PCI maintainer give an Ack for this trivial patch?
Thx, Michael
> ---
> drivers/pci/controller/pci-hyperv.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1e237d3538f9..7fcba05cec30 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -501,7 +501,6 @@ struct hv_pcibus_device {
> struct resource *low_mmio_res;
> struct resource *high_mmio_res;
> struct completion *survey_event;
> - struct pci_bus *pci_bus;
> spinlock_t config_lock; /* Avoid two threads writing index page */
> spinlock_t device_list_lock; /* Protect lists below */
> void __iomem *cfg_addr;
> --
> 2.25.1
>
^ permalink raw reply
* RE: [PATCH V0] x86/hyperv: Fix compiler warnings in hv_crash.c
From: Michael Kelley @ 2026-01-29 4:21 UTC (permalink / raw)
To: Mukesh R, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
Cc: wei.liu@kernel.org
In-Reply-To: <20260121024045.3834787-1-mrathor@linux.microsoft.com>
From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, January 20, 2026 6:41 PM
>
> Fix two compiler warnings:
> o smp_ops is only defined if CONFIG_SMP
> o status is set but not explicitly used.
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202512301641.FC6OAbGM-lkp@intel.com/
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
> arch/x86/hyperv/hv_crash.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
> index c0e22921ace1..82915b22ceae 100644
> --- a/arch/x86/hyperv/hv_crash.c
> +++ b/arch/x86/hyperv/hv_crash.c
> @@ -279,7 +279,6 @@ static void hv_notify_prepare_hyp(void)
> static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> {
> struct hv_input_disable_hyp_ex *input;
> - u64 status;
> int msecs = 1000, ccpu = smp_processor_id();
>
> if (ccpu == 0) {
> @@ -313,7 +312,7 @@ static noinline __noclone void crash_nmi_callback(struct
> pt_regs *regs)
> input->rip = trampoline_pa;
> input->arg = devirt_arg;
>
> - status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
> + hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
>
> hv_panic_timeout_reboot();
> }
> @@ -628,8 +627,9 @@ void hv_root_crash_init(void)
> if (rc)
> goto err_out;
>
> +#ifdef CONFIG_SMP
> smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
> -
> +#endif
> crash_kexec_post_notifiers = true;
> hv_crash_enabled = true;
> pr_info("Hyper-V: both linux and hypervisor kdump support enabled\n");
> --
> 2.51.2.vfs.0.1
>
Ingo Molnar has separately fixed the smp_ops problem in [1]. Removing
the unused "status" value looks good to me, though it's probably slightly
better to add (void) to hv_do_hypercall() as an explicit acknowledgement
that there's a return value that's not relevant and is being ignored; i.e.,
(void)hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
Regardless, for the unused "status" part of this patch,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
[1] https://lore.kernel.org/all/176959812223.510.4055929851272785854.tip-bot2@tip-bot2/
^ permalink raw reply
* Re: [PATCH 2/4] mshv: Introduce hv_deposit_memory helper functions
From: Stanislav Kinsburskii @ 2026-01-28 23:18 UTC (permalink / raw)
To: Mukesh R
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <8d141a6a-d06f-f91a-686b-82f8f0facabc@linux.microsoft.com>
On Tue, Jan 27, 2026 at 11:44:25AM -0800, Mukesh R wrote:
> On 1/27/26 10:30, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 06:06:23PM -0800, Mukesh R wrote:
> > > On 1/25/26 14:41, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 23, 2026 at 04:33:39PM -0800, Mukesh R wrote:
> > > > > On 1/22/26 17:35, Stanislav Kinsburskii wrote:
> > > > > > Introduce hv_deposit_memory_node() and hv_deposit_memory() helper
> > > > > > functions to handle memory deposition with proper error handling.
> > > > > >
> > > > > > The new hv_deposit_memory_node() function takes the hypervisor status
> > > > > > as a parameter and validates it before depositing pages. It checks for
> > > > > > HV_STATUS_INSUFFICIENT_MEMORY specifically and returns an error for
> > > > > > unexpected status codes.
> > > > > >
> > > > > > This is a precursor patch to new out-of-memory error codes support.
> > > > > > No functional changes intended.
> > > > > >
> > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > ---
> > > > > > drivers/hv/hv_proc.c | 22 ++++++++++++++++++++--
> > > > > > drivers/hv/mshv_root_hv_call.c | 25 +++++++++----------------
> > > > > > drivers/hv/mshv_root_main.c | 3 +--
> > > > > > include/asm-generic/mshyperv.h | 10 ++++++++++
> > > > > > 4 files changed, 40 insertions(+), 20 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> > > > > > index 80c66d1c74d5..c0c2bfc80d77 100644
> > > > > > --- a/drivers/hv/hv_proc.c
> > > > > > +++ b/drivers/hv/hv_proc.c
> > > > > > @@ -110,6 +110,23 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
> > > > > > +int hv_deposit_memory_node(int node, u64 partition_id,
> > > > > > + u64 hv_status)
> > > > > > +{
> > > > > > + u32 num_pages;
> > > > > > +
> > > > > > + switch (hv_result(hv_status)) {
> > > > > > + case HV_STATUS_INSUFFICIENT_MEMORY:
> > > > > > + num_pages = 1;
> > > > > > + break;
> > > > > > + default:
> > > > > > + hv_status_err(hv_status, "Unexpected!\n");
> > > > > > + return -ENOMEM;
> > > > > > + }
> > > > > > + return hv_call_deposit_pages(node, partition_id, num_pages);
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(hv_deposit_memory_node);
> > > > > > +
> > > > >
> > > > > Different hypercalls may want to deposit different number of pages in one
> > > > > shot. As feature evolves, page sizes get mixed, we'd almost need that
> > > > > flexibility. So, imo, either we just don't do this for now, or add num pages
> > > > > parameter to be passed down.
> > > > >
> > > >
> > > > What you do mean by "page sizes get mixed"?
> > > > A helper to deposit num pages already exists: its
> > > > hv_call_deposit_pages().
> > >
> > > My point, you are removing number of pages, and we may want to keep
> > > that so one can quickly play around and change them.
> > >
> > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > - pt_id, 1);
> > > + ret = hv_deposit_memory(pt_id, status);
> > >
> > > For example, in hv_call_initialize_partition() we may realize after
> > > some analysis that depositing 2 pages or 4 pages is much better.
> > >
> >
> > We have been using this 1-page deposit logic from the beginning. To
> > change the number of pages, simply replace hv_deposit_memory with
> > hv_call_deposit_pages and specify the desired number of pages.
>
> You could perhaps rename it to hv_deposit_page().
>
Yes, this would be a good name, but unfortunately we can now receive
statuses like HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY, where we need to
deposit at least 8 consecutive pages. There is also another pair of
status codes for required root pages, even when a guest partition-related
hypercall is performed (see the next patch for details).
This new helper is intended to cover all such cases, instead of branching
for all these different cases in every function.
Thanks,
Stanislav
> > The proposed approach reduces code duplication and is less error-prone,
> > as there are multiple error codes to handle. Consolidating the logic
> > also makes the driver more robust.
> >
> >
> > Thanks, Stanislav
> >
> > > > Thanks,
> > > > Stanislav
> > > >
> > > > > Thanks,
> > > > > -Mukesh
> > > > >
> > > > >
> > > > >
> > > > > > bool hv_result_oom(u64 status)
> > > > > > {
> > > > > > switch (hv_result(status)) {
> > > > > > @@ -155,7 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
> > > > > > }
> > > > > > break;
> > > > > > }
> > > > > > - ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
> > > > > > + ret = hv_deposit_memory_node(node, hv_current_partition_id,
> > > > > > + status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -197,7 +215,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
> > > > > > }
> > > > > > break;
> > > > > > }
> > > > > > - ret = hv_call_deposit_pages(node, partition_id, 1);
> > > > > > + ret = hv_deposit_memory_node(node, partition_id, status);
> > > > > > } while (!ret);
> > > > > > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > > > > > index 58c5cbf2e567..06f2bac8039d 100644
> > > > > > --- a/drivers/hv/mshv_root_hv_call.c
> > > > > > +++ b/drivers/hv/mshv_root_hv_call.c
> > > > > > @@ -123,8 +123,7 @@ int hv_call_create_partition(u64 flags,
> > > > > > break;
> > > > > > }
> > > > > > local_irq_restore(irq_flags);
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > > > - hv_current_partition_id, 1);
> > > > > > + ret = hv_deposit_memory(hv_current_partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -151,7 +150,7 @@ int hv_call_initialize_partition(u64 partition_id)
> > > > > > ret = hv_result_to_errno(status);
> > > > > > break;
> > > > > > }
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> > > > > > + ret = hv_deposit_memory(partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -465,8 +464,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
> > > > > > }
> > > > > > local_irq_restore(flags);
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > > > - partition_id, 1);
> > > > > > + ret = hv_deposit_memory(partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -525,8 +523,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
> > > > > > }
> > > > > > local_irq_restore(flags);
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > > > - partition_id, 1);
> > > > > > + ret = hv_deposit_memory(partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -573,7 +570,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> > > > > > local_irq_restore(flags);
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> > > > > > + ret = hv_deposit_memory(partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -722,8 +719,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> > > > > > ret = hv_result_to_errno(status);
> > > > > > break;
> > > > > > }
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
> > > > > > -
> > > > > > + ret = hv_deposit_memory(port_partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -776,8 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> > > > > > ret = hv_result_to_errno(status);
> > > > > > break;
> > > > > > }
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > > > - connection_partition_id, 1);
> > > > > > + ret = hv_deposit_memory(connection_partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -848,8 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
> > > > > > break;
> > > > > > }
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > > > - hv_current_partition_id, 1);
> > > > > > + ret = hv_deposit_memory(hv_current_partition_id, status);
> > > > > > } while (!ret);
> > > > > > return ret;
> > > > > > @@ -885,8 +879,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
> > > > > > return ret;
> > > > > > }
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > > > - hv_current_partition_id, 1);
> > > > > > + ret = hv_deposit_memory(hv_current_partition_id, status);
> > > > > > if (ret)
> > > > > > return ret;
> > > > > > } while (!ret);
> > > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > > index f4697497f83e..5fc572e31cd7 100644
> > > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > > @@ -264,8 +264,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
> > > > > > if (!hv_result_oom(status))
> > > > > > ret = hv_result_to_errno(status);
> > > > > > else
> > > > > > - ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > > > - pt_id, 1);
> > > > > > + ret = hv_deposit_memory(pt_id, status);
> > > > > > } while (!ret);
> > > > > > args.status = hv_result(status);
> > > > > > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> > > > > > index b73352a7fc9e..c8e8976839f8 100644
> > > > > > --- a/include/asm-generic/mshyperv.h
> > > > > > +++ b/include/asm-generic/mshyperv.h
> > > > > > @@ -344,6 +344,7 @@ static inline bool hv_parent_partition(void)
> > > > > > }
> > > > > > bool hv_result_oom(u64 status);
> > > > > > +int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
> > > > > > int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
> > > > > > int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
> > > > > > int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
> > > > > > @@ -353,6 +354,10 @@ static inline bool hv_root_partition(void) { return false; }
> > > > > > static inline bool hv_l1vh_partition(void) { return false; }
> > > > > > static inline bool hv_parent_partition(void) { return false; }
> > > > > > static inline bool hv_result_oom(u64 status) { return false; }
> > > > > > +static inline int hv_deposit_memory_node(int node, u64 partition_id, u64 status)
> > > > > > +{
> > > > > > + return -EOPNOTSUPP;
> > > > > > +}
> > > > > > static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
> > > > > > {
> > > > > > return -EOPNOTSUPP;
> > > > > > @@ -367,6 +372,11 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
> > > > > > }
> > > > > > #endif /* CONFIG_MSHV_ROOT */
> > > > > > +static inline int hv_deposit_memory(u64 partition_id, u64 status)
> > > > > > +{
> > > > > > + return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
> > > > > > +}
> > > > > > +
> > > > > > #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> > > > > > u8 __init get_vtl(void);
> > > > > > #else
> > > > > >
> > > > > >
>
^ permalink raw reply
* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-28 23:11 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXo2X4mRioTa3sBl@anirudh-surface.localdomain>
On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > hypervisor deposited pages.
> > > >
> > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > management is implemented.
> > >
> > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > and would work without any issue for L1VH.
> > >
> >
> > No, it won't work and hypervsisor depostied pages won't be withdrawn.
>
> All pages that were deposited in the context of a guest partition (i.e.
> with the guest partition ID), would be withdrawn when you kill the VMs,
> right? What other deposited pages would be left?
>
The driver deposits two types of pages: one for the guests (withdrawn
upon gust shutdown) and the other - for the host itself (never
withdrawn).
See hv_call_create_partition, for example: it deposits pages for the
host partition.
Thanks,
Stanislav
> Thanks,
> Anirudh.
>
> > Also, kernel consisntency must no depend on use space behavior.
> >
> > > Also, I don't think it is reasonable at all that someone needs to
> > > disable basic kernel functionality such as kexec in order to use our
> > > driver.
> > >
> >
> > It's a temporary measure until proper page lifecycle management is
> > supported in the driver.
> > Mutual exclusion of the driver and kexec is given and thus should be
> > expclitily stated in the Kconfig.
> >
> > Thanks,
> > Stanislav
> >
> > > Thanks,
> > > Anirudh.
> > >
> > > >
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > > drivers/hv/Kconfig | 1 +
> > > > 1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > --- a/drivers/hv/Kconfig
> > > > +++ b/drivers/hv/Kconfig
> > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > # e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > # no particular order, making it impossible to reassemble larger pages
> > > > depends on PAGE_SIZE_4KB
> > > > + depends on !KEXEC
> > > > select EVENTFD
> > > > select VIRT_XFER_TO_GUEST_WORK
> > > > select HMM_MIRROR
> > > >
> > > >
^ permalink raw reply
* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-28 23:08 UTC (permalink / raw)
To: Mukesh R
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <4bcd7b66-6e3b-8f53-b688-ce0272123839@linux.microsoft.com>
On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
> > > On 1/26/26 16:21, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
> > > > > On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> > > > > > > On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> > > > > > > > On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> > > > > > > > > On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > >
> > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > management is implemented.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > > > ---
> > > > > > > > > > drivers/hv/Kconfig | 1 +
> > > > > > > > > > 1 file changed, 1 insertion(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > > > > > # e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > > > > > # no particular order, making it impossible to reassemble larger pages
> > > > > > > > > > depends on PAGE_SIZE_4KB
> > > > > > > > > > + depends on !KEXEC
> > > > > > > > > > select EVENTFD
> > > > > > > > > > select VIRT_XFER_TO_GUEST_WORK
> > > > > > > > > > select HMM_MIRROR
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> > > > > > > > > implying that crash dump might be involved. Or did you test kdump
> > > > > > > > > and it was fine?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, it will. Crash kexec depends on normal kexec functionality, so it
> > > > > > > > will be affected as well.
> > > > > > >
> > > > > > > So not sure I understand the reason for this patch. We can just block
> > > > > > > kexec if there are any VMs running, right? Doing this would mean any
> > > > > > > further developement would be without a ver important and major feature,
> > > > > > > right?
> > > > > >
> > > > > > This is an option. But until it's implemented and merged, a user mshv
> > > > > > driver gets into a situation where kexec is broken in a non-obvious way.
> > > > > > The system may crash at any time after kexec, depending on whether the
> > > > > > new kernel touches the pages deposited to hypervisor or not. This is a
> > > > > > bad user experience.
> > > > >
> > > > > I understand that. But with this we cannot collect core and debug any
> > > > > crashes. I was thinking there would be a quick way to prohibit kexec
> > > > > for update via notifier or some other quick hack. Did you already
> > > > > explore that and didn't find anything, hence this?
> > > > >
> > > >
> > > > This quick hack you mention isn't quick in the upstream kernel as there
> > > > is no hook to interrupt kexec process except the live update one.
> > >
> > > That's the one we want to interrupt and block right? crash kexec
> > > is ok and should be allowed. We can document we don't support kexec
> > > for update for now.
> > >
> > > > I sent an RFC for that one but given todays conversation details is
> > > > won't be accepted as is.
> > >
> > > Are you taking about this?
> > >
> > > "mshv: Add kexec safety for deposited pages"
> > >
> >
> > Yes.
> >
> > > > Making mshv mutually exclusive with kexec is the only viable option for
> > > > now given time constraints.
> > > > It is intended to be replaced with proper page lifecycle management in
> > > > the future.
> > >
> > > Yeah, that could take a long time and imo we cannot just disable KEXEC
> > > completely. What we want is just block kexec for updates from some
> > > mshv file for now, we an print during boot that kexec for updates is
> > > not supported on mshv. Hope that makes sense.
> > >
> >
> > The trade-off here is between disabling kexec support and having the
> > kernel crash after kexec in a non-obvious way. This affects both regular
> > kexec and crash kexec.
>
> crash kexec on baremetal is not affected, hence disabling that
> doesn't make sense as we can't debug crashes then on bm.
>
Bare metal support is not currently relevant, as it is not available.
This is the upstream kernel, and this driver will be accessible to
third-party customers beginning with kernel 6.19 for running their
kernels in Azure L1VH, so consistency is required.
Thanks,
Stanislav
> Let me think and explore a bit, and if I come up with something, I'll
> send a patch here. If nothing, then we can do this as last resort.
>
> Thanks,
> -Mukesh
>
>
> > It?s a pity we can?t apply a quick hack to disable only regular kexec.
> > However, since crash kexec would hit the same issues, until we have a
> > proper state transition for deposted pages, the best workaround for now
> > is to reset the hypervisor state on every kexec, which needs design,
> > work, and testing.
> >
> > Disabling kexec is the only consistent way to handle this in the
> > upstream kernel at the moment.
> >
> > Thanks, Stanislav
> >
> >
> > > Thanks,
> > > -Mukesh
> > >
> > >
> > >
> > > > Thanks,
> > > > Stanislav
> > > >
> > > > > Thanks,
> > > > > -Mukesh
> > > > >
> > > > > > Therefor it should be explicitly forbidden as it's essentially not
> > > > > > supported yet.
> > > > > >
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > >
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Stanislav
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > -Mukesh
^ permalink raw reply
* Re: [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Stanislav Kinsburskii @ 2026-01-28 23:03 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260128160437.3342167-3-anirudh@anirudhrb.com>
On Wed, Jan 28, 2026 at 04:04:37PM +0000, Anirudh Rayabharam wrote:
> From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
>
> On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
> interrupts (SINTs) from the hypervisor for doorbells and intercepts.
> There is no such vector reserved for arm64.
>
> On arm64, the INTID for SINTs should be in the SGI or PPI range. The
> hypervisor exposes a virtual device in the ACPI that reserves a
> PPI for this use. Introduce a platform_driver that binds to this ACPI
> device and obtains the interrupt vector that can be used for SINTs.
>
> To better unify x86 and arm64 paths, introduce mshv_sint_irq_init() that
Where is mshv_sint_irq_init?
> either registers the platform_driver and obtains the INTID (arm64) or
> just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
>
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> ---
> drivers/hv/mshv_root.h | 2 +
> drivers/hv/mshv_root_main.c | 11 ++-
> drivers/hv/mshv_synic.c | 152 ++++++++++++++++++++++++++++++++++--
> 3 files changed, 158 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index c02513f75429..c2d1e8d7452c 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -332,5 +332,7 @@ int mshv_region_get(struct mshv_mem_region *region);
> bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
> void mshv_region_movable_fini(struct mshv_mem_region *region);
> bool mshv_region_movable_init(struct mshv_mem_region *region);
> +int mshv_synic_init(void);
> +void mshv_synic_cleanup(void);
>
> #endif /* _MSHV_ROOT_H_ */
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index abb34b37d552..6c2d4a80dbe3 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2276,11 +2276,17 @@ static int __init mshv_parent_partition_init(void)
> MSHV_HV_MAX_VERSION);
> }
>
> + ret = mshv_synic_init();
> + if (ret) {
> + dev_err(dev, "Failed to initialize synic: %i\n", ret);
> + goto device_deregister;
> + }
> +
> mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
> if (!mshv_root.synic_pages) {
> dev_err(dev, "Failed to allocate percpu synic page\n");
> ret = -ENOMEM;
> - goto device_deregister;
> + goto synic_cleanup;
> }
Should this become a part of mshv_synic_init()?
>
> ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> @@ -2322,6 +2328,8 @@ static int __init mshv_parent_partition_init(void)
> cpuhp_remove_state(mshv_cpuhp_online);
> free_synic_pages:
> free_percpu(mshv_root.synic_pages);
> +synic_cleanup:
> + mshv_synic_cleanup();
> device_deregister:
> misc_deregister(&mshv_dev);
> return ret;
> @@ -2337,6 +2345,7 @@ static void __exit mshv_parent_partition_exit(void)
> mshv_root_partition_exit();
> cpuhp_remove_state(mshv_cpuhp_online);
> free_percpu(mshv_root.synic_pages);
> + mshv_synic_cleanup();
Please, follow the common convention where cleaup path is the reverse of
init path.
> }
>
> module_init(mshv_parent_partition_init);
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index ba89655b0910..b7860a75b97e 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -10,13 +10,19 @@
> #include <linux/kernel.h>
> #include <linux/slab.h>
> #include <linux/mm.h>
> +#include <linux/interrupt.h>
> #include <linux/io.h>
> #include <linux/random.h>
> #include <asm/mshyperv.h>
> +#include <linux/platform_device.h>
> +#include <linux/acpi.h>
>
> #include "mshv_eventfd.h"
> #include "mshv.h"
>
> +static int mshv_interrupt = -1;
The name is a bit too short. What about mshv_callback_vector or
mshv_irq_vector?
> +static int mshv_irq = -1;
> +
Should this be a path of mshv_root structure?
> static u32 synic_event_ring_get_queued_port(u32 sint_index)
> {
> struct hv_synic_event_ring_page **event_ring_page;
> @@ -446,14 +452,144 @@ void mshv_isr(void)
> }
> }
>
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +#ifdef CONFIG_ACPI
> +static long __percpu *mshv_evt;
> +
> +static acpi_status mshv_walk_resources(struct acpi_resource *res, void *ctx)
> +{
> + struct resource r;
> +
> + switch (res->type) {
> + case ACPI_RESOURCE_TYPE_EXTENDED_IRQ:
> + if (!acpi_dev_resource_interrupt(res, 0, &r)) {
> + pr_err("Unable to parse MSHV ACPI interrupt\n");
> + return AE_ERROR;
> + }
> + /* ARM64 INTID */
> + mshv_interrupt = res->data.extended_irq.interrupts[0];
> + /* Linux IRQ number */
> + mshv_irq = r.start;
> + pr_info("MSHV SINT INTID %d, IRQ %d\n",
> + mshv_interrupt, mshv_irq);
> + return AE_OK;
> + default:
> + /* Unused resource type */
> + return AE_OK;
> + }
> +
> + return AE_OK;
> +}
> +
> +static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
> +{
> + mshv_isr();
> + add_interrupt_randomness(irq);
> + return IRQ_HANDLED;
> +}
> +
> +static int mshv_sint_probe(struct platform_device *pdev)
> +{
> + acpi_status result;
> + int ret = 0;
> + struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
> +
> + result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
> + mshv_walk_resources, NULL);
> +
> + if (ACPI_FAILURE(result)) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> + mshv_evt = alloc_percpu(long);
> + if (!mshv_evt) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = request_percpu_irq(mshv_irq, mshv_percpu_isr, "MSHV", mshv_evt);
> +out:
> + return ret;
> +}
> +
> +static void mshv_sint_remove(struct platform_device *pdev)
> +{
> + free_percpu_irq(mshv_irq, mshv_evt);
> + free_percpu(mshv_evt);
> +}
> +#else
> +static int mshv_sint_probe(struct platform_device *pdev)
> +{
> + return -ENODEV;
> +}
> +
> +static void mshv_sint_remove(struct platform_device *pdev)
> +{
> + return;
> +}
> +#endif
> +
Is this all x86-compatible?
The commit message says it's introduced for arm64.
If it's incompatible, please, wrap it into #ifdefs and compile out for
x86_64.
> +
> +static const __maybe_unused struct acpi_device_id mshv_sint_device_ids[] = {
> + {"MSFT1003", 0},
> + {"", 0},
> +};
> +
> +static struct platform_driver mshv_sint_drv = {
> + .probe = mshv_sint_probe,
> + .remove = mshv_sint_remove,
> + .driver = {
> + .name = "mshv_sint",
> + .acpi_match_table = ACPI_PTR(mshv_sint_device_ids),
> + .probe_type = PROBE_FORCE_SYNCHRONOUS,
> + },
> +};
> +#endif /* HYPERVISOR_CALLBACK_VECTOR */
> +
> +int mshv_synic_init(void)
> +{
> +#ifdef HYPERVISOR_CALLBACK_VECTOR
> + mshv_interrupt = HYPERVISOR_CALLBACK_VECTOR;
> + mshv_irq = -1;
> + return 0;
> +#else
> + int ret;
> +
> + if (acpi_disabled)
> + return -ENODEV;
> +
> + ret = platform_driver_register(&mshv_sint_drv);
> + if (ret)
> + return ret;
> +
> + if (mshv_interrupt == -1 || mshv_irq == -1) {
> + ret = -ENODEV;
> + goto out_unregister;
> + }
> +
> + return 0;
> +
> +out_unregister:
> + platform_driver_unregister(&mshv_sint_drv);
> + return ret;
> +#endif
> +}
> +
> +void mshv_synic_cleanup(void)
> +{
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> + if (!acpi_disabled)
> + platform_driver_unregister(&mshv_sint_drv);
> +#endif
> +}
> +
> int mshv_synic_cpu_init(unsigned int cpu)
> {
> union hv_synic_simp simp;
> union hv_synic_siefp siefp;
> union hv_synic_sirbp sirbp;
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
> union hv_synic_sint sint;
> -#endif
> union hv_synic_scontrol sctrl;
> struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> @@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
>
> hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
>
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
> + if (mshv_irq != -1)
> + enable_percpu_irq(mshv_irq, 0);
> +
It's better to explicitly separate x86 and arm64 paths with #ifdefs.
For example:
#ifdef CONFIG_X86_64
int setup_cpu_sint() {
/* Enable intercepts */
sint.as_uint64 = 0;
sint.vector = HYPERVISOR_CALLBACK_VECTOR;
....
}
#endif
#ifdef CONFIG_ARM64
int setup_cpu_sint() {
enable_percpu_irq(mshv_irq, 0);
/* Enable intercepts */
sint.as_uint64 = 0;
sint.vector = mshv_interrupt;
....
}
#endif
Thanks,
Stanislav
> /* Enable intercepts */
> sint.as_uint64 = 0;
> - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> + sint.vector = mshv_interrupt;
> sint.masked = false;
> sint.auto_eoi = hv_recommend_using_aeoi();
> hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> @@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
>
> /* Doorbell SINT */
> sint.as_uint64 = 0;
> - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> + sint.vector = mshv_interrupt;
> sint.masked = false;
> sint.as_intercept = 1;
> sint.auto_eoi = hv_recommend_using_aeoi();
> hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> sint.as_uint64);
> -#endif
>
> /* Enable global synic bit */
> sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> @@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
> hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> sint.as_uint64);
>
> + if (mshv_irq != -1)
> + disable_percpu_irq(mshv_irq);
> +
> /* Disable Synic's event ring page */
> sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> sirbp.sirbp_enabled = false;
> --
> 2.34.1
>
^ permalink raw reply
* RE: [PATCH v6 0/7] mshv: Debugfs interface for mshv_root
From: Michael Kelley @ 2026-01-28 19:20 UTC (permalink / raw)
To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>
From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, January 28, 2026 10:12 AM
>
> Expose hypervisor, logical processor, partition, and virtual processor
> statistics via debugfs. These are provided by mapping 'stats' pages via
> hypercall.
>
> Patch #1: Update hv_call_map_stats_page() to return success when
> HV_STATS_AREA_PARENT is unavailable, which is the case on some
> hypervisor versions, where it can fall back to HV_STATS_AREA_SELF
> Patch #2: Use struct hv_stats_page pointers instead of void *
> Patch #3: Make mshv_vp_stats_map/unmap() more flexible to use with debugfs
> code
> Patch #4: Always map vp stats page regardless of scheduler, to reuse in
> debugfs
> Patch #5: Change to hv_stats_page definition and
> VpRootDispatchThreadBlocked
> Patch #6: Introduce the definitions needed for the various stats pages
> Patch #7: Add mshv_debugfs.c, and integrate it with the mshv_root driver to
> expose the partition and VP stats.
>
> ---
> Changes in v6:
> - Fix whitespace and other checkpatch issues [Michael]
>
> Changes in v5:
> - Rename hv_counters.c to mshv_debugfs_counters.c [Michael]
> - Clarify unusual inclusion of mshv_debugfs_counters.c with comment. After
> discussion it is still included directly to keep things simple. Including
> arrays with unspecified size via a header means sizeof() cannot be used on
> the array.
> - Error if mshv_debugfs_counters.c is included elsewhere than mshv_debugfs.c
> - Use array index as stats page index to save space [Stanislav]
> - Enforce HV_STATS_AREA_PARENT and SELF fit in NUM_STATS_AREAS with
> static_assert and clarify with comment [Michael]
> - Return to using lp count from hv stats page for mshv_lps_count [Michael]
> - Use nr_cpu_ids instead of num_possible_cpus() [Michael]
> - Set mshv_lps_stats[idx] and the array itself to NULL on unmap and cleanup
> [Michael]
> - Rename HvLogicalProcessors and VpRootDispatchThreadBlocked to Linux style
> - Translate Linux cpu index to vp index via hv_vp_index on partition destroy
> [Michael]
> - Minor formatting cleanups [Michael]
>
> Changes in v4:
> - Put the counters definitions in static arrays in hv_counters.c, instead of
> as enums in hvhdk.h [Michael]
> - Due to the above, add an additional patch (#5) to simplify hv_stats_page,
> and retain the enum definition at the top of mshv_root_main.c for use with
> VpRootDispatchThreadBlocked. That is the only remaining use of the counter
> enum.
> - Due to the above, use num_present_cpus() as the number of LPs to map stats
> pages for - this number shouldn't change at runtime because the hypervisor
> doesn't support hotplug for root partition.
>
> Changes in v3:
> - Add 3 small refactor/cleanup patches (patches 2,3,4) from Stanislav. These
> simplify some of the debugfs code, and fix issues with mapping VP stats on
> L1VH.
> - Fix cleanup of parent stats dentries on module removal (via squashing some
> internal patches into patch #6) [Praveen]
> - Remove unused goto label [Stanislav, kernel bot]
> - Use struct hv_stats_page * instead of void * in mshv_debugfs.c [Stanislav]
> - Remove some redundant variables [Stanislav]
> - Rename debugfs dentry fields for brevity [Stanislav]
> - Use ERR_CAST() for the dentry error pointer returned from
> lp_debugfs_stats_create() [Stanislav]
> - Fix leak of pages allocated for lp stats mappings by storing them in an array
> [Michael]
> - Add comments to clarify PARENT vs SELF usage and edge cases [Michael]
> - Add VpLoadAvg for x86 and print the stat [Michael]
> - Add NUM_STATS_AREAS for array sizing in mshv_debugfs.c [Michael]
>
> Changes in v2:
> - Remove unnecessary pr_debug_once() in patch 1 [Stanislav Kinsburskii]
> - CONFIG_X86 -> CONFIG_X86_64 in patch 2 [Stanislav Kinsburskii]
>
> ---
> Nuno Das Neves (3):
> mshv: Update hv_stats_page definitions
> mshv: Add data for printing stats page counters
> mshv: Add debugfs to view hypervisor statistics
>
> Purna Pavan Chandra Aekkaladevi (1):
> mshv: Ignore second stats page map result failure
>
> Stanislav Kinsburskii (3):
> mshv: Use typed hv_stats_page pointers
> mshv: Improve mshv_vp_stats_map/unmap(), add them to mshv_root.h
> mshv: Always map child vp stats pages regardless of scheduler type
>
> drivers/hv/Makefile | 1 +
> drivers/hv/mshv_debugfs.c | 726 +++++++++++++++++++++++++++++
> drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++
> drivers/hv/mshv_root.h | 49 +-
> drivers/hv/mshv_root_hv_call.c | 64 ++-
> drivers/hv/mshv_root_main.c | 140 +++---
> include/hyperv/hvhdk.h | 7 +
> 7 files changed, 1412 insertions(+), 65 deletions(-)
> create mode 100644 drivers/hv/mshv_debugfs.c
> create mode 100644 drivers/hv/mshv_debugfs_counters.c
>
> --
> 2.34.1
Everything looks good to me.
For the entire series,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
^ permalink raw reply
* [PATCH v6 7/7] mshv: Add debugfs to view hypervisor statistics
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
paekkaladevi, Nuno Das Neves, Jinank Jain
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>
Introduce a debugfs interface to expose root and child partition stats
when running with mshv_root.
Create a debugfs directory "mshv" containing 'stats' files organized by
type and id. A stats file contains a number of counters depending on
its type. e.g. an excerpt from a VP stats file:
TotalRunTime : 1997602722
HypervisorRunTime : 649671371
RemoteNodeRunTime : 0
NormalizedRunTime : 1997602721
IdealCpu : 0
HypercallsCount : 1708169
HypercallsTime : 111914774
PageInvalidationsCount : 0
PageInvalidationsTime : 0
On a root partition with some active child partitions, the entire
directory structure may look like:
mshv/
stats # hypervisor stats
lp/ # logical processors
0/ # LP id
stats # LP 0 stats
1/
2/
3/
partition/ # partition stats
1/ # root partition id
stats # root partition stats
vp/ # root virtual processors
0/ # root VP id
stats # root VP 0 stats
1/
2/
3/
42/ # child partition id
stats # child partition stats
vp/ # child VPs
0/ # child VP id
stats # child VP 0 stats
1/
43/
55/
On L1VH, some stats are not present as it does not own the hardware
like the root partition does:
- The hypervisor and lp stats are not present
- L1VH's partition directory is named "self" because it can't get its
own id
- Some of L1VH's partition and VP stats fields are not populated, because
it can't map its own HV_STATS_AREA_PARENT page.
Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Co-developed-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/Makefile | 1 +
drivers/hv/mshv_debugfs.c | 726 ++++++++++++++++++++++++++++++++++++
drivers/hv/mshv_root.h | 34 ++
drivers/hv/mshv_root_main.c | 26 +-
4 files changed, 785 insertions(+), 2 deletions(-)
create mode 100644 drivers/hv/mshv_debugfs.c
diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index a49f93c2d245..2593711c3628 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -15,6 +15,7 @@ hv_vmbus-$(CONFIG_HYPERV_TESTING) += hv_debugfs.o
hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
+mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
mshv_vtl-y := mshv_vtl_main.o
# Code that must be built-in
diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
new file mode 100644
index 000000000000..ebf2549eb44d
--- /dev/null
+++ b/drivers/hv/mshv_debugfs.c
@@ -0,0 +1,726 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * The /sys/kernel/debug/mshv directory contents.
+ * Contains various statistics data, provided by the hypervisor.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+#include <linux/debugfs.h>
+#include <linux/stringify.h>
+#include <asm/mshyperv.h>
+#include <linux/slab.h>
+
+#include "mshv.h"
+#include "mshv_root.h"
+
+/* Ensure this file is not used elsewhere by accident */
+#define MSHV_DEBUGFS_C
+#include "mshv_debugfs_counters.c"
+
+#define U32_BUF_SZ 11
+#define U64_BUF_SZ 21
+/* Only support SELF and PARENT areas */
+#define NUM_STATS_AREAS 2
+static_assert(HV_STATS_AREA_SELF == 0 && HV_STATS_AREA_PARENT == 1,
+ "SELF and PARENT areas must be usable as indices into an array of size NUM_STATS_AREAS");
+/* HV_HYPERVISOR_COUNTER */
+#define HV_HYPERVISOR_COUNTER_LOGICAL_PROCESSORS 1
+
+static struct dentry *mshv_debugfs;
+static struct dentry *mshv_debugfs_partition;
+static struct dentry *mshv_debugfs_lp;
+static struct dentry **parent_vp_stats;
+static struct dentry *parent_partition_stats;
+
+static u64 mshv_lps_count;
+static struct hv_stats_page **mshv_lps_stats;
+
+static int lp_stats_show(struct seq_file *m, void *v)
+{
+ const struct hv_stats_page *stats = m->private;
+ int idx;
+
+ for (idx = 0; idx < ARRAY_SIZE(hv_lp_counters); idx++) {
+ char *name = hv_lp_counters[idx];
+
+ if (!name)
+ continue;
+ seq_printf(m, "%-32s: %llu\n", name, stats->data[idx]);
+ }
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(lp_stats);
+
+static void mshv_lp_stats_unmap(u32 lp_index)
+{
+ union hv_stats_object_identity identity = {
+ .lp.lp_index = lp_index,
+ .lp.stats_area_type = HV_STATS_AREA_SELF,
+ };
+ int err;
+
+ err = hv_unmap_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR,
+ mshv_lps_stats[lp_index], &identity);
+ if (err)
+ pr_err("%s: failed to unmap logical processor %u stats, err: %d\n",
+ __func__, lp_index, err);
+
+ mshv_lps_stats[lp_index] = NULL;
+}
+
+static struct hv_stats_page * __init mshv_lp_stats_map(u32 lp_index)
+{
+ union hv_stats_object_identity identity = {
+ .lp.lp_index = lp_index,
+ .lp.stats_area_type = HV_STATS_AREA_SELF,
+ };
+ struct hv_stats_page *stats;
+ int err;
+
+ err = hv_map_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR, &identity,
+ &stats);
+ if (err) {
+ pr_err("%s: failed to map logical processor %u stats, err: %d\n",
+ __func__, lp_index, err);
+ return ERR_PTR(err);
+ }
+ mshv_lps_stats[lp_index] = stats;
+
+ return stats;
+}
+
+static struct hv_stats_page * __init lp_debugfs_stats_create(u32 lp_index,
+ struct dentry *parent)
+{
+ struct dentry *dentry;
+ struct hv_stats_page *stats;
+
+ stats = mshv_lp_stats_map(lp_index);
+ if (IS_ERR(stats))
+ return stats;
+
+ dentry = debugfs_create_file("stats", 0400, parent,
+ stats, &lp_stats_fops);
+ if (IS_ERR(dentry)) {
+ mshv_lp_stats_unmap(lp_index);
+ return ERR_CAST(dentry);
+ }
+ return stats;
+}
+
+static int __init lp_debugfs_create(u32 lp_index, struct dentry *parent)
+{
+ struct dentry *idx;
+ char lp_idx_str[U32_BUF_SZ];
+ struct hv_stats_page *stats;
+ int err;
+
+ sprintf(lp_idx_str, "%u", lp_index);
+
+ idx = debugfs_create_dir(lp_idx_str, parent);
+ if (IS_ERR(idx))
+ return PTR_ERR(idx);
+
+ stats = lp_debugfs_stats_create(lp_index, idx);
+ if (IS_ERR(stats)) {
+ err = PTR_ERR(stats);
+ goto remove_debugfs_lp_idx;
+ }
+
+ return 0;
+
+remove_debugfs_lp_idx:
+ debugfs_remove_recursive(idx);
+ return err;
+}
+
+static void mshv_debugfs_lp_remove(void)
+{
+ int lp_index;
+
+ debugfs_remove_recursive(mshv_debugfs_lp);
+
+ for (lp_index = 0; lp_index < mshv_lps_count; lp_index++)
+ mshv_lp_stats_unmap(lp_index);
+
+ kfree(mshv_lps_stats);
+ mshv_lps_stats = NULL;
+}
+
+static int __init mshv_debugfs_lp_create(struct dentry *parent)
+{
+ struct dentry *lp_dir;
+ int err, lp_index;
+
+ mshv_lps_stats = kcalloc(mshv_lps_count,
+ sizeof(*mshv_lps_stats),
+ GFP_KERNEL_ACCOUNT);
+
+ if (!mshv_lps_stats)
+ return -ENOMEM;
+
+ lp_dir = debugfs_create_dir("lp", parent);
+ if (IS_ERR(lp_dir)) {
+ err = PTR_ERR(lp_dir);
+ goto free_lp_stats;
+ }
+
+ for (lp_index = 0; lp_index < mshv_lps_count; lp_index++) {
+ err = lp_debugfs_create(lp_index, lp_dir);
+ if (err)
+ goto remove_debugfs_lps;
+ }
+
+ mshv_debugfs_lp = lp_dir;
+
+ return 0;
+
+remove_debugfs_lps:
+ for (lp_index -= 1; lp_index >= 0; lp_index--)
+ mshv_lp_stats_unmap(lp_index);
+ debugfs_remove_recursive(lp_dir);
+free_lp_stats:
+ kfree(mshv_lps_stats);
+ mshv_lps_stats = NULL;
+
+ return err;
+}
+
+static int vp_stats_show(struct seq_file *m, void *v)
+{
+ const struct hv_stats_page **pstats = m->private;
+ u64 parent_val, self_val;
+ int idx;
+
+ /*
+ * For VP and partition stats, there may be two stats areas mapped,
+ * SELF and PARENT. These refer to the privilege level of the data in
+ * each page. Some fields may be 0 in SELF and nonzero in PARENT, or
+ * vice versa.
+ *
+ * Hence, prioritize printing from the PARENT page (more privileged
+ * data), but use the value from the SELF page if the PARENT value is
+ * 0.
+ */
+
+ for (idx = 0; idx < ARRAY_SIZE(hv_vp_counters); idx++) {
+ char *name = hv_vp_counters[idx];
+
+ if (!name)
+ continue;
+
+ parent_val = pstats[HV_STATS_AREA_PARENT]->data[idx];
+ self_val = pstats[HV_STATS_AREA_SELF]->data[idx];
+ seq_printf(m, "%-43s: %llu\n", name,
+ parent_val ? parent_val : self_val);
+ }
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(vp_stats);
+
+static void vp_debugfs_remove(struct dentry *vp_stats)
+{
+ debugfs_remove_recursive(vp_stats->d_parent);
+}
+
+static int vp_debugfs_create(u64 partition_id, u32 vp_index,
+ struct hv_stats_page **pstats,
+ struct dentry **vp_stats_ptr,
+ struct dentry *parent)
+{
+ struct dentry *vp_idx_dir, *d;
+ char vp_idx_str[U32_BUF_SZ];
+ int err;
+
+ sprintf(vp_idx_str, "%u", vp_index);
+
+ vp_idx_dir = debugfs_create_dir(vp_idx_str, parent);
+ if (IS_ERR(vp_idx_dir))
+ return PTR_ERR(vp_idx_dir);
+
+ d = debugfs_create_file("stats", 0400, vp_idx_dir,
+ pstats, &vp_stats_fops);
+ if (IS_ERR(d)) {
+ err = PTR_ERR(d);
+ goto remove_debugfs_vp_idx;
+ }
+
+ *vp_stats_ptr = d;
+
+ return 0;
+
+remove_debugfs_vp_idx:
+ debugfs_remove_recursive(vp_idx_dir);
+ return err;
+}
+
+static int partition_stats_show(struct seq_file *m, void *v)
+{
+ const struct hv_stats_page **pstats = m->private;
+ u64 parent_val, self_val;
+ int idx;
+
+ for (idx = 0; idx < ARRAY_SIZE(hv_partition_counters); idx++) {
+ char *name = hv_partition_counters[idx];
+
+ if (!name)
+ continue;
+
+ parent_val = pstats[HV_STATS_AREA_PARENT]->data[idx];
+ self_val = pstats[HV_STATS_AREA_SELF]->data[idx];
+ seq_printf(m, "%-37s: %llu\n", name,
+ parent_val ? parent_val : self_val);
+ }
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(partition_stats);
+
+static void mshv_partition_stats_unmap(u64 partition_id,
+ struct hv_stats_page *stats_page,
+ enum hv_stats_area_type stats_area_type)
+{
+ union hv_stats_object_identity identity = {
+ .partition.partition_id = partition_id,
+ .partition.stats_area_type = stats_area_type,
+ };
+ int err;
+
+ err = hv_unmap_stats_page(HV_STATS_OBJECT_PARTITION, stats_page,
+ &identity);
+ if (err)
+ pr_err("%s: failed to unmap partition %lld %s stats, err: %d\n",
+ __func__, partition_id,
+ (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
+ err);
+}
+
+static struct hv_stats_page *mshv_partition_stats_map(u64 partition_id,
+ enum hv_stats_area_type stats_area_type)
+{
+ union hv_stats_object_identity identity = {
+ .partition.partition_id = partition_id,
+ .partition.stats_area_type = stats_area_type,
+ };
+ struct hv_stats_page *stats;
+ int err;
+
+ err = hv_map_stats_page(HV_STATS_OBJECT_PARTITION, &identity, &stats);
+ if (err) {
+ pr_err("%s: failed to map partition %lld %s stats, err: %d\n",
+ __func__, partition_id,
+ (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
+ err);
+ return ERR_PTR(err);
+ }
+ return stats;
+}
+
+static int mshv_debugfs_partition_stats_create(u64 partition_id,
+ struct dentry **partition_stats_ptr,
+ struct dentry *parent)
+{
+ struct dentry *dentry;
+ struct hv_stats_page **pstats;
+ int err;
+
+ pstats = kcalloc(NUM_STATS_AREAS, sizeof(struct hv_stats_page *),
+ GFP_KERNEL_ACCOUNT);
+ if (!pstats)
+ return -ENOMEM;
+
+ pstats[HV_STATS_AREA_SELF] = mshv_partition_stats_map(partition_id,
+ HV_STATS_AREA_SELF);
+ if (IS_ERR(pstats[HV_STATS_AREA_SELF])) {
+ err = PTR_ERR(pstats[HV_STATS_AREA_SELF]);
+ goto cleanup;
+ }
+
+ /*
+ * L1VH partition cannot access its partition stats in parent area.
+ */
+ if (is_l1vh_parent(partition_id)) {
+ pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
+ } else {
+ pstats[HV_STATS_AREA_PARENT] = mshv_partition_stats_map(partition_id,
+ HV_STATS_AREA_PARENT);
+ if (IS_ERR(pstats[HV_STATS_AREA_PARENT])) {
+ err = PTR_ERR(pstats[HV_STATS_AREA_PARENT]);
+ goto unmap_self;
+ }
+ if (!pstats[HV_STATS_AREA_PARENT])
+ pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
+ }
+
+ dentry = debugfs_create_file("stats", 0400, parent,
+ pstats, &partition_stats_fops);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto unmap_partition_stats;
+ }
+
+ *partition_stats_ptr = dentry;
+ return 0;
+
+unmap_partition_stats:
+ if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF])
+ mshv_partition_stats_unmap(partition_id, pstats[HV_STATS_AREA_PARENT],
+ HV_STATS_AREA_PARENT);
+unmap_self:
+ mshv_partition_stats_unmap(partition_id, pstats[HV_STATS_AREA_SELF],
+ HV_STATS_AREA_SELF);
+cleanup:
+ kfree(pstats);
+ return err;
+}
+
+static void partition_debugfs_remove(u64 partition_id, struct dentry *dentry)
+{
+ struct hv_stats_page **pstats = NULL;
+
+ pstats = dentry->d_inode->i_private;
+
+ debugfs_remove_recursive(dentry->d_parent);
+
+ if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF]) {
+ mshv_partition_stats_unmap(partition_id,
+ pstats[HV_STATS_AREA_PARENT],
+ HV_STATS_AREA_PARENT);
+ }
+
+ mshv_partition_stats_unmap(partition_id,
+ pstats[HV_STATS_AREA_SELF],
+ HV_STATS_AREA_SELF);
+
+ kfree(pstats);
+}
+
+static int partition_debugfs_create(u64 partition_id,
+ struct dentry **vp_dir_ptr,
+ struct dentry **partition_stats_ptr,
+ struct dentry *parent)
+{
+ char part_id_str[U64_BUF_SZ];
+ struct dentry *part_id_dir, *vp_dir;
+ int err;
+
+ if (is_l1vh_parent(partition_id))
+ sprintf(part_id_str, "self");
+ else
+ sprintf(part_id_str, "%llu", partition_id);
+
+ part_id_dir = debugfs_create_dir(part_id_str, parent);
+ if (IS_ERR(part_id_dir))
+ return PTR_ERR(part_id_dir);
+
+ vp_dir = debugfs_create_dir("vp", part_id_dir);
+ if (IS_ERR(vp_dir)) {
+ err = PTR_ERR(vp_dir);
+ goto remove_debugfs_partition_id;
+ }
+
+ err = mshv_debugfs_partition_stats_create(partition_id,
+ partition_stats_ptr,
+ part_id_dir);
+ if (err)
+ goto remove_debugfs_partition_id;
+
+ *vp_dir_ptr = vp_dir;
+
+ return 0;
+
+remove_debugfs_partition_id:
+ debugfs_remove_recursive(part_id_dir);
+ return err;
+}
+
+static void parent_vp_debugfs_remove(u32 vp_index,
+ struct dentry *vp_stats_ptr)
+{
+ struct hv_stats_page **pstats;
+
+ pstats = vp_stats_ptr->d_inode->i_private;
+ vp_debugfs_remove(vp_stats_ptr);
+ mshv_vp_stats_unmap(hv_current_partition_id, vp_index, pstats);
+ kfree(pstats);
+}
+
+static void mshv_debugfs_parent_partition_remove(void)
+{
+ int idx;
+
+ for_each_online_cpu(idx)
+ parent_vp_debugfs_remove(hv_vp_index[idx],
+ parent_vp_stats[idx]);
+
+ partition_debugfs_remove(hv_current_partition_id,
+ parent_partition_stats);
+ kfree(parent_vp_stats);
+ parent_vp_stats = NULL;
+ parent_partition_stats = NULL;
+}
+
+static int __init parent_vp_debugfs_create(u32 vp_index,
+ struct dentry **vp_stats_ptr,
+ struct dentry *parent)
+{
+ struct hv_stats_page **pstats;
+ int err;
+
+ pstats = kcalloc(NUM_STATS_AREAS, sizeof(struct hv_stats_page *),
+ GFP_KERNEL_ACCOUNT);
+ if (!pstats)
+ return -ENOMEM;
+
+ err = mshv_vp_stats_map(hv_current_partition_id, vp_index, pstats);
+ if (err)
+ goto cleanup;
+
+ err = vp_debugfs_create(hv_current_partition_id, vp_index, pstats,
+ vp_stats_ptr, parent);
+ if (err)
+ goto unmap_vp_stats;
+
+ return 0;
+
+unmap_vp_stats:
+ mshv_vp_stats_unmap(hv_current_partition_id, vp_index, pstats);
+cleanup:
+ kfree(pstats);
+ return err;
+}
+
+static int __init mshv_debugfs_parent_partition_create(void)
+{
+ struct dentry *vp_dir;
+ int err, idx, i;
+
+ mshv_debugfs_partition = debugfs_create_dir("partition",
+ mshv_debugfs);
+ if (IS_ERR(mshv_debugfs_partition))
+ return PTR_ERR(mshv_debugfs_partition);
+
+ err = partition_debugfs_create(hv_current_partition_id,
+ &vp_dir,
+ &parent_partition_stats,
+ mshv_debugfs_partition);
+ if (err)
+ goto remove_debugfs_partition;
+
+ parent_vp_stats = kcalloc(nr_cpu_ids, sizeof(*parent_vp_stats),
+ GFP_KERNEL);
+ if (!parent_vp_stats) {
+ err = -ENOMEM;
+ goto remove_debugfs_partition;
+ }
+
+ for_each_online_cpu(idx) {
+ err = parent_vp_debugfs_create(hv_vp_index[idx],
+ &parent_vp_stats[idx],
+ vp_dir);
+ if (err)
+ goto remove_debugfs_partition_vp;
+ }
+
+ return 0;
+
+remove_debugfs_partition_vp:
+ for_each_online_cpu(i) {
+ if (i >= idx)
+ break;
+ parent_vp_debugfs_remove(i, parent_vp_stats[i]);
+ }
+ partition_debugfs_remove(hv_current_partition_id,
+ parent_partition_stats);
+
+ kfree(parent_vp_stats);
+ parent_vp_stats = NULL;
+ parent_partition_stats = NULL;
+
+remove_debugfs_partition:
+ debugfs_remove_recursive(mshv_debugfs_partition);
+ mshv_debugfs_partition = NULL;
+ return err;
+}
+
+static int hv_stats_show(struct seq_file *m, void *v)
+{
+ const struct hv_stats_page *stats = m->private;
+ int idx;
+
+ for (idx = 0; idx < ARRAY_SIZE(hv_hypervisor_counters); idx++) {
+ char *name = hv_hypervisor_counters[idx];
+
+ if (!name)
+ continue;
+ seq_printf(m, "%-27s: %llu\n", name, stats->data[idx]);
+ }
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(hv_stats);
+
+static void mshv_hv_stats_unmap(void)
+{
+ union hv_stats_object_identity identity = {
+ .hv.stats_area_type = HV_STATS_AREA_SELF,
+ };
+ int err;
+
+ err = hv_unmap_stats_page(HV_STATS_OBJECT_HYPERVISOR, NULL, &identity);
+ if (err)
+ pr_err("%s: failed to unmap hypervisor stats: %d\n",
+ __func__, err);
+}
+
+static void * __init mshv_hv_stats_map(void)
+{
+ union hv_stats_object_identity identity = {
+ .hv.stats_area_type = HV_STATS_AREA_SELF,
+ };
+ struct hv_stats_page *stats;
+ int err;
+
+ err = hv_map_stats_page(HV_STATS_OBJECT_HYPERVISOR, &identity, &stats);
+ if (err) {
+ pr_err("%s: failed to map hypervisor stats: %d\n",
+ __func__, err);
+ return ERR_PTR(err);
+ }
+ return stats;
+}
+
+static int __init mshv_debugfs_hv_stats_create(struct dentry *parent)
+{
+ struct dentry *dentry;
+ u64 *stats;
+ int err;
+
+ stats = mshv_hv_stats_map();
+ if (IS_ERR(stats))
+ return PTR_ERR(stats);
+
+ dentry = debugfs_create_file("stats", 0400, parent,
+ stats, &hv_stats_fops);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ pr_err("%s: failed to create hypervisor stats dentry: %d\n",
+ __func__, err);
+ goto unmap_hv_stats;
+ }
+
+ mshv_lps_count = stats[HV_HYPERVISOR_COUNTER_LOGICAL_PROCESSORS];
+
+ return 0;
+
+unmap_hv_stats:
+ mshv_hv_stats_unmap();
+ return err;
+}
+
+int mshv_debugfs_vp_create(struct mshv_vp *vp)
+{
+ struct mshv_partition *p = vp->vp_partition;
+
+ if (!mshv_debugfs)
+ return 0;
+
+ return vp_debugfs_create(p->pt_id, vp->vp_index,
+ vp->vp_stats_pages,
+ &vp->vp_stats_dentry,
+ p->pt_vp_dentry);
+}
+
+void mshv_debugfs_vp_remove(struct mshv_vp *vp)
+{
+ if (!mshv_debugfs)
+ return;
+
+ vp_debugfs_remove(vp->vp_stats_dentry);
+}
+
+int mshv_debugfs_partition_create(struct mshv_partition *partition)
+{
+ int err;
+
+ if (!mshv_debugfs)
+ return 0;
+
+ err = partition_debugfs_create(partition->pt_id,
+ &partition->pt_vp_dentry,
+ &partition->pt_stats_dentry,
+ mshv_debugfs_partition);
+ if (err)
+ return err;
+
+ return 0;
+}
+
+void mshv_debugfs_partition_remove(struct mshv_partition *partition)
+{
+ if (!mshv_debugfs)
+ return;
+
+ partition_debugfs_remove(partition->pt_id,
+ partition->pt_stats_dentry);
+}
+
+int __init mshv_debugfs_init(void)
+{
+ int err;
+
+ mshv_debugfs = debugfs_create_dir("mshv", NULL);
+ if (IS_ERR(mshv_debugfs)) {
+ pr_err("%s: failed to create debugfs directory\n", __func__);
+ return PTR_ERR(mshv_debugfs);
+ }
+
+ if (hv_root_partition()) {
+ err = mshv_debugfs_hv_stats_create(mshv_debugfs);
+ if (err)
+ goto remove_mshv_dir;
+
+ err = mshv_debugfs_lp_create(mshv_debugfs);
+ if (err)
+ goto unmap_hv_stats;
+ }
+
+ err = mshv_debugfs_parent_partition_create();
+ if (err)
+ goto unmap_lp_stats;
+
+ return 0;
+
+unmap_lp_stats:
+ if (hv_root_partition()) {
+ mshv_debugfs_lp_remove();
+ mshv_debugfs_lp = NULL;
+ }
+unmap_hv_stats:
+ if (hv_root_partition())
+ mshv_hv_stats_unmap();
+remove_mshv_dir:
+ debugfs_remove_recursive(mshv_debugfs);
+ mshv_debugfs = NULL;
+ return err;
+}
+
+void mshv_debugfs_exit(void)
+{
+ mshv_debugfs_parent_partition_remove();
+
+ if (hv_root_partition()) {
+ mshv_debugfs_lp_remove();
+ mshv_debugfs_lp = NULL;
+ mshv_hv_stats_unmap();
+ }
+
+ debugfs_remove_recursive(mshv_debugfs);
+ mshv_debugfs = NULL;
+ mshv_debugfs_partition = NULL;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index e4912b0618fa..7332d9af8373 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -52,6 +52,9 @@ struct mshv_vp {
unsigned int kicked_by_hv;
wait_queue_head_t vp_suspend_queue;
} run;
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+ struct dentry *vp_stats_dentry;
+#endif
};
#define vp_fmt(fmt) "p%lluvp%u: " fmt
@@ -136,6 +139,10 @@ struct mshv_partition {
u64 isolation_type;
bool import_completed;
bool pt_initialized;
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+ struct dentry *pt_stats_dentry;
+ struct dentry *pt_vp_dentry;
+#endif
};
#define pt_fmt(fmt) "p%llu: " fmt
@@ -327,6 +334,33 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
void *property_value, size_t property_value_sz);
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+int __init mshv_debugfs_init(void);
+void mshv_debugfs_exit(void);
+
+int mshv_debugfs_partition_create(struct mshv_partition *partition);
+void mshv_debugfs_partition_remove(struct mshv_partition *partition);
+int mshv_debugfs_vp_create(struct mshv_vp *vp);
+void mshv_debugfs_vp_remove(struct mshv_vp *vp);
+#else
+static inline int __init mshv_debugfs_init(void)
+{
+ return 0;
+}
+static inline void mshv_debugfs_exit(void) { }
+
+static inline int mshv_debugfs_partition_create(struct mshv_partition *partition)
+{
+ return 0;
+}
+static inline void mshv_debugfs_partition_remove(struct mshv_partition *partition) { }
+static inline int mshv_debugfs_vp_create(struct mshv_vp *vp)
+{
+ return 0;
+}
+static inline void mshv_debugfs_vp_remove(struct mshv_vp *vp) { }
+#endif
+
extern struct mshv_root mshv_root;
extern enum hv_scheduler_type hv_scheduler_type;
extern u8 * __percpu *hv_synic_eventring_tail;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 414d9cee5252..3a43e41e16a1 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1095,6 +1095,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
+ ret = mshv_debugfs_vp_create(vp);
+ if (ret)
+ goto put_partition;
+
/*
* Keep anon_inode_getfd last: it installs fd in the file struct and
* thus makes the state accessible in user space.
@@ -1102,7 +1106,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
O_RDWR | O_CLOEXEC);
if (ret < 0)
- goto put_partition;
+ goto remove_debugfs_vp;
/* already exclusive with the partition mutex for all ioctls */
partition->pt_vp_count++;
@@ -1110,6 +1114,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
return ret;
+remove_debugfs_vp:
+ mshv_debugfs_vp_remove(vp);
put_partition:
mshv_partition_put(partition);
free_vp:
@@ -1552,10 +1558,16 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
if (ret)
goto withdraw_mem;
+ ret = mshv_debugfs_partition_create(partition);
+ if (ret)
+ goto finalize_partition;
+
partition->pt_initialized = true;
return 0;
+finalize_partition:
+ hv_call_finalize_partition(partition->pt_id);
withdraw_mem:
hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
@@ -1735,6 +1747,7 @@ static void destroy_partition(struct mshv_partition *partition)
if (!vp)
continue;
+ mshv_debugfs_vp_remove(vp);
mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
vp->vp_stats_pages);
@@ -1768,6 +1781,8 @@ static void destroy_partition(struct mshv_partition *partition)
partition->pt_vp_array[i] = NULL;
}
+ mshv_debugfs_partition_remove(partition);
+
/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
hv_call_finalize_partition(partition->pt_id);
@@ -2313,10 +2328,14 @@ static int __init mshv_parent_partition_init(void)
mshv_init_vmm_caps(dev);
- ret = mshv_irqfd_wq_init();
+ ret = mshv_debugfs_init();
if (ret)
goto exit_partition;
+ ret = mshv_irqfd_wq_init();
+ if (ret)
+ goto exit_debugfs;
+
spin_lock_init(&mshv_root.pt_ht_lock);
hash_init(mshv_root.pt_htable);
@@ -2324,6 +2343,8 @@ static int __init mshv_parent_partition_init(void)
return 0;
+exit_debugfs:
+ mshv_debugfs_exit();
exit_partition:
if (hv_root_partition())
mshv_root_partition_exit();
@@ -2340,6 +2361,7 @@ static void __exit mshv_parent_partition_exit(void)
{
hv_setup_mshv_handler(NULL);
mshv_port_table_fini();
+ mshv_debugfs_exit();
misc_deregister(&mshv_dev);
mshv_irqfd_wq_cleanup();
if (hv_root_partition())
--
2.34.1
^ permalink raw reply related
* [PATCH v6 6/7] mshv: Add data for printing stats page counters
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>
Introduce mshv_debugfs_counters.c, containing static data
corresponding to HV_*_COUNTER enums in the hypervisor source.
Defining the enum members as an array instead makes more sense,
since it will be iterated over to print counter information to
debugfs.
Include hypervisor, logical processor, partition, and virtual
processor counters.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++++++++++++
1 file changed, 490 insertions(+)
create mode 100644 drivers/hv/mshv_debugfs_counters.c
diff --git a/drivers/hv/mshv_debugfs_counters.c b/drivers/hv/mshv_debugfs_counters.c
new file mode 100644
index 000000000000..978536ba691f
--- /dev/null
+++ b/drivers/hv/mshv_debugfs_counters.c
@@ -0,0 +1,490 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Data for printing stats page counters via debugfs.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+/*
+ * For simplicity, this file is included directly in mshv_debugfs.c.
+ * If these are ever needed elsewhere they should be compiled separately.
+ * Ensure this file is not used twice by accident.
+ */
+#ifndef MSHV_DEBUGFS_C
+#error "This file should only be included in mshv_debugfs.c"
+#endif
+
+/* HV_HYPERVISOR_COUNTER */
+static char *hv_hypervisor_counters[] = {
+ [1] = "HvLogicalProcessors",
+ [2] = "HvPartitions",
+ [3] = "HvTotalPages",
+ [4] = "HvVirtualProcessors",
+ [5] = "HvMonitoredNotifications",
+ [6] = "HvModernStandbyEntries",
+ [7] = "HvPlatformIdleTransitions",
+ [8] = "HvHypervisorStartupCost",
+
+ [10] = "HvIOSpacePages",
+ [11] = "HvNonEssentialPagesForDump",
+ [12] = "HvSubsumedPages",
+};
+
+/* HV_CPU_COUNTER */
+static char *hv_lp_counters[] = {
+ [1] = "LpGlobalTime",
+ [2] = "LpTotalRunTime",
+ [3] = "LpHypervisorRunTime",
+ [4] = "LpHardwareInterrupts",
+ [5] = "LpContextSwitches",
+ [6] = "LpInterProcessorInterrupts",
+ [7] = "LpSchedulerInterrupts",
+ [8] = "LpTimerInterrupts",
+ [9] = "LpInterProcessorInterruptsSent",
+ [10] = "LpProcessorHalts",
+ [11] = "LpMonitorTransitionCost",
+ [12] = "LpContextSwitchTime",
+ [13] = "LpC1TransitionsCount",
+ [14] = "LpC1RunTime",
+ [15] = "LpC2TransitionsCount",
+ [16] = "LpC2RunTime",
+ [17] = "LpC3TransitionsCount",
+ [18] = "LpC3RunTime",
+ [19] = "LpRootVpIndex",
+ [20] = "LpIdleSequenceNumber",
+ [21] = "LpGlobalTscCount",
+ [22] = "LpActiveTscCount",
+ [23] = "LpIdleAccumulation",
+ [24] = "LpReferenceCycleCount0",
+ [25] = "LpActualCycleCount0",
+ [26] = "LpReferenceCycleCount1",
+ [27] = "LpActualCycleCount1",
+ [28] = "LpProximityDomainId",
+ [29] = "LpPostedInterruptNotifications",
+ [30] = "LpBranchPredictorFlushes",
+#if IS_ENABLED(CONFIG_X86_64)
+ [31] = "LpL1DataCacheFlushes",
+ [32] = "LpImmediateL1DataCacheFlushes",
+ [33] = "LpMbFlushes",
+ [34] = "LpCounterRefreshSequenceNumber",
+ [35] = "LpCounterRefreshReferenceTime",
+ [36] = "LpIdleAccumulationSnapshot",
+ [37] = "LpActiveTscCountSnapshot",
+ [38] = "LpHwpRequestContextSwitches",
+ [39] = "LpPlaceholder1",
+ [40] = "LpPlaceholder2",
+ [41] = "LpPlaceholder3",
+ [42] = "LpPlaceholder4",
+ [43] = "LpPlaceholder5",
+ [44] = "LpPlaceholder6",
+ [45] = "LpPlaceholder7",
+ [46] = "LpPlaceholder8",
+ [47] = "LpPlaceholder9",
+ [48] = "LpSchLocalRunListSize",
+ [49] = "LpReserveGroupId",
+ [50] = "LpRunningPriority",
+ [51] = "LpPerfmonInterruptCount",
+#elif IS_ENABLED(CONFIG_ARM64)
+ [31] = "LpCounterRefreshSequenceNumber",
+ [32] = "LpCounterRefreshReferenceTime",
+ [33] = "LpIdleAccumulationSnapshot",
+ [34] = "LpActiveTscCountSnapshot",
+ [35] = "LpHwpRequestContextSwitches",
+ [36] = "LpPlaceholder2",
+ [37] = "LpPlaceholder3",
+ [38] = "LpPlaceholder4",
+ [39] = "LpPlaceholder5",
+ [40] = "LpPlaceholder6",
+ [41] = "LpPlaceholder7",
+ [42] = "LpPlaceholder8",
+ [43] = "LpPlaceholder9",
+ [44] = "LpSchLocalRunListSize",
+ [45] = "LpReserveGroupId",
+ [46] = "LpRunningPriority",
+#endif
+};
+
+/* HV_PROCESS_COUNTER */
+static char *hv_partition_counters[] = {
+ [1] = "PtVirtualProcessors",
+
+ [3] = "PtTlbSize",
+ [4] = "PtAddressSpaces",
+ [5] = "PtDepositedPages",
+ [6] = "PtGpaPages",
+ [7] = "PtGpaSpaceModifications",
+ [8] = "PtVirtualTlbFlushEntires",
+ [9] = "PtRecommendedTlbSize",
+ [10] = "PtGpaPages4K",
+ [11] = "PtGpaPages2M",
+ [12] = "PtGpaPages1G",
+ [13] = "PtGpaPages512G",
+ [14] = "PtDevicePages4K",
+ [15] = "PtDevicePages2M",
+ [16] = "PtDevicePages1G",
+ [17] = "PtDevicePages512G",
+ [18] = "PtAttachedDevices",
+ [19] = "PtDeviceInterruptMappings",
+ [20] = "PtIoTlbFlushes",
+ [21] = "PtIoTlbFlushCost",
+ [22] = "PtDeviceInterruptErrors",
+ [23] = "PtDeviceDmaErrors",
+ [24] = "PtDeviceInterruptThrottleEvents",
+ [25] = "PtSkippedTimerTicks",
+ [26] = "PtPartitionId",
+#if IS_ENABLED(CONFIG_X86_64)
+ [27] = "PtNestedTlbSize",
+ [28] = "PtRecommendedNestedTlbSize",
+ [29] = "PtNestedTlbFreeListSize",
+ [30] = "PtNestedTlbTrimmedPages",
+ [31] = "PtPagesShattered",
+ [32] = "PtPagesRecombined",
+ [33] = "PtHwpRequestValue",
+ [34] = "PtAutoSuspendEnableTime",
+ [35] = "PtAutoSuspendTriggerTime",
+ [36] = "PtAutoSuspendDisableTime",
+ [37] = "PtPlaceholder1",
+ [38] = "PtPlaceholder2",
+ [39] = "PtPlaceholder3",
+ [40] = "PtPlaceholder4",
+ [41] = "PtPlaceholder5",
+ [42] = "PtPlaceholder6",
+ [43] = "PtPlaceholder7",
+ [44] = "PtPlaceholder8",
+ [45] = "PtHypervisorStateTransferGeneration",
+ [46] = "PtNumberofActiveChildPartitions",
+#elif IS_ENABLED(CONFIG_ARM64)
+ [27] = "PtHwpRequestValue",
+ [28] = "PtAutoSuspendEnableTime",
+ [29] = "PtAutoSuspendTriggerTime",
+ [30] = "PtAutoSuspendDisableTime",
+ [31] = "PtPlaceholder1",
+ [32] = "PtPlaceholder2",
+ [33] = "PtPlaceholder3",
+ [34] = "PtPlaceholder4",
+ [35] = "PtPlaceholder5",
+ [36] = "PtPlaceholder6",
+ [37] = "PtPlaceholder7",
+ [38] = "PtPlaceholder8",
+ [39] = "PtHypervisorStateTransferGeneration",
+ [40] = "PtNumberofActiveChildPartitions",
+#endif
+};
+
+/* HV_THREAD_COUNTER */
+static char *hv_vp_counters[] = {
+ [1] = "VpTotalRunTime",
+ [2] = "VpHypervisorRunTime",
+ [3] = "VpRemoteNodeRunTime",
+ [4] = "VpNormalizedRunTime",
+ [5] = "VpIdealCpu",
+
+ [7] = "VpHypercallsCount",
+ [8] = "VpHypercallsTime",
+#if IS_ENABLED(CONFIG_X86_64)
+ [9] = "VpPageInvalidationsCount",
+ [10] = "VpPageInvalidationsTime",
+ [11] = "VpControlRegisterAccessesCount",
+ [12] = "VpControlRegisterAccessesTime",
+ [13] = "VpIoInstructionsCount",
+ [14] = "VpIoInstructionsTime",
+ [15] = "VpHltInstructionsCount",
+ [16] = "VpHltInstructionsTime",
+ [17] = "VpMwaitInstructionsCount",
+ [18] = "VpMwaitInstructionsTime",
+ [19] = "VpCpuidInstructionsCount",
+ [20] = "VpCpuidInstructionsTime",
+ [21] = "VpMsrAccessesCount",
+ [22] = "VpMsrAccessesTime",
+ [23] = "VpOtherInterceptsCount",
+ [24] = "VpOtherInterceptsTime",
+ [25] = "VpExternalInterruptsCount",
+ [26] = "VpExternalInterruptsTime",
+ [27] = "VpPendingInterruptsCount",
+ [28] = "VpPendingInterruptsTime",
+ [29] = "VpEmulatedInstructionsCount",
+ [30] = "VpEmulatedInstructionsTime",
+ [31] = "VpDebugRegisterAccessesCount",
+ [32] = "VpDebugRegisterAccessesTime",
+ [33] = "VpPageFaultInterceptsCount",
+ [34] = "VpPageFaultInterceptsTime",
+ [35] = "VpGuestPageTableMaps",
+ [36] = "VpLargePageTlbFills",
+ [37] = "VpSmallPageTlbFills",
+ [38] = "VpReflectedGuestPageFaults",
+ [39] = "VpApicMmioAccesses",
+ [40] = "VpIoInterceptMessages",
+ [41] = "VpMemoryInterceptMessages",
+ [42] = "VpApicEoiAccesses",
+ [43] = "VpOtherMessages",
+ [44] = "VpPageTableAllocations",
+ [45] = "VpLogicalProcessorMigrations",
+ [46] = "VpAddressSpaceEvictions",
+ [47] = "VpAddressSpaceSwitches",
+ [48] = "VpAddressDomainFlushes",
+ [49] = "VpAddressSpaceFlushes",
+ [50] = "VpGlobalGvaRangeFlushes",
+ [51] = "VpLocalGvaRangeFlushes",
+ [52] = "VpPageTableEvictions",
+ [53] = "VpPageTableReclamations",
+ [54] = "VpPageTableResets",
+ [55] = "VpPageTableValidations",
+ [56] = "VpApicTprAccesses",
+ [57] = "VpPageTableWriteIntercepts",
+ [58] = "VpSyntheticInterrupts",
+ [59] = "VpVirtualInterrupts",
+ [60] = "VpApicIpisSent",
+ [61] = "VpApicSelfIpisSent",
+ [62] = "VpGpaSpaceHypercalls",
+ [63] = "VpLogicalProcessorHypercalls",
+ [64] = "VpLongSpinWaitHypercalls",
+ [65] = "VpOtherHypercalls",
+ [66] = "VpSyntheticInterruptHypercalls",
+ [67] = "VpVirtualInterruptHypercalls",
+ [68] = "VpVirtualMmuHypercalls",
+ [69] = "VpVirtualProcessorHypercalls",
+ [70] = "VpHardwareInterrupts",
+ [71] = "VpNestedPageFaultInterceptsCount",
+ [72] = "VpNestedPageFaultInterceptsTime",
+ [73] = "VpPageScans",
+ [74] = "VpLogicalProcessorDispatches",
+ [75] = "VpWaitingForCpuTime",
+ [76] = "VpExtendedHypercalls",
+ [77] = "VpExtendedHypercallInterceptMessages",
+ [78] = "VpMbecNestedPageTableSwitches",
+ [79] = "VpOtherReflectedGuestExceptions",
+ [80] = "VpGlobalIoTlbFlushes",
+ [81] = "VpGlobalIoTlbFlushCost",
+ [82] = "VpLocalIoTlbFlushes",
+ [83] = "VpLocalIoTlbFlushCost",
+ [84] = "VpHypercallsForwardedCount",
+ [85] = "VpHypercallsForwardingTime",
+ [86] = "VpPageInvalidationsForwardedCount",
+ [87] = "VpPageInvalidationsForwardingTime",
+ [88] = "VpControlRegisterAccessesForwardedCount",
+ [89] = "VpControlRegisterAccessesForwardingTime",
+ [90] = "VpIoInstructionsForwardedCount",
+ [91] = "VpIoInstructionsForwardingTime",
+ [92] = "VpHltInstructionsForwardedCount",
+ [93] = "VpHltInstructionsForwardingTime",
+ [94] = "VpMwaitInstructionsForwardedCount",
+ [95] = "VpMwaitInstructionsForwardingTime",
+ [96] = "VpCpuidInstructionsForwardedCount",
+ [97] = "VpCpuidInstructionsForwardingTime",
+ [98] = "VpMsrAccessesForwardedCount",
+ [99] = "VpMsrAccessesForwardingTime",
+ [100] = "VpOtherInterceptsForwardedCount",
+ [101] = "VpOtherInterceptsForwardingTime",
+ [102] = "VpExternalInterruptsForwardedCount",
+ [103] = "VpExternalInterruptsForwardingTime",
+ [104] = "VpPendingInterruptsForwardedCount",
+ [105] = "VpPendingInterruptsForwardingTime",
+ [106] = "VpEmulatedInstructionsForwardedCount",
+ [107] = "VpEmulatedInstructionsForwardingTime",
+ [108] = "VpDebugRegisterAccessesForwardedCount",
+ [109] = "VpDebugRegisterAccessesForwardingTime",
+ [110] = "VpPageFaultInterceptsForwardedCount",
+ [111] = "VpPageFaultInterceptsForwardingTime",
+ [112] = "VpVmclearEmulationCount",
+ [113] = "VpVmclearEmulationTime",
+ [114] = "VpVmptrldEmulationCount",
+ [115] = "VpVmptrldEmulationTime",
+ [116] = "VpVmptrstEmulationCount",
+ [117] = "VpVmptrstEmulationTime",
+ [118] = "VpVmreadEmulationCount",
+ [119] = "VpVmreadEmulationTime",
+ [120] = "VpVmwriteEmulationCount",
+ [121] = "VpVmwriteEmulationTime",
+ [122] = "VpVmxoffEmulationCount",
+ [123] = "VpVmxoffEmulationTime",
+ [124] = "VpVmxonEmulationCount",
+ [125] = "VpVmxonEmulationTime",
+ [126] = "VpNestedVMEntriesCount",
+ [127] = "VpNestedVMEntriesTime",
+ [128] = "VpNestedSLATSoftPageFaultsCount",
+ [129] = "VpNestedSLATSoftPageFaultsTime",
+ [130] = "VpNestedSLATHardPageFaultsCount",
+ [131] = "VpNestedSLATHardPageFaultsTime",
+ [132] = "VpInvEptAllContextEmulationCount",
+ [133] = "VpInvEptAllContextEmulationTime",
+ [134] = "VpInvEptSingleContextEmulationCount",
+ [135] = "VpInvEptSingleContextEmulationTime",
+ [136] = "VpInvVpidAllContextEmulationCount",
+ [137] = "VpInvVpidAllContextEmulationTime",
+ [138] = "VpInvVpidSingleContextEmulationCount",
+ [139] = "VpInvVpidSingleContextEmulationTime",
+ [140] = "VpInvVpidSingleAddressEmulationCount",
+ [141] = "VpInvVpidSingleAddressEmulationTime",
+ [142] = "VpNestedTlbPageTableReclamations",
+ [143] = "VpNestedTlbPageTableEvictions",
+ [144] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
+ [145] = "VpFlushGuestPhysicalAddressListHypercalls",
+ [146] = "VpPostedInterruptNotifications",
+ [147] = "VpPostedInterruptScans",
+ [148] = "VpTotalCoreRunTime",
+ [149] = "VpMaximumRunTime",
+ [150] = "VpHwpRequestContextSwitches",
+ [151] = "VpWaitingForCpuTimeBucket0",
+ [152] = "VpWaitingForCpuTimeBucket1",
+ [153] = "VpWaitingForCpuTimeBucket2",
+ [154] = "VpWaitingForCpuTimeBucket3",
+ [155] = "VpWaitingForCpuTimeBucket4",
+ [156] = "VpWaitingForCpuTimeBucket5",
+ [157] = "VpWaitingForCpuTimeBucket6",
+ [158] = "VpVmloadEmulationCount",
+ [159] = "VpVmloadEmulationTime",
+ [160] = "VpVmsaveEmulationCount",
+ [161] = "VpVmsaveEmulationTime",
+ [162] = "VpGifInstructionEmulationCount",
+ [163] = "VpGifInstructionEmulationTime",
+ [164] = "VpEmulatedErrataSvmInstructions",
+ [165] = "VpPlaceholder1",
+ [166] = "VpPlaceholder2",
+ [167] = "VpPlaceholder3",
+ [168] = "VpPlaceholder4",
+ [169] = "VpPlaceholder5",
+ [170] = "VpPlaceholder6",
+ [171] = "VpPlaceholder7",
+ [172] = "VpPlaceholder8",
+ [173] = "VpContentionTime",
+ [174] = "VpWakeUpTime",
+ [175] = "VpSchedulingPriority",
+ [176] = "VpRdpmcInstructionsCount",
+ [177] = "VpRdpmcInstructionsTime",
+ [178] = "VpPerfmonPmuMsrAccessesCount",
+ [179] = "VpPerfmonLbrMsrAccessesCount",
+ [180] = "VpPerfmonIptMsrAccessesCount",
+ [181] = "VpPerfmonInterruptCount",
+ [182] = "VpVtl1DispatchCount",
+ [183] = "VpVtl2DispatchCount",
+ [184] = "VpVtl2DispatchBucket0",
+ [185] = "VpVtl2DispatchBucket1",
+ [186] = "VpVtl2DispatchBucket2",
+ [187] = "VpVtl2DispatchBucket3",
+ [188] = "VpVtl2DispatchBucket4",
+ [189] = "VpVtl2DispatchBucket5",
+ [190] = "VpVtl2DispatchBucket6",
+ [191] = "VpVtl1RunTime",
+ [192] = "VpVtl2RunTime",
+ [193] = "VpIommuHypercalls",
+ [194] = "VpCpuGroupHypercalls",
+ [195] = "VpVsmHypercalls",
+ [196] = "VpEventLogHypercalls",
+ [197] = "VpDeviceDomainHypercalls",
+ [198] = "VpDepositHypercalls",
+ [199] = "VpSvmHypercalls",
+ [200] = "VpBusLockAcquisitionCount",
+ [201] = "VpLoadAvg",
+ [202] = "VpRootDispatchThreadBlocked",
+ [203] = "VpIdleCpuTime",
+ [204] = "VpWaitingForCpuTimeBucket7",
+ [205] = "VpWaitingForCpuTimeBucket8",
+ [206] = "VpWaitingForCpuTimeBucket9",
+ [207] = "VpWaitingForCpuTimeBucket10",
+ [208] = "VpWaitingForCpuTimeBucket11",
+ [209] = "VpWaitingForCpuTimeBucket12",
+ [210] = "VpHierarchicalSuspendTime",
+ [211] = "VpExpressSchedulingAttempts",
+ [212] = "VpExpressSchedulingCount",
+#elif IS_ENABLED(CONFIG_ARM64)
+ [9] = "VpSysRegAccessesCount",
+ [10] = "VpSysRegAccessesTime",
+ [11] = "VpSmcInstructionsCount",
+ [12] = "VpSmcInstructionsTime",
+ [13] = "VpOtherInterceptsCount",
+ [14] = "VpOtherInterceptsTime",
+ [15] = "VpExternalInterruptsCount",
+ [16] = "VpExternalInterruptsTime",
+ [17] = "VpPendingInterruptsCount",
+ [18] = "VpPendingInterruptsTime",
+ [19] = "VpGuestPageTableMaps",
+ [20] = "VpLargePageTlbFills",
+ [21] = "VpSmallPageTlbFills",
+ [22] = "VpReflectedGuestPageFaults",
+ [23] = "VpMemoryInterceptMessages",
+ [24] = "VpOtherMessages",
+ [25] = "VpLogicalProcessorMigrations",
+ [26] = "VpAddressDomainFlushes",
+ [27] = "VpAddressSpaceFlushes",
+ [28] = "VpSyntheticInterrupts",
+ [29] = "VpVirtualInterrupts",
+ [30] = "VpApicSelfIpisSent",
+ [31] = "VpGpaSpaceHypercalls",
+ [32] = "VpLogicalProcessorHypercalls",
+ [33] = "VpLongSpinWaitHypercalls",
+ [34] = "VpOtherHypercalls",
+ [35] = "VpSyntheticInterruptHypercalls",
+ [36] = "VpVirtualInterruptHypercalls",
+ [37] = "VpVirtualMmuHypercalls",
+ [38] = "VpVirtualProcessorHypercalls",
+ [39] = "VpHardwareInterrupts",
+ [40] = "VpNestedPageFaultInterceptsCount",
+ [41] = "VpNestedPageFaultInterceptsTime",
+ [42] = "VpLogicalProcessorDispatches",
+ [43] = "VpWaitingForCpuTime",
+ [44] = "VpExtendedHypercalls",
+ [45] = "VpExtendedHypercallInterceptMessages",
+ [46] = "VpMbecNestedPageTableSwitches",
+ [47] = "VpOtherReflectedGuestExceptions",
+ [48] = "VpGlobalIoTlbFlushes",
+ [49] = "VpGlobalIoTlbFlushCost",
+ [50] = "VpLocalIoTlbFlushes",
+ [51] = "VpLocalIoTlbFlushCost",
+ [52] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
+ [53] = "VpFlushGuestPhysicalAddressListHypercalls",
+ [54] = "VpPostedInterruptNotifications",
+ [55] = "VpPostedInterruptScans",
+ [56] = "VpTotalCoreRunTime",
+ [57] = "VpMaximumRunTime",
+ [58] = "VpWaitingForCpuTimeBucket0",
+ [59] = "VpWaitingForCpuTimeBucket1",
+ [60] = "VpWaitingForCpuTimeBucket2",
+ [61] = "VpWaitingForCpuTimeBucket3",
+ [62] = "VpWaitingForCpuTimeBucket4",
+ [63] = "VpWaitingForCpuTimeBucket5",
+ [64] = "VpWaitingForCpuTimeBucket6",
+ [65] = "VpHwpRequestContextSwitches",
+ [66] = "VpPlaceholder2",
+ [67] = "VpPlaceholder3",
+ [68] = "VpPlaceholder4",
+ [69] = "VpPlaceholder5",
+ [70] = "VpPlaceholder6",
+ [71] = "VpPlaceholder7",
+ [72] = "VpPlaceholder8",
+ [73] = "VpContentionTime",
+ [74] = "VpWakeUpTime",
+ [75] = "VpSchedulingPriority",
+ [76] = "VpVtl1DispatchCount",
+ [77] = "VpVtl2DispatchCount",
+ [78] = "VpVtl2DispatchBucket0",
+ [79] = "VpVtl2DispatchBucket1",
+ [80] = "VpVtl2DispatchBucket2",
+ [81] = "VpVtl2DispatchBucket3",
+ [82] = "VpVtl2DispatchBucket4",
+ [83] = "VpVtl2DispatchBucket5",
+ [84] = "VpVtl2DispatchBucket6",
+ [85] = "VpVtl1RunTime",
+ [86] = "VpVtl2RunTime",
+ [87] = "VpIommuHypercalls",
+ [88] = "VpCpuGroupHypercalls",
+ [89] = "VpVsmHypercalls",
+ [90] = "VpEventLogHypercalls",
+ [91] = "VpDeviceDomainHypercalls",
+ [92] = "VpDepositHypercalls",
+ [93] = "VpSvmHypercalls",
+ [94] = "VpLoadAvg",
+ [95] = "VpRootDispatchThreadBlocked",
+ [96] = "VpIdleCpuTime",
+ [97] = "VpWaitingForCpuTimeBucket7",
+ [98] = "VpWaitingForCpuTimeBucket8",
+ [99] = "VpWaitingForCpuTimeBucket9",
+ [100] = "VpWaitingForCpuTimeBucket10",
+ [101] = "VpWaitingForCpuTimeBucket11",
+ [102] = "VpWaitingForCpuTimeBucket12",
+ [103] = "VpHierarchicalSuspendTime",
+ [104] = "VpExpressSchedulingAttempts",
+ [105] = "VpExpressSchedulingCount",
+#endif
+};
--
2.34.1
^ permalink raw reply related
* [PATCH v6 5/7] mshv: Update hv_stats_page definitions
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>
hv_stats_page belongs in hvhdk.h, move it there.
It does not require a union to access the data for different counters,
just use a single u64 array for simplicity and to match the Windows
definitions.
While at it, correct the ARM64 value for VpRootDispatchThreadBlocked.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 27 ++++++++-------------------
include/hyperv/hvhdk.h | 7 +++++++
2 files changed, 15 insertions(+), 19 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index fbfc9e7d9fa4..414d9cee5252 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -39,22 +39,12 @@ MODULE_AUTHOR("Microsoft");
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
-/* TODO move this to another file when debugfs code is added */
-enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
-#if defined(CONFIG_X86)
- VpRootDispatchThreadBlocked = 202,
+/* HV_THREAD_COUNTER */
+#if defined(CONFIG_X86_64)
+#define HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED 202
#elif defined(CONFIG_ARM64)
- VpRootDispatchThreadBlocked = 94,
+#define HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED 95
#endif
- VpStatsMaxCounter
-};
-
-struct hv_stats_page {
- union {
- u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
- u8 data[HV_HYP_PAGE_SIZE];
- };
-} __packed;
struct mshv_root mshv_root;
@@ -485,12 +475,11 @@ static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
static bool mshv_vp_dispatch_thread_blocked(struct mshv_vp *vp)
{
struct hv_stats_page **stats = vp->vp_stats_pages;
- u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->vp_cntrs;
- u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->vp_cntrs;
+ u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->data;
+ u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->data;
- if (self_vp_cntrs[VpRootDispatchThreadBlocked])
- return self_vp_cntrs[VpRootDispatchThreadBlocked];
- return parent_vp_cntrs[VpRootDispatchThreadBlocked];
+ return parent_vp_cntrs[HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED] ||
+ self_vp_cntrs[HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED];
}
static int
diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 469186df7826..d87cfdb7d360 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -10,6 +10,13 @@
#include "hvhdk_mini.h"
#include "hvgdk.h"
+/*
+ * Hypervisor statistics page format
+ */
+struct hv_stats_page {
+ u64 data[HV_HYP_PAGE_SIZE / sizeof(u64)];
+} __packed;
+
/* Bits for dirty mask of hv_vp_register_page */
#define HV_X64_REGISTER_CLASS_GENERAL 0
#define HV_X64_REGISTER_CLASS_IP 1
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox