* Re: [PATCH v2 1/8] KVM: SVM: Add a helper to detect VMRUN failures
From: Yosry Ahmed @ 2026-01-02 16:44 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Vitaly Kuznetsov, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, kvm, linux-hyperv, linux-kernel,
Jim Mattson
In-Reply-To: <20251230211347.4099600-2-seanjc@google.com>
On Tue, Dec 30, 2025 at 01:13:40PM -0800, Sean Christopherson wrote:
> Add a helper to detect VMRUN failures so that KVM can guard against its
> own long-standing bug, where KVM neglects to set exitcode[63:32] when
> synthesizing a nested VMFAIL_INVALID VM-Exit. This will allow fixing
> KVM's mess of treating exitcode as two separate 32-bit values without
> breaking KVM-on-KVM when running on an older, unfixed KVM.
>
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
> arch/x86/kvm/svm/nested.c | 16 +++++++---------
> arch/x86/kvm/svm/svm.c | 4 ++--
> arch/x86/kvm/svm/svm.h | 5 +++++
> 3 files changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index ba0f11c68372..f5bde972a2b1 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1134,7 +1134,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> vmcb12->control.exit_info_1 = vmcb02->control.exit_info_1;
> vmcb12->control.exit_info_2 = vmcb02->control.exit_info_2;
>
> - if (vmcb12->control.exit_code != SVM_EXIT_ERR)
> + if (!svm_is_vmrun_failure(vmcb12->control.exit_code))
> nested_save_pending_event_to_vmcb12(svm, vmcb12);
>
> if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
> @@ -1425,6 +1425,9 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
> u32 exit_code = svm->vmcb->control.exit_code;
> int vmexit = NESTED_EXIT_HOST;
>
> + if (svm_is_vmrun_failure(exit_code))
> + return NESTED_EXIT_DONE;
> +
> switch (exit_code) {
> case SVM_EXIT_MSR:
> vmexit = nested_svm_exit_handled_msr(svm);
> @@ -1432,7 +1435,7 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
> case SVM_EXIT_IOIO:
> vmexit = nested_svm_intercept_ioio(svm);
> break;
> - case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
> + case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f:
> /*
> * Host-intercepted exceptions have been checked already in
> * nested_svm_exit_special. There is nothing to do here,
> @@ -1440,15 +1443,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
> */
> vmexit = NESTED_EXIT_DONE;
> break;
> - }
> - case SVM_EXIT_ERR: {
> - vmexit = NESTED_EXIT_DONE;
> - break;
> - }
> - default: {
> + default:
> if (vmcb12_is_intercept(&svm->nested.ctl, exit_code))
> vmexit = NESTED_EXIT_DONE;
> - }
> + break;
> }
>
> return vmexit;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 24d59ccfa40d..c2ddf2e0aa1a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3540,7 +3540,7 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> return 1;
> }
>
> - if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
> + if (svm_is_vmrun_failure(svm->vmcb->control.exit_code)) {
> kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
> kvm_run->fail_entry.hardware_entry_failure_reason
> = svm->vmcb->control.exit_code;
> @@ -4311,7 +4311,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>
> /* Track VMRUNs that have made past consistency checking */
> if (svm->nested.nested_run_pending &&
> - svm->vmcb->control.exit_code != SVM_EXIT_ERR)
> + !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
> ++vcpu->stat.nested_run;
>
> svm->nested.nested_run_pending = 0;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 01be93a53d07..0f006793f973 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -424,6 +424,11 @@ static __always_inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
> return container_of(vcpu, struct vcpu_svm, vcpu);
> }
>
> +static inline bool svm_is_vmrun_failure(u64 exit_code)
> +{
> + return (u32)exit_code == (u32)SVM_EXIT_ERR;
> +}
> +
> /*
> * Only the PDPTRs are loaded on demand into the shadow MMU. All other
> * fields are synchronized on VM-Exit, because accessing the VMCB is cheap.
> --
> 2.52.0.351.gbe84eed79e-goog
>
^ permalink raw reply
* RE: [PATCH v2 3/3] mshv: Add debugfs to view hypervisor statistics
From: Michael Kelley @ 2026-01-02 16:27 UTC (permalink / raw)
To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com, Jinank Jain
In-Reply-To: <ff9aa617-3980-49bf-9311-282b70714280@linux.microsoft.com>
From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, December 30, 2025 4:27 PM
>
> On 12/8/2025 7:21 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, December 5, 2025 10:59 AM
> >>
> >> Introduce a debugfs interface to expose root and child partition stats
> >> when running with mshv_root.
> >>
> >> Create a debugfs directory "mshv" containing 'stats' files organized by
> >> type and id. A stats file contains a number of counters depending on
> >> its type. e.g. an excerpt from a VP stats file:
> >>
> >> TotalRunTime : 1997602722
> >> HypervisorRunTime : 649671371
> >> RemoteNodeRunTime : 0
> >> NormalizedRunTime : 1997602721
> >> IdealCpu : 0
> >> HypercallsCount : 1708169
> >> HypercallsTime : 111914774
> >> PageInvalidationsCount : 0
> >> PageInvalidationsTime : 0
> >>
> >> On a root partition with some active child partitions, the entire
> >> directory structure may look like:
> >>
> >> mshv/
> >> stats # hypervisor stats
> >> lp/ # logical processors
> >> 0/ # LP id
> >> stats # LP 0 stats
> >> 1/
> >> 2/
> >> 3/
> >> partition/ # partition stats
> >> 1/ # root partition id
> >> stats # root partition stats
> >> vp/ # root virtual processors
> >> 0/ # root VP id
> >> stats # root VP 0 stats
> >> 1/
> >> 2/
> >> 3/
> >> 42/ # child partition id
> >> stats # child partition stats
> >> vp/ # child VPs
> >> 0/ # child VP id
> >> stats # child VP 0 stats
> >> 1/
> >> 43/
> >> 55/
> >>
> >
> > In the above directory tree, each of the "stats" files is in a directory
> > by itself, where the directory name is the number of whatever
> > entity the stats are for (lp, partition, or vp). Do you expect there to
> > be other files parallel to "stats" that will be added later? Otherwise
> > you could collapse one directory level. The "best" directory structure
> > is somewhat a matter of taste and judgment, so there's not a "right"
> > answer. I don't object if your preference is to keep the numbered
> > directories, even if they are likely to never contain more than the
> > "stats" file.
> >
> Good question, I'm not aware of a plan to add additional parallel files
> in future, but even so, I think this structure is fine as-is.
>
> I see how the VPs and LPs directories could be collapsed, but partitions
> need to be directories to contain the VPs, so that would be an
> inconsistency (some "stats" files and some "$ID" files) which seems worse
> to me. e.g.., are you suggesting something like this?
>
> mshv/
> stats # hypervisor stats
> lp/ # logical processors
> 0 # LP 0 stats
> 1 # LP 1 stats
> partition/ # partition stats directory
> 1/ # root partition id
> stats # root partition stats
> vp/ # root virtual processors
> 0 # root VP 0 stats
> 1 # root VP 1 stats
> 4/ # child partition id
> stats # child partition stats
> vp/ # child virtual processors
> 0 # child VP 0 stats
> 1 # child VP 1 stats
>
> Unless I'm misunderstanding what you mean, I think the original is better,
> both because it's more consistent and does leave room for adding additional
> files if we ever want to.
Fair enough. Just curious -- is there envisioned to be a user space program
written to read and display all these stats in some organized fashion? I'm
presuming the user space VMM should not have an operational dependency
on this data because it is debugfs.
>
> >> On L1VH, some stats are not present as it does not own the hardware
> >> like the root partition does:
> >> - The hypervisor and lp stats are not present
> >> - L1VH's partition directory is named "self" because it can't get its
> >> own id
> >> - Some of L1VH's partition and VP stats fields are not populated, because
> >> it can't map its own HV_STATS_AREA_PARENT page.
> >>
> >> Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >> Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
> >> Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
> >> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> Co-developed-by: Purna Pavan Chandra Aekkaladevi
> >> <paekkaladevi@linux.microsoft.com>
> >> Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
> >> Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
> >> Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >> ---
> >> drivers/hv/Makefile | 1 +
> >> drivers/hv/mshv_debugfs.c | 1122 +++++++++++++++++++++++++++++++++++
> >> drivers/hv/mshv_root.h | 34 ++
> >> drivers/hv/mshv_root_main.c | 32 +-
> >> 4 files changed, 1185 insertions(+), 4 deletions(-)
> >> create mode 100644 drivers/hv/mshv_debugfs.c
> >>
> >> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> >> index 58b8d07639f3..36278c936914 100644
> >> --- a/drivers/hv/Makefile
> >> +++ b/drivers/hv/Makefile
> >> @@ -15,6 +15,7 @@ hv_vmbus-$(CONFIG_HYPERV_TESTING) += hv_debugfs.o
> >> hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
> >> mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
> >> mshv_root_hv_call.o mshv_portid_table.o
> >> +mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
> >> mshv_vtl-y := mshv_vtl_main.o
> >>
> >> # Code that must be built-in
> >> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
> >> new file mode 100644
> >> index 000000000000..581018690a27
> >> --- /dev/null
> >> +++ b/drivers/hv/mshv_debugfs.c
> >> @@ -0,0 +1,1122 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Copyright (c) 2025, Microsoft Corporation.
> >> + *
> >> + * The /sys/kernel/debug/mshv directory contents.
> >> + * Contains various statistics data, provided by the hypervisor.
> >> + *
> >> + * Authors: Microsoft Linux virtualization team
> >> + */
> >> +
> >> +#include <linux/debugfs.h>
> >> +#include <linux/stringify.h>
> >> +#include <asm/mshyperv.h>
> >> +#include <linux/slab.h>
> >> +
> >> +#include "mshv.h"
> >> +#include "mshv_root.h"
> >> +
> >> +#define U32_BUF_SZ 11
> >> +#define U64_BUF_SZ 21
> >> +
> >> +static struct dentry *mshv_debugfs;
> >> +static struct dentry *mshv_debugfs_partition;
> >> +static struct dentry *mshv_debugfs_lp;
> >> +
> >> +static u64 mshv_lps_count;
> >> +
> >> +static bool is_l1vh_parent(u64 partition_id)
> >> +{
> >> + return hv_l1vh_partition() && (partition_id == HV_PARTITION_ID_SELF);
> >> +}
> >> +
> >> +static int lp_stats_show(struct seq_file *m, void *v)
> >> +{
> >> + const struct hv_stats_page *stats = m->private;
> >> +
> >> +#define LP_SEQ_PRINTF(cnt) \
> >> + seq_printf(m, "%-29s: %llu\n", __stringify(cnt), stats->lp_cntrs[Lp##cnt])
> >> +
> >> + LP_SEQ_PRINTF(GlobalTime);
> >> + LP_SEQ_PRINTF(TotalRunTime);
> >> + LP_SEQ_PRINTF(HypervisorRunTime);
> >> + LP_SEQ_PRINTF(HardwareInterrupts);
> >> + LP_SEQ_PRINTF(ContextSwitches);
> >> + LP_SEQ_PRINTF(InterProcessorInterrupts);
> >> + LP_SEQ_PRINTF(SchedulerInterrupts);
> >> + LP_SEQ_PRINTF(TimerInterrupts);
> >> + LP_SEQ_PRINTF(InterProcessorInterruptsSent);
> >> + LP_SEQ_PRINTF(ProcessorHalts);
> >> + LP_SEQ_PRINTF(MonitorTransitionCost);
> >> + LP_SEQ_PRINTF(ContextSwitchTime);
> >> + LP_SEQ_PRINTF(C1TransitionsCount);
> >> + LP_SEQ_PRINTF(C1RunTime);
> >> + LP_SEQ_PRINTF(C2TransitionsCount);
> >> + LP_SEQ_PRINTF(C2RunTime);
> >> + LP_SEQ_PRINTF(C3TransitionsCount);
> >> + LP_SEQ_PRINTF(C3RunTime);
> >> + LP_SEQ_PRINTF(RootVpIndex);
> >> + LP_SEQ_PRINTF(IdleSequenceNumber);
> >> + LP_SEQ_PRINTF(GlobalTscCount);
> >> + LP_SEQ_PRINTF(ActiveTscCount);
> >> + LP_SEQ_PRINTF(IdleAccumulation);
> >> + LP_SEQ_PRINTF(ReferenceCycleCount0);
> >> + LP_SEQ_PRINTF(ActualCycleCount0);
> >> + LP_SEQ_PRINTF(ReferenceCycleCount1);
> >> + LP_SEQ_PRINTF(ActualCycleCount1);
> >> + LP_SEQ_PRINTF(ProximityDomainId);
> >> + LP_SEQ_PRINTF(PostedInterruptNotifications);
> >> + LP_SEQ_PRINTF(BranchPredictorFlushes);
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + LP_SEQ_PRINTF(L1DataCacheFlushes);
> >> + LP_SEQ_PRINTF(ImmediateL1DataCacheFlushes);
> >> + LP_SEQ_PRINTF(MbFlushes);
> >> + LP_SEQ_PRINTF(CounterRefreshSequenceNumber);
> >> + LP_SEQ_PRINTF(CounterRefreshReferenceTime);
> >> + LP_SEQ_PRINTF(IdleAccumulationSnapshot);
> >> + LP_SEQ_PRINTF(ActiveTscCountSnapshot);
> >> + LP_SEQ_PRINTF(HwpRequestContextSwitches);
> >> + LP_SEQ_PRINTF(Placeholder1);
> >> + LP_SEQ_PRINTF(Placeholder2);
> >> + LP_SEQ_PRINTF(Placeholder3);
> >> + LP_SEQ_PRINTF(Placeholder4);
> >> + LP_SEQ_PRINTF(Placeholder5);
> >> + LP_SEQ_PRINTF(Placeholder6);
> >> + LP_SEQ_PRINTF(Placeholder7);
> >> + LP_SEQ_PRINTF(Placeholder8);
> >> + LP_SEQ_PRINTF(Placeholder9);
> >> + LP_SEQ_PRINTF(Placeholder10);
> >> + LP_SEQ_PRINTF(ReserveGroupId);
> >> + LP_SEQ_PRINTF(RunningPriority);
> >> + LP_SEQ_PRINTF(PerfmonInterruptCount);
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + LP_SEQ_PRINTF(CounterRefreshSequenceNumber);
> >> + LP_SEQ_PRINTF(CounterRefreshReferenceTime);
> >> + LP_SEQ_PRINTF(IdleAccumulationSnapshot);
> >> + LP_SEQ_PRINTF(ActiveTscCountSnapshot);
> >> + LP_SEQ_PRINTF(HwpRequestContextSwitches);
> >> + LP_SEQ_PRINTF(Placeholder2);
> >> + LP_SEQ_PRINTF(Placeholder3);
> >> + LP_SEQ_PRINTF(Placeholder4);
> >> + LP_SEQ_PRINTF(Placeholder5);
> >> + LP_SEQ_PRINTF(Placeholder6);
> >> + LP_SEQ_PRINTF(Placeholder7);
> >> + LP_SEQ_PRINTF(Placeholder8);
> >> + LP_SEQ_PRINTF(Placeholder9);
> >> + LP_SEQ_PRINTF(SchLocalRunListSize);
> >> + LP_SEQ_PRINTF(ReserveGroupId);
> >> + LP_SEQ_PRINTF(RunningPriority);
> >> +#endif
> >> +
> >> + return 0;
> >> +}
> >> +DEFINE_SHOW_ATTRIBUTE(lp_stats);
> >> +
> >> +static void mshv_lp_stats_unmap(u32 lp_index, void *stats_page_addr)
> >> +{
> >> + union hv_stats_object_identity identity = {
> >> + .lp.lp_index = lp_index,
> >> + .lp.stats_area_type = HV_STATS_AREA_SELF,
> >> + };
> >> + int err;
> >> +
> >> + err = hv_unmap_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR,
> >> + stats_page_addr, &identity);
> >> + if (err)
> >> + pr_err("%s: failed to unmap logical processor %u stats, err: %d\n",
> >> + __func__, lp_index, err);
> >> +}
> >> +
> >> +static void __init *mshv_lp_stats_map(u32 lp_index)
> >> +{
> >> + union hv_stats_object_identity identity = {
> >> + .lp.lp_index = lp_index,
> >> + .lp.stats_area_type = HV_STATS_AREA_SELF,
> >> + };
> >> + void *stats;
> >> + int err;
> >> +
> >> + err = hv_map_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR, &identity,
> >> + &stats);
> >> + if (err) {
> >> + pr_err("%s: failed to map logical processor %u stats, err: %d\n",
> >> + __func__, lp_index, err);
> >> + return ERR_PTR(err);
> >> + }
> >> +
> >> + return stats;
> >> +}
> >> +
> >> +static void __init *lp_debugfs_stats_create(u32 lp_index, struct dentry *parent)
> >> +{
> >> + struct dentry *dentry;
> >> + void *stats;
> >> +
> >> + stats = mshv_lp_stats_map(lp_index);
> >> + if (IS_ERR(stats))
> >> + return stats;
> >> +
> >> + dentry = debugfs_create_file("stats", 0400, parent,
> >> + stats, &lp_stats_fops);
> >> + if (IS_ERR(dentry)) {
> >> + mshv_lp_stats_unmap(lp_index, stats);
> >> + return dentry;
> >> + }
> >> + return stats;
> >> +}
> >> +
> >> +static int __init lp_debugfs_create(u32 lp_index, struct dentry *parent)
> >> +{
> >> + struct dentry *idx;
> >> + char lp_idx_str[U32_BUF_SZ];
> >> + void *stats;
> >> + int err;
> >> +
> >> + sprintf(lp_idx_str, "%u", lp_index);
> >> +
> >> + idx = debugfs_create_dir(lp_idx_str, parent);
> >> + if (IS_ERR(idx))
> >> + return PTR_ERR(idx);
> >> +
> >> + stats = lp_debugfs_stats_create(lp_index, idx);
> >> + if (IS_ERR(stats)) {
> >> + err = PTR_ERR(stats);
> >> + goto remove_debugfs_lp_idx;
> >> + }
> >> +
> >> + return 0;
> >> +
> >> +remove_debugfs_lp_idx:
> >> + debugfs_remove_recursive(idx);
> >> + return err;
> >> +}
> >> +
> >> +static void mshv_debugfs_lp_remove(void)
> >> +{
> >> + int lp_index;
> >> +
> >> + debugfs_remove_recursive(mshv_debugfs_lp);
> >> +
> >> + for (lp_index = 0; lp_index < mshv_lps_count; lp_index++)
> >> + mshv_lp_stats_unmap(lp_index, NULL);
> >
> > Passing NULL as the second argument here leaks the stats page
> > memory if Linux allocated the page as an overlay GPFN. But is that
> > considered OK because the debugfs entries for LPs are removed
> > only when the root partition is shutting down? That works as
> > long as hot-add/remove of CPUs isn't supported in the root
> > partition.
> >
> Hmm, at the very least this appears to be a memory leak if the mshv
> driver is built as a module and removed + reinserted. The stats
> pages can be mapped multiple times so it will just allocate a page
> (on L1VH anyway) and remap it each time. I will check and fix it in
> this patch.
OK. I was thinking that removing and re-inserting the mshv driver
module isn't possible from any practical standpoint without doing
a shutdown, but maybe there is a way.
>
> >> +}
> >> +
> >> +static int __init mshv_debugfs_lp_create(struct dentry *parent)
> >> +{
> >> + struct dentry *lp_dir;
> >> + int err, lp_index;
> >> +
> >> + lp_dir = debugfs_create_dir("lp", parent);
> >> + if (IS_ERR(lp_dir))
> >> + return PTR_ERR(lp_dir);
> >> +
> >> + for (lp_index = 0; lp_index < mshv_lps_count; lp_index++) {
> >> + err = lp_debugfs_create(lp_index, lp_dir);
> >> + if (err)
> >> + goto remove_debugfs_lps;
> >> + }
> >> +
> >> + mshv_debugfs_lp = lp_dir;
> >> +
> >> + return 0;
> >> +
> >> +remove_debugfs_lps:
> >> + for (lp_index -= 1; lp_index >= 0; lp_index--)
> >> + mshv_lp_stats_unmap(lp_index, NULL);
> >> + debugfs_remove_recursive(lp_dir);
> >> + return err;
> >> +}
> >> +
> >> +static int vp_stats_show(struct seq_file *m, void *v)
> >> +{
> >> + const struct hv_stats_page **pstats = m->private;
> >> +
> >> +#define VP_SEQ_PRINTF(cnt) \
> >> +do { \
> >> + if (pstats[HV_STATS_AREA_SELF]->vp_cntrs[Vp##cnt]) \
> >> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
> >> + pstats[HV_STATS_AREA_SELF]->vp_cntrs[Vp##cnt]); \
> >> + else \
> >> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
> >> + pstats[HV_STATS_AREA_PARENT]->vp_cntrs[Vp##cnt]); \
> >> +} while (0)
> >
> > I don't understand this logic. Like in mshv_vp_dispatch_thread_blocked(), if
> > the SELF value is zero, then the PARENT value is used. The implication is that
> > you never want to display a SELF value of zero, which is a bit unexpected
> > since I could imagine zero being valid for some counters. But the overall result
> > is that the displayed values may be a mix of SELF and PARENT values.
>
> Yes, the basic idea is: Display a nonzero value, if there is one on either SELF or
> PARENT pages. (I *think* the values will always be the same if they are nonzero.)
>
> I admit it's not an ideal design from my perspective. As far as I know, it was
> done this way to retain backward compatibility with hypervisors that don't support
> the concept of a PARENT stats area at all.
>
> > And of course after Patch 1 of this series, if running on an older hypervisor
> > that doesn't provide PARENT, then SELF will be used anyway, which further
> > muddies what's going on here, at least for me. :-)
> >
>
> Yes, but in the end we need to check both pages, so there's no avoiding this
> redundant check on old hypervisors without adding a separate code path just for
> that case, which doesn't seem worth it.
>
> > If this is the correct behavior, please add some code comments as to
> > why it makes sense, including in the case where PARENT isn't available.
> >
>
> Ok, will do.
>
> >> +
> >> + VP_SEQ_PRINTF(TotalRunTime);
> >> + VP_SEQ_PRINTF(HypervisorRunTime);
> >> + VP_SEQ_PRINTF(RemoteNodeRunTime);
> >> + VP_SEQ_PRINTF(NormalizedRunTime);
> >> + VP_SEQ_PRINTF(IdealCpu);
> >> + VP_SEQ_PRINTF(HypercallsCount);
> >> + VP_SEQ_PRINTF(HypercallsTime);
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + VP_SEQ_PRINTF(PageInvalidationsCount);
> >> + VP_SEQ_PRINTF(PageInvalidationsTime);
> >> + VP_SEQ_PRINTF(ControlRegisterAccessesCount);
> >> + VP_SEQ_PRINTF(ControlRegisterAccessesTime);
> >> + VP_SEQ_PRINTF(IoInstructionsCount);
> >> + VP_SEQ_PRINTF(IoInstructionsTime);
> >> + VP_SEQ_PRINTF(HltInstructionsCount);
> >> + VP_SEQ_PRINTF(HltInstructionsTime);
> >> + VP_SEQ_PRINTF(MwaitInstructionsCount);
> >> + VP_SEQ_PRINTF(MwaitInstructionsTime);
> >> + VP_SEQ_PRINTF(CpuidInstructionsCount);
> >> + VP_SEQ_PRINTF(CpuidInstructionsTime);
> >> + VP_SEQ_PRINTF(MsrAccessesCount);
> >> + VP_SEQ_PRINTF(MsrAccessesTime);
> >> + VP_SEQ_PRINTF(OtherInterceptsCount);
> >> + VP_SEQ_PRINTF(OtherInterceptsTime);
> >> + VP_SEQ_PRINTF(ExternalInterruptsCount);
> >> + VP_SEQ_PRINTF(ExternalInterruptsTime);
> >> + VP_SEQ_PRINTF(PendingInterruptsCount);
> >> + VP_SEQ_PRINTF(PendingInterruptsTime);
> >> + VP_SEQ_PRINTF(EmulatedInstructionsCount);
> >> + VP_SEQ_PRINTF(EmulatedInstructionsTime);
> >> + VP_SEQ_PRINTF(DebugRegisterAccessesCount);
> >> + VP_SEQ_PRINTF(DebugRegisterAccessesTime);
> >> + VP_SEQ_PRINTF(PageFaultInterceptsCount);
> >> + VP_SEQ_PRINTF(PageFaultInterceptsTime);
> >> + VP_SEQ_PRINTF(GuestPageTableMaps);
> >> + VP_SEQ_PRINTF(LargePageTlbFills);
> >> + VP_SEQ_PRINTF(SmallPageTlbFills);
> >> + VP_SEQ_PRINTF(ReflectedGuestPageFaults);
> >> + VP_SEQ_PRINTF(ApicMmioAccesses);
> >> + VP_SEQ_PRINTF(IoInterceptMessages);
> >> + VP_SEQ_PRINTF(MemoryInterceptMessages);
> >> + VP_SEQ_PRINTF(ApicEoiAccesses);
> >> + VP_SEQ_PRINTF(OtherMessages);
> >> + VP_SEQ_PRINTF(PageTableAllocations);
> >> + VP_SEQ_PRINTF(LogicalProcessorMigrations);
> >> + VP_SEQ_PRINTF(AddressSpaceEvictions);
> >> + VP_SEQ_PRINTF(AddressSpaceSwitches);
> >> + VP_SEQ_PRINTF(AddressDomainFlushes);
> >> + VP_SEQ_PRINTF(AddressSpaceFlushes);
> >> + VP_SEQ_PRINTF(GlobalGvaRangeFlushes);
> >> + VP_SEQ_PRINTF(LocalGvaRangeFlushes);
> >> + VP_SEQ_PRINTF(PageTableEvictions);
> >> + VP_SEQ_PRINTF(PageTableReclamations);
> >> + VP_SEQ_PRINTF(PageTableResets);
> >> + VP_SEQ_PRINTF(PageTableValidations);
> >> + VP_SEQ_PRINTF(ApicTprAccesses);
> >> + VP_SEQ_PRINTF(PageTableWriteIntercepts);
> >> + VP_SEQ_PRINTF(SyntheticInterrupts);
> >> + VP_SEQ_PRINTF(VirtualInterrupts);
> >> + VP_SEQ_PRINTF(ApicIpisSent);
> >> + VP_SEQ_PRINTF(ApicSelfIpisSent);
> >> + VP_SEQ_PRINTF(GpaSpaceHypercalls);
> >> + VP_SEQ_PRINTF(LogicalProcessorHypercalls);
> >> + VP_SEQ_PRINTF(LongSpinWaitHypercalls);
> >> + VP_SEQ_PRINTF(OtherHypercalls);
> >> + VP_SEQ_PRINTF(SyntheticInterruptHypercalls);
> >> + VP_SEQ_PRINTF(VirtualInterruptHypercalls);
> >> + VP_SEQ_PRINTF(VirtualMmuHypercalls);
> >> + VP_SEQ_PRINTF(VirtualProcessorHypercalls);
> >> + VP_SEQ_PRINTF(HardwareInterrupts);
> >> + VP_SEQ_PRINTF(NestedPageFaultInterceptsCount);
> >> + VP_SEQ_PRINTF(NestedPageFaultInterceptsTime);
> >> + VP_SEQ_PRINTF(PageScans);
> >> + VP_SEQ_PRINTF(LogicalProcessorDispatches);
> >> + VP_SEQ_PRINTF(WaitingForCpuTime);
> >> + VP_SEQ_PRINTF(ExtendedHypercalls);
> >> + VP_SEQ_PRINTF(ExtendedHypercallInterceptMessages);
> >> + VP_SEQ_PRINTF(MbecNestedPageTableSwitches);
> >> + VP_SEQ_PRINTF(OtherReflectedGuestExceptions);
> >> + VP_SEQ_PRINTF(GlobalIoTlbFlushes);
> >> + VP_SEQ_PRINTF(GlobalIoTlbFlushCost);
> >> + VP_SEQ_PRINTF(LocalIoTlbFlushes);
> >> + VP_SEQ_PRINTF(LocalIoTlbFlushCost);
> >> + VP_SEQ_PRINTF(HypercallsForwardedCount);
> >> + VP_SEQ_PRINTF(HypercallsForwardingTime);
> >> + VP_SEQ_PRINTF(PageInvalidationsForwardedCount);
> >> + VP_SEQ_PRINTF(PageInvalidationsForwardingTime);
> >> + VP_SEQ_PRINTF(ControlRegisterAccessesForwardedCount);
> >> + VP_SEQ_PRINTF(ControlRegisterAccessesForwardingTime);
> >> + VP_SEQ_PRINTF(IoInstructionsForwardedCount);
> >> + VP_SEQ_PRINTF(IoInstructionsForwardingTime);
> >> + VP_SEQ_PRINTF(HltInstructionsForwardedCount);
> >> + VP_SEQ_PRINTF(HltInstructionsForwardingTime);
> >> + VP_SEQ_PRINTF(MwaitInstructionsForwardedCount);
> >> + VP_SEQ_PRINTF(MwaitInstructionsForwardingTime);
> >> + VP_SEQ_PRINTF(CpuidInstructionsForwardedCount);
> >> + VP_SEQ_PRINTF(CpuidInstructionsForwardingTime);
> >> + VP_SEQ_PRINTF(MsrAccessesForwardedCount);
> >> + VP_SEQ_PRINTF(MsrAccessesForwardingTime);
> >> + VP_SEQ_PRINTF(OtherInterceptsForwardedCount);
> >> + VP_SEQ_PRINTF(OtherInterceptsForwardingTime);
> >> + VP_SEQ_PRINTF(ExternalInterruptsForwardedCount);
> >> + VP_SEQ_PRINTF(ExternalInterruptsForwardingTime);
> >> + VP_SEQ_PRINTF(PendingInterruptsForwardedCount);
> >> + VP_SEQ_PRINTF(PendingInterruptsForwardingTime);
> >> + VP_SEQ_PRINTF(EmulatedInstructionsForwardedCount);
> >> + VP_SEQ_PRINTF(EmulatedInstructionsForwardingTime);
> >> + VP_SEQ_PRINTF(DebugRegisterAccessesForwardedCount);
> >> + VP_SEQ_PRINTF(DebugRegisterAccessesForwardingTime);
> >> + VP_SEQ_PRINTF(PageFaultInterceptsForwardedCount);
> >> + VP_SEQ_PRINTF(PageFaultInterceptsForwardingTime);
> >> + VP_SEQ_PRINTF(VmclearEmulationCount);
> >> + VP_SEQ_PRINTF(VmclearEmulationTime);
> >> + VP_SEQ_PRINTF(VmptrldEmulationCount);
> >> + VP_SEQ_PRINTF(VmptrldEmulationTime);
> >> + VP_SEQ_PRINTF(VmptrstEmulationCount);
> >> + VP_SEQ_PRINTF(VmptrstEmulationTime);
> >> + VP_SEQ_PRINTF(VmreadEmulationCount);
> >> + VP_SEQ_PRINTF(VmreadEmulationTime);
> >> + VP_SEQ_PRINTF(VmwriteEmulationCount);
> >> + VP_SEQ_PRINTF(VmwriteEmulationTime);
> >> + VP_SEQ_PRINTF(VmxoffEmulationCount);
> >> + VP_SEQ_PRINTF(VmxoffEmulationTime);
> >> + VP_SEQ_PRINTF(VmxonEmulationCount);
> >> + VP_SEQ_PRINTF(VmxonEmulationTime);
> >> + VP_SEQ_PRINTF(NestedVMEntriesCount);
> >> + VP_SEQ_PRINTF(NestedVMEntriesTime);
> >> + VP_SEQ_PRINTF(NestedSLATSoftPageFaultsCount);
> >> + VP_SEQ_PRINTF(NestedSLATSoftPageFaultsTime);
> >> + VP_SEQ_PRINTF(NestedSLATHardPageFaultsCount);
> >> + VP_SEQ_PRINTF(NestedSLATHardPageFaultsTime);
> >> + VP_SEQ_PRINTF(InvEptAllContextEmulationCount);
> >> + VP_SEQ_PRINTF(InvEptAllContextEmulationTime);
> >> + VP_SEQ_PRINTF(InvEptSingleContextEmulationCount);
> >> + VP_SEQ_PRINTF(InvEptSingleContextEmulationTime);
> >> + VP_SEQ_PRINTF(InvVpidAllContextEmulationCount);
> >> + VP_SEQ_PRINTF(InvVpidAllContextEmulationTime);
> >> + VP_SEQ_PRINTF(InvVpidSingleContextEmulationCount);
> >> + VP_SEQ_PRINTF(InvVpidSingleContextEmulationTime);
> >> + VP_SEQ_PRINTF(InvVpidSingleAddressEmulationCount);
> >> + VP_SEQ_PRINTF(InvVpidSingleAddressEmulationTime);
> >> + VP_SEQ_PRINTF(NestedTlbPageTableReclamations);
> >> + VP_SEQ_PRINTF(NestedTlbPageTableEvictions);
> >> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressSpaceHypercalls);
> >> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressListHypercalls);
> >> + VP_SEQ_PRINTF(PostedInterruptNotifications);
> >> + VP_SEQ_PRINTF(PostedInterruptScans);
> >> + VP_SEQ_PRINTF(TotalCoreRunTime);
> >> + VP_SEQ_PRINTF(MaximumRunTime);
> >> + VP_SEQ_PRINTF(HwpRequestContextSwitches);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket0);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket1);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket2);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket3);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket4);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket5);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket6);
> >> + VP_SEQ_PRINTF(VmloadEmulationCount);
> >> + VP_SEQ_PRINTF(VmloadEmulationTime);
> >> + VP_SEQ_PRINTF(VmsaveEmulationCount);
> >> + VP_SEQ_PRINTF(VmsaveEmulationTime);
> >> + VP_SEQ_PRINTF(GifInstructionEmulationCount);
> >> + VP_SEQ_PRINTF(GifInstructionEmulationTime);
> >> + VP_SEQ_PRINTF(EmulatedErrataSvmInstructions);
> >> + VP_SEQ_PRINTF(Placeholder1);
> >> + VP_SEQ_PRINTF(Placeholder2);
> >> + VP_SEQ_PRINTF(Placeholder3);
> >> + VP_SEQ_PRINTF(Placeholder4);
> >> + VP_SEQ_PRINTF(Placeholder5);
> >> + VP_SEQ_PRINTF(Placeholder6);
> >> + VP_SEQ_PRINTF(Placeholder7);
> >> + VP_SEQ_PRINTF(Placeholder8);
> >> + VP_SEQ_PRINTF(Placeholder9);
> >> + VP_SEQ_PRINTF(Placeholder10);
> >> + VP_SEQ_PRINTF(SchedulingPriority);
> >> + VP_SEQ_PRINTF(RdpmcInstructionsCount);
> >> + VP_SEQ_PRINTF(RdpmcInstructionsTime);
> >> + VP_SEQ_PRINTF(PerfmonPmuMsrAccessesCount);
> >> + VP_SEQ_PRINTF(PerfmonLbrMsrAccessesCount);
> >> + VP_SEQ_PRINTF(PerfmonIptMsrAccessesCount);
> >> + VP_SEQ_PRINTF(PerfmonInterruptCount);
> >> + VP_SEQ_PRINTF(Vtl1DispatchCount);
> >> + VP_SEQ_PRINTF(Vtl2DispatchCount);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket0);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket1);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket2);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket3);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket4);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket5);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket6);
> >> + VP_SEQ_PRINTF(Vtl1RunTime);
> >> + VP_SEQ_PRINTF(Vtl2RunTime);
> >> + VP_SEQ_PRINTF(IommuHypercalls);
> >> + VP_SEQ_PRINTF(CpuGroupHypercalls);
> >> + VP_SEQ_PRINTF(VsmHypercalls);
> >> + VP_SEQ_PRINTF(EventLogHypercalls);
> >> + VP_SEQ_PRINTF(DeviceDomainHypercalls);
> >> + VP_SEQ_PRINTF(DepositHypercalls);
> >> + VP_SEQ_PRINTF(SvmHypercalls);
> >> + VP_SEQ_PRINTF(BusLockAcquisitionCount);
> >
> > The x86 VpUnused counter is not shown. Any reason for that? All the
> > Placeholder counters *are* shown, so I'm just wondering what's
> > different.
> >
>
> Good question, I believe when this code was written VpUnused was
> actually undefined in our headers, because the value 201 was
> temporarily used for VpRootDispatchThreadBlocked before that was
> changed to 202 (the hypervisor version using 201 was never released
> publically so not considered a breaking change).
>
> Checking the code, 201 now refers to VpLoadAvg on x86 so I will
> update the definitions in patch #2 of this series to include that,
> and add it here in the debugfs code.
>
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + VP_SEQ_PRINTF(SysRegAccessesCount);
> >> + VP_SEQ_PRINTF(SysRegAccessesTime);
> >> + VP_SEQ_PRINTF(SmcInstructionsCount);
> >> + VP_SEQ_PRINTF(SmcInstructionsTime);
> >> + VP_SEQ_PRINTF(OtherInterceptsCount);
> >> + VP_SEQ_PRINTF(OtherInterceptsTime);
> >> + VP_SEQ_PRINTF(ExternalInterruptsCount);
> >> + VP_SEQ_PRINTF(ExternalInterruptsTime);
> >> + VP_SEQ_PRINTF(PendingInterruptsCount);
> >> + VP_SEQ_PRINTF(PendingInterruptsTime);
> >> + VP_SEQ_PRINTF(GuestPageTableMaps);
> >> + VP_SEQ_PRINTF(LargePageTlbFills);
> >> + VP_SEQ_PRINTF(SmallPageTlbFills);
> >> + VP_SEQ_PRINTF(ReflectedGuestPageFaults);
> >> + VP_SEQ_PRINTF(MemoryInterceptMessages);
> >> + VP_SEQ_PRINTF(OtherMessages);
> >> + VP_SEQ_PRINTF(LogicalProcessorMigrations);
> >> + VP_SEQ_PRINTF(AddressDomainFlushes);
> >> + VP_SEQ_PRINTF(AddressSpaceFlushes);
> >> + VP_SEQ_PRINTF(SyntheticInterrupts);
> >> + VP_SEQ_PRINTF(VirtualInterrupts);
> >> + VP_SEQ_PRINTF(ApicSelfIpisSent);
> >> + VP_SEQ_PRINTF(GpaSpaceHypercalls);
> >> + VP_SEQ_PRINTF(LogicalProcessorHypercalls);
> >> + VP_SEQ_PRINTF(LongSpinWaitHypercalls);
> >> + VP_SEQ_PRINTF(OtherHypercalls);
> >> + VP_SEQ_PRINTF(SyntheticInterruptHypercalls);
> >> + VP_SEQ_PRINTF(VirtualInterruptHypercalls);
> >> + VP_SEQ_PRINTF(VirtualMmuHypercalls);
> >> + VP_SEQ_PRINTF(VirtualProcessorHypercalls);
> >> + VP_SEQ_PRINTF(HardwareInterrupts);
> >> + VP_SEQ_PRINTF(NestedPageFaultInterceptsCount);
> >> + VP_SEQ_PRINTF(NestedPageFaultInterceptsTime);
> >> + VP_SEQ_PRINTF(LogicalProcessorDispatches);
> >> + VP_SEQ_PRINTF(WaitingForCpuTime);
> >> + VP_SEQ_PRINTF(ExtendedHypercalls);
> >> + VP_SEQ_PRINTF(ExtendedHypercallInterceptMessages);
> >> + VP_SEQ_PRINTF(MbecNestedPageTableSwitches);
> >> + VP_SEQ_PRINTF(OtherReflectedGuestExceptions);
> >> + VP_SEQ_PRINTF(GlobalIoTlbFlushes);
> >> + VP_SEQ_PRINTF(GlobalIoTlbFlushCost);
> >> + VP_SEQ_PRINTF(LocalIoTlbFlushes);
> >> + VP_SEQ_PRINTF(LocalIoTlbFlushCost);
> >> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressSpaceHypercalls);
> >> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressListHypercalls);
> >> + VP_SEQ_PRINTF(PostedInterruptNotifications);
> >> + VP_SEQ_PRINTF(PostedInterruptScans);
> >> + VP_SEQ_PRINTF(TotalCoreRunTime);
> >> + VP_SEQ_PRINTF(MaximumRunTime);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket0);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket1);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket2);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket3);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket4);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket5);
> >> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket6);
> >> + VP_SEQ_PRINTF(HwpRequestContextSwitches);
> >> + VP_SEQ_PRINTF(Placeholder2);
> >> + VP_SEQ_PRINTF(Placeholder3);
> >> + VP_SEQ_PRINTF(Placeholder4);
> >> + VP_SEQ_PRINTF(Placeholder5);
> >> + VP_SEQ_PRINTF(Placeholder6);
> >> + VP_SEQ_PRINTF(Placeholder7);
> >> + VP_SEQ_PRINTF(Placeholder8);
> >> + VP_SEQ_PRINTF(ContentionTime);
> >> + VP_SEQ_PRINTF(WakeUpTime);
> >> + VP_SEQ_PRINTF(SchedulingPriority);
> >> + VP_SEQ_PRINTF(Vtl1DispatchCount);
> >> + VP_SEQ_PRINTF(Vtl2DispatchCount);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket0);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket1);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket2);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket3);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket4);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket5);
> >> + VP_SEQ_PRINTF(Vtl2DispatchBucket6);
> >> + VP_SEQ_PRINTF(Vtl1RunTime);
> >> + VP_SEQ_PRINTF(Vtl2RunTime);
> >> + VP_SEQ_PRINTF(IommuHypercalls);
> >> + VP_SEQ_PRINTF(CpuGroupHypercalls);
> >> + VP_SEQ_PRINTF(VsmHypercalls);
> >> + VP_SEQ_PRINTF(EventLogHypercalls);
> >> + VP_SEQ_PRINTF(DeviceDomainHypercalls);
> >> + VP_SEQ_PRINTF(DepositHypercalls);
> >> + VP_SEQ_PRINTF(SvmHypercalls);
> >
> > The ARM64 VpLoadAvg counter is not shown? Any reason why?
> >
>
> I'm not sure, but could be related to the reasoning in the above
> comment - likely VpLoadAvg didn't exist before. I will add it.
>
> >> +#endif
> >
> > The VpRootDispatchThreadBlocked counter is not shown for either
> > x86 or ARM64. Is that intentional, and if so, why? I know the counter
> > is used in mshv_vp_dispatch_thread_blocked(), but it's not clear why
> > that means it shouldn't be shown here.
> >
>
> VpRootDispatchThreadBlocked is not really a 'stat' that you might want
> to expose like the other values, it's really a boolean control value
> that was tacked onto the vp stats page to facilitate fast interrupt
> injection used by the root scheduler. As such it isn't of much value to
> userspace.
I'd probably show it just for completeness and consistency, but I
don't have strong views on the topic.
>
> >> +
> >> + return 0;
> >> +}
> >
> > This function, vp_stats_show(), seems like a candidate for redoing based on a
> > static table that lists the counter names and index. Then the code just loops
> > through the table. On x86 each VP_SEQ_PRINTF() generates 42 bytes of code,
> > and there are 199 entries, so 8358 bytes. The table entries would probably
> > be 16 bytes each (a 64-bit pointer to the string constant, a 32-bit index value,
> > and 4 bytes of padding so each entry is 8-byte aligned). The actual space
> > saving isn't that large, but the code would be a lot more compact. The
> > other *_stats_shows() functions could do the same.
> >
> > It's distasteful to me to see 420 lines of enum entries in Patch 2 of this series,
> > then followed by another 420 lines of matching *_SEQ_PRINTF entries. But I
> > realize that the goal of the enum entries is to match the Windows code, so I
> > guess it is what it is. But there's an argument for ditching the enum entries
> > entirely, and using the putative static table to capture the information. It
> > doesn't seem like matching the Windows code is saving much sync effort
> > since any additions/ subtractions to the enum entries need to be matched
> > with changes in the *_stats_show() functions, or in my putative static table.
> > But I guess if Windows changed only the value for an enum entry without
> > additions/subtractions, that would sync more easily.
> >
>
> Keeping the definitions as close to Windows code as possible is a high priority,
> for consistency and hopefully partially automating that process in future. So,
> I'm against throwing away the enum values. The downside of having to update
> two code locations when adding a new enum member is fine by me.
>
> I'm not against replacing this sequence of macros with a loop over a table like
> the one you propose (in addition to keeping the enum values). That would save
> some space as you point out above, but the impact is fairly minimal.
>
> In terms of aesthetics the definition for a table will look very very similar to
> the list of VP_SEQ_PRINTF() that are currently here. So all in all, I don't see
> a strong reason to switch to a table, unless the space issue is more important
> that I realize.
>
> > I'm just throwing this out as a thought. You may prefer to keep everything
> > "as is", in which case ignore my comment and I won't raise it again.
> >
>
> Thanks, feel free to follow up if you have further thoughts on this part, I'm
> open to changing it if there's a reason. Right now it feels like mainly an
> aesthetics/cleanliness argument and I'm not sure it's worth the effort.
No further thoughts. I wanted to broach the idea, but I'm fine with
your judgment.
>
> >> +DEFINE_SHOW_ATTRIBUTE(vp_stats);
> >> +
> >> +static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index, void *stats_page_addr,
> >> + enum hv_stats_area_type stats_area_type)
> >> +{
> >> + union hv_stats_object_identity identity = {
> >> + .vp.partition_id = partition_id,
> >> + .vp.vp_index = vp_index,
> >> + .vp.stats_area_type = stats_area_type,
> >> + };
> >> + int err;
> >> +
> >> + err = hv_unmap_stats_page(HV_STATS_OBJECT_VP, stats_page_addr, &identity);
> >> + if (err)
> >> + pr_err("%s: failed to unmap partition %llu vp %u %s stats, err: %d\n",
> >> + __func__, partition_id, vp_index,
> >> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
> >> + err);
> >> +}
> >> +
> >> +static void *mshv_vp_stats_map(u64 partition_id, u32 vp_index,
> >> + enum hv_stats_area_type stats_area_type)
> >> +{
> >> + union hv_stats_object_identity identity = {
> >> + .vp.partition_id = partition_id,
> >> + .vp.vp_index = vp_index,
> >> + .vp.stats_area_type = stats_area_type,
> >> + };
> >> + void *stats;
> >> + int err;
> >> +
> >> + err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity, &stats);
> >> + if (err) {
> >> + pr_err("%s: failed to map partition %llu vp %u %s stats, err: %d\n",
> >> + __func__, partition_id, vp_index,
> >> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
> >> + err);
> >> + return ERR_PTR(err);
> >> + }
> >> + return stats;
> >> +}
> >
> > Presumably you've noticed that the functions mshv_vp_stats_map() and
> > mshv_vp_stats_unmap() also exist in mshv_root_main.c. They are static
> > functions in both places, so the compiler & linker do the right thing, but
> > it sure does make things a bit more complex for human readers. The versions
> > here follow a consistent pattern for (lp, vp, hv, partition), so maybe the ones
> > in mshv_root_main.c could be renamed to avoid confusion?
> >
>
> Good point - this is being addressed in our internal tree but hasn't made it into
> this patch set. I will consider squashing that into a later version of this set,
> but for now I'm treating it as a future cleanup patch to send later.
OK
>
> >> +
> >> +static int vp_debugfs_stats_create(u64 partition_id, u32 vp_index,
> >> + struct dentry **vp_stats_ptr,
> >> + struct dentry *parent)
> >> +{
> >> + struct dentry *dentry;
> >> + struct hv_stats_page **pstats;
> >> + int err;
> >> +
> >> + pstats = kcalloc(2, sizeof(struct hv_stats_page *), GFP_KERNEL_ACCOUNT);
> >
> > Open coding "2" as the first parameter makes assumptions about the values of
> > HV_STATS_AREA_SELF and HV_STATS_AREA_PARENT. Should use
> > HV_STATS_AREA_COUNT instead of "2" so that indexing into the array is certain
> > to work.
> >
>
> Thanks, I'll chang it to use HV_STATS_AREA_COUNT.
>
> >> + if (!pstats)
> >> + return -ENOMEM;
> >> +
> >> + pstats[HV_STATS_AREA_SELF] = mshv_vp_stats_map(partition_id, vp_index,
> >> + HV_STATS_AREA_SELF);
> >> + if (IS_ERR(pstats[HV_STATS_AREA_SELF])) {
> >> + err = PTR_ERR(pstats[HV_STATS_AREA_SELF]);
> >> + goto cleanup;
> >> + }
> >> +
> >> + /*
> >> + * L1VH partition cannot access its vp stats in parent area.
> >> + */
> >> + if (is_l1vh_parent(partition_id)) {
> >> + pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
> >> + } else {
> >> + pstats[HV_STATS_AREA_PARENT] = mshv_vp_stats_map(
> >> + partition_id, vp_index, HV_STATS_AREA_PARENT);
> >> + if (IS_ERR(pstats[HV_STATS_AREA_PARENT])) {
> >> + err = PTR_ERR(pstats[HV_STATS_AREA_PARENT]);
> >> + goto unmap_self;
> >> + }
> >> + if (!pstats[HV_STATS_AREA_PARENT])
> >> + pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
> >> + }
> >> +
> >> + dentry = debugfs_create_file("stats", 0400, parent,
> >> + pstats, &vp_stats_fops);
> >> + if (IS_ERR(dentry)) {
> >> + err = PTR_ERR(dentry);
> >> + goto unmap_vp_stats;
> >> + }
> >> +
> >> + *vp_stats_ptr = dentry;
> >> + return 0;
> >> +
> >> +unmap_vp_stats:
> >> + if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF])
> >> + mshv_vp_stats_unmap(partition_id, vp_index, pstats[HV_STATS_AREA_PARENT],
> >> + HV_STATS_AREA_PARENT);
> >> +unmap_self:
> >> + mshv_vp_stats_unmap(partition_id, vp_index, pstats[HV_STATS_AREA_SELF],
> >> + HV_STATS_AREA_SELF);
> >> +cleanup:
> >> + kfree(pstats);
> >> + return err;
> >> +}
> >> +
> >> +static void vp_debugfs_remove(u64 partition_id, u32 vp_index,
> >> + struct dentry *vp_stats)
> >> +{
> >> + struct hv_stats_page **pstats = NULL;
> >> + void *stats;
> >> +
> >> + pstats = vp_stats->d_inode->i_private;
> >> + debugfs_remove_recursive(vp_stats->d_parent);
> >> + if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF]) {
> >> + stats = pstats[HV_STATS_AREA_PARENT];
> >> + mshv_vp_stats_unmap(partition_id, vp_index, stats,
> >> + HV_STATS_AREA_PARENT);
> >> + }
> >> +
> >> + stats = pstats[HV_STATS_AREA_SELF];
> >> + mshv_vp_stats_unmap(partition_id, vp_index, stats, HV_STATS_AREA_SELF);
> >> +
> >> + kfree(pstats);
> >> +}
> >> +
> >> +static int vp_debugfs_create(u64 partition_id, u32 vp_index,
> >> + struct dentry **vp_stats_ptr,
> >> + struct dentry *parent)
> >> +{
> >> + struct dentry *vp_idx_dir;
> >> + char vp_idx_str[U32_BUF_SZ];
> >> + int err;
> >> +
> >> + sprintf(vp_idx_str, "%u", vp_index);
> >> +
> >> + vp_idx_dir = debugfs_create_dir(vp_idx_str, parent);
> >> + if (IS_ERR(vp_idx_dir))
> >> + return PTR_ERR(vp_idx_dir);
> >> +
> >> + err = vp_debugfs_stats_create(partition_id, vp_index, vp_stats_ptr,
> >> + vp_idx_dir);
> >> + if (err)
> >> + goto remove_debugfs_vp_idx;
> >> +
> >> + return 0;
> >> +
> >> +remove_debugfs_vp_idx:
> >> + debugfs_remove_recursive(vp_idx_dir);
> >> + return err;
> >> +}
> >> +
> >> +static int partition_stats_show(struct seq_file *m, void *v)
> >> +{
> >> + const struct hv_stats_page **pstats = m->private;
> >> +
> >> +#define PARTITION_SEQ_PRINTF(cnt) \
> >> +do { \
> >> + if (pstats[HV_STATS_AREA_SELF]->pt_cntrs[Partition##cnt]) \
> >> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
> >> + pstats[HV_STATS_AREA_SELF]->pt_cntrs[Partition##cnt]); \
> >> + else \
> >> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
> >> + pstats[HV_STATS_AREA_PARENT]->pt_cntrs[Partition##cnt]); \
> >> +} while (0)
> >
> > Same comment as for VP_SEQ_PRINTF.
> >
> Ack
>
> >> +
> >> + PARTITION_SEQ_PRINTF(VirtualProcessors);
> >> + PARTITION_SEQ_PRINTF(TlbSize);
> >> + PARTITION_SEQ_PRINTF(AddressSpaces);
> >> + PARTITION_SEQ_PRINTF(DepositedPages);
> >> + PARTITION_SEQ_PRINTF(GpaPages);
> >> + PARTITION_SEQ_PRINTF(GpaSpaceModifications);
> >> + PARTITION_SEQ_PRINTF(VirtualTlbFlushEntires);
> >> + PARTITION_SEQ_PRINTF(RecommendedTlbSize);
> >> + PARTITION_SEQ_PRINTF(GpaPages4K);
> >> + PARTITION_SEQ_PRINTF(GpaPages2M);
> >> + PARTITION_SEQ_PRINTF(GpaPages1G);
> >> + PARTITION_SEQ_PRINTF(GpaPages512G);
> >> + PARTITION_SEQ_PRINTF(DevicePages4K);
> >> + PARTITION_SEQ_PRINTF(DevicePages2M);
> >> + PARTITION_SEQ_PRINTF(DevicePages1G);
> >> + PARTITION_SEQ_PRINTF(DevicePages512G);
> >> + PARTITION_SEQ_PRINTF(AttachedDevices);
> >> + PARTITION_SEQ_PRINTF(DeviceInterruptMappings);
> >> + PARTITION_SEQ_PRINTF(IoTlbFlushes);
> >> + PARTITION_SEQ_PRINTF(IoTlbFlushCost);
> >> + PARTITION_SEQ_PRINTF(DeviceInterruptErrors);
> >> + PARTITION_SEQ_PRINTF(DeviceDmaErrors);
> >> + PARTITION_SEQ_PRINTF(DeviceInterruptThrottleEvents);
> >> + PARTITION_SEQ_PRINTF(SkippedTimerTicks);
> >> + PARTITION_SEQ_PRINTF(PartitionId);
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + PARTITION_SEQ_PRINTF(NestedTlbSize);
> >> + PARTITION_SEQ_PRINTF(RecommendedNestedTlbSize);
> >> + PARTITION_SEQ_PRINTF(NestedTlbFreeListSize);
> >> + PARTITION_SEQ_PRINTF(NestedTlbTrimmedPages);
> >> + PARTITION_SEQ_PRINTF(PagesShattered);
> >> + PARTITION_SEQ_PRINTF(PagesRecombined);
> >> + PARTITION_SEQ_PRINTF(HwpRequestValue);
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + PARTITION_SEQ_PRINTF(HwpRequestValue);
> >> +#endif
> >> +
> >> + return 0;
> >> +}
> >> +DEFINE_SHOW_ATTRIBUTE(partition_stats);
> >> +
> >> +static void mshv_partition_stats_unmap(u64 partition_id, void *stats_page_addr,
> >> + enum hv_stats_area_type stats_area_type)
> >> +{
> >> + union hv_stats_object_identity identity = {
> >> + .partition.partition_id = partition_id,
> >> + .partition.stats_area_type = stats_area_type,
> >> + };
> >> + int err;
> >> +
> >> + err = hv_unmap_stats_page(HV_STATS_OBJECT_PARTITION, stats_page_addr,
> >> + &identity);
> >> + if (err) {
> >> + pr_err("%s: failed to unmap partition %lld %s stats, err: %d\n",
> >> + __func__, partition_id,
> >> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
> >> + err);
> >> + }
> >> +}
> >> +
> >> +static void *mshv_partition_stats_map(u64 partition_id,
> >> + enum hv_stats_area_type stats_area_type)
> >> +{
> >> + union hv_stats_object_identity identity = {
> >> + .partition.partition_id = partition_id,
> >> + .partition.stats_area_type = stats_area_type,
> >> + };
> >> + void *stats;
> >> + int err;
> >> +
> >> + err = hv_map_stats_page(HV_STATS_OBJECT_PARTITION, &identity, &stats);
> >> + if (err) {
> >> + pr_err("%s: failed to map partition %lld %s stats, err: %d\n",
> >> + __func__, partition_id,
> >> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
> >> + err);
> >> + return ERR_PTR(err);
> >> + }
> >> + return stats;
> >> +}
> >> +
> >> +static int mshv_debugfs_partition_stats_create(u64 partition_id,
> >> + struct dentry **partition_stats_ptr,
> >> + struct dentry *parent)
> >> +{
> >> + struct dentry *dentry;
> >> + struct hv_stats_page **pstats;
> >> + int err;
> >> +
> >> + pstats = kcalloc(2, sizeof(struct hv_stats_page *), GFP_KERNEL_ACCOUNT);
> >
> > Same comment here about the use of "2" as the first parameter.
> >
> Ack.
>
> >> + if (!pstats)
> >> + return -ENOMEM;
>
> <snip>
> Thanks for the comments, I appreciate the review!
>
> Nuno
^ permalink raw reply
* RE: [PATCH v2 2/3] mshv: Add definitions for stats pages
From: Michael Kelley @ 2026-01-02 16:27 UTC (permalink / raw)
To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com
In-Reply-To: <30cf6cd4-9826-425c-b3f1-465bfa372e01@linux.microsoft.com>
From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, December 30, 2025 3:04 PM
>
> On 12/8/2025 7:13 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, December 5, 2025 10:59 AM
> >>
> >> Add the definitions for hypervisor, logical processor, and partition
> >> stats pages.
> >>
> >> Move the definition for the VP stats page to its rightful place in
> >> hvhdk.h, and add the missing members.
> >>
> >> These enum members retain their CamelCase style, since they are imported
> >> directly from the hypervisor code They will be stringified when printing
> >
> > Missing a '.' (period) after "hypervisor code".
> >
> Ack
>
> >> the stats out, and retain more readability in this form.
> >>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> ---
> >> drivers/hv/mshv_root_main.c | 17 --
> >> include/hyperv/hvhdk.h | 437 ++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 437 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> >> index f59a4ab47685..19006b788e85 100644
> >> --- a/drivers/hv/mshv_root_main.c
> >> +++ b/drivers/hv/mshv_root_main.c
> >> @@ -38,23 +38,6 @@ MODULE_AUTHOR("Microsoft");
> >> MODULE_LICENSE("GPL");
> >> MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
> >>
> >> -/* TODO move this to another file when debugfs code is added */
> >> -enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
> >> -#if defined(CONFIG_X86)
> >> - VpRootDispatchThreadBlocked = 202,
> >> -#elif defined(CONFIG_ARM64)
> >> - VpRootDispatchThreadBlocked = 94,
> >> -#endif
> >> - VpStatsMaxCounter
> >> -};
> >> -
> >> -struct hv_stats_page {
> >> - union {
> >> - u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
> >> - u8 data[HV_HYP_PAGE_SIZE];
> >> - };
> >> -} __packed;
> >> -
> >> struct mshv_root mshv_root;
> >>
> >> enum hv_scheduler_type hv_scheduler_type;
> >> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> >> index 469186df7826..51abbcd0ec37 100644
> >> --- a/include/hyperv/hvhdk.h
> >> +++ b/include/hyperv/hvhdk.h
> >> @@ -10,6 +10,443 @@
> >> #include "hvhdk_mini.h"
> >> #include "hvgdk.h"
> >>
> >> +enum hv_stats_hypervisor_counters { /* HV_HYPERVISOR_COUNTER
> */
> >> + HvLogicalProcessors = 1,
> >> + HvPartitions = 2,
> >> + HvTotalPages = 3,
> >> + HvVirtualProcessors = 4,
> >> + HvMonitoredNotifications = 5,
> >> + HvModernStandbyEntries = 6,
> >> + HvPlatformIdleTransitions = 7,
> >> + HvHypervisorStartupCost = 8,
> >> + HvIOSpacePages = 10,
> >> + HvNonEssentialPagesForDump = 11,
> >> + HvSubsumedPages = 12,
> >> + HvStatsMaxCounter
> >> +};
> >> +
> >> +enum hv_stats_partition_counters { /* HV_PROCESS_COUNTER */
> >> + PartitionVirtualProcessors = 1,
> >> + PartitionTlbSize = 3,
> >> + PartitionAddressSpaces = 4,
> >> + PartitionDepositedPages = 5,
> >> + PartitionGpaPages = 6,
> >> + PartitionGpaSpaceModifications = 7,
> >> + PartitionVirtualTlbFlushEntires = 8,
> >> + PartitionRecommendedTlbSize = 9,
> >> + PartitionGpaPages4K = 10,
> >> + PartitionGpaPages2M = 11,
> >> + PartitionGpaPages1G = 12,
> >> + PartitionGpaPages512G = 13,
> >> + PartitionDevicePages4K = 14,
> >> + PartitionDevicePages2M = 15,
> >> + PartitionDevicePages1G = 16,
> >> + PartitionDevicePages512G = 17,
> >> + PartitionAttachedDevices = 18,
> >> + PartitionDeviceInterruptMappings = 19,
> >> + PartitionIoTlbFlushes = 20,
> >> + PartitionIoTlbFlushCost = 21,
> >> + PartitionDeviceInterruptErrors = 22,
> >> + PartitionDeviceDmaErrors = 23,
> >> + PartitionDeviceInterruptThrottleEvents = 24,
> >> + PartitionSkippedTimerTicks = 25,
> >> + PartitionPartitionId = 26,
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + PartitionNestedTlbSize = 27,
> >> + PartitionRecommendedNestedTlbSize = 28,
> >> + PartitionNestedTlbFreeListSize = 29,
> >> + PartitionNestedTlbTrimmedPages = 30,
> >> + PartitionPagesShattered = 31,
> >> + PartitionPagesRecombined = 32,
> >> + PartitionHwpRequestValue = 33,
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + PartitionHwpRequestValue = 27,
> >> +#endif
> >> + PartitionStatsMaxCounter
> >> +};
> >> +
> >> +enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
> >> + VpTotalRunTime = 1,
> >> + VpHypervisorRunTime = 2,
> >> + VpRemoteNodeRunTime = 3,
> >> + VpNormalizedRunTime = 4,
> >> + VpIdealCpu = 5,
> >> + VpHypercallsCount = 7,
> >> + VpHypercallsTime = 8,
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + VpPageInvalidationsCount = 9,
> >> + VpPageInvalidationsTime = 10,
> >> + VpControlRegisterAccessesCount = 11,
> >> + VpControlRegisterAccessesTime = 12,
> >> + VpIoInstructionsCount = 13,
> >> + VpIoInstructionsTime = 14,
> >> + VpHltInstructionsCount = 15,
> >> + VpHltInstructionsTime = 16,
> >> + VpMwaitInstructionsCount = 17,
> >> + VpMwaitInstructionsTime = 18,
> >> + VpCpuidInstructionsCount = 19,
> >> + VpCpuidInstructionsTime = 20,
> >> + VpMsrAccessesCount = 21,
> >> + VpMsrAccessesTime = 22,
> >> + VpOtherInterceptsCount = 23,
> >> + VpOtherInterceptsTime = 24,
> >> + VpExternalInterruptsCount = 25,
> >> + VpExternalInterruptsTime = 26,
> >> + VpPendingInterruptsCount = 27,
> >> + VpPendingInterruptsTime = 28,
> >> + VpEmulatedInstructionsCount = 29,
> >> + VpEmulatedInstructionsTime = 30,
> >> + VpDebugRegisterAccessesCount = 31,
> >> + VpDebugRegisterAccessesTime = 32,
> >> + VpPageFaultInterceptsCount = 33,
> >> + VpPageFaultInterceptsTime = 34,
> >> + VpGuestPageTableMaps = 35,
> >> + VpLargePageTlbFills = 36,
> >> + VpSmallPageTlbFills = 37,
> >> + VpReflectedGuestPageFaults = 38,
> >> + VpApicMmioAccesses = 39,
> >> + VpIoInterceptMessages = 40,
> >> + VpMemoryInterceptMessages = 41,
> >> + VpApicEoiAccesses = 42,
> >> + VpOtherMessages = 43,
> >> + VpPageTableAllocations = 44,
> >> + VpLogicalProcessorMigrations = 45,
> >> + VpAddressSpaceEvictions = 46,
> >> + VpAddressSpaceSwitches = 47,
> >> + VpAddressDomainFlushes = 48,
> >> + VpAddressSpaceFlushes = 49,
> >> + VpGlobalGvaRangeFlushes = 50,
> >> + VpLocalGvaRangeFlushes = 51,
> >> + VpPageTableEvictions = 52,
> >> + VpPageTableReclamations = 53,
> >> + VpPageTableResets = 54,
> >> + VpPageTableValidations = 55,
> >> + VpApicTprAccesses = 56,
> >> + VpPageTableWriteIntercepts = 57,
> >> + VpSyntheticInterrupts = 58,
> >> + VpVirtualInterrupts = 59,
> >> + VpApicIpisSent = 60,
> >> + VpApicSelfIpisSent = 61,
> >> + VpGpaSpaceHypercalls = 62,
> >> + VpLogicalProcessorHypercalls = 63,
> >> + VpLongSpinWaitHypercalls = 64,
> >> + VpOtherHypercalls = 65,
> >> + VpSyntheticInterruptHypercalls = 66,
> >> + VpVirtualInterruptHypercalls = 67,
> >> + VpVirtualMmuHypercalls = 68,
> >> + VpVirtualProcessorHypercalls = 69,
> >> + VpHardwareInterrupts = 70,
> >> + VpNestedPageFaultInterceptsCount = 71,
> >> + VpNestedPageFaultInterceptsTime = 72,
> >> + VpPageScans = 73,
> >> + VpLogicalProcessorDispatches = 74,
> >> + VpWaitingForCpuTime = 75,
> >> + VpExtendedHypercalls = 76,
> >> + VpExtendedHypercallInterceptMessages = 77,
> >> + VpMbecNestedPageTableSwitches = 78,
> >> + VpOtherReflectedGuestExceptions = 79,
> >> + VpGlobalIoTlbFlushes = 80,
> >> + VpGlobalIoTlbFlushCost = 81,
> >> + VpLocalIoTlbFlushes = 82,
> >> + VpLocalIoTlbFlushCost = 83,
> >> + VpHypercallsForwardedCount = 84,
> >> + VpHypercallsForwardingTime = 85,
> >> + VpPageInvalidationsForwardedCount = 86,
> >> + VpPageInvalidationsForwardingTime = 87,
> >> + VpControlRegisterAccessesForwardedCount = 88,
> >> + VpControlRegisterAccessesForwardingTime = 89,
> >> + VpIoInstructionsForwardedCount = 90,
> >> + VpIoInstructionsForwardingTime = 91,
> >> + VpHltInstructionsForwardedCount = 92,
> >> + VpHltInstructionsForwardingTime = 93,
> >> + VpMwaitInstructionsForwardedCount = 94,
> >> + VpMwaitInstructionsForwardingTime = 95,
> >> + VpCpuidInstructionsForwardedCount = 96,
> >> + VpCpuidInstructionsForwardingTime = 97,
> >> + VpMsrAccessesForwardedCount = 98,
> >> + VpMsrAccessesForwardingTime = 99,
> >> + VpOtherInterceptsForwardedCount = 100,
> >> + VpOtherInterceptsForwardingTime = 101,
> >> + VpExternalInterruptsForwardedCount = 102,
> >> + VpExternalInterruptsForwardingTime = 103,
> >> + VpPendingInterruptsForwardedCount = 104,
> >> + VpPendingInterruptsForwardingTime = 105,
> >> + VpEmulatedInstructionsForwardedCount = 106,
> >> + VpEmulatedInstructionsForwardingTime = 107,
> >> + VpDebugRegisterAccessesForwardedCount = 108,
> >> + VpDebugRegisterAccessesForwardingTime = 109,
> >> + VpPageFaultInterceptsForwardedCount = 110,
> >> + VpPageFaultInterceptsForwardingTime = 111,
> >> + VpVmclearEmulationCount = 112,
> >> + VpVmclearEmulationTime = 113,
> >> + VpVmptrldEmulationCount = 114,
> >> + VpVmptrldEmulationTime = 115,
> >> + VpVmptrstEmulationCount = 116,
> >> + VpVmptrstEmulationTime = 117,
> >> + VpVmreadEmulationCount = 118,
> >> + VpVmreadEmulationTime = 119,
> >> + VpVmwriteEmulationCount = 120,
> >> + VpVmwriteEmulationTime = 121,
> >> + VpVmxoffEmulationCount = 122,
> >> + VpVmxoffEmulationTime = 123,
> >> + VpVmxonEmulationCount = 124,
> >> + VpVmxonEmulationTime = 125,
> >> + VpNestedVMEntriesCount = 126,
> >> + VpNestedVMEntriesTime = 127,
> >> + VpNestedSLATSoftPageFaultsCount = 128,
> >> + VpNestedSLATSoftPageFaultsTime = 129,
> >> + VpNestedSLATHardPageFaultsCount = 130,
> >> + VpNestedSLATHardPageFaultsTime = 131,
> >> + VpInvEptAllContextEmulationCount = 132,
> >> + VpInvEptAllContextEmulationTime = 133,
> >> + VpInvEptSingleContextEmulationCount = 134,
> >> + VpInvEptSingleContextEmulationTime = 135,
> >> + VpInvVpidAllContextEmulationCount = 136,
> >> + VpInvVpidAllContextEmulationTime = 137,
> >> + VpInvVpidSingleContextEmulationCount = 138,
> >> + VpInvVpidSingleContextEmulationTime = 139,
> >> + VpInvVpidSingleAddressEmulationCount = 140,
> >> + VpInvVpidSingleAddressEmulationTime = 141,
> >> + VpNestedTlbPageTableReclamations = 142,
> >> + VpNestedTlbPageTableEvictions = 143,
> >> + VpFlushGuestPhysicalAddressSpaceHypercalls = 144,
> >> + VpFlushGuestPhysicalAddressListHypercalls = 145,
> >> + VpPostedInterruptNotifications = 146,
> >> + VpPostedInterruptScans = 147,
> >> + VpTotalCoreRunTime = 148,
> >> + VpMaximumRunTime = 149,
> >> + VpHwpRequestContextSwitches = 150,
> >> + VpWaitingForCpuTimeBucket0 = 151,
> >> + VpWaitingForCpuTimeBucket1 = 152,
> >> + VpWaitingForCpuTimeBucket2 = 153,
> >> + VpWaitingForCpuTimeBucket3 = 154,
> >> + VpWaitingForCpuTimeBucket4 = 155,
> >> + VpWaitingForCpuTimeBucket5 = 156,
> >> + VpWaitingForCpuTimeBucket6 = 157,
> >> + VpVmloadEmulationCount = 158,
> >> + VpVmloadEmulationTime = 159,
> >> + VpVmsaveEmulationCount = 160,
> >> + VpVmsaveEmulationTime = 161,
> >> + VpGifInstructionEmulationCount = 162,
> >> + VpGifInstructionEmulationTime = 163,
> >> + VpEmulatedErrataSvmInstructions = 164,
> >> + VpPlaceholder1 = 165,
> >> + VpPlaceholder2 = 166,
> >> + VpPlaceholder3 = 167,
> >> + VpPlaceholder4 = 168,
> >> + VpPlaceholder5 = 169,
> >> + VpPlaceholder6 = 170,
> >> + VpPlaceholder7 = 171,
> >> + VpPlaceholder8 = 172,
> >> + VpPlaceholder9 = 173,
> >> + VpPlaceholder10 = 174,
> >> + VpSchedulingPriority = 175,
> >> + VpRdpmcInstructionsCount = 176,
> >> + VpRdpmcInstructionsTime = 177,
> >> + VpPerfmonPmuMsrAccessesCount = 178,
> >> + VpPerfmonLbrMsrAccessesCount = 179,
> >> + VpPerfmonIptMsrAccessesCount = 180,
> >> + VpPerfmonInterruptCount = 181,
> >> + VpVtl1DispatchCount = 182,
> >> + VpVtl2DispatchCount = 183,
> >> + VpVtl2DispatchBucket0 = 184,
> >> + VpVtl2DispatchBucket1 = 185,
> >> + VpVtl2DispatchBucket2 = 186,
> >> + VpVtl2DispatchBucket3 = 187,
> >> + VpVtl2DispatchBucket4 = 188,
> >> + VpVtl2DispatchBucket5 = 189,
> >> + VpVtl2DispatchBucket6 = 190,
> >> + VpVtl1RunTime = 191,
> >> + VpVtl2RunTime = 192,
> >> + VpIommuHypercalls = 193,
> >> + VpCpuGroupHypercalls = 194,
> >> + VpVsmHypercalls = 195,
> >> + VpEventLogHypercalls = 196,
> >> + VpDeviceDomainHypercalls = 197,
> >> + VpDepositHypercalls = 198,
> >> + VpSvmHypercalls = 199,
> >> + VpBusLockAcquisitionCount = 200,
> >> + VpUnused = 201,
> >> + VpRootDispatchThreadBlocked = 202,
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + VpSysRegAccessesCount = 9,
> >> + VpSysRegAccessesTime = 10,
> >> + VpSmcInstructionsCount = 11,
> >> + VpSmcInstructionsTime = 12,
> >> + VpOtherInterceptsCount = 13,
> >> + VpOtherInterceptsTime = 14,
> >> + VpExternalInterruptsCount = 15,
> >> + VpExternalInterruptsTime = 16,
> >> + VpPendingInterruptsCount = 17,
> >> + VpPendingInterruptsTime = 18,
> >> + VpGuestPageTableMaps = 19,
> >> + VpLargePageTlbFills = 20,
> >> + VpSmallPageTlbFills = 21,
> >> + VpReflectedGuestPageFaults = 22,
> >> + VpMemoryInterceptMessages = 23,
> >> + VpOtherMessages = 24,
> >> + VpLogicalProcessorMigrations = 25,
> >> + VpAddressDomainFlushes = 26,
> >> + VpAddressSpaceFlushes = 27,
> >> + VpSyntheticInterrupts = 28,
> >> + VpVirtualInterrupts = 29,
> >> + VpApicSelfIpisSent = 30,
> >> + VpGpaSpaceHypercalls = 31,
> >> + VpLogicalProcessorHypercalls = 32,
> >> + VpLongSpinWaitHypercalls = 33,
> >> + VpOtherHypercalls = 34,
> >> + VpSyntheticInterruptHypercalls = 35,
> >> + VpVirtualInterruptHypercalls = 36,
> >> + VpVirtualMmuHypercalls = 37,
> >> + VpVirtualProcessorHypercalls = 38,
> >> + VpHardwareInterrupts = 39,
> >> + VpNestedPageFaultInterceptsCount = 40,
> >> + VpNestedPageFaultInterceptsTime = 41,
> >> + VpLogicalProcessorDispatches = 42,
> >> + VpWaitingForCpuTime = 43,
> >> + VpExtendedHypercalls = 44,
> >> + VpExtendedHypercallInterceptMessages = 45,
> >> + VpMbecNestedPageTableSwitches = 46,
> >> + VpOtherReflectedGuestExceptions = 47,
> >> + VpGlobalIoTlbFlushes = 48,
> >> + VpGlobalIoTlbFlushCost = 49,
> >> + VpLocalIoTlbFlushes = 50,
> >> + VpLocalIoTlbFlushCost = 51,
> >> + VpFlushGuestPhysicalAddressSpaceHypercalls = 52,
> >> + VpFlushGuestPhysicalAddressListHypercalls = 53,
> >> + VpPostedInterruptNotifications = 54,
> >> + VpPostedInterruptScans = 55,
> >> + VpTotalCoreRunTime = 56,
> >> + VpMaximumRunTime = 57,
> >> + VpWaitingForCpuTimeBucket0 = 58,
> >> + VpWaitingForCpuTimeBucket1 = 59,
> >> + VpWaitingForCpuTimeBucket2 = 60,
> >> + VpWaitingForCpuTimeBucket3 = 61,
> >> + VpWaitingForCpuTimeBucket4 = 62,
> >> + VpWaitingForCpuTimeBucket5 = 63,
> >> + VpWaitingForCpuTimeBucket6 = 64,
> >> + VpHwpRequestContextSwitches = 65,
> >> + VpPlaceholder2 = 66,
> >> + VpPlaceholder3 = 67,
> >> + VpPlaceholder4 = 68,
> >> + VpPlaceholder5 = 69,
> >> + VpPlaceholder6 = 70,
> >> + VpPlaceholder7 = 71,
> >> + VpPlaceholder8 = 72,
> >> + VpContentionTime = 73,
> >> + VpWakeUpTime = 74,
> >> + VpSchedulingPriority = 75,
> >> + VpVtl1DispatchCount = 76,
> >> + VpVtl2DispatchCount = 77,
> >> + VpVtl2DispatchBucket0 = 78,
> >> + VpVtl2DispatchBucket1 = 79,
> >> + VpVtl2DispatchBucket2 = 80,
> >> + VpVtl2DispatchBucket3 = 81,
> >> + VpVtl2DispatchBucket4 = 82,
> >> + VpVtl2DispatchBucket5 = 83,
> >> + VpVtl2DispatchBucket6 = 84,
> >> + VpVtl1RunTime = 85,
> >> + VpVtl2RunTime = 86,
> >> + VpIommuHypercalls = 87,
> >> + VpCpuGroupHypercalls = 88,
> >> + VpVsmHypercalls = 89,
> >> + VpEventLogHypercalls = 90,
> >> + VpDeviceDomainHypercalls = 91,
> >> + VpDepositHypercalls = 92,
> >> + VpSvmHypercalls = 93,
> >> + VpLoadAvg = 94,
> >> + VpRootDispatchThreadBlocked = 95,
> >
> > In current code, VpRootDispatchThreadBlocked on ARM64 is 94. Is that an
> > error that is being corrected by this patch?
> >
>
> Hmm, I didn't realize this changed - 95 is the correct value. However,
> the mshv driver does not yet support on ARM64, so this fix doesn't
> have any impact right now. Do you suggest a separate patch to fix it?
I don’t see a need for a separate patch. Maybe just put a short note
about the change in the commit message for this patch, so that it is
recorded as intentional and not a random typo. Long lists like these
have been known to get fat-fingered from time-to-time. :-)
>
> >> +#endif
> >> + VpStatsMaxCounter
> >> +};
> >> +
> >> +enum hv_stats_lp_counters { /* HV_CPU_COUNTER */
> >> + LpGlobalTime = 1,
> >> + LpTotalRunTime = 2,
> >> + LpHypervisorRunTime = 3,
> >> + LpHardwareInterrupts = 4,
> >> + LpContextSwitches = 5,
> >> + LpInterProcessorInterrupts = 6,
> >> + LpSchedulerInterrupts = 7,
> >> + LpTimerInterrupts = 8,
> >> + LpInterProcessorInterruptsSent = 9,
> >> + LpProcessorHalts = 10,
> >> + LpMonitorTransitionCost = 11,
> >> + LpContextSwitchTime = 12,
> >> + LpC1TransitionsCount = 13,
> >> + LpC1RunTime = 14,
> >> + LpC2TransitionsCount = 15,
> >> + LpC2RunTime = 16,
> >> + LpC3TransitionsCount = 17,
> >> + LpC3RunTime = 18,
> >> + LpRootVpIndex = 19,
> >> + LpIdleSequenceNumber = 20,
> >> + LpGlobalTscCount = 21,
> >> + LpActiveTscCount = 22,
> >> + LpIdleAccumulation = 23,
> >> + LpReferenceCycleCount0 = 24,
> >> + LpActualCycleCount0 = 25,
> >> + LpReferenceCycleCount1 = 26,
> >> + LpActualCycleCount1 = 27,
> >> + LpProximityDomainId = 28,
> >> + LpPostedInterruptNotifications = 29,
> >> + LpBranchPredictorFlushes = 30,
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + LpL1DataCacheFlushes = 31,
> >> + LpImmediateL1DataCacheFlushes = 32,
> >> + LpMbFlushes = 33,
> >> + LpCounterRefreshSequenceNumber = 34,
> >> + LpCounterRefreshReferenceTime = 35,
> >> + LpIdleAccumulationSnapshot = 36,
> >> + LpActiveTscCountSnapshot = 37,
> >> + LpHwpRequestContextSwitches = 38,
> >> + LpPlaceholder1 = 39,
> >> + LpPlaceholder2 = 40,
> >> + LpPlaceholder3 = 41,
> >> + LpPlaceholder4 = 42,
> >> + LpPlaceholder5 = 43,
> >> + LpPlaceholder6 = 44,
> >> + LpPlaceholder7 = 45,
> >> + LpPlaceholder8 = 46,
> >> + LpPlaceholder9 = 47,
> >> + LpPlaceholder10 = 48,
> >> + LpReserveGroupId = 49,
> >> + LpRunningPriority = 50,
> >> + LpPerfmonInterruptCount = 51,
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + LpCounterRefreshSequenceNumber = 31,
> >> + LpCounterRefreshReferenceTime = 32,
> >> + LpIdleAccumulationSnapshot = 33,
> >> + LpActiveTscCountSnapshot = 34,
> >> + LpHwpRequestContextSwitches = 35,
> >> + LpPlaceholder2 = 36,
> >> + LpPlaceholder3 = 37,
> >> + LpPlaceholder4 = 38,
> >> + LpPlaceholder5 = 39,
> >> + LpPlaceholder6 = 40,
> >> + LpPlaceholder7 = 41,
> >> + LpPlaceholder8 = 42,
> >> + LpPlaceholder9 = 43,
> >> + LpSchLocalRunListSize = 44,
> >> + LpReserveGroupId = 45,
> >> + LpRunningPriority = 46,
> >> +#endif
> >> + LpStatsMaxCounter
> >> +};
> >> +
> >> +/*
> >> + * Hypervisor statsitics page format
> >
> > s/statsitics/statistics/
> >
> Ack, thanks
>
> >> + */
> >> +struct hv_stats_page {
> >> + union {
> >> + u64 hv_cntrs[HvStatsMaxCounter]; /* Hypervisor counters */
> >> + u64 pt_cntrs[PartitionStatsMaxCounter]; /* Partition counters */
> >> + u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
> >> + u64 lp_cntrs[LpStatsMaxCounter]; /* LP counters */
> >> + u8 data[HV_HYP_PAGE_SIZE];
> >> + };
> >> +} __packed;
> >> +
> >> /* Bits for dirty mask of hv_vp_register_page */
> >> #define HV_X64_REGISTER_CLASS_GENERAL 0
> >> #define HV_X64_REGISTER_CLASS_IP 1
> >> --
> >> 2.34.1
^ permalink raw reply
* RE: [PATCH v2 1/3] mshv: Ignore second stats page map result failure
From: Michael Kelley @ 2026-01-02 16:27 UTC (permalink / raw)
To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com
In-Reply-To: <9a997f03-f1be-411b-b4b2-c28069b2a3ce@linux.microsoft.com>
From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Monday, December 29, 2025 4:28 PM
>
> On 12/8/2025 7:12 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, December 5, 2025 10:59 AM
> >>
> >> From: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
> >>
> >> Older versions of the hypervisor do not support HV_STATS_AREA_PARENT
> >> and return HV_STATUS_INVALID_PARAMETER for the second stats page
> >> mapping request.
> >>
> >> This results a failure in module init. Instead of failing, gracefully
> >> fall back to populating stats_pages[HV_STATS_AREA_PARENT] with the
> >> already-mapped stats_pages[HV_STATS_AREA_SELF].
> >
> > This explains "what" this patch does. But could you add an explanation of "why"
> > substituting SELF for the unavailable PARENT is the right thing to do? As a somewhat
> > outside reviewer, I don't know enough about SELF vs. PARENT to immediately know
> > why this substitution makes sense.
> >
> I'll attempt to explain. I'm a little hindered by the fact that like many of the
> root interfaces this is not well-documented, but this is my understanding:
>
> The stats areas HV_STATS_AREA_SELF and HV_STATS_AREA_PARENT indicate the
> privilege level of the data in the mapped stats page.
OK. But evidently that difference in "privilege level" (whatever that means) doesn't
affect what the root partition can do to read and display the data in debugfs, right?
>
> Both SELF and PARENT contain the same fields, but some fields that are 0 in the
> SELF page may be nonzero in PARENT page, and vice-versa. So, to read all the fields
> we need to map both pages if possible, and prioritize reading non-zero data from
> each field, by checking both the SELF and PARENT pages.
Overall, this mostly makes sense. Each VP and each partition has associated SELF and
PARENT stats pages. For the SELF page, the stats are presumably for the single
associated VP or partition. But "PARENT" terminology usually implies some kind of
hierarchy, as in a parent has one or more children. Parent-level stats would typically
be an aggregate of all its children's stats. But if that's the case here, choosing at runtime
on a per-field stat basis between SELF and PARENT would produce weird results. So
maybe that typical model of "parent" isn't correct here. If SELF and PARENT are only
indicating some kind of privilege level, maybe the PARENT page for each VP and each
partition is like the SELF page -- it contains stats only for the associated VP or partition.
>
> I don't know if it's possible for a given field to have a different (nonzero) value
> in both SELF and PARENT pages. I imagine in that case we'd want to prioritize the
> PARENT value, but it may simply not be possible.
It would be nice to confirm that this can't happen. If it can happen, that messes
up trying to construct a sensible model of how this all works. :-)
And a somewhat related question: Assuming that a particular stat appears in
either the SELF or the PARENT page, under what circumstances might that stat
move from one to the other, if ever? I would guess that for a given version of the
hypervisor, the split is always the same, across all VPs and all partitions running
on hypervisors of that version. But a different hypervisor version might split the
stats differently between SELF and PARENT. Of course, this stuff is overall a bit
unusual, so my guess might not be right.
I ask because making a runtime decision between SELF and PARENT for every
individual stat, every time it is read, is conceptually a lot of wasted motion if the
split is static and know-able ahead of time. But I say "conceptually" because I
can't immediately come up with a way to make things faster or more compact if
the split were to be static and know-able ahead of time. So it may be moot point
from an implementation standpoint, but I'm still interested in the answer from
the standpoint of being able to document the overall model of how this works.
>
> The API is designed in this way to be backward-compatible with older hypervisors
> that didn't have a concept of SELF and PARENT. Hence on older hypervisors (detectable
> via the error code), all we can do is map SELF and use it for everything.
In cases where PARENT can't be mapped by the root partition, does that mean
some of the stats just aren't available? Or does the hypervisor provide all the
stats in the SELF page?
>
> > Also, does this patch affect the logic in mshv_vp_dispatch_thread_blocked() where
> > a zero value for the SELF version of VpRootDispatchThreadBlocked is replaced by
> > the PARENT value? But that logic seems to be in the reverse direction -- replacing
> > a missing SELF value with the PARENT value -- whereas this patch is about replacing
> > missing PARENT values with SELF values. So are there two separate PARENT vs. SELF
> > issues overall? And after this patch is in place and PARENT values are replaced with
> > SELF on older hypervisor versions, the logic in mshv_vp_dispatch_thread_blocked()
> > then effectively becomes a no-op if the SELF value is zero, and the return value will
> > be zero. Is that problem?
> >
> This is the same issue, because we only care about any nonzero value in
> mshv_vp_dispatch_thread_blocked(). It doesn't matter which page we check first in that
> code, just that any nonzero value is returned as a boolean to indicate a blocked state.
>
> The code in question could be rewritten:
>
> return self_vp_cntrs[VpRootDispatchThreadBlocked] ||
> parent_vp_cntrs[VpRootDispatchThreadBlocked];
OK. It would be more consistent to apply the same logic (check SELF then PARENT,
or vice versa) in both mshv_vp_dispatch_thread_blocked() and in this new debugfs
code. As you know, for me inconsistencies always beg the question of "why"? :-)
But that's a minor point.
>
> >>
> >> Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >> ---
> >> drivers/hv/mshv_root_hv_call.c | 41 ++++++++++++++++++++++++++++++----
> >> drivers/hv/mshv_root_main.c | 3 +++
> >> 2 files changed, 40 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> >> index 598eaff4ff29..b1770c7b500c 100644
> >> --- a/drivers/hv/mshv_root_hv_call.c
> >> +++ b/drivers/hv/mshv_root_hv_call.c
> >> @@ -855,6 +855,24 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
> >> return ret;
> >> }
> >>
> >> +static int
> >> +hv_stats_get_area_type(enum hv_stats_object_type type,
> >> + const union hv_stats_object_identity *identity)
> >> +{
> >> + switch (type) {
> >> + case HV_STATS_OBJECT_HYPERVISOR:
> >> + return identity->hv.stats_area_type;
> >> + case HV_STATS_OBJECT_LOGICAL_PROCESSOR:
> >> + return identity->lp.stats_area_type;
> >> + case HV_STATS_OBJECT_PARTITION:
> >> + return identity->partition.stats_area_type;
> >> + case HV_STATS_OBJECT_VP:
> >> + return identity->vp.stats_area_type;
> >> + }
> >> +
> >> + return -EINVAL;
> >> +}
> >> +
> >> static int hv_call_map_stats_page(enum hv_stats_object_type type,
> >> const union hv_stats_object_identity *identity,
> >> void **addr)
> >> @@ -863,7 +881,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
> >> struct hv_input_map_stats_page *input;
> >> struct hv_output_map_stats_page *output;
> >> u64 status, pfn;
> >> - int ret = 0;
> >> + int hv_status, ret = 0;
> >>
> >> do {
> >> local_irq_save(flags);
> >> @@ -878,11 +896,26 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
> >> pfn = output->map_location;
> >>
> >> local_irq_restore(flags);
> >> - if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> >> - ret = hv_result_to_errno(status);
> >> +
> >> + hv_status = hv_result(status);
> >> + if (hv_status != HV_STATUS_INSUFFICIENT_MEMORY) {
> >> if (hv_result_success(status))
> >> break;
> >> - return ret;
> >> +
> >> + /*
> >> + * Older versions of the hypervisor do not support the
> >> + * PARENT stats area. In this case return "success" but
> >> + * set the page to NULL. The caller should check for
> >> + * this case and instead just use the SELF area.
> >> + */
> >> + if (hv_stats_get_area_type(type, identity) == HV_STATS_AREA_PARENT &&
> >> + hv_status == HV_STATUS_INVALID_PARAMETER) {
> >> + *addr = NULL;
> >> + return 0;
> >> + }
> >> +
> >> + hv_status_debug(status, "\n");
> >> + return hv_result_to_errno(status);
> >
> > Does the hv_call_map_stats_page2() function need a similar fix? Or is there a linkage
> > in hypervisor functionality where any hypervisor version that supports an overlay GPFN
> > also supports the PARENT stats? If such a linkage is why hv_call_map_stats_page2()
> > doesn't need a similar fix, please add a code comment to that effect in
> > hv_call_map_stats_page2().
> >
> Exactly; hv_call_map_stats_page2() is only available on hypervisors where the PARENT
> page is also available. I'll add a comment.
Thanks.
>
> >> }
> >>
> >> ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> >> index bc15d6f6922f..f59a4ab47685 100644
> >> --- a/drivers/hv/mshv_root_main.c
> >> +++ b/drivers/hv/mshv_root_main.c
> >> @@ -905,6 +905,9 @@ static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
> >> if (err)
> >> goto unmap_self;
> >>
> >> + if (!stats_pages[HV_STATS_AREA_PARENT])
> >> + stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
> >> +
> >> return 0;
> >>
> >> unmap_self:
> >> --
> >> 2.34.1
^ permalink raw reply
* RE: [RFC][PATCH v0] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Michael Kelley @ 2026-01-02 16:02 UTC (permalink / raw)
To: Vitaly Kuznetsov, Mukesh Rathor, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com, tglx@linutronix.de,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, hpa@zytor.com
In-Reply-To: <877bu0au1t.fsf@redhat.com>
From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Friday, January 2, 2026 7:55 AM
>
> Mukesh Rathor <mrathor@linux.microsoft.com> writes:
>
> > MSVC compiler used to compile the Microsoft Hyper-V hypervisor currently,
> > has an assert intrinsic that uses interrupt vector 0x29 to create an
> > exception. This will cause hypervisor to then crash and collect core. As
> > such, if this interrupt number is assigned to a device by linux and the
> > device generates it, hypervisor will crash. There are two other such
> > vectors hard coded in the hypervisor, 0x2C and 0x2D.
> >
> > Fortunately, the three vectors are part of the kernel driver space, and
> > that makes it feasible to reserve them early so they are not assigned
> > later.
> >
> > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > ---
> > arch/x86/kernel/cpu/mshyperv.c | 22 ++++++++++++++++++++++
> > 1 file changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > index 579fb2c64cfd..19d41f7434df 100644
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -478,6 +478,25 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
> > }
> > EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
> >
> > +/*
> > + * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
> > + * will crash or hang or break into debugger.
> > + */
> > +static void hv_reserve_irq_vectors(void)
> > +{
> > + #define HYPERV_DBG_FASTFAIL_VECTOR 0x29
> > + #define HYPERV_DBG_ASSERT_VECTOR 0x2C
> > + #define HYPERV_DBG_SERVICE_VECTOR 0x2D
> > +
> > + if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
> > + test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
> > + test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
> > + BUG();
>
> Would it be less hackish to use sysvec_install() with a dummy handler
> for all three vectors instead?
It would be, but unfortunately, it doesn't work. sysvec_install() requires
that the vector be >= FIRST_SYSTEM_VECTOR, and these vectors are not.
Michael
>
> > +
> > + pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
> > + HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
> > +}
> > +
> > static void __init ms_hyperv_init_platform(void)
> > {
> > int hv_max_functions_eax, eax;
> > @@ -510,6 +529,9 @@ static void __init ms_hyperv_init_platform(void)
> >
> > hv_identify_partition_type();
> >
> > + if (hv_root_partition())
> > + hv_reserve_irq_vectors();
> > +
> > if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC))
> > ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED;
>
> --
> Vitaly
>
^ permalink raw reply
* Re: [RFC][PATCH v0] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Vitaly Kuznetsov @ 2026-01-02 15:55 UTC (permalink / raw)
To: Mukesh Rathor, linux-hyperv, linux-kernel
Cc: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
dave.hansen, x86, hpa
In-Reply-To: <20251231012100.681060-1-mrathor@linux.microsoft.com>
Mukesh Rathor <mrathor@linux.microsoft.com> writes:
> MSVC compiler used to compile the Microsoft Hyper-V hypervisor currently,
> has an assert intrinsic that uses interrupt vector 0x29 to create an
> exception. This will cause hypervisor to then crash and collect core. As
> such, if this interrupt number is assigned to a device by linux and the
> device generates it, hypervisor will crash. There are two other such
> vectors hard coded in the hypervisor, 0x2C and 0x2D.
>
> Fortunately, the three vectors are part of the kernel driver space, and
> that makes it feasible to reserve them early so they are not assigned
> later.
>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
> arch/x86/kernel/cpu/mshyperv.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 579fb2c64cfd..19d41f7434df 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -478,6 +478,25 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
> }
> EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>
> +/*
> + * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
> + * will crash or hang or break into debugger.
> + */
> +static void hv_reserve_irq_vectors(void)
> +{
> + #define HYPERV_DBG_FASTFAIL_VECTOR 0x29
> + #define HYPERV_DBG_ASSERT_VECTOR 0x2C
> + #define HYPERV_DBG_SERVICE_VECTOR 0x2D
> +
> + if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
> + test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
> + test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
> + BUG();
Would it be less hackish to use sysvec_install() with a dummy handler
for all three vectors instead?
> +
> + pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
> + HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
> +}
> +
> static void __init ms_hyperv_init_platform(void)
> {
> int hv_max_functions_eax, eax;
> @@ -510,6 +529,9 @@ static void __init ms_hyperv_init_platform(void)
>
> hv_identify_partition_type();
>
> + if (hv_root_partition())
> + hv_reserve_irq_vectors();
> +
> if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC))
> ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED;
--
Vitaly
^ permalink raw reply
* Re: [PATCH v2 2/8] KVM: SVM: Open code handling of unexpected exits in svm_invoke_exit_handler()
From: Gupta, Pankaj @ 2026-01-02 11:41 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-3-seanjc@google.com>
> Fold svm_check_exit_valid() and svm_handle_invalid_exit() into their sole
> caller, svm_invoke_exit_handler(), as having tiny single-use helpers makes
> the code unncessarily difficult to follow. This will also allow for
> additional cleanups in svm_invoke_exit_handler().
>
> No functional change intended.
>
> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
> ---
> arch/x86/kvm/svm/svm.c | 25 ++++++++++---------------
> 1 file changed, 10 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index c2ddf2e0aa1a..a523011f0923 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3443,23 +3443,13 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
> sev_free_decrypted_vmsa(vcpu, save);
> }
>
> -static bool svm_check_exit_valid(u64 exit_code)
> -{
> - return (exit_code < ARRAY_SIZE(svm_exit_handlers) &&
> - svm_exit_handlers[exit_code]);
> -}
> -
> -static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code)
> -{
> - dump_vmcb(vcpu);
> - kvm_prepare_unexpected_reason_exit(vcpu, exit_code);
> - return 0;
> -}
> -
> int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
> {
> - if (!svm_check_exit_valid(exit_code))
> - return svm_handle_invalid_exit(vcpu, exit_code);
> + if (exit_code >= ARRAY_SIZE(svm_exit_handlers))
> + goto unexpected_vmexit;
> +
> + if (!svm_exit_handlers[exit_code])
> + goto unexpected_vmexit;
>
> #ifdef CONFIG_MITIGATION_RETPOLINE
> if (exit_code == SVM_EXIT_MSR)
> @@ -3478,6 +3468,11 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
> #endif
> #endif
> return svm_exit_handlers[exit_code](vcpu);
> +
> +unexpected_vmexit:
> + dump_vmcb(vcpu);
> + kvm_prepare_unexpected_reason_exit(vcpu, exit_code);
> + return 0;
> }
>
> static void svm_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
^ permalink raw reply
* Re: [PATCH v2 8/8] KVM: SVM: Assert that Hyper-V's HV_SVM_EXITCODE_ENL == SVM_EXIT_SW
From: Vitaly Kuznetsov @ 2026-01-02 9:58 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-9-seanjc@google.com>
Sean Christopherson <seanjc@google.com> writes:
> Add a build-time assertiont that Hyper-V's "enlightened" exit code is that,
> same as the AMD-defined "Reserved for Host" exit code, mostly to help
> readers connect the dots and understand why synthesizing a software-defined
> exit code is safe/ok.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/svm/hyperv.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/hyperv.c b/arch/x86/kvm/svm/hyperv.c
> index 3ec580d687f5..4f24dcb45116 100644
> --- a/arch/x86/kvm/svm/hyperv.c
> +++ b/arch/x86/kvm/svm/hyperv.c
> @@ -10,6 +10,12 @@ void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
>
> + /*
> + * The exit code used by Hyper-V for software-defined exits is reserved
> + * by AMD specifically for such use cases.
> + */
> + BUILD_BUG_ON(HV_SVM_EXITCODE_ENL != SVM_EXIT_SW);
> +
> svm->vmcb->control.exit_code = HV_SVM_EXITCODE_ENL;
> svm->vmcb->control.exit_info_1 = HV_SVM_ENL_EXITCODE_TRAP_AFTER_FLUSH;
> svm->vmcb->control.exit_info_2 = 0;
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Alternatively (or additionally?) to BUG_ON, I guess we could've
#define HV_SVM_EXITCODE_ENL SVM_EXIT_SW
unless including SVM's headers into include/hyperv/hvgdk.h is too big of
a mess.
--
Vitaly
^ permalink raw reply
* Re: [PATCH v2 1/1] Drivers: hv: Always do Hyper-V panic notification in hv_kmsg_dump()
From: vdso @ 2026-01-01 7:54 UTC (permalink / raw)
To: mhklinux
Cc: mhkelley58@gmail.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, kys@microsoft.com,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
dan.carpenter@linaro.org
In-Reply-To: <20251231201447.1399-1-mhklinux@outlook.com>
> On 12/31/2025 12:14 PM mhkelley58@gmail.com wrote:
>
>
> From: Michael Kelley <mhklinux@outlook.com>
>
> hv_kmsg_dump() currently skips the panic notification entirely if it
> doesn't get any message bytes to pass to Hyper-V due to an error from
> kmsg_dump_get_buffer(). Skipping the notification is undesirable because
> it leaves the Hyper-V host uncertain about the state of a panic'ed guest.
>
> Fix this by always doing the panic notification, even if bytes_written
> is zero. Also ensure that bytes_written is initialized, which fixes a
> kernel test robot warning. The warning is actually bogus because
> kmsg_dump_get_buffer() happens to set bytes_written even if it fails, and
> in the kernel test robot's CONFIG_PRINTK not set case, hv_kmsg_dump() is
> never called. But do the initialization for robustness and to quiet the
> static checker.
>
> Fixes: 9c318a1d9b50 ("Drivers: hv: move panic report code from vmbus to hv early init code")
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
> Closes: https://lore.kernel.org/all/202512172103.OcUspn1Z-lkp@intel.com/
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Roman Kisel <vdso@mailbox.org>
> ---
> Changes in v2:
> * Reworked patch to focus on always sending the panic message, with
> resolving the uninitialized variable report as a side effect. See
> discussion on v1 of the patch [1]
>
> [1] https://lore.kernel.org/linux-hyperv/20251219160832.1628-1-mhklinux@outlook.com/
>
> drivers/hv/hv_common.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 0a3ab7efed46..f1c17fb60dc1 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -195,13 +195,15 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
>
> /*
> * Write dump contents to the page. No need to synchronize; panic should
> - * be single-threaded.
> + * be single-threaded. Ignore failures from kmsg_dump_get_buffer() since
> + * panic notification should be done even if there is no message data.
> + * Don't assume bytes_written is set in case of failure, so initialize it.
> */
> kmsg_dump_rewind(&iter);
> - kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
> + bytes_written = 0;
> + (void)kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
> &bytes_written);
> - if (!bytes_written)
> - return;
> +
> /*
> * P3 to contain the physical address of the panic page & P4 to
> * contain the size of the panic data in that page. Rest of the
> @@ -210,7 +212,7 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
> hv_set_msr(HV_MSR_CRASH_P0, 0);
> hv_set_msr(HV_MSR_CRASH_P1, 0);
> hv_set_msr(HV_MSR_CRASH_P2, 0);
> - hv_set_msr(HV_MSR_CRASH_P3, virt_to_phys(hv_panic_page));
> + hv_set_msr(HV_MSR_CRASH_P3, bytes_written ? virt_to_phys(hv_panic_page) : 0);
> hv_set_msr(HV_MSR_CRASH_P4, bytes_written);
>
> /*
> --
> 2.25.1
^ permalink raw reply
* [PATCH v2 1/1] Drivers: hv: Always do Hyper-V panic notification in hv_kmsg_dump()
From: mhkelley58 @ 2025-12-31 20:14 UTC (permalink / raw)
To: haiyangz, wei.liu, decui, kys, linux-kernel, linux-hyperv,
dan.carpenter
From: Michael Kelley <mhklinux@outlook.com>
hv_kmsg_dump() currently skips the panic notification entirely if it
doesn't get any message bytes to pass to Hyper-V due to an error from
kmsg_dump_get_buffer(). Skipping the notification is undesirable because
it leaves the Hyper-V host uncertain about the state of a panic'ed guest.
Fix this by always doing the panic notification, even if bytes_written
is zero. Also ensure that bytes_written is initialized, which fixes a
kernel test robot warning. The warning is actually bogus because
kmsg_dump_get_buffer() happens to set bytes_written even if it fails, and
in the kernel test robot's CONFIG_PRINTK not set case, hv_kmsg_dump() is
never called. But do the initialization for robustness and to quiet the
static checker.
Fixes: 9c318a1d9b50 ("Drivers: hv: move panic report code from vmbus to hv early init code")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/202512172103.OcUspn1Z-lkp@intel.com/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
Changes in v2:
* Reworked patch to focus on always sending the panic message, with
resolving the uninitialized variable report as a side effect. See
discussion on v1 of the patch [1]
[1] https://lore.kernel.org/linux-hyperv/20251219160832.1628-1-mhklinux@outlook.com/
drivers/hv/hv_common.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 0a3ab7efed46..f1c17fb60dc1 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -195,13 +195,15 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
/*
* Write dump contents to the page. No need to synchronize; panic should
- * be single-threaded.
+ * be single-threaded. Ignore failures from kmsg_dump_get_buffer() since
+ * panic notification should be done even if there is no message data.
+ * Don't assume bytes_written is set in case of failure, so initialize it.
*/
kmsg_dump_rewind(&iter);
- kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
+ bytes_written = 0;
+ (void)kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
&bytes_written);
- if (!bytes_written)
- return;
+
/*
* P3 to contain the physical address of the panic page & P4 to
* contain the size of the panic data in that page. Rest of the
@@ -210,7 +212,7 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
hv_set_msr(HV_MSR_CRASH_P0, 0);
hv_set_msr(HV_MSR_CRASH_P1, 0);
hv_set_msr(HV_MSR_CRASH_P2, 0);
- hv_set_msr(HV_MSR_CRASH_P3, virt_to_phys(hv_panic_page));
+ hv_set_msr(HV_MSR_CRASH_P3, bytes_written ? virt_to_phys(hv_panic_page) : 0);
hv_set_msr(HV_MSR_CRASH_P4, bytes_written);
/*
--
2.25.1
^ permalink raw reply related
* RE: [PATCH 1/1] Drivers: hv: Fix uninit'ed variable in hv_msg_dump() if CONFIG_PRINTK not set
From: Michael Kelley @ 2025-12-31 17:52 UTC (permalink / raw)
To: Michael Kelley, vdso@mailbox.org
Cc: haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
kys@microsoft.com, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, dan.carpenter@linaro.org
In-Reply-To: <SN6PR02MB4157065BCB5064B3587F9E4AD4BCA@SN6PR02MB4157.namprd02.prod.outlook.com>
From: Michael Kelley <mhklinux@outlook.com> Sent: Monday, December 29, 2025 7:04 PM
>
> From: vdso@mailbox.org <vdso@mailbox.org> Sent: Sunday, December 28, 2025 8:12 PM
> >
> > > On 12/19/2025 8:08 AM mhkelley58@gmail.com wrote:
>
> [snip]
>
> > > @@ -198,9 +199,9 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
> > > * be single-threaded.
> > > */
> > > kmsg_dump_rewind(&iter);
> > > - kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
> > > - &bytes_written);
> > > - if (!bytes_written)
> > > + ret = kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
> > > + &bytes_written);
> > > + if (!ret || !bytes_written)
> > > return;
> > > /*
> > > * P3 to contain the physical address of the panic page & P4 to
> >
> > The existing code
> >
> > 1. doesn't care about the return value from kmsg_dump_get_buffer.
> > The return value wouldn't make the function return before, why does that
> > need to change?
>
> The existing code depends on the implementation of kmsg_dump_get_buffer()
> always setting bytes_written, even if it fails. That's atypical behavior, but it is
> what kmsg_dump_get_buffer() does -- except that if CONFIG_PRINTK=n, the
> stub kmsg_dump_get_buffer() does *not* do that. Testing the return value is
> the more typical pattern, and bytes_written should be used only if the return
> value indicates success. So that's why I proposed this change, instead of just
> initializing bytes_written to zero when it is defined. My proposed change
> makes the overall pattern more typical, and would work if the implementation
> of kmsg_dump_get_buffer() should ever change to not set bytes_written in
> some error case.
>
> >
> > 2. returns early when there are no bytes written.
> > I think it shouldn't as otherwise the crash control register isn't written to,
> > and the panic isn't signalled to the host. Is there another path maybe that
> > I'm not noticing?
>
> You make an excellent point. I didn't even think about the possibility of the
> current logic being wrong. There is hyperv_report_panic(), but it is not called
> if hv_panic_page is allocated, in order to avoid duplicate reports. I agree that
> this code should go ahead and send the panic report even if there's no
> message data. And in that case the discussion about testing the return value
> from kmsg_dump_get_buffer() is moot.
>
> I'll submit a new patch to change the behavior to send the panic report to
> the host even if the message length is zero. I did a quick test of that case,
> and it behaves like the case where HV_CRASH_CTL_CRASH_NOTIFY_MSG
> is not set, which is fine.
>
> I'll submit a new version of the patch focused on submitting the panic
> report to the hypervisor even if the message size is zero. Avoiding the
> uninitialized bytes_written will fall out of that change.
FWIW, the plot thickens here a bit. In testing my new version of
the patch, it turns out that hv_kmsg_dump() is never called when
CONFIG_PRINTK is not set. In that case, kmsg_dump_register()
fails, and hv_kmsg_dump_register() sets hv_panic_page back
to NULL. So hyperv_report_panic() is used to report the panic
to Hyper-V with no guest message.
The original uninitialized variable report is actually bogus since
the non-stub kmsg_dump_get_buffer() always sets bytes_written,
even if it returns false. But it's still worth initializing bytes_written
to zero so the static checkers don't generate noise.
I'll take all this into account in my new version of the patch.
Michael
>
> See a comment below in your suggested patch.
>
> >
> > That said, would it make sense to you the patch be something similar to:
> >
> > diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> > index 0a3ab7efed46..20e4a9a13b32 100644
> > --- a/drivers/hv/hv_common.c
> > +++ b/drivers/hv/hv_common.c
> > @@ -188,6 +188,7 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
> > {
> > struct kmsg_dump_iter iter;
> > size_t bytes_written;
> > + bool ret;
> >
> > /* We are only interested in panics. */
> > if (detail->reason != KMSG_DUMP_PANIC || !sysctl_record_panic_msg)
> > @@ -197,11 +198,16 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
> > * Write dump contents to the page. No need to synchronize; panic should
> > * be single-threaded.
> > */
> > + bytes_written = 0;
> > kmsg_dump_rewind(&iter);
> > - kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
> > + ret = kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
> > &bytes_written);
>
> Ignoring the return value can be made explicit as:
>
> + (void)kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
> &bytes_written);
>
> Plus an appropriate comment. Then there's no need to introduce the "ret" local
> variable and the somewhat funky:
>
> (void) ret;
>
> Michael
>
> > - if (!bytes_written)
> > - return;
> > + /*
> > + * Whether there is more data available or not, send what has been captured
> > + * to the host. Ignore the return value.
> > + */
> > + (void) ret;
> > +
> > /*
> > * P3 to contain the physical address of the panic page & P4 to
> > * contain the size of the panic data in that page. Rest of the
> > @@ -210,7 +216,7 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
> > hv_set_msr(HV_MSR_CRASH_P0, 0);
> > hv_set_msr(HV_MSR_CRASH_P1, 0);
> > hv_set_msr(HV_MSR_CRASH_P2, 0);
> > - hv_set_msr(HV_MSR_CRASH_P3, virt_to_phys(hv_panic_page));
> > + hv_set_msr(HV_MSR_CRASH_P3, bytes_written ? virt_to_phys(hv_panic_page) :
> NULL);
> > hv_set_msr(HV_MSR_CRASH_P4, bytes_written);
> >
> > /*
> >
> > --
> > Cheers,
> > Roman
> >
> > > --
> > > 2.25.1
^ permalink raw reply
* [RFC][PATCH v0] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Mukesh Rathor @ 2025-12-31 1:21 UTC (permalink / raw)
To: linux-hyperv, linux-kernel
Cc: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
dave.hansen, x86, hpa
MSVC compiler used to compile the Microsoft Hyper-V hypervisor currently,
has an assert intrinsic that uses interrupt vector 0x29 to create an
exception. This will cause hypervisor to then crash and collect core. As
such, if this interrupt number is assigned to a device by linux and the
device generates it, hypervisor will crash. There are two other such
vectors hard coded in the hypervisor, 0x2C and 0x2D.
Fortunately, the three vectors are part of the kernel driver space, and
that makes it feasible to reserve them early so they are not assigned
later.
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
arch/x86/kernel/cpu/mshyperv.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 579fb2c64cfd..19d41f7434df 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -478,6 +478,25 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
}
EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
+/*
+ * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
+ * will crash or hang or break into debugger.
+ */
+static void hv_reserve_irq_vectors(void)
+{
+ #define HYPERV_DBG_FASTFAIL_VECTOR 0x29
+ #define HYPERV_DBG_ASSERT_VECTOR 0x2C
+ #define HYPERV_DBG_SERVICE_VECTOR 0x2D
+
+ if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
+ test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
+ test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
+ BUG();
+
+ pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
+ HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
+}
+
static void __init ms_hyperv_init_platform(void)
{
int hv_max_functions_eax, eax;
@@ -510,6 +529,9 @@ static void __init ms_hyperv_init_platform(void)
hv_identify_partition_type();
+ if (hv_root_partition())
+ hv_reserve_irq_vectors();
+
if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC))
ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED;
--
2.51.2.vfs.0.1
^ permalink raw reply related
* Re: [PATCH v2 3/3] mshv: Add debugfs to view hypervisor statistics
From: Nuno Das Neves @ 2025-12-31 0:26 UTC (permalink / raw)
To: Michael Kelley, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com, Jinank Jain
In-Reply-To: <SN6PR02MB4157938404BC0D12978ACD9BD4A2A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 12/8/2025 7:21 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, December 5, 2025 10:59 AM
>>
>> Introduce a debugfs interface to expose root and child partition stats
>> when running with mshv_root.
>>
>> Create a debugfs directory "mshv" containing 'stats' files organized by
>> type and id. A stats file contains a number of counters depending on
>> its type. e.g. an excerpt from a VP stats file:
>>
>> TotalRunTime : 1997602722
>> HypervisorRunTime : 649671371
>> RemoteNodeRunTime : 0
>> NormalizedRunTime : 1997602721
>> IdealCpu : 0
>> HypercallsCount : 1708169
>> HypercallsTime : 111914774
>> PageInvalidationsCount : 0
>> PageInvalidationsTime : 0
>>
>> On a root partition with some active child partitions, the entire
>> directory structure may look like:
>>
>> mshv/
>> stats # hypervisor stats
>> lp/ # logical processors
>> 0/ # LP id
>> stats # LP 0 stats
>> 1/
>> 2/
>> 3/
>> partition/ # partition stats
>> 1/ # root partition id
>> stats # root partition stats
>> vp/ # root virtual processors
>> 0/ # root VP id
>> stats # root VP 0 stats
>> 1/
>> 2/
>> 3/
>> 42/ # child partition id
>> stats # child partition stats
>> vp/ # child VPs
>> 0/ # child VP id
>> stats # child VP 0 stats
>> 1/
>> 43/
>> 55/
>>
>
> In the above directory tree, each of the "stats" files is in a directory
> by itself, where the directory name is the number of whatever
> entity the stats are for (lp, partition, or vp). Do you expect there to
> be other files parallel to "stats" that will be added later? Otherwise
> you could collapse one directory level. The "best" directory structure
> is somewhat a matter of taste and judgment, so there's not a "right"
> answer. I don't object if your preference is to keep the numbered
> directories, even if they are likely to never contain more than the
> "stats" file.
>
Good question, I'm not aware of a plan to add additional parallel files
in future, but even so, I think this structure is fine as-is.
I see how the VPs and LPs directories could be collapsed, but partitions
need to be directories to contain the VPs, so that would be an
inconsistency (some "stats" files and some "$ID" files) which seems worse
to me. e.g.., are you suggesting something like this?
mshv/
stats # hypervisor stats
lp/ # logical processors
0 # LP 0 stats
1 # LP 1 stats
partition/ # partition stats directory
1/ # root partition id
stats # root partition stats
vp/ # root virtual processors
0 # root VP 0 stats
1 # root VP 1 stats
4/ # child partition id
stats # child partition stats
vp/ # child virtual processors
0 # child VP 0 stats
1 # child VP 1 stats
Unless I'm misunderstanding what you mean, I think the original is better,
both because it's more consistent and does leave room for adding additional
files if we ever want to.
>> On L1VH, some stats are not present as it does not own the hardware
>> like the root partition does:
>> - The hypervisor and lp stats are not present
>> - L1VH's partition directory is named "self" because it can't get its
>> own id
>> - Some of L1VH's partition and VP stats fields are not populated, because
>> it can't map its own HV_STATS_AREA_PARENT page.
>>
>> Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
>> Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
>> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Co-developed-by: Purna Pavan Chandra Aekkaladevi
>> <paekkaladevi@linux.microsoft.com>
>> Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
>> Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
>> Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> ---
>> drivers/hv/Makefile | 1 +
>> drivers/hv/mshv_debugfs.c | 1122 +++++++++++++++++++++++++++++++++++
>> drivers/hv/mshv_root.h | 34 ++
>> drivers/hv/mshv_root_main.c | 32 +-
>> 4 files changed, 1185 insertions(+), 4 deletions(-)
>> create mode 100644 drivers/hv/mshv_debugfs.c
>>
>> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
>> index 58b8d07639f3..36278c936914 100644
>> --- a/drivers/hv/Makefile
>> +++ b/drivers/hv/Makefile
>> @@ -15,6 +15,7 @@ hv_vmbus-$(CONFIG_HYPERV_TESTING) += hv_debugfs.o
>> hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
>> mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
>> mshv_root_hv_call.o mshv_portid_table.o
>> +mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
>> mshv_vtl-y := mshv_vtl_main.o
>>
>> # Code that must be built-in
>> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
>> new file mode 100644
>> index 000000000000..581018690a27
>> --- /dev/null
>> +++ b/drivers/hv/mshv_debugfs.c
>> @@ -0,0 +1,1122 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (c) 2025, Microsoft Corporation.
>> + *
>> + * The /sys/kernel/debug/mshv directory contents.
>> + * Contains various statistics data, provided by the hypervisor.
>> + *
>> + * Authors: Microsoft Linux virtualization team
>> + */
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/stringify.h>
>> +#include <asm/mshyperv.h>
>> +#include <linux/slab.h>
>> +
>> +#include "mshv.h"
>> +#include "mshv_root.h"
>> +
>> +#define U32_BUF_SZ 11
>> +#define U64_BUF_SZ 21
>> +
>> +static struct dentry *mshv_debugfs;
>> +static struct dentry *mshv_debugfs_partition;
>> +static struct dentry *mshv_debugfs_lp;
>> +
>> +static u64 mshv_lps_count;
>> +
>> +static bool is_l1vh_parent(u64 partition_id)
>> +{
>> + return hv_l1vh_partition() && (partition_id == HV_PARTITION_ID_SELF);
>> +}
>> +
>> +static int lp_stats_show(struct seq_file *m, void *v)
>> +{
>> + const struct hv_stats_page *stats = m->private;
>> +
>> +#define LP_SEQ_PRINTF(cnt) \
>> + seq_printf(m, "%-29s: %llu\n", __stringify(cnt), stats->lp_cntrs[Lp##cnt])
>> +
>> + LP_SEQ_PRINTF(GlobalTime);
>> + LP_SEQ_PRINTF(TotalRunTime);
>> + LP_SEQ_PRINTF(HypervisorRunTime);
>> + LP_SEQ_PRINTF(HardwareInterrupts);
>> + LP_SEQ_PRINTF(ContextSwitches);
>> + LP_SEQ_PRINTF(InterProcessorInterrupts);
>> + LP_SEQ_PRINTF(SchedulerInterrupts);
>> + LP_SEQ_PRINTF(TimerInterrupts);
>> + LP_SEQ_PRINTF(InterProcessorInterruptsSent);
>> + LP_SEQ_PRINTF(ProcessorHalts);
>> + LP_SEQ_PRINTF(MonitorTransitionCost);
>> + LP_SEQ_PRINTF(ContextSwitchTime);
>> + LP_SEQ_PRINTF(C1TransitionsCount);
>> + LP_SEQ_PRINTF(C1RunTime);
>> + LP_SEQ_PRINTF(C2TransitionsCount);
>> + LP_SEQ_PRINTF(C2RunTime);
>> + LP_SEQ_PRINTF(C3TransitionsCount);
>> + LP_SEQ_PRINTF(C3RunTime);
>> + LP_SEQ_PRINTF(RootVpIndex);
>> + LP_SEQ_PRINTF(IdleSequenceNumber);
>> + LP_SEQ_PRINTF(GlobalTscCount);
>> + LP_SEQ_PRINTF(ActiveTscCount);
>> + LP_SEQ_PRINTF(IdleAccumulation);
>> + LP_SEQ_PRINTF(ReferenceCycleCount0);
>> + LP_SEQ_PRINTF(ActualCycleCount0);
>> + LP_SEQ_PRINTF(ReferenceCycleCount1);
>> + LP_SEQ_PRINTF(ActualCycleCount1);
>> + LP_SEQ_PRINTF(ProximityDomainId);
>> + LP_SEQ_PRINTF(PostedInterruptNotifications);
>> + LP_SEQ_PRINTF(BranchPredictorFlushes);
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + LP_SEQ_PRINTF(L1DataCacheFlushes);
>> + LP_SEQ_PRINTF(ImmediateL1DataCacheFlushes);
>> + LP_SEQ_PRINTF(MbFlushes);
>> + LP_SEQ_PRINTF(CounterRefreshSequenceNumber);
>> + LP_SEQ_PRINTF(CounterRefreshReferenceTime);
>> + LP_SEQ_PRINTF(IdleAccumulationSnapshot);
>> + LP_SEQ_PRINTF(ActiveTscCountSnapshot);
>> + LP_SEQ_PRINTF(HwpRequestContextSwitches);
>> + LP_SEQ_PRINTF(Placeholder1);
>> + LP_SEQ_PRINTF(Placeholder2);
>> + LP_SEQ_PRINTF(Placeholder3);
>> + LP_SEQ_PRINTF(Placeholder4);
>> + LP_SEQ_PRINTF(Placeholder5);
>> + LP_SEQ_PRINTF(Placeholder6);
>> + LP_SEQ_PRINTF(Placeholder7);
>> + LP_SEQ_PRINTF(Placeholder8);
>> + LP_SEQ_PRINTF(Placeholder9);
>> + LP_SEQ_PRINTF(Placeholder10);
>> + LP_SEQ_PRINTF(ReserveGroupId);
>> + LP_SEQ_PRINTF(RunningPriority);
>> + LP_SEQ_PRINTF(PerfmonInterruptCount);
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + LP_SEQ_PRINTF(CounterRefreshSequenceNumber);
>> + LP_SEQ_PRINTF(CounterRefreshReferenceTime);
>> + LP_SEQ_PRINTF(IdleAccumulationSnapshot);
>> + LP_SEQ_PRINTF(ActiveTscCountSnapshot);
>> + LP_SEQ_PRINTF(HwpRequestContextSwitches);
>> + LP_SEQ_PRINTF(Placeholder2);
>> + LP_SEQ_PRINTF(Placeholder3);
>> + LP_SEQ_PRINTF(Placeholder4);
>> + LP_SEQ_PRINTF(Placeholder5);
>> + LP_SEQ_PRINTF(Placeholder6);
>> + LP_SEQ_PRINTF(Placeholder7);
>> + LP_SEQ_PRINTF(Placeholder8);
>> + LP_SEQ_PRINTF(Placeholder9);
>> + LP_SEQ_PRINTF(SchLocalRunListSize);
>> + LP_SEQ_PRINTF(ReserveGroupId);
>> + LP_SEQ_PRINTF(RunningPriority);
>> +#endif
>> +
>> + return 0;
>> +}
>> +DEFINE_SHOW_ATTRIBUTE(lp_stats);
>> +
>> +static void mshv_lp_stats_unmap(u32 lp_index, void *stats_page_addr)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .lp.lp_index = lp_index,
>> + .lp.stats_area_type = HV_STATS_AREA_SELF,
>> + };
>> + int err;
>> +
>> + err = hv_unmap_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR,
>> + stats_page_addr, &identity);
>> + if (err)
>> + pr_err("%s: failed to unmap logical processor %u stats, err: %d\n",
>> + __func__, lp_index, err);
>> +}
>> +
>> +static void __init *mshv_lp_stats_map(u32 lp_index)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .lp.lp_index = lp_index,
>> + .lp.stats_area_type = HV_STATS_AREA_SELF,
>> + };
>> + void *stats;
>> + int err;
>> +
>> + err = hv_map_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR, &identity,
>> + &stats);
>> + if (err) {
>> + pr_err("%s: failed to map logical processor %u stats, err: %d\n",
>> + __func__, lp_index, err);
>> + return ERR_PTR(err);
>> + }
>> +
>> + return stats;
>> +}
>> +
>> +static void __init *lp_debugfs_stats_create(u32 lp_index, struct dentry *parent)
>> +{
>> + struct dentry *dentry;
>> + void *stats;
>> +
>> + stats = mshv_lp_stats_map(lp_index);
>> + if (IS_ERR(stats))
>> + return stats;
>> +
>> + dentry = debugfs_create_file("stats", 0400, parent,
>> + stats, &lp_stats_fops);
>> + if (IS_ERR(dentry)) {
>> + mshv_lp_stats_unmap(lp_index, stats);
>> + return dentry;
>> + }
>> + return stats;
>> +}
>> +
>> +static int __init lp_debugfs_create(u32 lp_index, struct dentry *parent)
>> +{
>> + struct dentry *idx;
>> + char lp_idx_str[U32_BUF_SZ];
>> + void *stats;
>> + int err;
>> +
>> + sprintf(lp_idx_str, "%u", lp_index);
>> +
>> + idx = debugfs_create_dir(lp_idx_str, parent);
>> + if (IS_ERR(idx))
>> + return PTR_ERR(idx);
>> +
>> + stats = lp_debugfs_stats_create(lp_index, idx);
>> + if (IS_ERR(stats)) {
>> + err = PTR_ERR(stats);
>> + goto remove_debugfs_lp_idx;
>> + }
>> +
>> + return 0;
>> +
>> +remove_debugfs_lp_idx:
>> + debugfs_remove_recursive(idx);
>> + return err;
>> +}
>> +
>> +static void mshv_debugfs_lp_remove(void)
>> +{
>> + int lp_index;
>> +
>> + debugfs_remove_recursive(mshv_debugfs_lp);
>> +
>> + for (lp_index = 0; lp_index < mshv_lps_count; lp_index++)
>> + mshv_lp_stats_unmap(lp_index, NULL);
>
> Passing NULL as the second argument here leaks the stats page
> memory if Linux allocated the page as an overlay GPFN. But is that
> considered OK because the debugfs entries for LPs are removed
> only when the root partition is shutting down? That works as
> long as hot-add/remove of CPUs isn't supported in the root
> partition.
>
Hmm, at the very least this appears to be a memory leak if the mshv
driver is built as a module and removed + reinserted. The stats
pages can be mapped multiple times so it will just allocate a page
(on L1VH anyway) and remap it each time. I will check and fix it in
this patch.
>> +}
>> +
>> +static int __init mshv_debugfs_lp_create(struct dentry *parent)
>> +{
>> + struct dentry *lp_dir;
>> + int err, lp_index;
>> +
>> + lp_dir = debugfs_create_dir("lp", parent);
>> + if (IS_ERR(lp_dir))
>> + return PTR_ERR(lp_dir);
>> +
>> + for (lp_index = 0; lp_index < mshv_lps_count; lp_index++) {
>> + err = lp_debugfs_create(lp_index, lp_dir);
>> + if (err)
>> + goto remove_debugfs_lps;
>> + }
>> +
>> + mshv_debugfs_lp = lp_dir;
>> +
>> + return 0;
>> +
>> +remove_debugfs_lps:
>> + for (lp_index -= 1; lp_index >= 0; lp_index--)
>> + mshv_lp_stats_unmap(lp_index, NULL);
>> + debugfs_remove_recursive(lp_dir);
>> + return err;
>> +}
>> +
>> +static int vp_stats_show(struct seq_file *m, void *v)
>> +{
>> + const struct hv_stats_page **pstats = m->private;
>> +
>> +#define VP_SEQ_PRINTF(cnt) \
>> +do { \
>> + if (pstats[HV_STATS_AREA_SELF]->vp_cntrs[Vp##cnt]) \
>> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
>> + pstats[HV_STATS_AREA_SELF]->vp_cntrs[Vp##cnt]); \
>> + else \
>> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
>> + pstats[HV_STATS_AREA_PARENT]->vp_cntrs[Vp##cnt]); \
>> +} while (0)
>
> I don't understand this logic. Like in mshv_vp_dispatch_thread_blocked(), if
> the SELF value is zero, then the PARENT value is used. The implication is that
> you never want to display a SELF value of zero, which is a bit unexpected
> since I could imagine zero being valid for some counters. But the overall result
> is that the displayed values may be a mix of SELF and PARENT values.
Yes, the basic idea is: Display a nonzero value, if there is one on either SELF or
PARENT pages. (I *think* the values will always be the same if they are nonzero.)
I admit it's not an ideal design from my perspective. As far as I know, it was
done this way to retain backward compatibility with hypervisors that don't support
the concept of a PARENT stats area at all.
> And of course after Patch 1 of this series, if running on an older hypervisor
> that doesn't provide PARENT, then SELF will be used anyway, which further
> muddies what's going on here, at least for me. :-)
>
Yes, but in the end we need to check both pages, so there's no avoiding this
redundant check on old hypervisors without adding a separate code path just for
that case, which doesn't seem worth it.
> If this is the correct behavior, please add some code comments as to
> why it makes sense, including in the case where PARENT isn't available.
>
Ok, will do.
>> +
>> + VP_SEQ_PRINTF(TotalRunTime);
>> + VP_SEQ_PRINTF(HypervisorRunTime);
>> + VP_SEQ_PRINTF(RemoteNodeRunTime);
>> + VP_SEQ_PRINTF(NormalizedRunTime);
>> + VP_SEQ_PRINTF(IdealCpu);
>> + VP_SEQ_PRINTF(HypercallsCount);
>> + VP_SEQ_PRINTF(HypercallsTime);
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + VP_SEQ_PRINTF(PageInvalidationsCount);
>> + VP_SEQ_PRINTF(PageInvalidationsTime);
>> + VP_SEQ_PRINTF(ControlRegisterAccessesCount);
>> + VP_SEQ_PRINTF(ControlRegisterAccessesTime);
>> + VP_SEQ_PRINTF(IoInstructionsCount);
>> + VP_SEQ_PRINTF(IoInstructionsTime);
>> + VP_SEQ_PRINTF(HltInstructionsCount);
>> + VP_SEQ_PRINTF(HltInstructionsTime);
>> + VP_SEQ_PRINTF(MwaitInstructionsCount);
>> + VP_SEQ_PRINTF(MwaitInstructionsTime);
>> + VP_SEQ_PRINTF(CpuidInstructionsCount);
>> + VP_SEQ_PRINTF(CpuidInstructionsTime);
>> + VP_SEQ_PRINTF(MsrAccessesCount);
>> + VP_SEQ_PRINTF(MsrAccessesTime);
>> + VP_SEQ_PRINTF(OtherInterceptsCount);
>> + VP_SEQ_PRINTF(OtherInterceptsTime);
>> + VP_SEQ_PRINTF(ExternalInterruptsCount);
>> + VP_SEQ_PRINTF(ExternalInterruptsTime);
>> + VP_SEQ_PRINTF(PendingInterruptsCount);
>> + VP_SEQ_PRINTF(PendingInterruptsTime);
>> + VP_SEQ_PRINTF(EmulatedInstructionsCount);
>> + VP_SEQ_PRINTF(EmulatedInstructionsTime);
>> + VP_SEQ_PRINTF(DebugRegisterAccessesCount);
>> + VP_SEQ_PRINTF(DebugRegisterAccessesTime);
>> + VP_SEQ_PRINTF(PageFaultInterceptsCount);
>> + VP_SEQ_PRINTF(PageFaultInterceptsTime);
>> + VP_SEQ_PRINTF(GuestPageTableMaps);
>> + VP_SEQ_PRINTF(LargePageTlbFills);
>> + VP_SEQ_PRINTF(SmallPageTlbFills);
>> + VP_SEQ_PRINTF(ReflectedGuestPageFaults);
>> + VP_SEQ_PRINTF(ApicMmioAccesses);
>> + VP_SEQ_PRINTF(IoInterceptMessages);
>> + VP_SEQ_PRINTF(MemoryInterceptMessages);
>> + VP_SEQ_PRINTF(ApicEoiAccesses);
>> + VP_SEQ_PRINTF(OtherMessages);
>> + VP_SEQ_PRINTF(PageTableAllocations);
>> + VP_SEQ_PRINTF(LogicalProcessorMigrations);
>> + VP_SEQ_PRINTF(AddressSpaceEvictions);
>> + VP_SEQ_PRINTF(AddressSpaceSwitches);
>> + VP_SEQ_PRINTF(AddressDomainFlushes);
>> + VP_SEQ_PRINTF(AddressSpaceFlushes);
>> + VP_SEQ_PRINTF(GlobalGvaRangeFlushes);
>> + VP_SEQ_PRINTF(LocalGvaRangeFlushes);
>> + VP_SEQ_PRINTF(PageTableEvictions);
>> + VP_SEQ_PRINTF(PageTableReclamations);
>> + VP_SEQ_PRINTF(PageTableResets);
>> + VP_SEQ_PRINTF(PageTableValidations);
>> + VP_SEQ_PRINTF(ApicTprAccesses);
>> + VP_SEQ_PRINTF(PageTableWriteIntercepts);
>> + VP_SEQ_PRINTF(SyntheticInterrupts);
>> + VP_SEQ_PRINTF(VirtualInterrupts);
>> + VP_SEQ_PRINTF(ApicIpisSent);
>> + VP_SEQ_PRINTF(ApicSelfIpisSent);
>> + VP_SEQ_PRINTF(GpaSpaceHypercalls);
>> + VP_SEQ_PRINTF(LogicalProcessorHypercalls);
>> + VP_SEQ_PRINTF(LongSpinWaitHypercalls);
>> + VP_SEQ_PRINTF(OtherHypercalls);
>> + VP_SEQ_PRINTF(SyntheticInterruptHypercalls);
>> + VP_SEQ_PRINTF(VirtualInterruptHypercalls);
>> + VP_SEQ_PRINTF(VirtualMmuHypercalls);
>> + VP_SEQ_PRINTF(VirtualProcessorHypercalls);
>> + VP_SEQ_PRINTF(HardwareInterrupts);
>> + VP_SEQ_PRINTF(NestedPageFaultInterceptsCount);
>> + VP_SEQ_PRINTF(NestedPageFaultInterceptsTime);
>> + VP_SEQ_PRINTF(PageScans);
>> + VP_SEQ_PRINTF(LogicalProcessorDispatches);
>> + VP_SEQ_PRINTF(WaitingForCpuTime);
>> + VP_SEQ_PRINTF(ExtendedHypercalls);
>> + VP_SEQ_PRINTF(ExtendedHypercallInterceptMessages);
>> + VP_SEQ_PRINTF(MbecNestedPageTableSwitches);
>> + VP_SEQ_PRINTF(OtherReflectedGuestExceptions);
>> + VP_SEQ_PRINTF(GlobalIoTlbFlushes);
>> + VP_SEQ_PRINTF(GlobalIoTlbFlushCost);
>> + VP_SEQ_PRINTF(LocalIoTlbFlushes);
>> + VP_SEQ_PRINTF(LocalIoTlbFlushCost);
>> + VP_SEQ_PRINTF(HypercallsForwardedCount);
>> + VP_SEQ_PRINTF(HypercallsForwardingTime);
>> + VP_SEQ_PRINTF(PageInvalidationsForwardedCount);
>> + VP_SEQ_PRINTF(PageInvalidationsForwardingTime);
>> + VP_SEQ_PRINTF(ControlRegisterAccessesForwardedCount);
>> + VP_SEQ_PRINTF(ControlRegisterAccessesForwardingTime);
>> + VP_SEQ_PRINTF(IoInstructionsForwardedCount);
>> + VP_SEQ_PRINTF(IoInstructionsForwardingTime);
>> + VP_SEQ_PRINTF(HltInstructionsForwardedCount);
>> + VP_SEQ_PRINTF(HltInstructionsForwardingTime);
>> + VP_SEQ_PRINTF(MwaitInstructionsForwardedCount);
>> + VP_SEQ_PRINTF(MwaitInstructionsForwardingTime);
>> + VP_SEQ_PRINTF(CpuidInstructionsForwardedCount);
>> + VP_SEQ_PRINTF(CpuidInstructionsForwardingTime);
>> + VP_SEQ_PRINTF(MsrAccessesForwardedCount);
>> + VP_SEQ_PRINTF(MsrAccessesForwardingTime);
>> + VP_SEQ_PRINTF(OtherInterceptsForwardedCount);
>> + VP_SEQ_PRINTF(OtherInterceptsForwardingTime);
>> + VP_SEQ_PRINTF(ExternalInterruptsForwardedCount);
>> + VP_SEQ_PRINTF(ExternalInterruptsForwardingTime);
>> + VP_SEQ_PRINTF(PendingInterruptsForwardedCount);
>> + VP_SEQ_PRINTF(PendingInterruptsForwardingTime);
>> + VP_SEQ_PRINTF(EmulatedInstructionsForwardedCount);
>> + VP_SEQ_PRINTF(EmulatedInstructionsForwardingTime);
>> + VP_SEQ_PRINTF(DebugRegisterAccessesForwardedCount);
>> + VP_SEQ_PRINTF(DebugRegisterAccessesForwardingTime);
>> + VP_SEQ_PRINTF(PageFaultInterceptsForwardedCount);
>> + VP_SEQ_PRINTF(PageFaultInterceptsForwardingTime);
>> + VP_SEQ_PRINTF(VmclearEmulationCount);
>> + VP_SEQ_PRINTF(VmclearEmulationTime);
>> + VP_SEQ_PRINTF(VmptrldEmulationCount);
>> + VP_SEQ_PRINTF(VmptrldEmulationTime);
>> + VP_SEQ_PRINTF(VmptrstEmulationCount);
>> + VP_SEQ_PRINTF(VmptrstEmulationTime);
>> + VP_SEQ_PRINTF(VmreadEmulationCount);
>> + VP_SEQ_PRINTF(VmreadEmulationTime);
>> + VP_SEQ_PRINTF(VmwriteEmulationCount);
>> + VP_SEQ_PRINTF(VmwriteEmulationTime);
>> + VP_SEQ_PRINTF(VmxoffEmulationCount);
>> + VP_SEQ_PRINTF(VmxoffEmulationTime);
>> + VP_SEQ_PRINTF(VmxonEmulationCount);
>> + VP_SEQ_PRINTF(VmxonEmulationTime);
>> + VP_SEQ_PRINTF(NestedVMEntriesCount);
>> + VP_SEQ_PRINTF(NestedVMEntriesTime);
>> + VP_SEQ_PRINTF(NestedSLATSoftPageFaultsCount);
>> + VP_SEQ_PRINTF(NestedSLATSoftPageFaultsTime);
>> + VP_SEQ_PRINTF(NestedSLATHardPageFaultsCount);
>> + VP_SEQ_PRINTF(NestedSLATHardPageFaultsTime);
>> + VP_SEQ_PRINTF(InvEptAllContextEmulationCount);
>> + VP_SEQ_PRINTF(InvEptAllContextEmulationTime);
>> + VP_SEQ_PRINTF(InvEptSingleContextEmulationCount);
>> + VP_SEQ_PRINTF(InvEptSingleContextEmulationTime);
>> + VP_SEQ_PRINTF(InvVpidAllContextEmulationCount);
>> + VP_SEQ_PRINTF(InvVpidAllContextEmulationTime);
>> + VP_SEQ_PRINTF(InvVpidSingleContextEmulationCount);
>> + VP_SEQ_PRINTF(InvVpidSingleContextEmulationTime);
>> + VP_SEQ_PRINTF(InvVpidSingleAddressEmulationCount);
>> + VP_SEQ_PRINTF(InvVpidSingleAddressEmulationTime);
>> + VP_SEQ_PRINTF(NestedTlbPageTableReclamations);
>> + VP_SEQ_PRINTF(NestedTlbPageTableEvictions);
>> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressSpaceHypercalls);
>> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressListHypercalls);
>> + VP_SEQ_PRINTF(PostedInterruptNotifications);
>> + VP_SEQ_PRINTF(PostedInterruptScans);
>> + VP_SEQ_PRINTF(TotalCoreRunTime);
>> + VP_SEQ_PRINTF(MaximumRunTime);
>> + VP_SEQ_PRINTF(HwpRequestContextSwitches);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket0);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket1);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket2);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket3);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket4);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket5);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket6);
>> + VP_SEQ_PRINTF(VmloadEmulationCount);
>> + VP_SEQ_PRINTF(VmloadEmulationTime);
>> + VP_SEQ_PRINTF(VmsaveEmulationCount);
>> + VP_SEQ_PRINTF(VmsaveEmulationTime);
>> + VP_SEQ_PRINTF(GifInstructionEmulationCount);
>> + VP_SEQ_PRINTF(GifInstructionEmulationTime);
>> + VP_SEQ_PRINTF(EmulatedErrataSvmInstructions);
>> + VP_SEQ_PRINTF(Placeholder1);
>> + VP_SEQ_PRINTF(Placeholder2);
>> + VP_SEQ_PRINTF(Placeholder3);
>> + VP_SEQ_PRINTF(Placeholder4);
>> + VP_SEQ_PRINTF(Placeholder5);
>> + VP_SEQ_PRINTF(Placeholder6);
>> + VP_SEQ_PRINTF(Placeholder7);
>> + VP_SEQ_PRINTF(Placeholder8);
>> + VP_SEQ_PRINTF(Placeholder9);
>> + VP_SEQ_PRINTF(Placeholder10);
>> + VP_SEQ_PRINTF(SchedulingPriority);
>> + VP_SEQ_PRINTF(RdpmcInstructionsCount);
>> + VP_SEQ_PRINTF(RdpmcInstructionsTime);
>> + VP_SEQ_PRINTF(PerfmonPmuMsrAccessesCount);
>> + VP_SEQ_PRINTF(PerfmonLbrMsrAccessesCount);
>> + VP_SEQ_PRINTF(PerfmonIptMsrAccessesCount);
>> + VP_SEQ_PRINTF(PerfmonInterruptCount);
>> + VP_SEQ_PRINTF(Vtl1DispatchCount);
>> + VP_SEQ_PRINTF(Vtl2DispatchCount);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket0);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket1);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket2);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket3);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket4);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket5);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket6);
>> + VP_SEQ_PRINTF(Vtl1RunTime);
>> + VP_SEQ_PRINTF(Vtl2RunTime);
>> + VP_SEQ_PRINTF(IommuHypercalls);
>> + VP_SEQ_PRINTF(CpuGroupHypercalls);
>> + VP_SEQ_PRINTF(VsmHypercalls);
>> + VP_SEQ_PRINTF(EventLogHypercalls);
>> + VP_SEQ_PRINTF(DeviceDomainHypercalls);
>> + VP_SEQ_PRINTF(DepositHypercalls);
>> + VP_SEQ_PRINTF(SvmHypercalls);
>> + VP_SEQ_PRINTF(BusLockAcquisitionCount);
>
> The x86 VpUnused counter is not shown. Any reason for that? All the
> Placeholder counters *are* shown, so I'm just wondering what's
> different.
>
Good question, I believe when this code was written VpUnused was
actually undefined in our headers, because the value 201 was
temporarily used for VpRootDispatchThreadBlocked before that was
changed to 202 (the hypervisor version using 201 was never released
publically so not considered a breaking change).
Checking the code, 201 now refers to VpLoadAvg on x86 so I will
update the definitions in patch #2 of this series to include that,
and add it here in the debugfs code.
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + VP_SEQ_PRINTF(SysRegAccessesCount);
>> + VP_SEQ_PRINTF(SysRegAccessesTime);
>> + VP_SEQ_PRINTF(SmcInstructionsCount);
>> + VP_SEQ_PRINTF(SmcInstructionsTime);
>> + VP_SEQ_PRINTF(OtherInterceptsCount);
>> + VP_SEQ_PRINTF(OtherInterceptsTime);
>> + VP_SEQ_PRINTF(ExternalInterruptsCount);
>> + VP_SEQ_PRINTF(ExternalInterruptsTime);
>> + VP_SEQ_PRINTF(PendingInterruptsCount);
>> + VP_SEQ_PRINTF(PendingInterruptsTime);
>> + VP_SEQ_PRINTF(GuestPageTableMaps);
>> + VP_SEQ_PRINTF(LargePageTlbFills);
>> + VP_SEQ_PRINTF(SmallPageTlbFills);
>> + VP_SEQ_PRINTF(ReflectedGuestPageFaults);
>> + VP_SEQ_PRINTF(MemoryInterceptMessages);
>> + VP_SEQ_PRINTF(OtherMessages);
>> + VP_SEQ_PRINTF(LogicalProcessorMigrations);
>> + VP_SEQ_PRINTF(AddressDomainFlushes);
>> + VP_SEQ_PRINTF(AddressSpaceFlushes);
>> + VP_SEQ_PRINTF(SyntheticInterrupts);
>> + VP_SEQ_PRINTF(VirtualInterrupts);
>> + VP_SEQ_PRINTF(ApicSelfIpisSent);
>> + VP_SEQ_PRINTF(GpaSpaceHypercalls);
>> + VP_SEQ_PRINTF(LogicalProcessorHypercalls);
>> + VP_SEQ_PRINTF(LongSpinWaitHypercalls);
>> + VP_SEQ_PRINTF(OtherHypercalls);
>> + VP_SEQ_PRINTF(SyntheticInterruptHypercalls);
>> + VP_SEQ_PRINTF(VirtualInterruptHypercalls);
>> + VP_SEQ_PRINTF(VirtualMmuHypercalls);
>> + VP_SEQ_PRINTF(VirtualProcessorHypercalls);
>> + VP_SEQ_PRINTF(HardwareInterrupts);
>> + VP_SEQ_PRINTF(NestedPageFaultInterceptsCount);
>> + VP_SEQ_PRINTF(NestedPageFaultInterceptsTime);
>> + VP_SEQ_PRINTF(LogicalProcessorDispatches);
>> + VP_SEQ_PRINTF(WaitingForCpuTime);
>> + VP_SEQ_PRINTF(ExtendedHypercalls);
>> + VP_SEQ_PRINTF(ExtendedHypercallInterceptMessages);
>> + VP_SEQ_PRINTF(MbecNestedPageTableSwitches);
>> + VP_SEQ_PRINTF(OtherReflectedGuestExceptions);
>> + VP_SEQ_PRINTF(GlobalIoTlbFlushes);
>> + VP_SEQ_PRINTF(GlobalIoTlbFlushCost);
>> + VP_SEQ_PRINTF(LocalIoTlbFlushes);
>> + VP_SEQ_PRINTF(LocalIoTlbFlushCost);
>> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressSpaceHypercalls);
>> + VP_SEQ_PRINTF(FlushGuestPhysicalAddressListHypercalls);
>> + VP_SEQ_PRINTF(PostedInterruptNotifications);
>> + VP_SEQ_PRINTF(PostedInterruptScans);
>> + VP_SEQ_PRINTF(TotalCoreRunTime);
>> + VP_SEQ_PRINTF(MaximumRunTime);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket0);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket1);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket2);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket3);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket4);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket5);
>> + VP_SEQ_PRINTF(WaitingForCpuTimeBucket6);
>> + VP_SEQ_PRINTF(HwpRequestContextSwitches);
>> + VP_SEQ_PRINTF(Placeholder2);
>> + VP_SEQ_PRINTF(Placeholder3);
>> + VP_SEQ_PRINTF(Placeholder4);
>> + VP_SEQ_PRINTF(Placeholder5);
>> + VP_SEQ_PRINTF(Placeholder6);
>> + VP_SEQ_PRINTF(Placeholder7);
>> + VP_SEQ_PRINTF(Placeholder8);
>> + VP_SEQ_PRINTF(ContentionTime);
>> + VP_SEQ_PRINTF(WakeUpTime);
>> + VP_SEQ_PRINTF(SchedulingPriority);
>> + VP_SEQ_PRINTF(Vtl1DispatchCount);
>> + VP_SEQ_PRINTF(Vtl2DispatchCount);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket0);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket1);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket2);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket3);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket4);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket5);
>> + VP_SEQ_PRINTF(Vtl2DispatchBucket6);
>> + VP_SEQ_PRINTF(Vtl1RunTime);
>> + VP_SEQ_PRINTF(Vtl2RunTime);
>> + VP_SEQ_PRINTF(IommuHypercalls);
>> + VP_SEQ_PRINTF(CpuGroupHypercalls);
>> + VP_SEQ_PRINTF(VsmHypercalls);
>> + VP_SEQ_PRINTF(EventLogHypercalls);
>> + VP_SEQ_PRINTF(DeviceDomainHypercalls);
>> + VP_SEQ_PRINTF(DepositHypercalls);
>> + VP_SEQ_PRINTF(SvmHypercalls);
>
> The ARM64 VpLoadAvg counter is not shown? Any reason why?
>
I'm not sure, but could be related to the reasoning in the above
comment - likely VpLoadAvg didn't exist before. I will add it.
>> +#endif
>
> The VpRootDispatchThreadBlocked counter is not shown for either
> x86 or ARM64. Is that intentional, and if so, why? I know the counter
> is used in mshv_vp_dispatch_thread_blocked(), but it's not clear why
> that means it shouldn't be shown here.
>
VpRootDispatchThreadBlocked is not really a 'stat' that you might want
to expose like the other values, it's really a boolean control value
that was tacked onto the vp stats page to facilitate fast interrupt
injection used by the root scheduler. As such it isn't of much value to
userspace.
>> +
>> + return 0;
>> +}
>
> This function, vp_stats_show(), seems like a candidate for redoing based on a
> static table that lists the counter names and index. Then the code just loops
> through the table. On x86 each VP_SEQ_PRINTF() generates 42 bytes of code,
> and there are 199 entries, so 8358 bytes. The table entries would probably
> be 16 bytes each (a 64-bit pointer to the string constant, a 32-bit index value,
> and 4 bytes of padding so each entry is 8-byte aligned). The actual space
> saving isn't that large, but the code would be a lot more compact. The
> other *_stats_shows() functions could do the same.
>
> It's distasteful to me to see 420 lines of enum entries in Patch 2 of this series,
> then followed by another 420 lines of matching *_SEQ_PRINTF entries. But I
> realize that the goal of the enum entries is to match the Windows code, so I
> guess it is what it is. But there's an argument for ditching the enum entries
> entirely, and using the putative static table to capture the information. It
> doesn't seem like matching the Windows code is saving much sync effort
> since any additions/ subtractions to the enum entries need to be matched
> with changes in the *_stats_show() functions, or in my putative static table.
> But I guess if Windows changed only the value for an enum entry without
> additions/subtractions, that would sync more easily.
>
Keeping the definitions as close to Windows code as possible is a high priority,
for consistency and hopefully partially automating that process in future. So,
I'm against throwing away the enum values. The downside of having to update
two code locations when adding a new enum member is fine by me.
I'm not against replacing this sequence of macros with a loop over a table like
the one you propose (in addition to keeping the enum values). That would save
some space as you point out above, but the impact is fairly minimal.
In terms of aesthetics the definition for a table will look very very similar to
the list of VP_SEQ_PRINTF() that are currently here. So all in all, I don't see
a strong reason to switch to a table, unless the space issue is more important
that I realize.
> I'm just throwing this out as a thought. You may prefer to keep everything
> "as is", in which case ignore my comment and I won't raise it again.
>
Thanks, feel free to follow up if you have further thoughts on this part, I'm
open to changing it if there's a reason. Right now it feels like mainly an
aesthetics/cleanliness argument and I'm not sure it's worth the effort.
>> +DEFINE_SHOW_ATTRIBUTE(vp_stats);
>> +
>> +static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index, void *stats_page_addr,
>> + enum hv_stats_area_type stats_area_type)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .vp.partition_id = partition_id,
>> + .vp.vp_index = vp_index,
>> + .vp.stats_area_type = stats_area_type,
>> + };
>> + int err;
>> +
>> + err = hv_unmap_stats_page(HV_STATS_OBJECT_VP, stats_page_addr, &identity);
>> + if (err)
>> + pr_err("%s: failed to unmap partition %llu vp %u %s stats, err: %d\n",
>> + __func__, partition_id, vp_index,
>> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
>> + err);
>> +}
>> +
>> +static void *mshv_vp_stats_map(u64 partition_id, u32 vp_index,
>> + enum hv_stats_area_type stats_area_type)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .vp.partition_id = partition_id,
>> + .vp.vp_index = vp_index,
>> + .vp.stats_area_type = stats_area_type,
>> + };
>> + void *stats;
>> + int err;
>> +
>> + err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity, &stats);
>> + if (err) {
>> + pr_err("%s: failed to map partition %llu vp %u %s stats, err: %d\n",
>> + __func__, partition_id, vp_index,
>> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
>> + err);
>> + return ERR_PTR(err);
>> + }
>> + return stats;
>> +}
>
> Presumably you've noticed that the functions mshv_vp_stats_map() and
> mshv_vp_stats_unmap() also exist in mshv_root_main.c. They are static
> functions in both places, so the compiler & linker do the right thing, but
> it sure does make things a bit more complex for human readers. The versions
> here follow a consistent pattern for (lp, vp, hv, partition), so maybe the ones
> in mshv_root_main.c could be renamed to avoid confusion?
>
Good point - this is being addressed in our internal tree but hasn't made it into
this patch set. I will consider squashing that into a later version of this set,
but for now I'm treating it as a future cleanup patch to send later.
>> +
>> +static int vp_debugfs_stats_create(u64 partition_id, u32 vp_index,
>> + struct dentry **vp_stats_ptr,
>> + struct dentry *parent)
>> +{
>> + struct dentry *dentry;
>> + struct hv_stats_page **pstats;
>> + int err;
>> +
>> + pstats = kcalloc(2, sizeof(struct hv_stats_page *), GFP_KERNEL_ACCOUNT);
>
> Open coding "2" as the first parameter makes assumptions about the values of
> HV_STATS_AREA_SELF and HV_STATS_AREA_PARENT. Should use
> HV_STATS_AREA_COUNT instead of "2" so that indexing into the array is certain
> to work.
>
Thanks, I'll chang it to use HV_STATS_AREA_COUNT.
>> + if (!pstats)
>> + return -ENOMEM;
>> +
>> + pstats[HV_STATS_AREA_SELF] = mshv_vp_stats_map(partition_id, vp_index,
>> + HV_STATS_AREA_SELF);
>> + if (IS_ERR(pstats[HV_STATS_AREA_SELF])) {
>> + err = PTR_ERR(pstats[HV_STATS_AREA_SELF]);
>> + goto cleanup;
>> + }
>> +
>> + /*
>> + * L1VH partition cannot access its vp stats in parent area.
>> + */
>> + if (is_l1vh_parent(partition_id)) {
>> + pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
>> + } else {
>> + pstats[HV_STATS_AREA_PARENT] = mshv_vp_stats_map(
>> + partition_id, vp_index, HV_STATS_AREA_PARENT);
>> + if (IS_ERR(pstats[HV_STATS_AREA_PARENT])) {
>> + err = PTR_ERR(pstats[HV_STATS_AREA_PARENT]);
>> + goto unmap_self;
>> + }
>> + if (!pstats[HV_STATS_AREA_PARENT])
>> + pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
>> + }
>> +
>> + dentry = debugfs_create_file("stats", 0400, parent,
>> + pstats, &vp_stats_fops);
>> + if (IS_ERR(dentry)) {
>> + err = PTR_ERR(dentry);
>> + goto unmap_vp_stats;
>> + }
>> +
>> + *vp_stats_ptr = dentry;
>> + return 0;
>> +
>> +unmap_vp_stats:
>> + if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF])
>> + mshv_vp_stats_unmap(partition_id, vp_index, pstats[HV_STATS_AREA_PARENT],
>> + HV_STATS_AREA_PARENT);
>> +unmap_self:
>> + mshv_vp_stats_unmap(partition_id, vp_index, pstats[HV_STATS_AREA_SELF],
>> + HV_STATS_AREA_SELF);
>> +cleanup:
>> + kfree(pstats);
>> + return err;
>> +}
>> +
>> +static void vp_debugfs_remove(u64 partition_id, u32 vp_index,
>> + struct dentry *vp_stats)
>> +{
>> + struct hv_stats_page **pstats = NULL;
>> + void *stats;
>> +
>> + pstats = vp_stats->d_inode->i_private;
>> + debugfs_remove_recursive(vp_stats->d_parent);
>> + if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF]) {
>> + stats = pstats[HV_STATS_AREA_PARENT];
>> + mshv_vp_stats_unmap(partition_id, vp_index, stats,
>> + HV_STATS_AREA_PARENT);
>> + }
>> +
>> + stats = pstats[HV_STATS_AREA_SELF];
>> + mshv_vp_stats_unmap(partition_id, vp_index, stats, HV_STATS_AREA_SELF);
>> +
>> + kfree(pstats);
>> +}
>> +
>> +static int vp_debugfs_create(u64 partition_id, u32 vp_index,
>> + struct dentry **vp_stats_ptr,
>> + struct dentry *parent)
>> +{
>> + struct dentry *vp_idx_dir;
>> + char vp_idx_str[U32_BUF_SZ];
>> + int err;
>> +
>> + sprintf(vp_idx_str, "%u", vp_index);
>> +
>> + vp_idx_dir = debugfs_create_dir(vp_idx_str, parent);
>> + if (IS_ERR(vp_idx_dir))
>> + return PTR_ERR(vp_idx_dir);
>> +
>> + err = vp_debugfs_stats_create(partition_id, vp_index, vp_stats_ptr,
>> + vp_idx_dir);
>> + if (err)
>> + goto remove_debugfs_vp_idx;
>> +
>> + return 0;
>> +
>> +remove_debugfs_vp_idx:
>> + debugfs_remove_recursive(vp_idx_dir);
>> + return err;
>> +}
>> +
>> +static int partition_stats_show(struct seq_file *m, void *v)
>> +{
>> + const struct hv_stats_page **pstats = m->private;
>> +
>> +#define PARTITION_SEQ_PRINTF(cnt) \
>> +do { \
>> + if (pstats[HV_STATS_AREA_SELF]->pt_cntrs[Partition##cnt]) \
>> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
>> + pstats[HV_STATS_AREA_SELF]->pt_cntrs[Partition##cnt]); \
>> + else \
>> + seq_printf(m, "%-30s: %llu\n", __stringify(cnt), \
>> + pstats[HV_STATS_AREA_PARENT]->pt_cntrs[Partition##cnt]); \
>> +} while (0)
>
> Same comment as for VP_SEQ_PRINTF.
>
Ack
>> +
>> + PARTITION_SEQ_PRINTF(VirtualProcessors);
>> + PARTITION_SEQ_PRINTF(TlbSize);
>> + PARTITION_SEQ_PRINTF(AddressSpaces);
>> + PARTITION_SEQ_PRINTF(DepositedPages);
>> + PARTITION_SEQ_PRINTF(GpaPages);
>> + PARTITION_SEQ_PRINTF(GpaSpaceModifications);
>> + PARTITION_SEQ_PRINTF(VirtualTlbFlushEntires);
>> + PARTITION_SEQ_PRINTF(RecommendedTlbSize);
>> + PARTITION_SEQ_PRINTF(GpaPages4K);
>> + PARTITION_SEQ_PRINTF(GpaPages2M);
>> + PARTITION_SEQ_PRINTF(GpaPages1G);
>> + PARTITION_SEQ_PRINTF(GpaPages512G);
>> + PARTITION_SEQ_PRINTF(DevicePages4K);
>> + PARTITION_SEQ_PRINTF(DevicePages2M);
>> + PARTITION_SEQ_PRINTF(DevicePages1G);
>> + PARTITION_SEQ_PRINTF(DevicePages512G);
>> + PARTITION_SEQ_PRINTF(AttachedDevices);
>> + PARTITION_SEQ_PRINTF(DeviceInterruptMappings);
>> + PARTITION_SEQ_PRINTF(IoTlbFlushes);
>> + PARTITION_SEQ_PRINTF(IoTlbFlushCost);
>> + PARTITION_SEQ_PRINTF(DeviceInterruptErrors);
>> + PARTITION_SEQ_PRINTF(DeviceDmaErrors);
>> + PARTITION_SEQ_PRINTF(DeviceInterruptThrottleEvents);
>> + PARTITION_SEQ_PRINTF(SkippedTimerTicks);
>> + PARTITION_SEQ_PRINTF(PartitionId);
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + PARTITION_SEQ_PRINTF(NestedTlbSize);
>> + PARTITION_SEQ_PRINTF(RecommendedNestedTlbSize);
>> + PARTITION_SEQ_PRINTF(NestedTlbFreeListSize);
>> + PARTITION_SEQ_PRINTF(NestedTlbTrimmedPages);
>> + PARTITION_SEQ_PRINTF(PagesShattered);
>> + PARTITION_SEQ_PRINTF(PagesRecombined);
>> + PARTITION_SEQ_PRINTF(HwpRequestValue);
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + PARTITION_SEQ_PRINTF(HwpRequestValue);
>> +#endif
>> +
>> + return 0;
>> +}
>> +DEFINE_SHOW_ATTRIBUTE(partition_stats);
>> +
>> +static void mshv_partition_stats_unmap(u64 partition_id, void *stats_page_addr,
>> + enum hv_stats_area_type stats_area_type)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .partition.partition_id = partition_id,
>> + .partition.stats_area_type = stats_area_type,
>> + };
>> + int err;
>> +
>> + err = hv_unmap_stats_page(HV_STATS_OBJECT_PARTITION, stats_page_addr,
>> + &identity);
>> + if (err) {
>> + pr_err("%s: failed to unmap partition %lld %s stats, err: %d\n",
>> + __func__, partition_id,
>> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
>> + err);
>> + }
>> +}
>> +
>> +static void *mshv_partition_stats_map(u64 partition_id,
>> + enum hv_stats_area_type stats_area_type)
>> +{
>> + union hv_stats_object_identity identity = {
>> + .partition.partition_id = partition_id,
>> + .partition.stats_area_type = stats_area_type,
>> + };
>> + void *stats;
>> + int err;
>> +
>> + err = hv_map_stats_page(HV_STATS_OBJECT_PARTITION, &identity, &stats);
>> + if (err) {
>> + pr_err("%s: failed to map partition %lld %s stats, err: %d\n",
>> + __func__, partition_id,
>> + (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
>> + err);
>> + return ERR_PTR(err);
>> + }
>> + return stats;
>> +}
>> +
>> +static int mshv_debugfs_partition_stats_create(u64 partition_id,
>> + struct dentry **partition_stats_ptr,
>> + struct dentry *parent)
>> +{
>> + struct dentry *dentry;
>> + struct hv_stats_page **pstats;
>> + int err;
>> +
>> + pstats = kcalloc(2, sizeof(struct hv_stats_page *), GFP_KERNEL_ACCOUNT);
>
> Same comment here about the use of "2" as the first parameter.
>
Ack.
>> + if (!pstats)
>> + return -ENOMEM;
<snip>
Thanks for the comments, I appreciate the review!
Nuno
^ permalink raw reply
* Re: [PATCH v2 2/3] mshv: Add definitions for stats pages
From: Nuno Das Neves @ 2025-12-30 23:04 UTC (permalink / raw)
To: Michael Kelley, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157729608C3B15BA9442591D4A2A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 12/8/2025 7:13 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, December 5, 2025 10:59 AM
>>
>> Add the definitions for hypervisor, logical processor, and partition
>> stats pages.
>>
>> Move the definition for the VP stats page to its rightful place in
>> hvhdk.h, and add the missing members.
>>
>> These enum members retain their CamelCase style, since they are imported
>> directly from the hypervisor code They will be stringified when printing
>
> Missing a '.' (period) after "hypervisor code".
>
Ack
>> the stats out, and retain more readability in this form.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>> drivers/hv/mshv_root_main.c | 17 --
>> include/hyperv/hvhdk.h | 437 ++++++++++++++++++++++++++++++++++++
>> 2 files changed, 437 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index f59a4ab47685..19006b788e85 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -38,23 +38,6 @@ MODULE_AUTHOR("Microsoft");
>> MODULE_LICENSE("GPL");
>> MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface
>> /dev/mshv");
>>
>> -/* TODO move this to another file when debugfs code is added */
>> -enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
>> -#if defined(CONFIG_X86)
>> - VpRootDispatchThreadBlocked = 202,
>> -#elif defined(CONFIG_ARM64)
>> - VpRootDispatchThreadBlocked = 94,
>> -#endif
>> - VpStatsMaxCounter
>> -};
>> -
>> -struct hv_stats_page {
>> - union {
>> - u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
>> - u8 data[HV_HYP_PAGE_SIZE];
>> - };
>> -} __packed;
>> -
>> struct mshv_root mshv_root;
>>
>> enum hv_scheduler_type hv_scheduler_type;
>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>> index 469186df7826..51abbcd0ec37 100644
>> --- a/include/hyperv/hvhdk.h
>> +++ b/include/hyperv/hvhdk.h
>> @@ -10,6 +10,443 @@
>> #include "hvhdk_mini.h"
>> #include "hvgdk.h"
>>
>> +enum hv_stats_hypervisor_counters { /* HV_HYPERVISOR_COUNTER */
>> + HvLogicalProcessors = 1,
>> + HvPartitions = 2,
>> + HvTotalPages = 3,
>> + HvVirtualProcessors = 4,
>> + HvMonitoredNotifications = 5,
>> + HvModernStandbyEntries = 6,
>> + HvPlatformIdleTransitions = 7,
>> + HvHypervisorStartupCost = 8,
>> + HvIOSpacePages = 10,
>> + HvNonEssentialPagesForDump = 11,
>> + HvSubsumedPages = 12,
>> + HvStatsMaxCounter
>> +};
>> +
>> +enum hv_stats_partition_counters { /* HV_PROCESS_COUNTER */
>> + PartitionVirtualProcessors = 1,
>> + PartitionTlbSize = 3,
>> + PartitionAddressSpaces = 4,
>> + PartitionDepositedPages = 5,
>> + PartitionGpaPages = 6,
>> + PartitionGpaSpaceModifications = 7,
>> + PartitionVirtualTlbFlushEntires = 8,
>> + PartitionRecommendedTlbSize = 9,
>> + PartitionGpaPages4K = 10,
>> + PartitionGpaPages2M = 11,
>> + PartitionGpaPages1G = 12,
>> + PartitionGpaPages512G = 13,
>> + PartitionDevicePages4K = 14,
>> + PartitionDevicePages2M = 15,
>> + PartitionDevicePages1G = 16,
>> + PartitionDevicePages512G = 17,
>> + PartitionAttachedDevices = 18,
>> + PartitionDeviceInterruptMappings = 19,
>> + PartitionIoTlbFlushes = 20,
>> + PartitionIoTlbFlushCost = 21,
>> + PartitionDeviceInterruptErrors = 22,
>> + PartitionDeviceDmaErrors = 23,
>> + PartitionDeviceInterruptThrottleEvents = 24,
>> + PartitionSkippedTimerTicks = 25,
>> + PartitionPartitionId = 26,
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + PartitionNestedTlbSize = 27,
>> + PartitionRecommendedNestedTlbSize = 28,
>> + PartitionNestedTlbFreeListSize = 29,
>> + PartitionNestedTlbTrimmedPages = 30,
>> + PartitionPagesShattered = 31,
>> + PartitionPagesRecombined = 32,
>> + PartitionHwpRequestValue = 33,
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + PartitionHwpRequestValue = 27,
>> +#endif
>> + PartitionStatsMaxCounter
>> +};
>> +
>> +enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
>> + VpTotalRunTime = 1,
>> + VpHypervisorRunTime = 2,
>> + VpRemoteNodeRunTime = 3,
>> + VpNormalizedRunTime = 4,
>> + VpIdealCpu = 5,
>> + VpHypercallsCount = 7,
>> + VpHypercallsTime = 8,
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + VpPageInvalidationsCount = 9,
>> + VpPageInvalidationsTime = 10,
>> + VpControlRegisterAccessesCount = 11,
>> + VpControlRegisterAccessesTime = 12,
>> + VpIoInstructionsCount = 13,
>> + VpIoInstructionsTime = 14,
>> + VpHltInstructionsCount = 15,
>> + VpHltInstructionsTime = 16,
>> + VpMwaitInstructionsCount = 17,
>> + VpMwaitInstructionsTime = 18,
>> + VpCpuidInstructionsCount = 19,
>> + VpCpuidInstructionsTime = 20,
>> + VpMsrAccessesCount = 21,
>> + VpMsrAccessesTime = 22,
>> + VpOtherInterceptsCount = 23,
>> + VpOtherInterceptsTime = 24,
>> + VpExternalInterruptsCount = 25,
>> + VpExternalInterruptsTime = 26,
>> + VpPendingInterruptsCount = 27,
>> + VpPendingInterruptsTime = 28,
>> + VpEmulatedInstructionsCount = 29,
>> + VpEmulatedInstructionsTime = 30,
>> + VpDebugRegisterAccessesCount = 31,
>> + VpDebugRegisterAccessesTime = 32,
>> + VpPageFaultInterceptsCount = 33,
>> + VpPageFaultInterceptsTime = 34,
>> + VpGuestPageTableMaps = 35,
>> + VpLargePageTlbFills = 36,
>> + VpSmallPageTlbFills = 37,
>> + VpReflectedGuestPageFaults = 38,
>> + VpApicMmioAccesses = 39,
>> + VpIoInterceptMessages = 40,
>> + VpMemoryInterceptMessages = 41,
>> + VpApicEoiAccesses = 42,
>> + VpOtherMessages = 43,
>> + VpPageTableAllocations = 44,
>> + VpLogicalProcessorMigrations = 45,
>> + VpAddressSpaceEvictions = 46,
>> + VpAddressSpaceSwitches = 47,
>> + VpAddressDomainFlushes = 48,
>> + VpAddressSpaceFlushes = 49,
>> + VpGlobalGvaRangeFlushes = 50,
>> + VpLocalGvaRangeFlushes = 51,
>> + VpPageTableEvictions = 52,
>> + VpPageTableReclamations = 53,
>> + VpPageTableResets = 54,
>> + VpPageTableValidations = 55,
>> + VpApicTprAccesses = 56,
>> + VpPageTableWriteIntercepts = 57,
>> + VpSyntheticInterrupts = 58,
>> + VpVirtualInterrupts = 59,
>> + VpApicIpisSent = 60,
>> + VpApicSelfIpisSent = 61,
>> + VpGpaSpaceHypercalls = 62,
>> + VpLogicalProcessorHypercalls = 63,
>> + VpLongSpinWaitHypercalls = 64,
>> + VpOtherHypercalls = 65,
>> + VpSyntheticInterruptHypercalls = 66,
>> + VpVirtualInterruptHypercalls = 67,
>> + VpVirtualMmuHypercalls = 68,
>> + VpVirtualProcessorHypercalls = 69,
>> + VpHardwareInterrupts = 70,
>> + VpNestedPageFaultInterceptsCount = 71,
>> + VpNestedPageFaultInterceptsTime = 72,
>> + VpPageScans = 73,
>> + VpLogicalProcessorDispatches = 74,
>> + VpWaitingForCpuTime = 75,
>> + VpExtendedHypercalls = 76,
>> + VpExtendedHypercallInterceptMessages = 77,
>> + VpMbecNestedPageTableSwitches = 78,
>> + VpOtherReflectedGuestExceptions = 79,
>> + VpGlobalIoTlbFlushes = 80,
>> + VpGlobalIoTlbFlushCost = 81,
>> + VpLocalIoTlbFlushes = 82,
>> + VpLocalIoTlbFlushCost = 83,
>> + VpHypercallsForwardedCount = 84,
>> + VpHypercallsForwardingTime = 85,
>> + VpPageInvalidationsForwardedCount = 86,
>> + VpPageInvalidationsForwardingTime = 87,
>> + VpControlRegisterAccessesForwardedCount = 88,
>> + VpControlRegisterAccessesForwardingTime = 89,
>> + VpIoInstructionsForwardedCount = 90,
>> + VpIoInstructionsForwardingTime = 91,
>> + VpHltInstructionsForwardedCount = 92,
>> + VpHltInstructionsForwardingTime = 93,
>> + VpMwaitInstructionsForwardedCount = 94,
>> + VpMwaitInstructionsForwardingTime = 95,
>> + VpCpuidInstructionsForwardedCount = 96,
>> + VpCpuidInstructionsForwardingTime = 97,
>> + VpMsrAccessesForwardedCount = 98,
>> + VpMsrAccessesForwardingTime = 99,
>> + VpOtherInterceptsForwardedCount = 100,
>> + VpOtherInterceptsForwardingTime = 101,
>> + VpExternalInterruptsForwardedCount = 102,
>> + VpExternalInterruptsForwardingTime = 103,
>> + VpPendingInterruptsForwardedCount = 104,
>> + VpPendingInterruptsForwardingTime = 105,
>> + VpEmulatedInstructionsForwardedCount = 106,
>> + VpEmulatedInstructionsForwardingTime = 107,
>> + VpDebugRegisterAccessesForwardedCount = 108,
>> + VpDebugRegisterAccessesForwardingTime = 109,
>> + VpPageFaultInterceptsForwardedCount = 110,
>> + VpPageFaultInterceptsForwardingTime = 111,
>> + VpVmclearEmulationCount = 112,
>> + VpVmclearEmulationTime = 113,
>> + VpVmptrldEmulationCount = 114,
>> + VpVmptrldEmulationTime = 115,
>> + VpVmptrstEmulationCount = 116,
>> + VpVmptrstEmulationTime = 117,
>> + VpVmreadEmulationCount = 118,
>> + VpVmreadEmulationTime = 119,
>> + VpVmwriteEmulationCount = 120,
>> + VpVmwriteEmulationTime = 121,
>> + VpVmxoffEmulationCount = 122,
>> + VpVmxoffEmulationTime = 123,
>> + VpVmxonEmulationCount = 124,
>> + VpVmxonEmulationTime = 125,
>> + VpNestedVMEntriesCount = 126,
>> + VpNestedVMEntriesTime = 127,
>> + VpNestedSLATSoftPageFaultsCount = 128,
>> + VpNestedSLATSoftPageFaultsTime = 129,
>> + VpNestedSLATHardPageFaultsCount = 130,
>> + VpNestedSLATHardPageFaultsTime = 131,
>> + VpInvEptAllContextEmulationCount = 132,
>> + VpInvEptAllContextEmulationTime = 133,
>> + VpInvEptSingleContextEmulationCount = 134,
>> + VpInvEptSingleContextEmulationTime = 135,
>> + VpInvVpidAllContextEmulationCount = 136,
>> + VpInvVpidAllContextEmulationTime = 137,
>> + VpInvVpidSingleContextEmulationCount = 138,
>> + VpInvVpidSingleContextEmulationTime = 139,
>> + VpInvVpidSingleAddressEmulationCount = 140,
>> + VpInvVpidSingleAddressEmulationTime = 141,
>> + VpNestedTlbPageTableReclamations = 142,
>> + VpNestedTlbPageTableEvictions = 143,
>> + VpFlushGuestPhysicalAddressSpaceHypercalls = 144,
>> + VpFlushGuestPhysicalAddressListHypercalls = 145,
>> + VpPostedInterruptNotifications = 146,
>> + VpPostedInterruptScans = 147,
>> + VpTotalCoreRunTime = 148,
>> + VpMaximumRunTime = 149,
>> + VpHwpRequestContextSwitches = 150,
>> + VpWaitingForCpuTimeBucket0 = 151,
>> + VpWaitingForCpuTimeBucket1 = 152,
>> + VpWaitingForCpuTimeBucket2 = 153,
>> + VpWaitingForCpuTimeBucket3 = 154,
>> + VpWaitingForCpuTimeBucket4 = 155,
>> + VpWaitingForCpuTimeBucket5 = 156,
>> + VpWaitingForCpuTimeBucket6 = 157,
>> + VpVmloadEmulationCount = 158,
>> + VpVmloadEmulationTime = 159,
>> + VpVmsaveEmulationCount = 160,
>> + VpVmsaveEmulationTime = 161,
>> + VpGifInstructionEmulationCount = 162,
>> + VpGifInstructionEmulationTime = 163,
>> + VpEmulatedErrataSvmInstructions = 164,
>> + VpPlaceholder1 = 165,
>> + VpPlaceholder2 = 166,
>> + VpPlaceholder3 = 167,
>> + VpPlaceholder4 = 168,
>> + VpPlaceholder5 = 169,
>> + VpPlaceholder6 = 170,
>> + VpPlaceholder7 = 171,
>> + VpPlaceholder8 = 172,
>> + VpPlaceholder9 = 173,
>> + VpPlaceholder10 = 174,
>> + VpSchedulingPriority = 175,
>> + VpRdpmcInstructionsCount = 176,
>> + VpRdpmcInstructionsTime = 177,
>> + VpPerfmonPmuMsrAccessesCount = 178,
>> + VpPerfmonLbrMsrAccessesCount = 179,
>> + VpPerfmonIptMsrAccessesCount = 180,
>> + VpPerfmonInterruptCount = 181,
>> + VpVtl1DispatchCount = 182,
>> + VpVtl2DispatchCount = 183,
>> + VpVtl2DispatchBucket0 = 184,
>> + VpVtl2DispatchBucket1 = 185,
>> + VpVtl2DispatchBucket2 = 186,
>> + VpVtl2DispatchBucket3 = 187,
>> + VpVtl2DispatchBucket4 = 188,
>> + VpVtl2DispatchBucket5 = 189,
>> + VpVtl2DispatchBucket6 = 190,
>> + VpVtl1RunTime = 191,
>> + VpVtl2RunTime = 192,
>> + VpIommuHypercalls = 193,
>> + VpCpuGroupHypercalls = 194,
>> + VpVsmHypercalls = 195,
>> + VpEventLogHypercalls = 196,
>> + VpDeviceDomainHypercalls = 197,
>> + VpDepositHypercalls = 198,
>> + VpSvmHypercalls = 199,
>> + VpBusLockAcquisitionCount = 200,
>> + VpUnused = 201,
>> + VpRootDispatchThreadBlocked = 202,
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + VpSysRegAccessesCount = 9,
>> + VpSysRegAccessesTime = 10,
>> + VpSmcInstructionsCount = 11,
>> + VpSmcInstructionsTime = 12,
>> + VpOtherInterceptsCount = 13,
>> + VpOtherInterceptsTime = 14,
>> + VpExternalInterruptsCount = 15,
>> + VpExternalInterruptsTime = 16,
>> + VpPendingInterruptsCount = 17,
>> + VpPendingInterruptsTime = 18,
>> + VpGuestPageTableMaps = 19,
>> + VpLargePageTlbFills = 20,
>> + VpSmallPageTlbFills = 21,
>> + VpReflectedGuestPageFaults = 22,
>> + VpMemoryInterceptMessages = 23,
>> + VpOtherMessages = 24,
>> + VpLogicalProcessorMigrations = 25,
>> + VpAddressDomainFlushes = 26,
>> + VpAddressSpaceFlushes = 27,
>> + VpSyntheticInterrupts = 28,
>> + VpVirtualInterrupts = 29,
>> + VpApicSelfIpisSent = 30,
>> + VpGpaSpaceHypercalls = 31,
>> + VpLogicalProcessorHypercalls = 32,
>> + VpLongSpinWaitHypercalls = 33,
>> + VpOtherHypercalls = 34,
>> + VpSyntheticInterruptHypercalls = 35,
>> + VpVirtualInterruptHypercalls = 36,
>> + VpVirtualMmuHypercalls = 37,
>> + VpVirtualProcessorHypercalls = 38,
>> + VpHardwareInterrupts = 39,
>> + VpNestedPageFaultInterceptsCount = 40,
>> + VpNestedPageFaultInterceptsTime = 41,
>> + VpLogicalProcessorDispatches = 42,
>> + VpWaitingForCpuTime = 43,
>> + VpExtendedHypercalls = 44,
>> + VpExtendedHypercallInterceptMessages = 45,
>> + VpMbecNestedPageTableSwitches = 46,
>> + VpOtherReflectedGuestExceptions = 47,
>> + VpGlobalIoTlbFlushes = 48,
>> + VpGlobalIoTlbFlushCost = 49,
>> + VpLocalIoTlbFlushes = 50,
>> + VpLocalIoTlbFlushCost = 51,
>> + VpFlushGuestPhysicalAddressSpaceHypercalls = 52,
>> + VpFlushGuestPhysicalAddressListHypercalls = 53,
>> + VpPostedInterruptNotifications = 54,
>> + VpPostedInterruptScans = 55,
>> + VpTotalCoreRunTime = 56,
>> + VpMaximumRunTime = 57,
>> + VpWaitingForCpuTimeBucket0 = 58,
>> + VpWaitingForCpuTimeBucket1 = 59,
>> + VpWaitingForCpuTimeBucket2 = 60,
>> + VpWaitingForCpuTimeBucket3 = 61,
>> + VpWaitingForCpuTimeBucket4 = 62,
>> + VpWaitingForCpuTimeBucket5 = 63,
>> + VpWaitingForCpuTimeBucket6 = 64,
>> + VpHwpRequestContextSwitches = 65,
>> + VpPlaceholder2 = 66,
>> + VpPlaceholder3 = 67,
>> + VpPlaceholder4 = 68,
>> + VpPlaceholder5 = 69,
>> + VpPlaceholder6 = 70,
>> + VpPlaceholder7 = 71,
>> + VpPlaceholder8 = 72,
>> + VpContentionTime = 73,
>> + VpWakeUpTime = 74,
>> + VpSchedulingPriority = 75,
>> + VpVtl1DispatchCount = 76,
>> + VpVtl2DispatchCount = 77,
>> + VpVtl2DispatchBucket0 = 78,
>> + VpVtl2DispatchBucket1 = 79,
>> + VpVtl2DispatchBucket2 = 80,
>> + VpVtl2DispatchBucket3 = 81,
>> + VpVtl2DispatchBucket4 = 82,
>> + VpVtl2DispatchBucket5 = 83,
>> + VpVtl2DispatchBucket6 = 84,
>> + VpVtl1RunTime = 85,
>> + VpVtl2RunTime = 86,
>> + VpIommuHypercalls = 87,
>> + VpCpuGroupHypercalls = 88,
>> + VpVsmHypercalls = 89,
>> + VpEventLogHypercalls = 90,
>> + VpDeviceDomainHypercalls = 91,
>> + VpDepositHypercalls = 92,
>> + VpSvmHypercalls = 93,
>> + VpLoadAvg = 94,
>> + VpRootDispatchThreadBlocked = 95,
>
> In current code, VpRootDispatchThreadBlocked on ARM64 is 94. Is that an
> error that is being corrected by this patch?
>
Hmm, I didn't realize this changed - 95 is the correct value. However,
the mshv driver does not yet support on ARM64, so this fix doesn't
have any impact right now. Do you suggest a separate patch to fix it?
>> +#endif
>> + VpStatsMaxCounter
>> +};
>> +
>> +enum hv_stats_lp_counters { /* HV_CPU_COUNTER */
>> + LpGlobalTime = 1,
>> + LpTotalRunTime = 2,
>> + LpHypervisorRunTime = 3,
>> + LpHardwareInterrupts = 4,
>> + LpContextSwitches = 5,
>> + LpInterProcessorInterrupts = 6,
>> + LpSchedulerInterrupts = 7,
>> + LpTimerInterrupts = 8,
>> + LpInterProcessorInterruptsSent = 9,
>> + LpProcessorHalts = 10,
>> + LpMonitorTransitionCost = 11,
>> + LpContextSwitchTime = 12,
>> + LpC1TransitionsCount = 13,
>> + LpC1RunTime = 14,
>> + LpC2TransitionsCount = 15,
>> + LpC2RunTime = 16,
>> + LpC3TransitionsCount = 17,
>> + LpC3RunTime = 18,
>> + LpRootVpIndex = 19,
>> + LpIdleSequenceNumber = 20,
>> + LpGlobalTscCount = 21,
>> + LpActiveTscCount = 22,
>> + LpIdleAccumulation = 23,
>> + LpReferenceCycleCount0 = 24,
>> + LpActualCycleCount0 = 25,
>> + LpReferenceCycleCount1 = 26,
>> + LpActualCycleCount1 = 27,
>> + LpProximityDomainId = 28,
>> + LpPostedInterruptNotifications = 29,
>> + LpBranchPredictorFlushes = 30,
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + LpL1DataCacheFlushes = 31,
>> + LpImmediateL1DataCacheFlushes = 32,
>> + LpMbFlushes = 33,
>> + LpCounterRefreshSequenceNumber = 34,
>> + LpCounterRefreshReferenceTime = 35,
>> + LpIdleAccumulationSnapshot = 36,
>> + LpActiveTscCountSnapshot = 37,
>> + LpHwpRequestContextSwitches = 38,
>> + LpPlaceholder1 = 39,
>> + LpPlaceholder2 = 40,
>> + LpPlaceholder3 = 41,
>> + LpPlaceholder4 = 42,
>> + LpPlaceholder5 = 43,
>> + LpPlaceholder6 = 44,
>> + LpPlaceholder7 = 45,
>> + LpPlaceholder8 = 46,
>> + LpPlaceholder9 = 47,
>> + LpPlaceholder10 = 48,
>> + LpReserveGroupId = 49,
>> + LpRunningPriority = 50,
>> + LpPerfmonInterruptCount = 51,
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + LpCounterRefreshSequenceNumber = 31,
>> + LpCounterRefreshReferenceTime = 32,
>> + LpIdleAccumulationSnapshot = 33,
>> + LpActiveTscCountSnapshot = 34,
>> + LpHwpRequestContextSwitches = 35,
>> + LpPlaceholder2 = 36,
>> + LpPlaceholder3 = 37,
>> + LpPlaceholder4 = 38,
>> + LpPlaceholder5 = 39,
>> + LpPlaceholder6 = 40,
>> + LpPlaceholder7 = 41,
>> + LpPlaceholder8 = 42,
>> + LpPlaceholder9 = 43,
>> + LpSchLocalRunListSize = 44,
>> + LpReserveGroupId = 45,
>> + LpRunningPriority = 46,
>> +#endif
>> + LpStatsMaxCounter
>> +};
>> +
>> +/*
>> + * Hypervisor statsitics page format
>
> s/statsitics/statistics/
>
Ack, thanks
>> + */
>> +struct hv_stats_page {
>> + union {
>> + u64 hv_cntrs[HvStatsMaxCounter]; /* Hypervisor counters
>> */
>> + u64 pt_cntrs[PartitionStatsMaxCounter]; /* Partition
>> counters */
>> + u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
>> + u64 lp_cntrs[LpStatsMaxCounter]; /* LP counters */
>> + u8 data[HV_HYP_PAGE_SIZE];
>> + };
>> +} __packed;
>> +
>> /* Bits for dirty mask of hv_vp_register_page */
>> #define HV_X64_REGISTER_CLASS_GENERAL 0
>> #define HV_X64_REGISTER_CLASS_IP 1
>> --
>> 2.34.1
^ permalink raw reply
* [PATCH v2 8/8] KVM: SVM: Assert that Hyper-V's HV_SVM_EXITCODE_ENL == SVM_EXIT_SW
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Add a build-time assertiont that Hyper-V's "enlightened" exit code is that,
same as the AMD-defined "Reserved for Host" exit code, mostly to help
readers connect the dots and understand why synthesizing a software-defined
exit code is safe/ok.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/hyperv.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/hyperv.c b/arch/x86/kvm/svm/hyperv.c
index 3ec580d687f5..4f24dcb45116 100644
--- a/arch/x86/kvm/svm/hyperv.c
+++ b/arch/x86/kvm/svm/hyperv.c
@@ -10,6 +10,12 @@ void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ /*
+ * The exit code used by Hyper-V for software-defined exits is reserved
+ * by AMD specifically for such use cases.
+ */
+ BUILD_BUG_ON(HV_SVM_EXITCODE_ENL != SVM_EXIT_SW);
+
svm->vmcb->control.exit_code = HV_SVM_EXITCODE_ENL;
svm->vmcb->control.exit_info_1 = HV_SVM_ENL_EXITCODE_TRAP_AFTER_FLUSH;
svm->vmcb->control.exit_info_2 = 0;
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 7/8] KVM: SVM: Harden exit_code against being used in Spectre-like attacks
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Explicitly clamp the exit code used to index KVM's exit handlers to guard
against Spectre-like attacks, mainly to provide consistency between VMX
and SVM (VMX was given the same treatment by commit c926f2f7230b ("KVM:
x86: Protect exit_reason from being used in Spectre-v1/L1TF attacks").
For normal VMs, it's _extremely_ unlikely the exit code could be used to
exploit a speculation vulnerability, as the exit code is set by hardware
and unexpected/unknown exit codes should be quite well bounded (as is/was
the case with VMX). But with SEV-ES+, the exit code is guest-controlled
as it comes from the GHCB, not from hardware, i.e. an attack from the
guest is at least somewhat plausible.
Irrespective of SEV-ES+, hardening KVM is easy and inexpensive, and such
an attack is theoretically possible.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/svm.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b97e6763839b..a75cd832e194 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3477,6 +3477,7 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 __exit_code)
if (exit_code >= ARRAY_SIZE(svm_exit_handlers))
goto unexpected_vmexit;
+ exit_code = array_index_nospec(exit_code, ARRAY_SIZE(svm_exit_handlers));
if (!svm_exit_handlers[exit_code])
goto unexpected_vmexit;
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 6/8] KVM: SVM: Limit incorrect check on SVM_EXIT_ERR to running as a VM
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Limit KVM's incorrect check for VMXEXIT_INVALID, a.k.a. SVM_EXIT_ERR, to
running as a VM, as detected by X86_FEATURE_HYPERVISOR. The exit_code and
all failure codes, e.g. VMXEXIT_INVALID, are 64-bit values, and so checking
only bits 31:0 could result in false positives when running on non-broken
hardware, e.g. in the extremely unlikely scenario exit code 0xffffffffull
is ever generated by hardware.
Keep the 32-bit check to play nice with running on broken KVM (for years,
KVM has not set bits 63:32 when synthesizing nested SVM VM-Exits).
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/svm.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 105f1394026e..f938679c0231 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -425,7 +425,10 @@ static __always_inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
static inline bool svm_is_vmrun_failure(u64 exit_code)
{
- return (u32)exit_code == (u32)SVM_EXIT_ERR;
+ if (cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+ return (u32)exit_code == (u32)SVM_EXIT_ERR;
+
+ return exit_code == SVM_EXIT_ERR;
}
/*
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 5/8] KVM: SVM: Treat exit_code as an unsigned 64-bit value through all of KVM
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Fix KVM's long-standing buggy handling of SVM's exit_code as a 32-bit
value. Per the APM and Xen commit d1bd157fbc ("Big merge the HVM
full-virtualisation abstractions.") (which is arguably more trustworthy
than KVM), offset 0x70 is a single 64-bit value:
070h 63:0 EXITCODE
Track exit_code as a single u64 to prevent reintroducing bugs where KVM
neglects to correctly set bits 63:32.
Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
Cc: Jim Mattson <jmattson@google.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/svm.h | 3 +-
arch/x86/include/uapi/asm/svm.h | 32 ++++++++---------
arch/x86/kvm/svm/hyperv.c | 1 -
arch/x86/kvm/svm/nested.c | 13 ++-----
arch/x86/kvm/svm/sev.c | 36 +++++++------------
arch/x86/kvm/svm/svm.c | 7 ++--
arch/x86/kvm/svm/svm.h | 4 +--
arch/x86/kvm/trace.h | 6 ++--
include/hyperv/hvgdk.h | 2 +-
tools/testing/selftests/kvm/include/x86/svm.h | 3 +-
.../kvm/x86/svm_nested_soft_inject_test.c | 4 +--
11 files changed, 42 insertions(+), 69 deletions(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 56aa99503dc4..ba28ff531b60 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -136,8 +136,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u32 int_vector;
u32 int_state;
u8 reserved_3[4];
- u32 exit_code;
- u32 exit_code_hi;
+ u64 exit_code;
u64 exit_info_1;
u64 exit_info_2;
u32 exit_int_info;
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 650e3256ea7d..010a45c9f614 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -103,38 +103,38 @@
#define SVM_EXIT_VMGEXIT 0x403
/* SEV-ES software-defined VMGEXIT events */
-#define SVM_VMGEXIT_MMIO_READ 0x80000001
-#define SVM_VMGEXIT_MMIO_WRITE 0x80000002
-#define SVM_VMGEXIT_NMI_COMPLETE 0x80000003
-#define SVM_VMGEXIT_AP_HLT_LOOP 0x80000004
-#define SVM_VMGEXIT_AP_JUMP_TABLE 0x80000005
+#define SVM_VMGEXIT_MMIO_READ 0x80000001ull
+#define SVM_VMGEXIT_MMIO_WRITE 0x80000002ull
+#define SVM_VMGEXIT_NMI_COMPLETE 0x80000003ull
+#define SVM_VMGEXIT_AP_HLT_LOOP 0x80000004ull
+#define SVM_VMGEXIT_AP_JUMP_TABLE 0x80000005ull
#define SVM_VMGEXIT_SET_AP_JUMP_TABLE 0
#define SVM_VMGEXIT_GET_AP_JUMP_TABLE 1
-#define SVM_VMGEXIT_PSC 0x80000010
-#define SVM_VMGEXIT_GUEST_REQUEST 0x80000011
-#define SVM_VMGEXIT_EXT_GUEST_REQUEST 0x80000012
-#define SVM_VMGEXIT_AP_CREATION 0x80000013
+#define SVM_VMGEXIT_PSC 0x80000010ull
+#define SVM_VMGEXIT_GUEST_REQUEST 0x80000011ull
+#define SVM_VMGEXIT_EXT_GUEST_REQUEST 0x80000012ull
+#define SVM_VMGEXIT_AP_CREATION 0x80000013ull
#define SVM_VMGEXIT_AP_CREATE_ON_INIT 0
#define SVM_VMGEXIT_AP_CREATE 1
#define SVM_VMGEXIT_AP_DESTROY 2
-#define SVM_VMGEXIT_SNP_RUN_VMPL 0x80000018
-#define SVM_VMGEXIT_SAVIC 0x8000001a
+#define SVM_VMGEXIT_SNP_RUN_VMPL 0x80000018ull
+#define SVM_VMGEXIT_SAVIC 0x8000001aull
#define SVM_VMGEXIT_SAVIC_REGISTER_GPA 0
#define SVM_VMGEXIT_SAVIC_UNREGISTER_GPA 1
#define SVM_VMGEXIT_SAVIC_SELF_GPA ~0ULL
-#define SVM_VMGEXIT_HV_FEATURES 0x8000fffd
-#define SVM_VMGEXIT_TERM_REQUEST 0x8000fffe
+#define SVM_VMGEXIT_HV_FEATURES 0x8000fffdull
+#define SVM_VMGEXIT_TERM_REQUEST 0x8000fffeull
#define SVM_VMGEXIT_TERM_REASON(reason_set, reason_code) \
/* SW_EXITINFO1[3:0] */ \
(((((u64)reason_set) & 0xf)) | \
/* SW_EXITINFO1[11:4] */ \
((((u64)reason_code) & 0xff) << 4))
-#define SVM_VMGEXIT_UNSUPPORTED_EVENT 0x8000ffff
+#define SVM_VMGEXIT_UNSUPPORTED_EVENT 0x8000ffffull
/* Exit code reserved for hypervisor/software use */
-#define SVM_EXIT_SW 0xf0000000
+#define SVM_EXIT_SW 0xf0000000ull
-#define SVM_EXIT_ERR -1
+#define SVM_EXIT_ERR -1ull
#define SVM_EXIT_REASONS \
{ SVM_EXIT_READ_CR0, "read_cr0" }, \
diff --git a/arch/x86/kvm/svm/hyperv.c b/arch/x86/kvm/svm/hyperv.c
index 088f6429b24c..3ec580d687f5 100644
--- a/arch/x86/kvm/svm/hyperv.c
+++ b/arch/x86/kvm/svm/hyperv.c
@@ -11,7 +11,6 @@ void svm_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
svm->vmcb->control.exit_code = HV_SVM_EXITCODE_ENL;
- svm->vmcb->control.exit_code_hi = 0;
svm->vmcb->control.exit_info_1 = HV_SVM_ENL_EXITCODE_TRAP_AFTER_FLUSH;
svm->vmcb->control.exit_info_2 = 0;
nested_svm_vmexit(svm);
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f5bde972a2b1..18b656508de9 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -45,7 +45,6 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
* correctly fill in the high bits of exit_info_1.
*/
vmcb->control.exit_code = SVM_EXIT_NPF;
- vmcb->control.exit_code_hi = 0;
vmcb->control.exit_info_1 = (1ULL << 32);
vmcb->control.exit_info_2 = fault->address;
}
@@ -421,7 +420,6 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
to->int_vector = from->int_vector;
to->int_state = from->int_state;
to->exit_code = from->exit_code;
- to->exit_code_hi = from->exit_code_hi;
to->exit_info_1 = from->exit_info_1;
to->exit_info_2 = from->exit_info_2;
to->exit_int_info = from->exit_int_info;
@@ -727,8 +725,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
enter_guest_mode(vcpu);
/*
- * Filled at exit: exit_code, exit_code_hi, exit_info_1, exit_info_2,
- * exit_int_info, exit_int_info_err, next_rip, insn_len, insn_bytes.
+ * Filled at exit: exit_code, exit_info_1, exit_info_2, exit_int_info,
+ * exit_int_info_err, next_rip, insn_len, insn_bytes.
*/
if (guest_cpu_cap_has(vcpu, X86_FEATURE_VGIF) &&
@@ -985,7 +983,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
if (!nested_vmcb_check_save(vcpu) ||
!nested_vmcb_check_controls(vcpu)) {
vmcb12->control.exit_code = SVM_EXIT_ERR;
- vmcb12->control.exit_code_hi = -1u;
vmcb12->control.exit_info_1 = 0;
vmcb12->control.exit_info_2 = 0;
goto out;
@@ -1018,7 +1015,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
svm->soft_int_injected = false;
svm->vmcb->control.exit_code = SVM_EXIT_ERR;
- svm->vmcb->control.exit_code_hi = -1u;
svm->vmcb->control.exit_info_1 = 0;
svm->vmcb->control.exit_info_2 = 0;
@@ -1130,7 +1126,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
vmcb12->control.int_state = vmcb02->control.int_state;
vmcb12->control.exit_code = vmcb02->control.exit_code;
- vmcb12->control.exit_code_hi = vmcb02->control.exit_code_hi;
vmcb12->control.exit_info_1 = vmcb02->control.exit_info_1;
vmcb12->control.exit_info_2 = vmcb02->control.exit_info_2;
@@ -1422,7 +1417,7 @@ static int nested_svm_intercept_ioio(struct vcpu_svm *svm)
static int nested_svm_intercept(struct vcpu_svm *svm)
{
- u32 exit_code = svm->vmcb->control.exit_code;
+ u64 exit_code = svm->vmcb->control.exit_code;
int vmexit = NESTED_EXIT_HOST;
if (svm_is_vmrun_failure(exit_code))
@@ -1494,7 +1489,6 @@ static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
struct vmcb *vmcb = svm->vmcb;
vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + ex->vector;
- vmcb->control.exit_code_hi = 0;
if (ex->has_error_code)
vmcb->control.exit_info_1 = ex->error_code;
@@ -1669,7 +1663,6 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
dst->int_vector = from->int_vector;
dst->int_state = from->int_state;
dst->exit_code = from->exit_code;
- dst->exit_code_hi = from->exit_code_hi;
dst->exit_info_1 = from->exit_info_1;
dst->exit_info_2 = from->exit_info_2;
dst->exit_int_info = from->exit_int_info;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f59c65abe3cf..794640bebb62 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3275,11 +3275,6 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
kvfree(svm->sev_es.ghcb_sa);
}
-static u64 kvm_get_cached_sw_exit_code(struct vmcb_control_area *control)
-{
- return (((u64)control->exit_code_hi) << 32) | control->exit_code;
-}
-
static void dump_ghcb(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3301,7 +3296,7 @@ static void dump_ghcb(struct vcpu_svm *svm)
*/
pr_err("GHCB (GPA=%016llx) snapshot:\n", svm->vmcb->control.ghcb_gpa);
pr_err("%-20s%016llx is_valid: %u\n", "sw_exit_code",
- kvm_get_cached_sw_exit_code(control), kvm_ghcb_sw_exit_code_is_valid(svm));
+ control->exit_code, kvm_ghcb_sw_exit_code_is_valid(svm));
pr_err("%-20s%016llx is_valid: %u\n", "sw_exit_info_1",
control->exit_info_1, kvm_ghcb_sw_exit_info_1_is_valid(svm));
pr_err("%-20s%016llx is_valid: %u\n", "sw_exit_info_2",
@@ -3335,7 +3330,6 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
struct vmcb_control_area *control = &svm->vmcb->control;
struct kvm_vcpu *vcpu = &svm->vcpu;
struct ghcb *ghcb = svm->sev_es.ghcb;
- u64 exit_code;
/*
* The GHCB protocol so far allows for the following data
@@ -3369,9 +3363,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
__kvm_emulate_msr_write(vcpu, MSR_IA32_XSS, kvm_ghcb_get_xss(svm));
/* Copy the GHCB exit information into the VMCB fields */
- exit_code = kvm_ghcb_get_sw_exit_code(svm);
- control->exit_code = lower_32_bits(exit_code);
- control->exit_code_hi = upper_32_bits(exit_code);
+ control->exit_code = kvm_ghcb_get_sw_exit_code(svm);
control->exit_info_1 = kvm_ghcb_get_sw_exit_info_1(svm);
control->exit_info_2 = kvm_ghcb_get_sw_exit_info_2(svm);
svm->sev_es.sw_scratch = kvm_ghcb_get_sw_scratch_if_valid(svm);
@@ -3384,15 +3376,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
struct kvm_vcpu *vcpu = &svm->vcpu;
- u64 exit_code;
u64 reason;
- /*
- * Retrieve the exit code now even though it may not be marked valid
- * as it could help with debugging.
- */
- exit_code = kvm_get_cached_sw_exit_code(control);
-
/* Only GHCB Usage code 0 is supported */
if (svm->sev_es.ghcb->ghcb_usage) {
reason = GHCB_ERR_INVALID_USAGE;
@@ -3406,7 +3391,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
!kvm_ghcb_sw_exit_info_2_is_valid(svm))
goto vmgexit_err;
- switch (exit_code) {
+ switch (control->exit_code) {
case SVM_EXIT_READ_DR7:
break;
case SVM_EXIT_WRITE_DR7:
@@ -3507,15 +3492,19 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
return 0;
vmgexit_err:
+ /*
+ * Print the exit code even though it may not be marked valid as it
+ * could help with debugging.
+ */
if (reason == GHCB_ERR_INVALID_USAGE) {
vcpu_unimpl(vcpu, "vmgexit: ghcb usage %#x is not valid\n",
svm->sev_es.ghcb->ghcb_usage);
} else if (reason == GHCB_ERR_INVALID_EVENT) {
vcpu_unimpl(vcpu, "vmgexit: exit code %#llx is not valid\n",
- exit_code);
+ control->exit_code);
} else {
vcpu_unimpl(vcpu, "vmgexit: exit code %#llx input is not valid\n",
- exit_code);
+ control->exit_code);
dump_ghcb(svm);
}
@@ -4354,7 +4343,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb_control_area *control = &svm->vmcb->control;
- u64 ghcb_gpa, exit_code;
+ u64 ghcb_gpa;
int ret;
/* Validate the GHCB */
@@ -4396,8 +4385,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
svm_vmgexit_success(svm, 0);
- exit_code = kvm_get_cached_sw_exit_code(control);
- switch (exit_code) {
+ switch (control->exit_code) {
case SVM_VMGEXIT_MMIO_READ:
ret = setup_vmgexit_scratch(svm, true, control->exit_info_2);
if (ret)
@@ -4489,7 +4477,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = -EINVAL;
break;
default:
- ret = svm_invoke_exit_handler(vcpu, exit_code);
+ ret = svm_invoke_exit_handler(vcpu, control->exit_code);
}
return ret;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1ffe922e95fd..b97e6763839b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2443,7 +2443,6 @@ static bool check_selective_cr0_intercepted(struct kvm_vcpu *vcpu,
if (cr0 ^ val) {
svm->vmcb->control.exit_code = SVM_EXIT_CR0_SEL_WRITE;
- svm->vmcb->control.exit_code_hi = 0;
ret = (nested_svm_exit_handled(svm) == NESTED_EXIT_DONE);
}
@@ -3275,7 +3274,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
pr_err("%-20s%08x\n", "int_state:", control->int_state);
- pr_err("%-20s%08x\n", "exit_code:", control->exit_code);
+ pr_err("%-20s%016llx\n", "exit_code:", control->exit_code);
pr_err("%-20s%016llx\n", "exit_info1:", control->exit_info_1);
pr_err("%-20s%016llx\n", "exit_info2:", control->exit_info_2);
pr_err("%-20s%08x\n", "exit_int_info:", control->exit_int_info);
@@ -3525,7 +3524,6 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
struct vcpu_svm *svm = to_svm(vcpu);
struct kvm_run *kvm_run = vcpu->run;
- u32 exit_code = svm->vmcb->control.exit_code;
/* SEV-ES guests must use the CR write traps to track CR registers. */
if (!sev_es_guest(vcpu->kvm)) {
@@ -3561,7 +3559,7 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
if (exit_fastpath != EXIT_FASTPATH_NONE)
return 1;
- return svm_invoke_exit_handler(vcpu, exit_code);
+ return svm_invoke_exit_handler(vcpu, svm->vmcb->control.exit_code);
}
static int pre_svm_run(struct kvm_vcpu *vcpu)
@@ -4627,7 +4625,6 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
if (static_cpu_has(X86_FEATURE_NRIPS))
vmcb->control.next_rip = info->next_rip;
vmcb->control.exit_code = icpt_info.exit_code;
- vmcb->control.exit_code_hi = 0;
vmexit = nested_svm_exit_handled(svm);
ret = (vmexit == NESTED_EXIT_DONE) ? X86EMUL_INTERCEPTED
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0f006793f973..105f1394026e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -159,8 +159,7 @@ struct vmcb_ctrl_area_cached {
u32 int_ctl;
u32 int_vector;
u32 int_state;
- u32 exit_code;
- u32 exit_code_hi;
+ u64 exit_code;
u64 exit_info_1;
u64 exit_info_2;
u32 exit_int_info;
@@ -767,7 +766,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm);
static inline int nested_svm_simple_vmexit(struct vcpu_svm *svm, u32 exit_code)
{
svm->vmcb->control.exit_code = exit_code;
- svm->vmcb->control.exit_code_hi = 0;
svm->vmcb->control.exit_info_1 = 0;
svm->vmcb->control.exit_info_2 = 0;
return nested_svm_vmexit(svm);
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index e79bc9cb7162..e7fdbe9efc90 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -383,10 +383,10 @@ TRACE_EVENT(kvm_apic,
#define kvm_print_exit_reason(exit_reason, isa) \
(isa == KVM_ISA_VMX) ? \
__print_symbolic(exit_reason & 0xffff, VMX_EXIT_REASONS) : \
- __print_symbolic(exit_reason, SVM_EXIT_REASONS), \
+ __print_symbolic_u64(exit_reason, SVM_EXIT_REASONS), \
(isa == KVM_ISA_VMX && exit_reason & ~0xffff) ? " " : "", \
(isa == KVM_ISA_VMX) ? \
- __print_flags(exit_reason & ~0xffff, " ", VMX_EXIT_REASON_FLAGS) : ""
+ __print_flags_u64(exit_reason & ~0xffff, " ", VMX_EXIT_REASON_FLAGS) : ""
#define TRACE_EVENT_KVM_EXIT(name) \
TRACE_EVENT(name, \
@@ -781,7 +781,7 @@ TRACE_EVENT_KVM_EXIT(kvm_nested_vmexit);
* Tracepoint for #VMEXIT reinjected to the guest
*/
TRACE_EVENT(kvm_nested_vmexit_inject,
- TP_PROTO(__u32 exit_code,
+ TP_PROTO(__u64 exit_code,
__u64 exit_info1, __u64 exit_info2,
__u32 exit_int_info, __u32 exit_int_info_err, __u32 isa),
TP_ARGS(exit_code, exit_info1, exit_info2,
diff --git a/include/hyperv/hvgdk.h b/include/hyperv/hvgdk.h
index dd6d4939ea29..384c3f3ff4a5 100644
--- a/include/hyperv/hvgdk.h
+++ b/include/hyperv/hvgdk.h
@@ -281,7 +281,7 @@ struct hv_vmcb_enlightenments {
#define HV_VMCB_NESTED_ENLIGHTENMENTS 31
/* Synthetic VM-Exit */
-#define HV_SVM_EXITCODE_ENL 0xf0000000
+#define HV_SVM_EXITCODE_ENL 0xf0000000ull
#define HV_SVM_ENL_EXITCODE_TRAP_AFTER_FLUSH (1)
/* VM_PARTITION_ASSIST_PAGE */
diff --git a/tools/testing/selftests/kvm/include/x86/svm.h b/tools/testing/selftests/kvm/include/x86/svm.h
index 29cffd0a9181..10b30b38bb3f 100644
--- a/tools/testing/selftests/kvm/include/x86/svm.h
+++ b/tools/testing/selftests/kvm/include/x86/svm.h
@@ -92,8 +92,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u32 int_vector;
u32 int_state;
u8 reserved_3[4];
- u32 exit_code;
- u32 exit_code_hi;
+ u64 exit_code;
u64 exit_info_1;
u64 exit_info_2;
u32 exit_int_info;
diff --git a/tools/testing/selftests/kvm/x86/svm_nested_soft_inject_test.c b/tools/testing/selftests/kvm/x86/svm_nested_soft_inject_test.c
index 7b6481d6c0d3..4bd1655f9e6d 100644
--- a/tools/testing/selftests/kvm/x86/svm_nested_soft_inject_test.c
+++ b/tools/testing/selftests/kvm/x86/svm_nested_soft_inject_test.c
@@ -103,7 +103,7 @@ static void l1_guest_code(struct svm_test_data *svm, uint64_t is_nmi, uint64_t i
run_guest(vmcb, svm->vmcb_gpa);
__GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL,
- "Expected VMMCAL #VMEXIT, got '0x%x', info1 = '0x%lx, info2 = '0x%lx'",
+ "Expected VMMCAL #VMEXIT, got '0x%lx', info1 = '0x%lx, info2 = '0x%lx'",
vmcb->control.exit_code,
vmcb->control.exit_info_1, vmcb->control.exit_info_2);
@@ -133,7 +133,7 @@ static void l1_guest_code(struct svm_test_data *svm, uint64_t is_nmi, uint64_t i
run_guest(vmcb, svm->vmcb_gpa);
__GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_HLT,
- "Expected HLT #VMEXIT, got '0x%x', info1 = '0x%lx, info2 = '0x%lx'",
+ "Expected HLT #VMEXIT, got '0x%lx', info1 = '0x%lx, info2 = '0x%lx'",
vmcb->control.exit_code,
vmcb->control.exit_info_1, vmcb->control.exit_info_2);
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 4/8] KVM: SVM: Filter out 64-bit exit codes when invoking exit handlers on bare metal
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Explicitly filter out 64-bit exit codes when invoking exit handlers, as
svm_exit_handlers[] will never be sized with entries that use bits 63:32.
Processing the non-failing exit code as a 32-bit value will allow tracking
exit_code as a single 64-bit value (which it is, architecturally). This
will also allow hardening KVM against Spectre-like attacks without needing
to do silly things to avoid build failures on 32-bit kernels
(array_index_nospec() rightly asserts that the index fits in an "unsigned
long").
Omit the check when running as a VM, as KVM has historically failed to set
bits 63:32 appropriately when synthesizing VM-Exits, i.e. KVM could get
false positives when running as a VM on an older, broken KVM/kernel. From
a functional perspective, omitting the check is "fine", as any unwanted
collision between e.g. VMEXIT_INVALID and a 32-bit exit code will be
fatal to KVM-on-KVM regardless of what KVM-as-L1 does.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/svm.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e24bedf1fc81..1ffe922e95fd 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3443,8 +3443,22 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
sev_free_decrypted_vmsa(vcpu, save);
}
-int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
+int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 __exit_code)
{
+ u32 exit_code = __exit_code;
+
+ /*
+ * SVM uses negative values, i.e. 64-bit values, to indicate that VMRUN
+ * failed. Report all such errors to userspace (note, VMEXIT_INVALID,
+ * a.k.a. SVM_EXIT_ERR, is special cased by svm_handle_exit()). Skip
+ * the check when running as a VM, as KVM has historically left garbage
+ * in bits 63:32, i.e. running KVM-on-KVM would hit false positives if
+ * the underlying kernel is buggy.
+ */
+ if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
+ (u64)exit_code != __exit_code)
+ goto unexpected_vmexit;
+
#ifdef CONFIG_MITIGATION_RETPOLINE
if (exit_code == SVM_EXIT_MSR)
return msr_interception(vcpu);
@@ -3471,7 +3485,7 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
unexpected_vmexit:
dump_vmcb(vcpu);
- kvm_prepare_unexpected_reason_exit(vcpu, exit_code);
+ kvm_prepare_unexpected_reason_exit(vcpu, __exit_code);
return 0;
}
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 3/8] KVM: SVM: Check for an unexpected VM-Exit after RETPOLINE "fast" handling
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Check for an unexpected/unhandled VM-Exit after the manual RETPOLINE=y
handling. The entire point of the RETPOLINE checks is to optimize for
common VM-Exits, i.e. checking for the rare case of an unsupported
VM-Exit is counter-productive. This also aligns SVM and VMX exit handling.
No functional change intended.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/svm.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a523011f0923..e24bedf1fc81 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3445,12 +3445,6 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
{
- if (exit_code >= ARRAY_SIZE(svm_exit_handlers))
- goto unexpected_vmexit;
-
- if (!svm_exit_handlers[exit_code])
- goto unexpected_vmexit;
-
#ifdef CONFIG_MITIGATION_RETPOLINE
if (exit_code == SVM_EXIT_MSR)
return msr_interception(vcpu);
@@ -3467,6 +3461,12 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
return sev_handle_vmgexit(vcpu);
#endif
#endif
+ if (exit_code >= ARRAY_SIZE(svm_exit_handlers))
+ goto unexpected_vmexit;
+
+ if (!svm_exit_handlers[exit_code])
+ goto unexpected_vmexit;
+
return svm_exit_handlers[exit_code](vcpu);
unexpected_vmexit:
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 2/8] KVM: SVM: Open code handling of unexpected exits in svm_invoke_exit_handler()
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Fold svm_check_exit_valid() and svm_handle_invalid_exit() into their sole
caller, svm_invoke_exit_handler(), as having tiny single-use helpers makes
the code unncessarily difficult to follow. This will also allow for
additional cleanups in svm_invoke_exit_handler().
No functional change intended.
Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/svm.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c2ddf2e0aa1a..a523011f0923 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3443,23 +3443,13 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
sev_free_decrypted_vmsa(vcpu, save);
}
-static bool svm_check_exit_valid(u64 exit_code)
-{
- return (exit_code < ARRAY_SIZE(svm_exit_handlers) &&
- svm_exit_handlers[exit_code]);
-}
-
-static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code)
-{
- dump_vmcb(vcpu);
- kvm_prepare_unexpected_reason_exit(vcpu, exit_code);
- return 0;
-}
-
int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
{
- if (!svm_check_exit_valid(exit_code))
- return svm_handle_invalid_exit(vcpu, exit_code);
+ if (exit_code >= ARRAY_SIZE(svm_exit_handlers))
+ goto unexpected_vmexit;
+
+ if (!svm_exit_handlers[exit_code])
+ goto unexpected_vmexit;
#ifdef CONFIG_MITIGATION_RETPOLINE
if (exit_code == SVM_EXIT_MSR)
@@ -3478,6 +3468,11 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
#endif
#endif
return svm_exit_handlers[exit_code](vcpu);
+
+unexpected_vmexit:
+ dump_vmcb(vcpu);
+ kvm_prepare_unexpected_reason_exit(vcpu, exit_code);
+ return 0;
}
static void svm_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 1/8] KVM: SVM: Add a helper to detect VMRUN failures
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
Add a helper to detect VMRUN failures so that KVM can guard against its
own long-standing bug, where KVM neglects to set exitcode[63:32] when
synthesizing a nested VMFAIL_INVALID VM-Exit. This will allow fixing
KVM's mess of treating exitcode as two separate 32-bit values without
breaking KVM-on-KVM when running on an older, unfixed KVM.
Cc: Jim Mattson <jmattson@google.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/nested.c | 16 +++++++---------
arch/x86/kvm/svm/svm.c | 4 ++--
arch/x86/kvm/svm/svm.h | 5 +++++
3 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ba0f11c68372..f5bde972a2b1 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1134,7 +1134,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
vmcb12->control.exit_info_1 = vmcb02->control.exit_info_1;
vmcb12->control.exit_info_2 = vmcb02->control.exit_info_2;
- if (vmcb12->control.exit_code != SVM_EXIT_ERR)
+ if (!svm_is_vmrun_failure(vmcb12->control.exit_code))
nested_save_pending_event_to_vmcb12(svm, vmcb12);
if (guest_cpu_cap_has(vcpu, X86_FEATURE_NRIPS))
@@ -1425,6 +1425,9 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
u32 exit_code = svm->vmcb->control.exit_code;
int vmexit = NESTED_EXIT_HOST;
+ if (svm_is_vmrun_failure(exit_code))
+ return NESTED_EXIT_DONE;
+
switch (exit_code) {
case SVM_EXIT_MSR:
vmexit = nested_svm_exit_handled_msr(svm);
@@ -1432,7 +1435,7 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
case SVM_EXIT_IOIO:
vmexit = nested_svm_intercept_ioio(svm);
break;
- case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
+ case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f:
/*
* Host-intercepted exceptions have been checked already in
* nested_svm_exit_special. There is nothing to do here,
@@ -1440,15 +1443,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
*/
vmexit = NESTED_EXIT_DONE;
break;
- }
- case SVM_EXIT_ERR: {
- vmexit = NESTED_EXIT_DONE;
- break;
- }
- default: {
+ default:
if (vmcb12_is_intercept(&svm->nested.ctl, exit_code))
vmexit = NESTED_EXIT_DONE;
- }
+ break;
}
return vmexit;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 24d59ccfa40d..c2ddf2e0aa1a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3540,7 +3540,7 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return 1;
}
- if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
+ if (svm_is_vmrun_failure(svm->vmcb->control.exit_code)) {
kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
kvm_run->fail_entry.hardware_entry_failure_reason
= svm->vmcb->control.exit_code;
@@ -4311,7 +4311,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
/* Track VMRUNs that have made past consistency checking */
if (svm->nested.nested_run_pending &&
- svm->vmcb->control.exit_code != SVM_EXIT_ERR)
+ !svm_is_vmrun_failure(svm->vmcb->control.exit_code))
++vcpu->stat.nested_run;
svm->nested.nested_run_pending = 0;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 01be93a53d07..0f006793f973 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -424,6 +424,11 @@ static __always_inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
return container_of(vcpu, struct vcpu_svm, vcpu);
}
+static inline bool svm_is_vmrun_failure(u64 exit_code)
+{
+ return (u32)exit_code == (u32)SVM_EXIT_ERR;
+}
+
/*
* Only the PDPTRs are loaded on demand into the shadow MMU. All other
* fields are synchronized on VM-Exit, because accessing the VMCB is cheap.
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related
* [PATCH v2 0/8] KVM: SVM: Fix exit_code bugs
From: Sean Christopherson @ 2025-12-30 21:13 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
Fix (mostly benign) bugs in SVM where KVM treats exit codes as 32-bit values
instead of 64-bit values.
The most dangerous aspect of the mess is that simply fixing KVM would likely
break KVM-on-KVM setups if only L1 is patched. To try and avoid such
breakage while also fixing KVM, I opted to have KVM retain its checks on
only bits 31:0 if KVM is running as a VM (as detected by
X86_FEATURE_HYPERVISOR).
v2:
- Drop the nSVM #VMEXIT fixes (already merged).
- Collect reviews. [Yosry]
- Fix inverted svm_is_vmrun_failure() check. [Yosry]
- Use __print_symbolic_u64() and __print_flags_u64() in tracepoints. [Test Bot]
- Track exit_code as a u64 in KVM selftests.
- Make HV_SVM_EXITCODE_ENL an ull like everything else. [Michael]
- Add a compile-time assertion to verify HV_SVM_EXITCODE_ENL == SVM_EXIT_SW.
v1: https://lore.kernel.org/all/20251113225621.1688428-1-seanjc@google.com
Sean Christopherson (8):
KVM: SVM: Add a helper to detect VMRUN failures
KVM: SVM: Open code handling of unexpected exits in
svm_invoke_exit_handler()
KVM: SVM: Check for an unexpected VM-Exit after RETPOLINE "fast"
handling
KVM: SVM: Filter out 64-bit exit codes when invoking exit handlers on
bare metal
KVM: SVM: Treat exit_code as an unsigned 64-bit value through all of
KVM
KVM: SVM: Limit incorrect check on SVM_EXIT_ERR to running as a VM
KVM: SVM: Harden exit_code against being used in Spectre-like attacks
KVM: SVM: Assert that Hyper-V's HV_SVM_EXITCODE_ENL == SVM_EXIT_SW
arch/x86/include/asm/svm.h | 3 +-
arch/x86/include/uapi/asm/svm.h | 32 ++++++------
arch/x86/kvm/svm/hyperv.c | 7 ++-
arch/x86/kvm/svm/nested.c | 29 ++++-------
arch/x86/kvm/svm/sev.c | 36 +++++--------
arch/x86/kvm/svm/svm.c | 51 +++++++++++--------
arch/x86/kvm/svm/svm.h | 12 +++--
arch/x86/kvm/trace.h | 6 +--
include/hyperv/hvgdk.h | 2 +-
tools/testing/selftests/kvm/include/x86/svm.h | 3 +-
.../kvm/x86/svm_nested_soft_inject_test.c | 4 +-
11 files changed, 90 insertions(+), 95 deletions(-)
base-commit: 9448598b22c50c8a5bb77a9103e2d49f134c9578
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply
* Re: [PATCH] Drivers: hv: vmbus: fix typo in function name reference
From: vdso @ 2025-12-30 14:18 UTC (permalink / raw)
To: Julia Lawall
Cc: yunbolyu, kexinsun, ratnadiraw, xutong.ma, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, linux-hyperv, linux-kernel, K. Y. Srinivasan
In-Reply-To: <20251230141414.94472-1-Julia.Lawall@inria.fr>
> On 12/30/2025 6:14 AM Julia Lawall <julia.lawall@inria.fr> wrote:
>
>
> Replace cmxchg by cmpxchg.
>
> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
>
Reviewed-by: Roman Kisel <vdso@mailbox.org>
> ---
> drivers/hv/hyperv_vmbus.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index b2862e0a317a..cdbc5f5c3215 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -375,7 +375,7 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
> return;
>
> /*
> - * The cmxchg() above does an implicit memory barrier to
> + * The cmpxchg() above does an implicit memory barrier to
> * ensure the write to MessageType (ie set to
> * HVMSG_NONE) happens before we read the
> * MessagePending and EOMing. Otherwise, the EOMing
^ permalink raw reply
* [PATCH] Drivers: hv: vmbus: fix typo in function name reference
From: Julia Lawall @ 2025-12-30 14:14 UTC (permalink / raw)
To: K. Y. Srinivasan
Cc: yunbolyu, kexinsun, ratnadiraw, xutong.ma, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, linux-hyperv, linux-kernel
Replace cmxchg by cmpxchg.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
---
drivers/hv/hyperv_vmbus.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index b2862e0a317a..cdbc5f5c3215 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -375,7 +375,7 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
return;
/*
- * The cmxchg() above does an implicit memory barrier to
+ * The cmpxchg() above does an implicit memory barrier to
* ensure the write to MessageType (ie set to
* HVMSG_NONE) happens before we read the
* MessagePending and EOMing. Otherwise, the EOMing
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox